<a href="https://colab.research.google.com/github/kashish1203/minecrafters/blob/Chinmaya/MineCrafters_Chinmaya.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Course Project - 01: Data Preprocessing, EDA and Regression Analysis**

**Project Tasks**

## **Exploratory Data Analysis (EDA) & Preprocessing:**

T1. Explore the dataset assigned to your team and provide:


### a. A summary of the dataset (should include information columns present, attribute types, null values, and a summary of each attribute).

**Data Source:** The data is collected from analogue and digital sensors installed on the APU(Air Processing Unit) of a metro train's compressor. These sensors monitor different aspects of the compressor's operation.

**Sensors:** The dataset includes readings from the following sensors:

- **Pressure Sensor:**Monitors pressure levels within the APU.
- **Temperature Sensor:** Measures the temperature of the APU.
- **Motor Current Sensor:** Records the electrical current consumed by the compressor's motor.
- **Air Intake Valve Sensor:** Monitors the status or position of the air intake valve.


**What is APU?**\
-- An APU, or Air Processing Unit, in the context of a metro train's compressor, refers to a component that plays a crucial role in providing clean and conditioned air for various systems within the train. The APU is responsible for filtering, cooling, and sometimes heating the air before it is distributed to different parts of the train, ensuring a comfortable and safe environment for passengers and crew.

In [1]:
#Importing Required Libraries
import pandas as pd
import numpy as np

In [2]:
# Mounting Google drive to fetch dataset
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Loading the dataset
df = pd.read_csv('/content/drive/MyDrive/MetroPT3(AirCompressor).csv')

In [4]:
# Getting first five instances of dataset
df.head(5)

Unnamed: 0.1,Unnamed: 0,timestamp,TP2,TP3,H1,DV_pressure,Reservoirs,Oil_temperature,Motor_current,COMP,DV_eletric,Towers,MPG,LPS,Pressure_switch,Oil_level,Caudal_impulses
0,0,2020-02-01 00:00:00,-0.012,9.358,9.34,-0.024,9.358,53.6,0.04,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
1,10,2020-02-01 00:00:10,-0.014,9.348,9.332,-0.022,9.348,53.675,0.04,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
2,20,2020-02-01 00:00:19,-0.012,9.338,9.322,-0.022,9.338,53.6,0.0425,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
3,30,2020-02-01 00:00:29,-0.012,9.328,9.312,-0.022,9.328,53.425,0.04,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
4,40,2020-02-01 00:00:39,-0.012,9.318,9.302,-0.022,9.318,53.475,0.04,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0


In [5]:
# Getting the shape of dataset
df.shape

(1516948, 17)

In [6]:
# Getting the names of columns present in the dataset
df.columns

Index(['Unnamed: 0', 'timestamp', 'TP2', 'TP3', 'H1', 'DV_pressure',
       'Reservoirs', 'Oil_temperature', 'Motor_current', 'COMP', 'DV_eletric',
       'Towers', 'MPG', 'LPS', 'Pressure_switch', 'Oil_level',
       'Caudal_impulses'],
      dtype='object')

In [7]:
# Getting column/attribute type
df.dtypes

Unnamed: 0           int64
timestamp           object
TP2                float64
TP3                float64
H1                 float64
DV_pressure        float64
Reservoirs         float64
Oil_temperature    float64
Motor_current      float64
COMP               float64
DV_eletric         float64
Towers             float64
MPG                float64
LPS                float64
Pressure_switch    float64
Oil_level          float64
Caudal_impulses    float64
dtype: object

Only one categorical variable "timestamp" and rest are numerical variables/attributes.\
**No need for Encoding**

### Pre-Processing


In [8]:
# Checking null values
df.isnull().sum()

Unnamed: 0         0
timestamp          0
TP2                0
TP3                0
H1                 0
DV_pressure        0
Reservoirs         0
Oil_temperature    0
Motor_current      0
COMP               0
DV_eletric         0
Towers             0
MPG                0
LPS                0
Pressure_switch    0
Oil_level          0
Caudal_impulses    0
dtype: int64

**No null values are present.**

In [9]:
#Checking duplicate rows
df.duplicated().sum()

0

**No Duplicate Instances are present.**

In [14]:
# Dropping Unnecessary columns
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [12]:
# Converting timestamp to datetime format for time-series analysis
df['timestamp'] = pd.to_datetime(df['timestamp'])

In [13]:
df['timestamp']

0         2020-02-01 00:00:00
1         2020-02-01 00:00:10
2         2020-02-01 00:00:19
3         2020-02-01 00:00:29
4         2020-02-01 00:00:39
                  ...        
1516943   2020-09-01 03:59:10
1516944   2020-09-01 03:59:20
1516945   2020-09-01 03:59:30
1516946   2020-09-01 03:59:40
1516947   2020-09-01 03:59:50
Name: timestamp, Length: 1516948, dtype: datetime64[ns]

In [16]:
# Getting number unique values for each column
df.nunique()

timestamp          1516948
TP2                   5257
TP3                   3683
H1                    2665
DV_pressure           2257
Reservoirs            3682
Oil_temperature       2462
Motor_current         1809
COMP                     2
DV_eletric               2
Towers                   2
MPG                      2
LPS                      2
Pressure_switch          2
Oil_level                2
Caudal_impulses          2
dtype: int64

**Observation**
- some columns may be nominal/ordinal type since only two unique values for them.


In [23]:
cnt= df['TP2'].dtypes
cnt

dtype('float64')

In [24]:
# Getting categorical columns which are already encoded
cat_cols = []
num_cols = []
col = df.columns
for i in col:
  cnt = df[i].nunique()
  if df[i].dtypes != 'datetime64[ns]':
    if cnt <= 2 :
      val = sorted(df[i].unique())
      print(f'unique values for {i}: {val}')
      print(f'value counts {i}: {cnt}\n')
      cat_cols.append(i)
    else:
      num_cols.append(i)


unique values for COMP: [0.0, 1.0]
value counts COMP: 2

unique values for DV_eletric: [0.0, 1.0]
value counts DV_eletric: 2

unique values for Towers: [0.0, 1.0]
value counts Towers: 2

unique values for MPG: [0.0, 1.0]
value counts MPG: 2

unique values for LPS: [0.0, 1.0]
value counts LPS: 2

unique values for Pressure_switch: [0.0, 1.0]
value counts Pressure_switch: 2

unique values for Oil_level: [0.0, 1.0]
value counts Oil_level: 2

unique values for Caudal_impulses: [0.0, 1.0]
value counts Caudal_impulses: 2



In [22]:
# Categorical columns
cat_cols

['COMP',
 'DV_eletric',
 'Towers',
 'MPG',
 'LPS',
 'Pressure_switch',
 'Oil_level',
 'Caudal_impulses']

In [25]:
# Numerical columns
num_cols

['TP2',
 'TP3',
 'H1',
 'DV_pressure',
 'Reservoirs',
 'Oil_temperature',
 'Motor_current']

- 'TP2', 'TP3', 'H1', 'DV_pressure', 'Reservoirs', 'Oil_temperature', 'Motor_current' are quantitative variables.
- 'COMP', 'DV_eletric', 'Towers', 'MPG', 'LPS', 'Pressure_switch', 'Oil_level', 'Caudal_impulses' are qualitative variables. Moreover, these variables are binary in nature.

**Description of Attributes:**\
(researched plus taken from assigned dataset description)\
Attributes in the dataset:
1. **Unnamed: 0:** An unnamed index or identifier for each record in the dataset.
2. **timestamp:** The timestamp indicating the time at which the readings were recorded.
3. **TP2:** Reading from the Pressure sensor, TP2 measures the pressure on the compressor.
4. **TP3:** Reading from the Pressure sensor, TP3 measure the pressure generated at the pneumatic panel.
5. **H1:** Reading from the Pressure sensor, H1 measure the pressure generated due to pressure drop when the discharge of the cyclonic
separator filter occurs.
6. **DV_pressure:** Reading from the Pressure sensor, which measure the pressure drop generated when the towers discharge air dryers; a zero
reading indicates that the compressor is operating under load.
7. **Reservoirs:** Reading related to reservoirs which has the measure of the downstream pressure of the reservoirs, which should be close to the
pneumatic panel pressure (TP3).
8. **Oil_temperature:** Reading of oil temperature on the compressor.
9. **Motor_current:** Reading of motor current which has the measure of the current of one phase of the three-phase motor;\
it presents values close to
  - 0A - when it turns off,
  - 4A - when working offloaded,
  - 7A - when working under load and
  - 9A - when it starts working.
10. **COMP:** Reading related to the electrical signal of the air intake valve on the compressor.
  - it is active when there is no air intake,
indicating that the compressor is either turned off or operating in an offloaded state.
11. **DV_eletric:** Reading related to electrical signal that controls the compressor outlet valve.
  - it is active when the compressor is functioning under load
  - inactive when the compressor is either off or operating in an offloaded state.
12. **Towers:** Reading related to the electrical signal that defines the tower responsible for drying the air and the tower responsible
for draining the humidity removed from the air.
  - when not active, it indicates that tower one is functioning
  - when active, it indicates that tower two is in operation.
13. **MPG:** Reading related to MPG (miles per gallon).\
It measures the electrical signal responsible for starting the compressor under load by activating the intake valve
when the pressure in the air production unit (APU) falls below 8.2 bar;\
It activates the COMP sensor, which assumes
the same behaviour as the MPG sensor.
14. **LPS:** Reading of LPS (low pressure system) which measures the electrical signal that detects and activates when the pressure drops below 7 bars.
15. **Pressure_switch:** Reading from the pressure switch which measures the electrical signal that detects the discharge in the air-drying towers.
16. **Oil_level:** It measures the electrical signal that detects the oil level on the compressor\
It is active when the oil is below the
expected values.
17. **Caudal_impulses:** the electrical signal that counts the pulse outputs generated by the absolute amount of air
flowing from the APU to the reservoirs.




### b. Data Visualization, summarizing insights about the dataset through EDA.

## **Regression Analysis:**

### T2. Identify and list regression problems on your assigned dataset. Which one does seem the most interesting to you and why?

Regression problems involve predicting a continuous numeric value based on input features. Given the attributes in the metro train dataset, here are a few potential regression problems:

1. **Predict Motor Current:** Given sensor readings such as 'TP2', 'TP3', 'H1', and 'Oil_temperature', predict the 'Motor_current' value, which represents the electrical current consumed by the motor. This could be valuable for monitoring motor health and efficiency.

2. **Oil Temperature Prediction:** Use attributes like 'TP2', 'TP3', 'H1', and 'Motor_current' to predict the 'Oil_temperature'. Accurate prediction of oil temperature is crucial for maintaining optimal compressor operation and preventing overheating.

3. **Air Pressure Prediction:** Given sensor readings like 'TP2', 'TP3', 'H1', and 'Oil_temperature', predict the 'DV_pressure' or other pressure-related attributes. Accurate pressure prediction is essential for maintaining safe and efficient compressor operation.

4. **Estimating Reservoir Levels ('Reservoirs'):** Predict the levels of the 'Reservoirs' using other attributes. This could help in maintaining appropriate fluid levels and preventing issues due to under- or overfilling.



**Selecting the Most Interesting Problem:**

Choosing the most interesting problem depends on goals, domain expertise, and the potential impact of solving the problem. However, one problem that stands out is predicting the 'Motor_current'. Motor current consumption is a crucial indicator of the motor's health, efficiency, and potential issues. By accurately predicting motor current, you could:

- Identify abnormal motor behavior or impending failures.
- Optimize energy consumption by understanding how motor current changes in different operating conditions.
- Schedule maintenance more effectively, preventing unexpected breakdowns.
- Enhance passenger safety by addressing potential motor-related risks.
- Solving this problem would likely have a direct impact on the overall reliability, efficiency, and safety of the metro train system.



**Predicting Motor Current ('Motor_current'):**

**Reasons for Choosing this Problem:**

1. **Operational Significance:** Motor current consumption is a critical operational parameter in a metro train's compressor system. It directly relates to the health and performance of the motor driving the compressor.

2. **Safety and Reliability:** Accurate predictions of motor current can help in ensuring the safety and reliability of the metro train system. Any anomalies or deviations from expected motor current levels could indicate potential issues or wear and tear.

3. **Energy Efficiency:** Motor current is closely linked to energy efficiency. Predicting motor current allows for optimizing energy consumption, which is essential for cost savings and environmental sustainability.

4. **Maintenance Planning:** Predictive modeling of motor current can enable proactive maintenance planning. Detecting abnormal current spikes or drops can trigger maintenance interventions before critical failures occur, minimizing downtime.

5. **Data Availability:** Motor current data is often readily available in such systems, making it a practical choice for predictive modeling.

6. **Real-time Monitoring:** Accurate motor current predictions can support real-time monitoring, allowing for immediate responses to any deviations from expected current levels.




### T3. Build an end-to-end Machine Learning pipeline for your assigned dataset for the aforementioned most interesting regression problems found in T2.


Your pipeline should
include components for dataset preprocessing, transformation, regression model building
hyperparameter tuning, grid search or optimization, and evaluation. Report results on the
regression models with hyperparameter tuning, and report the best hyperparameter
values. Report results using at least two relevant evaluation metrics like RMSE,MAE.
Compare results for different models and give the reasoning for that.