## DATA PREPROCESSING/DATA PREPARATION

1. Data Cleaning - drop, fillna with mean, median, mode Imputation, changing dtype
2. Data Transformation
    * If data is CONTINUOUS = Standard Scaler, MinMaxScaler, Robust Scaler
    * If data is DISCRETE   = Label Encoder, One hot Encoder

## 1. Import Necessary Libraries

In [1]:
 import pandas as pd

## Import data

In [2]:
weather_data_2010=pd.read_csv('data_clean.csv')
weather_data_2010.head(50)

Unnamed: 0.1,Unnamed: 0,Ozone,Solar.R,Wind,Temp C,Month,Day,Year,Temp,Weather
0,1,41.0,190.0,7.4,67,5,1,2010,67,S
1,2,36.0,118.0,8.0,72,5,2,2010,72,C
2,3,12.0,149.0,12.6,74,5,3,2010,74,PS
3,4,18.0,313.0,11.5,62,5,4,2010,62,S
4,5,,,14.3,56,5,5,2010,56,S
5,6,28.0,,14.9,66,5,6,2010,66,C
6,7,23.0,299.0,8.6,65,5,7,2010,65,PS
7,8,19.0,99.0,13.8,59,5,8,2010,59,C
8,9,8.0,19.0,20.1,61,5,9,2010,61,PS
9,10,,194.0,8.6,69,5,10,2010,69,S


Pick out one feature and compare it with o/p feature and if we have a domain knowledge we will be knowing how much that feature will be contributing for o/p prediction.

## 3. Data Understanding

In [3]:
weather_data_2010.shape

(158, 10)

In [4]:
weather_data_2010.isnull().sum()

Unnamed: 0     0
Ozone         38
Solar.R        7
Wind           0
Temp C         0
Month          0
Day            0
Year           0
Temp           0
Weather        3
dtype: int64

In [5]:
38/158*100

24.050632911392405

- 24%  of our observation wrt Ozone is not captured properly

In [6]:
weather_data_2010.dtypes

Unnamed: 0      int64
Ozone         float64
Solar.R       float64
Wind          float64
Temp C         object
Month          object
Day             int64
Year            int64
Temp            int64
Weather        object
dtype: object

In [7]:
weather_data_2010.describe(include='all')

Unnamed: 0.1,Unnamed: 0,Ozone,Solar.R,Wind,Temp C,Month,Day,Year,Temp,Weather
count,158.0,120.0,151.0,158.0,158.0,158.0,158.0,158.0,158.0,155
unique,,,,,41.0,6.0,,,,3
top,,,,,81.0,9.0,,,,S
freq,,,,,11.0,34.0,,,,59
mean,79.5,41.583333,185.403974,9.957595,,,16.006329,2010.0,77.727848,
std,45.754781,32.620709,88.723103,3.511261,,,8.997166,0.0,9.377877,
min,1.0,1.0,7.0,1.7,,,1.0,2010.0,56.0,
25%,40.25,18.0,119.0,7.4,,,8.0,2010.0,72.0,
50%,79.5,30.5,197.0,9.7,,,16.0,2010.0,78.5,
75%,118.75,61.5,257.0,11.875,,,24.0,2010.0,84.0,


- If we replace null values with mean/median the null values will be imputed but there will be lot of duplications will be done or false information is given to the data and  that pattern will also be learnred by machine.
- If false imputation is not allowed then we will be going to drop the colun

In [8]:
7/158*100

4.430379746835443

Can be replaced with mean/median imputation or Drop the observations(Row wise)

In [9]:
weather_data_2010['Weather'].unique()

array(['S', 'C', 'PS', nan], dtype=object)

In [10]:
3/154*100

1.948051948051948

- For o/p feature we are going with mean/mode imputation it is not the best idea as it may learn wrong patterns and the best idea is to drop the observation.

## 4. Data Preparation

### STAGE 1: Data Cleaning

#### 1. Null Value Treatment

In [11]:
weather_data_2010.head()

Unnamed: 0.1,Unnamed: 0,Ozone,Solar.R,Wind,Temp C,Month,Day,Year,Temp,Weather
0,1,41.0,190.0,7.4,67,5,1,2010,67,S
1,2,36.0,118.0,8.0,72,5,2,2010,72,C
2,3,12.0,149.0,12.6,74,5,3,2010,74,PS
3,4,18.0,313.0,11.5,62,5,4,2010,62,S
4,5,,,14.3,56,5,5,2010,56,S


In [12]:
del weather_data_2010['Unnamed: 0'] # Dropping unwanted feature

In [13]:
weather_data_2010.head()

Unnamed: 0,Ozone,Solar.R,Wind,Temp C,Month,Day,Year,Temp,Weather
0,41.0,190.0,7.4,67,5,1,2010,67,S
1,36.0,118.0,8.0,72,5,2,2010,72,C
2,12.0,149.0,12.6,74,5,3,2010,74,PS
3,18.0,313.0,11.5,62,5,4,2010,62,S
4,,,14.3,56,5,5,2010,56,S


### Client agreed to drop Ozone feature because of more null values.

In [14]:
del weather_data_2010['Ozone']

In [15]:
weather_data_2010.isna().sum()

Solar.R    7
Wind       0
Temp C     0
Month      0
Day        0
Year       0
Temp       0
Weather    3
dtype: int64

### Claint agreed to go with mean imputation for Solar.R feature

In [16]:
weather_data_2010['Solar.R'].mean()

185.40397350993376

In [17]:
weather_data_2010.head()

Unnamed: 0,Solar.R,Wind,Temp C,Month,Day,Year,Temp,Weather
0,190.0,7.4,67,5,1,2010,67,S
1,118.0,8.0,72,5,2,2010,72,C
2,149.0,12.6,74,5,3,2010,74,PS
3,313.0,11.5,62,5,4,2010,62,S
4,,14.3,56,5,5,2010,56,S


In [18]:
weather_data_2010['Solar.R'].fillna(value=185.4, inplace=True)

In [19]:
weather_data_2010.isna().sum()

Solar.R    0
Wind       0
Temp C     0
Month      0
Day        0
Year       0
Temp       0
Weather    3
dtype: int64

### Client agreed to drop 3 observations for which the Weather has not been captured carefully

In [20]:
weather_data_2010.dropna(axis=0, inplace=True)

In [21]:
weather_data_2010.isna().sum()

Solar.R    0
Wind       0
Temp C     0
Month      0
Day        0
Year       0
Temp       0
Weather    0
dtype: int64

#### 2. Conversion Of Datatypes

In [22]:
weather_data_2010.dtypes

Solar.R    float64
Wind       float64
Temp C      object
Month       object
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

In [23]:
weather_data_2010.head(10)

Unnamed: 0,Solar.R,Wind,Temp C,Month,Day,Year,Temp,Weather
0,190.0,7.4,67,5,1,2010,67,S
1,118.0,8.0,72,5,2,2010,72,C
2,149.0,12.6,74,5,3,2010,74,PS
3,313.0,11.5,62,5,4,2010,62,S
4,185.4,14.3,56,5,5,2010,56,S
5,185.4,14.9,66,5,6,2010,66,C
6,299.0,8.6,65,5,7,2010,65,PS
7,99.0,13.8,59,5,8,2010,59,C
8,19.0,20.1,61,5,9,2010,61,PS
9,194.0,8.6,69,5,10,2010,69,S


In [24]:
del weather_data_2010['Temp C']

In [25]:
weather_data_2010.dtypes

Solar.R    float64
Wind       float64
Month       object
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

In [26]:
weather_data_2010.head(40)

Unnamed: 0,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,190.0,7.4,5,1,2010,67,S
1,118.0,8.0,5,2,2010,72,C
2,149.0,12.6,5,3,2010,74,PS
3,313.0,11.5,5,4,2010,62,S
4,185.4,14.3,5,5,2010,56,S
5,185.4,14.9,5,6,2010,66,C
6,299.0,8.6,5,7,2010,65,PS
7,99.0,13.8,5,8,2010,59,C
8,19.0,20.1,5,9,2010,61,PS
9,194.0,8.6,5,10,2010,69,S


We need to convert Month Datatype to integer

First we need to convert object to null

In [27]:
weather_data_2010['Month']=pd.to_numeric(arg=weather_data_2010['Month'], errors='coerce')
weather_data_2010.head(40)

Unnamed: 0,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,190.0,7.4,5.0,1,2010,67,S
1,118.0,8.0,5.0,2,2010,72,C
2,149.0,12.6,5.0,3,2010,74,PS
3,313.0,11.5,5.0,4,2010,62,S
4,185.4,14.3,5.0,5,2010,56,S
5,185.4,14.9,5.0,6,2010,66,C
6,299.0,8.6,5.0,7,2010,65,PS
7,99.0,13.8,5.0,8,2010,59,C
8,19.0,20.1,5.0,9,2010,61,PS
9,194.0,8.6,5.0,10,2010,69,S


In [28]:
weather_data_2010.dtypes

Solar.R    float64
Wind       float64
Month      float64
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

In [29]:
weather_data_2010['Month'].fillna(value=5, inplace=True)

In [30]:
weather_data_2010.head(30)

Unnamed: 0,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,190.0,7.4,5.0,1,2010,67,S
1,118.0,8.0,5.0,2,2010,72,C
2,149.0,12.6,5.0,3,2010,74,PS
3,313.0,11.5,5.0,4,2010,62,S
4,185.4,14.3,5.0,5,2010,56,S
5,185.4,14.9,5.0,6,2010,66,C
6,299.0,8.6,5.0,7,2010,65,PS
7,99.0,13.8,5.0,8,2010,59,C
8,19.0,20.1,5.0,9,2010,61,PS
9,194.0,8.6,5.0,10,2010,69,S


In [31]:
weather_data_2010['Month']=weather_data_2010['Month'].astype('int')

In [32]:
weather_data_2010.dtypes

Solar.R    float64
Wind       float64
Month        int32
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

**Explore np.where**

In [33]:
weather_data_2010.dtypes

Solar.R    float64
Wind       float64
Month        int32
Day          int64
Year         int64
Temp         int64
Weather     object
dtype: object

In [34]:
weather_data_2010.head()

Unnamed: 0,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,190.0,7.4,5,1,2010,67,S
1,118.0,8.0,5,2,2010,72,C
2,149.0,12.6,5,3,2010,74,PS
3,313.0,11.5,5,4,2010,62,S
4,185.4,14.3,5,5,2010,56,S


### STAGE 2 Data Transformation

2. Data Transformation
    * If data is **CONTINUOUS** = Standard Scaler, MinMaxScaler, Robust Scaler
    * If data is **DISCRETE**   = Label Encoder, One hot Encoder

#### 1. Label Encoder

In [35]:
weather_data= weather_data_2010.copy()

In [36]:
weather_data.head()

Unnamed: 0,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,190.0,7.4,5,1,2010,67,S
1,118.0,8.0,5,2,2010,72,C
2,149.0,12.6,5,3,2010,74,PS
3,313.0,11.5,5,4,2010,62,S
4,185.4,14.3,5,5,2010,56,S


In [37]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
weather_data['Weather']=le.fit_transform(weather_data['Weather'])
weather_data

Unnamed: 0,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,190.0,7.4,5,1,2010,67,2
1,118.0,8.0,5,2,2010,72,0
2,149.0,12.6,5,3,2010,74,1
3,313.0,11.5,5,4,2010,62,2
4,185.4,14.3,5,5,2010,56,2
...,...,...,...,...,...,...,...
153,190.0,7.4,5,1,2010,67,0
154,193.0,6.9,9,26,2010,70,1
155,145.0,13.2,9,27,2010,77,2
156,191.0,14.3,9,28,2010,75,2


In [38]:
weather_data.dtypes

Solar.R    float64
Wind       float64
Month        int32
Day          int64
Year         int64
Temp         int64
Weather      int32
dtype: object

#### 2. One Hot Encoder

This can be achieved by using the following 2 libraries

1. Pandas-pd.get_dummies()
2. sklearn-OneHotEncoder()

##### 1. Pandas

In [39]:
weather_data_2= weather_data_2010.copy()
weather_data_2.head()

Unnamed: 0,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,190.0,7.4,5,1,2010,67,S
1,118.0,8.0,5,2,2010,72,C
2,149.0,12.6,5,3,2010,74,PS
3,313.0,11.5,5,4,2010,62,S
4,185.4,14.3,5,5,2010,56,S


In [40]:
weather_data_2=pd.get_dummies(data=weather_data_2, columns=['Weather'], drop_first=True)
weather_data_2

Unnamed: 0,Solar.R,Wind,Month,Day,Year,Temp,Weather_PS,Weather_S
0,190.0,7.4,5,1,2010,67,0,1
1,118.0,8.0,5,2,2010,72,0,0
2,149.0,12.6,5,3,2010,74,1,0
3,313.0,11.5,5,4,2010,62,0,1
4,185.4,14.3,5,5,2010,56,0,1
...,...,...,...,...,...,...,...,...
153,190.0,7.4,5,1,2010,67,0,0
154,193.0,6.9,9,26,2010,70,1,0
155,145.0,13.2,9,27,2010,77,0,1
156,191.0,14.3,9,28,2010,75,0,1


##### 2. OneHot Encoder

In [41]:
weather_data_3= weather_data_2010.copy()
weather_data_3.head()

Unnamed: 0,Solar.R,Wind,Month,Day,Year,Temp,Weather
0,190.0,7.4,5,1,2010,67,S
1,118.0,8.0,5,2,2010,72,C
2,149.0,12.6,5,3,2010,74,PS
3,313.0,11.5,5,4,2010,62,S
4,185.4,14.3,5,5,2010,56,S


In [42]:
from sklearn.preprocessing import OneHotEncoder
ohe=OneHotEncoder(handle_unknown='ignore')
ohe.fit_transform(weather_data_3[['Weather']]).toarray()
ohe_data=pd.DataFrame(ohe.fit_transform(weather_data_3[['Weather']]).toarray())
ohe_data

Unnamed: 0,0,1,2
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,0.0,1.0
4,0.0,0.0,1.0
...,...,...,...
150,1.0,0.0,0.0
151,0.0,1.0,0.0
152,0.0,0.0,1.0
153,0.0,0.0,1.0


In [43]:
weather_data_3=weather_data_3.iloc[:,0:6].join(ohe_data)
weather_data_3


Unnamed: 0,Solar.R,Wind,Month,Day,Year,Temp,0,1,2
0,190.0,7.4,5,1,2010,67,0.0,0.0,1.0
1,118.0,8.0,5,2,2010,72,1.0,0.0,0.0
2,149.0,12.6,5,3,2010,74,0.0,1.0,0.0
3,313.0,11.5,5,4,2010,62,0.0,0.0,1.0
4,185.4,14.3,5,5,2010,56,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...
153,190.0,7.4,5,1,2010,67,0.0,0.0,1.0
154,193.0,6.9,9,26,2010,70,1.0,0.0,0.0
155,145.0,13.2,9,27,2010,77,,,
156,191.0,14.3,9,28,2010,75,,,


### How to choose between OneHot Encoder and Label Encoder

### Input feature:
Based on whether it is **Parametric or a Non Parametric Model**, we will choosing between LE or OHE.
* For **Parametric Models** - better to go with **OHE.**
* For **Non-Parametric Models** - always go with **Label Encoding.**

### Output feature:
Always the output feature, it must be **Label Encoder.**