# Prepare the Data for Machine Learning Algorithms

In [1]:
import numpy as np
import pandas as pd

In [2]:
filepath="../data/preprocessed_data/train.csv"

In [4]:
housing=pd.read_csv(filepath,index_col=0)

In [5]:
housing.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,72100.0,INLAND
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,279600.0,NEAR OCEAN
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,82700.0,INLAND
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,112500.0,NEAR OCEAN
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,238300.0,<1H OCEAN


- Splitting the dataset into X and y dataframes so that X contains the attributes required to predict y variable
- y is basically our target variable

In [6]:
X = housing.drop('median_house_value',axis=1)
y = housing["median_house_value"].copy()

In [7]:
X.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,ocean_proximity
12655,-121.46,38.52,29.0,3873.0,797.0,2237.0,706.0,2.1736,INLAND
15502,-117.23,33.09,7.0,5320.0,855.0,2015.0,768.0,6.3373,NEAR OCEAN
2908,-119.04,35.37,44.0,1618.0,310.0,667.0,300.0,2.875,INLAND
14053,-117.13,32.75,24.0,1877.0,519.0,898.0,483.0,2.2264,NEAR OCEAN
20496,-118.7,34.28,27.0,3536.0,646.0,1837.0,580.0,4.4964,<1H OCEAN


In [8]:
y.head()

12655     72100.0
15502    279600.0
2908      82700.0
14053    112500.0
20496    238300.0
Name: median_house_value, dtype: float64

## Data Cleaning

- We need to do Data cleaning because most machine learning models won't work with missing values in features.

In [9]:
X.isnull().sum()

longitude               0
latitude                0
housing_median_age      0
total_rooms             0
total_bedrooms        158
population              0
households              0
median_income           0
ocean_proximity         0
dtype: int64

- We can observe the null values in total_bedrooms column
## Ways to deal with null values
- if null values are more than half get rid of that column
- Try to remove the rows
- Try to impute or fill them with appropriate values

In [10]:
# housing.dropna(subset=["total_bedrooms"]) # option 1
# housing.drop("total_bedrooms", axis=1) # option 2
median = housing["total_bedrooms"].median() # option 3
housing["total_bedrooms"].fillna(median, inplace=True)

## Filling Null values with sklearn
- Scikit-Learn provides a handy class to take care of missing values: SimpleImputer.

In [11]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median") 
# since it is numerical variable we are using strategy median
# if it is categorical we can use the strategy as mode.

In [12]:
housing_num = housing.drop("ocean_proximity", axis=1)
# since we choose strategy for numerical values, so we are droppin the ocean_proximity as it object dtype

In [13]:
imputer.fit(housing_num)

In [14]:
imputer.statistics_

array([-1.18510e+02,  3.42600e+01,  2.90000e+01,  2.11900e+03,
        4.33000e+02,  1.16400e+03,  4.08000e+02,  3.54155e+00,
        1.79500e+05])

In [15]:
housing_num.median().values

array([-1.18510e+02,  3.42600e+01,  2.90000e+01,  2.11900e+03,
        4.33000e+02,  1.16400e+03,  4.08000e+02,  3.54155e+00,
        1.79500e+05])

we you can use this “trained” imputer to transform the training dataset by replacing the missing values with calculated median

In [16]:
temp=imputer.transform(housing_num)

Transform function will generally return numpy array , so we need to convert that back into dataframe

In [17]:
housing_tr = pd.DataFrame(X, columns=housing_num.columns)

## Handling text and Categorical attributes

In [18]:
housing_cat=housing[['ocean_proximity']].copy()
housing_cat.head()

Unnamed: 0,ocean_proximity
12655,INLAND
15502,NEAR OCEAN
2908,INLAND
14053,NEAR OCEAN
20496,<1H OCEAN


- Most machine learning algorithms work on numerical input so we need to convert this text into numbers
- The process to tackle this situation is called encoding
## Methods for encoding
- Ordianl Encoding
- One hot encdoing
- Label Encoding

In [19]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()

In [20]:
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded


array([[1.],
       [4.],
       [1.],
       ...,
       [0.],
       [0.],
       [1.]])

It is a list containing a 1D array of categories for each categorical attribute

In [21]:
ordinal_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

One issue with this representation is that ML algorithms will assume that two nearby
values are more similar than two distant values. It may be good for example like good average bad. But here the labels or categories are not similar like they are different values like Yes or No!, in this type of situations the preferable encoding technique will be One Hot Encoding

In [22]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)


Basically what it does is it will create new columns of number of categories in that column such that if it belong tp that particult categroy it will mae 1(hot) remaining columns will be 0(cold).

In [23]:
housing_cat_1hot.toarray()

array([[0., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0.],
       ...,
       [1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

In [24]:
cat_encoder.categories_

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

These are basic and important steps in Data Preprocessing.Next notebook we will jump into Transformers