### Prepare Data for ML Algorithms

#### 1. Load Data and Split (as section 1)

In [2]:
#load data
import os
import pandas as pd

HOUSING_PATH = os.path.join("datasets", "housing")

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()

In [3]:
#split train and test (with stratified sampling)
from sklearn.model_selection import StratifiedShuffleSplit
import numpy as np

housing["income_cat"] = pd.cut(housing["median_income"],
                              bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                              labels=[1, 2, 3, 4, 5])

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

In [4]:
#clean unnecessary data
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

#only take the training data (or sample of it if dataset is too large)
housing = strat_train_set.copy()

#### 2. Data Cleaning

e.g. Total_bedrooms attribute has some missing values (from section 1). We have three options:
- get rid of the corresponding district
- get rid of the whole attribute
- set the values to some values (zero, the mean, the median, etc.)

In [None]:
# pandas methods:
housing.dropna(subset=["total_bedrooms"])   #option 1
housing.drop("total_bedrooms", axis=1)      #option 2
median = housing["total_bedrooms"].median() #option 3 --> calc new median, and same modification on test set and new data
housing["total_bedrooms"].fillna(median, inplace=True) 

In [6]:
# sklearn method:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")             
housing_num = housing.drop("ocean_proximity", axis=1)  # copy all numerical data to new variable
imputer.fit(housing_num)                               # apply imputer on all numerical attributes  --> Imputer is an Estimator

imputer.statistics_  # the median of each attribute is stored in statistics_ by default = housing_num.median().values

array([-1.1851e+02,  3.4260e+01,  2.9000e+01,  2.1195e+03,  4.3300e+02,
        1.1640e+03,  4.0800e+02,  3.5409e+00,  1.7950e+05])

In [11]:
# transform the train set by replacing missing values with the "learned" medians --> NumPy array with transformed features
X = imputer.transform(housing_num)                      # --> Imputer is an Transformer
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index) # put it back to panda's DataFrame
X.shape

(16512, 9)

#### 3. Handling Text and Categorical Attributes

In [5]:
# take a look at the only text/non-numerical attribute: ocean_proximity
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)

Unnamed: 0,ocean_proximity
17606,<1H OCEAN
18632,<1H OCEAN
14650,NEAR OCEAN
3230,INLAND
3555,<1H OCEAN
19480,INLAND
8879,<1H OCEAN
13685,INLAND
4937,<1H OCEAN
4861,<1H OCEAN


Oberserving that the "ocean_proximity" attribute doesn't contain arbitrary text, but a limited number of possible values.\
--> this is a categorical attribute\
--> they can be converted to numbers for ML algorithms

In [6]:
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]

array([[0.],
       [0.],
       [4.],
       [1.],
       [0.],
       [1.],
       [0.],
       [1.],
       [0.],
       [0.]])

In [7]:
ordinal_encoder.categories_   # instance variable: all categories

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

HOWEVER, representing with numbers 0 to 4 will cause ML algorithms to assume two nearby values are more similar than two distant values, which is not the case here.\
--> one-hot vectors

In [8]:
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot     # SciPy sparse matrix: stores locations of non-zero elements

<16512x5 sparse matrix of type '<class 'numpy.float64'>'
	with 16512 stored elements in Compressed Sparse Row format>

In [9]:
housing_cat_1hot.toarray()   # convert to NumPy dense matrix

array([[1., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1.],
       ...,
       [0., 1., 0., 0., 0.],
       [1., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0.]])

In [10]:
cat_encoder.categories_    # instance variable: all categories

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
       dtype=object)]

There are more considerations in converting the categorical attributes, see more details in chapters about ***representation leaning***