<a rel="license" href="http://creativecommons.org/licenses/by/4.0/"><img alt="Creative Commons Licence" style="border-width:0" src="https://i.creativecommons.org/l/by/4.0/88x31.png" /></a><br /><span xmlns:dct="http://purl.org/dc/terms/" property="dct:title">COMP3611 - Building a Machine Learning Pipeline (part 2)</span> by <span xmlns:cc="http://creativecommons.org/ns#" property="cc:attributionName">Marc de Kamps and University of Leeds</span> is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by/4.0/">Creative Commons Attribution 4.0 International License</a>.

## Building a machine learning (part 2)

### Objectives

We will clean the data, in particular, we will
- Impute missing values
- Replace alpha-numeric information by one-hot-encoded vectors to arrive at a purely numerical implementation
- Add features to the dataset that are likely to boost prediction power
- We will use Transformer and Pipeline objects to provide self-documenting implementations of these steps

Note that we have left the sample answers in place, or you wouldn't be able to run through the notebook.

### Caveat

There is one small error in this notebook that we will leave in place, but have removed in (Part 3). Once you have run (Part 3), see if you can discover it.
**Warning** Use Part 3 as a basis for further coding, not this notebook (part 2).

We start where we left off in the previous notebook:

In [17]:
import os
import numpy as np
import pandas as pd
import tarfile

from sklearn.model_selection import StratifiedShuffleSplit

local_path = 'datasets/housing'


def restore():
    housing_tgz=tarfile.open(os.path.join(local_path,'./housing.tgz'))
    housing_tgz.extractall(path=local_path)
    housing_tgz.close()

    csv_path=os.path.join(local_path,'./housing.csv')
    housing = pd.read_csv(csv_path)

    # create test training set with stratified sampling (see previous notebook)
    housing["income_category"]=np.ceil(housing["median_income"]/1.5)
    housing["income_category"].where(housing["income_category"] < 5, 5.0, inplace = True)

    split = StratifiedShuffleSplit(n_splits=1, test_size=0.2,random_state=42)
    
    for train_index, test_index in split.split(housing,housing["income_category"]):
        strat_train_set = housing.loc[train_index]
        strat_test_set = housing.loc[test_index]
    
    for set_ in (strat_train_set, strat_test_set):
        set_.drop(("income_category"),axis=1,inplace=True)
        

    # this is the mistake - no need to remove the labels from dataset
    housing.drop(("income_category"),axis=1,inplace=True)

   
    return housing, strat_train_set, strat_test_set

housing, strat_train_set, start_test_set = restore()

## Missing data

In many real world data sets there is missing data. Doctors forget to register the temperature at some days, sensors are sometimes faulty and produce no sensible output, etc. Pandas is a considerable step up from comma separated files in that they have an explicit representation for a missing value: *NaN*.

In general, machine learning algorithms can't work with NaN, however and this means either removing data where NaN occurs or replacing them by a sensible default value, a process called *imputation*.

Here you have three choices:
-Drop districts where NaN occurs
-Drop the entire attribute (i.e. remove the entire column where NaN occurs)
-Impute

Many imputation strategies are simple: they replace the missing value by the median (or sometimes mean) of the attribute. This is potentially dangerous: if the pattern of missingness is not random, but has systematic causes it may be wrong to use this strategy. More sophisticated strategies, that we will not consider here, consist of using for example a random forest or decision tree to try and predict the missing values.

Pandas offers quick ways for implementing simple imputation:

In [18]:
#housing.dropna(subset=["total_bedrooms"]) # drop those districts for which the total_bedrooms value is missing
#housing.drop("total_bedrooms",axis=1) # drop the entire column

# impute missing values by the median
median=housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median)

0         129.0
1        1106.0
2         190.0
3         235.0
4         280.0
          ...  
20635     374.0
20636     150.0
20637     485.0
20638     409.0
20639     616.0
Name: total_bedrooms, Length: 20640, dtype: float64

*scikit-learn* itself also offers support for simple imputation strategies. And since this nicely dovetails with the use of pipelines, which is an important topic in this notebook, let's check it out.

In [19]:
from sklearn.impute import SimpleImputer

imputer=SimpleImputer(strategy="median")

It should be as simple as calling the fit method on the data frame. So:

            imputer.fit(housing)

You will find this will cause an exception to be thrown:

In [20]:
imputer.fit(housing)

ValueError: Cannot use median strategy with non-numeric data:
could not convert string to float: 'NEAR BAY'

**Exercise 1** Examine the data frame to see what causes this problem.

*ANSWER*: SimpleImputer is calculating the median value for each column, however it cannot do this on columns which are not completely numeric - hence, an error is thrown

**Exercise 2** The offending data itself does not need to be imputed. So it could be dropped, at least temporary, from the dataframe. Research how you can create a dataframe that doesn't contain the offending data, then fit the imputer to it. Once you have done this, print the

                imputer.statistics_
                
member variable.

In [21]:
#! Sample answer
print(housing.head())
dropped = housing['ocean_proximity']
print(dropped)
housing_num =housing.drop("ocean_proximity",axis=1)
imputer.fit(housing_num)
print(imputer.statistics_)

   longitude  latitude  housing_median_age  total_rooms  total_bedrooms  \
0    -122.23     37.88                41.0        880.0           129.0   
1    -122.22     37.86                21.0       7099.0          1106.0   
2    -122.24     37.85                52.0       1467.0           190.0   
3    -122.25     37.85                52.0       1274.0           235.0   
4    -122.25     37.85                52.0       1627.0           280.0   

   population  households  median_income  median_house_value ocean_proximity  
0       322.0       126.0         8.3252            452600.0        NEAR BAY  
1      2401.0      1138.0         8.3014            358500.0        NEAR BAY  
2       496.0       177.0         7.2574            352100.0        NEAR BAY  
3       558.0       219.0         5.6431            341300.0        NEAR BAY  
4       565.0       259.0         3.8462            342200.0        NEAR BAY  
0        NEAR BAY
1        NEAR BAY
2        NEAR BAY
3        NEAR BAY
4  

In the previous question, you should have created a new dataframe that doesn't contain the 'ocean_proximity' attribute. Let's assume this reduced dataframe is called *housing_num* which now is guaranteed to contain numerical data only.

The fit method has not changed *housing_num* at all. It only has fed the data of the reduced dataframe to the imputer and allowed it to calculate median values for each of the columns. This is called *training* the imputer.
Now the imputer must be used to actually clean up the data of the *housing_num* frame. This is done by calling the imputer's *transform* method on *housing_num*. **Before proceeding, please ensure you have created the reduced dataframe and called it housing_num or the notebook will crash. Such a crash is not acceptable in your coursework submission.**

In [22]:
X=imputer.transform(housing_num)

**Exercise 3** Examine the form of the data represented by X.

In [23]:
#! Sample answer
print(X[0])
print(X.shape)

# So just one big numpy array, but without Nan!
# (20640, 9) means this array of arrays has 20640 indexes i.e. arrays
# and each array contains 9 values i.e. a data value for each column

[-1.2223e+02  3.7880e+01  4.1000e+01  8.8000e+02  1.2900e+02  3.2200e+02
  1.2600e+02  8.3252e+00  4.5260e+05]
(20640, 9)


**Exercise 4** Convert X back into a dataframe with the original column names (minus *ocean_proximity*)

In [24]:
#! Sample answer
housing_tr = pd.DataFrame(X,columns=housing_num.columns)
housing_tr.head()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0


### Text and Categorical Variables

In general machine learning algorithms need numerical values to work and *scikit-learn* expects data in two dimensional numpy arrays (like X) in the previous example. It is therefore necessary to convert text which often indicates a categorical variable (like, '<1H OCEAN', 'INLAND', 'NEAR OCEAN' etc.) into numerical values. (An exception to this would be a project that tries to apply NLP to text fragments, but that is not the case here; the column
*ocean_proximity* clearly contains categories). It would be possible to convert  them into numerical values as in '<1H OCEAN' = 0, 'INLAND' = 1, etc.. There is a potential problem with this as there is an impcit concept of nearness in this coding: 0 and 1 are closer together than 1 and 4, for example. If used in a clustering algorithm this nearness may be used without this being intended by the investigator. A one-hot encoding does not have this problem.

A one-hot coding would code 5 categories as follows:
(1,0,0,0,0)
(0,1,0,0,0)
   ...
(0,0,0,0,1)

There is no unintended concept of proximity in this coding and it is to be preferred for coding categorical variables numerically.

Again *scikit-learn* offers a custom-made class for transforming textual categorical variables into one-hot encoded ones:

In [25]:
from sklearn.preprocessing import OneHotEncoder
enc=OneHotEncoder()
# reshape (row, column) says to reshape an array such that it has row rows and column columns
# -1 can be used as one of the parameters though, and it basically lets np calculate the corresponding value for either no.rows
# or no.columns. If theres a 3*4 array and reshape (6,-1) is passed, there will be 6 rows and the no.columns is calculated via
# (3*4) / 6 = 2. hence, 2 columns in the reshaped array
enc.fit(housing["ocean_proximity"].values.reshape(-1,1))
print(enc.categories_)

[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
      dtype=object)]


*reshape*  is necessary because a transformer (such as Imputer or OneHotEncoder) expects a 2D numpy array. **housing["ocean_proximity"]** is a *pandas.Series*, hence the use of values. The -1 is a neat little syntactic trick to prevent having to know or calculate the length of the array.

In [26]:
enc.transform(housing["ocean_proximity"].values.reshape(-1,1)).toarray()

array([[0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 1., 0.],
       ...,
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

## Transformers

*Imputer* and *OneHotEncoder* are so-called **transformers**. Whenever you start to realise that you need to do a lot of repetitive processing, you should either use a tranformer or design one of your own, in case *scikit-learn*  doesn't have one that is necessary for your processing.

You can design a class of your own that encapsulates your preprocessing code. All you need to do is provide functions: 

                fit()
                transform()
                fit_transform()
                
*fit_transform* simply first calls *fit* and then *transform*

As an example, lets add some preprocessing that adds the number of rooms per house hold and optionally also
the number of rooms per household. Earlier we have seen that the number rooms per house hold correlates quite strongly with the median house value and it may be a variable that we want to represent explicitly in the dataset, meaning that we have to transform it. Optionally, we may want the number of bedrooms per room to the dataset as well. We will make the transformer configurable so that later in your analysis pipeline you can investigate both options, simply by setting a binary flag, which then becomes a hyperparameter.

In [27]:
from sklearn.base import BaseEstimator

class CombinedAttributesAdder(BaseEstimator):

    def __init__(self, do_add_bedrooms_per_room = False):
        
        # simply a binary variable per room
        self.do_add_bedrooms_per_room = do_add_bedrooms_per_room
        
        # These are the column indices of the respective columns. OK for illustration purposes.
        # For more robust code you would want to extract these values from the DataFrame by name.
        self.rooms_ix      = 3
        self.bedrooms_ix   = 4
        self.population_ix = 5
        self.household_ix  = 6
        
    def fit(self, X, y=None):
        # We don't transform the target values here
        return self
    
    def transform(self, X, y=None):
        rooms_per_household = X[:,self.rooms_ix]/X[:,self.household_ix]
        population_per_household = X[:, self.population_ix]/ X[:,self.rooms_ix]
        if self.do_add_bedrooms_per_room:
            bedrooms_per_room = X[:,self.bedrooms_ix]/X[:,self.rooms_ix]
            return np.c_[X,rooms_per_household, population_per_household,bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]
        
attr_adder=CombinedAttributesAdder(do_add_bedrooms_per_room=True)
housing_extra_attribs=attr_adder.transform(housing.values)
print(housing_extra_attribs)


[[-122.23 37.88 41.0 ... 6.984126984126984 0.3659090909090909
  0.14659090909090908]
 [-122.22 37.86 21.0 ... 6.238137082601054 0.3382166502324271
  0.15579659106916466]
 [-122.24 37.85 52.0 ... 8.288135593220339 0.338104976141786
  0.12951601908657123]
 ...
 [-121.22 39.43 17.0 ... 5.20554272517321 0.44676131322094054
  0.21517302573203195]
 [-121.32 39.43 18.0 ... 5.329512893982808 0.39838709677419354
  0.21989247311827956]
 [-121.24 39.37 16.0 ... 5.254716981132075 0.4980251346499102
  0.22118491921005387]]


**My questions:**
1. What is a hyperparameter?
1a. What is the difference between hyperparam and feature and parameter?
A feature is a column in the ds. Parameters can be seen as the weights applied to the features in a regression model, for example, and a hyperparameter is a parameter used to tune the modelling algorithm itself e.g. number of trees in a random forest

2. What is the point of the fit function here?
Fit calculates a value for every column - median in this case

3. What does the notation with the colons etc mean when used in the transform method on X?
It does python slicing - so instead of saying get me this element at ith index, it says just bring me back the whole column / every value from the column

4. What is np.c_ doing?
just stacks vectors - as shown below

The extra attributes have now been added to the data, which are spit out as a 2D numpy array without further structure.  Ultimately, this is what you need if you want to do numerical processing and exactly what *scikit-learn* algorithms expect.

The *np.c_* is a numpy shorthand for grouping vectors into matrices. There are many ways of achieving the same effect, some people prefer *hstack* , *vstack* constructions.

In [28]:
a=np.array([1,2,3]).T
b=np.array([4,5,6]).T
print(np.c_[a,b])

print(np.vstack([a,b]).T)

[[1 4]
 [2 5]
 [3 6]]
[[1 4]
 [2 5]
 [3 6]]


### Feature Scaling

Most machine learning algorithms perform better if the numerical values of the attributes are of comparable magnitude. The total number of rooms varies between 0 and 39320, whereas income is between 0 and 15. Typically, the attributes - but not the target variable, here median house value - are rescaled. An exception are decision tree based methods where the data is taken as is.

#### Min-max scaling (normalisation)

The data are shifted and rescaled so that their minimum value corresponds to 0 and their maximum value corresponds to 1. *scikit-learn* provides the **MinMaxScaler** to achieve this. It can be recongured to other target ranges than $[0,1]$ where desired.

#### Standardisation
The mean is substracted of all data points and then divided by the variance so that a zero-mean unit variance distribution results.

There are pros and cons to each of these methods. Some machine learning algorithms (neural networks) expect input values to be within a certain range, meaning that you have to use min-max scaling. But a single outlier can compress the other data points to heap up around 0, something that doesn't happen to the same extent in standardisation.

*scikit-learn* provides the *StandardScalar* transformer to implement this method.



## Pipelines

Pipelines group transformers sequentially and provide a single access point for running the entire preprocessing chain. The transformations we have discussed so far crop up almost in every machine learning analysis. Each transformation step is usually simple, and it may be that initially you have cobbled an ad hoc preprocessing step to get going with analysis. Such code tends to obscure your programmes. If you find yourself bored because you're constantly writing relatively simple proeprocessing steps, you're not using pipelines properly. The individual
Transformers can be hidden in python modules that you can import. The Pipelines define a very short definition
of the preprocessing steps and are almost self-documenting. An example.

In [29]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('imputer',SimpleImputer(strategy="median")),
    ('attribs_adder',CombinedAttributesAdder()),
    ('std_scaler',StandardScaler())
])

housing_num_tr = num_pipeline.fit_transform(housing_num)

This piece of code is very readable, moreover since imputing and scaling occur so often, similar pipelines can
often be constructed for other analyses, and the indvidual components can be re-used with a minimum of new programming effors. 

The names are practical for documentation, but also allow named access to the elements of the pipeline in case they're needed downstream in the analysis:

In [30]:
num_pipeline['imputer']

When you call the pipeline *fit* method, it calls the *fit_transform* method of the elements of the pipeline in order. All but the last elements of the pipeline must be *transformers*, the last can be an *estimator* (an *estimator*) has *fit* method, but not necessarily a *transform* or *fit_transform* method. ).

**QUESTION:** What is an estimator and what actually does fit do?
Fit calculate values, such as fit(median) will calculate the median for each column, and then transform will apply these values to the NaN values.
An estimator just collects data e.g. by using fit, but doesn't tranform the data in any way. It collects data and outputs it

### Combining Seperate Pipelines for Numerical and Categorical Values

We have seen that we need to handle numerical and categorical values differently. This effectively boils down to the creation of two pipelines that each run on mutually exclusive columns. We would like to run these pipelines in parallel and combine their joint output in one 2D numpy array which contains the conversion of all attributes to sensible numerical values.

To this end we will write a custom transformer that selects only certain columns of the dataset, and drops the other ones. We will use that transformer, called **DataFrameSelector** to create two parallel pipelines and then
combined the two resulting datasets.

In [31]:
class DataFrameSelector(BaseEstimator):
    
    def __init__(self, attribute_names):
        self.attribute_names= attribute_names
        
    def fit(self,X, y = None):
        return self
    
    def transform(self, X):
        return X[self.attribute_names].values


Note that although there appears to be no pandas code, this works because X is a panda DataFrame, which can take a list of column names and select the appropriate columns!

The two pipelines can now be created as follows **from the training set**.

In [32]:
num_attribs=list(housing) # using pandas to get the names of attributes in housing_num, which we made by dropping ocean_proximity
num_attribs.remove("ocean_proximity")
cat_attribs=["ocean_proximity"]

num_pipeline= Pipeline([
    ('selector', DataFrameSelector(num_attribs)),
    ('imputer',SimpleImputer(strategy="median")),
    ('attribs_adder',CombinedAttributesAdder()),
    ('std_scaler',StandardScaler())
])

cat_pipeline = Pipeline([
    ('selector',DataFrameSelector(cat_attribs)),
    ('one hot',OneHotEncoder())
])

These pipelines can be run independently ('in parallel'). Their results now need to be combined. *scikit-learn* offers the **FeatureUnion** for this purpose.

In [33]:
from sklearn.pipeline import FeatureUnion

full_pipeline = FeatureUnion(transformer_list=[
    ("num_pipeline",num_pipeline),
    ("cat_pipeline",cat_pipeline)
])

The whole pipeline can now be run in one go on the original *housing* DataFrame. 

In [35]:
housing.head()
#housing=strat_train_set.drop("median_house_value",axis=1)
#housing_labels=strat_train_set["median_house_value"]
housing_prepared = full_pipeline.fit_transform(housing)

print(housing_prepared.shape)
print("\n\n\n")
print(housing_prepared)

(20640, 16)




  (0, 0)	-1.3278352216308462
  (0, 1)	1.0525482830366848
  (0, 2)	0.9821426581785077
  (0, 3)	-0.8048190966246049
  (0, 4)	-0.9724764790070289
  (0, 5)	-0.9744285971768408
  (0, 6)	-0.9770328537634586
  (0, 7)	2.3447657583017163
  (0, 8)	2.129631481668038
  (0, 9)	0.6285594533305325
  (0, 10)	-0.08762704370782531
  (0, 14)	1.0
  (1, 0)	-1.3228439144608317
  (1, 1)	1.043184551645693
  (1, 2)	-0.6070189133741593
  (1, 3)	2.045890099248613
  (1, 4)	1.357143430032187
  (1, 5)	0.8614388682720688
  (1, 6)	1.6699610275640624
  (1, 7)	2.3322379635373314
  (1, 8)	1.314156136924335
  (1, 9)	0.32704135754480507
  (1, 10)	-0.0971931712397134
  (1, 14)	1.0
  (2, 0)	-1.3328265288008536
  :	:
  (20637, 12)	1.0
  (20638, 0)	-0.8736262691597478
  (20638, 1)	1.7782374658384243
  (20638, 2)	-0.8453931491070593
  (20638, 3)	-0.35559976683593253
  (20638, 4)	-0.3048269656692802
  (20638, 5)	-0.6044293340584699
  (20638, 6)	-0.3937525814946471
  (20638, 7)	-1.0545829218829477
  (20638, 8)	-1

**QUESTION:** What is the output showing in the final cell?
The tuples represent which element and the value on the right shows the element's value. e.g (0,0) 1.786 is the 0th array, 0th element, and it has value of 1.786