# Leaning Objectives

### In this module, we will cover:
* Extract “features” from formatted datasets, and store them in appropriate data structures for use by Machine Learning libraries.
* Meaningfully extract features of different forms (e.g. continuous variables, strings, periodic data, categorical variables)
* Deal with missing values in a dataset

## Features from Categorical Data

* Incorporate binary and categorical features into regressor
* Compare the benefits of various feature representation strategies

#### A motivating example

How would we build regression models that incorporate features like:

* How does height vary with **gender**?
  - The gender values might look more like ```{"male", "female", "other", "not specified"}```
* How do preferences vary with **geographical region**?
* How does product demand change during different **seasons**?


Let's **first** start with a binary problem where we just have **```{"male", "female"}```**


#### What should our model equation look like?

<img src="Datasets/Binary_model_equation.jpg" alt="Drawing" style="width: 300px;"/>

#### How to interpret the results:

<img src="Datasets/Binary_model_illus.jpg" alt="Drawing" style="width: 800px;"/>



### Example - Air quality prediction

We'll look at the problem of predicting **air quality**, using an index called pm2.5, measured in Beijing

* This is a "simpler" dataset than some of the others we've been working with, as the relevant features are all real-valued


#### What are we trying to predict?

$$ \mbox{pm2.5} = \theta_0 + \theta_1 \times \mbox{temp} + \theta_2 \times \textbf{if year 2010 or 2011} $$

In [1]:
import pandas as pd
df = pd.read_csv("Datasets/Beijing_PM25_air_data.csv", sep=',', header=0)
df.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


In [2]:
import numpy as np

df['year before 2012'] = np.where(df['year'] < 2012, 1, 0)

In [6]:
dataset = df.dropna(subset=['pm2.5'])
y = dataset['pm2.5']
X = dataset[['TEMP', 'year before 2012']]

print(len(X), len(y))

41757 41757


In [7]:
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X,y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

Intercept: 
 105.40135451195876
Coefficients: 
 [-0.67852431  4.21278971]



#### What if we have ```{"male", "female", "other", "not specified"}```

<img src="Datasets/Multi_category.jpg" alt="Drawing" style="width: 300px;"/>

### Code: Convert categorical variable into dummy/indicator variables

```df = pd.get_dummies(df, columns=['type'])```

In [11]:
# Remove the previous dummy for year before 2012
dataset = dataset.drop(columns=['year before 2012'])
dataset.head()


Unnamed: 0,No,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir,year_2010.0,year_2011.0,year_2012.0,year_2013.0,year_2014.0,year_nan
24,25,1,2,0,129.0,-16,-4.0,1020.0,SE,1.79,0,0,1,0,0,0,0,0
25,26,1,2,1,148.0,-15,-4.0,1020.0,SE,2.68,0,0,1,0,0,0,0,0
26,27,1,2,2,159.0,-11,-5.0,1021.0,SE,3.57,0,0,1,0,0,0,0,0
27,28,1,2,3,181.0,-7,-5.0,1022.0,SE,5.36,1,0,1,0,0,0,0,0
28,29,1,2,4,138.0,-7,-5.0,1022.0,SE,6.25,2,0,1,0,0,0,0,0


In [None]:
dataset = pd.get_dummies(dataset, columns=['year'], dummy_na=True)
dataset.head()

In [20]:
X = pd.concat([dataset[['TEMP']], dataset.iloc[:, 12:17]], axis = 1)
X.head()

Unnamed: 0,TEMP,year_2010.0,year_2011.0,year_2012.0,year_2013.0,year_2014.0
24,-4.0,1,0,0,0,0
25,-4.0,1,0,0,0,0
26,-5.0,1,0,0,0,0
27,-5.0,1,0,0,0,0
28,-5.0,1,0,0,0,0


In [21]:
y = dataset['pm2.5']

print(len(X), len(y))

41757 41757


In [22]:
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X,y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

Intercept: 
 107.0559131654772
Coefficients: 
 [-0.6809868   4.69436887  0.4650212  -8.26332891  3.08954049  0.01439836]


## Concept: One-hot encodings

<img src="Datasets/One_hot_code.jpg" alt="Drawing" style="width: 400px;"/>

* Note that to capture 4 possible categories, we only need three dimensions (a dimension for "male" would be redundant)
* This approach can be used to capture a variety of categorical feature types, as well as objects that belong to multiple categories.

## Features from Temporal Data

* Investigate different strategies for extracting features from temporal (or seasonal) data
* Extend the concept of one-hot-encodings to represent temporal information


#### A motivating example

How would we build regression models that incorporate features like:

* How do sales (or preferences) vary over time?
  - The gender values might look more like ```{"male", "female", "other", "not specified"}```
* What are the **long term** trends of sales?
* What are the **short term** trends (e.g. day of the week, season, etc.)?


Let's start with a simple feature representation, e.g., map the month name to a month number

$$
\mbox{rating} = \theta_0 + \theta_1 \times \mbox{month}
$$

#### What if the model we learn might look something like:

<img src="Datasets/Rating_month.jpg" alt="Drawing" style="width: 500px;"/>

<img src="Datasets/Rating_month_2.jpg" alt="Drawing" style="width: 700px;"/>

* This representation implies that the model would "wrap around" on December 31 to its January 1st value
* This type of "sawtooth" pattern probably isn't very realistic

### Fitting piecewise functions

<img src="Datasets/Piecewise_month.jpg" alt="Drawing" style="width: 500px;"/>

* Note that we don't need a feature for January

Or equivalent we'd have features as

<img src="Datasets/Piecewise_month_feature.jpg" alt="Drawing" style="width: 450px;"/>



## Summary of concepts

* Motivated the use of piecewise functions to model temporal data
* Describe how one-hot encodings can be used to model piecewise functions

### On your own...

* Think about what piecewise functions you might use to model demand on Amazon
  * Is the day of week important?
  * Or the day of the month?
  
* How would you incorporate significant holidays (which may influence demand) into this model?

## Feature Transformations

* Demonstrate the use of **transformation** to incorporate non-linear functions into linear models



#### A motivating example

We've previously seen simple examples of regression model such as ```weight``` vs. ```height```:

* A linear relationship is probably *okay* for modeling this data
* But, it certainly makes some assumptions that aren't totally justified
* How can we fit more suitable, or more general functions

### Fitting complex functions

Let's start with a polynomial function (e.g. a cubic function):

$$
\mbox{weight} = \theta_0 + \theta_1 \times \mbox{height} + \theta_2 \times \mbox{height}^2 + \theta_3 \times \mbox{height}^3
$$

* Note that this is perfectly straightforward: the model still takes the form

    $$
    \mbox{weight} = \theta \cdot x
    $$

* We just need to use the feature vector

    $$
    x = [1, \mbox{height}, \mbox{height}^2, \mbox{height}^3]
    $$

**NOTE** that we can use the same approach to fit arbitrary functions of the features! E.g.,

$$
\mbox{weight} = \theta_0 + \theta_1 \times \mbox{height} + \theta_2 \times \mbox{height}^2 + \theta_3\exp{(\mbox{height})} + \theta_4\sin{(\mbox{height})}
$$

* We can perform arbitrary combinations of the **features** and the model will still be linear in the **parameters** (theta):

    $$
    \mbox{weight} = \theta \cdot x
    $$

In [23]:
import pandas as pd
df = pd.read_csv("Datasets/Beijing_PM25_air_data.csv", sep=',', header=0)
df.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


In [35]:
df['TEMP^2'] = df['TEMP'].apply(lambda x: np.square(x) if x >= 0 else (-1)*np.square(x))
df[df['TEMP'] < -5].head()


Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir,TEMP^2
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0,-121.0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0,-144.0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0,-121.0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0,-196.0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0,-144.0


In [38]:
df['exp(TEMP)'] = df['TEMP'].apply(lambda x: np.exp(x))
df.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir,TEMP^2,TEMP^3,exp(TEMP)
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0,-121.0,1.67017e-05,1.67017e-05
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0,-144.0,6.144212e-06,6.144212e-06
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0,-121.0,1.67017e-05,1.67017e-05
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0,-196.0,8.315287e-07,8.315287e-07
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0,-144.0,6.144212e-06,6.144212e-06


In [39]:
df = df.drop(columns=['TEMP^3'])

In [40]:
df['sin(TEMP)'] = df['TEMP'].apply(lambda x: np.sin(x))
df.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir,TEMP^2,exp(TEMP),sin(TEMP)
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0,-121.0,1.67017e-05,0.99999
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0,-144.0,6.144212e-06,0.536573
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0,-121.0,1.67017e-05,0.99999
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0,-196.0,8.315287e-07,-0.990607
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0,-144.0,6.144212e-06,0.536573


In [50]:
df = df.dropna(subset=['pm2.5'])
X = pd.concat([df[['TEMP']], df.iloc[:, 13:16]], axis = 1)
X.head()
y = df['pm2.5']

print(len(X), len(y))

41757 41757


In [51]:
X.head()

Unnamed: 0,TEMP,TEMP^2,exp(TEMP),sin(TEMP)
24,-4.0,-16.0,0.018316,0.756802
25,-4.0,-16.0,0.018316,0.756802
26,-5.0,-25.0,0.006738,0.958924
27,-5.0,-25.0,0.006738,0.958924
28,-5.0,-25.0,0.006738,0.958924


In [52]:
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X,y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

Intercept: 
 106.83226683826837
Coefficients: 
 [-4.04531348e-01 -1.10624873e-02 -7.95767209e-17 -2.85229509e-04]


**HOWEVER**, the same approach would **not** work if we wanted to transform the parameters:

$$
\mbox{weight} = \theta_0 + \theta_1 \times \mbox{height} + {\theta_2}^2 \times \mbox{height} + \sigma{(\theta_3)}\times\mbox{height}
$$

* The **linear** models we've seen so far do not support these types of transformations (i.e., they need to be linear in their parameters)

* There *are* alternative models that support nonlinear transformations of parameters, e.g., neural networks

## Summary of concepts

* Showed how to apply arbitrary transformations to features in a linear model
* Further explained the restrictions and assumptions of **linear models**

### On your own...

* Extend our previous code (on pm2.5 levels vs. air temperature) to handle simple polynomial functions.
