## Features from Categorical Data

* Incorporate binary and categorical features into regressor
* Compare the benefits of various feature representation strategies

### Example 1 - Height vs. Gender

How would we build regression models that incorporate features like:

* How does height vary with **gender**?
  - The gender values might look more like ```{"male", "female", "other", "not specified"}```
* How do preferences vary with **geographical region**?
* How does product demand change during different **seasons**?


Let's **first** start with a binary problem where we just have **```{"male", "female"}```**
<img src="Binary_model_illus.JPG" alt="Drawing" style="width: 700px;"/>

#### What are we trying to predict?

$$ \mbox{Height} = \theta_0 + \theta_1 \times \textbf{if gender is female} $$

In [2]:
import pandas as pd
df = pd.read_csv("weight_height.csv", sep=',', header=0)
df.head()

Unnamed: 0,Gender,Height,Weight
0,Male,73.847017,241.893563
1,Male,68.781904,162.310473
2,Male,74.110105,212.740856
3,Male,71.730978,220.04247
4,Male,69.881796,206.349801


**Add a new column data to indicate whether is female or not** 

E.g., if $Gender = "Female"$ then the column value is $1$.

In [4]:
import numpy as np

df['Is Female'] = np.where(df['Gender'] == "Female", 1, 0)

**Code: Extracting the X and Y**

In [10]:
X = df[['Weight', 'Is Female']]
Y = df['Height']

print (len(X), len(Y))

10000 10000


**Code: linear regression with category value**

In [11]:
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X,Y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

Intercept: 
 46.06780351484079
Coefficients: 
 [0.12275942 0.96286425]


### Example 2 - Air quality prediction

We'll look at the problem of predicting **air quality**, using an index called pm2.5, measured in Beijing

* This is a "simpler" dataset than some of the others we've been working with, as the relevant features are all real-valued

**What are we trying to predict?**

$$ \mbox{pm2.5} = \theta_0 + \theta_1 \times \mbox{temp} + \theta_2 \times \textbf{if year 2010} + \theta_3 \times \textbf{if year 2011} + \theta_4 \times \textbf{if year 2012} + \theta_5 \times \textbf{if year 2013} + \theta_6 \times \textbf{if year 2014}$$

In [14]:
import pandas as pd
df = pd.read_csv("Beijing_PM25_air_data.csv", sep=',', header=0)
df.head()

Unnamed: 0,No,year,month,day,hour,pm2.5,DEWP,TEMP,PRES,cbwd,Iws,Is,Ir
0,1,2010,1,1,0,,-21,-11.0,1021.0,NW,1.79,0,0
1,2,2010,1,1,1,,-21,-12.0,1020.0,NW,4.92,0,0
2,3,2010,1,1,2,,-21,-11.0,1019.0,NW,6.71,0,0
3,4,2010,1,1,3,,-21,-14.0,1019.0,NW,9.84,0,0
4,5,2010,1,1,4,,-20,-12.0,1018.0,NW,12.97,0,0


**Code: Convert categorical variable into dummy/indicator variables**

In [24]:
dataset = df.dropna(subset=['pm2.5'])
dataset = pd.get_dummies(dataset, columns=['year'], dummy_na=True)
dataset.head()

X = pd.concat([dataset[['TEMP']], dataset.iloc[:, 12:17]], axis = 1)
X.head()

Unnamed: 0,TEMP,year_2010.0,year_2011.0,year_2012.0,year_2013.0,year_2014.0
24,-4.0,1,0,0,0,0
25,-4.0,1,0,0,0,0
26,-5.0,1,0,0,0,0
27,-5.0,1,0,0,0,0
28,-5.0,1,0,0,0,0


In [25]:
Y = dataset['pm2.5']

print(len(X), len(y))

41757 43824


**Code: linear regression model with both numerical and categorical data**

In [26]:
from sklearn import linear_model

regr = linear_model.LinearRegression()
regr.fit(X,Y)

print('Intercept: \n', regr.intercept_)
print('Coefficients: \n', regr.coef_)

Intercept: 
 107.0559131654772
Coefficients: 
 [-0.6809868   4.69436887  0.4650212  -8.26332891  3.08954049  0.01439836]
