<br>
# 3.6.6  Qualitative Predictors
<br>

### Form of multiple linear regression

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \beta_nx_n$

- $y$ is the response
- $\beta_0$ is the intercept
- $\beta_1$ is the coefficient for $x_1$ (the first feature)
- $\beta_n$ is the coefficient for $x_n$ (the nth feature)

<br>

In [1]:
# inserted cell

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, explained_variance_score
from sklearn.preprocessing import PolynomialFeatures

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [2]:
# read CSV file and save the results
#data = pd.read_csv('data/Carseats.csv')
data = pd.read_csv('data/Credit.csv')

# display the first 5 rows
print(data.head())

# create a Python list of feature names
feature_cols = ['Income', 'Ethnicity']

# use the list to select a subset of the original DataFrame
X = data[feature_cols]

# print the first 5 rows
X.head()

    Income  Limit  Rating  Cards  Age  Education  Gender Student Married  \
0   14.891   3606     283      2   34         11    Male      No     Yes   
1  106.025   6645     483      3   82         15  Female     Yes     Yes   
2  104.593   7075     514      4   71         11    Male      No      No   
3  148.924   9504     681      3   36         11  Female      No      No   
4   55.882   4897     357      2   68         16    Male      No     Yes   

   Ethnicity  Balance  
0  Caucasian      333  
1      Asian      903  
2      Asian      580  
3      Asian      964  
4  Caucasian      331  


Unnamed: 0,Income,Ethnicity
0,14.891,Caucasian
1,106.025,Asian
2,104.593,Asian
3,148.924,Asian
4,55.882,Caucasian


In [3]:
# check the type and shape of X
print(type(X))
print(X.shape)

<class 'pandas.core.frame.DataFrame'>
(400, 2)


In [4]:
# select a Series from the DataFrame
y = data['Balance']

# equivalent command that works if there are no spaces in the column name
y = data.Balance

# print the first 5 values
y.head()

0    333
1    903
2    580
3    964
4    331
Name: Balance, dtype: int64

In [5]:
# check the type and shape of y
print(type(y))
print(y.shape)

<class 'pandas.core.series.Series'>
(400,)


In [6]:
X=pd.get_dummies(X)
print(X)

      Income  Ethnicity_African American  Ethnicity_Asian  Ethnicity_Caucasian
0     14.891                           0                0                    1
1    106.025                           0                1                    0
2    104.593                           0                1                    0
3    148.924                           0                1                    0
4     55.882                           0                0                    1
5     80.180                           0                0                    1
6     20.996                           1                0                    0
7     71.408                           0                1                    0
8     15.125                           0                0                    1
9     71.061                           1                0                    0
10    63.095                           0                0                    1
11    15.045                           0            

In [None]:
feature_cols=list(X)
print(feature_cols)

## Splitting X and y into training and testing sets

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [None]:
# default split is 75% for training and 25% for testing
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

## Multiple Linear Regression in scikit-learn

In [None]:
# import model
#from sklearn.linear_model import LinearRegression

# instantiate
linreg = LinearRegression()

# fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)

### Interpreting model coefficients

In [None]:
# print the intercept and coefficients
print(linreg.intercept_)
print(linreg.coef_)

In [None]:
# pair the feature names with the coefficients
list(zip(feature_cols, linreg.coef_))

$y = 221.744 - 6.266 \times Income - 53.975 \times Ethnicity\_African American + 19.799 \times Ethnicity\_Asian + 34.175 \times Ethnicity\_Caucasian$

- This is a statement of **association**, not **causation**.


### Making predictions

In [None]:
# make predictions on the testing set
y_pred = linreg.predict(X_test)

We need an **evaluation metric** in order to compare our predictions with the actual values!

### Computing  $R^2$

In [None]:
print(linreg.score(X_test, y_test))

### Computing the RMSE 

In [None]:
print(np.sqrt(mean_squared_error(y_test, y_pred)))

## sklearn 을 이용하여 categorical feature를 처리하는 방법

본 수업에서는 다루지 않는다.

When to use One Hot Encoding vs LabelEncoder vs DictVectorizor?

There are some cases where LabelEncoder or DictVectorizor are useful, but these are quite limited in my opinion due to ordinality.

LabelEncoder can turn [dog,cat,dog,mouse,cat] into [1,2,1,3,2], but then the imposed ordinality means that the average of dog and mouse is cat. Still there are algorithms like decision trees and random forests that can work with categorical variables just fine and LabelEncoder can be used to store values using less disk space.

One-Hot-Encoding has a the advantage that the result is binary rather than ordinal and that everything sits in an orthogonal vector space. The disadvantage is that for high cardinality, the feature space can really blow up quickly and you start fighting with the curse of dimensionality. In these cases, I typically employ one-hot-encoding followed by PCA for dimensionality reduction. I find that the judicious combination of one-hot plus PCA can seldom be beat by other encoding schemes. PCA finds the linear overlap, so will naturally tend to group similar features into the same feature.
