# A) Continuation with VIF 

## 1) Examples where VIF is a problem and how to eliminate				

### Example 1

Let us use the *car-mpg.csv* file containing the cars dataset

Steps for calculating VIF:

* a. Run a multiple regression
* b. Calculate the VIF factors
* c.  Examine VIF fof each explanatory variable. Consider dropping the variable with VIF more than 5.

To construct two design matrices (y and X (outcome and predictor data) we use patsy.dmatrices given a formula_like argument and data. 

https://etav.github.io/python/vif_factor_python.html

https://patsy.readthedocs.io/en/latest/API-reference.html#basic-api

In [1]:
import pandas                               as     pd
import numpy                                as     np
import matplotlib.pyplot                    as     plt
import seaborn                              as     sns
import statsmodels.api                      as     sm
import scipy.stats                          as     stats
from   patsy                                import dmatrices
from   statsmodels.stats.outliers_influence import variance_inflation_factor

In [2]:
cars = pd.read_csv('./data/car-mpg.csv')
cars.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 9 columns):
mpg         398 non-null float64
cyl         398 non-null int64
disp        398 non-null float64
hp          398 non-null object
wt          398 non-null int64
acc         398 non-null float64
yr          398 non-null int64
origin      398 non-null int64
car name    398 non-null object
dtypes: float64(3), int64(4), object(2)
memory usage: 28.1+ KB


Let the predictor variable be stored in X, a Pandas Series

In [3]:
### Collect features

features = "cyl + disp + wt + acc + yr + origin"

### Extract y and X dataframes based on this regression:

y, X = dmatrices('mpg ~' + features, cars, return_type='dataframe')

In [4]:
vif               = pd.DataFrame()
vif["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"]   = X.columns
print(vif)


   VIF Factor   features
0  612.937457  Intercept
1   10.541162        cyl
2   20.057737       disp
3    8.554710         wt
4    1.621482        acc
5    1.184443         yr
6    1.662647     origin


We need to retain only one of the highly correlated variables cyl, disp and wt.

In [5]:
print('Corr coeff between cyl  with mpg is %1.4f'% np.corrcoef(cars.mpg, cars.cyl)[0,1])
print('Corr coeff between disp with mpg is %1.4f'% np.corrcoef(cars.mpg, cars.disp)[0,1])
print('Corr coeff between wt   with mpg is %1.4f'% np.corrcoef(cars.mpg, cars.wt)[0,1])

print('\nWe retain the variable highly correlated with target variablea nd drop others')
print('We retain %s with %1.4f' %('wt',np.corrcoef(cars.mpg, cars.wt)[0,1]))

Corr coeff between cyl  with mpg is -0.7754
Corr coeff between disp with mpg is -0.8042
Corr coeff between wt   with mpg is -0.8317

We retain the variable highly correlated with target variablea nd drop others
We retain wt with -0.8317


#### After dropping the variables cyl and disp that are least correlated with the target variable, mpg, 
we run the muliple linear regression model again.

In [6]:
features =  "wt + acc + yr + origin"

### Extract y and X dataframes based on this regression:

y, X = dmatrices('mpg ~' + features, cars, return_type='dataframe')

In [7]:
vif1               = pd.DataFrame()
vif1["VIF Factor"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif1["features"]   = X.columns
print(vif1)

   VIF Factor   features
0  570.081190  Intercept
1    1.808820         wt
2    1.257330        acc
3    1.143102         yr
4    1.513603     origin


#### Inference

We don't observe any multi-collinearity in the dataset.

### Practice Exercise 1

For *Boston House Prices dataset*, check if there is multi-collinearity and if so remove it.
Your target variable is MEDV = Median value of owner-occupied homes in $1000's

In [21]:
from   sklearn.datasets import load_boston
data = load_boston()
print(data.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

## 2) Final model for Auto data				

#### We now proceed to build a final model for auto data.

#### Detect and remove Outliers

First we try to get the outliers using IQR by doing the following:
* Calculate interquartile range
* Calculate the outlier cutoff
* Identify outliers (data lying beyond the outlier cutoff
* Remove those outliers

In [8]:
def detect_mark_outliers(data):
    
    import numpy as np
    
    q25, q75  = np.percentile(data, 25), np.percentile(data, 75)
    iqr       = q75 - q25
    cut_off = iqr * 1.5
    lower, upper = q25 - cut_off, q75 + cut_off
    print('Percentiles: 25th=%.3f, 75th=%.3f, IQR=%.3f' % (q25, q75, iqr))
    for i in range(len(data)):
        x    = data[i]
        if x < lower or x > upper:
            data[i] = np.NaN  # Mark outliers as NA
            
    return data

In [9]:
df      = cars[[ 'mpg', 'wt', 'acc', 'yr', 'origin']]

for i in range(len(df.columns)):
    data              = df[df.columns[i]]
    print('Variable: %s'%df.columns[i])
    df[df.columns[i]] = detect_mark_outliers(data)
    
print(df.info())


Variable: mpg
Percentiles: 25th=17.500, 75th=29.000, IQR=11.500
Variable: wt
Percentiles: 25th=2223.750, 75th=3608.000, IQR=1384.250

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  



Variable: acc
Percentiles: 25th=13.825, 75th=17.175, IQR=3.350
Variable: yr
Percentiles: 25th=73.000, 75th=79.000, IQR=6.000
Variable: origin
Percentiles: 25th=1.000, 75th=2.000, IQR=1.000
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 5 columns):
mpg       397 non-null float64
wt        398 non-null int64
acc       389 non-null float64
yr        398 non-null int64
origin    398 non-null int64
dtypes: float64(2), int64(3)
memory usage: 15.6 KB
None


In [10]:
df.dropna(inplace = True) ## Remove outliers

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [11]:
df.shape

(388, 5)

In [12]:
X = df[['wt', 'acc', 'yr', 'origin']]
y = df.mpg
print(X.shape)
print(y.shape)

(388, 4)
(388,)


### 1) Abscence of multi-collinearity

It is already done.

## 3) Improving the model with transformation showing R-square, adj R-square				

### Example 2

Use Boston data set as shown in Practice Exercise 1.
Check if the model measures such as R-square and adjusted R square have improved after transforming the predictor variables.
Here our target variable is MEDV.

In [40]:
import pandas                               as     pd
import numpy                                as     np
import matplotlib.pyplot                    as     plt
import seaborn                              as     sns
import statsmodels.api                      as     sm
import scipy.stats                          as     stats
from   sklearn                              import datasets
from   sklearn.metrics                      import mean_squared_error
from   sklearn.preprocessing                import PolynomialFeatures
from   sklearn.linear_model                 import LinearRegression
from   sklearn                              import linear_model

In [41]:
# Load Data
boston = datasets.load_boston()
print(boston.data.shape, boston.target.shape)
print(boston.feature_names)

(506, 13) (506,)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']


In [42]:
data = pd.DataFrame(boston.data, columns = boston.feature_names)
data = pd.concat([data, pd.Series(boston.target, name = 'MEDV')], axis = 1)
data.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [57]:
X              =  boston.data
y              =  boston.target

X              = sm.add_constant(X) 
model          = sm.OLS(y, X).fit()
residuals      = model.resid
model.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.741
Model:,OLS,Adj. R-squared:,0.734
Method:,Least Squares,F-statistic:,108.1
Date:,"Wed, 28 Nov 2018",Prob (F-statistic):,6.95e-135
Time:,21:41:18,Log-Likelihood:,-1498.8
No. Observations:,506,AIC:,3026.0
Df Residuals:,492,BIC:,3085.0
Df Model:,13,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,36.4911,5.104,7.149,0.000,26.462,46.520
x1,-0.1072,0.033,-3.276,0.001,-0.171,-0.043
x2,0.0464,0.014,3.380,0.001,0.019,0.073
x3,0.0209,0.061,0.339,0.735,-0.100,0.142
x4,2.6886,0.862,3.120,0.002,0.996,4.381
x5,-17.7958,3.821,-4.658,0.000,-25.302,-10.289
x6,3.8048,0.418,9.102,0.000,2.983,4.626
x7,0.0008,0.013,0.057,0.955,-0.025,0.027
x8,-1.4758,0.199,-7.398,0.000,-1.868,-1.084

0,1,2,3
Omnibus:,178.029,Durbin-Watson:,1.078
Prob(Omnibus):,0.0,Jarque-Bera (JB):,782.015
Skew:,1.521,Prob(JB):,1.54e-170
Kurtosis:,8.276,Cond. No.,15100.0


We will try transformation to predictor variables and check coefficient of determination has improved from 0.741.

In [61]:
poly      = PolynomialFeatures(degree=2)
X_        = poly.fit_transform(X)

model2     = linear_model.LinearRegression()
model2.fit(X_, y)

r_squared  = model2.score(X_,y)
adjusted_r_squared = 1 - (1 - r_squared) * (len(y) - 1) / (len(y) - X_.shape[1] - 1)

print('Polynomial Model of degree 2 - R square is %.2f R adj square %.2f' %(r_squared, adjusted_r_squared))

Polynomial Model of degree 2 - R square is 0.90 R adj square 0.87


So transformation has improved the earlier model as per details given below:

| Model | $R^2$ | Adj $R^2$ |
| -------------- | ----- | ------- |  
| Linear | 0.741 | 0.734 |
| Polynomial | 0.90 | 0.0.87 |



https://acadgild.com/blog/polynomial-regression-understand-power-of-polynomials

### Practice Exercise 2

Use the red wine data and predict the target variable, wine quality using the predictor variables (1 to 11).
Check whether the transformation of predictor variables has improved R square and adj. R square.


Attribute Information:

Input variables (based on physicochemical tests): 

+ 1 - fixed acidity 
+ 2 - volatile acidity 
+ 3 - citric acid 
+ 4 - residual sugar 
+ 5 - chlorides 
+ 6 - free sulfur dioxide 
+ 7 - total sulfur dioxide 
+ 8 - density 
+ 9 - pH 
+ 10 - sulphates 
+ 11 - alcohol 

**Output variable (based on sensory data): **

+ 12 - quality (score between 0 and 10)

##### Relevant Papers:
+ P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. 
+ In Decision Support Systems, Elsevier, 47(4):547-553, 2009. 


#### Citation Request:

Please include this citation if you plan to use this database: 

+ P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
+ Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

In [25]:
import pandas as pd
wine_data   = pd.read_csv('./data/winequality-red.csv', header = 0, sep = ';')
print(wine_data.info())
print(wine_data.shape)
print(wine_data.head(5).T)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
fixed acidity           1599 non-null float64
volatile acidity        1599 non-null float64
citric acid             1599 non-null float64
residual sugar          1599 non-null float64
chlorides               1599 non-null float64
free sulfur dioxide     1599 non-null float64
total sulfur dioxide    1599 non-null float64
density                 1599 non-null float64
pH                      1599 non-null float64
sulphates               1599 non-null float64
alcohol                 1599 non-null float64
quality                 1599 non-null int64
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
None
(1599, 12)
                            0        1       2       3        4
fixed acidity          7.4000   7.8000   7.800  11.200   7.4000
volatile acidity       0.7000   0.8800   0.760   0.280   0.7000
citric acid            0.0000   0.0000   0.040   0.560   0.0000
residual sugar    

## 4) Binary & Multinomial Predictors				

http://songhuiming.github.io/pages/2017/01/21/linear-regression-in-python-chapter-3-regression-with-categorical-predictors/

### 4a) Dummy variable				

### Example 3

Over 370000 used cars scraped with Scrapy from Ebay-Kleinanzeigen. 
https://www.kaggle.com/orgesleka/used-cars-database/home

Those fields are included: autos.csv:

| Sl No | Variable | Description | 
| --- | -------------- | ------------------------------------ |
| 1 | dateCrawled | when this ad was first crawled, all field-values are taken from this date | 
| 2 |  name |  "name" of the car | 
| 3 |  seller | private or dealer | 
| 4 |  offerType |  | 
| 5 |  price  |  the price on the ad to sell the car | 
| 6 |  abtest |  | 
| 7 |  vehicleType |  | 
| 8 |  yearOfRegistration |  at which year the car was first registered |  
| 9 |  gearbox |  | 
| 10 |  powerPS  |  power of the car in PS | 
| 11 |  model |  | 
| 12 |  kilometer  |  how many kilometers the car has driven | 
| 13 |  monthOfRegistration  |  at which month the car was first registered | 
| 14 |  fuelType |  | 
| 15 |  brand |  | 
| 16 |  notRepairedDamage  |  if the car has a damage which is not repaired yet | 
| 17 |  dateCreated  |  the date for which the ad at ebay was created | 
| 18 |  nrOfPictures  | number of pictures in the ad (unfortunately this field contains everywhere a 0 and is thus useless (bug in crawler!) ) | 
| 19 |  postalCode  |  | 
| 20 |  lastSeenOnline |  when the crawler saw this ad last online | 

The fields lastSeen and dateCreated could be used to estimate how long a car will be at least online before it is sold.
## Data brought to you by Orges Leka.





We shall use the following variables out of 20 variables and also select data with price not more than 50000.
1. seller, the nature of seller
2. price, the auction price
3. yearOfRegistration, the year on which the car was registered first
4. gearbox, type of gearbox
5. kilometer, number of kilometers
6. fuelType, type of fuel (and if the vehicle is electric or not)
7. notRepaiedDamage, whether or not the vehicle has damages and has been repaired.

In [15]:
import pandas          as pd
import numpy           as np
import statsmodels.api as sm

In [16]:
data = pd.read_csv('./data/autos.csv', encoding = 'latin', quoting = 3, usecols = ['seller','price','yearOfRegistration','gearbox','kilometer','fuelType','notRepairedDamage'])
data = data[data.price < 50001] # select price <= 50000
data.dropna(inplace = True) # Drop Missing values
print(data.info())
print(data.columns)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 276025 entries, 1 to 371823
Data columns (total 7 columns):
seller                276025 non-null object
price                 276025 non-null int64
yearOfRegistration    276025 non-null int64
gearbox               276025 non-null object
kilometer             276025 non-null int64
fuelType              276025 non-null object
notRepairedDamage     276025 non-null object
dtypes: int64(3), object(4)
memory usage: 16.8+ MB
None
Index(['seller', 'price', 'yearOfRegistration', 'gearbox', 'kilometer',
       'fuelType', 'notRepairedDamage'],
      dtype='object')


We create X as a vector of predictor variables and y as the response variable vector

In [17]:
X = data[['seller', 'yearOfRegistration', 'gearbox', 'kilometer','fuelType', 'notRepairedDamage']]
y = data['price']
print(X.shape)
print(y.shape)

(276025, 6)
(276025,)


#### Get the levels of the variables:
* 1) gearbox  which is a binary 
* 2) fuelType, which is multinomial

In [27]:
data["gearbox"] = data["gearbox"].astype('category')
print(dict( enumerate(data.gearbox.cat.categories)))
data["fuelType"] = data["fuelType"].astype('category')
print(dict( enumerate(data.fuelType.cat.categories)))

{0: 'automatik', 1: 'manuell'}
{0: 'andere', 1: 'benzin', 2: 'cng', 3: 'diesel', 4: 'elektro', 5: 'hybrid', 6: 'lpg'}


### Inference

* 1) There are two levels for the variable, gearbox: *automatik and manuell*.
* 2) There are seven levels for the variable, fuelType: *andere, benzin,cng,diesel,elektro,hybrid and lpg*.

In [18]:
### Create dummy variables for each of the categorical variables

for variable in X.columns:
    if X[variable].dtype == object:
       dcols = pd.get_dummies(X[variable])
       X       = X.join(dcols)
       del X[variable]

In [19]:
print(X.head())

   yearOfRegistration  kilometer  gewerblich  privat  automatik  manuell  \
1                2011     125000           0       1          0        1   
3                2001     150000           0       1          0        1   
4                2008      90000           0       1          0        1   
5                1995     150000           0       1          0        1   
6                2004     150000           0       1          0        1   

   andere  benzin  cng  diesel  elektro  hybrid  lpg  ja  nein  
1       0       0    0       1        0       0    0   1     0  
3       0       1    0       0        0       0    0   0     1  
4       0       0    0       1        0       0    0   0     1  
5       0       1    0       0        0       0    0   1     0  
6       0       1    0       0        0       0    0   0     1  


In [20]:
X              = sm.add_constant(X) 
model          = sm.OLS(y, X).fit()
residuals      = model.resid
model.summary()

0,1,2,3
Dep. Variable:,price,R-squared:,0.405
Model:,OLS,Adj. R-squared:,0.404
Method:,Least Squares,F-statistic:,17050.0
Date:,"Thu, 29 Nov 2018",Prob (F-statistic):,0.0
Time:,07:34:26,Log-Likelihood:,-2766500.0
No. Observations:,276025,AIC:,5533000.0
Df Residuals:,276013,BIC:,5533000.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-3043.5557,791.631,-3.845,0.000,-4595.130,-1491.981
yearOfRegistration,10.5436,0.400,26.369,0.000,9.760,11.327
kilometer,-0.0831,0.000,-310.428,0.000,-0.084,-0.083
gewerblich,-3275.0485,2297.565,-1.425,0.154,-7778.214,1228.116
privat,231.4928,1571.037,0.147,0.883,-2847.698,3310.683
automatik,709.8405,396.023,1.792,0.073,-66.354,1486.035
manuell,-3753.3962,396.007,-9.478,0.000,-4529.559,-2977.233
andere,-4244.3640,489.064,-8.679,0.000,-5202.916,-3285.812
benzin,95.5883,170.177,0.562,0.574,-237.953,429.130

0,1,2,3
Omnibus:,104523.624,Durbin-Watson:,2.004
Prob(Omnibus):,0.0,Jarque-Bera (JB):,794297.077
Skew:,1.63,Prob(JB):,0.0
Kurtosis:,10.644,Cond. No.,1.97e+22


### 4b) Interpretaion of regression coefficients				

### Our regression equation is given below:

price = -3043.5557 + yearOfRegistration * 10.5436  - kilometer * 0.0831 - gewerblich * 3275.0485 + privat * 231.4928 + automatik * 709.8405 - manuell * 3753.3962 - andere * 4244.3640 + benzin * 95.5883 + cng * 901.6695 + diesel * 4029.0504 - elektro * 6094.3164 + hybrid * 1677.0442 + lpg * 591.7725 - ja * 2983.5059 - nein * 60.0498

We would like to interpret the coefficients for the variable, gearbox which is split into two dummy variables:
**automatik and manuell.**

Interpretaion of regression coefficients for the 

Holding all other variables constant, binary variable **gearbox**
* a unit increase in automatik (gearbox type) of car will increase the price by 709.8405.
* a unit increase in manuell (gearbox type) of car will decrease the price by 3753.3962.

Holding all other variables constant, multinomial variable **fuelType**
* a unit increase in andere(fuelType) of car will decrease the price by 4244.36.
* a unit increase in benzin(fuelType) of car will increase the price by 95.59.
* a unit increase in cng(fuelType) of car will increase the price by 901.67.
* a unit increase in diesel(fuelType) of car will increase the price by 4029.05.
* a unit increase in elektro(fuelType) of car will decrease the price by 6094.32.
* a unit increase in hybrid(fuelType) of car will increase the price by 1677.04.
* a unit increase in lpg(fuelType) of car will increase the price by 591.77.

### 4c) Interaction effects		

Does the price of car increase more with kilometer and automatic gearbox than manuell gearbox?

In [34]:
from statsmodels.formula.api import ols
result = ols(formula = 'price ~ C(seller) + yearOfRegistration + C(gearbox) + kilometer + C(fuelType) + C(notRepairedDamage) + kilometer * C(gearbox)', data = data).fit()    
print(result.summary())    

                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.413
Model:                            OLS   Adj. R-squared:                  0.413
Method:                 Least Squares   F-statistic:                 1.620e+04
Date:                Thu, 29 Nov 2018   Prob (F-statistic):               0.00
Time:                        11:36:49   Log-Likelihood:            -2.7645e+06
No. Observations:              276025   AIC:                         5.529e+06
Df Residuals:                  276012   BIC:                         5.529e+06
Df Model:                          12                                         
Covariance Type:            nonrobust                                         
                                      coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------------
Intercept 

### Inference

From the above coefficients table,.we observe that holding all other variables constant, a unit increase in kilometer and when the gearbox is manaul, the price increases by 0.0384.

### Practice Exercise 3

Use Carseats.csv, a simulated data set containing sales of child car seats at 400 different stores.

Predict Sales using the explanatory variables 2 to 10 listed below and interpret the coefficients:

Format A data frame with 400 observations on the following 11 variables.

| Sl No | Variable | Description |
| --- | ------------------- | --------------------------------- |
| 1 | Sales | Unit sales (in thousands) at each location  | 
| 2 | CompPrice | Price charged by competitor at each location | 
| 3 | Income | Community income level (in thousands of dollars) | 
| 4 | Advertising | Local advertising budget for company at each location (in thousands of dollars)  | 
| 5 | Population | Population size in region (in thousands) | 
| 6 | Price | Price company charges for car seats at each site | 
| 7 | ShelveLoc | A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site |  
| 8 | Age | Average age of the local population | 
| 9 | Education | Education level at each location | 
| 10 | Urban | A factor with levels No and Yes to indicate whether the store is in an urban or rural location | 
| 11 | US | A factor with levels No and Yes to indicate whether the store is in the US or not | 

** Source Simulated data** 

References James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013) An Introduction to Statistical Learning with applications in R, www.StatLearning.com, Springer-Verlag, New York


In [None]:
import pandas as pd

In [27]:
carseats = pd.read_csv('./data/Carseats.csv')
print(carseats.info())
print(carseats.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 11 columns):
Sales          400 non-null float64
CompPrice      400 non-null int64
Income         400 non-null int64
Advertising    400 non-null int64
Population     400 non-null int64
Price          400 non-null int64
ShelveLoc      400 non-null object
Age            400 non-null int64
Education      400 non-null int64
Urban          400 non-null object
US             400 non-null object
dtypes: float64(1), int64(7), object(3)
memory usage: 34.5+ KB
None
   Sales  CompPrice  Income  Advertising  Population  Price ShelveLoc  Age  \
0   9.50        138      73           11         276    120       Bad   42   
1  11.22        111      48           16         260     83      Good   65   
2  10.06        113      35           10         269     80    Medium   59   
3   7.40        117     100            4         466     97    Medium   55   
4   4.15        141      64            3         340    12

# B) Automatic Model Building									

https://gerardnico.com/data_mining/stepwise_regression

In stepwise regression include regression models, where the choice of predictor variables is carried out by an automatic procedure

Stepwise regression adds or removes predictor variables based on their p values.

In forward stepwise selection model, we start the model with no predictor and add the best one based on p-value below the threshold value

In backward stepwise selection model, we start the model with all the predictors and remove the variable with the largest p-value.

### Example 4

Apply stepwise regression to select the best features using the pre-defined, Boston data set

######  https://datascience.stackexchange.com/questions/24405/how-to-do-stepwise-regression-using-sklearn

We will write a function to perform a forward-backward feature selection based on p-value from statsmodels.api.OLS 
Input parameters:

+ 1) X - pandas.DataFrame with features
+ 2) y - pandas series with the target variable
+ 3) initial_list - list of features to start with (column names of X)
+ 4) threshold_in - include a feature if its p-value < threshold_in
+ 5) threshold_out - exclude a feature if its p-value > threshold_out # set threshold_in < threshold_out
+ 6) verbose - whether to print the each step

Output: List of selected features 

See https://en.wikipedia.org/wiki/Stepwise_regression for the details

In [6]:
import pandas           as     pd
import numpy            as     np
import statsmodels.api  as     sm
from   sklearn.datasets import load_boston

In [13]:
def stepwise_selection(X, y, 
                       initial_list  = [], 
                       threshold_in  = 0.01, 
                       threshold_out = 0.05, 
                       verbose       = True):

    included_list = list(initial_list)
    
    while True:
        
        changed = False
        
        # forward step
        
        excluded_list   =  list(set(X.columns)-set(included_list))
        new_pval        =  pd.Series(index = excluded_list)
        
        for new_column in excluded_list:
            
            model                = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included_list+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
            
        best_pval = new_pval.min()
        
        if best_pval < threshold_in:
            best_feature = new_pval.argmin()
            included_list.append(best_feature)
            changed=True
            
            if verbose:
                print('Add  %s with p-value %1.12f'%(best_feature, best_pval))

        # backward step
        
        model      = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included_list]))).fit()
        
        # use all coefs except intercept
        
        p_values    =  model.pvalues.iloc[1:]
        worst_pval  =  p_values.max() # null if pvalues is empty
        
        if worst_pval > threshold_out:
            
            changed=True
            worst_feature = p_values.argmax()
            included_list.remove(worst_feature)
            
            if verbose:
                print(' %s with p-value %1.12f'%(worst_feature, worst_pval))
                
        if not changed:
            break
            
    return included_list


In [14]:
data = load_boston()
X    = pd.DataFrame(data.data, columns=data.feature_names) # Predictor variables
y    = data.target # Target or Response variable

In [18]:
print(data.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [15]:
result = stepwise_selection(X, y)

print('resulting features:')
print(result)

Add  LSTAT with p-value 0.000000000000
Add  RM with p-value 0.000000000000
Add  PTRATIO with p-value 0.000000000000
Add  DIS with p-value 0.000016684671
Add  NOX with p-value 0.000000054881
Add  CHAS with p-value 0.000265473059
Add  B with p-value 0.000771945890
Add  ZN with p-value 0.004651615937
resulting features:
['LSTAT', 'RM', 'PTRATIO', 'DIS', 'NOX', 'CHAS', 'B', 'ZN']


In [5]:
print(X.columns)

Index(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX',
       'PTRATIO', 'B', 'LSTAT'],
      dtype='object')


### Inference

Boston dataset contains the following features:

**CRIM, ZN, INDUS, CHAS, NOX, RM, AGE, DIS, RAD,TAX,PTRATIO, B and LSTAT**

Best subset of features selected are **LSTAT, RM, PTRATIO, DIS, NOX, CHAS,B and ZN**

### Practice Exercise 4

Use the data redwine and apply stepwise regression to select the best features to predict target variable quality.

### Take Home Excercises

### Exercise 5

Use the Whitewine data and recommend a regression model which will be free of multicollinearity.
Also take care of binary and multinomial predictors and interpret regression equations.

### White wine data

Attribute Information:

Input variables (based on physicochemical tests): 

+ 1 - fixed acidity 
+ 2 - volatile acidity 
+ 3 - citric acid 
+ 4 - residual sugar 
+ 5 - chlorides 
+ 6 - free sulfur dioxide 
+ 7 - total sulfur dioxide 
+ 8 - density 
+ 9 - pH 
+ 10 - sulphates 
+ 11 - alcohol 

**Output variable (based on sensory data): **

+ 12 - quality (score between 0 and 10)

##### Relevant Papers:
+ P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. 
+ In Decision Support Systems, Elsevier, 47(4):547-553, 2009. 


#### Citation Request:

Please include this citation if you plan to use this database: 

+ P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
+ Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

In [28]:
import pandas as pd

whitewine_data = pd.read_csv('./data/winequality-white.csv', sep = ";")
print(whitewine_data.info())
print(whitewine_data.head().T)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
fixed acidity           4898 non-null float64
volatile acidity        4898 non-null float64
citric acid             4898 non-null float64
residual sugar          4898 non-null float64
chlorides               4898 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4898 non-null float64
sulphates               4898 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
dtypes: float64(11), int64(1)
memory usage: 459.3 KB
None
                            0        1        2         3         4
fixed acidity           7.000    6.300   8.1000    7.2000    7.2000
volatile acidity        0.270    0.300   0.2800    0.2300    0.2300
citric acid             0.360    0.340   0.4000    0.3200    0.3200
residual suga

### Exercise 6

Apply stepwise regression to select the best features using the concrete dataset to predict strength.

We have 1030 observations on 9 variables.

**Attribute information**

| Sl No | Variable | Description |
| --- | ------------------------ | ---------------------------|
| 1 | cement | Cement in Kg in a m3 mixture |
| 2 | slag | Blast Furnace Slag|
| 3 | ash | Fly Ash |
| 4 | water| Water |
| 5 | superplastic | Superplasticizer |
| 6 | coarseagg | Coarse Aggregate |
| 7 | fineagg | Fine Aggregate |
| 8 | age | Age - Day ( 1 -365) |
| 9 | strength | complete comprehensive strength, target variable |

In [30]:
import pandas as pd

cement_df =  pd.read_csv('./data/concrete.csv', header = 0)
print(cement_df.info())
print(cement_df.columns)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
cement          1030 non-null float64
slag            1030 non-null float64
ash             1030 non-null float64
water           1030 non-null float64
superplastic    1030 non-null float64
coarseagg       1030 non-null float64
fineagg         1030 non-null float64
age             1030 non-null int64
strength        1030 non-null float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB
None
Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'strength'],
      dtype='object')


## END