<center><h1>Regression Models in Pyhton</h1></center>
<center><h3>2022-03-15</h3></center>


# Linear Regression Review

Linear regression modeling is a technique used to generate models for outcome variables that are continuous (e.g., heart rate, age, income, cholesterol, etc.)

Our goal for linear regression in machine learning context is slightly different than in classical statistics. In particular, we are less interested in the "significance" of a given predictor variable, and we are more interested in the overall quality of the model in terms of it's performance on new data. 

We can use anywhere from 1 to _p_ predictor variables in our model for the outcome variable. 




## $$y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... \beta_p x_{pi} + \epsilon_i$$  

![image](images/linear_reg.png)

<center><img src=images/train_test_split.png width = 1080/></center>

# SciKit-Learn
  * One of the most popular packages in the Python eco-system
  * Specialized for machine learning algorithms

In [34]:
import pandas as pd
import numpy as np 

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

## Minnesota Traffic Volume Data
Hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN.


In [35]:
df = pd.read_csv("data/Metro_Interstate_Traffic_Volume.csv")

print(df.shape)

df.head()

(48204, 9)


Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918


In [36]:
xvars = ["temp", "rain_1h", "snow_1h", "clouds_all"]

X = df.loc[:, xvars].values              # get X values (i.e., predictors/features)
y = df.loc[:, "traffic_volume"].values   # get y values (i.e., outcome/target variable)

## Train/Test Split

In [37]:
# Split training/test data at random

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Fitting Regression Model

In [38]:
mod = LinearRegression()       # create model object

mod.fit(X_train, y_train)      # fit model to training data

LinearRegression()

In [39]:
print(mod.coef_)               # show regression coefficients 

[2.03771422e+01 1.41759285e-01 6.18593670e+02 3.89941368e+00]


## Making Predictions with Fitted Model

In [40]:
# Use our fitted model to make predictions using test set 

y_pred = mod.predict(X_test)

In [41]:
# Print our metrics of model adequacy 

print("Mean Absolute Error:    ", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:     ", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("R-Squared Value:        ", metrics.explained_variance_score(y_test, y_pred))

Mean Absolute Error:     1701.0237286830916
Mean Squared Error:      3799938.6620990513
Root Mean Squared Error: 1949.3431360586703
R-Squared Value:         0.02834348685617749



# Feature Engineering

* The term "feature" is typically used to refer to predictor variables 
* Transforming and/or modifying data in a manner that extracts additional information from raw data

In [9]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

import time

# Minnesota Traffic Volume Data
Hourly Interstate 94 Westbound traffic volume for MN DoT ATR station 301, roughly midway between Minneapolis and St Paul, MN.


## Pre-Processing and Data Engineering 

In [10]:
def get_hour_wkday(v_dt):
    n = len(v_dt)
    hour = np.zeros(n)
    wkday = np.zeros(n)
    
    for i in range(n):
        dt_tmp = time.strptime(v_dt[i], "%Y-%m-%d %H:%M:%S")
        hour[i] = dt_tmp.tm_hour
        wkday[i] = dt_tmp.tm_wday
    
    return hour, wkday

In [11]:
# create two new columns `hour` and `wkday`

df["hour"], df["wkday"] = get_hour_wkday(df["date_time"])

In [12]:
# Add dummy coded weather

weather_dummy_codes = pd.get_dummies(df["weather_description"])

df = df.join(weather_dummy_codes)

In [13]:
df.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume,hour,...,sleet,smoke,snow,thunderstorm,thunderstorm with drizzle,thunderstorm with heavy rain,thunderstorm with light drizzle,thunderstorm with light rain,thunderstorm with rain,very heavy rain
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545,9.0,...,0,0,0,0,0,0,0,0,0,0
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516,10.0,...,0,0,0,0,0,0,0,0,0,0
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767,11.0,...,0,0,0,0,0,0,0,0,0,0
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026,12.0,...,0,0,0,0,0,0,0,0,0,0
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918,13.0,...,0,0,0,0,0,0,0,0,0,0


### More Data Engineering

In [14]:
# create column of 0/1 indicating holidays

df["is_holiday"] = [0 if x == "None" else 1 for x in df["holiday"]]

In [15]:
# create column of 0/1 indicating holidays

df["is_weekend"] = [1 if x in [5, 6] else 0 for x in df["wkday"]]

df.head()

Unnamed: 0,holiday,temp,rain_1h,snow_1h,clouds_all,weather_main,weather_description,date_time,traffic_volume,hour,...,snow,thunderstorm,thunderstorm with drizzle,thunderstorm with heavy rain,thunderstorm with light drizzle,thunderstorm with light rain,thunderstorm with rain,very heavy rain,is_holiday,is_weekend
0,,288.28,0.0,0.0,40,Clouds,scattered clouds,2012-10-02 09:00:00,5545,9.0,...,0,0,0,0,0,0,0,0,0,0
1,,289.36,0.0,0.0,75,Clouds,broken clouds,2012-10-02 10:00:00,4516,10.0,...,0,0,0,0,0,0,0,0,0,0
2,,289.58,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 11:00:00,4767,11.0,...,0,0,0,0,0,0,0,0,0,0
3,,290.13,0.0,0.0,90,Clouds,overcast clouds,2012-10-02 12:00:00,5026,12.0,...,0,0,0,0,0,0,0,0,0,0
4,,291.14,0.0,0.0,75,Clouds,broken clouds,2012-10-02 13:00:00,4918,13.0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
xvars = ["temp", "rain_1h", "snow_1h", "clouds_all"]

xvars2 = ["temp", "rain_1h", "snow_1h", "clouds_all", "hour", "wkday", "is_weekend", "is_holiday"]

xvars2 = xvars2 + weather_dummy_codes.columns.values.tolist()


X = df.loc[:, xvars2].values             # get X values (i.e., predictors/features)
y = df.loc[:, "traffic_volume"].values   # get y values (i.e., outcome/target variable)

## Split Training/Test Data

In [17]:
# Split training/test data at random

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

### Fitting Regression Model

In [18]:
mod = LinearRegression()       # create model object

mod.fit(X_train, y_train)      # fit model to training data

LinearRegression()

In [19]:
print(mod.coef_)               # show regression coefficients 

[ 1.36282865e+01  2.37103764e-01 -1.00316251e+02  4.84476336e+00
  9.27434574e+01  6.47757732e+01 -1.18253645e+03 -1.49543785e+03
 -1.00338654e+03  2.98694636e+02  2.75686496e+02 -3.62998128e+02
  5.14655779e+02 -7.46664112e+01  6.66119597e+02  3.41400982e+02
  9.87140468e+01 -3.66936279e+02 -1.22216844e+02 -6.77524112e+01
  1.82313666e+02 -5.57203807e+01  8.59593709e+02  9.50366376e+02
 -1.11240570e+02 -1.65110266e+02 -2.01285152e+02 -2.83143246e+01
  9.08638953e+02 -3.54866993e+02 -6.82636976e+02 -6.47624688e+02
  4.97411544e+02 -6.62934222e+02  0.00000000e+00  1.12205424e+02
  1.45678356e+03  1.82560102e+01 -6.22174307e+02 -6.02553589e+02
  1.77766070e+03 -6.48467998e+02 -5.38561604e+02 -6.04015837e+02
  1.25378962e+02 -1.16041691e+03]


In [20]:
# Use our fitted model to make predictions using test set 

y_pred = mod.predict(X_test)

In [21]:
# Print our metrics of model adequacy 

print("Mean Absolute Error:    ", metrics.mean_absolute_error(y_test, y_pred))
print("Mean Squared Error:     ", metrics.mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error:", np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("R-Squared Value:        ", metrics.explained_variance_score(y_test, y_pred))

Mean Absolute Error:     1555.8016865530478
Mean Squared Error:      3157421.0110158063
Root Mean Squared Error: 1776.9133380713326
R-Squared Value:         0.19265696897419982


# Logistic Regression Models

### Classification vs. Regression

* Classification involves categorical outcome variable
* Regression involves continuous outcome variable 
* "logistic regression" is used for modeling categorical data


<center><img src=images/train_test_split.png width = 1080/></center>

# `adult` Data
  * Popular census data 
  * UCI ML repository

In [22]:
import sklearn

In [23]:
from sklearn.linear_model import LogisticRegression

In [24]:
help(LogisticRegression)

Help on class LogisticRegression in module sklearn.linear_model._logistic:

class LogisticRegression(sklearn.linear_model._base.LinearClassifierMixin, sklearn.linear_model._base.SparseCoefMixin, sklearn.base.BaseEstimator)
 |  LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)
 |  
 |  Logistic Regression (aka logit, MaxEnt) classifier.
 |  
 |  In the multiclass case, the training algorithm uses the one-vs-rest (OvR)
 |  scheme if the 'multi_class' option is set to 'ovr', and uses the
 |  cross-entropy loss if the 'multi_class' option is set to 'multinomial'.
 |  (Currently the 'multinomial' option is supported only by the 'lbfgs',
 |  'sag', 'saga' and 'newton-cg' solvers.)
 |  
 |  This class implements regularized logistic regression using the
 |  'liblinear' library, 'newton-cg', 's

In [25]:
import pandas as pd
import numpy as np 

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

## Data Exploration

In [26]:
df = pd.read_csv("data/adult.csv", skipinitialspace = True)

df.columns = df.columns.str.replace(" ", "")

print(df.shape)

df.head(100)

(32561, 15)


Unnamed: 0,age,workclass,fnlwgt,education,education_num,marital_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,29,Local-gov,115585,Some-college,10,Never-married,Handlers-cleaners,Not-in-family,White,Male,0,0,50,United-States,<=50K
96,48,Self-emp-not-inc,191277,Doctorate,16,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,1902,60,United-States,>50K
97,37,Private,202683,Some-college,10,Married-civ-spouse,Sales,Husband,White,Male,0,0,48,United-States,>50K
98,48,Private,171095,Assoc-acdm,12,Divorced,Exec-managerial,Unmarried,White,Female,0,0,40,England,<=50K


### Dummy Code Outcome Variable

In [27]:
df["income_gt_50"] = [1 if x == ">50K" else 0 for x in df["income"]]

np.mean(df["income_gt_50"])        # proportion > 50k

0.2408095574460244

In [28]:
xvars = ["age", "education_num", "capital_gain", "hours_per_week"]

X = df.loc[:, xvars].values              # get X values (i.e., predictors/features)
y = df.loc[:, "income_gt_50"].values     # get y values (i.e., outcome/target variable)

## Train/Test Split

In [29]:
# Split training/test data at random

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Fitting Logistic Regression Model

In [30]:
mod = LogisticRegression()       # create model object

mod.fit(X_train, y_train)        # fit model 

LogisticRegression()

In [31]:
print(mod.coef_)               # show regression coefficients 

[[4.39336041e-02 3.29083778e-01 3.12547546e-04 4.29242150e-02]]


## Making Predictions with Fitted Model

In [32]:
# Use our fitted model to make predictions using test set 

y_pred = mod.predict(X_test)

In [33]:
# Print our metrics of model adequacy 

print("F1 Score:    ", metrics.f1_score(y_test, y_pred))
print("Accuracy:    ", metrics.accuracy_score(y_test, y_pred))

F1 Score:     0.4738336713995943
Accuracy:     0.8008598188238907
