# Advanced Regression Techniques 

While multiple regression is better than univariate regression because the former finds a model solution with **lower SSE** (higher r-squared). But, if there are **too many X variables (e.g., > 10) in a regression model**, the model becomes **too complex (or overfitting) to be useful** in practice due to **multicollinearity and difficulity of interpretation**. This problem is also known as **"curse of high dimensionality"**

Thus, **more advanced regression would be needed to deal with this issue**

> ## 1. Regularization 
> - refers to **the process of penalizing the model with too many redundant variables (or highly correlated variables)**
> - the goal is developing the regression model with **low SSE** and **simplicity (fewer X variables or predictors)** 
> - **two types of regression include regulariation in their objective function: Lasso and Ridge**
> - http://scikit-learn.org/stable/modules/linear_model.html

> ## 2. Feature selection
> - refers to **the process of selecting the most useful predictors**, helping analysts understand what predictors matter in predicting y value
> - the goal is developing the **simple** regression model with the **user specificed number of predictors**.
> - f_regression 

# Lasso regression (Regularization)

- (Least Absolute Shrinkage and Selection Operator) is one of the **regression models with regularization**
- Finds the model solution with **fewer X variables**

> #### How does Lasso work?#### 

> - Remember that the goal of regression is **Minimize SSE** and more predictors is likely to reduce **SSE**
> - Thus, Lasso includes **regularization** or **a mechanism of penalizing adding too many variables**. 

                  minimize (SSE  + alpha|coefficient|)

                  where alpha = parameter for penalizing adding more coefficients

> - This approach reinforces the Lasso regression to consider fewer predictors (simpler regression model) 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#regression packages
import sklearn.linear_model as lm
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
import statsmodels.formula.api as sm

#lasso regression
from sklearn import linear_model

#f_regression (feature selection)
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import SelectKBest

# recursive feature selection (feature selection)
from sklearn.feature_selection import RFE

In [None]:
teams = pd.read_csv("data/baseball.csv")
teams.head()

On Base Percentage (OBP, On Base Average, OBA) is a measure of how often a batter reaches base. 

The full formula is OBP = (Hits + Walks + Hit by Pitch) / (At Bats + Walks + Hit by Pitch + Sacrifice Flies). Batters are not credited with reaching base on an error or fielder's choice, and they are not charged with an opportunity if they make a sacrifice bunt.

All Time Leaders
Ted Williams	.482	(career)
Barry Bonds	    .609	(2004 season)

http://www.baseball-reference.com/bullpen/On_base_percentage

In [None]:
teams = teams.drop(['yearID', 'teamID', 'Rank'], axis=1)
teams.head(2)

In [None]:
#assigning columns to X and Y variables
y = teams['R'] 
X = teams.drop(['R'], axis =1)

In [None]:
#Fit the model
model1 = linear_model.Lasso(alpha=1)         #higher alpha (penality parameter), fewer predictors
model1.fit(X, y)
model1_y = model1.predict(X)

In [None]:
print 'Coefficients: ', model1.coef_
print "y-intercept ", model1.intercept_

In [None]:
coef = ["%.3f" % i for i in model1.coef_]
xcolumns = [ i for i in X.columns ]
zip(xcolumns, coef)

The regression model has become a lot simpler than the full model with all X variables. Several X variables were removed from the model, including **G, AB, salary, BA, OBP, and SLG**. These removed variables have their **coefficients close to 0**.

R = 0.212RA + 2.113W + 0.471H + 0.242BB + ...

In [None]:
print "mean square error: ", mean_squared_error(y, model1_y)
print "variance or r-squared: ", explained_variance_score(y, model1_y)

# f_Regression (Feature Selection)

- Quick linear model for testing the effect of a single predictor, sequentially for many predictors.
- http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression

In [None]:
#selec only 2 X variables
X_new = SelectKBest(f_regression, k=2).fit_transform(X, y)
X_new

f_regression determines that **OBP** and **SLG** are two most important predictors

In [None]:
model2 = lm.LinearRegression()
model2.fit(X_new, y)
model2_y = model2.predict(X_new)

print "mean square error: ", mean_squared_error(y, model2_y)
print "variance or r-squared: ", explained_variance_score(y, model2_y)

In [None]:
# use f_regression with k = 3 and develop a new regression model


model3





# Recursive Feature Selection (RFE): Another Feature Selection Method

In [None]:
lr = lm.LinearRegression()
rfe = RFE(lr, n_features_to_select=2)
rfe_y = rfe.fit(X,y)

print "Features sorted by their rank:"
print sorted(zip(map(lambda x: x, rfe.ranking_), X.columns))

In [None]:
# or you can do something like this: zip first and sort the data

print sorted(zip(rfe.ranking_, X.columns))

# Appendix: with fewer predictors

In [None]:
#First Model
runs_reg_model1 = sm.ols("R~OBP+SLG+BA",teams)
runs_reg1 = runs_reg_model1.fit()
#Second Model
runs_reg_model2 = sm.ols("R~OBP+SLG",teams)
runs_reg2 = runs_reg_model2.fit()
#Third Model
runs_reg_model3 = sm.ols("R~BA",teams)
runs_reg3 = runs_reg_model3.fit()

- The first one will have as features OBP, SLG and BA. 
- The second model will have as features OBP and SLG. 
- The last one will have as feature BA only.

In [None]:
print runs_reg1.summary()
print runs_reg2.summary()
print runs_reg3.summary()

- The first model has an Adjusted R-squared of 0.918, with 95% confidence interval of BA between -283 and 468. This is counterintuitive, since we expect the BA value to be positive. This is due to a **multicollinearity** between the variables.

- The second model has an Adjusted R-squared of 0.919, and the last model an Adjusted R-squared of 0.500.

- Based on this analysis, we could confirm that the second model using **OBP** and **SLG** is the best model for predicting Run Scored.

# References

- http://adilmoujahid.com/posts/2014/07/baseball-analytics/ (reproduced from this page)
- https://www.analyticsvidhya.com/blog/2016/01/complete-tutorial-ridge-lasso-regression-python/#four
- http://www.python-course.eu/lambda.php (excellent resource for lamda and map)