Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

* Vanilla logistic regression

* Ridge logistic regression

* Lasso logistic regression

If you're stuck on how to begin combining your two new modeling skills, here's a hint: the SKlearn LogisticRegression method has a "penalty" argument that takes either 'l1' or 'l2' as a value.

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

In [56]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
import math
import seaborn as sns
import sklearn
from sklearn import linear_model
from sklearn import preprocessing
%matplotlib inline

In [58]:
football = pd.read_csv("/Users/Jenny/Documents/Thinkful/international football data/results.csv")
football.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
0,1872-11-30,Scotland,England,0,0,Friendly,Glasgow,Scotland,False
1,1873-03-08,England,Scotland,4,2,Friendly,London,England,False
2,1874-03-07,Scotland,England,2,1,Friendly,Glasgow,Scotland,False
3,1875-03-06,England,Scotland,2,2,Friendly,London,England,False
4,1876-03-04,Scotland,England,3,0,Friendly,Glasgow,Scotland,False


In [65]:
football.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38929 entries, 0 to 38928
Data columns (total 9 columns):
date          38929 non-null object
home_team     38929 non-null object
away_team     38929 non-null object
home_score    38929 non-null int64
away_score    38929 non-null int64
tournament    38929 non-null object
city          38929 non-null object
country       38929 non-null object
neutral       38929 non-null bool
dtypes: bool(1), int64(2), object(6)
memory usage: 2.4+ MB


In [84]:
#let's look at a dataframe where spain is the hometeam:
#our binary outcome will be: did spain win?
spain = football[(football.home_team == 'Spain')]                              
pd.DataFrame(spain)
spain.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral
602,1921-12-18,Spain,Portugal,3,1,Friendly,Madrid,Spain,False
673,1923-01-28,Spain,France,3,0,Friendly,San Sebastián,Spain,False
753,1923-12-16,Spain,Portugal,3,0,Friendly,Seville,Spain,False
833,1924-12-21,Spain,Austria,2,1,Friendly,Barcelona,Spain,False
864,1925-06-14,Spain,Italy,1,0,Friendly,Valencia,Spain,False


In [17]:
print(spain['home_score'].mean())
print(spain['away_score'].mean())

2.32378223495702
0.7449856733524355


### Feature engineering

In [73]:
numerical_columns = spain.columns[4:5]
for num_col in numerical_columns:
    spain[num_col]= spain[num_col].apply(lambda x: float(x))
    
numerical_columns2 = spain.columns[9:13]
for num_col in numerical_columns2:
    spain[num_col]= spain[num_col].apply(lambda x: float(x))
    
spain.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,away_score_large,home_score_large,friendly,world_cup_qual,spain_win
0,1872-11-30,Scotland,England,0,0.0,Friendly,Glasgow,Scotland,0,0.0,0.0,1.0,0.0,0
1,1873-03-08,England,Scotland,4,2.0,Friendly,London,England,0,1.0,1.0,1.0,0.0,1
2,1874-03-07,Scotland,England,2,1.0,Friendly,Glasgow,Scotland,0,0.0,1.0,1.0,0.0,1
3,1875-03-06,England,Scotland,2,2.0,Friendly,London,England,0,0.0,1.0,1.0,0.0,0
4,1876-03-04,Scotland,England,3,0.0,Friendly,Glasgow,Scotland,0,1.0,0.0,1.0,0.0,1


In [102]:
spain['away_score_large'] = np.where(spain['home_score'] > 2, 1, 0)
spain['home_score_large'] = np.where(spain['away_score'] > 0, 1, 0) 
spain['neutral'] = np.where(spain['neutral'] == True, 1, 0)
spain['friendly'] = np.where(spain['tournament'] == 'Friendly', 1, 0)
spain['world_cup_qual'] = np.where(spain['tournament'] == 'FIFA World Cup qualification', 1, 0)

In [75]:
spain.head()

Unnamed: 0,date,home_team,away_team,home_score,away_score,tournament,city,country,neutral,away_score_large,home_score_large,friendly,world_cup_qual,spain_win
0,1872-11-30,Scotland,England,0,0.0,Friendly,Glasgow,Scotland,0,0,0,1,0,0
1,1873-03-08,England,Scotland,4,2.0,Friendly,London,England,0,1,1,1,0,1
2,1874-03-07,Scotland,England,2,1.0,Friendly,Glasgow,Scotland,0,0,1,1,0,1
3,1875-03-06,England,Scotland,2,2.0,Friendly,London,England,0,0,1,1,0,0
4,1876-03-04,Scotland,England,3,0.0,Friendly,Glasgow,Scotland,0,1,0,1,0,1


In [108]:
new_spain = spain[['home_score', 'away_score', 'neutral', 'away_score_large', 'home_score_large', 'friendly', 'world_cup_qual']]
new_spain.head()

Unnamed: 0,home_score,away_score,neutral,away_score_large,home_score_large,friendly,world_cup_qual
602,3,1,0,1,1,1,0
673,3,0,0,1,0,1,0
753,3,0,0,1,0,1,0
833,2,1,0,0,1,1,0
864,1,0,0,0,0,1,0


In [109]:
spain_copy = new_spain
spain_copy['spain_win'] = np.where(spain_copy['home_score'] > spain_copy['away_score'], 1, 0)
spain_copy.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,home_score,away_score,neutral,away_score_large,home_score_large,friendly,world_cup_qual,spain_win
602,3,1,0,1,1,1,0,1
673,3,0,0,1,0,1,0,1
753,3,0,0,1,0,1,0,1
833,2,1,0,0,1,1,0,1
864,1,0,0,0,0,1,0,1


## Vanilla Logistic Regression

### Start with StatsMod 

In [126]:
import statsmodels.api as sm
#declare predictors: we're seeing if we can predict whether or not
#spain will win based off friendly and neutral matches
X_statsmod = new_spain[['friendly', 'neutral']]

X_statsmod['intercepts'] = 1

#declare and fit the model:
logit = sm.Logit(spain_copy['spain_win'], X_statsmod)
result = logit.fit()

print(result.summary())

Optimization terminated successfully.
         Current function value: 0.611997
         Iterations 5
                           Logit Regression Results                           
Dep. Variable:              spain_win   No. Observations:                  349
Model:                          Logit   Df Residuals:                      346
Method:                           MLE   Df Model:                            2
Date:                Tue, 17 Jul 2018   Pseudo R-squ.:                 0.02142
Time:                        18:48:39   Log-Likelihood:                -213.59
converged:                       True   LL-Null:                       -218.26
                                        LLR p-value:                  0.009326
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
friendly      -0.6234      0.251     -2.483      0.013      -1.115      -0.131
neutral       -0.8028      0.

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [128]:
#calc. accuracy. Get probability that each game will be won by spain
pred_statsmod = result.predict(X_statsmod)

#code win as 1 if probability is greater than .5
pred_y_statsmod = np.where(pred_statsmod < .5, 0, 1)

#accuracy table
table = pd.crosstab(spain_copy['spain_win'], pred_y_statsmod)

print('\nAccuracy by win status')
print(table)
print('\nPercentage accuracy')
print((table.iloc[0,0] + table.iloc[1,1]) / (table.sum().sum()))


Accuracy by win status
col_0       0    1
spain_win         
0          22   89
1          13  225

Percentage accuracy
0.707736389685


In [127]:
X_statsmod = new_spain[['away_score_large', 'neutral']]

X_statsmod['intercepts'] = 1

#declare and fit the model:
logit = sm.Logit(spain_copy['spain_win'], X_statsmod)
result = logit.fit()

print(result.summary())

Optimization terminated successfully.
         Current function value: 0.462558
         Iterations 9
                           Logit Regression Results                           
Dep. Variable:              spain_win   No. Observations:                  349
Model:                          Logit   Df Residuals:                      346
Method:                           MLE   Df Model:                            2
Date:                Tue, 17 Jul 2018   Pseudo R-squ.:                  0.2604
Time:                        18:49:30   Log-Likelihood:                -161.43
converged:                       True   LL-Null:                       -218.26
                                        LLR p-value:                 2.086e-25
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
away_score_large     4.7553      1.014      4.690      0.000       2.768       6.742
neutral    

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [129]:
#calc. accuracy. Get probability that each game will be won by spain
pred_statsmod = result.predict(X_statsmod)

#code win as 1 if probability is greater than .5
pred_y_statsmod = np.where(pred_statsmod < .5, 0, 1)

#accuracy table
table = pd.crosstab(spain_copy['spain_win'], pred_y_statsmod)

print('\nAccuracy by win status')
print(table)
print('\nPercentage accuracy')
print((table.iloc[0,0] + table.iloc[1,1]) / (table.sum().sum()))


Accuracy by win status
col_0       0    1
spain_win         
0          22   89
1          13  225

Percentage accuracy
0.707736389685


### Now using SKLearn

In [125]:
# Declare a logistic regression classifier.
# Parameter regularization coefficient C described above.
lr = LogisticRegression(C=1e9)
y = spain_copy['spain_win']
X = new_spain[['away_score_large', 'neutral']]

# Fit the model.
fit = lr.fit(X, y)

# Display.
print('Coefficients')
print(fit.coef_)
print(fit.intercept_)
pred_y_sklearn = lr.predict(X)

print('\n Accuracy by win status')
print(pd.crosstab(pred_y_sklearn, y))

print('\n Percentage accuracy')
print(lr.score(X, y))

Coefficients
[[ 4.75449729 -0.6592537 ]]
[ 0.16207944]

 Accuracy by admission status
spain_win   0    1
row_0             
0          22   13
1          89  225

 Percentage accuracy
0.707736389685


**Percentage accuracies are the same!**

## Ridge Logistic Regression
L2 regualrization

In [130]:
y = spain_copy['spain_win']
x = new_spain[['away_score_large', 'neutral']]

In [136]:
# Generate range of alpha values to pick one resulting in best r squared
alphas = np.arange(0.1, 5, 1)
lr_ridge = LogisticRegression(penalty='l2')
ridge_r_squared = []

# Train model with different regularization values
for a in alphas:
    lr_ridge.set_params(C=a, fit_intercept=False)
    lr_ridge.fit(x, y)
    y_pred = lr_ridge.predict(x)
    ridge_r_squared.append(lr_ridge.score(x, y))

In [134]:
# Get values# Get va 
ridge_r_squared

[0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004]

In [137]:
# Get alpha corresponding to highest r-squared# Get al 
alphas[4]

4.0999999999999996

In [145]:
# Not much variation observed by changing penalization coefficient.  Select best one.

# Instantiate and set regularization coefficient 
lr_ridge = LogisticRegression(penalty='l2', C=4.1, fit_intercept=False)

# Fit model
lr_ridge.fit(x, y)

# Get results
print('Coefficient')
print(lr_ridge.coef_)
print(lr_ridge.intercept_)

# Get predictions
pred_y_r = lr_ridge.predict(x)

print('\nAccuracy by win status')
print(pd.crosstab(pred_y_r, y))

ridge_scores = cross_val_score(lr_ridge, x, y, cv=5)
print('\nPercentage accuracy')
print(ridge_scores)
print('Mean:', ridge_scores.mean())

Coefficient
[[ 4.16457697 -0.4626811 ]]
0.0

Accuracy by win status
spain_win    0    1
row_0              
0          110  117
1            1  121

Percentage accuracy
[ 0.74647887  0.54285714  0.7         0.60869565  0.71014493]
Mean: 0.661635319161


* Overall, our accuracy **decreased**.  
*why?*

## LASSO Logistic Regression
L1 Regularization

In [152]:
# Repeat iterative process to find best value of penalization coefficient# Repeat 

# Generate range of alpha values
alphas = np.arange(0.01, 1, 0.1)
lr_lasso = LogisticRegression(penalty='l1')
lasso_r_squared = []

# Train model with different regularization values
for a in alphas:
    lr_lasso.set_params(C=a, fit_intercept=False)
    lr_lasso.fit(x, y)
    y_pred = lr_lasso.predict(x)
    lasso_r_squared.append(lr_lasso.score(x, y))

In [153]:
# Get values# Get va 
lasso_r_squared

[0.31805157593123207,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004,
 0.66189111747851004]

In [154]:
# Instantiate and set regularization coefficient to selected value
lr_lasso = LogisticRegression(penalty='l1', C=0.9)
#increading coef. increases accuracy 

# Fit model
lr_lasso.fit(x, y)

# Get results
print('Coefficient')
print(lr_lasso.coef_)
print(lr_lasso.intercept_)

# Get predictions
pred_y_l = lr_lasso.predict(x)

print('\nAccuracy by win status')
print(pd.crosstab(pred_y_l, y))

lasso_scores = cross_val_score(lr_lasso, x, y, cv=5)
print('\nPercentage accuracy')
print(lasso_scores)
print('Mean:', lasso_scores.mean())

Coefficient
[[ 3.98549969 -0.47588523]]
[ 0.13471575]

Accuracy by win status
spain_win   0    1
row_0             
0          22   13
1          89  225

Percentage accuracy
[ 0.69014085  0.68571429  0.72857143  0.72463768  0.71014493]
Mean: 0.70784183361


* Accuracy is about the **same** as vanilla logistic regression. 

### Other interesting things we could look at:
* Is using World Cup Qualifying rounds better for prediction accuracy
    * Could be used to look at:  *how well does Spain perform under pressure?*
* Probably, looking at whether or not Spain scored higher num. of goals than average would be a better predictor.
* How well could we predict based *only* on the away team score?
* How well does this work if Spain is the away team, not the home team?