Now that you have two new regression methods at your fingertips, it's time to give them a spin. In fact, for this challenge, let's put them together! Pick a dataset of your choice with a binary outcome and the potential for at least 15 features. If you're drawing a blank, the crime rates in 2013 dataset has a lot of variables that could be made into a modelable binary outcome.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

Vanilla logistic regression
Ridge logistic regression
Lasso logistic regression
If you're stuck on how to begin combining your two new modeling skills, here's a hint: the SKlearn LogisticRegression method has a "penalty" argument that takes either 'l1' or 'l2' as a value.

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

Record your work and reflections in a notebook to discuss with your mentor.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression

In [2]:
import xlrd
df=pd.read_csv('binary1.csv')

In [3]:
df.shape

(400, 4)

In [4]:
df.dtypes

admit      int64
gre        int64
gpa      float64
rank       int64
dtype: object

In [5]:
df.head()

Unnamed: 0,admit,gre,gpa,rank
0,0,380,3.61,3
1,1,660,3.67,3
2,1,800,4.0,1
3,1,640,3.19,4
4,0,520,2.93,4


### Split the data set

In [6]:
#import train_test_split function
from sklearn.model_selection import train_test_split

# features
x=df[['gre','gpa']]

#labels
y=df['admit']

#split dataset into training and test set
#70% training set and 30% test set
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.5) 

### Vanilla Logistic Regression

In [7]:
from sklearn import linear_model

#create a vallina logistic regression
lr=LogisticRegression(C=1.0)

#train the model using training set
lr.fit(x_train,y_train)

#predict the model using test set
y_pred=lr.predict(x_test)

In [8]:
#model accuracy, how often is the classifier correct?
print("accuracy: {}".format(lr.score(x_test,y_test)))

accuracy: 0.665


### Ridge Regression

In [9]:
#import train_test_split function
from sklearn.model_selection import train_test_split

# features
x=df[['gre','gpa']]

#labels
y=df['admit']

#split dataset into training and test set
#70% training set and 30% test set
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.5) 

In [10]:
#create a ridge logistic regression
ridge=linear_model.Ridge(alpha=10,fit_intercept=False)

#train the model using training set
ridge.fit(x_train,y_train)

#predict the model using test set
y_pred=ridge.predict(x_test)

In [11]:
#model accuracy, how often is the classifier correct?
print("accuracy: {}".format(ridge.score(x_test,y_test)))

accuracy: 0.0001254691813378228


### Lasso Regression

In [12]:
#import train_test_split function
from sklearn.model_selection import train_test_split

# features
x = df[['gpa', 'gre']]

#labels
y=df['admit']

#split dataset into training and test set
#70% training set and 30% test set
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.5) 

In [13]:
#create a lasso logistic regression
lass=linear_model.Lasso(alpha=0.3)

#train the model using training set
lass.fit(x_train,y_train)

#predict the model using test set
y_pred=lass.predict(x_test)

In [14]:
#model accuracy, how often is the classifier correct?
print("accuracy: {}".format(lass.score(x_train,y_train)))
print("accuracy: {}".format(lass.score(x_test,y_test)))

accuracy: 0.03496729083741035
accuracy: 0.013775117912969814
