# Data Dive Week 6: Logistic Regression 

This week we take a look at *logistic regression*, the first classification model we'll be covering in class. We'll be using `scikit-learn` in today's exercise. 


***

As we discussed last week, logisitic regression is a *classification model* - meaning that it is designed to idenfity the likelihood that a given observed data point belongs to set class, or category. Today we'll be looking at a real world application of logistic regression using July 2019 flight data from the U.S. Department of Transportation's [Bureau of Transportation Statistics](https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236). 
***

![flights](https://media.giphy.com/media/Btn42lfKKrOzS/source.gif)


In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression


In [None]:
df = pd.read_csv('https://grantmlong.com/data/Flights-July2019.csv')
print(df.shape)
print(list(df))
df.sample(5).transpose()

# Identifying a Target

If you're a traveler, which variable might make the most to predict? How about if you are a travel booking site?

Create and/or summarize the variable that you think makes the most sense to model. What percentage of flights fall into this category?


### Ensuring our target has valid values

We need to ensure that all of our target observations have valid values in order to ensure our model will run properly. 

In [None]:
target = 'XXX'
df = df.loc[df[target].notna()] 


# Exploring Our Data

As we start to put together a model, we'll want to think about the features that might make sense to use in our model. Ideally, these should be:
 1. Available in advance.
 2. Sensibly related to our target.
 3. Capable of being encoded into a model. 
 
Let's summarize some potential features, and look at the ways in which they correlate with our target.

### Categorical Data

To examine how well some potential categorical variables might work, we can summarize our target by each of the categorical values.

In [None]:
potential_feature = 'XXX'

if df[potential_feature].nunique()<=20:

    print(
        df[[potential_feature, target]]
        .groupby(potential_feature)
        .agg(['mean', 'count'])
        .sort_values(by=(target, 'mean'), ascending=False)
    )

else:

    print(
        df[[potential_feature, target]]
        .groupby(potential_feature)
        .agg(['mean', 'count'])
        .sort_values(by=(target, 'mean'), ascending=False).head(10)
    )

    print()

    print(
        df[[potential_feature, target]]
        .groupby(potential_feature)
        .agg(['mean', 'count'])
        .sort_values(by=(target, 'mean'), ascending=False).tail(10)
    )

### Continuous Variables 

For continuous features, evaluating their predictive potential is slightly more straightforward. We can look at the distribution of the potential features by each class to gauge their relationship to the target.

In [None]:
potential_feature = 'XXX'

df.loc[(df[target]==0), potential_feature].hist(bins=20, alpha=.5, density=True, color='blue')
df.loc[(df[target]==1), potential_feature].hist(bins=20, alpha=.5, density=True, color='red')

# Building Feature Sets

For each of the categorical features we'd like to include, we'll have to break each category - or a combination of categories - into a series of binary features. For continuous features, we'll have to ensure each row we'd like to include has a valid value. 

In [None]:
####################################################################        
# Categorical features

all_features = []
categorical_features = ['XXX', 'XXX']
top = 10

for each_feature in categorical_features:
    for f in df[each_feature].value_counts().index[:top].to_list():
        df[each_feature+'='+f] = (df[each_feature]==f)*1
        all_features.append(each_feature+'='+f)

        
####################################################################        
# Continuous features

continuous_features = ['XXX', 'XXX']

for each_feature in continuous_features:
    all_features.append(each_feature)

        
####################################################################        
# Adding a constant
df['constant'] = 1
all_features.append('constant')
    
    
df[all_features].describe().transpose()

# Training A Model in `statsmodels`

As we saw last week, `statsmodels` can be helpful if we want to visualize the summary statistics of our output. Just like linear regression, it only takes a few lines of code to use `statsmodels` to fit the model and print the result. 

In [None]:
logit = sm.Logit(df[target], df[all_features])
result = logit.fit()
print(result.summary())

# Training A Model in `sklearn`

As we saw the past two weeks, `statsmodels` can be helpful if we want to visualize the summary statistics of our output. Documentation for the `LogisticRegression` object can be found [here](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)

In [None]:
X = df[all_features].values
y = df[target].values


##### Let's fit and score the model:

In [None]:
clf = LogisticRegression(random_state=20191016, solver='lbfgs').fit(X, y)
print('The accuracy of our model is %0.1f%%' % (clf.score(X, y)*100))


##### Let's see where we went right and where we went wrong:

In [None]:
df['likelihood'] = clf.predict_proba(X)[:,1]

df['likelihood'].hist(bins=20)


Let's take a look at the top values where we failed to predict a bad flight: 

In [None]:
interesting_cols = ['FL_DATE', 'OP_CARRIER', 'ORIGIN', 'DEST', 'CRS_DEP_TIME', 'likelihood', target]

(
    df.loc[(df[target]==1), interesting_cols]
    .sort_values(by='likelihood', ascending=True)
    .head(10)
)

Now let's take a look at where we predicted a bad flight, but the flights were not bad.

In [None]:
(
    df.loc[(df[target]==0), interesting_cols]
    .sort_values(by='likelihood', ascending=False)
    .head(10)
)

Finally, let's take a closer look at our summary metrics. How do they change as we change our cutoff value for our prediction?

In [None]:
def calculate_metrics(df, threshold):
    
    df['predicted'] = (clf.predict_proba(X)[:,1]>=threshold)*1
    accuracy = sum(df['predicted']==y)/len(y)
    precision = df.loc[df.predicted==1, target].mean()
    recall = df.loc[df[target]==1, 'predicted'].mean()
    
    return accuracy, precision, recall, sum(df['predicted'])

for p in [.1, .2, .3, .4, .5, .6]:
    print(p, calculate_metrics(df.copy(), p))