In [1]:
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn import linear_model
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
%matplotlib inline

# 3.3.4 Challenge: Compare Logistic Regressions
Now that you have two new regression methods at your fingertips, it's time to give them a spin. In fact, for this challenge, let's put them together! Pick a dataset of your choice with a binary outcome and the potential for at least 15 features. If you're drawing a blank, the crime rates in 2013 dataset has a lot of variables that could be made into a modelable binary outcome.

Engineer your features, then create three models. Each model will be run on a training set and a test-set (or multiple test-sets, if you take a folds approach). The models should be:

    Vanilla logistic regression
    Ridge logistic regression
    Lasso logistic regression

If you're stuck on how to begin combining your two new modeling skills, here's a hint: the SKlearn LogisticRegression method has a "penalty" argument that takes either 'l1' or 'l2' as a value.

In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

# Dataset: Online News Popularity
Source: Kelwin Fernandes (kafc â€˜@â€™ inesctec.pt, kelwinfc â€™@â€™ gmail.com) - INESC TEC, Porto, Portugal/Universidade do Porto, Portugal.
Pedro Vinagre (pedro.vinagre.sousa â€™@â€™ gmail.com) - ALGORITMI Research Centre, Universidade do Minho, Portugal
Paulo Cortez - ALGORITMI Research Centre, Universidade do Minho, Portugal
Pedro Sernadela - Universidade de Aveiro

Website: https://archive.ics.uci.edu/ml/datasets/online+news+popularity


# Loading data and features engineering

In [2]:
data = pd.read_csv('onp.csv')
data = data.rename(columns=lambda x: x.strip())

def share_fun(row):
    if row['shares'] > 1400:
        return 1
    if row['shares'] <= 1400:
        return 0

data['shares_binary'] = data.apply(lambda row: share_fun(row), axis=1)
data['shares_rate'] = data['shares']/data['timedelta']

data = data.drop(['url', 'n_non_stop_words', 'n_non_stop_unique_tokens',
                    'kw_max_max', 'kw_avg_max', 'kw_min_avg'], 1)

features=data.columns[7:54]
X = data[features]
X = pd.get_dummies(X)
Y = data['shares_binary']

# Holdout Groups
from sklearn.model_selection import train_test_split
# Use train_test_split to create the necessary training and test groups
X_train, X_test, Y_train, Y_test = train_test_split(data, Y, test_size=0.2, random_state=20)

# Logistic Regression

In [7]:
lr = LogisticRegression(C=.9)
fit = lr.fit(X_train, Y_train)
print(lr.score(X_test, Y_test))



0.9766679278597553





# Ridge Regression

In [4]:
ridgeregr = linear_model.Ridge(alpha=.5, fit_intercept=False)
ridgeregr.fit(X_train, Y_train)
print(ridgeregr.score(X_test, Y_test))

0.9999999954334717



# LASSO Regression

In [5]:
lass = linear_model.Lasso(alpha=.1)
lassfit = lass.fit(X_train, Y_train)
print(lass.score(X_test, Y_test))

0.8298853911184845



# Cross Validation Scores

In [6]:
print('vanilla cv score:\n', cross_val_score(lr, X_test, Y_test, cv=10))
print('lasso cv score:\n', cross_val_score(lass, X_test, Y_test, cv=10))
print('ridge cv score:\n', cross_val_score(ridgeregr, X_test, Y_test, cv=10))



vanilla cv score:
 [0.97355164 0.93064313 0.97856242 0.99369483 0.96090794 0.98991173
 0.96595208 0.96090794 0.96464646 0.97853535]
lasso cv score:
 [0.81930842 0.81858638 0.80799404 0.81720592 0.7755907  0.8129662
 0.80845716 0.81940628 0.81310516 0.80348137]
ridge cv score:
 [0.9999999  0.9999999  0.9999999  0.9999999  0.99999988 0.9999999
 0.9999999  0.9999999  0.9999999  0.9999999 ]


# Best Model
In your report, evaluate all three models and decide on your best. Be clear about the decisions you made that led to these models (feature selection, regularization parameter selection, model evaluation criteria) and why you think that particular model is the best of the three. Also reflect on the strengths and limitations of regression as a modeling approach. Were there things you couldn't do but you wish you could have done?

I don't know how to interpret my results.....

