We will use an advertising dataset with logistic regression whether or not a particular internet user clicked on an ad on a company website. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.

Features:

* 'Daily Time Spent on Site': consumer time on site in minutes
* 'Age': cutomer age in years
* 'Area Income': Avg. Income of geographical area of consumer
* 'Daily Internet Usage': Avg. minutes a day consumer is on the internet
* 'Ad Topic Line': Headline of the advertisement
* 'City': City of consumer
* 'Male': Whether or not consumer was male
* 'Country': Country of consumer
* 'Timestamp': Time at which consumer clicked on Ad or closed window
* 'Clicked on Ad': 0 or 1 indicated clicking on Ad

In [1]:
# import the packages needed 
import pandas as pd
import numpy as np
import seaborn as sns
import statsmodels.api as sm 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import * # import all libraries under sklearn.metrics

# 2 functions to print out metrics (written by us, not in sklearn)

# define a function for calculating the metric to be used later 
# takes in 2 inputs: Y_pred, Y_true
# and uses them to calculate metrics using functions in sklearn
def classification_metrics(Y_pred, Y_true):
    acc = accuracy_score(Y_true, Y_pred)
    precision = precision_score(Y_true, Y_pred)
    recall = recall_score(Y_true, Y_pred)
    f1score = f1_score(Y_true, Y_pred)
    auc = roc_auc_score(Y_true, y_pred)

    # the function's outputs are the 5 variables below
    return acc, precision, recall, f1score, auc

# define a function for printing the metrics using inputs: classifierName, Y_pred, Y_true
# e.g. inputs can be: 'Logistic Regression', y_pred, y_test
# inside the function, we do something with the inputs (e.g. run classification_metrics on the inputs)
# classification_metrics is antoher function we wrote above
def display_metrics(classifierName, Y_pred, Y_true):
    print ("______________________________________________")
    print ("Model: "+classifierName)
    acc, precision, recall, f1score, auc = classification_metrics(Y_pred, Y_true)
    # returns 5 vars: acc, precision, recall, f1score, auc
    # print them below
    print ("Accuracy: "+str(acc))
    print ("Precision: "+str(precision))
    print ("Recall: "+str(recall))
    print ("F1-score: "+str(f1score))
    print ("AUC: "+str(auc))
    print ("______________________________________________")
    print ("")


## Get the Data
**Read in the advertising.csv file and set it to a data frame called ad_data.**

In [2]:
# read in the data using Pandas 
# give it a name of your choice 
ad_data = pd.read_csv('advertising.csv')

**Check the head of ad_data**

In [3]:
# peek at the data 
ad_data.head()

Unnamed: 0,Daily Time Spent on Site,Age,Income,Daily Internet Usage,Male,Country,Clicked on Ad
0,82.03,41,71511.08,187.53,0,Afghanistan,1
1,80.03,44,24030.06,150.84,0,Afghanistan,0
2,51.38,59,42362.49,158.56,0,Afghanistan,0
3,77.07,40,44559.43,261.02,0,Afghanistan,1
4,51.87,50,51869.87,119.65,0,Afghanistan,0


# Logistic Regression

We will use both statsmodels and scikit-learn. Statsmodels will be used to estimate the coefficients, get p-values, etc. 
Scikit-learn will be used to make predictions and calculate various metrics for predictive accuracy. 

In [4]:
# Replicate original data so not to overwrite it 
ad_data2 = ad_data.copy()

# create dummy variables for each country using function pd.get_dummies() for variable 'Country'
# add a prefix to names of dummies using 'prefix='Country'
# embed pd.get_dummies(ad_data2['Country'], prefix='Country') inside pd.conat()

# another way:
# countries = pd.get_dummies(ad_data2['Country'], prefix='Country')
# ad_data2 = pd.concat([ad_data2, countries],axis=1)

ad_data2 = pd.concat([ad_data2, pd.get_dummies(ad_data2['Country'], prefix='Country')],axis=1)

# now drop the original 'country' column (you don't need it anymore)
ad_data2.drop(['Country'],axis=1, inplace=True)

# Create a list of predictor (x) variables: just Age
predictors1 = ['Age']

# Create another list of predictor variables: Age, Country dummies (without the first country dummy) 
# [i for i in ad_data2.columns if i.startswith('Country')]: chooses all items in ad_data2.columns (var names)
# which start with 'Country'
# [i for i in ad_data2.columns if i.startswith('Country')][1:] -> add all countries but the first one (drop country at index 0)
predictors2 = ['Age']+[i for i in ad_data2.columns if i.startswith('Country')][1:]

predictors3 = ['Age', 'Male', 'Daily Time Spent on Site', 'Daily Internet Usage', 'Income']


In [5]:
# create dataframes for X (using Age only) and y variables 
X = ad_data2[predictors2] # choose predictors1
y = ad_data2['Clicked on Ad'] # choose target var

# see list of X variables 
# X.columns is the list of var names
# [i for i in X.columns]: choose all the items in X.columns (var names in X) in list
print('X variables:\n', [i for i in X.columns])


X variables:
 ['Age', 'Country_Argentina', 'Country_Australia', 'Country_Belgium', 'Country_Brazil', 'Country_Canada', 'Country_China', 'Country_Croatia', 'Country_Egypt', 'Country_France', 'Country_Germany', 'Country_Japan', 'Country_Malaysia', 'Country_Mexico', 'Country_Netherlands', 'Country_Pakistan', 'Country_Portugal', 'Country_Russia', 'Country_Singapore', 'Country_South Africa', 'Country_Spain', 'Country_Thailand', 'Country_Turkey', 'Country_USA', 'Country_Uruguay', 'Country_Vietnam', 'Country_Zimbabwe']


In [6]:
X

Unnamed: 0,Age,Country_Argentina,Country_Australia,Country_Belgium,Country_Brazil,Country_Canada,Country_China,Country_Croatia,Country_Egypt,Country_France,...,Country_Russia,Country_Singapore,Country_South Africa,Country_Spain,Country_Thailand,Country_Turkey,Country_USA,Country_Uruguay,Country_Vietnam,Country_Zimbabwe
0,41,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,44,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,59,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,40,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,50,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,29,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
996,29,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
997,48,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
998,26,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


# Sklearn

In [7]:
# split the dataset into training and test sets (e.g. Use 30% of the data as test data, 70% as training data)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

# Define the type of model using function LogisticRegression()
model = LogisticRegression()

# Fit the model using training set
model.fit(X_train, y_train)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [8]:
# Predict the y values for the test set using function 'model.predict'
y_pred = model.predict(X_test)

# see predictions 
y_pred


array([0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0,
       1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1,
       1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0,
       1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1,
       1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0], d

In [9]:
# calculate the confusion matrix for the test data using function 'confusion_matrix' in sklearn
# inputs: y_test, y_pred
# y_test: true y-values 
# y_pred: predicted y-values 
confusion_matrix_results = confusion_matrix(y_test, y_pred)

# print the counts of the confusion matrix 
print('confusion matrix: \n', confusion_matrix_results)

# print the metrics using function 'display_metrics' we wrote
display_metrics('Logistic Regression', y_pred, y_test)


confusion matrix: 
 [[ 89  48]
 [ 34 129]]
______________________________________________
Model: Logistic Regression
Accuracy: 0.7266666666666667
Precision: 0.7288135593220338
Recall: 0.7914110429447853
F1-score: 0.7588235294117648
AUC: 0.7205230397205677
______________________________________________



# Statsmodels 


In [10]:
# Estimate a logistic regression model on the data set with statsmodels 
# feed in all data (no spliting)

X = ad_data2[predictors3] 

model = sm.Logit(y, X)
result = model.fit() 
result.summary2()


Optimization terminated successfully.
         Current function value: 0.254291
         Iterations 7


0,1,2,3
Model:,Logit,Method:,MLE
Dependent Variable:,Clicked on Ad,Pseudo R-squared:,0.633
Date:,2023-09-26 17:33,AIC:,518.5811
No. Observations:,1000,BIC:,543.1199
Df Model:,4,Log-Likelihood:,-254.29
Df Residuals:,995,LL-Null:,-693.15
Converged:,1.0000,LLR p-value:,1.1228e-188
No. Iterations:,7.0000,Scale:,1.0000

0,1,2,3,4,5,6
,Coef.,Std.Err.,z,P>|z|,[0.025,0.975]
Age,-0.2629,0.0170,-15.4264,0.0000,-0.2964,-0.2295
Male,-0.1639,0.2314,-0.7081,0.4789,-0.6174,0.2897
Daily Time Spent on Site,0.0618,0.0081,7.5825,0.0000,0.0458,0.0777
Daily Internet Usage,0.0245,0.0030,8.1913,0.0000,0.0186,0.0304
Income,0.0000,0.0000,1.9489,0.0513,-0.0000,0.0000
