# DSO 528 case 2 sample models
Logistic regression for TrojanHorse case

## Import data

In [1]:
# prompt: Import buyer data from /content/drive/MyDrive/Colab Notebooks/DSO528/Week5/TrojanHorse.csv

import pandas as pd

# Assuming you have mounted your Google Drive
buyer = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/DSO528/Week5/TrojanHorse.csv')


In [2]:
buyer.head()

Unnamed: 0,Gender,M,R,F,FirstPurch,BusinessExecutive,Surfer,Yuppie,Hipster,Artist,ClassicGentleman,Rugged,Formal,Casual,Comic,Success
0,1,260,16,2,18,0,0,1,0,1,0,0,0,0,0,0
1,0,259,12,5,30,1,0,1,0,0,1,1,0,0,0,0
2,1,218,16,6,42,1,1,3,1,0,0,0,0,0,0,0
3,1,143,14,1,14,0,0,0,1,0,0,0,0,0,0,0
4,1,419,8,11,52,4,0,1,0,1,1,2,1,1,1,1


## Split data into train and test sets

*   70% training and 30% testing
*   Set seed to 528 for this case

In [3]:
# prompt: use sklearn library, set seed as 528, select 70% of the buyer data as training and the rest as testing

from sklearn.model_selection import train_test_split

train, test = train_test_split(buyer, test_size=0.3, random_state=528)

print(f"Training data shape: {train.shape}")
print(f"Testing data shape: {test.shape}")


Training data shape: (1400, 16)
Testing data shape: (600, 16)


## Model 1. Success ~ R + ClassicGentleman
Please note that this model is built based on experience, or you can consider it as our initial attempt. The variables selected for this model are based on heuristic reasoning.




In [7]:
# prompt: Use sklearn to build a logistic regression model for the train data to predict Success (set target level as Success =1) using R and ClassicGentleman. Generate statistical summary of the model using statsmodel.

import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression

# Define X and y for the training data
X_train = train[['R', 'ClassicGentleman']]
y_train = (train['Success'] == 1).astype(int)  # Convert Success to binary (1 or 0)

# Create and fit the logistic regression model
model1 = LogisticRegression()
model1.fit(X_train, y_train)

# Use statsmodels to get a statistical summary
X_train_with_constant = sm.add_constant(X_train)  # Add a constant term for the intercept
logit_model = sm.Logit(y_train, X_train_with_constant).fit()
print(logit_model.summary())


Optimization terminated successfully.
         Current function value: 0.290478
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                Success   No. Observations:                 1400
Model:                          Logit   Df Residuals:                     1397
Method:                           MLE   Df Model:                            2
Date:                Wed, 25 Sep 2024   Pseudo R-squ.:                  0.1107
Time:                        16:59:25   Log-Likelihood:                -406.67
converged:                       True   LL-Null:                       -457.31
Covariance Type:            nonrobust   LLR p-value:                 1.017e-22
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
const               -1.3918      0.179     -7.768      0.000      -1.743      -1.041
R          

Model equation
Log(P1/P0) = -1.39 - 0.11 * R + 0.90 * ClassicGentleman

The model explains about 11% (pseudo r-square 0.11)of the variance in the dependent varialbe. According to the statistical summary, both R and ClassicGentleman are significant at 5%. Based on the coefficient estimates, the higher the recency, the lower the purchase propensity, holding other conditions unchanged. Similary, ClassicGentleman has a positive impact on buyer likelihood.

Does the model make sense for the business?

### Apply model 1 to predict for test data
Save predicted probability to buyer_model1.csv

In [8]:
# prompt: Use model1 built in the previous stept to predict probability of Success = 1, make a copy of test data and save the predicted probability to the test data, save the data altogether to buyer_model1.csv at /content/drive/MyDrive/Colab Notebooks/DSO528/Week5

# Predict probabilities for the test data
X_test = test[['R', 'ClassicGentleman']]
y_pred_proba = model1.predict_proba(X_test)[:, 1]  # Probability of Success = 1

# Make a copy of the test data
buyer_model1 = test.copy()

# Add the predicted probability to the dataframe
buyer_model1['Predicted_Probability'] = y_pred_proba

# Save the dataframe to a CSV file
buyer_model1.to_csv('/content/drive/MyDrive/Colab Notebooks/DSO528/Week5/buyer_model1.csv', index=False)


## (optional) Model 2: Selected features (X) using SequentialFeatureSelector

For more information about SFS, check https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SequentialFeatureSelector.html

You may consider other scoring criteria we've learned in class such as accuracy, recall, precision and r-square. For more information about scoring criteria, check https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter

Other things to consider, change the number of features or change direction.


In [9]:
# prompt: Build a logistic regression model for the train data set up in the previous step. The model predicts Success (set target level as Success = 1). Select the best 5 features based on roc_auc, report statistical summary of the model. Use SFS.

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SequentialFeatureSelector
import statsmodels.api as sm

# Prepare the data
X_train = train.drop('Success', axis=1)
y_train = train['Success']

# Build a logistic regression model with SFS for feature selection
model2 = LogisticRegression(max_iter=1000)  # Increase max_iter if needed
sfs = SequentialFeatureSelector(model2, n_features_to_select=5, direction='forward', scoring='roc_auc')
sfs.fit(X_train, y_train)

# Get the selected features
selected_features = X_train.columns[sfs.get_support()]

# Train the model with the selected features
X_train_selected = X_train[selected_features]

model2.fit(X_train_selected, y_train)

# Statistical summary using statsmodels
X_train_selected_with_constant = sm.add_constant(X_train_selected)
logit_model2 = sm.Logit(y_train, X_train_selected_with_constant).fit()
print(logit_model2.summary())

Optimization terminated successfully.
         Current function value: 0.276020
         Iterations 7
                           Logit Regression Results                           
Dep. Variable:                Success   No. Observations:                 1400
Model:                          Logit   Df Residuals:                     1394
Method:                           MLE   Df Model:                            5
Date:                Wed, 25 Sep 2024   Pseudo R-squ.:                  0.1550
Time:                        17:00:45   Log-Likelihood:                -386.43
converged:                       True   LL-Null:                       -457.31
Covariance Type:            nonrobust   LLR p-value:                 7.548e-29
                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------
const               -0.8276      0.221     -3.742      0.000      -1.261      -0.394
Gender     

### Apply model 2 to predict for test data

In [10]:
# prompt: Use model2 to predict probability of Success = 1, make a copy of test data and save the predicted probability to the test data, save the data altogether to buyer_model2.csv at /content/drive/MyDrive/Colab Notebooks/DSO528/Week5

# Prepare the test data with selected features
X_test_selected = test[selected_features]

# Predict probabilities for the test data using model2
test_pred_model2 = model2.predict_proba(X_test_selected)[:, 1]

# Make a copy of the test data
buyer_model2 = test.copy()

# Add the predicted probabilities to the copied test data
buyer_model2['Predicted_Probability'] = test_pred_model2

# Save the data with predicted probabilities to buyer_model2.csv
buyer_model2.to_csv('/content/drive/MyDrive/Colab Notebooks/DSO528/Week5/buyer_model2.csv', index=False)
