# User Restaurant Compatibility

We'll be looking through a series of three data sets, containing respectively: user data, restaurant data, and restaurant redemption data. With the data available, we will measure a user's likelihood to redeem inKind credit at a given restaurant.


## Clean and wrangle data

To better explore the data, we'll start with reading it into a dataframe and getting a quick picture of what's contained.



In [1]:
#load in the data sets

import pandas as pd


# Load DataFrames
df_users = pd.read_csv('user_data.csv')
df_restaurants = pd.read_csv('restaurant_data.csv')
df_redemptions = pd.read_csv('restaurant_redemption_data.csv')



In [2]:
#get a sense of what is contained in each data set

df_users.head()

Unnamed: 0,anonymized_user_id,created_at,user_source,zip_code,brand_id_of_source_offer_if_applicable
0,1,2024-05-15 00:00:00,Offer,27609,834.0
1,2,2024-05-25 00:00:00,Offer,89109,833.0
2,3,2024-05-02 00:00:00,Redemption,76005,
3,4,2024-05-19 00:00:00,Referral,65401,
4,5,2024-05-08 00:00:00,Offer,22554,


In [3]:
df_restaurants.head()

Unnamed: 0,anonymized_location_id,anonymized_brand_id (brands contain multiple locations),location_address,location_city,location_state,location_zip_code,Day of location_creation,location_latitude,location_longitude,stars_avg,service_type,check_avg,venue_type,location_name,Cuisine Types
0,1,2184.0,280 East 12300 South,Draper,UT,84020,2017-05-17,40.525296,-111.882849,,unknown,mid,cafe,Draper,"Casual, Fine Dining, Takeout"
1,2,2184.0,1856 5400 S,Salt Lake City,UT,84118,2017-05-17,40.653483,-111.942164,,unknown,mid,cafe,Taylorsville,"Casual, Fine Dining, Takeout"
2,3,,3068 Mt Pleasant St NW,Washington,DC,20009,2017-09-25,38.928548,-77.037527,4.5,,,,Each Peach Market,
3,4,,3064 Mt Pleasant St NW,Washington,DC,20009,2017-09-25,38.928463,-77.037516,4.4,,,,Pear Plum Cafe,
4,5,,1815 Adams Mill Rd NW,Washington,DC,20005,2017-11-17,38.923209,-77.042616,4.1,,,,Southern Hospitality,


In [4]:
df_redemptions.head()

Unnamed: 0,anonymized_user_id,created_at,anonymized_location_id,anonymized_brand_id,metro_area,txn_id,total_bill_amount,tip,order_type,miles_travelled
0,1314668,2024-01-06 08:40:56,75.0,1122.0,Cincinnati,1915276,26.45,0.0,Dine-in,1.04
1,1862552,2024-03-08 15:17:55,290.0,1515.0,Austin,2389101,28.16,5.2,Dine-in,1.81
2,1425149,2024-01-15 09:33:27,75.0,1122.0,Cincinnati,1972584,22.76,4.84,Dine-in,0.28
3,1554732,2024-04-22 09:51:55,224.0,1118.0,"Manhattan, New York",2892306,39.66,6.58,Dine-in,3.36
4,1892523,2024-03-06 06:16:02,224.0,1118.0,"Manhattan, New York",2370753,9.44,0.56,Dine-in,0.71


 ####
It's clear that we'll be interested in using the `brand_id` fields to join datasets together for later analysis. We'll get rid of null values first.
 ####

In [5]:
#create function to clean the data

def write_non_null_to_file(df, column_name, output_filename):
   
    if column_name not in df.columns:
        raise ValueError(f"Column '{column_name}' not found in DataFrame.")

    non_null_df = df[df[column_name].notna()]

    non_null_df.to_csv(output_filename, index=False)

write_non_null_to_file(df_users, 'brand_id_of_source_offer_if_applicable', 'user_data_cleaned.csv')
write_non_null_to_file(df_restaurants, 'anonymized_brand_id (brands contain multiple locations)', 'restaurant_data_cleaned.csv')
write_non_null_to_file(df_redemptions, 'anonymized_brand_id', 'redemptions_data_cleaned.csv')

####
Next up, we'll explore joining the user, restaurant, and redemption data on the `user_id` `location_id` fields. We'll start by standardizing the column names.
####

In [6]:
df_redemptions = pd.read_csv('redemptions_data_cleaned.csv')
df_restaurants = pd.read_csv('restaurant_data_cleaned.csv')
df_merged = pd.merge(df_redemptions, df_restaurants, on='anonymized_location_id', how='inner')
pd.set_option('display.max_columns', None)
df_merged.head()

Unnamed: 0,anonymized_user_id,created_at,anonymized_location_id,anonymized_brand_id,metro_area,txn_id,total_bill_amount,tip,order_type,miles_travelled,anonymized_brand_id (brands contain multiple locations),location_address,location_city,location_state,location_zip_code,Day of location_creation,location_latitude,location_longitude,stars_avg,service_type,check_avg,venue_type,location_name,Cuisine Types
0,1314668,2024-01-06 08:40:56,75.0,1122.0,Cincinnati,1915276,26.45,0.0,Dine-in,1.04,381.0,7893 Beechmont Ave,Cincinnati,OH,45255,2020-11-27,39.072392,-84.337628,4.1,full_service,low,casual_dining,City Bird - Beechmont,"American Fare, Family Friendly, Fried Chicken,..."
1,1862552,2024-03-08 15:17:55,290.0,1515.0,Austin,2389101,28.16,5.2,Dine-in,1.81,28.0,5900 Slaughter Lane Suite D-500,Austin,TX,78749,2021-08-18,30.201736,-97.878878,4.4,full_service,mid,casual_dining,District Kitchen Slaughter Lane,"Bar, Brunch, Cocktail Bar, Comfort Food, Date ..."
2,1425149,2024-01-15 09:33:27,75.0,1122.0,Cincinnati,1972584,22.76,4.84,Dine-in,0.28,381.0,7893 Beechmont Ave,Cincinnati,OH,45255,2020-11-27,39.072392,-84.337628,4.1,full_service,low,casual_dining,City Bird - Beechmont,"American Fare, Family Friendly, Fried Chicken,..."
3,1554732,2024-04-22 09:51:55,224.0,1118.0,"Manhattan, New York",2892306,39.66,6.58,Dine-in,3.36,377.0,227 Front St,San Francisco,CA,10014,2021-05-12,37.793777,-122.399096,4.3,quick_service_restaurant,very_low,coffee_shop,Bluestone Lane (Financial District),"Brunch, Business Dining, Cafe, Casual, Coffee ..."
4,1892523,2024-03-06 06:16:02,224.0,1118.0,"Manhattan, New York",2370753,9.44,0.56,Dine-in,0.71,377.0,227 Front St,San Francisco,CA,10014,2021-05-12,37.793777,-122.399096,4.3,quick_service_restaurant,very_low,coffee_shop,Bluestone Lane (Financial District),"Brunch, Business Dining, Cafe, Casual, Coffee ..."


In [7]:
#we only need user_ids and the date of creation here, this will be useful for creating product flags down the line

df_users = pd.read_csv('user_data_cleaned.csv', usecols=['anonymized_user_id', 'created_at'])
df_users.head()

Unnamed: 0,anonymized_user_id,created_at
0,1,2024-05-15 00:00:00
1,2,2024-05-25 00:00:00
2,6,2024-05-29 00:00:00
3,7,2024-04-27 00:00:00
4,14,2024-02-16 00:00:00


In [8]:
#drop duplicate transaction ids
df_no_dupes = df_merged.drop_duplicates(subset=['txn_id'], keep='first')

#delete rows where venue_type is unknown
df_cleaned = df_no_dupes[df_no_dupes['venue_type'] != 'unknown']

sorted_df = df_cleaned.sort_values('anonymized_user_id')

sorted_df.head()

Unnamed: 0,anonymized_user_id,created_at,anonymized_location_id,anonymized_brand_id,metro_area,txn_id,total_bill_amount,tip,order_type,miles_travelled,anonymized_brand_id (brands contain multiple locations),location_address,location_city,location_state,location_zip_code,Day of location_creation,location_latitude,location_longitude,stars_avg,service_type,check_avg,venue_type,location_name,Cuisine Types
26867,2,2024-05-05 15:33:46,1842.0,2353.0,Washington,3070259,38.24,5.0,Dine-in,,1956.0,1800 14th Street NW,Washington,DC,20009,2024-02-19,38.914381,-77.032146,4.4,full_service,mid,casual_dining,Doi Moi,"Brunch, Cocktail Bar, Vietnamese"
34085,13,2024-04-29 10:10:10,331.0,1520.0,Washington,2990663,11.55,0.0,Dine-in,0.02,33.0,4201 Georgia Ave NW,Washington,DC,20011,2021-10-06,38.942041,-77.025309,4.3,full_service,low,casual_dining,Honeymoon Chicken,"Bib Gourmand, Brunch, Cocktails, Fried Chicken..."
23973,13,2024-03-04 08:28:25,331.0,1520.0,Washington,2360834,33.55,4.58,Dine-in,0.02,33.0,4201 Georgia Ave NW,Washington,DC,20011,2021-10-06,38.942041,-77.025309,4.3,full_service,low,casual_dining,Honeymoon Chicken,"Bib Gourmand, Brunch, Cocktails, Fried Chicken..."
35526,14,2024-05-14 17:23:01,1488.0,2334.0,Annapolis,3166503,22.53,4.18,Dine-in,13.72,1937.0,8133 Honeygo Boulevard,Nottingham,MD,21236,2023-11-13,39.371247,-76.465211,4.1,full_service,mid,casual_dining,Bar Louie - White Marsh,"American Fare, Bar, Beer, Burgers, Casual, Coc..."
15515,14,2024-05-21 17:49:54,1488.0,2334.0,Annapolis,3238841,36.58,6.76,Dine-in,13.72,1937.0,8133 Honeygo Boulevard,Nottingham,MD,21236,2023-11-13,39.371247,-76.465211,4.1,full_service,mid,casual_dining,Bar Louie - White Marsh,"American Fare, Bar, Beer, Burgers, Casual, Coc..."


Next up, we'll start conisdering what fields we have available and what model will work best to predict where a user will redeem credit. A logisitc regression model is looking to be a good choice, as it can predict binary outcomes (in this case, to redeem, or not to redeem) based on a set of predictor varaiables.

Since the amount spent in a given `venue_type` can vary greatly (coffee shop visits vs fine dining), it doesn't seem helpful to track total spent if we're just looking for where a customer will likely redeem credit, and not looking to track where a user will spend the most (a different analysis than was asked).

We'll look at `user_id`, `location_id`, `venue_type`, number of times a user has purchased something at a given restaurant (or in this case, a group associated with a `brand_id`), and will create a binary purchase_flag field. Typically, purchase flags have many factors that go into deciding a likely buy or unlikely buy, such as time since last purchase, user app activity, repurchase history, etc. We will keep it simple due to the data that we have avialable and set the decision rule for if they have redeemed credit at a location 3 or more times.

####

In [9]:
#We'll keep the fields we're interested in and drop the rest

df_prep = sorted_df[['anonymized_user_id', 'anonymized_location_id', 'venue_type']]
df_prep.head()
                         

Unnamed: 0,anonymized_user_id,anonymized_location_id,venue_type
26867,2,1842.0,casual_dining
34085,13,331.0,casual_dining
23973,13,331.0,casual_dining
35526,14,1488.0,casual_dining
15515,14,1488.0,casual_dining


In [10]:
# Count instances of redemptions by each user for a given restaurant

counts = df_prep.groupby(['anonymized_user_id', 'anonymized_location_id']).size().reset_index(name='redemption_count_by_user')

counts.head()

Unnamed: 0,anonymized_user_id,anonymized_location_id,redemption_count_by_user
0,2,1842.0,1
1,13,331.0,2
2,14,1488.0,6
3,15,32.0,1
4,15,326.0,1


In [11]:
df_venues = df_restaurants[['anonymized_location_id', 'venue_type']]
df_venues.head()

Unnamed: 0,anonymized_location_id,venue_type
0,1,cafe
1,2,cafe
2,7,bar
3,8,bar
4,9,bar


In [12]:
#merge back the venue_types to table with counts

df_counts = pd.merge(counts, df_venues, on='anonymized_location_id', how='inner')
df_counts.head()

Unnamed: 0,anonymized_user_id,anonymized_location_id,redemption_count_by_user,venue_type
0,2,1842.0,1,casual_dining
1,13,331.0,2,casual_dining
2,14,1488.0,6,casual_dining
3,15,32.0,1,fast_casual
4,15,326.0,1,casual_dining


In [13]:
#create purchase_flag field based on if a user has redeemed at a particular restaurant 3 or more times
#set location_ids  to integers
df_counts['anonymized_location_id'] = df_counts['anonymized_location_id'].astype(int)
df_counts['purchase_flag'] = df_counts['redemption_count_by_user'].apply(lambda x: 1 if x >= 3 else 0)
df_counts.tail()

Unnamed: 0,anonymized_user_id,anonymized_location_id,redemption_count_by_user,venue_type,purchase_flag
425752,2929041,1609,1,casual_dining,0
425753,2929041,1621,1,fine_dining,0
425754,2929082,1472,2,casual_dining,0
425755,2929089,604,1,casual_fine_dining,0
425756,2929090,63,1,casual_dining,0


#
## Build and run the model
The dataset we'll use for the model has been cleaned and wrangled, but there are still some additional steps needed to prepare. We're going to  convert the `venue_type` field into numeric values and split data for training.

In [39]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Features (X) and target (y)
X = df_counts[['anonymized_user_id', 'anonymized_location_id', 'venue_type', 'redemption_count_by_user']]
y = df_counts['purchase_flag']

# Preprocessing: One-Hot Encoding for categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('venue_type', OneHotEncoder(), ['venue_type']) # preprocess venue types from text to numeric value
        # ('remainder', 'passthrough', ['anonymized_user_id', 'anonymized_location_id', 'redemption_count_by_user'])  # Keep numerical columns as is
    ],
    remainder='passthrough'
)

# Split data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

####
Time to train the model
####

In [40]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create a pipeline with preprocessing and logistic regression
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(class_weight='balanced', max_iter=1000)) 
])

# Train the model
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Classification Report (Precision, Recall, F1-score)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 1.0

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     79890
           1       1.00      1.00      1.00      5262

    accuracy                           1.00     85152
   macro avg       1.00      1.00      1.00     85152
weighted avg       1.00      1.00      1.00     85152


Confusion Matrix:
[[79890     0]
 [    0  5262]]


In [41]:
# Predict probabilities for the test set
y_proba = pipeline.predict_proba(X_test)[:, 1]  # Probability of class 1 (will purchase)

print("\nPredicted Likelihood of Purchase:")
print(y_proba)


Predicted Likelihood of Purchase:
[1.84380705e-21 2.05718173e-19 8.62841454e-18 ... 2.29334426e-19
 1.30938004e-21 1.40692928e-21]


In [42]:
df_yprob = pd.DataFrame(y_proba)
df_yprob.to_csv('Predicted_Likelihood_of_Purchase_orig.csv', index=False)

In [51]:
# Get predicted probabilities for the full dataset (including both training and test sets)
full_pred_prob = pipeline.predict_proba(X)[:, 1]  # Probabilities for class 1 (purchase = 1)

# Add predicted probabilities to the full dataset
df_counts['Predicted_Purchase_Likelihood'] = full_pred_prob
df_output = df_counts
# Display the dataset with the predicted probabilities
df_output.to_csv('Predict_Purchase_Model_Output_orig', index=False)

In [54]:
df_output.head()

Unnamed: 0,anonymized_user_id,anonymized_location_id,redemption_count_by_user,venue_type,purchase_flag,Predicted_Purchase_Likelihood
0,2,1842,1,casual_dining,0,2.129972e-21
1,13,331,2,casual_dining,0,6.025485e-09
2,14,1488,6,casual_dining,1,1.0
3,15,32,1,fast_casual,0,1.269245e-18
4,15,326,1,casual_dining,0,1.4121769999999999e-21


We can see here (and much better in the full dataset) that the Predicted Purchase Likelihood probabilities match up well with the purchase flag values. Below, we can explore way to refine the results of the model in various ways.

# Actionable Next Steps

### &ensp; - Segment customers based on purchase likelihood
### &ensp; - Run targeted marketing campaigns based on those segments. Example actions
###  &emsp; &emsp;  * High likelihood: offer promotions/loyalty rewards
###  &emsp; &emsp;  * Medium likelihood: provide personalized content/discounts, implement remarketing as strategies
###  &emsp;  &emsp; * Low likelihood: provide brand awareness, consider offering significant discounts to drive purchases
### &ensp; - Identifiy customer retention strategies
### &ensp; - Use the data to help with cross-selling/upselling by using or creating recommendation software
### &emsp;  &emsp; * collaborative/content based filtering (eg Netlflix recommendations) are 2 methods that can be combined
### &ensp; - Identify ways to improve customer experience (try and get feedback from each segment and act on it)
###  &emsp;  &emsp; * UI/UX improvements
###  &emsp; &emsp;  * Customer support
### &ensp; - Improve the model

#
# A Little More for Fun
#
The model seemed to perform very well, but a lot of the predicted likelihood of purchase values returned are very small. There's a solid chance this has to do with our classes being unevenly distributed (there are way fewer user who have redeemed >3 times at a given restaurant). We will try undersampling the majority class and SMOTE, a method of oversampling the minority class. 
#

#### Undersampling the majority

In [21]:
from imblearn.under_sampling import RandomUnderSampler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

# Features (X) and target (y)
X = df_counts[['anonymized_user_id', 'anonymized_location_id', 'venue_type', 'redemption_count_by_user']]
y = df_counts['purchase_flag']

# Preprocessing: One-Hot Encoding for categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('venue_type', OneHotEncoder(), ['venue_type']) # preprocess venue types from text to numeric value
        # ('remainder', 'passthrough', ['anonymized_user_id', 'anonymized_location_id', 'redemption_count_by_user'])  # Keep numerical columns as is
    ],
    remainder='passthrough'
)

# Split data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply undersampling to balance the classes in the training set
undersample = RandomUnderSampler(random_state=42)
X_train_under, y_train_under = undersample.fit_resample(X_train, y_train)

print(f"Original class distribution: {dict(zip(*np.unique(y_train, return_counts=True)))}")
print(f"Undersampled class distribution: {dict(zip(*np.unique(y_train_under, return_counts=True)))}")

Original class distribution: {np.int64(0): np.int64(319420), np.int64(1): np.int64(21185)}
Undersampled class distribution: {np.int64(0): np.int64(21185), np.int64(1): np.int64(21185)}


In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create a pipeline with preprocessing and logistic regression
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(class_weight='balanced', max_iter=1000)) 
])

# Train the model
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Classification Report (Precision, Recall, F1-score)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 1.0

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     79890
           1       1.00      1.00      1.00      5262

    accuracy                           1.00     85152
   macro avg       1.00      1.00      1.00     85152
weighted avg       1.00      1.00      1.00     85152


Confusion Matrix:
[[79890     0]
 [    0  5262]]


In [23]:
# Predict probabilities for the test set
y_proba = pipeline.predict_proba(X_test)[:, 1]  # Probability of class 1 (will purchase)

print("\nPredicted Likelihood of Purchase:")
print(y_proba)


Predicted Likelihood of Purchase:
[1.84380705e-21 2.05718173e-19 8.62841454e-18 ... 2.29334426e-19
 1.30938004e-21 1.40692928e-21]


####
Still getting very small values for the probablitlities. We'll try SMOTE next.


#### SMOTE, or Oversampling the Minority class

In [29]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

# Features (X) and target (y)
X = df_counts[['anonymized_user_id', 'anonymized_location_id', 'venue_type', 'redemption_count_by_user']]
y = df_counts['purchase_flag']

# Preprocessing: One-Hot Encoding for categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('venue_type', OneHotEncoder(), ['venue_type']) # preprocess venue types from text to numeric value
        # ('remainder', 'passthrough', ['anonymized_user_id', 'anonymized_location_id', 'redemption_count_by_user'])  # Keep numerical columns as is
    ],
    remainder='passthrough'
)

# Split data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create a pipeline with preprocessing and logistic regression
pipeline = ImbPipeline(steps=[
    ('preprocessor', preprocessor),
    ('smote', SMOTE(sampling_strategy='auto', random_state=42)),
    ('classifier', LogisticRegression(class_weight='balanced', max_iter=1000)) 
])

# Train the model
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Classification Report (Precision, Recall, F1-score)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 1.0

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00    131769
           1       1.00      1.00      1.00      8731

    accuracy                           1.00    140500
   macro avg       1.00      1.00      1.00    140500
weighted avg       1.00      1.00      1.00    140500


Confusion Matrix:
[[131769      0]
 [     0   8731]]


In [31]:
# Predict probabilities for the test set
y_proba = pipeline.predict_proba(X_test)[:, 1]  # Probability of class 1 (will purchase)

print("\nPredicted Likelihood of Purchase:")
print(y_proba)


Predicted Likelihood of Purchase:
[4.34603298e-24 5.21227813e-22 2.82280950e-20 ... 4.75508791e-24
 5.13881844e-24 9.99999999e-01]


###
This seemed to have the biggest effect on the output, which may indicate that increasing the relative size of the minority class (likely purchasers) could reduce the model's bias for prediciting the the majority class (unlikely purchasers)
###

### Scaling numeric features

In [35]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import numpy as np

# Features (X) and target (y)
X = df_counts[['anonymized_user_id', 'anonymized_location_id', 'venue_type', 'redemption_count_by_user']]
y = df_counts['purchase_flag']

numeric_features = ['anonymized_user_id', 'anonymized_location_id', 'redemption_count_by_user']

# Preprocessing: One-Hot Encoding for categorical columns
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('venue_type', OneHotEncoder(), ['venue_type']) # preprocess venue types from text to numeric value
        # ('remainder', 'passthrough', ['anonymized_user_id', 'anonymized_location_id', 'redemption_count_by_user'])  # Keep numerical columns as is
    ],
)

# Split data into training and test sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [36]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Create a pipeline with preprocessing and logistic regression
pipeline = ImbPipeline(steps=[
    ('preprocessor', preprocessor),
    ('smote', SMOTE(sampling_strategy='auto', random_state=42)),
    ('classifier', LogisticRegression(class_weight='balanced', max_iter=1000)) 
])

# Train the model
pipeline.fit(X_train, y_train)

# Predict on the test set
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Classification Report (Precision, Recall, F1-score)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

# Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Accuracy: 1.0

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00     79890
           1       1.00      1.00      1.00      5262

    accuracy                           1.00     85152
   macro avg       1.00      1.00      1.00     85152
weighted avg       1.00      1.00      1.00     85152


Confusion Matrix:
[[79890     0]
 [    0  5262]]


In [37]:
# Predict probabilities for the test set
y_proba = pipeline.predict_proba(X_test)[:, 1]  # Probability of class 1 (will purchase)

print("\nPredicted Likelihood of Purchase:")
print(y_proba)


Predicted Likelihood of Purchase:
[1.00779944e-08 1.34165930e-08 1.67467701e-08 ... 1.11402248e-08
 1.13928444e-08 1.15426209e-08]


In [38]:
This did not move the needle too much