## **Multi Mart data for Frequency and Revenue Analysis**

Frequency and Revenue Analysis is an estimate of all the future profits to be accumulated from a relationship with a given customer. It is used in the business to measure the performance of retention strategies and to provide insights into how much should be spent in customer acquisition.

### **Data Dictionary**

**Objective**: To understand and gain insights from an E-Commerce dataset by performing various exploratory data analyses, data visualization, and data modelling.<br>
<br>
**Dataset Columns:**

**CustomerID** : Unique customer ID<br>
**first purchase date** : It refers to the date when a customer or user made their initial purchase or transaction with the organization.<br>
**last purchase date** : It refers to the date when a customer or user made their most recent purchase or transaction with the organization.<br>
**total purchases** : It is the count or sum of all purchases made by a customer or user with the organization.<br>
**total revenue** : It is the sum of all revenue generated from customer or user transactions with the organization.<br>
**referral source** : It provides information about how individuals found out about the products or services.<br>
**churn indicator** : This is a binary flag that indicates whether a customer or user has churned (i.e., stopped using the products or services) or is still an active customer. Typically, a value of 1 or "Yes" is used to indicate churn, while 0 or "No" is used to indicate an active customer.<br>
**discount used** : It provides information about whether a discount was utilized for a specific purchase or order.<br>
**product category** : It classifies products into specific categories or groups based on their characteristics or purpose.<br> **responsetolastcampaign** : This indicates whether and how a customer or user responded to the most recent marketing campaign.<br>
**feedbackscore** : It represents a numeric score or rating provided by customers or users as feedback for a product, service, or experience.<br>
**preferredpaymentmethod** : It provides information about the customer's preferred way to make payments.<br>
**supportticketsraised** : It represents the number of customer or user support tickets that have been opened or raised by individuals seeking assistance, reporting issues, or making inquiries.<br>
**hasloyaltycard** : This is a binary indicator that shows whether a customer or user possesses a loyalty card with the organization.<br>
**frequency** : The frequency column represents how often a customer or user interacts with the organization, such as making purchases, engaging with the services, or participating in activities. The frequency column is based on the first purchase date and the last purchase date period.


In [None]:
# importing the necessary  libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.io as pio
import itertools
import warnings
import plotly.offline as py

# Initialize Plotly in notebook mode
py.init_notebook_mode(connected=True)

# Enable inline plotting for matplotlib
%matplotlib inline

# Set the default Plotly renderer to 'colab'
pio.renderers.default = 'colab'

# Ignore warnings to prevent them from being displayed
warnings.filterwarnings("ignore")

In [None]:
# #Mount google drive to access data set
# from google.colab import drive
# drive.mount('/content/drive')

# reading the data
data = pd.read_csv('Customer_Lifetime_Value_Dataset.csv')

# printing the head of the data
data.head()

In [None]:
# getting an overview of the data
data.info()

In [None]:
# printing an overview of the dataset
print ("Rows     : " ,data.shape[0])
print ("Columns  : " ,data.shape[1])
print ("\nFeatures : \n" ,data.columns.tolist())
print ("\nUnique values :  \n",data.nunique())

In [None]:
# Count missing values
missing_data = data.isnull()
for column in missing_data.columns.values.tolist():
    print(column)
    print (missing_data[column].value_counts())
    print("")


The dataset consists of 10k records & no missing values in them.

In [None]:
# checking for duplicate values
data.nunique()

Some of the columns have duplicate values but they are justified according to a retailer dataset.

### **Statistical Analysis**

In [None]:
# getting an overall statistical analysis of the data by using .describe()
data.describe()

### **One-Hot Encoding**

In [None]:
# # Identify categorical columns
# categorical_columns = data.select_dtypes(include=['object']).columns

columns_to_encode = ['referralsource', 'responsetolastcampaign', 'preferredpaymentmethod']

# Perform one-hot encoding for all categorical columns
onehot_data = pd.get_dummies(data, columns=columns_to_encode)

# # Display the encoded DataFrame
# print(onehot_data)

In [None]:
onehot_data.columns

In [None]:
onehot_data.head()

### **Correlationship with One-Hot Encoded Columns**

In [None]:
matrix = onehot_data.corr()
# importing the necessary libraries
import seaborn as sns
import matplotlib.pyplot as plt

# formation of correlation matrix
correlation_matrix = data.corr()

# Creating a heatmap
plt.figure(figsize=(40, 30))
sns.heatmap(matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix of Numerical Features')
plt.show()

### **Random Forest**

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
import matplotlib.pyplot as plt

target_column = 'totalpurchases'
# Dropping non-numeric columns
X = onehot_data.drop([target_column, 'customerid', 'firstpurchasedate', 'lastpurchasedate', 'productcategory', 'avgpurchasevalue', 'avgtimebetweenpurchases', 'supportticketsraised',
                      'hasloyaltycard'], axis=1)
y = onehot_data[target_column]

# Converting non-numeric columns to numeric
X_numeric = X.apply(pd.to_numeric, errors='coerce')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_numeric, y, test_size=0.2, random_state=42)

# Initialize the RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict the target values
y_pred_rf = rf.predict(X_test)

# Calculate R-squared score
r2_score(y_test, y_pred_rf)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred_rf)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred_rf)

# Print the results
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)

### **Linear Regression**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

target_column = 'totalpurchases'
# Dropping non-numeric and target columns
X = onehot_data.drop([target_column, 'customerid', 'firstpurchasedate', 'lastpurchasedate', 'productcategory', 'avgpurchasevalue', 'avgtimebetweenpurchases', 'supportticketsraised',
                      'hasloyaltycard'], axis=1)
y = onehot_data[target_column]

# Converting non-numeric columns to numeric
X_numeric = X.apply(pd.to_numeric, errors='coerce')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_numeric, y, test_size=0.2, random_state=42)

# Initializing the Linear Regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred_lr = linear_reg.predict(X_test)

# Evaluate the model
# mean_squared_error(y_test, y_pred)
r2_score(y_test, y_pred_lr)

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred_lr)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred_lr)

# Print the results
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)

### **Feature Importance**

In [None]:
# Get feature importances
feature_importances = rf.feature_importances_

# Create a DataFrame with feature names and their importance scores
feature_importance_df = pd.DataFrame({
    'Feature': X_numeric.columns,
    'Importance': feature_importances
})

# Sort the DataFrame by importance
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Display the feature importance DataFrame
print("Feature Importance:")
print(feature_importance_df)

# Plot the top N feature importances
# top_n = 10  # You can adjust this based on your preference
# plt.figure(figsize=(12, 8))
# sns.barplot(x='Importance', y='Feature', data=feature_importance_df.head(top_n))
# plt.title(f'Top {top_n} Feature Importances')
# plt.show()

### **PCA**

In [None]:
from sklearn.decomposition import PCA
import pandas as pd

# Initialize PCA with the desired number of components
n_components = 3  # Number of principal components to keep
pca = PCA(n_components=n_components)

# Fit PCA to your data and transform the data to the new feature space
X_pca = pca.fit_transform(X)

# Create a DataFrame for the transformed data
X_pca_df = pd.DataFrame(data=X_pca, columns=[f'PC{i+1}' for i in range(n_components)])

# Display the transformed data
print("Transformed Data (First 5 rows):")
print(X_pca_df.head())

In [None]:
# Assuming 'pca' is your trained PCA model
# Access the loadings (coefficients) for each principal component
loadings = pca.components_

# Retrieve the loadings for PC1
loadings_pc1 = loadings[0]  # Assuming PC1 is the first component

# Create a DataFrame to display the loadings
loadings_df = pd.DataFrame(data=loadings_pc1, index=X.columns, columns=['Loading_PC1'])

# Display the loadings for PC1
print("Loadings for PC1:")
print(loadings_df)


### **Random Forest with PCA and Feature Importance Columns**

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
import matplotlib.pyplot as plt

target_column = 'totalpurchases'
features = ['frequency', 'tenure', 'recency', 'totalrevenue', 'feedbackscore', 'discountsused', 'churnindicator', 'referralsource_Online advertisements', 'preferredpaymentmethod_debit card', 'referralsource_Influencer endorsements', 'preferredpaymentmethod_credit card', 'referralsource_Social media promotions', 'referralsource_Word of mouth', 'referralsource_Email campaigns', 'responsetolastcampaign_opened mail', 'referralsource_In-store promotions', 'preferredpaymentmethod_cash', 'preferredpaymentmethod_apple pay', 'responsetolastcampaign_ignored', 'preferredpaymentmethod_paypal']
# Dropping non-numeric columns
X = onehot_data[features]
y = onehot_data[target_column]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict the target values
y_pred_rf = rf.predict(X_test)

# Calculate R-squared score
r2_score(y_test, y_pred_rf)

from sklearn.metrics import mean_squared_error, mean_absolute_error

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred_rf)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred_rf)

# Print the results
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)
print(r2_score(y_test, y_pred_rf))

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import seaborn as sns
import matplotlib.pyplot as plt

target_column = 'totalpurchases'
features = ['frequency', 'tenure', 'recency', 'totalrevenue', 'feedbackscore', 'discountsused', 'churnindicator']
# Dropping non-numeric columns
X = onehot_data[features]
y = onehot_data[target_column]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the RandomForestRegressor
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict the target values
y_pred_rf = rf.predict(X_test)

# Calculate R-squared score
r2_score(y_test, y_pred_rf)

from sklearn.metrics import mean_squared_error, mean_absolute_error

# Calculate Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred_rf)

# Calculate Root Mean Squared Error (RMSE)
rmse = np.sqrt(mse)

# Calculate Mean Absolute Error (MAE)
mae = mean_absolute_error(y_test, y_pred_rf)

# Print the results
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)
print(r2_score(y_test, y_pred_rf))

### **Linear Regression using PCA and Feature Importance Columns**

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

target_column = 'totalpurchases'
features = ['frequency', 'tenure', 'recency', 'totalrevenue', 'feedbackscore', 'discountsused', 'churnindicator', 'referralsource_Online advertisements', 'preferredpaymentmethod_debit card', 'referralsource_Influencer endorsements', 'preferredpaymentmethod_credit card', 'referralsource_Social media promotions', 'referralsource_Word of mouth', 'referralsource_Email campaigns', 'responsetolastcampaign_opened mail', 'referralsource_In-store promotions', 'preferredpaymentmethod_cash', 'preferredpaymentmethod_apple pay', 'responsetolastcampaign_ignored', 'preferredpaymentmethod_paypal']
# Dropping non-numeric columns
X = onehot_data[features]
y = onehot_data[target_column]

# Converting non-numeric columns to numeric
X_numeric = X.apply(pd.to_numeric, errors='coerce')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_numeric, y, test_size=0.2, random_state=42)

# Initializing the Linear Regression model
linear_reg = LinearRegression()
linear_reg.fit(X_train, y_train)

# Make predictions on the test set
y_pred_lr = linear_reg.predict(X_test)

# Evaluate the model
# mean_squared_error(y_test, y_pred)
r2_score(y_test, y_pred_lr)

# Print the results
print("Mean Squared Error (MSE):", mse)
print("Root Mean Squared Error (RMSE):", rmse)
print("Mean Absolute Error (MAE):", mae)
print(r2_score(y_test, y_pred_rf))

In [None]:
!pip install lazypredict

In [None]:
from lazypredict.Supervised import LazyRegressor

target_column = 'totalpurchases'
features = ['frequency', 'tenure', 'recency', 'totalrevenue', 'feedbackscore', 'discountsused', 'churnindicator', 'referralsource_Online advertisements', 'preferredpaymentmethod_debit card', 'referralsource_Influencer endorsements', 'preferredpaymentmethod_credit card', 'referralsource_Social media promotions', 'referralsource_Word of mouth', 'referralsource_Email campaigns', 'responsetolastcampaign_opened mail', 'referralsource_In-store promotions', 'preferredpaymentmethod_cash', 'preferredpaymentmethod_apple pay', 'responsetolastcampaign_ignored', 'preferredpaymentmethod_paypal']
# Dropping non-numeric columns
X = onehot_data[features]
y = onehot_data[target_column]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize LazyRegressor
reg = LazyRegressor(verbose=0, ignore_warnings=True, custom_metric=None)

# Fit LazyRegressor
models, predictions = reg.fit(X_train, X_test, y_train, y_test)

# Print model performance
print(models)

### **Gradient Boosting**

In [None]:
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Assuming 'totalpurchases' is your target column
target_column = 'totalpurchases'

# Drop non-numeric and target columns
X = onehot_data.drop([target_column, 'customerid', 'firstpurchasedate', 'lastpurchasedate', 'productcategory','supportticketsraised','hasloyaltycard','avgpurchasevalue',
               'avgtimebetweenpurchases'], axis=1)
y = onehot_data[target_column]

# Convert non-numeric columns to numeric
X_numeric = X.apply(pd.to_numeric, errors='coerce')

# Fill NaN values (resulting from non-numeric conversion) with some appropriate value or strategy
X_numeric.fillna(0, inplace=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_numeric, y, test_size=0.2, random_state=42)

# Initialize the GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=100, random_state=42)

# Fit the model
gbr.fit(X_train, y_train)

# Predict on the testing data
y_pred = gbr.predict(X_test)

# Calculate R-squared value
r2_encoded = r2_score(y_test, y_pred)

print("R-squared value using Gradient Boosting Regressor model:", r2_encoded)


### **Grid Search for 'totalpurchases'**

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Assuming 'totalpurchases' is your target column
target_column = 'totalpurchases'

# Drop non-numeric and target columns
X = onehot_data[['frequency', 'tenure', 'recency', 'totalrevenue', 'feedbackscore', 'discountsused', 'churnindicator']]
y = onehot_data[target_column]

# Convert non-numeric columns to numeric
X_numeric = X.apply(pd.to_numeric, errors='coerce')

# Fill NaN values (resulting from non-numeric conversion) with some appropriate value or strategy
X_numeric.fillna(0, inplace=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_numeric, y, test_size=0.2, random_state=42)

# Define the parameter grid to search through
param_grid = {
    'n_estimators': [100, 200, 300],  # Number of boosting stages to be run
    'learning_rate': [0.05, 0.1, 0.2],  # Boosting learning rate
    'max_depth': [3, 4, 5]  # Maximum depth of the individual regression estimators
}

# Initialize the GradientBoostingRegressor
gbr = GradientBoostingRegressor(random_state=42)

# Create the GridSearchCV object
grid_search = GridSearchCV(estimator=gbr, param_grid=param_grid, cv=5, scoring='r2', n_jobs=-1)

# Perform grid search
grid_search.fit(X_train, y_train)

# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_

print("Best parameters found:", best_params)
print("Best R-squared score found:", best_score)

# Use the best estimator for prediction
best_estimator = grid_search.best_estimator_
y_pred_grid = best_estimator.predict(X_test)

# Calculate R-squared value using the best estimator
r2_grid = r2_score(y_test, y_pred_grid)
print("R-squared value using the best estimator from GridSearchCV:", r2_grid)


In [None]:
import pandas as pd
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.preprocessing import LabelEncoder
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score

# Assuming 'totalpurchases' is your target column
target_column = 'totalpurchases'

# Drop non-numeric and target columns
X = onehot_data[['frequency', 'tenure', 'recency', 'totalrevenue', 'feedbackscore']]
y = onehot_data[target_column]

# Convert non-numeric columns to numeric
X_numeric = X.apply(pd.to_numeric, errors='coerce')

# Fill NaN values (resulting from non-numeric conversion) with some appropriate value or strategy
X_numeric.fillna(0, inplace=True)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_numeric, y, test_size=0.2, random_state=42)

# Initialize the GradientBoostingRegressor
gbr = GradientBoostingRegressor(n_estimators=100, random_state=42, learning_rate = 0.05, max_depth = 3)

# Fit the model
gbr.fit(X_train, y_train)

# Predict on the testing data
y_pred = gbr.predict(X_test)

# Calculate R-squared value
r2_encoded = r2_score(y_test, y_pred)

print("R-squared value using Gradient Boosting Regressor model:", r2_encoded)

## **'hasloyalty' Analysis**

In [None]:
df = pd.read_csv("/content/drive/MyDrive/DAB303/group project/project-4/Customer_Lifetime_Value_Dataset.csv")
df.head()

### **Label Encoding**

In [None]:
from sklearn.preprocessing import LabelEncoder

# Select the columns to be label encoded
columns_to_encode = ['referralsource', 'responsetolastcampaign', 'preferredpaymentmethod', 'hasloyaltycard']

# Create a LabelEncoder instance
label_encoder = LabelEncoder()

# Create a new DataFrame 'encoded_data' to store the label-encoded data
encoded_data = df.copy()

# Iterate over the selected columns and apply label encoding
for column in columns_to_encode:
    # Fit and transform the column
    encoded_data[column] = label_encoder.fit_transform(encoded_data[column])

encoded_data.head()

### **Label Encoded Correlation**

In [None]:
encoded_data.corr()

In [None]:
#  Calculate the correlation matrix
correlation_matrix = encoded_data.corr()

# Set the threshold values
positive_threshold = 0.0999999
negative_threshold = -0.0999999

# Find correlation values greater than 0.0999999
positive_correlations = correlation_matrix[correlation_matrix > positive_threshold].stack()

# Find correlation values lesser than -0.099999
negative_correlations = correlation_matrix[correlation_matrix < negative_threshold].stack()

# Display the results
print("Correlation Values Greater than 0.0999999:")
print(positive_correlations)

# print("\nCorrelation Values Lesser than -0.099999:")
print(negative_correlations)

In [None]:
encoded_data.columns

### **Lazy Predict and Random Forest Classifier**

In [None]:
from lazypredict.Supervised import LazyClassifier

target_column = 'hasloyaltycard'
# features = ['hasloyaltycard', 'firstpurchasedate', 'lastpurchasedate', 'productcategory', 'customerid']
# Dropping non-numeric columns
y = encoded_data[target_column]
X = encoded_data.drop(['hasloyaltycard', 'firstpurchasedate', 'lastpurchasedate', 'productcategory', 'supportticketsraised'], axis=1)
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize LazyRegressor
classy = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)

# Fit LazyRegressor
models, predictions = classy.fit(X_train, X_test, y_train, y_test)

# Print model performance
print(models)

In [None]:
from sklearn.ensemble import RandomForestClassifier

target_column = 'hasloyaltycard'

# Dropping non-numeric columns
X_best = encoded_data.drop(['hasloyaltycard', 'firstpurchasedate', 'lastpurchasedate', 'productcategory', 'supportticketsraised', 'customerid'], axis=1)
y_best = encoded_data[target_column]

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the RandomForestRegressor
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

# Predict the target values
y_pred_rfc = rfc.predict(X_test)

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, r2_score

# Assuming y_true contains the true labels and y_pred contains the predicted labels
accuracy = accuracy_score(y_test, y_pred_rfc)
precision = precision_score(y_test, y_pred_rfc)
recall = recall_score(y_test, y_pred_rfc)
f1 = f1_score(y_test, y_pred_rfc)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)

### **Grid Search**

In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid to search
param_grid = {
    'n_estimators': [300, 400, 350, 250],
    'max_depth': [35, 40, 45, 50],
    'min_samples_split': [25, 27, 30,33],
    'min_samples_leaf': [8, 10, 11, 12]
}

# Initialize the RandomForestClassifier
rf = RandomForestClassifier(random_state=42)

# Initialize GridSearchCV with the specified parameter grid and the classifier
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=5, scoring='accuracy')

# Fit GridSearchCV to the training data
grid_search.fit(X_train, y_train)

# Print the best parameters found by GridSearchCV
print("Best Parameters:", grid_search.best_params_)

# Get the best model
best_rf = grid_search.best_estimator_

# Predict on the testing data using the best model
y_pred = best_rf.predict(X_test)

# Calculate precision, recall, and F1 score
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

In [None]:
# Get feature importances
feature_importances = rf.feature_importances_

# Create a DataFrame to display feature importances
feature_importance_df = pd.DataFrame({'Feature': X_train.columns, 'Importance': feature_importances})

# Sort the DataFrame by importance values in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Display the feature importance DataFrame
print("Feature Importance:")
print(feature_importance_df)

### **OneHot Encoded**

In [None]:
from sklearn.preprocessing import LabelEncoder

# Select the columns to be one-hot encoded
columns_to_onehot_encode = ['referralsource', 'responsetolastcampaign', 'preferredpaymentmethod']

# Create a new DataFrame 'onehot_data' to store the one-hot encoded data
onehot = df.copy()

# Iterate over the selected columns and apply one-hot encoding
for column in columns_to_onehot_encode:
    # Apply one-hot encoding
    encoded_column = pd.get_dummies(onehot[column], prefix=column)

    # Concatenate the one-hot encoded column with the original DataFrame
    onehot = pd.concat([onehot, encoded_column], axis=1)

    # Drop the original column
    onehot.drop(column, axis=1, inplace=True)

# Label encode the 'hasloyaltycard' column
label_encoder = LabelEncoder()
onehot['hasloyaltycard'] = label_encoder.fit_transform(onehot['hasloyaltycard'])

onehot.head()

In [None]:
onehot.columns

### **Random Forest One-Hot**

In [None]:
from sklearn.ensemble import RandomForestClassifier

target_column = 'hasloyaltycard'

# Dropping non-numeric columns
X = onehot.drop(['hasloyaltycard', 'firstpurchasedate', 'lastpurchasedate', 'productcategory', 'supportticketsraised'], axis=1)
y = onehot[target_column]

# Split data into train and test sets
X_trainoh, X_testoh, y_trainoh, y_testoh = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize the RandomForestRegressor
rfcl = RandomForestClassifier(n_estimators=100, random_state=42)
rfcl.fit(X_trainoh, y_trainoh)

# Predict the target values
y_pred_rfcl = rfcl.predict(X_testoh)

# Assuming y_true contains the true labels and y_pred contains the predicted labels
accuracy = accuracy_score(y_testoh, y_pred_rfcl)
precision = precision_score(y_testoh, y_pred_rfcl)
recall = recall_score(y_testoh, y_pred_rfcl)
f1 = f1_score(y_testoh, y_pred_rfcl)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1)

## **Data Imbalance**

### **K-Fold Cross Validation**

In [None]:
from sklearn.model_selection import cross_validate
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score

# Initialize the RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=100, random_state=42)

# Define the scoring metrics
scoring = {'precision': make_scorer(precision_score),
           'recall': make_scorer(recall_score),
           'f1_score': make_scorer(f1_score),
           'accuracy': make_scorer(accuracy_score)}

# Perform k-fold cross-validation
k = 5
cv_results = cross_validate(rfc, X_best, y_best, cv=k, scoring=scoring)

# Print the cross-validation results
print("Cross-Validation Scores:")
for metric in scoring:
    mean_cv_score = cv_results[f"test_{metric}"].mean()
    print(f"{metric.capitalize()}:")
    print("  Mean:", mean_cv_score)