# Data driven Proposal to maximize profit of next market campaign
Name: Guilherme Coelho Minervino
Live in: Brasília/DF
Github: http://github.com/guico3lho
Linkedin: https://www.linkedin.com/in/guilherme-coelho-2258751a2/

For convention, customers that responded to campaign will be described as **positive customers** and customers that did not responded as **negative customers**

## 1. Packages and Functions

### Import packages

In [1]:


import pandas as pd
import matplotlib.pyplot as plt

from pandas.plotting import scatter_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
import datetime
import numpy as np

### Functions

## 2. Preparing data

### Importing dataset

In [None]:
# import dataset from github
df = pd.read_csv('https://raw.githubusercontent.com/ifood/ifood-data-business-analyst-test/master/ml_project1_data.csv',
                 sep=',')
df

In [None]:
df[df['Income'].isnull()]

Notes:
Shape (2240, 29)
Num_Columns_Numerical = 27
Num_Columns_Categorical = 2

###  Preprocessing

Due to the fact that 24 rows have NULL values on Income column, we have two options: delete these rows or put the median of the values of all rows to these values. The solution with the best performance was: fill values with median. Despite the accuracy was better when deleting rows with nan values, the f1 score was worse

Besides that, while reading the case, was not clear if 2n Cycle refeers high school or a pos graduation. So, a analysis of Income for each category will be made
Reference of 2nCycle pos graduation: https://www.unibo.it/en/teaching/enrolment-transfer-and-final-examination/the-university-system/what-is-a-second-cycle-degree-programme

In [None]:
# df = df.dropna()
df['Income'].fillna(df['Income'].median(), inplace=True)

df.groupby('Education')['Income'].mean().sort_values(ascending=True)

Notes:
Based on the results, 2n Cycle was the second lowest income, so it can be presumed that 2n Cycle corresponds to High school, not a pos-graduation
That information will be further used to encode these categories to numerical (OrdinalEncoding)

There are two columns of type categorical and one column of type date
Lets see how many categories there are on the categorical ones

In [None]:
df_pp = df.copy()
display(df_pp['Education'].value_counts())
display(df_pp['Marital_Status'].value_counts())


With the goal to perform better exploratory analysis and predictions, it is necessary to convert columns Dt_customer (Date string), Education (Categorical ordinal), Marital_Status (Categorical nominal) to numeric representation


In [None]:
from sklearn.preprocessing import OrdinalEncoder


# Categorizando coluna Education seguindo uma ordem de hierarquia crescente (Basic (0) -> PhD (4))
categories = [['Basic', '2n Cycle', 'Graduation', 'Master', 'PhD']]
ordinalEncoder = OrdinalEncoder(categories=categories)
df_pp['Education_Cat'] = ordinalEncoder.fit_transform(df_pp['Education'].values.reshape(-1, 1))
df_pp['Education_Cat'] = df_pp['Education_Cat'].astype(int)


Notes:
The OrdinalEncoder was used because the Education column has ordinal categories (categories has a hierarchy between them)
Basic, 2n Cycle, Graduation, Master, PhD will recieve a value (weight) of 0, 1, 2, 3, 4, respectively

In [None]:
# Categorizando coluna Marital Status utilizando get_dummies, já que a ordem das categorias não é importante
df_pp = pd.get_dummies(df_pp, columns=['Marital_Status'], prefix=['Marital_Type'])
df_pp['Marital_Status'] = df['Marital_Status']


Notes:
get_dummies was used because Marital_Status does not have hierarchy between them (nominal categories). Each category will recieve its own column

Transforming date column to number

In [None]:
df_pp['Dt_Customer_Number'] = df_pp['Dt_Customer'].apply(lambda x: int(round(datetime.datetime.strptime(x, '%Y-%m-%d').timestamp())))

In [None]:
# Shift response column to the end of df
df_columns = [col for col in df_pp.columns if col != 'Response']
df_columns.insert(len(df_pp), 'Response')
df_pp = df_pp[df_columns]

df after all pre processings:

In [None]:
df_pp

Notes:
10 columns were added due to the numericalization of columns Education, Marital_Status and Dt_Customer

## 3. Exploratory Analysis

### 3.1. Analyzing 10 samples (5 with target = 1 and 5 with target = 0)
Objective: find nice features and bad features and insights about future analysis

In [None]:
samples = df_pp.sort_values(by=['Response'], ascending=False).groupby('Response').head(5)
samples

Notes on the sample of size 10 (Initial hypotheses):
- Columns correlated to Response: MntFruits,Meat,Fish,Hold; NumWebPurchases,CatalogPurchases; Education; Marital_Status
- Positive costumers spend more on Wines, Meat, Fish, Gold Prdocuts than negative customers
- Positive costumers have better education and fewer kids

### 3.2 Lets see the statistics about data

In [None]:
df_pp.describe()

Notes:
Study about describe() func


### 3.3 Lets see the balancement of the dataset

In [None]:
df_pp['Response'].value_counts()

Notes:
- The dataset is imbalanced
- Percentage of customers that responded to campaign: 15%
- Percentage of customers that not responded to campaign: 85%
- Therefore, can be presumed that future machine learning models will be better at predicting negative customers (0) than predicting negative customers (1)

### 3.4 Lets see the mean of each column based on target label (Response)

In [None]:
df_3_4 = df_pp.groupby('Response').mean()
df_3_4

OBS:
- Columns Z_CostContact and Z_Revenue did not appear due to constant nature for all samples

Notes:
Considering df_3_4:
- Positive customers have 10k higher income than negative customers
- Negative customers have more Kids and Teens than positive customers
- Negative customers take longer time to do another purchase than positive customers
- Positive customers buy almost double quantity of Wines and Meat than negative customers
- Positive customers buy more Fruits, Fish, Sweet and Gold than negative customers
- Positive customers buys more using Catalog than negative customers
- Positive customers buy more on Web than negative customers, despite the number of WebVisits of each is very similar (Therefore, the chance of Positive customers buy a product on Web is higher than negative customers
- Considering all previous Campaigns, positive customers responded better than negative customers
- Was expected to Complain be higher on negative customers. But the data showed they have approximate values
- Education of positive customers is higher than negative
- Looking at Marital_status columns, can be inferred that the new gadget is more acceptable by single people (Single or Divorced)

Notes:
- Year_Birth, Income, MntWines, MntMeat, MntFish, MntSweet, MntGold, NumCatalog, NumWeb, AcceptedCmp[1-5], Education, Alone, Divorced, Single, Widow, Absurd was higher for Response = 1
- Kidhome, Teenhome, Recency, Maried, Together was higher for Response = 0

## 4. Visualizations
Let's drop columns that does not give information


In [None]:
df_4 = df_pp.copy()
df_4 = df_4.drop(['ID', 'Z_CostContact', 'Z_Revenue', 'Education', 'Marital_Status'], axis=1)
df_4

### 4.1 Univariate Plot

In [None]:
df_4.hist(figsize=(15, 25))
plt.show()

Notes:
- There are much more people with a Graduation than other educations
- The distribution for Fish, Fruit, Wine is low. Makes sense, since there are close to 6 times more negative customers than positive customer and, as seen on section 3.4, negative customers does not spend much on these products rather positive customers

### 4.2 Multivariate Plot

In [None]:

df_4_2 = df_4.copy()
df_4_2 = df_4_2.loc[:, ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']]
scatter_matrix(df_4_2, figsize=(10, 12))
plt.show()

In [None]:

df_4_2 = df_4.copy()
df_4_2 = df_4_2.loc[:, ['MntWines', 'MntFruits',
                        'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'Response']]
scatter_matrix(df_4_2, figsize=(12, 12),alpha=0.1)
plt.show()

Notes:
It can be perceived from the hist above that:
- Fish, Meat, Fruit Wines has a positive correlation between them

In [None]:

df_4_2 = df_4.copy()
df_4_2 = df_4_2.loc[:, ['NumDealsPurchases', 'NumWebPurchases',
                        'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'Response']]
scatter_matrix(df_4_2, figsize=(10, 12),alpha=0.1)
plt.show()

Notes:
- None insight was obtained from hist above

In [None]:

df_4_2 = df_4.copy()
df_4_2 = df_4_2.loc[:, ['Year_Birth', 'Education_Cat', 'Income', 'Kidhome',
                        'Teenhome', 'Dt_Customer', 'Recency', 'Response']]
scatter_matrix(df_4_2, figsize=(10, 12))
plt.show()

plt.show()
Notes:
- None insight was obtained from hist above

Notes:
- Can be said that Higher Wine and Meet, higher the chance that is a positive customer
- Can be said that lower the store purchases, higher the chance that is a negative customer

### 4.3 Confirming notes above using corr()

In [None]:
corr = df_pp.corr()
corr

Notes:
Analyzing the correlation, a new feature was found relevant: Dt_Customer_Number. Older the customer, higher the chance that will be the gadget


## 5. Customer segmentation

KMeans will be used to see the cluster formed by features MntWines, MntMeat and Response, to prove that the high value of these features means a higher chance of Response = 1

In [None]:
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans


A normalization on data is needed because KMeans used the concept of distance. So, on each column will be applied log function and StandardScaler

In [None]:
def log_alt(x):
    if np.log(x) < 0:
        return 0
    else:
        return np.log(x)

In [None]:
data = df_pp[["MntWines", "MntMeatProducts", "Response"]]

df_log = pd.DataFrame()
df_log['MntWines'] = data['MntWines'].apply(log_alt)
df_log['MntMeatProducts'] = data['MntMeatProducts'].apply(log_alt)
df_log['Response'] = data['Response']

std_scaler = StandardScaler()
df_scaled = std_scaler.fit_transform(df_log)

In [None]:

erros = []
for k in range(1,11):
    model = KMeans(n_clusters=k, random_state=42)
    model.fit(df_scaled)
    erros.append(model.inertia_)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Error of cluster')
sns.pointplot(x=list(range(1,11)), y=erros)
plt.show()

The theory of the elbow says that the optimal number of clusters is when the elbow is formed e.g. N = 3

In [None]:
model = KMeans(n_clusters=3, random_state=2)
model.fit(df_scaled)
data = data.assign(ClusterLabel = model.labels_)
data.groupby("ClusterLabel")[["MntWines", "MntMeatProducts", "Response"]].median()

In [None]:
import plotly.express as px
fig = px.scatter_3d(
    data_frame=data,
    x="MntWines",
    y="MntMeatProducts",
    z="Response",
    title = "Relationship between MntWines, MntMeatProducts, Response",
    color="ClusterLabel",
    height=500
)
fig.show()

Notes:
As Cluster 2 represent Response = 1 and Cluster 1 and 0 represent Response - 0,
The 3d graph and table shows that positive customers spends more than 200 on meat products and more than 460 on wine products

## 6. Classification metodologies

### Packages

In [None]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report


### Functions

In [None]:
def evaluateModels(X_train, y_train, models, n_splits):
    print(f"{n_splits}-Fold Cross validation")
    results = []
    names = []
    for name, model in models:
        kfold = model_selection.StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=2)
        cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
        results.append(cv_results)
        names.append(name)
        print(f"{name}: Mean Accuracy={cv_results.mean():.5f}, Standard Deviation={cv_results.std():.5f}")

In [None]:
class Models:
    def __init__(self, X_train, X_test , y_train, y_test):
        self.X_train = X_train
        self.X_test = X_test
        self.y_train = y_train
        self.y_test = y_test

    def logistic_regression(self):
        print("Logistic Regression")
        self.name = 'Logistic Regression'
        self.classifier = LogisticRegression(solver='liblinear', multi_class='ovr')
        self.classifier.fit(self.X_train, self.y_train)
        self.y_pred = self.classifier.predict(self.X_test)

    def svm(self):
        self.name = 'SVM'
        print("SVM")
        # self.classifier = SVC(C=1.0, kernel='linear', degree=3, gamma='auto',random_state=0)
        self.classifier = SVC(gamma='auto')

        self.classifier.fit(self.X_train, self.y_train)
        self.y_pred = self.classifier.predict(self.X_test)

    def k_neighbors(self, n_neighbors):
        self.name = 'K Neighbors'
        print("KNN, n_neighbors = {}".format(n_neighbors))
        # self.classifier = KNeighborsClassifier(n_neighbors=5, metric='cosine', p=2)
        self.classifier = KNeighborsClassifier(n_neighbors=n_neighbors, metric='euclidean')

        self.classifier.fit(self.X_train, self.y_train)
        self.y_pred = self.classifier.predict(self.X_test)

    def score(self, type='cr'):
        self.score1 = accuracy_score(self.y_test, self.y_pred)
        self.score2 = precision_score(self.y_test, self.y_pred)
        self.score3 = recall_score(self.y_test, self.y_pred)
        self.score4 = f1_score(self.y_test, self.y_pred)
        self.cm = confusion_matrix(self.y_test, self.y_pred)

        if (type == 'scores'):

            print("---- Scores ----")
            print("Accuracy score is: {}%".format(round(self.score1 * 100, 2)))
            print("Precision score is: {}".format(round(self.score2 * 100, 2)))
            print("Recall score is: {}".format(round(self.score3 * 100, 2)))
            print("F1 score is: {}".format(round(self.score4 * 100, 2)))
        elif (type == 'cr'):
            print("---- Classification Report ----")
            print(classification_report(self.y_test, self.y_pred))

        elif (type == 'cm'):
            print("---- Confusion Matrix ----")

            self.show_confusion_matrix()

    def show_confusion_matrix(self):
        print("Confusion Matrix")
        plt.figure(figsize=(5, 5))

        sns.heatmap(self.cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Negative', 'Positive'],
                    yticklabels=['Negative', 'Positive'])
        plt.xlabel('Predicted')
        plt.ylabel('Truth')
        plt.title('Confusion Matrix')
        plt.show()

### Metodology 1
- Use features more correlated with target Response for training

#### Preparing data for input to model

Get column names more correlated to Response (abs >= 0.1)

In [None]:
# get columns more correlated
df_more_corr = corr.loc[abs(corr['Response']) >= 0.09]
columns_more_corr = df_more_corr.index.tolist()
# create dataframe with these columns
df_m1 = df_pp[columns_more_corr]




In [None]:
# Split array into features and target label
m1_array = df_m1.values
X = m1_array[:, :-1]
y = m1_array[:, -1]

In [None]:
# normalize features
ss = MinMaxScaler()
X = ss.fit_transform(X)

Notes:
Was decided to use the MinMaxScaler normalization because the models had of 2% increase of accuracy in comparison to not using any kind of normalization

This normnaization maps each column to a range of 0 to 1, based on the max value of the columns and the minimum value of the column to decide the final value

In [None]:
# split into train and test

X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.8, random_state=2)

#### Evaluating models

Was decided to use cross validation on the train_splits (using one part for train and another part for validation) and leaving test_split for the final test. This method will be used because all the train data will be used as validation at some point. After cross validation, the best model will be used on unseen data (test_split). Then can be evaluated if the model is overfitted on trained data or not)

In [None]:
# evaluating models

models = []

models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('KNN', KNeighborsClassifier(n_neighbors=5)))
models.append(('CART', DecisionTreeClassifier()))
models.append(('SVM', SVC(gamma='auto')))

evaluateModels(X_train,y_train, models, 10)

Evaluating the 4 models above using only train_split (90% as train/ 10% as validation iteratively), Logistic Regression showed the best results, with 89% of mean accuracy considering the 10 rounds

#### Predictions
- Now that the best models was evaluated, it is time to test it on unseen data

In [None]:

models = Models(X_train, X_test, y_train, y_test)
models.logistic_regression()
print(f"Methodology 1 Results")
models.score()
models.show_confusion_matrix()

#### Cost-Revenue Confusion Matrix
Considering the confusion matrix idea and the columns Z_CustomerCost and Z_Revenue, I propose the Cost-Revenue Confusion Matrix with the goal to calculate the profit of the model
Properties:
- 2x2 shape
- If a customer is TP, means that the company had the cost of 3 but revenue of 11, resulting on 8 of profit (because the model predics that the customer wil buy and he will)
- If a customer is FP, means that the company had the cost of 3 but revenue of 0, resulting on -3 of profit (because the model predics that the customer will buy and he dont)
- If a customer is FN, means that the company would have "possible" cost of 3 and "possible" revenue of 11, resulting on -8 of profit (because that they did not profit with a positive customer, can be inferred a profit of 0 too, but I decided the profit of -11 MUs for convention)
- If a customer is TN, means that the company had the cost of 0 and "deduction" of 3, resulting on 3 of profit (because the model predicts that the customer wont buy and he dont, saving 3 MUs)
- Therefore, each customer on TP will score 8 points to the final profit
- each customer on FP will score -3 on final profit
- And there goes...
- The final profit can be calculated using the scalar product of the two matrices (A.B)

Therefore, The confusion matrix:
|TN FP|
|FN TP|

Will be mapped to:
|3 -3|
|-8 8|

Based on that mapping, the following metrics is considered the best:
- recall because we want to minimize FN cases (profit of -8) and maximize TP (profit of 8)
- accuracy because pursuit to maximize TP (deduction 3) and TN (profit of 8) cases


Thus, the model had 98% of recall on negative customers and 39% of recall on positive customers and a Accuracy of 87%
Based on that, the profit can be calculated by:

In [None]:
confusion_matrix_m1 = models.cm
cost_revenue_confusion_matrix_m1 = np.array([[3,-3],[-8,8]])

Making scalar product

In [None]:
final_profit_m1 = np.dot(confusion_matrix_m1.reshape(4), cost_revenue_confusion_matrix_m1.reshape(4))
print(f"The final profit for methodology 1 is {final_profit_m1} MUs")

### Metodology 2
- Use all features for training

#### Preparing data for input to model

In [None]:
df_m2 = df_pp.copy()
df_m2.drop(columns=['Education', 'Marital_Status','Dt_Customer','Z_Revenue','Z_CostContact'], inplace=True)


In [None]:
m2_array = df_m2.values
X = m2_array[:, :-1]
y = m2_array[:, -1]

In [None]:
from sklearn.preprocessing import MinMaxScaler
ss = MinMaxScaler()
X = ss.fit_transform(X)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size=0.8, random_state=2)

#### Evaluating models

In [None]:
models = []

models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('KNN', KNeighborsClassifier(n_neighbors=5)))
models.append(('CART', DecisionTreeClassifier()))
models.append(('SVM', SVC(gamma='auto')))

evaluateModels(X_train, y_train, models, 10)

Notes:
LR was the best model using StratifiedKfold and 80/20 train_test_split, with 0,89107 of accuracy

Due to the fact that LR got the best results, it will be used for the final test: use the model for unseen data (test_split)

#### Predictions
- Now that the best models was evaluated, it is time to test it on unseen data


In [None]:
models = Models(X_train, X_test, y_train, y_test)
models.logistic_regression()
print(f"Methodology 1 Results")
models.score()
models.show_confusion_matrix()

#### Cost-Revenue Confusion Matrix

Thus, the model had 98% of recall on negative customers and 42% of recall on positive customers and a Accuracy of 88%
Based on that, the profit can be calculated by:

In [None]:
confusion_matrix_m1 = models.cm
cost_revenue_confusion_matrix_m1 = np.array([[3, -3], [-8, 8]])

final_profit_m1 = np.dot(confusion_matrix_m1.reshape(4), cost_revenue_confusion_matrix_m1.reshape(4))
print(f"The final profit for methodology 2 is {final_profit_m1} MUs")

## 7. Conclusion

- It is important to note that the model will be better at predicting right customers that will not respond to the compaign (TN) than predicting right customers that will respond (TP) due to the fact that the dataset is unbalanced

- I exptected the model using only features more correlated to Response (target) will be better (Methodolgy 1). But, surpsingly, the other model did best, with a profit of 805 MUs, accuracy of 88% and recall of 42% againist 754 MUs, 87% of accuracy and 39% of recall for Methodology 2

- It is important that to tell CMO that if a customer has low recency, long date customer, doesnt have teen or kids, is not married (five features more negative correlated to response), accepted the campaigns 1,3 or 5 and spend a lot with Wines or Meat(five features more positve correlated to response), the chance to be a positive customer (buy the gadget) is extremely high

## 8. References
https://machinelearningmastery.com/machine-learning-in-python-step-by-step/
https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd
https://towardsdatascience.com/data-exploration-and-analysis-using-python-e564473d7607
https://medium.com/ml-research-lab/chapter-4-knowledge-from-the-data-and-data-exploration-analysis-99a734792733
https://www.freecodecamp.org/news/customer-segmentation-python-machine-learning/
Data Science Do Zero: Noções Fundamentais com Python, 2021 - Joel Grus