## The minimal product level information available about the new products is their cost range and product category (cream, foundation, lipcolor, etc..).

## Mine the past cosmetic sales data from last month, utilize relevant features and to make estimations as to which products will sell more (`Purchased?` = 1)

## Task 0: Understand the Data

In [None]:
## Import libraries
import os
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import seaborn as sb

In [None]:
# Load the data from previous months (past)
Past = pd.read_csv("Past_month_products.csv")
print(Past.shape)
Past.head()

(5000, 37)


Unnamed: 0,product_id,user_id,NumOfEventsInJourney,NumSessions,interactionTime,maxPrice,minPrice,NumCart,NumView,NumRemove,InsessionCart,InsessionView,InsessionRemove,Weekend,Fr,Mon,Sat,Sun,Thu,Tue,Wed,2019,2020,Jan,Feb,Oct,Nov,Dec,Afternoon,Dawn,EarlyMorning,Evening,Morning,Night,Purchased?,Noon,Category
0,5866936,561897800.0,1.333333,1.333333,5550.0,15.84,15.84,0.0,1.333333,0.0,0.0,1.0,0.0,0.0,0.0,0.333333,0.0,0.0,0.333333,0.0,0.333333,0.333333,0.666667,0.333333,0.333333,0.333333,0.0,0.0,0.0,0.0,0.666667,0.333333,0.0,0.0,0,0.0,1.0
1,5647110,532652900.0,2.25,1.5,27556.5,5.8,5.565,1.25,0.25,0.25,3.75,2.25,9.0,0.0,0.0,0.25,0.0,0.25,0.0,0.25,0.25,0.5,0.5,0.0,0.5,0.0,0.25,0.25,0.75,0.0,0.0,0.25,0.0,0.0,0,0.0,1.0
2,5790472,457810900.0,1.0,1.0,0.0,6.2725,6.2725,0.25,0.75,0.0,17.25,30.0,2.5,0.0,0.25,0.25,0.25,0.25,0.0,0.0,0.0,0.5,0.5,0.0,0.5,0.25,0.25,0.0,0.0,0.0,0.0,0.75,0.25,0.0,0,0.0,1.0
3,5811598,461264100.0,1.5,1.5,131532.5,5.56,5.56,0.25,1.0,0.25,3.25,10.5,1.0,0.0,0.0,0.25,0.25,0.0,0.25,0.25,0.0,0.5,0.5,0.5,0.0,0.0,0.25,0.25,0.0,0.0,0.0,0.5,0.0,0.25,0,0.25,1.0
4,5846363,515799300.0,1.875,1.375,11055.875,4.08625,4.08625,0.5,1.0,0.25,4.875,3.375,4.25,0.0,0.125,0.125,0.375,0.0,0.25,0.125,0.0,0.75,0.25,0.125,0.125,0.25,0.25,0.25,0.375,0.0,0.125,0.25,0.25,0.0,1,0.0,1.0


In [None]:
# Next, load the data regarding products to be launched next month
Next = pd.read_csv("Next_month_products.csv")
print(Next.shape)
Next.head()

(30091, 5)


Unnamed: 0,product_id,maxPrice,minPrice,Purchased?,Category
0,5866502,7.616667,7.616667,0,1.0
1,5870408,6.27,6.27,0,3.0
2,5900580,10.008,10.008,0,1.0
3,5918778,5.98,5.98,0,2.5
4,5848772,26.83,26.83,0,1.0


### Only the `product_id`, `maxPrice`, `minPrice`, and `Category` columns are common to both the training and test data

# Task 1: Exploratory Data Analysis (EDA) and Data Preparation
## EDA: Find the following:
1. Percentage of Purchased events in train data: 
2. Percentage of Purchased events in test data:
3. Are there any overlaps in product ID between train and test data?

In [None]:
y_train = Past['Purchased?'].values
print(f"Percentage of Purchased in Training data = {(np.sum(y_train)/len(y_train))*100}")

y_test = Next['Purchased?'].values
print(f"Percentage of Purchased in Test data = {(np.sum(y_test)/len(y_test))*100}")

# Verify that every product ID in the training data appears only once
print(f"Every product ID in the training data appears only once: {len(np.unique(Past['product_id'])) == Past.shape[0]}")

# Verify that every product ID in the test data appears only once
print(f"Every product ID in the test data appears only once: {len(np.unique(Next['product_id'])) == Next.shape[0]}")

# Concatenate the product_id columns of the training and test DataFrames
frames = [Past.iloc[:,0], Next.iloc[:,0]]
result = np.array(pd.concat(frames))

# Get all the unique product IDs and their counts
prod, prod_counts = np.unique(result, return_counts=True)

# Determine whether any product IDs appear in both the training and test data
num = (prod_counts > 1).astype(int)
overlap = set(Past['product_id']).intersection(set(Next['product_id']))
print(f"Number of product ids with count > 0 for training and test data combined = {sum(num)}")
print(f"These product IDs are present in both the training and test data: {overlap}")

Percentage of Purchased in Training data = 34.38
Percentage of Purchased in Test data = 34.42557575354757
Every product ID in the training data appears only once: True
Every product ID in the test data appears only once: True
Number of product ids with count > 0 for training and test data combined = 0
These product IDs are present in both the training and test data: set()


## Next, create `X_train`, `y_train`, `X_test`, and `y_test`. Remember the following: 
1. The `Purchased?` column is the target
2. `X_train` and `X_test` should contain the same features
3. `product_id` should NOT be one of those features. Can you see why?

In [None]:
def return_train_test_data(df_old, df_new):
    X_train = df_old[['maxPrice', 'minPrice', 'Category']].values
    y_train = df_old[['Purchased?']].values
    X_test  = df_new[['maxPrice', 'minPrice', 'Category']].values
    y_test  = df_new[['Purchased?']].values
    return X_train, y_train, X_test, y_test
    
X_train, y_train, X_test, y_test = return_train_test_data(Past, Next)    
print(X_train.shape, y_train.shape, X_test.shape)

(5000, 3) (5000, 1) (30091, 3)


# Task 2, Baselining: Build the best classifier using the Past month's data that will predict if the Next month's products will be Purchased or not?
## Consider using AutoML to estimate the best classifier. Which features would you use from the training data?

In [None]:
# Uncomment the following line if using Colab
!pip install tpot



In [None]:
# TPOT for classification
from tpot import TPOTClassifier

# Instantiate and train a TPOT auto-ML classifier
tpot = TPOTClassifier(generations=5, population_size=40, verbosity=2)
tpot.fit(X_train, y_train)

# Evaluate the classifier on the test data
# By default, the scoring function is accuracy
print(f"{tpot.score(X_test, y_test)}")
tpot.export('tpot_products_pipeline.py')

  y = column_or_1d(y, warn=True)


Optimization Progress:   0%|          | 0/240 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.8762000000000001

Generation 2 - Current best internal CV score: 0.8762000000000001

Generation 3 - Current best internal CV score: 0.8762000000000001

Generation 4 - Current best internal CV score: 0.8762000000000001

Generation 5 - Current best internal CV score: 0.8762000000000001

Best pipeline: RandomForestClassifier(SelectFromModel(input_matrix, criterion=entropy, max_features=0.9000000000000001, n_estimators=100, threshold=0.45), bootstrap=True, criterion=gini, max_features=0.15000000000000002, min_samples_leaf=1, min_samples_split=4, n_estimators=100)


  y = column_or_1d(y, warn=True)


0.8727858828221062


<!-- ## Modify the file `tpot_products_pipeline.py` to return the prediction labels for `X_test` and paste the function here or reload kernel to reload updated file -->

## Use the appropriate lines of `tpot_products_pipeline.py` (and modify the relevant names) to write a function which returns the predicted labels generated by the best classifier which TPOT found 

In [None]:
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline, make_union
from sklearn.svm import LinearSVC
from tpot.builtins import StackingEstimator
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier

def return_tpot_results(X_train, y_train, X_test):
    exported_pipeline = DecisionTreeClassifier(criterion="entropy", max_depth=6, min_samples_leaf=19, min_samples_split=17)
    
    exported_pipeline.fit(X_train, y_train)
    prediction = exported_pipeline.predict(X_test)
    return prediction

pred = return_tpot_results(X_train, y_train, X_test)

## Evaluate the results of the best classifier which TPOT found

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score as accuracy
from sklearn.metrics import recall_score as recall
from sklearn.metrics import precision_score as precision
from sklearn.metrics import f1_score

# TPOT confusion matrix
cmtp = confusion_matrix(y_test, pred) 
acc  = accuracy(y_test, pred)
rec  = recall(y_test, pred)
prec = precision(y_test, pred)
f1   = f1_score(y_test, pred)

print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cmtp)

Accuracy = 0.8721544647901366, Precision = 0.9604072398190046, Recall = 0.6556617434115262, F1-score = 0.7793012449084964
Confusion Matrix is:
[[19452   280]
 [ 3567  6792]]


# Task 3, Semi-supervised learning: Apply label spreading on the data and run performance analysis by cross validation.

Step 1: Combine `X_train` and `X_test`

Step 2: Combine `y_train` and pad `y_test` with -1 labels

Step 3: Run label spreading on complete data. Use knn spreading with `n_neighbors` varying as 1,3,5,7,9,11. What's the best neighborhood?


### Concatenate `X_train` and `X_test`

In [None]:
X = np.concatenate((X_train, X_test), axis=0)
print(X.shape[0])
print(y_train.shape)

35091
(5000, 1)


### Create an array shaped like a column of `X_test`, with each value equal to -1
### Make sure the array is a column vector

In [None]:
y_hat = -1*np.ones((X_test.shape[0],1))

### Concatenate `y_train` and `y_hat`

In [None]:
y = np.concatenate((y_train, y_hat), axis=0)

### Instantiate and train the label-spreading model. Use a KNN kernel and set `alpha` to 0.01. Try the `n_neighbors` values mentioned above.

In [None]:
from sklearn.semi_supervised import LabelSpreading
lp_model = LabelSpreading(kernel='knn', alpha=0.01, n_neighbors=17)
lp_model.fit(X, y)


  y = column_or_1d(y, warn=True)


LabelSpreading(alpha=0.01, kernel='knn', n_neighbors=17)

### Extract the label predictions (transductions) for the test data

In [None]:
semi_sup_preds = lp_model.transduction_[5000:]

### Evaluate the test predictions against the true test labels

In [None]:
cm   = confusion_matrix(y_test, semi_sup_preds)
acc  = accuracy(y_test, semi_sup_preds)
rec  = recall(y_test, semi_sup_preds)
prec = precision(y_test, semi_sup_preds)
f1   = f1_score(y_test, semi_sup_preds)
print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cm)

Accuracy = 0.8197800006646505, Precision = 0.7931116389548694, Recall = 0.6446568201563857, F1-score = 0.7112199797646308
Confusion Matrix is:
[[17990  1742]
 [ 3681  6678]]


## Observe increase in recall by running label spreading. Tabulate your results
----------------------------------------------------------------------------------------------------------------
Method    |   Recall      |F1-score    | Accuracy    |
------------------------------------------------------------------------------
### AutoML    | 0.6557                   | 0.7793                   | 0.8721                    |
-------------------------------------------------------------------------
### Label Spread | 0.6446              | 0.7112                         | 0.8197                        | 