# Today you are a Machine Learning Engineer at the Department of New Products at Target Cosmetics!
This work relies on processed data from Kaggle https://www.kaggle.com/mkechinov/ecommerce-events-history-in-cosmetics-shop

This work is motivated by the publication https://arxiv.org/pdf/2010.02503.pdf

So far you have seen user-product interaction data that can lead to classification of a user-product relationship as ending in purchase or no-purchase, and for clustering (categorizing) user behaviors.

In this assignment, we will have a very small training set to work with. Additionally, the test set we'll use has very few features. We'll first expose you to an Auto-Machine Learning library called TPOT and show you how it can be used to search over many ML model architectures. Then we will use the Label Spreading method to do semi-supervised learning, allowing us to leverage a small amount of labeled data in combination with a larger amount of unlabeled data. Finally we'll have a more open-ended task centering on system design for Zero-shot learning.

Labeled data is sparse, and in our hypothetical application, (cosmetics purchase prediction) the intention is to maximize Recall (so that no popular cosmetic is understocked). Digital overstocking is allowed since it will not cause disengagement in customers.

## Task 1: Exploratory Data Analysis (EDA) and Data Preparation

1. Read in the data file `Past_month_products.csv` and save it as a DataFrame called `past_df`. This dataset has the past data and will be our training set.

    Look at the shape of the DataFrame to determine the number of features and number of datapoints.
    
    Look at the first few rows of the DataFrame and review the data.

In [1]:
import pandas as pd
past_df = pd.read_csv('data/Past_month_products.csv')
past_df.head()
past_df.shape

(5000, 37)

2. Read in the data in `Next_month_products.csv` and save it as a DataFrame called `next_df`. This is the test dataset.

    Look at the shape of the DataFrame and look at the first few rows.

In [2]:
next_df = pd.read_csv('data/Next_month_products.csv')
next_df.head()
next_df.shape

(30091, 5)

3. How does the number of datapoints in the training set compare to the number of datapoints in the test set?

    And how does the feature set in the training set compare to the feature set in the test set?

**Answer**

There is a difference of 25091 datapoints between the training and the testing dataset. Also there is a difference of 32 features between the train/test datasets

Imagine that you are helping plan the launch of new products. You have to figure out how to mine the past cosmetic sales data from last month, utilize relevant features and to make estimations as to which products will sell more. 

4. What percentage of datapoints are a purchase in the training set?

In [3]:
#purchase_df = past_df.loc[past_df['Purchase'] == 1]
print(past_df['Purchased?'].mean() * 100)

34.38


5. What percentage of datapoints are a purchase in the test set?

In [4]:
print(next_df['Purchased?'].mean() * 100)

34.42557575354757


6. Are there any product ids in both the training and test datasets?

In [5]:
import numpy as np
unique_train_ids = np.unique(past_df['product_id'])
unique_test_ids = np.unique(next_df['product_id'])

intersection = len(set(np.unique(unique_train_ids)).intersection(set(unique_test_ids)))

print('Number of overlapping ids: ', intersection)

Number of overlapping ids:  0


7. Create `X_train`, `y_train`, `X_test`, and `y_test` according to the following guidelines.
    * The `Purchased?` column is the target.
    * `X_train` and `X_test` should contain the same features (so you will not be able to use all the features).
    * `product_id` should not be a feature.
    
    Double check that the shapes of the four arrays are what you expect.

In [6]:
X_train = past_df[['maxPrice',	'minPrice',	'Category']].values
y_train = past_df['Purchased?'].values

X_test = next_df[['maxPrice',	'minPrice',	'Category']].values
y_test = next_df['Purchased?'].values

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

(5000, 3) (5000,) (30091, 3) (30091,)


## Task 2: Build the best classifier you can using only the Past month's data.

We will be using the TPOT library to find an optimal model.

1. Install `tpot`.

    If you're running the notebook locally, follow [these instructions](https://epistasislab.github.io/tpot/installing/), using either conda or pip.
    
    If you're using Colab, uncomment the following line to install tpot.

In [7]:
 !pip install tpot



2. Instantiate and train a TPOT auto-ML classifier.

    The parameters are set fairly aritrarily (with some trial and error). Use these parameter values:
    * `generations`: 5
    * `population_size`: 40
    * `verbosity`: 2 (so you can see each generation's performance)
    
    The final line with create a Python script `tpot_products_pipeline.py` with the code to create the optimal model found by TPOT.

In [8]:

from tpot import TPOTClassifier

tpot = TPOTClassifier(generations=5, population_size=50, verbosity=2, random_state=42)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_products_pipeline.py')





Optimization Progress:   0%|          | 0/300 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.8724000000000001

Generation 2 - Current best internal CV score: 0.8724000000000001

Generation 3 - Current best internal CV score: 0.8747999999999999

Generation 4 - Current best internal CV score: 0.8747999999999999

Generation 5 - Current best internal CV score: 0.8747999999999999

Best pipeline: DecisionTreeClassifier(SelectFromModel(input_matrix, criterion=entropy, max_features=0.2, n_estimators=100, threshold=0.30000000000000004), criterion=entropy, max_depth=8, min_samples_leaf=13, min_samples_split=17)
0.8721544647901366


3. Take the appropriate lines (updating the variable names) from `tpot_products_pipeline.py` to build a model on our training set and make predictions on the test set. Save the predictions as `y_pred`.

    If there is model used in `tpot_products_pipeline.py` that you aren't familiar with, look it up!

    Note: There is randomness to the way the TPOT searches, so it's possible you won't have exactly the same result as your classmate.

In [9]:
import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from tpot.export_utils import set_param_recursive

# Average CV score on the training set was: 0.8747999999999999
exported_pipeline = make_pipeline(
    SelectFromModel(estimator=ExtraTreesClassifier(criterion="entropy", max_features=0.2, n_estimators=100), threshold=0.30000000000000004),
    DecisionTreeClassifier(criterion="entropy", max_depth=8, min_samples_leaf=13, min_samples_split=17)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 42)

exported_pipeline.fit(X_train, y_train)
y_pred = exported_pipeline.predict(X_test)

4. Compute some evaluation metrics for the predictions made above. Print the accuracy, recall, precision, f1 score and confusion matrix.

In [10]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score as accuracy
from sklearn.metrics import recall_score as recall
from sklearn.metrics import precision_score as precision
from sklearn.metrics import f1_score

cmtp=confusion_matrix(y_test, y_pred)
acc  = accuracy(y_test, y_pred)
rec  = recall(y_test, y_pred)
prec = precision(y_test, y_pred)
f1   = f1_score(y_test, y_pred)
print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cmtp)

Accuracy = 0.8721544647901366, Precision = 0.9604072398190046, Recall = 0.6556617434115262, F1-score = 0.7793012449084964
Confusion Matrix is:
[[19452   280]
 [ 3567  6792]]


## Task 3: Semi-supervised learning: Apply label spreading on the data.

We won't use any of the labels for the test set. We'll just use labels for the training set. We will, however, use the **features** from the test set along with the features from the training set. Since we're using a large number of sampled features, but only a small number of these samples have labels, this is **semi-supervised learning**.

1. Create a matrix `X` that has the rows from `X_train` concatenated with the rows from `X_test`.

    Check the shape of the matrix.

In [11]:
X = np.concatenate([X_train, X_test])
X.shape

(35091, 3)

2. Create the target array `y` by concatenating `y_train` with a vector of -1's, effectively creating a dummy label for the `X_test` rows in `X`.

    Check the shape of the array. It should have as many values as `X` has rows.

In [12]:
y = np.concatenate([y_train, -np.ones(X_test.shape[0])])
y.shape

(35091,)

Scikit-learn provides two label propagation models: `LabelPropagation` and `LabelSpreading`. Both work by constructing a similarity graph over all items in the input dataset. LabelSpreading is similar to the basic Label Propagation algorithm, but it uses an affinity matrix based on the normalized graph Laplacian and soft clamping across the labels. We will be using scikit-learn's `LabelSpreading` model with `kNN`.

3. Train a `LabelSpreading` model. Set `kernel` to `knn` and `alpha` to 0.01.

In [13]:
from sklearn.semi_supervised import LabelSpreading
model = LabelSpreading(kernel='knn', alpha=0.01)
model.fit(X,y)

LabelSpreading(alpha=0.01, kernel='knn')

4. Extract the predictions for the test data. You can get the predictions from the `transduction_` attribute. Note that there is a value for every row in `X`, so select just the values that correspond to `X_test`.

In [14]:
y_pred = model.transduction_
y_pred =  y_pred[y_train.shape[0]:]
y_pred.shape

(30091,)

5. Compute some evaluation metrics for the predictions. Print the accuracy, recall, precision, f1 score and confusion matrix.

In [15]:
cmtp=confusion_matrix(y_test, y_pred)
acc  = accuracy(y_test, y_pred)
rec  = recall(y_test, y_pred)
prec = precision(y_test, y_pred)
f1   = f1_score(y_test, y_pred)
print(f'Accuracy = {acc}, Precision = {prec}, Recall = {rec}, F1-score = {f1}')
print('Confusion Matrix is:')
print(cmtp)

Accuracy = 0.8184174670167159, Precision = 0.7975683890577507, Recall = 0.6332657592431702, F1-score = 0.7059836418424451
Confusion Matrix is:
[[18067  1665]
 [ 3799  6560]]


6. Collect your results in the table below to compare the two models.


| Method          | Recall       | F1-score     | Accuracy     |
| --------------- | ------------ | ------------ | ------------ |
| TPOT (AutoML)   |   0.6556617434115262           |     0.7793012449084964         |    0.8721544647901366          |
| Label Spreading |      0.6332657592431702        |    0.7059836418424451        |   0.8184174670167159  |

## Task 4: System Design for Zero Shot Learning:
So far we have been looking at 3 product level features (min price, max price, Product Category) to classify if a particular product will get get purchased or not.
Now, let's say you have access to some more information regarding each Past sold cosmetic item and the Next cosmetic item. Design a System to enable accurate identification of an item that is more likely to be purchased.
Think through the following:
1. What additional data fields do you need per cosmetic in past and Next catalogue? How would you process these data fields?
2. You have access to picture images of each cosmetic. How will you use these images to extract relevant features for gauging interest in the new cosmetics?
3. Design an end-to-end system workflow using the additional cosmetic data and cosmetic images to predict its purchasing polularity.

**This task  was optional for this assigment**

## Task 5: Summary and Discussion

What would you report back as the best method to gauge product popularity?

Think in terms of Data, Process and Outcomes specifically.

Consider the following:
1. Can you store the data in some other way to enable ZSL or more efficient information storage/retrieval?
2. Given a new data set on the job, how would you report the best "method"? What are the steps to always follow? 
3. What is the metric/metrics you would use to report your results?

Share screen and discuss findings. Think about generalizability (something that works across data sets)

Also, look into ML system design in terms of Data, Process and Outcome.