Video link https://youtu.be/q0VofqW4g20

# Walmart TripType prediction

https://www.kaggle.com/c/walmart-recruiting-trip-type-classification

>For this competition, you are tasked with categorizing shopping trip types based on the items that customers purchased. To give a few hypothetical examples of trip types: a customer may make a small daily dinner trip, a weekly large grocery trip, a trip to buy gifts for an upcoming holiday, or a seasonal trip to buy clothes.


## Multi-class classification, goal is to predict `type of the trip`.



In [10]:
!kaggle competitions download -c walmart-recruiting-trip-type-classification

Traceback (most recent call last):
  File "/usr/local/bin/kaggle", line 5, in <module>
    from kaggle.cli import main
  File "/usr/local/lib/python3.7/site-packages/kaggle/__init__.py", line 23, in <module>
    api.authenticate()
  File "/usr/local/lib/python3.7/site-packages/kaggle/api/kaggle_api_extended.py", line 166, in authenticate
    self.config_file, self.config_dir))
OSError: Could not find kaggle.json. Make sure it's located in /Users/vitalijugnivenko/.kaggle. Or use the environment method.


In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [9]:
train = pd.read_csv('walmart-recruiting-trip-type-classification/train.csv')
test = pd.read_csv('walmart-recruiting-trip-type-classification/test.csv')

FileNotFoundError: [Errno 2] File walmart-recruiting-trip-type-classification/train.csv does not exist: 'walmart-recruiting-trip-type-classification/train.csv'

In [None]:
train.head()

### Columns description

- `TripType` - a categorical id representing the type of shopping trip the customer made. This is the ground truth that you are predicting. TripType_999 is an "other" category.
- `VisitNumber` - an id corresponding to a single trip by a single customer
- `Weekday` - the weekday of the trip
- `Upc`- the UPC number of the product purchased
- `ScanCount` - the number of the given item that was purchased. A negative value indicates a product return.
- `DepartmentDescription` - a high-level description of the item's department
- `FinelineNumber` - a more refined category for each of the products, created by Walmart


In [None]:
train.shape

In [None]:
train.count()

In [None]:
train.TripType.value_counts()

# 1. Understand the task.

> Every single ml task has its features, and there are no universal solutions, only generally working principles.

We need to predict type of the visit: `TripType`. Every row of the data table contains information about a single product not a visit, threfore we need to combine the information about the visit from all the purchases. 

![](https://downloader.disk.yandex.ru/preview/efd0ee1162a479288a9ef907b7c627e0216b9b3eb9f85500ee4e0a0a66552c4c/5f6c7f77/eeebh5OVftecfJjMjJc1xYaANWzMwwdQQtKH72IlRPTvgOXDhJidoIyHmzaIk9kp8pZpk9fLNjeTpe27JpbpDg==?uid=0&filename=Screenshot+from+2020-09-24+10-12-50.png&disposition=inline&hash=&limit=0&content_type=image%2Fpng&tknv=v2&owner_uid=159868851&size=2048x2048)

image link https://yadi.sk/i/ShJNKZ-42X65NA

Unfortunately, we do not have the information about customers, so we do not know if some of the Visits were performed by the same person.


In [None]:
train.head(5)

# 2. Build a simple baseline model

- Count number of purchases in a Visit
- Create binary column `is_weekend`
- Drop all the remaining columns

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

In [None]:
def is_weekend(day):
    return int(day in ['Saturday', 'Sunday'])

df_train = train.copy()

# Generate is_weekend
df_train['is_weekend'] = df_train.Weekday.apply(is_weekend)

# Generate n_products
gp_n_products = df_train.groupby('VisitNumber')['ScanCount'].count()
df_train['n_products'] = df_train.VisitNumber.map(gp_n_products)

# drop duplicated Visit numbers
df_train = df_train.drop_duplicates(subset=['VisitNumber']).reset_index(drop=True)

# drop all columns except `is_weekend`, `n_products` and `TripType`
df_train = df_train.drop(['VisitNumber', 'Weekday', 'Upc', 'ScanCount',
                          'DepartmentDescription', 'FinelineNumber'], axis=1)

# Encode TripType so unique values are from 0 to (m-1), where m is number of classes
encoder = LabelEncoder().fit(df_train['TripType'])
df_train['TripType_encoded'] =  encoder.transform(df_train['TripType'])
df_train = df_train.drop('TripType', axis=1)

# Create separate variables X and y
X = df_train.drop('TripType_encoded', axis=1)
y = df_train.TripType_encoded

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
# Split data into train and test, use parameter `stratify=y`

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=774, stratify=y)

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import log_loss

In [None]:
# train a RandomForest model with default hyperparameters

clf = RandomForestClassifier()
clf.fit(X_train, y_train)

---
For this tash we are using a log loss $$- y \log p - (1-y)\log(1-p)$$
in the following multi class form:

$$\frac{1}{N}\sum_{i=1}^N \log {\left(\frac{e^{a_{it_i}}}{\sum_{j=0}^{M-1} e^{a_{ij}}}\right)}$$

$t \in \{0\ldots M-1\}$, $M$ is number of classes, $N$ is number of objects. Numerator is $a_{it_{i}}$ = \[ unnormalized probability of an $i$'th object to be assigned to the right class $t_i$\], thus:

$$p_{it_i} = \frac{e^{a_{it_i}}}{\sum_{j=0}^{M-1} e^{a_{ij}}}$$

see for example https://towardsdatascience.com/cross-entropy-for-classification-d98e7f974451

In [None]:
# predict y_test and compute multi-class log loss of your prediction

y_pred = clf.predict_proba(X_test)
log_loss(y_test, y_pred)

In [None]:
y_pred.shape

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
# just to get an intuition on whether it is a good a bad prediction compute an accuracy of your model

y_pred = clf.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
# Remember this is a 38 class classification, count predicted classes

np.unique(y_pred, return_counts=True)

In [None]:
# Check value counts of a TripType on a train set

y_train.value_counts()

## Compare with the constant prediction


In [None]:
# Compute a log loss and an accuracy of a constant prediction (predict most frequent TripType)

y_dummy = [5]*19135
print(accuracy_score(y_test, y_dummy))

y_dummy_proba = np.zeros((19135, 38))
y_dummy_proba[:, 5] = 1
print(log_loss(y_test, y_dummy_proba))

## Conclusions on the baseline
1. Even with this simple features and relatively stupid predictions we are better than a constant prediction.
2. Classifier mostly predicts frequent classes.
3. Some frequent classes are predicted and some are not. This may be due to the fact that predicted
classes are better described by the generated features.

# 3. Deeper look on the features.

### Columns description

- `TripType` - a categorical id representing the type of shopping trip the customer made. This is the ground truth that you are predicting. TripType_999 is an "other" category.
- `VisitNumber` - an id corresponding to a single trip by a single customer
- `Weekday` - the weekday of the trip
- `Upc`- the UPC number of the product purchased
- `ScanCount` - the number of the given item that was purchased. A negative value indicates a product return.
- `DepartmentDescription` - a high-level description of the item's department
- `FinelineNumber` - a more refined category for each of the products, created by Walmart


In [None]:
train.head()

In [None]:
train.dtypes

All features, except `ScanCount` are categorical, a negative value of a `ScanCount` indicates a product return.

### 3.1 VisitNumber 
is an indicator of a visit, we need it to aggregate different purchases,
but the number itself is not important it is just an index. Let's have a quick look on it.

In [None]:
# How many unique VisitNumber are in the data?

train.VisitNumber.value_counts()

In [None]:
# What is the typical size of a purchase (number of unique products in a Visit)?

sns.histplot(data=train.VisitNumber.value_counts(), bins=50);

In [None]:


train.VisitNumber.value_counts().value_counts().sort_index()[:20]

--- 

- More than half of all Visit consists of 4 or less purchases
- 90% of Visits consist of 17 or less purchases

### 3.2 Weekday

In [None]:
# How many visits are on different weekdays
weekdays = ['Monday', 'Tuesday', 'Wednesday','Thursday','Friday', 'Saturday', 'Sunday']

train.Weekday.value_counts().reindex(weekdays)

In [None]:
# Do we have more returns on some weekdays?

train['IsReturn'] = train['ScanCount'].apply(lambda x: x < 0)
train.groupby(['Weekday'])['IsReturn'].sum().sort_values().reindex(weekdays)

In [None]:
# Is there only `-1` returns?

train[train.IsReturn].ScanCount.value_counts()

### 3.3 DepartmentDescription

In [None]:
# What are the most popular Departments? (total sum over ScanCount)

train.groupby('DepartmentDescription').ScanCount.sum().sort_values(ascending=False)[:20]

In [None]:
# Does most popular DepartmentDescription differ for different TripTypes?
# > Allows to more or less deanonymize `TripType`.

gp = train.groupby('TripType')['DepartmentDescription'].value_counts().reset_index(name='Count')
gp2 = gp.groupby(['TripType'])[['Count','DepartmentDescription']]\
                                                           .apply(pd.DataFrame.nlargest, n=3, columns=['Count'])\
                                                           .reset_index()\
                                                           .drop('level_1', axis=1)

gp2[:15]

In [None]:
# From our baseline, recall most popular TripTypes, let's deanonymize them.
# Select subset of the previous table with TripType in [40, 39, 37, 38, 25]

gp2[gp2.TripType.isin([40, 39, 37, 38, 25])]

In [None]:
# What are the DepartmentDescription with most returns?

train.groupby('DepartmentDescription')['IsReturn'].sum().sort_values(ascending=False)[:15]

### 3.4 FinelineNumber

> according to the data description `FinelineNumber` is just a more detailed `DepartmentDescription`

In [None]:
# Build a crosstab between DepartmentDescription and FinelineNumber,
# for most popular DepartmentDescription (total ScanCount > 20_000)
# and most popular FinelineNumber (total ScanCount > 2000)


popular_dd = train.groupby('DepartmentDescription').ScanCount.sum().sort_values(ascending=False)[:10].index
sub = train[train.DepartmentDescription.isin(popular_dd)]
sub = sub[sub.FinelineNumber.isin(sub.FinelineNumber.value_counts().iloc[:15].index)]

tab = pd.crosstab(sub.FinelineNumber, sub.DepartmentDescription)
tab

### 3.5 Upc

In [None]:
train.Upc.value_counts()

## Conclusions on EDA

 - Visits are mostly consist of small number of products
 - Large visits are on weekends
 - `TripType` depends on `DepartmentDescription` and `FinelineNumber`
 - `Upc` is something like a bar code, could be usefull but contains almost 100k unique values
 

# 4. Generate features

In [None]:
train = pd.read_csv('walmart-recruiting-trip-type-classification/train.csv')
test = pd.read_csv('walmart-recruiting-trip-type-classification/test.csv')

In [None]:
def return_nth(x, n=0):
    try:
        return x[n]
    except:
        return np.nan
    
def return_nth_val(x, n=0):
    try:
        return x[n]
    except:
        return 0
    
def is_weekend(x):
    return int(x in ['Sunday', 'Saturday'])

def has_return(x):
    return int(any(_x < 0 for _x in x))

def sum_return(x):
    return np.sum([_x for _x in x if _x < 0])

In [None]:
def generate_features(df):
    
    df = df.copy()
    
    # 1. Most frequent DepartmentDescription in the purchase, second most frequent, third, fourth.

    f = lambda grp: grp.value_counts().nlargest(10)
    x = df.groupby('VisitNumber')['DepartmentDescription'].apply(f).reset_index()
    gp = x.groupby('VisitNumber')['level_1'].unique()

    df['PopularCategory'] = df.VisitNumber.map(gp)
    for i in range(4):
        df[f'Category_{i}'] = df['PopularCategory'].apply(return_nth, args=[i])

    # 2. Same for FinelineNumber, but 6

    x = df.groupby('VisitNumber')['FinelineNumber'].apply(f).reset_index()
    gp = x.groupby('VisitNumber')['level_1'].unique()

    df['PopularFineline'] = df.VisitNumber.map(gp)
    for i in range(6):
        df[f'Fineline_{i}'] = df['PopularFineline'].apply(return_nth, args=[i]).astype(object)

    # 3. Count number of unique DepartmentDescription in the purchase

    gp = df.groupby('VisitNumber')['DepartmentDescription'].nunique()
    df['#Unique_Department'] = df.VisitNumber.map(gp)

    # 4. Count number of unique FinelineNumber in the purchase.

    gp = df.groupby('VisitNumber')['FinelineNumber'].nunique()
    df['#Unique_Fineline'] = df.VisitNumber.map(gp)

    # 5. Count ScanCount (number of unique Upc in the purchase)

    gp = df.groupby('VisitNumber')['ScanCount'].count()
    df['#UniqueScanCount'] = df['VisitNumber'].map(gp)   

    # 6. Sum ScanCount

    gp = df.groupby('VisitNumber')['ScanCount'].sum()
    df['TotalScanCount'] = df['VisitNumber'].map(gp)

    # 7. Is weekend

    df['is_Weekend'] = df['Weekday'].apply(is_weekend).astype(object)

    # 8. Sum returns

    gp = df.groupby('VisitNumber')['ScanCount'].apply(sum_return)
    df['total_return'] = df.VisitNumber.map(gp)
    df['total_return'].fillna(0, inplace=True)

    # Drop old and intermediate features

    df = df.drop(['Upc', 'ScanCount', 'DepartmentDescription',
                  'FinelineNumber', 'PopularFineline', 'PopularCategory'], axis=1)

    # Drop duplicated rows (with the same VisitNumber)

    df = df.drop_duplicates(keep='first')
    
    return df

In [None]:
# train = generate_features(train)
# train.to_csv('gb_generated_features_train.csv', index=False)

train = pd.read_csv('gb_generated_features_train.csv')

In [None]:
# test = generate_features(test)
# test.to_csv('gb_generated_features_test.csv', index=False)

test = pd.read_csv('gb_generated_features_test.csv')

# 5. Train a model

We will use CatBoost implementation of a GradientBoosting algorithm https://catboost.ai, since our data have many
categorical features and catboost implements automatical categorical features handling.

https://catboost.ai/docs/concepts/python-usages-examples.html


### Preparations

In [None]:
from catboost import CatBoostClassifier, Pool

In [None]:
# Print column names, thier index and datatype

for i, col in enumerate(train.columns):
    print(i-2, col, train[col].dtype)

In [None]:
# Catboost requiers all categorical features to be `str` or `int`, and all missing values to be `str`.

for col in train.columns:
    if col.startswith('Category'):
        train[col]=train[col].apply(str)
    if col.startswith('Fineline'):
        train[col]=train[col].apply(str)
        
train.fillna('NaN', inplace=True)

In [None]:
# Split data into train and test

x_train, x_eval = train_test_split(train, test_size=0.1, random_state=10, shuffle=True)

In [None]:
# We could provide catboost indexes of all categorical features

cat_features = [*list(range(10)), 15]


In [None]:
# Create Pool objects (catboost internal data structure)

data_train = Pool(x_train.drop(['TripType', 'VisitNumber'], axis=1), 
                  label=x_train.TripType,
                  cat_features=cat_features)

data_eval = Pool(x_eval.drop(['TripType', 'VisitNumber'], axis=1), 
                 label=x_eval.TripType,
                 cat_features=cat_features)

### Fit a model


> If you have a GPU it will drastically improve training speed, use parameter `device='GPU'`

In [None]:
# Fit a CatBoostClassifier with the following set of hyperparameters

ctb_params = {
    'depth': 4,
    'learning_rate': .33,
    'l2_leaf_reg': 3,
    'loss_function': 'MultiClass',
    'verbose': 1,
    'thread_count': 12,
}

In [None]:
# With this particular features and hyperparameters it takes about 1.5 hours on a CPU

# model = CatBoostClassifier(**ctb_params)
# model.fit(X = data_train, silent=False, eval_set=data_eval)

![](https://downloader.disk.yandex.ru/preview/47a45d0202eecc7b3d8fb35050278a2714c430c51af256ed919379e6a54c2c5d/5f7074cc/bc4nyN3t5JFfbprrYUVJamEuORedzsk-VmIkZ9PIafohTWWT9tiuZj6n0fo_c9npMfoJXu57fO-EZQkbM2vKyQ==?uid=0&filename=Screenshot+from+2020-09-27+10-16-51.png&disposition=inline&hash=&limit=0&content_type=image%2Fpng&tknv=v2&owner_uid=159868851&size=2048x2048)

pic. link https://yadi.sk/i/IFRhOFu_y_PAKw

### Save the results into file

- save model 
- save feature importances

In [None]:
import json

In [None]:
# model.save_model('catboost_depth=4_lr=.33_l2=.3')
# feature_importance = dict(zip(model.feature_names_, model.get_feature_importance()))
# with open('feature_importance_cb.json', 'w') as f:
#     json.dump(feature_importance, f)

In [None]:
# Load saved model

model = CatBoostClassifier()
model = model.load_model('catboost_depth=4_lr=.33_l2=.3')

In [None]:
# Compute log loss on the local train and test data

log_loss(x_train.TripType, model.predict_proba(data_train))

In [None]:
log_loss(x_eval.TripType, model.predict_proba(data_eval))

In [None]:
# Compute accuracy score on the local train and test data

accuracy_score(x_eval.TripType, model.predict(data_eval))

In [None]:
accuracy_score(x_train.TripType, model.predict(data_train))

### Visualize the feature importance

https://catboost.ai/docs/concepts/fstr.html

some other possible approaches:
- https://github.com/slundberg/shap
- https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html

In [None]:
# Load saved feature importance

with open('feature_importance_cb.json', 'r') as f:
    feature_importance = json.load(f)
    
feature_importance = sorted(feature_importance.items(), key=lambda x: x[1])

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Plot barchart of feature importances

fig, ax = plt.subplots(figsize=(5,10))

colnames = [fi[0] for fi in feature_importance]
importance = [fi[1] for fi in feature_importance]

ax.barh(range(len(colnames)), importance)
ax.set_yticks(range(len(colnames)))
ax.set_yticklabels(colnames);
ax.grid();
ax.set_title('Feature importance');

### What's next?

1. Analyze where does your model make most mistakes, e.g. which classes are not predicted, which classes are confused by a model (build a confusion matrix).
2. Based on (1) generate more features, e.g. "Percentage of products from the top 1 `DepartmentCategory`".
3. GridSearch over boosting parameters. Fit best model on a whole train data. 
https://catboost.ai/docs/concepts/python-reference_catboost_grid_search.html
4. Think about cross-validation, e.g. time-based split? upsample rare classes?