# Diplodatos Kaggle Competition

We present this peace of code to create the baseline for the competition, and as an example of how to deal with these kind of problems. The main goals are that you:

1. Learn
1. Try different models and see which one fits the best the given data
1. Get a higher score than the given one in the current baseline example
1. Try to get the highest score in the class :)

In [1]:
# Import the required packages
import os
from sklearn import preprocessing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Read the *original* dataset...

In [2]:
original_df = pd.read_csv('https://raw.githubusercontent.com/DiploDatos/AprendizajeSupervisado/master/practico/data/train.csv')

url_train = 'https://raw.githubusercontent.com/DiploDatos/AprendizajeSupervisado/master/practico/data/train.csv'
url_test = 'https://raw.githubusercontent.com/DiploDatos/AprendizajeSupervisado/master/practico/data/test.csv'



In [None]:
original_df.columns

In [None]:
original_df.describe()

In [None]:
original_df.head()

In [None]:
set(original_df.Weekday)

In [None]:
set(original_df.DepartmentDescription)

There are `nan`s in the column, let us find them...

In [None]:
original_df[original_df.DepartmentDescription.isna()]

When the description is NaN, then the Upc and FinelineNumber are both NaN?

In [None]:
(original_df.DepartmentDescription.isna().sum(),
 (original_df.DepartmentDescription.isna() & original_df.Upc.isna() & original_df.FinelineNumber.isna()).sum())

In [None]:
original_df[original_df.Upc.isna()]

But it may be the case that Upc is NaN but not the description...

In [None]:
(original_df.Upc.isna().sum(),
 original_df.FinelineNumber.isna().sum(),
 (original_df.FinelineNumber.isna() & original_df.Upc.isna()).sum())

Upc and FinelineNumber are both NaN at the same time

Upc and FinelineNumber are both NaN at the same time

In [None]:
bins = set(original_df.TripType)

In [None]:
# to count the TripType values, we have to group the visits by VisitNumber
original_df.groupby("VisitNumber").TripType.mean().value_counts().plot(kind='bar')

**Unbalanced!!!**

Create a function to load the datasets.
Some important decisions made here:
We will use one hot encoding for Weekday and DepartmentDescription. All transformations are applied to the training and testing datasets...
We are handling NaN's as another category. This may not be the best approach.
We may have multiple records for one single visit and the goal is to classify all those records the exact same way. Therefore, we will prepare the data in a way that all the information for a visit is in the same record.
Based on the last bullet, we will count the DepartmentDescription for all items acquired in the same visit.
We drop the Upc and FinelineNumber to simplify the process. You may use it as they have a lot of information (may be using one-hot encoding for them as well)

The following operations will be performed in a function. Anyway, we present them here so that we can see them in action:
First, we drop the columns. We include TripType as we are going to treat it differently:

In [None]:
df = original_df.drop(["Upc", "FinelineNumber", "TripType"], axis=1)

In [None]:
df

In [None]:
df = pd.get_dummies(df, columns=["DepartmentDescription"], dummy_na=True)

In [None]:
df

Now, we group by the VisitNumber and Weekday (they should be the same), and add all values for ScanCount, and the one-hot encoding of DepartmentDescriptioin

In [None]:
df = df.groupby(["VisitNumber", "Weekday"], as_index=False).sum()

In [None]:
df

In [None]:
def transform_data_original(train_data_fname, test_data_fname):
    df_train = pd.read_csv(train_data_fname)
    df_train['is_train_set'] = 1
    df_test = pd.read_csv(test_data_fname)
    df_test['is_train_set'] = 0

    # we  get the TripType for the train set. To do that, we group by VisitNumber and
    # then we get the max (or min or avg)
    y = df_train.groupby(["VisitNumber", "Weekday"], as_index=False).max().TripType

    # we remove the TripType now, and concat training and testing data
    # the concat is done so that we have the same columns for both datasets
    # after one-hot encoding
    df_train = df_train.drop("TripType", axis=1)
    df = pd.concat([df_train, df_test])
    
    # the next three operations are the ones we have just presented in the previous lines
    
    # drop the columns we won't use (it may be good to use them somehow)
    df = df.drop(["Upc", "FinelineNumber"], axis=1)

    # one-hot encoding for the DepartmentDescription
    df = pd.get_dummies(df, columns=["DepartmentDescription"], dummy_na=True)

    # now we add the groupby values
    df = df.groupby(["VisitNumber", "Weekday"], as_index=False).sum()
    
    # finally, we do one-hot encoding for the Weekday
    df = pd.get_dummies(df, columns=["Weekday"], dummy_na=True)

    # get train and test back
    df_train = df[df.is_train_set != 0]
    df_test = df[df.is_train_set == 0]
    
    X = df_train.drop(["is_train_set"], axis=1)
    yy = None
    XX = df_test.drop(["is_train_set"], axis=1)

    return X, y, XX, yy

Load the data...

In [None]:
X, y, XX, yy = transform_data_original(url_train, url_test)

Create the model and evaluate it

In [None]:
# For the sake of comparison
models_scores = dict()

In [9]:
# split training dataset into train and "validation" 
# (we won't be using validation set in this example, because of the cross-validation;
# but it could be useful for you depending on your approach)
from sklearn.model_selection import train_test_split
#X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=42)

In [None]:
# results dataframe is used to store the computed results
results = pd.DataFrame(columns=('clf', 'best_acc'))

In [None]:
# we will use a DesicionTree to classify and GridSearch to determine the parameters
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score

tree_param = {'criterion':('gini', 'entropy'), 'min_samples_leaf':(1, 2, 5),
              'min_samples_split':(2, 3, 5, 10, 50, 100)}
tree = DT(random_state=42)
tree_clf = GridSearchCV(tree, tree_param, cv=3, scoring='accuracy') #scoring='balanced_accuracy')
tree_clf.fit(X_train, y_train)
best_tree_clf = tree_clf.best_estimator_

In [None]:
print('Best Decision Tree accuracy: ', tree_clf.best_score_)
print(best_tree_clf)
results = results.append({'clf': best_tree_clf, 'best_acc': tree_clf.best_score_}, ignore_index=True)

print('The best classifier so far is: ')
print(results.loc[results['best_acc'].idxmax()]['clf'])

**And finally**, we predict the unknown label for the testing set

In [None]:
models_scores = results
models_scores

In [None]:
X.shape, XX.shape

In [None]:
yy = results.clf.iloc[0].predict(XX)

The last thing we do is generating a file that should be *submitted* on kaggle

In [None]:
submission = pd.DataFrame(list(zip(XX.VisitNumber, yy)), columns=["VisitNumber", "TripType"])

In [None]:
submission.to_csv("../data/submission.csv", header=True, index=False)

We, Group 13, propose the following classification solutions for Walrmat's Classification Challenge. Many improvement can be included but resourcer are scarse and hence not many alternatives could be tested nor parameters of the selected models tuned. 

In [55]:
import xgboost as xgb
from sklearn.model_selection import cross_validate
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
import gc


In [None]:
def add_department_counts(data):
    alist = []
    for array in np.asarray(data.iloc[:,4:70]):
        count = 0
        for item in array:
            if item > 0:
                count += 1
        alist.append(count)
    dept_counts = pd.DataFrame(alist)
    dept_counts = dept_counts.rename(columns={0:"DepartmentCounts"})
    dept_counts = dept_counts.set_index(data.index)
    data.insert(4, 'DepartmentCounts', dept_counts)
    return data

In [56]:
def transform_data(train_data_fname, test_data_fname):
    df_train = pd.read_csv(train_data_fname)
    df_train['is_train_set'] = 1
    df_test = pd.read_csv(test_data_fname) 
    df_test['is_train_set'] = 0

    # we  get the TripType for the train set. To do that, we group by VisitNumber and
    # then we get the max (or min or avg)
    y = df_train.groupby(["VisitNumber", "Weekday"], as_index=False).max().TripType

 
    # we remove the TripType now, and concat training and testing data
    # the concat is done so that we have the same columns for both datasets
    # after one-hot encoding
    df_train = df_train.drop("TripType", axis=1)
    df = pd.concat([df_train, df_test])


   # Corrections of misspelling in some department's names 
    df['DepartmentDescription'] = df_2['DepartmentDescription'].replace({'GIRLS WEAR, 4-6X  AND 7-14': 'LADIES WEAR',
               'HEALTH AND BEAUTY AIDS': 'HEALTH & BEAUTY', 'LADIESWEAR': 'LADIES WEAR',
               'MENSWEAR' : 'MENS WEAR', 'OPTICAL - FRAMES' : 'OPTICAL',
               'OPTICAL - LENSES': 'OPTICAL', 'PHARMACY OTC': 'PHARMACY',
               'PHARMACY OTC': 'PHARMACY'})

    dummies = pd.get_dummies(df.DepartmentDescription)
    df[dummies.columns] = dummies 
    df['Weekday'] = df['Weekday'].map({"Monday": 1, "Tuesday": 2, "Wednesday": 3, "Thursday": 4, "Friday": 5, 
                                           "Saturday": 6, "Sunday": 7})
    data_dummies = df.iloc[:,7:]
    data_dummies = data_dummies.apply(lambda x: x*df["ScanCount"])
    data_dummies = data_dummies.replace(-0,0)
    
    
    # FinelineNumber 
    # 0.7 da 0.74514
    # 0.6 da 0.74573
    # The percentile selected filters certain amount of data. It should be tuned in order to keep just the most
    # Important values. Applying the same approach to other features copuld be interesting. 
    fineline_ranks = df.FinelineNumber.value_counts()
    fineline_ranks = fineline_ranks[fineline_ranks > fineline_ranks.quantile(0.6)]
    columns1 = list(fineline_ranks.index)
    dummies = pd.get_dummies(df.FinelineNumber)
    dummies = dummies[columns1]


    df.loc[df.ScanCount < 0, 'ItemsReturned'] = 1
    df.loc[df.ItemsReturned != 1, 'ItemsReturned'] = 0
    
    df = df[[ "VisitNumber", "Weekday", "Upc", "ScanCount", "ItemsReturned", "is_train_set"]]
    df = df.rename(columns={"ScanCount":"NumItems"})
    df = pd.concat([df, data_dummies], axis=1)
    df = pd.concat([df, dummies], axis=1)


    # now we add the groupby values
    df = df.groupby(["VisitNumber", "Weekday"], as_index=False).sum()
    
    df['working_day'] =  df['Weekday'].map({1: 'working_day', 2: 'working_day', 3: 'working_day',
                                        4: 'working_day', 5: 'working_day', 
                                           6: 'weekend', 6: 'weekend'})



    df = pd.get_dummies(df, columns=["working_day"])

    # finally, we do one-hot encoding for the Weekday
    df = pd.get_dummies(df, columns=["Weekday"])
    
    
    # get train and test back
    df_train = df[df.is_train_set != 0]
    df_test = df[df.is_train_set == 0]
    
    X = df_train.drop(["is_train_set"], axis=1)
    yy = None
    XX = df_test.drop(["is_train_set"], axis=1)

    return X, y, XX, yy

#


In [57]:
X, y, XX, yy = transform_data_fineline("https://raw.githubusercontent.com/DiploDatos/AprendizajeSupervisado/master/practico/data/train.csv", "https://raw.githubusercontent.com/DiploDatos/AprendizajeSupervisado/master/practico/data/test.csv")

# Encode labels?
add_department_counts = True

if (add_department_counts):
    X = add_department_counts(X)
    XX = add_department_counts(XX)


In [58]:
X.head()

Unnamed: 0,VisitNumber,Upc,NumItems,Return,DepartmentCounts,1-HR PHOTO,ACCESSORIES,AUTOMOTIVE,BAKERY,BATH AND SHOWER,...,9974.0,working_day_weekend,working_day_working_day,Weekday_1,Weekday_2,Weekday_3,Weekday_4,Weekday_5,Weekday_6,Weekday_7
0,5,68113150000.0,-1,1.0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
3,9,1070088000.0,3,0.0,3,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
4,10,17009270000.0,3,0.0,2,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
5,11,22738950000.0,4,0.0,3,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0
6,12,183217300000.0,7,0.0,4,0,0,0,0,0,...,0,0,1,0,0,0,0,1,0,0


In [59]:
# Encode labels?
econd_labels = True

if (econd_labels):
    le = preprocessing.LabelEncoder()
    le.fit(y)
    y_encoded = le.transform(y)


# Create training and test datasets
X_train, X_valid, y_train, y_valid = train_test_split(X, y_encoded, test_size=0.3, random_state=42)

Converts data to a DMatrix on the fly, there's no need to do it ourselves but if corrsvalidation is to be done, it's a necessary step.

In [78]:
xgb_clf = xgb.XGBClassifier(seed=1234, objective = 'multi:softprob', num_class = 38)
xgb_clf.fit(X_train, y_train, verbose=True)
xgb_clf_score = accuracy_score(y_valid, xgb_clf.predict(X_valid))

print('XGB', xgb_clf_score)

XGB 0.7457357402158238


In [35]:
# We must transform the encoded labels back to their original values in order to be submitted. 
yy = le.inverse_transform(xgboost.predict(XX))

#yy = xgboost.predict(XX)

array([30, 40, 21, ...,  8, 39, 39])

In [115]:
#submission2 = pd.DataFrame(list(zip(XX.VisitNumber, yy)), columns=["VisitNumber", "TripType"])

In [116]:
#submission2.to_csv("../data/submission_Alessio_FinelineNumber_8.csv", header=True, index=False)

46920

This were the results we submitted to Kaggle Competition last Friday. 

Example of how to tune the parameters of a XGBoost model. Here we have just tuned the learning rate because if more paramereters are to be included, more iterations are needed and we currently don't have enough resourcer to perform the latter in a reasonable time. 

In [82]:

#from xgboost import XGBRegressor
from xgboost import XGBRFRegressor
from sklearn.model_selection import GridSearchCV

from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error

#from sklearn.preprocessing import Imputer
from sklearn.impute import SimpleImputer 
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')

In [100]:
#XGBoost hyper-parameter tuning
def hyperParameterTuning(X_train, y_train):
    param_tuning = {
        'learning_rate': [0.01, 0.1],
        #'max_depth': [3, 5, 7, 10],
        'objective' : ['multi:softprob'],
        #'min_child_weight': [1, 3, 5],
        #'subsample': [0.5, 0.7],
        #'colsample_bytree': [0.5, 0.7],
        #'n_estimators' : [100, 200, 500],
         'num_class' : [38]
    }

    xgb_model = xgb.XGBClassifier()

    gsearch = GridSearchCV(estimator = xgb_model,
                           param_grid = param_tuning,                        
                           scoring = 'accuracy', #MAE
                           #scoring = 'reg:squarederror',  #MSE
                           cv = 3,
                           n_jobs = -1,
                           verbose = 1)

    gsearch.fit(X_train,y_train)

    return gsearch.best_params_

In [101]:
gc.collect()

318

In [102]:
#Run only in the first run of the kernel.
hyperParameterTuning(X_train, y_train)

Fitting 3 folds for each of 2 candidates, totalling 6 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed: 255.9min finished


{'learning_rate': 0.1, 'num_class': 38, 'objective': 'multi:softprob'}

Example of a cross validation of the hyperparameters for a XGBoost model. Here we are need to transform our data to a DMatrix because the cv function does'nt do it on the fly. Accuracy can't be used here so we are using the error score for multiclassifications. 

In [104]:
train_dmatrix = xgb.DMatrix(X_train, label = y_train)
test_dmatrix = xgb.DMatrix(X_valid, label = y_valid)

In [105]:
# Create the parameter dictionary: params
param = {'objective': 'multi:softprob', 'num_class':38, 
     'eval_metric': 'error', "max_delta_step": 5, "learning_rate": 0.1}

In [None]:
# Perform cross-validation: cv_results
cv_results = xgb.cv(dtrain = train_dmatrix, params = param, 
                  nfold = 3, num_boost_round = 10, 
                  metrics = "merror", as_pandas = True, seed = 123)



In [None]:
# Print cv_results
print(cv_results)

In [None]:
# Print the accuracy
print(((1-cv_results["test-merror-mean"]).iloc[-1]))