# Diplodatos Kaggle Competition

We present this peace of code to create the baseline for the competition, and as an example of how to deal with these kind of problems. The main goals are that you:

1. Learn
1. Try different models and see which one fits the best the given data
1. Get a higher score than the given one in the current baseline example
1. Try to get the highest score in the class :)

In [1]:
# Import the required packages
import os
from sklearn import preprocessing
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Read the *original* dataset...

In [2]:
original_df = pd.read_csv('https://raw.githubusercontent.com/DiploDatos/AprendizajeSupervisado/master/practico/data/train.csv')

In [4]:
from hashlib import md5
def hashit(val):
    if isinstance(val, float): 
        return str(val)
    return md5(val.encode('utf-8')).hexdigest()

In [5]:
def get_dia_laboral(nombre_dia):
    if nombre_dia in ['Wednesday', 'Thursday', 'Friday', 'Monday','Tuesday']:
        return 'Dia laboral'
    else:
        return 'Fin de semana'

In [24]:
def transform_data(train_data_fname, test_data_fname):
    df_train = pd.read_csv(train_data_fname)
    df_train['is_train_set'] = 1
    df_test = pd.read_csv(test_data_fname)
    df_test['is_train_set'] = 0

    # we  get the TripType for the train set. To do that, we group by VisitNumber and
    # then we get the max (or min or avg)
    y = df_train.groupby(["VisitNumber", "Weekday"], as_index=False).max().TripType

    # we remove the TripType now, and concat training and testing data
    # the concat is done so that we have the same columns for both datasets
    # after one-hot encoding
    df_train = df_train.drop("TripType", axis=1)
    df = pd.concat([df_train, df_test])
    
    # the next three operations are the ones we have just presented in the previous lines
    #df.drop_duplicates(keep='first', ignore_index=True, inplace=True)

    # drop the columns we won't use (it may be good to use them somehow)
    #df = df.drop(["Upc"], axis=1)

    mask = (df.FinelineNumber.isna())&(df.DepartmentDescription=='PHARMACY RX')
    column_name = 'FinelineNumber'
    df.loc[mask, column_name] = 4822.0

    # one-hot encoding for the DepartmentDescription
    df = pd.get_dummies(df, columns=["DepartmentDescription"], dummy_na=True)

    # now we add the groupby values
    #df = df.groupby(["VisitNumber", "Weekday","FinelineNumber"], as_index=False).sum()
    df = df.groupby(["VisitNumber", "Weekday"], as_index=False).sum()

    df['tipo_dia']=df.Weekday.apply(lambda x:get_dia_laboral(x))
    df = pd.get_dummies(df, columns=["tipo_dia"], dummy_na=True)

    # finally, we do one-hot encoding for the Weekday
    df = pd.get_dummies(df, columns=["Weekday"], dummy_na=True)

    # get train and test back
    df_train = df[df.is_train_set != 0]
    df_test = df[df.is_train_set == 0]
    
    X = df_train.drop(["is_train_set"], axis=1)
    yy = None
    XX = df_test.drop(["is_train_set"], axis=1)

    return X, y, XX, yy

Load the data...

In [25]:
X, y, XX, yy = transform_data("https://raw.githubusercontent.com/DiploDatos/AprendizajeSupervisado/master/practico/data/train.csv", "https://raw.githubusercontent.com/DiploDatos/AprendizajeSupervisado/master/practico/data/test.csv")

Create the model and evaluate it

In [26]:
X.shape

(67029, 84)

In [27]:
# split training dataset into train and "validation" 
# (we won't be using validation set in this example, because of the cross-validation;
# but it could be useful for you depending on your approach)
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.3, random_state=42)

In [28]:
# results dataframe is used to store the computed results
results = pd.DataFrame(columns=('clf', 'best_acc'))

In [29]:
# we will use a DesicionTree to classify and GridSearch to determine the parameters
from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score


In [30]:
#RANDOM FOREST
from sklearn import ensemble
clf = ensemble.RandomForestClassifier(random_state=2)
clf.fit(X_train, y_train);

In [31]:
predictions = clf.predict(X_valid)
print ('Accuracy: %d ' % ((np.sum(y_valid == predictions))/float(y_valid.size)*100))

Accuracy: 69 


**And finally**, we predict the unknown label for the testing set

In [32]:
predictions = clf.predict(XX)


Exportamos Resultados


In [33]:
submission2 = pd.DataFrame(list(zip(XX.VisitNumber, predictions)), columns=["VisitNumber", "TripType"])

In [34]:
submission2.to_csv("sample_data/submission_randomforest.csv", header=True, index=False)