# Notebook DataMining
## Jean Ababii - Guilhem Baissus - Chanèle Jourdan

# ----- 1. LOADING DATA

The goal of this project is  to study how it is possible to determine who is playing given a behavioral trace (game events produced by the player) by designing a prediction model using machine learning techniques. We'll try to solve this problem with the video game StarCraft 2. 

We'll work with these two datasets :

* TRAIN - the training set: labelled behavioral traces
* TEST - the test set: unlabelled behavioral traces (you need to predict the player)

In [None]:
# IMPORTS

import numpy as np
import pandas as pd 
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        
import csv
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 18})

print("done")

In [None]:
# GET TRAIN AND TEST DATASETS

path_train_dataset = os.path.join(dirname, "TRAIN.CSV")
path_test_dataset = os.path.join(dirname, "TEST.CSV")

In [None]:
# VARIABLES INFLUENCING THE RESULTS

time_limit = 120
minimum_keys_pressed = 20
#number_of_features = 0

# ----- 2. DATA VISUALISATION AND PRE-PROCESSING

For our pre-preocessing we'll load the 2 datasets in dictionnaries and realise various operations on our data. 

In [None]:
#Get number of lines of the datasets train and test

def get_number_of_lines_in_csv(path): 
    file = open(path)
    reader = csv.reader(file)
    nb_lines = len(list(reader))
    return nb_lines
    
nb_lines_train = get_number_of_lines_in_csv(path_train_dataset)
print("Number of lines in train dataset : ", nb_lines_train)
nb_lines_test = get_number_of_lines_in_csv(path_test_dataset) 
print("Number of lines in test dataset : ",nb_lines_test)

In [None]:
#Get train's data and test's data into dictionnaries

#gets array of the different elements of the row before time_limit. If time_limit = 0 means no time limit.
def get_row_elements(row_str, time_limit):
    if int(time_limit)== 0:
        elements = row_str.split(',')
        return elements
    else:      
        elements = row_str.split(',') 
        for index, element in enumerate(elements):
            if "t" in element[0] and int(element[1:]) >= int(time_limit):
                del elements[index:]
                return elements
        return elements

#saves data in a dictionnary
def fill_dic (dic, row_number, row_elements):
    dic[row_number]= row_elements

#creates dictionnary with the row number as a key and an array of all the elements of the specific line
def create_dictionnary(path, nb_lines, time_limit):    
    dic = {}

    df = pd.read_csv(path, sep='delimiter', header=None)
    columns = df.columns
    rows = df[columns[0]]
    for index in range(0,nb_lines):
        row_str = rows[index]
        row_elements = get_row_elements(row_str, time_limit)
        fill_dic(dic, index, row_elements)
    return dic

#time_limit = 30

train_rows = create_dictionnary(path_train_dataset, nb_lines_train, time_limit)
#print(train_rows)

test_rows = create_dictionnary(path_test_dataset, nb_lines_test, time_limit)
#print(test_rows)

# 2.1 Get features

We'll get and create various features from our dataset :

* The list of all the keys pressed to get the number of times a player used each keys during a game
* The race played (3 types of race, so we create 3 features with a value set to 1 or 0 according the race of each game)
* A string which identifies the player
* The total of the number of keys pressed
* Time features : the number of keys pressed in every time interval of 5 seconds

In [None]:
#Get columns of the datasets

#conditions allowing the extraction of elements to make the features
def extraction_conditions(element):
    return not ('t1' in element or 't2' in element or 't3' in element or 't4' in element or 't5' in element or 't6' in element or 't7' in element or 't8' in element or 't9' in element or element in dataset_columns or "http" in element or "Protoss" in element or "Terran" in element or "Zerg" in element)

#Creates the columns of the dataset from the values extracted per row
def get_dataset_columns(dic, nb_lines):
    for index in range(0, nb_lines):
        row = train_rows[index]
        for element in row:
            if extraction_conditions(element) :
                dataset_columns.append(element)
                
        
#I added 4 more colums : profile and the species : "profile" is the class we want to predict and species the type of character that the player chose that can either be "Protoss" or "Terran" or "Zerg". It will allow a manual hot encoding
dataset_columns = ['player_profile', 'Protoss', 'Terran', 'Zerg'] 

get_dataset_columns(train_rows, nb_lines_train)
print(dataset_columns)

#number of columns
print('----------------------')
print("Number of columns : {}".format(len(dataset_columns)))

In [None]:
#Adding time features columns
#We are going to try do add features linked to time to see if the results change positively

#Every 5 seconds until time limit, create a column where the number of keys pressed is going to be recorded
def add_5seconds_columns(time_limit, columns):
    time = 0
    for i in range(0,int(time_limit/5)):
        time +=5
        columns.append("t{} to t{}".format(time - 5, time))

def add_total_keys_pressed(columns):
    columns.append("total_keys_pressed")
    
#add_5seconds_columns(time_limit, dataset_columns)
#add_total_keys_pressed(dataset_columns)
    
print(dataset_columns)

#number of columns
print('----------------------')
print("Number of columns : {}".format(len(dataset_columns)))    

# 2.2 Cleaning data

* We'll plot the number of keys pressed in every game and see if the distribution is homogeneous, and adapt the data consequently


In [None]:
#Data distribution for the time limit chosen 
    
def get_key_frequence_per_game(dic, nb_lines):
    nb_key = []
    for index in range(0, nb_lines):
        row = dic[index]
        size_row = len(row)
        i = size_row -1
        while row[i][0]!="t" and i>0:
            i-=1
        if i!=0:
            time_value = row[i][1:]
            #              size - number of t present - 2 for the first two columns (id and specie chosen)
            nb_key.append(size_row - int(time_value)/5 - 2)
        else:
            #In case there is no time stamp on the line
            nb_key.append(size_row - 2)
            #if size_row - 2 ==0:
                #print("no value for a line")
        
    return nb_key

def plot_data_visualisation(dic, nb_lines):
    nb_key = get_key_frequence_per_game(dic, nb_lines)
    plt.hist(nb_key, bins=40)  
    plt.grid(axis='y', alpha=0.75)
    plt.xlabel('Number of keys pressed')
    plt.ylabel('Number of games')
    plt.title("Distribution of the number of keys pressed in {} secondes".format(time_limit))
    plt.show()

plt.figure(figsize = (20,10))
plot_data_visualisation(train_rows, nb_lines_train)


We see thate there's a big varibility in the game time and the number of keys pressed. 
Therefore, it's necessary to adjust our data by :

- Removing empty games or with a game time really short to not biases our data processing. 
- Normalising data

In [None]:
#get lines having less than a minimum number of keys pressed
def not_enough_values_lines_indexes(dic, nb_lines, minimum_keys_pressed):
    lines_indexes = []
    for index in range(0, nb_lines):
        row = dic[index]
        size_row = len(row)    
        i = size_row -1
        while row[i][0]!="t" and i>0:
            i-=1
        if i!=0:
            time_value = row[i][1:]
            #              size - number of t present - 2 for the first two columns (id and specie chosen)
            size = size_row - int(time_value)/5 - 2
        else:
            #In case there is no time stamp on the line
            size = size_row - 2
        if size <minimum_keys_pressed:
            lines_indexes.append(index)

    return lines_indexes

lines_no_value_indexes = []
#minimum_keys_pressed = 20

lines_no_value_indexes = not_enough_values_lines_indexes(train_rows, nb_lines_train, minimum_keys_pressed)
print("{} lines need to be removed because they have less than {} values".format(len(lines_no_value_indexes), minimum_keys_pressed))
print(lines_no_value_indexes)

We get here a liste which contains the indexes of all the lines having less than 20 values. This list we'll then be used for the creation of the dataset.  

# 2.3 Solved unbalanced class

We'll plot here the number of games of each players to see if the distribution is homogeneous, and adapt the data consequently.


In [None]:
with open(path_train_dataset, 'r') as temp_f:
    # get No of columns in each line
    col_count = [ len(l.split(",")) for l in temp_f.readlines() ]
    
column_names_train = ['Player','Race'] + [i for i in range(0, max(col_count)-2)]
df = pd.read_csv(path_train_dataset, header=None, delimiter=",", names=column_names_train)

In [None]:
# Plot the number of games by player

df_by_player = df.groupby(['Player']).count().sort_values(by=['Race'])['Race']

avg = df_by_player.mean()
print('Average of number of games by player: ', avg)

bar_chart = df_by_player.plot.bar(x='Player', y='Race', rot=0, figsize = (20,10))
bar_chart.hlines(avg, -.5,200.5, linestyles='dashed')
bar_chart.annotate('average',(10,avg+1))
bar_chart.axes.get_xaxis().set_visible(False)
bar_chart.set_title("Distribution of the number of games by player", fontsize=20)
bar_chart


We see that the number of games by players is really unbalances, which can be problematic especially for cross validation. There for we have to process data by one of these two possiblities :

- Duplicate data for the players having a number of games really low. 
- Remove data for players having a big number of games (but it's maybe less relevant because it implies loosing information)

So we decide to increase data for player having a game's number under the average (which is around 15), by duplicating one or two times depending of the game's number. 

In [None]:
# Get list of players with less than 15 games in dataframe

df_mod = df.groupby(['Player']).count()
df_0to5 = df_mod[(df_mod['Race']<6)]
df_0to15 = df_mod[(df_mod['Race']<15)]
df_6to15 = df_0to15[(df_0to15['Race']>5)]
player_6to15 = df_6to15.index.tolist()
player_0to5 = df_0to5.index.tolist()

print("Number of players with 0 to 5 games : ", len(player_0to5))
print("Number of players with 6 to 15 games : ", len(player_6to15))
print("--------------------------------------")

# Duplicate data for this players (1 ot 2 times according their number of games) in dataframe

rows_to_add=[]
rows_indexes_add_once=[]
rows_indexes_add_twice=[]
for indexRow in range(nb_lines_train):
    row=df.values[indexRow]
    if(row[0] in player_6to15):
        rows_indexes_add_once.append(indexRow)
        rows_to_add.append(row)
    elif(row[0] in player_0to5):
        rows_indexes_add_twice.append(indexRow)
        rows_to_add.append(row)
        rows_to_add.append(row)
        
df_with_duplication=pd.concat([df,pd.DataFrame(rows_to_add, columns=df.columns)], ignore_index=True)

# Add duplicates in dictionnary 

print("Length of the dictionnary before duplication :",nb_lines_train)

counter=0;
for indexRow in range(len(rows_indexes_add_once)):
    train_rows[3052+counter]=train_rows[rows_indexes_add_once[indexRow]]
    counter=counter+1
for indexRow in range(len(rows_indexes_add_twice)):
    train_rows[3052+counter]=train_rows[rows_indexes_add_once[indexRow]]
    train_rows[3052+counter+1]=train_rows[rows_indexes_add_once[indexRow]]
    counter=counter+2
    
nb_lines_train=len(train_rows)
print("Number of games to duplicate :",len(rows_indexes_add_once)+2*len(rows_indexes_add_twice))

print("Length of the dictionnary after duplication :",nb_lines_train)


# Plot the number of games by player with new distribution

df_with_duplication_by_player = df_with_duplication.groupby(['Player']).count().sort_values(by=['Race'])['Race']
bar_chart = df_with_duplication_by_player.plot.bar(x='Player', y='Race', rot=0, figsize = (20,8))
bar_chart.hlines(15.26, -.5,200.5, linestyles='dashed')
bar_chart.annotate('old average',(10,avg+1))
bar_chart.axes.get_xaxis().set_visible(False)
bar_chart.set_title("Distribution of the number of games by player after duplication", fontsize=20)

bar_chart


# 2.4 Extract features & Create dataframe

Finally, we create the dataframe from all the modification we've done before, and with the list of features we described. 

In [None]:
#Organized data into columns (frequencies of keys per line + species and profile + time features) and create a dataframe for each dataset

#creates empty dataset
def create_empty_dataset(columns):
    dataset = {}
    for column in columns:
        dataset[column] = []
    return dataset

#Gets the number of times one key was clicked
def get_number_clicked_key(row, key_name):
    total_number = 0
    for element in row:
        if element == key_name:
            total_number +=1
    return total_number

#returns the profile of the player
def get_profile(row):
    if "http" in row[0]:
        return row[0]
    else:
        return "Unknown"
    

#returns the specie chosen on this line
def get_specie(dataset, row):    
    column_number = 0
    #Looking for the column containing the profile
    for index,element in enumerate(row):
        if element == "Protoss" or  element == "Terran" or element =="Zerg":
            column_number = index
            break
        
    if row[column_number]=="Protoss":
        dataset["Protoss"].append(1.0) 
        dataset["Terran"].append(0.0)
        dataset["Zerg"].append(0.0)
    elif row[column_number] == "Terran":
        dataset["Protoss"].append(0.0) 
        dataset["Terran"].append(1.0)
        dataset["Zerg"].append(0.0)
    elif row[column_number] == "Zerg":
        dataset["Protoss"].append(0.0) 
        dataset["Terran"].append(0.0)
        dataset["Zerg"].append(1.0)
    else:
        print("Profile not detected")

#count the number of keys pressed in a 5 seconds interval until time limit
def get_key_number_5seconds(dataset, row, time_limit):  
    number_elements_in_frame = 0
    time = 5
    for index, element in enumerate(row):
        if "t{}".format(time) in element or index == len(row)-1:
            #I have to pop because a 0 appears for a unknown reason
            dataset["t{} to t{}".format(time -5, time)].pop()
            dataset["t{} to t{}".format(time -5, time)].append(number_elements_in_frame)
            number_elements_in_frame = 0
            time +=5
        #does not take in account the elements linked to the player profile or the specie chosen
        if "http" in element or element == "Zerg" or element == "Protoss" or element == "Terran":
            number_elements_in_frame -=1
        number_elements_in_frame +=1
        
#count total number of keys pressed between 0 and the time limit
def get_total_keys_pressed(dataset, row):
    number_elements_in_row = 0
    for index, element in enumerate(row):
        number_elements_in_row +=1
        if "http" in element or element == "Zerg" or element == "Protoss" or element == "Terran" or "t1" in element or "t2" in element or "t3" in element or "t4" in element or "t5" in element or "t6" in element or "t7" in element or "t8" in element or "t9" in element:
            number_elements_in_row -=1
    dataset["total_keys_pressed"].pop()
    dataset["total_keys_pressed"].append(number_elements_in_row)

                

#create final dataset with the columns filled
def create_dataset(nb_lines, rows, time_limit, dataset_type = "train"):
    dataset = create_empty_dataset(dataset_columns)

    #For each line, the columns are filled
    for index in range(0,nb_lines):
        #Do not take in account the lines with not enough values
        if index not in lines_no_value_indexes or dataset_type =="test":
            #Prevent "species" and profiles columns to be called
            for column in dataset_columns[4:]:
                row = rows[index]
                total_number = get_number_clicked_key(row, column)
                dataset[column].append(total_number)
            dataset['player_profile'].append(get_profile(rows[index]))
            get_specie(dataset, rows[index])
            #get_key_number_5seconds(dataset, rows[index], time_limit)
            #get_total_keys_pressed(dataset, rows[index])
    return dataset
            
train_dataset = create_dataset(nb_lines_train, train_rows, time_limit, "train")
#print(train_dataset)
test_dataset =  create_dataset(nb_lines_test, test_rows, time_limit, "test")

#Create dataframes
train_df = pd.DataFrame(train_dataset,columns=dataset_columns)
test_df = pd.DataFrame(test_dataset,columns=dataset_columns)


In [None]:
#Create csv files from the dataframes
#train_df.to_csv("train_df.csv", index = False)
#test_df.to_csv("test_df.csv",index = False)

# ----- 3.DIMENSIONALITY REDUCTION

Dimensionality reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data

Why doing that?
* Reducing training time
* Using less memory 
* Improves Accuracy (less misleading data means modeling accuracy improves) : even if it's not the main advantage
* Reducing risk of overfitting (Less redundant data means less opportunity to make decisions based on noise)
* Avoiding the “curse of dimensionality” (problem that happens when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings. For example :  in high dimensional data, all objects appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient).

We can define two types of dimensionality reduction : feature selection and feature extraction


# 3.1 Feature selection

Some features in data can be redundant or irrelevant, and can thus be removed without incurring much loss of information. So we try to find a subset of the input variables => reduce dimensionality by removing some features.

There are 3 different methods for feature selection :

**1.FILTER METHOD**

Filter type methods select variables regardless of the model -> suppress the least interesting variables. The inconvenient is that it tends to select redundant variables when they do not consider the relationships between variables.


**2.WRAPPER METHOD**

This method consist in evaluating subsets of variables which allows, unlike filter approaches, to detect the possible interactions between variables. (but in comparison to the first method, it can increase overfitting risk + significant computation time)

**3.EMBEDDED METHOD**

This method tries to combine the advantages of both previous methods. A learning algorithm takes advantage of its own variable selection process and performs feature selection and classification simultaneously


***FIRST EXAMPLE: using feature importance (filter method)***

In [None]:
X_train = train_df.drop(columns=['player_profile'])
y_train = train_df['player_profile']

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
import matplotlib.pyplot as plt

In [None]:
model = ExtraTreesClassifier()
model.fit(X_train,y_train)

In [None]:
#plot graph of feature importances for better visualization
feat_importances = pd.Series(model.feature_importances_, index=X_train.columns)
fig = feat_importances.nlargest(36).plot(kind='barh', figsize=(20,10), title = "Feature importance", xlabel ="key", ylabel = "score")
fig.set_title("Feature importance", fontsize=20)
plt.show()

In [None]:
#new dataframe only with the selected features
train_fi19_df = train_df.drop(columns =["player_profile", "hotkey11", "hotkey51", "hotkey31", "hotkey71", "hotkey21",
                                      "hotkey81", "hotkey41", "hotkey61", "hotkey72", "hotkey82", "hotkey91", "hotkey92",
                                      "hotkey01", "Terran", "Zerg", "Protoss", "hotkey02"])
test_fi19_df  = test_df.drop(columns =["player_profile", "hotkey11", "hotkey51", "hotkey31", "hotkey71", "hotkey21",
                                      "hotkey81", "hotkey41", "hotkey61", "hotkey72", "hotkey82", "hotkey91", "hotkey92",
                                      "hotkey01", "Terran", "Zerg", "Protoss", "hotkey02"])

In [None]:
#new dataframe only with the selected features
train_fi11_df = train_df.drop(columns =["player_profile", "hotkey11", "hotkey51", "hotkey31", "hotkey71", "hotkey21", "hotkey81", "hotkey41", "hotkey61", "hotkey72", 
                                        "hotkey82", "hotkey91", "hotkey92", "hotkey01", "Terran", "Zerg", "Protoss", "hotkey02", "Base", "hotkey80", "SingleMineral",
                                       "hotkey62", "hotkey70", "hotkey90", "hotkey60", "hotkey00"])
test_fi11_df  = test_df.drop(columns =["player_profile", "hotkey11", "hotkey51", "hotkey31", "hotkey71", "hotkey21", "hotkey81", "hotkey41", "hotkey61", "hotkey72", 
                                        "hotkey82", "hotkey91", "hotkey92", "hotkey01", "Terran", "Zerg", "Protoss", "hotkey02", "Base", "hotkey80", "SingleMineral",
                                       "hotkey62", "hotkey70", "hotkey90", "hotkey60", "hotkey00"])

***SECOND EXAMPLE: UNIVARIATE SELECTION***


In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

In [None]:
#apply SelectKBest class to extract all the features
nb_of_features = 36
bestfeatures = SelectKBest(score_func=chi2, k=nb_of_features)
fit = bestfeatures.fit(X_train,y_train)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X_train.columns)

#concat two dataframes for better visualization 
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['action','score']  #naming the dataframe columns

#show best features
fig = featureScores.nlargest(36,'score').plot(x='action', kind='barh', figsize=(20,10), xlabel ="key", ylabel = "score")
fig.set_title("Univariate selection of the best attributes", fontsize=20)

In [None]:
#new dataframe only with the selected features
train_univariate18_df = train_df.drop(columns =["player_profile", "Protoss", "Zerg", "hotkey71", "hotkey80", "hotkey51", "Terran", "hotkey70", "hotkey61", "hotkey81",
                                                "hotkey50", "hotkey90", "hotkey00", "hotkey91", "hotkey01", "hotkey11", "hotkey41", "hotkey21", "hotkey31"])
test_univariate18_df  = test_df.drop(columns =["player_profile", "Protoss", "Zerg", "hotkey71", "hotkey80", "hotkey51", "Terran", "hotkey70", "hotkey61", "hotkey81",
                                                "hotkey50", "hotkey90", "hotkey00", "hotkey91", "hotkey01", "hotkey11", "hotkey41", "hotkey21", "hotkey31"])

In [None]:
#new dataframe only with the selected features
train_univariate13_df = train_df.drop(columns =["player_profile", "Protoss", "Zerg", "hotkey71", "hotkey80", "hotkey51","Terran", "hotkey70", "hotkey61", "hotkey81",
                                                "hotkey50", "hotkey90", "hotkey00", "hotkey91", "hotkey01", "hotkey11", "hotkey41", "hotkey21", "hotkey31", "hotkey40",
                                                "hotkey60", "hotkey10", "hotkey20", "hotkey30"])
test_univariate13_df  = test_df.drop(columns =["player_profile", "Protoss", "Zerg", "hotkey71", "hotkey80", "hotkey51","Terran", "hotkey70", "hotkey61", "hotkey81",
                                                "hotkey50", "hotkey90", "hotkey00", "hotkey91", "hotkey01", "hotkey11", "hotkey41", "hotkey21", "hotkey31", "hotkey40",
                                                "hotkey60", "hotkey10", "hotkey20", "hotkey30"])

***THIRD EXAMPLE: using low variance (filter method)***

In [None]:
#Code avec VarianceThreshold de SKLearn
from sklearn.feature_selection import VarianceThreshold

fs = VarianceThreshold(5)
xtrain_var = fs.fit_transform(np.array(train_df.drop("player_profile", 1)))
fs.get_support()
train_var_df = train_df[train_df.columns[fs.get_support(indices=True)]]
test_var_df  = test_df[test_df.columns[fs.get_support(indices=True)]]

print(train_var_df.columns)
print("Number of features selected :",len(train_var_df.columns))

# 3.2 Feature extraction

We start from an initial set of measured data and builds derived features intended to be informative and non-redundant => constructing these derived features with combinations of the initial variables : facilitating the subsequent learning and generalization steps.

There are many ways to do feature extraction.





***FOURTH EXAMPLE: using PCA (principal component analysis)***

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

In [None]:
#normalization
X_train = StandardScaler().fit_transform(X_train)

#PCA
pca = PCA(n_components=nb_of_features) #nb_of_features = 36
principalComponents = pca.fit_transform(X_train)
principalDf = pd.DataFrame(data = principalComponents)

In [None]:
#plot explained variance
plt.figure(figsize=(18,10))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.grid()
plt.title("Cumulative explained variance in function of the number of components",fontsize=20)
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance')

We choose first 20 attributes (around 20% of the variance introduced) 

# ----- 4.MODEL CONSTRUCTION AND PREDICTION

In [None]:
#Scale X
from sklearn.preprocessing import StandardScaler

def scale(X):
    X = StandardScaler().fit_transform(X)
    return X

In [None]:
#Evaluate the models
def evaluate(dataset, model, n_splits):

    X = np.array(train_df.drop("player_profile", 1))
    X = scale(X)
    y = np.array(train_df["player_profile"])

    le = preprocessing.LabelEncoder()
    le.fit(y)
    y = le.transform(y)

    #array
    scores_accuracy = []

    kf = KFold(n_splits=n_splits)
    nb_folds_processed = 0
    for train_index, test_index in kf.split(X, y):
        nb_folds_processed +=1
        print(" {} folds processed".format(nb_folds_processed))
        x_train, x_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)

        scores_accuracy.append(accuracy_score(y_test, y_pred))

    return np.mean(scores_accuracy)

nb_splits = 4

model = DecisionTreeClassifier(criterion="entropy")
print(evaluate(train_df,model, nb_splits ))

model = RandomForestClassifier(n_estimators=100,bootstrap = True, max_features = 'sqrt')
#print(evaluate(train_df,model, nb_splits ))

In [None]:
#Train decision tree model with the train dataset 

def train_decision_tree(df):
    #variables
    test_size = 0.1
    criterion = "entropy"

    X_train = np.array(df.drop("player_profile", 1))
    y_train = np.array(df["player_profile"])

    model = DecisionTreeClassifier(criterion=criterion)
    model = model.fit(X_train, y_train)
    
    return model


In [None]:
#Train random forest model

def train_random_forest(df):

    X_train = np.array(df.drop("player_profile", 1))
    y_train = np.array(df["player_profile"])

    model = RandomForestClassifier(n_estimators=100, 
                                   bootstrap = True,
                                   max_features = 'sqrt')

    model = model.fit(X_train, y_train)

    return model

# ----- 5. Results and conclusion

In [None]:
def get_predictions(model):
    X_test = np.array(test_df.drop("player_profile", 1))
    y_test = np.array(test_df["player_profile"])

    y_pred =  model.predict(X_test)
    return y_pred


In [None]:
#We create a new csv containing the predictions 
def create_csv(y_pred, name="results.csv"):
    rowid = []
    for i in range(1, nb_lines_test+1):
        rowid.append(i)

    data = {
            "RowId" : rowid,
            "prediction" : y_pred,
        }

    df = pd.DataFrame()
    df = pd.DataFrame(data,columns=list(data.keys()))
    df.to_csv(name, index = False)

## 5.1 Model without any feature selection

In [None]:
#We predict the results using the normal trained model

model = train_random_forest(train_df)
y_pred = get_predictions(model)
create_csv(y_pred, name = "results.csv")

### Results without any feature selection : 0.897

## 5.2 Model with feature importance

In [None]:
#Prediction with feature importance technique

#train model
X_train = np.array(train_fi11_df)
y_train = np.array(train_df["player_profile"])

model = RandomForestClassifier(n_estimators=100, 
                               bootstrap = True,
                               max_features = 'sqrt')

model = model.fit(X_train, y_train)

#predict on trained model
X_test = np.array(test_fi11_df)
y_pred = model.predict(X_test)

create_csv(y_pred, name = "featureImportance.csv")

### Resuts with feature importance : 0.885 (keeping 19 features), 0.841 (keeping 11 features)

## 5.2 Model with univariate selection

In [None]:
#Prediction with univariate selection technique

#train model
X_train = np.array(train_univariate18_df)
y_train = np.array(train_df["player_profile"])

model = RandomForestClassifier(n_estimators=100, 
                               bootstrap = True,
                               max_features = 'sqrt')

model = model.fit(X_train, y_train)

#predict on trained model
X_test = np.array(test_univariate18_df)
y_pred = model.predict(X_test)

create_csv(y_pred, name = "univariateSelection.csv")

### Results with univariate selection : 0.879 (keeping 18 features), 0.829 (keeping 13 features)

## 5.3 Model with feature selection using variance threshold

In [None]:
#Prediction with variance threshold

#train model
X_train = np.array(train_var_df)
y_train = np.array(train_df["player_profile"])

model = RandomForestClassifier(n_estimators=100, 
                               bootstrap = True,
                               max_features = 'sqrt')

model = model.fit(X_train, y_train)

#predict on trained model
X_test = np.array(test_var_df)
y_test = np.array(test_df["player_profile"])
y_pred = model.predict(X_test)

create_csv(y_pred, name = "resultsVariance.csv")

### Result with variance threshold : 0.868 (k = 5)

## 5.4 Model with PCA

In [None]:
#preparing train and test data
scaler = StandardScaler() 

X_train = train_df.drop(columns=['player_profile'])
scaler.fit(X_train)
X_train = scaler.transform(X_train)
y_train = np.array(train_df["player_profile"])

X_test = test_df.drop(columns=['player_profile'])
X_test = scaler.transform(X_test)

#PCA
pca = PCA(n_components=20) #20 components = 90% de la variance
pca.fit(X_train)

X_train = pca.transform(X_train)
X_test = pca.transform(X_test)

#the Random Forest model 
model = RandomForestClassifier(n_estimators=100, 
                               bootstrap = True,
                               max_features = 'sqrt')

model = model.fit(X_train, y_train)

#Prediction
y_pred = model.predict(X_test)

create_csv(y_pred, name = "pca20.csv")

### Results with PCA : 0.879 (with 20 new features), 0.868 (13 new features)

## 5.5 Conclusion

|With new features|With duplication|Dim.Reduction    | Features' number                                  | RESULT            |
|-----|-----|-----------------------|---------------------------------------------------------|------------|
|No|No|No|36|0.897|
|Yes|No|No|62|0.897|
|No|Yes|No|36|0.897|
|No|Yes|Feature importance|19|0.885|
|No|Yes|Feature importance|11|0.841|
|No|Yes|Univariate selection |18|0.879|
|No|Yes|Univariate selection |13|0.829|
|No|Yes|Low variance|16|0.868|
|No|Yes|Low variance|?|?|
|No|Yes|PCA|20 new|0.879|
|No|Yes|PCA|13 new|0.868|

Here’s a summary of the results of all the models we’ve tested.

* 1st line: Our initial model without any dimensionality reduction has a result of 0.897
* 2nd line: By adding features we don’t have a better result so for the others tries we didn’t keep it since we want the reduce the dimension, so it wouldn’t makes sense.
* 3rd line: By duplicating data we have a similar result too.

About dimensionality reduction: 
* For the feature selection, we tested each method two times, keeping more or less features. We see that when we keep around 17-20 features, according the method, it reduces our results by only 0.02, and that dividing almost by two our number of features which is a really interesting result. 
* For feature extraction, with the PCA method, even taking a third of our features’ number, we still have good results.

Finally, if we had to keep only one model in this project we would anyway take the one without any dimensionality reduction because we think that we don’t really have a problem of time compilation or memory space in our case, and we’re not in a situation of over fitting. So we don’t really have an interest in reducing the dimension.
But we showed, and that was our objective, that it’s possible to gain lots of benefits by reducing dimension in others situations with a really big amount of features. (And of course there’re a lot of others methods to do that, we've here presented only 4 possibilities)
