### Data Preparation

In [1]:
import pandas as pd
import numpy as np

For the Data preparation, we have previously prepared the CSV file. We changed the format of some values because they were following a date format. After every column was with the respective format type, we saved it to prepare the data. 

In [2]:
data_path = './data'
file_name = 'cleaned_test.xlsx'
df_prepared = pd.DataFrame()
df = pd.read_excel(f'{data_path}/{file_name}')
df.head()

Unnamed: 0,Tipo de Red,Estudiante de Tecnología,Nivel de Educación,Vive en Ciudad,Tipo de Instituto,Edad,Dispositivo,Tipo de Internet,Situación Financiera,Género,Duración de la Clase
0,3G,No,Colegio,No,Público,21-25,Smartphone,Compra Megas,Mala,Masculino,1-3
1,4G,No,Colegio,No,Privado,16-20,Smartphone,Wifi,Media,Femenino,1-3
2,4G,No,Universidad,Si,Público,21-25,Smartphone,Compra Megas,Media,Masculino,0
3,3G,Si,Universidad,No,Público,21-25,Smartphone,Compra Megas,Media,Masculino,1-3
4,3G,No,Escuela,No,Público,6-10,Smartphone,Compra Megas,Media,Femenino,0


We need to apply one hot encoding to the categorical variables, so we have the optimal representation for these values. In this case we used the "get_dummies" function from pandas. 

In [3]:
def one_hot_encoding(column_name,column_prefix,df_prepared,df_original):
    df_coding =  pd.get_dummies(df_original[column_name], prefix=column_prefix)
    df_prepared = pd.concat([df_prepared,df_coding],axis=1)
    return df_prepared

These are the categorical variables, which values are categories and they will be one hot encoded to be used in the neural network. 

In [4]:
columns_to_one_hot = ['Tipo de Red', 
'Nivel de Educación',
'Tipo de Instituto', 
'Edad', 
'Dispositivo',
'Tipo de Internet', 
'Situación Financiera', 
'Género',
'Duración de la Clase']

The non categorical variables are also considered for the complete dataframe. These are binary variables.

In [5]:
not_categoricals = []
for column in df.columns:
    if column not in columns_to_one_hot:
        not_categoricals.append(column)
print(not_categoricals)

['Estudiante de Tecnología', 'Vive en Ciudad']


The binary variables will also have a numeric representation, considering 1 for "Yes" and 0 for "No".

In [6]:
for column in not_categoricals:
    df_prepared[column] = df[column]
    df_prepared[column].loc[df_prepared[column]=="Si"] = 1 
    df_prepared[column].loc[df_prepared[column]=="No"] = 0 

In [7]:
for column_name in columns_to_one_hot:
    df_prepared = one_hot_encoding(column_name,column_name,df_prepared,df)

In [8]:
df_prepared

Unnamed: 0,Estudiante de Tecnología,Vive en Ciudad,Tipo de Red_2G,Tipo de Red_3G,Tipo de Red_4G,Nivel de Educación_Colegio,Nivel de Educación_Escuela,Nivel de Educación_Universidad,Tipo de Instituto_Privado,Tipo de Instituto_Público,...,Tipo de Internet_Compra Megas,Tipo de Internet_Wifi,Situación Financiera_Buena,Situación Financiera_Mala,Situación Financiera_Media,Género_Femenino,Género_Masculino,Duración de la Clase_0,Duración de la Clase_1-3,Duración de la Clase_3-6
0,0,0,0,1,0,1,0,0,0,1,...,1,0,0,1,0,0,1,0,1,0
1,0,0,0,0,1,1,0,0,1,0,...,0,1,0,0,1,1,0,0,1,0
2,0,1,0,0,1,0,0,1,0,1,...,1,0,0,0,1,0,1,1,0,0
3,1,0,0,1,0,0,0,1,0,1,...,1,0,0,0,1,0,1,0,1,0
4,0,0,0,1,0,0,1,0,0,1,...,1,0,0,0,1,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
236,0,1,0,0,1,0,1,0,1,0,...,0,1,0,0,1,1,0,0,1,0
237,0,1,0,0,1,0,1,0,1,0,...,1,0,0,0,1,1,0,0,1,0
238,0,1,0,0,1,0,1,0,1,0,...,1,0,0,0,1,0,1,0,1,0
239,0,1,0,0,1,0,1,0,0,1,...,0,1,0,0,1,0,1,0,1,0


The dataset is ready to be saved and used in the following steps.

In [9]:
out_file_name = 'test_data_set_prepared.csv'
df_prepared.to_csv(f'{data_path}/{out_file_name}',index=False)