# Pasos del TP
*   Explorar los datos
*   Plantear el problema a resolver
*   Preprocesar los datos a un formato adecuado
*   Elegir algoritmos
*   Fittear y validar
*   Decidir el algoritmo final, y testear

# Problemas que se buscan resolver:

Se ha demostrado que cada vez más empresas (especialmente empresas de comercio electrónico) tienen grandes dificultades en la conversión de propects (clientes potencialess) a clientes activos con la primer compra, además sostener a los clientes actuales en la dinámica e interacción en el tiempo con los productos y la empresa suele ser de difícil comprensión, generando pérdidas de facturación por cancelaciones hasta la pérdida del cliente

Las investigaciones se han centrado en el análisis del producto y el ciclo de valor del cliente en la empresa.

Las preguntas que ayudan a entender el problema:
¿Cuál es el rendimiento de los productos en las ventas?
¿Cómo se relacionan los clientes con los productos en el tiempo?
¿Cómo se agrupan los clientes según sus necesidades e intereses?
¿Cómo retenemos a clientes o mejoramos las tasas de conversión a clientes ? 

## OBJETIVO:
Este proyecto tiene como objetivo explorar diferentes herramientas de conversion, retención y rendimientos de clientes en las ventas a traves de metodologías de Machine Learning

Exploraremos y mediremos la efectividad de las siguientes herramientas:
A. Product Analytics.
B. Recomendación de Productos
C. CLV (Ciclo de vida del Valor cliente)
D. Segmentación de clientes.

Las técnicas algorítmicas en ML a utilizar y explorar:
* No supervisados
  * Clustering Knn
* Supervidados
  * Decision Tree, 
  * SVM, ANN, DNN



# AED - Exploración de Datos

## Inicialización

In [1]:
%matplotlib inline

In [None]:
!pip install seaborn

In [176]:
!pip install kmodes

In [None]:
!pip install nltk

In [144]:
import pandas as pd
from pandas import DataFrame
from datetime import datetime, timedelta, date
from pandas.plotting import autocorrelation_plot
from pandas import read_csv
from matplotlib import pyplot as plt


import warnings;
warnings.filterwarnings('ignore')



In [None]:
import seaborn as sns

In [None]:

# NLP
import re
import nltk 
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize, sent_tokenize

In [None]:
# Algoritmo 1
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import LinearSVC
from sklearn import preprocessing

from kmodes.kprototypes import KPrototypes
from pprint import pprint
import numpy as np

Si estas en Google Colab

In [129]:
import sys
assert sys.version_info >= (3, 5)
import os

In [None]:
RETAIL_PATH = "https://github.com/hcgalvan/UNSAM-Machine-Learning-on-Economics/raw/main/data/"

In [None]:
if 'google.colab' in sys.modules:
  def load_datasets_h1(datasets_path=RETAIL_PATH):
    csv_path = os.path.join(datasets_path, "Year 2009-2010_train.csv")
    return pd.read_csv(csv_path, encoding= 'unicode_escape')

  retail_ol_h1 = load_datasets_h1()
  def load_datasets_h2(datasets_path=RETAIL_PATH):
      csv_path = os.path.join(datasets_path, "Year 2010-2011_train.csv")
      return pd.read_csv(csv_path, encoding= 'unicode_escape')

  retail_ol_h2 = load_datasets_h2()

Utilizar si estas en PC con Code y cualquier otro framework

In [130]:
retail_ol_h1 = pd.read_csv('./data/Year 2009-2010_train.csv',  encoding= 'unicode_escape')
retail_ol_h2 = pd.read_csv('./data/Year 2010-2011_train.csv',  encoding= 'unicode_escape')

De Uso comun para PC y Google Colab

In [131]:
frames = [retail_ol_h1, retail_ol_h2]
results = pd.concat(frames)
df = results.copy()

# PRE-PROCESAMIENTO

#### Situación actual de los datos

In [132]:
df.shape

(853896, 9)

In [133]:
df.isnull().sum()

Unnamed: 0          0
Invoice             0
StockCode           0
Description      3531
Quantity            0
InvoiceDate         0
Price               0
Customer ID    194467
Country             0
dtype: int64

In [None]:
# Chequeamos datos unicos y actuales en cada features (atributos).

for i in df.columns:
  print("Actual number of values",i,len(df[i]))
  print("Unique number of values",i,len(df[i].unique()))

In [135]:
# Chequeamos valores nulos en los features
df.isnull().sum()

Unnamed: 0          0
Invoice             0
StockCode           0
Description      3531
Quantity            0
InvoiceDate         0
Price               0
Customer ID    194467
Country             0
dtype: int64

In [136]:
# Chequeamos datos duplicados
df.duplicated().sum()

0

In [137]:
# meses incompletos
print('Rango de Fecha: %s ~ %s' % (df['InvoiceDate'].min(), df['InvoiceDate'].max()))
df.loc[df['InvoiceDate'] >= '2011-12-01'].shape
df.loc[df['InvoiceDate'] < '2009-12-02' ].shape

Rango de Fecha: 2009-12-01 07:45:00 ~ 2011-12-09 12:50:00


(2600, 9)

## Preparación de Datos
1. Eliminación de pedidos cancelados.
2. Eliminando registros sin Customer ID, sin descripción de productos y features sin títulos
3. Excluimos meses incompletos.
4. Calcular las ventas totales de los features Cantidad y Precio unitario.
5. Datos por cliente : para analizar segmentos de clientes, necesitamos transformar nuestros datos, de modo que cada registro represente el historial de compras de clientes individuales.

#### 1. Limpieza de Datos
Hay registros con valores negativos en la columna Cantidad, que representan pedidos cancelados. Ignoremos y eliminemos estos registros.

In [138]:
def limpieza_datos(df):
    # Observamos las cantidades negativas
    df.loc[df['Quantity'] <= 0].shape
    df = df.loc[df['Quantity'] > 0]
    #Quitamos la 1er columna vacía
    df.drop(['Unnamed: 0'], axis =1, inplace=True)
    # Quitamos valores nulos en features Customer ID y la Descripcion porque no son imputaciones.
    df.dropna(inplace=True)
    # Quitamos valores duplicados
    df.drop_duplicates(inplace=True)
    # Quitamos fechas incompletas
    df = df.loc[df['InvoiceDate'] < '2011-12-01']
    df = df.loc[df['InvoiceDate'] > '2009-12-01']
    return df

#### 2. Agregados de features

#### NLP - PRE-PROCESADO PARA ANALIZAR CATEGORIAS DE PRODUCTOS

In [171]:
# Esta función busca categorizar a los productos que se ofrecen
def agrega_color(df):
    colours = ['red','orange', 'yellow','green', 'blue', 'indigo', 'violet', 'purple', 'pink', 'silver', 'gold', 'beige', 'brown', 'grey', 'gray', 'black', 'white', 'cream']

    stop_words = set(stopwords.words('english'))
    Product_type = []
    Colour_type = []
    dataset= df
    # dataset= len(df)
    for row in dataset.iloc[:,2]:
        s=" "
        description = re.sub('[^a-zA-Z]'," ", str(row).lower()) #cleaning of text data
        wordsList = nltk.word_tokenize(description) #tokenization
        wordsList = [nltk.stem.WordNetLemmatizer().lemmatize(w, 'n') for w in wordsList if not w in stop_words] # lemmitization
        flag=False
        for w in wordsList:
            if w in colours:
                Colour_type.append(w)
                flag=True
            break
        if flag==False:
            Colour_type.append("no_color") #taking out colours from description

        tagged = nltk.pos_tag(wordsList)

        for tag in tagged:
            if tag[1]=='NN' :
                s+=tag[0] +  " "
        Product_type.append(s)
    
    return Product_type, Colour_type

In [174]:
def borrar_desc_invoice(df):
    # Quitar columnas "InvoiceDate" y "Description"
    X = df.drop(["Description", "InvoiceDate"], axis=1)
    return X

In [175]:
def cambiar_tipo_datos(df):
    # Transformar todas las variables en categoricas y en flotantes 
    # Columna 2 es Quantity, 3 es Price y 8 Revenue
    X = df.astype('category')
    X.iloc[:, 2] = X.iloc[:, 2].astype(float)
    X.iloc[:, 3] = X.iloc[:, 3].astype(float)
    X.iloc[:, 8] = X.iloc[:, 8].astype(float)
    return X

#### FUNCIONES UTILIZADAS EN GENERAL

In [None]:
# Damos formato de fecha a InvoiceDate para realizar tratamientos posteriores
def tipos_dataset(df):    
    df['InvoiceDate']  = pd.to_datetime(df.InvoiceDate, format = '%Y/%m/%d %H:%M')
    df['Quantity'] = pd.to_numeric(df['Quantity'], errors='coerce')
    df['Price']=df['Price'].astype(str)
    df['Price']=df['Price'].astype(float)
    df['Month'] = pd.DatetimeIndex(pd.to_datetime(df['InvoiceDate'])).month
    return df

In [146]:
# Agregamos features para Analizar diferentes casos
# UTILIZAR UNA VEZ SETEADO LOS TIPOS DE DATOS def tipos_dataset(df)
def agregados_features(df):
    # Agregamos features Sales
    df['Sales'] = df['Quantity'] * df['Price']
    # Datos cliente por pedido
    df['date'] = pd.to_datetime(df.InvoiceDate.dt.date, errors='coerce')
    df['time'] = df.InvoiceDate.dt.time
    df['hour'] = df['time'].apply(lambda x: x.hour)

    df['day'] = df['time'].apply(lambda x: x.day)
    df['month'] = df['time'].apply(lambda x: x.month)
    df['year'] = df['time'].apply(lambda x: x.year)
    
    df['weekend'] = df['date'].apply(lambda x: x.weekday() in [5, 6])
    df['dayofweek'] = df['date'].apply(lambda x: x.dayofweek)

    df['Product Type'] = agrega_color(df)[0]
    df['Colour_type']= agrega_color(df)[1]

    return df

In [None]:
def agregar_fechas(df):

    df['Day'] = df['time'].apply(lambda x: x.day)
    df['Month'] = df['time'].apply(lambda x: x.month)
    df['Year'] = df['time'].apply(lambda x: x.year)
    df['DayOfWeek'] = df['date'].apply(lambda x: x.dayofweek)

    return df

In [None]:
# Esta función agrupa por continente a paises y suma la cantidad vendida por pais
def cantidad_mensual_pais(df):
    df = df.groupby(['Country' , 'Month']).agg({'Quantity':'sum'})
    df = df.reset_index()
    df = df.sort_values(by=['Month'])
    Europe = ['United Kingdom','France', 'Belgium','EIRE',
              'Germany','Portugal', 'Denmark', 'Netherlands', 'Poland',
             'Spain', 'Cyprus', 'Greece', 'Norway', 'Austria', 'Sweden', 
              'Finland','Italy', 'Switzerland', 'Malta', 'Israel', 
              'Lithuania','Iceland']
    Asia = [ 'Japan','United Arab Emirates','Singapore','Hong Kong',
       'Thailand','West Indies', 'Korea','Lebanon',]
    America = ['Channel Islands','USA','Brazil', 'Canada']
    Australia = ['Australia',]
    df['Continent'] = df['Country'].map(lambda x: 'Europe' if x in Europe else(
                                        'Asia' if x in Asia else
                                        'America' if x in America else
                                        'Australia' if x in Australia else 'None' ))
    fig = px.scatter_geo(df, locations="Country",color="Continent",
                         hover_name="Country", size="Quantity",
                         animation_frame="Month",
                         projection="natural earth")
    fig.show()

#### FUNCIONES SOBRE PRECIOS

In [None]:
# Ploteo Cantidad y precio
def ploteo_precio_cantidad(df):
    df = df[['Quantity','Price']]
    df['PriceBins'] = pd.cut(df['Price'].tolist(), bins=8)
    sns.barplot(data=df,x="PrecioBins", y="Cantidad")

In [None]:
#Cuántos clientes compraron algo cada mes durante el último año
def price_customer(df):
    
    df = df.groupby(['Country' , 'Month']).agg({'Price':'sum' , 'Customer ID' :'count'})
    df.columns = ['PriceSum','CustomerIDCount']
    df = df.reset_index()
    cm = sns.light_palette("blue", as_cmap=True)
    pvd = pd.pivot_table(df, values='CustomerIDCount', index=['Country'],
                    columns=['Month'],
                    aggfunc=np.sum).fillna(0)
    return pvd.style.background_gradient(cmap=cm)

In [None]:
# ¿En qué rangos los precios son más comunes? - Grafica
def rango_precios(df):
    prices = pd.DataFrame([df['Price'].value_counts()
                         .sort_values(ascending=False).to_dict()]).T
    df = pd.DataFrame(df['Price'].value_counts())
    df = df.reset_index()
    df.columns = ['Price','CountPrice']
    df['PriceBins'] = pd.cut(df['Price'].tolist(), bins=8)
    sns.barplot(data=df, x='PriceBins', y='CountPrice')

In [None]:
# Rango de cambio de precio durante el tiempo - EXPLORACION DE DATOS
def AED(df):
     customer_avg_spending= df[['Price','Customer ID', 'InvoiceDate' , 'Country']]
     avg_selling_of_products = df[['Price','Quantity','InvoiceDate']]
     return customer_avg_spending, avg_selling_of_products 

In [None]:
def customer_avg_spending_insights(df):
    df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate']).dt.strftime('%Y-%m-%d')
    df = df.groupby(['InvoiceDate']).agg({'Price':'sum'}).reset_index()
    df = df.reset_index(drop=True)
    df.columns = ['Date','PriceSum']
    fig = go.Figure()
    fig.add_trace(go.Scatter(x=df['Date'], y=df['PriceSum'], name="Price Sum",
                         line_color='deepskyblue'))
    fig.update_layout(title_text='Sum range of all prices among time',
                  xaxis_rangeslider_visible=True)
    fig.show()

#### FUNCIONES QUE HACEN AGREGACIONES Y MATRICES

In [None]:
def matriz_cliente_item(df):
    # Matriz cliente-item
    # Podemos sumar todas las cantidades compradas para cada artículo, utilizando la función aggfunc.
    customer_item_matrix = df.pivot_table(
        index='Customer ID', 
        columns='StockCode', 
        values='Quantity',
        aggfunc='sum'
    )
    #Convertimos esta matriz y lo codificamos en 0 - 1 a los datos, por lo que el valor de 1 determinando un producto fue comprado por el cliente dado, y el valor 0 determinado por producto que nunca fue comprado por el cliente dado.
    # La función Lambda que estamos usando en este código simplemente codifica todos los elementos cuyos valores son mayores que 0 con 1, y el resto con 0.
    customer_item_matrix = customer_item_matrix.applymap(lambda x: 1 if x > 0 else 0)
    return customer_item_matrix

In [None]:
def agrupa_cliente_pedidos(df):
    orders_df = df.groupby(['Customer ID', 'Invoice']).agg({
    'Sales': sum,
    'InvoiceDate': max
    })
    return orders_df

In [None]:
def agrupa_cliente_ventas(df):
    # Datos cliente por ventas
    customer_df = df.groupby('Customer ID').agg({
        'Sales': sum,
        'Invoice': lambda x: x.nunique()
    })

    customer_df.columns = ['TotalSales', 'OrderCount']
    customer_df['AvgOrderValue'] = customer_df['TotalSales']/customer_df['OrderCount']
    return customer_df

#### 5. Datos por cliente

In [None]:
# Instantanea de la matriz
customer_item_matrix

In [105]:
rank_df = customer_df.rank(method='first')

In [None]:
customer_df.head(15)

## Análisis descriptivo de Datos

Necesitamos convertir InvoiceDate en tipo Date.

In [107]:
df.describe(include='object')

Unnamed: 0,Invoice,StockCode,Description,InvoiceDate,Country
count,613951,613951,613951,613951,613951
unique,35540,4605,5243,33316,41
top,576339,85123A,WHITE HANGING HEART T-LIGHT HOLDER,2011-11-14 15:27:00,United Kingdom
freq,434,4051,4051,434,551712


### Descriptiva de Clientes

In [108]:
# Analizamos clientes
df['Customer ID'].describe()

count    613951.000000
mean      15322.031318
std        1695.361076
min       12346.000000
25%       13971.000000
50%       15251.000000
75%       16794.000000
max       18287.000000
Name: Customer ID, dtype: float64

In [109]:
df['Quantity'].describe()

count    613951.000000
mean         13.385628
std         119.300741
min           1.000000
25%           2.000000
50%           6.000000
75%          12.000000
max       74215.000000
Name: Quantity, dtype: float64

In [110]:
customer_df.describe()

Unnamed: 0,TotalSales,OrderCount,AvgOrderValue
count,5828.0,5828.0,5828.0
mean,2327.814351,6.098147,302.795982
std,11403.148921,12.491479,474.321604
min,0.0,1.0,0.0
25%,276.365,1.0,143.1
50%,684.745,3.0,227.0975
75%,1810.44,7.0,340.546941
max,460080.01,379.0,19633.5


In [111]:
rank_df.describe()

Unnamed: 0,TotalSales,OrderCount,AvgOrderValue
count,5828.0,5828.0,5828.0
mean,2914.5,2914.5,2914.5
std,1682.543016,1682.543016,1682.543016
min,1.0,1.0,1.0
25%,1457.75,1457.75,1457.75
50%,2914.5,2914.5,2914.5
75%,4371.25,4371.25,4371.25
max,5828.0,5828.0,5828.0


### Normalización de datos

In [112]:
normalized_df = (rank_df - rank_df.mean()) / rank_df.std()

In [113]:
normalized_df.head(15)

Unnamed: 0_level_0,TotalSales,OrderCount,AvgOrderValue
Customer ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
12346.0,1.723284,1.244842,1.729228
12347.0,1.327455,0.859711,1.405313
12348.0,0.814541,0.491221,0.844852
12349.0,1.310219,0.203561,1.633539
12350.0,-0.960154,-1.731605,0.246948
12351.0,-0.897748,-1.731011,0.392561
12352.0,1.111116,1.17471,0.254674
12353.0,-0.667145,-0.747381,-0.585126
12354.0,0.085882,-1.730416,1.572322
12355.0,0.122731,-0.746786,1.081399


In [114]:
normalized_df.describe()

Unnamed: 0,TotalSales,OrderCount,AvgOrderValue
count,5828.0,5828.0,5828.0
mean,0.0,0.0,-9.753504000000001e-18
std,1.0,1.0,1.0
min,-1.731605,-1.731605,-1.731605
25%,-0.865803,-0.865803,-0.8658025
50%,0.0,0.0,0.0
75%,0.865803,0.865803,0.8658025
max,1.731605,1.731605,1.731605


### Analisis más Detallado de productos

In [115]:
# ¿Cuál es el produto más vendido?
df.StockCode.mode()

0    85123A
dtype: object

In [None]:
product = df[df.StockCode.str.contains("85123A")]
product.head()

In [117]:
df.Description.mode()

0    WHITE HANGING HEART T-LIGHT HOLDER
dtype: object

### Análisis de Precios y Cantidades

In [118]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 613951 entries, 0 to 433525
Data columns (total 9 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   Invoice      613951 non-null  object 
 1   StockCode    613951 non-null  object 
 2   Description  613951 non-null  object 
 3   Quantity     613951 non-null  int64  
 4   InvoiceDate  613951 non-null  object 
 5   Price        613951 non-null  float64
 6   Customer ID  613951 non-null  float64
 7   Country      613951 non-null  object 
 8   Sales        613951 non-null  float64
dtypes: float64(3), int64(1), object(5)
memory usage: 46.8+ MB


In [121]:
df1 = tipos_dataset(df)

In [None]:
price_quantity_plot(df1)

## ALGORITMOS ANALIZADOS

## ALGORITMO 1 - CLASIFICACION Y PREDICCIÓN DE CANTIDAD

In [None]:
# Limpiar datos y establecer tipos
df = limpieza_datos(df)

In [None]:
# Agregar columnas en Dataframe

df['Product Type'] = agrega_color(df)[0]
df['Colour_type'] = agrega_color(df)[1]

In [None]:
# Borrar Descripcion e Invoice Date
df1 = borrar_desc_invoice(df)

In [None]:
# Agregar Ingresos por Ventas
df1['Revenue'] = df1['Price'] * df1['Quantity']

In [None]:
# Label encoding of categorical features
label_encoder = preprocessing.LabelEncoder()

for col in ["Invoice", "StockCode", "Customer ID","Country", "Product Type","Colour_type"]:
  df1[col] = label_encoder.fit_transform(df1[col])

In [None]:
# Definir los tipos de datos en categoricos y Float
df1 = cambiar_tipo_datos(df1)

In [None]:
# Splitear en dataframe en "Train" y "test"
train, test = train_test_split(df1, train_size = 0.8, random_state = 0)

#### Cluster de items similares para el nuevo feature cluster, utiliza en este caso K-prototype clustering

In [None]:
# Chequeo del valor optimo de 'K' // demanda varios minutos de ejecucion para encontrar el optimo

cost = []
for num_clusters in list(range(2,15)):
    kproto = KPrototypes(n_clusters = num_clusters, init='Cao')
    kproto.fit_predict(train, categorical=[0, 1, 4, 5, 6, 7])
    cost.append(kproto.cost_)
    labels=kproto.labels_
plt.plot(cost)

In [None]:
# Genera un nuevo numero de cluster del atributo

kproto = KPrototypes(n_clusters = 3, init = 'Cao')
kproto.fit_predict(train, categorical=[0, 1, 4, 5, 6])
print(kproto.cost_)
labels=kproto.labels_

In [None]:
# Agrego nuevo atributo
train["Cluster number"]=labels


In [None]:
# Ahora agrego "InvoiceDate" en dataframe

df2 = train.merge(pd.DataFrame(df["InvoiceDate"]), left_index=True, right_index=True)


## AGREGAR AQUI INGENIERIA DE FEATURE

In [None]:
# Agrego las fechas en diferentes rangos: horas, dias, meses, años
df2 = agregar_fechas( df2 )

In [None]:
# Guardo el proceso en archivo csv
df2.to_csv('./data/df2.csv')

In [None]:
# Capturo el archivo, quito columna vacía y muestro
df2 = pd.read_csv('./data/df2.csv')
df2.drop(['Unnamed: 0'], axis =1, inplace=True)
df2.head()

### Classification of test data into number of clusters

- Cluster numbers were treated as a target variable as the objective
was to match the records from the validation and testing sets with the clusters from the training set.
- El numero de Clusters fueron tratados como variable target, con el objetivo de converger los registros del set de validación y testeo con el cluster del set de entrenamiento.

In [None]:
# Corte del dataframe entre entrenamiento y validación
train_, val_= train_test_split(df2, train_size = 0.8, random_state = 0)

In [None]:
train_y=train_["Cluster number"]
train_x=train_.drop(['Cluster number'],axis=1,inplace=False)

val_y=val_["Cluster number"]
val_x=val_.drop(['Cluster number'],axis=1,inplace=False)

#### Para Clasificar
* Utilizamos SVC, que brinda el mejor resultado sobre otros algoritmos

In [None]:
model1 = LinearSVC()
model1.fit(train_x, train_y)

In [None]:
# Validación del data test
pred_y = model1.predict(val_x)

In [None]:
# Evaluación de Performance
accuracy_score(val_y,pred_y)

In [None]:
# Adding "InvoiceDate" in test data

test_Df = test.merge(pd.DataFrame(dataset["InvoiceDate"]), left_index=True, right_index=True)
test_Df

### PREDICCION DE FEATURE "QUANTITY" PARA DEMANDAS DE PRODUCTOS

In [None]:
# Codificación de Label para features categoricos

from sklearn import preprocessing
label_encoder = preprocessing.LabelEncoder()

for col in ["Invoice", "StockCode", "Customer ID","Country", "Product Type","Colour_type"]:
  df[col] = label_encoder.fit_transform(df[col])

In [None]:
train_y=train_["Quantity"].astype('int')
train_x=train_.drop(['Quantity'], axis=1,inplace=False)

test_Df_y=test_Df["Quantity"].astype('int')
test_Df_x=test_Df.drop(['Quantity'],axis=1,inplace=False)

#### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(bootstrap=True,ccp_alpha=0.0,
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   max_samples=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=100,
                                                   n_jobs=None,)
clf.fit(train_x, train_y)

In [None]:
prediction_test = clf.predict(test_Df_x)
prediction_test

In [None]:
from sklearn.metrics import f1_score
f1_score(test_Df_y, prediction_test, average='micro')

In [None]:
accuracy_score(test_Df_y, prediction_test)

In [None]:
# Ajuste de Hiperparámetros para Algoritmo de Random forest




# Numero de arboles en random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Numero de features a considerar en cada split
max_features = ['auto', 'sqrt']
# Numero Maximo de niveles en el arbol
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Numero Minimo de samples requeridos para splitear un nodo
min_samples_split = [2, 5, 10]
# Numero Minimo de samples requerido por cada hoja nodo
min_samples_leaf = [1, 2, 4]
# Método de selección de samples para entrenamiento en cada arbol
bootstrap = [True, False]
# Crea la cuadrícula (random grid)
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}
pprint(random_grid)



# Uso del random grid para buscar los mejores hiperparámetros
# Primero crea el modelo de base para ajustar
rf = RandomForestRegressor()
# Busqueda aleatoria de parametros, usando 3 fold cross validation, 
# search across 100 diferentes combinaciones, y usa todos los nucleos disponibles (available score)
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 100, cv = 3, verbose=2, random_state=42, n_jobs = -1)
# Fitea el modelo random search model
rf_random.fit(train_x, train_y)

In [None]:
# KNN

from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(train_x, train_y)
knn=neigh.predict(test_Df_x)
print(accuracy_score(test_Df_y, knn))

In [None]:
# SVC with kernel 

from sklearn import svm
from sklearn.svm import SVC

rbf_svc = svm.SVC(kernel='rbf')
rbf_svc.fit(train_x, train_y)

rbf=rbf_svc.predict(test_Df_x)
accuracy_score(test_Df_y, rbf)


In [None]:
# AdaBoost

from sklearn.ensemble import AdaBoostClassifier

ad = AdaBoostClassifier(n_estimators=100, random_state=0)
ad.fit(train_x, train_y)
adb=ad.predict(test_Df_x)
print(accuracy_score(test_Df_y, adb))

In [None]:
# logistic

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(multi_class='ovr')
lr.fit(train_x, train_y)
lrc=lr.predict(test_Df_x)
print(accuracy_score(test_Df_y, lrc))

In [None]:
# Naive base Classifier

from sklearn.naive_bayes import GaussianNB

lr = GaussianNB()
lr.fit(train_x, train_y)
lrc=lr.predict(test_Df_x)
print(accuracy_score(test_Df_y, lrc))

In [None]:
# Decision Tree Classifier

from sklearn.tree import DecisionTreeClassifier 

dtree_model = DecisionTreeClassifier().fit(train_x, train_y)
dtree_predictions = dtree_model.predict(test_Df_x)
accuracy_score(test_Df_y, dtree_predictions)

In [None]:
# GradientBoostingClassifier

from sklearn.ensemble import GradientBoostingClassifier

gb=GradientBoostingClassifier()
gb.fit(train_x, train_y)
gbc=lr.predict(test_Df_x)
print(accuracy_score(test_Df_y, gbc))