# <center> Modelos supervisados XGBoost y SVM<center>

**Proyecto:**
Modelos supervisados

**Autor:**
Carlos Ramírez - Coordinación de plazas y remuneraciones - Racionalización

**Revisión y modificaciones:**
Nombre - Coordinación de analítica de datos

**Fecha de última modificación:**
10.08.2022


<hr style="height:2px;border-width:0;color:black;background-color:black">

## 0. Importación librerías y definición de funciones

Instalar previamente la librería de xgboost, mayor información revisar el siguiente [link](https://xgboost.readthedocs.io/en/stable/install.html)

Blog de interés XGBoost
- [Hiperparámetros](https://towardsdatascience.com/a-guide-to-xgboost-hyperparameters-87980c7f44a9)

Blog de interés SVM
- [SVM](https://towardsdatascience.com/support-vector-machine-python-example-d67d9b63f1c8#:~:text=Support%20Vector%20Machine%20(SVM)%20is,straight%20line%20between%20two%20classes.)

In [1]:
# Instalar XGBoost si no se encuentra disponible
#pip install xgboost

In [2]:
import pandas as pd
import numpy as np
import math
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import base64
import gc
import xgboost as xgb
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler

In [3]:
# Onedrive url para descargar archivos

def create_onedrive_directdownload (onedrive_link):
    data_bytes64 = base64.b64encode(bytes(onedrive_link, 'utf-8'))
    data_bytes64_String = data_bytes64.decode('utf-8').replace('/','_').replace('+','-').rstrip("=")
    resultUrl = f"https://api.onedrive.com/v1.0/shares/u!{data_bytes64_String}/root/content"
    return resultUrl

## 1. Preparación base de datos

In [4]:
# Importamos base csv - train
onedrive_link_train = "https://1drv.ms/u/s!AodhAFTTDqU02wuv2mkPBKGILXcR?e=CcroYu"
onedrive_directdownload_train = create_onedrive_directdownload(onedrive_link_train)

df_train=pd.read_csv(onedrive_directdownload_train)
df_train=df_train.loc[:, ~df_train.columns.str.contains('^Unnamed')]

# Importamos base csv - test
onedrive_link_test = "https://1drv.ms/u/s!AodhAFTTDqU02mV2CVCE3C20RUrs?e=Aynct8"
onedrive_directdownload_test = create_onedrive_directdownload(onedrive_link_test)

df_test=pd.read_csv(onedrive_directdownload_test)
df_test = df_test.loc[:, ~df_test.columns.str.contains('^Unnamed')]

In [5]:
# Codificamos variables categóricas a numéricas en train y test
my_dataframes = {'df_train': df_train, 'df_test': df_test}

for df_name, df in my_dataframes.items():
    # ruralidad
    df['cod_ruralidad'] = 0
    df.loc[df['ruralidad']=='Rural 1', 'cod_ruralidad'] = 1
    df.loc[df['ruralidad']=='Rural 2', 'cod_ruralidad'] = 2
    df.loc[df['ruralidad']=='Rural 3', 'cod_ruralidad'] = 3
    
    # VRAEM
    df['cod_vraem'] = 0
    df.loc[df['vraem']=='Vraem', 'cod_vraem'] = 1    
    
    # Frontera
    df['cod_frontera'] = 0
    df.loc[df['frontera']=='Frontera', 'cod_frontera'] = 1

    # Bilingüe
    df['cod_bilingue'] = 0
    df.loc[df['bilingue']=='Bilingue', 'cod_bilingue'] = 1  
    
    # Característica
    df['cod_caracteristica'] = 0
    df.loc[df['caracteristica']=='Unidocente', 'cod_caracteristica'] = 1
    df.loc[df['caracteristica']=='Multigrado', 'cod_caracteristica'] = 2
    
    # Nivel
    df['cod_nivel'] = 0
    df.loc[df['d_niv_mod']=='Inicial - Jardin', 'cod_nivel'] = 1
    df.loc[df['d_niv_mod']=='Primaria', 'cod_nivel'] = 2
       

In [6]:
#Eliminamos variables no numéricas
df_train = df_train.drop(['nlat_ie', 'nlong_ie', 'ubigeo', 'codooii', 'd_niv_mod', 'ruralidad', 'vraem',
                          'frontera', 'bilingue', 'caracteristica', 'd_prov', 'd_dreugel', 'kfold'], axis=1)

df_test = df_test.drop(['nlat_ie', 'nlong_ie', 'ubigeo', 'codooii', 'd_niv_mod', 'ruralidad', 'vraem',
                          'frontera', 'bilingue', 'caracteristica', 'd_prov', 'd_dreugel'], axis=1) 

## 2. Especificación train y test - XGBoost

In [7]:
# Se especifica variables explicativas, target para el train y test
# En este caso nosotros estamos calculando 
X_train = df_train.drop(['secciones_necesarias'], axis=1)
Y_train = df_train['secciones_necesarias']
x_test = df_test.drop(['secciones_necesarias'], axis=1) 
y_test = df_test['secciones_necesarias']

# Guardamos las bases 
dtrain = xgb.DMatrix(X_train, label = Y_train)
dtest = xgb.DMatrix(x_test, label = y_test)

In [8]:
params = {
    'max_depth': 3,
    'eta': 0.3
}
epochs = 10

# train model
model = xgb.train(params, dtrain, epochs)
# prediction
y_pred = model.predict(dtest)
y_pred = np.round(model.predict(dtest))

In [9]:
comparativo = pd.DataFrame(y_pred, y_test)

In [10]:
accuracy_score(y_test, y_pred)

0.6746205259541508

## 3. Especificación train y test - SVM

In [11]:
df_train = df_train.dropna(how='any')
df_test = df_test.dropna(how='any')

X_train_SVM = df_train.drop(['secciones_necesarias'], axis=1)
Y_train_SVM = df_train['secciones_necesarias']
X_test_SVM = df_test.drop(['secciones_necesarias'], axis=1) 
y_test_SVM = df_test['secciones_necesarias']



In [None]:
svclassifier = SVC(kernel='poly', degree=8)
svclassifier.fit(X_train_SVM, Y_train_SVM)

In [None]:
y_pred = svclassifier.predict(X_test_SVM)
print(confusion_matrix(X_test_SVM, y_pred))
print(classification_report(X_test_SVM, y_pred))

In [None]:
accuracy_score(y_test_SVM, y_pred)