<h1><strong>Optimización de hiperparametros</strong> </h1> 
    
<h3>Acá se van a buscar los hiperparametros más optimos para cada algoritmos. Luego se va a crear una función que devuelva un diccionario con los mejores hiperparametros para cada algoritmo, lo que va a significar una gran ventaja de tiempo en pruebas futuras.</h3>

------------------------------------------------------------------------------

<h4> Se van a estar probando tres algoritmos de optimizacion: </h4>
<ol> 
    <li> Grid Search </li>
    <li> Random Search </li>
    <li> Bayesian Optimization </li>
</ol>

-------------------------------------------------------------------------------    
    
<br>Primero se hacen los imports necesarios

In [1]:
import pandas as pd
import numpy as np

import nbimporter # pip install nbimporter
import sklearn

from sklearn.naive_bayes  import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from catboost import CatBoostClassifier
                        
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics import f1_score, accuracy_score

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

from feature_builder import process_dataset, add_text_embeddings, calculate_keyword_encoding
from hyperparameter_tuning import random_search, GridSearch, bayesian_optimization #pip install hyperopt



In [2]:
def obtener_hiperparametros():
    
    return {
        'Catboost':{} ,
        'cnn':{},
        'GRU':{} ,
        'knn': {},
        'LightGBM': {},
        'Logistic-Regression': {},
        'lstm': {},
        'MNB': {},
        'nn': {},
        'SVM':{'kernel': 'rbf', 'gamma': 'scale', 'degree': 9, 'coef0': 5, 'C': 2},
        'XGBoost':{},
    }

In [3]:
train_dataset = pd.read_csv('train.csv')

In [4]:
test_dataset = pd.read_csv('test.csv')

In [5]:
#y = train_dataset.loc[:,'target']
y=train_dataset[['id','target']]

<h2><strong>Preparo los distintos sets con features diferentes.</h2>
Los distintos algoritmos necesitan diferentes sets, según lo investigado.

<h3>Primero los procesados completos con spacy

In [6]:
x_processed = process_dataset(train_dataset, use_spacy=True)

Embeddings loaded!
Percentage of words covered in the embeddings = 0.4875485193423176
Embeddings loaded!
Percentage of words covered in the embeddings = 0.5959707770644233


<h3>Los que solo necesitan embeddings. 

In [14]:
x_embedd = process_dataset(train_dataset, usa_spacy=True, use_manual_features=False)

Percentage of words covered in the embeddings = 0.4937444933920705


<h3>Los que usan TF-IDF

In [19]:
x_tfidf = process_dataset(train_dataset, text_type='tfidf', use_manual_features=False)

<h3><strong>SVM

In [6]:
SVC = svm.SVC()

In [7]:
params_svc={'kernel':('linear', 'rbf', 'sigmoid', 'poly'),
    'C':[0.1, 0.2, 0.5, 0.8, 1, 2, 5, 10, 25],
    'degree':np.arange(1,10,1),
    'coef0':[0.1, 0.5, 1, 2, 5, 10],
    'gamma':('auto','scale')}

<h5>Random Search

In [None]:
random_search(x_processed, y, SVC, params_svc)

Fitting 5 folds for each of 40 candidates, totalling 200 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.


<h5>Grid Search

In [None]:
GridSearch(x_processed, y, SVC, params_svc)



<h5>Bayesian Optimization

<h3><strong> Logistic-Regression

In [65]:
logisticRegr = LogisticRegression()

In [None]:
params_lr = {
    'penalty': ['l1', 'l2'],
    'C': np.logspace(-4,4,20),
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}


<h5>Random Search

In [None]:
random_search(x_processed, y, logisticRegr, params_lr)

<h5>Grid Search

In [None]:
GridSearch(x_processed, y, logisticRegr, params_lr)

<h5>Bayesian Optimization

<h3><strong>Catboost

In [7]:
catboost = CatBoostClassifier(thread_count=2,verbose=False)

In [8]:
params_cb = {'depth':np.arange(1,12,1),
          'iterations':[80, 100,256,465,678,1000],
          'learning_rate':[0.01,0.05,0.1,0.3], 
          'l2_leaf_reg':np.arange(2,10,1),
          'border_count':[0,5,10,50,100],
          'random_strength':[0,1,42]}

<h5>Random Search

In [9]:
random_search(x_processed, y['target'], catboost, params_cb,5,4)

0:	learn: 0.6173932	total: 108ms	remaining: 27.6s
1:	learn: 0.5728471	total: 143ms	remaining: 18.1s
2:	learn: 0.5418936	total: 171ms	remaining: 14.4s
3:	learn: 0.5189744	total: 203ms	remaining: 12.8s
4:	learn: 0.5026454	total: 232ms	remaining: 11.6s
5:	learn: 0.4895342	total: 258ms	remaining: 10.7s
6:	learn: 0.4764438	total: 285ms	remaining: 10.2s
7:	learn: 0.4670584	total: 316ms	remaining: 9.78s
8:	learn: 0.4564305	total: 343ms	remaining: 9.42s
9:	learn: 0.4483355	total: 370ms	remaining: 9.11s
10:	learn: 0.4416335	total: 396ms	remaining: 8.82s
11:	learn: 0.4351781	total: 424ms	remaining: 8.63s
12:	learn: 0.4289448	total: 450ms	remaining: 8.42s
13:	learn: 0.4231282	total: 479ms	remaining: 8.29s
14:	learn: 0.4179280	total: 510ms	remaining: 8.19s
15:	learn: 0.4125080	total: 539ms	remaining: 8.09s
16:	learn: 0.4070748	total: 570ms	remaining: 8.01s
17:	learn: 0.4023490	total: 607ms	remaining: 8.03s
18:	learn: 0.3974547	total: 638ms	remaining: 7.96s
19:	learn: 0.3932494	total: 670ms	remaini

165:	learn: 0.1279537	total: 4.74s	remaining: 2.57s
166:	learn: 0.1271581	total: 4.77s	remaining: 2.54s
167:	learn: 0.1262774	total: 4.8s	remaining: 2.51s
168:	learn: 0.1253948	total: 4.83s	remaining: 2.49s
169:	learn: 0.1245077	total: 4.86s	remaining: 2.46s
170:	learn: 0.1239833	total: 4.88s	remaining: 2.43s
171:	learn: 0.1232528	total: 4.91s	remaining: 2.4s
172:	learn: 0.1226141	total: 4.94s	remaining: 2.37s
173:	learn: 0.1218608	total: 4.96s	remaining: 2.34s
174:	learn: 0.1210155	total: 4.99s	remaining: 2.31s
175:	learn: 0.1202603	total: 5.02s	remaining: 2.28s
176:	learn: 0.1195066	total: 5.05s	remaining: 2.25s
177:	learn: 0.1187832	total: 5.08s	remaining: 2.23s
178:	learn: 0.1179224	total: 5.11s	remaining: 2.2s
179:	learn: 0.1170393	total: 5.14s	remaining: 2.17s
180:	learn: 0.1161761	total: 5.17s	remaining: 2.14s
181:	learn: 0.1154147	total: 5.2s	remaining: 2.11s
182:	learn: 0.1145984	total: 5.23s	remaining: 2.08s
183:	learn: 0.1138104	total: 5.26s	remaining: 2.06s
184:	learn: 0.11

<h5>Grid Search

In [None]:
GridSearch(x_processed, y, catboost, params_cb)

<h5>Bayesian Optimization

<h3><strong>MNB

In [91]:
MultiNB = MultinomialNB()

<h5>Random Search

<h5>Grid Search

<h5>Bayesian Optimization

<h3><strong>CNN

<h5>Random Search

<h5>Grid Search

<h5>Bayesian Optimization

<h3><strong>GRU

<h5>Random Search

<h5>Grid Search

<h5>Bayesian Optimization

<h3><strong>KNN

In [None]:
params_knn= [{'n_neighbors': np.arange(1,30)},
   {'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute']},
   {'metric': ['euclidean', 'manhattan', 'chebyshev', 'minkowski']}]

<h5>Random Search

<h5>Grid Search

<h5>Bayesian Optimization

<h3><strong>LightGBM

In [None]:
params_lgbm =  {'objective':['binary'],
                             'learning_rate':[0.005,0.01,0.05,0.1,0.3],
                             'n_estimators':np.arange(25,200,15),
                             'num_leaves': np.arange(24, 45,3),
                             'feature_fraction': np.arange(0.1, 0.91, 0.2),   
                             'bagging_fraction': np.arange(0.8, 1.01, 0.1),
                             'max_depth': np.arange(3, 12, 1),
                             'lambda_l2': np.arange(0, 3),
                             'min_split_gain': [0.001, 0.01, 0.1],
                             'min_child_weight': [1e-05]+np.arange(5, 11)
                             }

<h5>Random Search

<h5>Grid Search

<h5>Bayesian Optimization

<h3><strong>LSTM

<h5>Random Search

<h5>Grid Search

<h5>Bayesian Optimization

<h3><strong>NN

<h5>Random Search

<h5>Grid Search

<h5>Bayesian Optimization

<h3><strong>XGBoost

In [None]:
params_xgb = [{'objective': ['binary:logistic','reg:linear'],'learning_rate':np.arange(0.1,0.5,0.1)},
              {'n_estimators':np.arange(16,116,15)},
              {'scale_pos_weight':np.arange(2,6,1)},
              {'max_depth':np.arange(4,12,1),'min_child_weight':np.arange(1,10,1)},
              {'gamma':np.arange(0,0.5,0.1)},
              {'subsample':np.arange(0.6,1,0.1),'colsample_bytree':np.arange(0.6,0.91,0.05)},
              {'colsample_bylevel':np.arange(0.6,0.91,0.05)}]

<h5>Random Search

<h5>Grid Search

<h5>Bayesian Optimization