<h1><strong>Bagging Classifier</strong> </h1> 
    
<h3>Acá se va a estar probando el ensamblaje de Bagging con los distintos algoritmos implementados.</h3>

Primero se hacen los imports necesarios

In [1]:
import pandas as pd
import numpy as np

import sklearn

from sklearn.naive_bayes  import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn import svm
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.metrics import f1_score, accuracy_score


from feature_builder import process_dataset, add_text_embeddings, calculate_keyword_encoding
from hyperparameter_tuning import random_search


In [2]:
train_dataset = pd.read_csv('train.csv')

In [3]:
test_dataset = pd.read_csv('test.csv')

In [4]:
y = train_dataset.loc[:,'target']

In [2]:
# from sklearn.metrics import confusion_matrix
# print('Confusion Matrix : \n' + str(confusion_matrix(y_test,y_pred))) #ver

<h2>Preparo los distintos sets con features diferentes.</h2>
Los distintos algoritmos necesitan diferentes sets, según lo investigado.

<h3>Primero los procesados completos con spacy

In [5]:
x_processed = process_dataset(train_dataset, use_spacy=True)

Percentage of words covered in the embeddings = 0.4937444933920705
Percentage of words covered in the embeddings = 0.5961027457927369


In [6]:
x_train_processed, x_test_processed, y_train_processed, y_test_processed = train_test_split(x_processed, y, test_size = .30, random_state = 17)

<h3>Los que solo necesitan embeddings. 

In [58]:
x_embedd = train_dataset.copy()
add_text_embeddings(x_embedd, x_embedd['text'], 'embeddings')
x_embedd.drop(['text', 'location', 'keyword', 'id', 'target'], axis=1, inplace=True)

Percentage of words covered in the embeddings = 0.4937444933920705


In [59]:
x_train_embedd, x_test_embedd, y_train_embedd, y_test_embedd = train_test_split(x_embedd, y, test_size = .30, random_state = 17)

<h3>Los que usan TF-IDF

In [89]:
v = TfidfVectorizer()
x_tfidf = v.fit_transform(train_dataset['text'])


In [90]:
x_train_tfidf, x_test_tfidf, y_train_tfidf, y_test_tfidf = train_test_split(x_tfidf, y, test_size = .33, random_state = 17)

<h1> Logistic Regression

In [65]:
logisticRegr = LogisticRegression(solver='liblinear', penalty='l1', multi_class='auto', max_iter=1000, C=1)

In [67]:
BC_LR = BaggingClassifier(base_estimator= logisticRegr, n_estimators=10, random_state=0)

In [68]:
BC_LR.fit(x_train_processed, y_train_processed)

BaggingClassifier(base_estimator=LogisticRegression(C=1, class_weight=None,
                                                    dual=False,
                                                    fit_intercept=True,
                                                    intercept_scaling=1,
                                                    l1_ratio=None,
                                                    max_iter=1000,
                                                    multi_class='auto',
                                                    n_jobs=None, penalty='l1',
                                                    random_state=None,
                                                    solver='liblinear',
                                                    tol=0.0001, verbose=0,
                                                    warm_start=False),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=None

In [69]:
y_predict_LR = BC_LR.predict(x_test_processed)

In [70]:
f1_score(y_test_processed, y_predict_LR)

0.7899073120494334

<h1> SVM

In [75]:
SVM = svm.SVC(degree=10,coef0=10,C=5)

In [76]:
BC_SVM = BaggingClassifier(base_estimator= SVM, n_estimators=10, random_state=0)

In [77]:
BC_SVM.fit(x_train_embedd, y_train_embedd)

BaggingClassifier(base_estimator=SVC(C=5, break_ties=False, cache_size=200,
                                     class_weight=None, coef0=10,
                                     decision_function_shape='ovr', degree=10,
                                     gamma='scale', kernel='rbf', max_iter=-1,
                                     probability=False, random_state=None,
                                     shrinking=True, tol=0.001, verbose=False),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=None,
                  oob_score=False, random_state=0, verbose=0, warm_start=False)

In [78]:
y_pred_svm = BC_SVM.predict(x_test_embedd)

In [79]:
f1_score(y_test_embedd, y_pred_svm)

0.7789699570815452

<h1>CatBoost

In [80]:
catboost = CatBoostClassifier(verbose=False)

In [81]:
BC_CB = BaggingClassifier(base_estimator= catboost, n_estimators=10, random_state=0)

In [82]:
BC_CB.fit(x_train_processed, y_train_processed)

BaggingClassifier(base_estimator=<catboost.core.CatBoostClassifier object at 0x7f023b63a310>,
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=None,
                  oob_score=False, random_state=0, verbose=0, warm_start=False)

In [83]:
y_pred_processed = BC_CB.predict(x_test_processed)

In [84]:
f1_score(y_test_processed, y_pred_processed)

0.781897491821156

<h1>MNB
    

In [91]:
MultiNB = MultinomialNB()

In [92]:
BC_MNB = BaggingClassifier(base_estimator= MultiNB, n_estimators=10, random_state=0)

In [93]:
BC_MNB.fit(x_train_tfidf, y_train_tfidf)

BaggingClassifier(base_estimator=MultinomialNB(alpha=1.0, class_prior=None,
                                               fit_prior=True),
                  bootstrap=True, bootstrap_features=False, max_features=1.0,
                  max_samples=1.0, n_estimators=10, n_jobs=None,
                  oob_score=False, random_state=0, verbose=0, warm_start=False)

In [94]:
y_pred_MNB = BC_MNB.predict(x_test_tfidf)

In [95]:
f1_score(y_test_tfidf, y_pred_MNB)

0.7233351380617218

<h1>CNN

<h1>KNN

<h1>LSTM

<h1>XGBoost

In [7]:
xgbooster = XGBClassifier(max_depth=3, n_estimators=600, colsample_bytree=0.9,
                        subsample=0.9, nthread=4, learning_rate=0.05)


BC_XGB = BaggingClassifier(base_estimator= xgbooster, n_estimators=10, random_state=0)

In [8]:
BC_XGB.fit(x_train_processed, y_train_processed)

BaggingClassifier(base_estimator=XGBClassifier(base_score=None, booster=None,
                                               colsample_bylevel=None,
                                               colsample_bynode=None,
                                               colsample_bytree=0.9, gamma=None,
                                               gpu_id=None,
                                               importance_type='gain',
                                               interaction_constraints=None,
                                               learning_rate=0.05,
                                               max_delta_step=None, max_depth=3,
                                               min_child_weight=None,
                                               missing=nan,
                                               monotone_constraints=None,
                                               n_estimators=600, n_jobs=None,
                                               nthread=4,
    

In [9]:
y_pred_processed = BC_XGB.predict(x_test_processed)

In [10]:
f1_score(y_test_processed, y_pred_processed)

0.7902964959568733

<h1>LightGBM

In [11]:
gbm = LGBMClassifier()

BC_LGB = BaggingClassifier(base_estimator= gbm, n_estimators=10, random_state=0)
BC_LGB.fit(x_train_processed, y_train_processed)

BaggingClassifier(base_estimator=LGBMClassifier(), random_state=0)

In [12]:
y_pred_processed = BC_LGB.predict(x_test_processed)

In [13]:
f1_score(y_test_processed, y_pred_processed)

0.7810972297664313

<h2>Export del cvs con el BC entrenado con todo el train_set y el mejor algoritmo: 

In [28]:
x_test_proccesed = process_dataset(test_dataset, use_spacy=True)

Percentage of words covered in the embeddings = 0.5707598689343111
Percentage of words covered in the embeddings = 0.665389037945573


In [None]:
BC_LR.fit(x_processed, y)

In [None]:
y_pred = BC.predict(x_test_proccesed)

In [None]:
ids = test_dataset['id']
final_df = pd.DataFrame({'target': [x for x in y_pred]}, index=ids)

In [None]:
final_df.to_csv('BC-CB-processed-spacy.csv')