# **EVA** TEXT CLASSIFIER

![img_ppal](https://www.innovacion-tecnologia.com/wp-content/uploads/2020/09/Historia-de-los-humanoides.jpg)

## *ÍNDICE:* 
---
0. INTRODUCCIÓN
1. **MACHINE LEARNING**
    - 2.1. Preparación y limpieza de datos
    - 2.2. Feature Engineering
    - 2.3. Modelado
2. RESULTADOS
    - 3.1. Visualización y reporting de los resultados
    - 3.2. Creación de un pipeline para el flujo automatizado
---

# ***MACHINE LEARNING: NLP models training***

---

### Librerías

In [1]:
from utils.libreries import *
from utils.functions import *

### Definir constantes

In [2]:
REL_PATH = os.getcwd()
TRAIN_PATH = REL_PATH + '/data/raw/train.csv'
TEST_PATH = REL_PATH + '/data/raw/test.csv'
TRAIN_PROCESSED_PATH = REL_PATH + '/data/processed/train_processed.csv'

### Datasets

In [3]:
df_train = pd.read_csv(TRAIN_PATH)
df_train.head()

Unnamed: 0,train_idx,text,label,label_text
0,0,i really do recommend this to anyone in need o...,1,positive
1,1,very good every day camera fits nicely in the ...,1,positive
2,2,"but , dollar for dollar , this dvd player is p...",1,positive
3,3,i got this phone yesterday and didn ' t find a...,1,positive
4,4,1 ) price gb of storage,1,positive


In [4]:
df_train.head(20)

Unnamed: 0,train_idx,text,label,label_text
0,0,i really do recommend this to anyone in need o...,1,positive
1,1,very good every day camera fits nicely in the ...,1,positive
2,2,"but , dollar for dollar , this dvd player is p...",1,positive
3,3,i got this phone yesterday and didn ' t find a...,1,positive
4,4,1 ) price gb of storage,1,positive
5,5,one cabinet shop has been using one regularly ...,1,positive
6,6,i will say that the os that the phone runs doe...,0,negative
7,7,this model appears to be especially good,1,positive
8,8,i find that it is stable in my hands and its '...,1,positive
9,9,"the catch is that , while it plays movies just...",0,negative


In [5]:
ds_test = pd.read_csv(TEST_PATH)
ds_test.head()

Unnamed: 0,test_idx,text
0,0,fm receiver it has none
1,1,"the picture quality surprised me , when i firs..."
2,2,great video clip quality for a digital camera ...
3,3,creative did well on its rechargeable battery ...
4,4,i highly recommend this camera to anyone looki...


### Baseline de los modelos:


In [6]:
df_train['text'] = df_train['text'].apply(preprocess_text)
df_train['text']

0                 really recommend anyone need new player
1       good every day camera fit nicely pocket jean t...
2                  dollar dollar dvd player probably best
3                    got phone yesterday find problem yet
4                                      1 price gb storage
                              ...                        
3011    itunes find good window medium player computer...
3012          played feature yet camera easy use get used
3013    application lot application work well eventual...
3014                                    battery non issue
3015     fm tuner 5g storage removable disk great feature
Name: text, Length: 3016, dtype: object

In [7]:
plt.figure(figsize=(10,8))
plt.style.use('ggplot')

<Figure size 1000x800 with 0 Axes>

In [1]:
from transformers import AutoTokenizer
from transformers import TFAutoModelForSequenceClassification
from scipy.special import softmax
import tensorflow as tf

: 

: 

In [9]:
MODEL = f"cardiffnlp/twitter-roberta-base-sentiment"
tokenizer = AutoTokenizer.from_pretrained(MODEL)

In [10]:
model = TFAutoModelForSequenceClassification.from_pretrained(MODEL)
model.save_pretrained(MODEL)

text = "Good night 😊"
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
scores = output[0][0].numpy()
scores = softmax(scores)

: 

: 

### Modelos modificando parametros:

In [26]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df_train['text'])
y = df_train['label'].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 77)

In [None]:
models_names = list(choose_models('all',params=None).keys())
#mods = models_names
mods = models_names[0]
print(mods)

models_generator(X,y,choose_models(mods,params=None),choose_params(mods),
file_name ='metrics.csv',dir_file='model/model_metrics',dir_model_file='model',scaling=False,
scoring = { "AUC": "roc_auc","Accuracy": make_scorer(accuracy_score),'Precision': make_scorer(precision_score),'F1':make_scorer(f1_score)})

In [None]:
models_names = list(choose_models('all',params=None).keys())
#mods = models_names
mods = models_names[3:5]
print(mods)
for mod in mods:  
    models_generator(X,y,choose_models(mod,params=None),choose_params(mod),
    file_name ='metrics.csv',dir_file='model/model_metrics',dir_model_file='model',scaling=False,
    scoring = { "AUC": "roc_auc","Accuracy": make_scorer(accuracy_score),'Precision': make_scorer(precision_score),'F1':make_scorer(f1_score)})

In [14]:
x=pd.read_csv(os.getcwd()+'/model/model_metrics/metrics.csv',sep=';')
x.sort_values(by='F1',ascending = False)

Unnamed: 0,model,params_tried,best_params,ACC,Precision,Recall,F1,ROC,Jaccard,model_path
0,LogisticRegression,"{'penalty': ['l2'], 'class_weight': ['none', '...","{'class_weight': 'none', 'max_iter': 50, 'pena...",0.77649,0.784067,0.921182,0.847112,0.70049,0.734774,model/LogisticRegression.pkl
4,XGBClassifier,"{'nthread': [4], 'objective': ['binary:logisti...","{'colsample_bytree': 1.0, 'learning_rate': 0.0...",0.764901,0.76087,0.948276,0.844298,0.668582,0.73055,model/XGBClassifier.pkl
5,SVC,"[{'C': [1, 10, 100, 1000], 'kernel': ['linear'...","{'C': 1, 'class_weight': 'balanced', 'kernel':...",0.788079,0.854592,0.825123,0.839599,0.768622,0.723542,model/SVC.pkl
1,KNeighborsClassifier,"{'n_neighbors': [3, 5, 7, 9, 11, 13, 15], 'wei...","{'algorithm': 'ball_tree', 'leaf_size': 20, 'n...",0.740066,0.741748,0.940887,0.829533,0.634585,0.70872,model/KNeighborsClassifier.pkl
6,LogisticRegression,"{'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, ...","{'C': 0.01, 'max_iter': 100, 'penalty': 'l1', ...",0.672185,0.672185,1.0,0.80396,0.5,0.672185,model/LogisticRegression_1.pkl
7,LogisticRegression,"{'penalty': ['l1', 'l2'], 'C': [0.01, 0.1, 1, ...","{'C': 0.01, 'max_iter': 100, 'penalty': 'l1', ...",0.672185,0.672185,1.0,0.80396,0.5,0.672185,model/LogisticRegression_2.pkl
3,ExtraTreeClassifier,"{'criterion': ['gini', 'entropy'], 'max_depth'...","{'class_weight': None, 'criterion': 'gini', 'm...",0.665563,0.674658,0.970443,0.79596,0.505424,0.661074,model/ExtraTreeClassifier.pkl
2,DecisionTreeClassifier,"{'criterion': ['log_loss', 'gini', 'entropy'],...","{'class_weight': 'balanced', 'criterion': 'log...",0.652318,0.75,0.724138,0.736842,0.614594,0.583333,model/DecisionTreeClassifier.pkl


### Resultado con modelo elegido:

In [16]:
df_test = ds_test.copy()
df_test['text'] = ds_test['text'].apply(preprocess_text)


In [19]:
df_test.head()

Unnamed: 0,test_idx,text
0,0,fm receiver none
1,1,picture quality surprised first saw camera saw...
2,2,great video clip quality digital camera much b...
3,3,creative well rechargeable battery feature
4,4,highly recommend camera anyone looking good di...


In [18]:
ds_test.head()

Unnamed: 0,test_idx,text
0,0,fm receiver it has none
1,1,"the picture quality surprised me , when i firs..."
2,2,great video clip quality for a digital camera ...
3,3,creative did well on its rechargeable battery ...
4,4,i highly recommend this camera to anyone looki...


In [22]:
model_path='model/LogisticRegression.pkl'
model=pickle.load(open(model_path,'rb'))
 

In [23]:
model

In [27]:
test_X = vectorizer.transform(df_test['text'])
df_test['prediction']=model.predict(test_X) 

In [32]:
df_test.head()

Unnamed: 0,test_idx,text,prediction
0,0,fm receiver none,0
1,1,picture quality surprised first saw camera saw...,1
2,2,great video clip quality digital camera much b...,1
3,3,creative well rechargeable battery feature,1
4,4,highly recommend camera anyone looking good di...,1


In [41]:
prediction_json={'target':{str(df_test['test_idx'][i]):int(df_test['prediction'][i]) for i in range(len(df_test))}}

In [42]:
prediction_json

{'target': {'0': 0,
  '1': 1,
  '2': 1,
  '3': 1,
  '4': 1,
  '5': 0,
  '6': 1,
  '7': 1,
  '8': 1,
  '9': 1,
  '10': 0,
  '11': 0,
  '12': 1,
  '13': 1,
  '14': 1,
  '15': 1,
  '16': 0,
  '17': 0,
  '18': 1,
  '19': 0,
  '20': 1,
  '21': 1,
  '22': 1,
  '23': 1,
  '24': 1,
  '25': 1,
  '26': 1,
  '27': 1,
  '28': 1,
  '29': 1,
  '30': 0,
  '31': 1,
  '32': 1,
  '33': 1,
  '34': 0,
  '35': 1,
  '36': 1,
  '37': 1,
  '38': 1,
  '39': 0,
  '40': 0,
  '41': 0,
  '42': 0,
  '43': 0,
  '44': 1,
  '45': 1,
  '46': 1,
  '47': 1,
  '48': 0,
  '49': 1,
  '50': 0,
  '51': 0,
  '52': 1,
  '53': 1,
  '54': 0,
  '55': 0,
  '56': 1,
  '57': 0,
  '58': 1,
  '59': 0,
  '60': 0,
  '61': 1,
  '62': 0,
  '63': 1,
  '64': 0,
  '65': 1,
  '66': 1,
  '67': 1,
  '68': 1,
  '69': 1,
  '70': 1,
  '71': 1,
  '72': 1,
  '73': 0,
  '74': 1,
  '75': 1,
  '76': 1,
  '77': 0,
  '78': 1,
  '79': 1,
  '80': 1,
  '81': 0,
  '82': 1,
  '83': 0,
  '84': 1,
  '85': 1,
  '86': 1,
  '87': 1,
  '88': 1,
  '89': 1,
  '90': 1,

In [43]:
type(prediction_json)

dict

In [44]:
with open("data/processed/prediction.json", "w") as f:
    json.dump(prediction_json, f)

## OBSERVACIONES:
- F1-score probado en la web = 0.70

