## Description

This a short text classifier of offers for products descriptions in a retail. Here we read, clean the data, train and test the model, and save to a `.joblib` file to be used later or deploy it to AI Platform in Google Cloud.

In [None]:
import pandas as pd
import numpy as np
from sklearn.externals import joblib
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import roc_auc_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

__References and tutorials to improve the model__


https://hackernoon.com/chars2vec-character-based-language-model-for-handling-real-world-texts-with-spelling-errors-and-a3e4053a147d

https://towardsdatascience.com/introduction-to-word-embedding-and-word2vec-652d0c2060fa

https://github.com/madelonhulsebos/sherlock

Read data and clean the code bars id's

In [None]:
xl = pd.ExcelFile("Entrenamiento IA Ofertas.xlsx")
data = xl.parse("Sheet1")
copy = data.copy()
data = data[data["TIPO OFERTA"] != "REGULAR"]
le = LabelEncoder()
y = le.fit_transform(data["TIPO OFERTA"])
#remove bar codes
data["ITEM"] = data["ITEM"].str.replace(r'\d{6,}', '')

In [None]:
data["TIPO OFERTA"].value_counts(normalize=True)

Split data into test and train

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data['ITEM'], y,
                                                    random_state=0, test_size=0.4)

This a important stage, here because of short text description of products, we should n-grams of chars `char_wb`
instead of words for long documents.

In [None]:
vect = CountVectorizer(ngram_range=(1,6), analyzer="char_wb").fit(X_train)
X_train_vectorized = vect.transform(X_train)
len(vect.get_feature_names())

This is a classification problem, so is likely that the dataset is imbalance. So, we should use micro and
macro average score with `f1_score`.

In [None]:
model = LogisticRegression(solver='liblinear', multi_class="auto")
model.fit(X_train_vectorized, y_train)
predictions = model.predict(vect.transform(X_test))
print('weighted: ', f1_score(y_test, predictions, average="weighted"))
print('micro: ', f1_score(y_test, predictions, average="micro"))
print('macro: ', f1_score(y_test, predictions, average="macro"))

In [None]:
model.score(vect.transform(X_test), y_test)

In [None]:
for i, clas in enumerate(y_test):
    if y_test[i] != predictions[i]:
        print("ITEM --> "+X_test.iloc[i])
        print("TRUE--> "+str(le.inverse_transform([y_test[i]])), " MODEL--> "+str(le.inverse_transform([predictions[i]])))
        print("")

## Export model

In [None]:
joblib.dump(vect, "vectorizer.joblid")
joblib.dump(le, "encoder.joblib")
joblib.dump(offers, "model.joblib")