# Development of Intelligent Computing Systems _ 2021  IME-USP
- course [page][3]
- ministred by: MSc [Renato Cordeiro Ferreira][1]
- student: [Rodrigo Didier Anderson][2]

[1]: https://www.linkedin.com/in/renatocf/
[2]: https://www.linkedin.com/in/didier11/
[3]: https://www.ime.usp.br/verao/index.php

This is the first part of the course project, we will create the training pipeline for a categorization model.

More specifically, the goal is to train a model that should receive data related to products and return the best categories for them.


- More details about this stage of the project [here][1].
- More info about the data can be found [here][2]
[1]: https://github.com/didier-rda/intelligent-systems-project/blob/main/training/README.md
[2]: https://github.com/didier-rda/intelligent-systems-project/blob/main/data/README.md

## Training Pipeline  
(less than 5 minutes)

This training pipeline follows the following steps. 

For each step, a class was created with the necessary methods to fulfill the respective stage of the pipline.

1. **Data extraction** <br>
   Loads a dataset with product data from a specified path available in the
   environment variable `DATASET_PATH`.
   
   class: `dataExtractor`



2. **Data formatting** <br>
   Processes the dataset to use it for training and validation.
   
   class: `dataFormatter`



3. **Data Modeling & Model Exportation** <br>
   - Specifies a model to handle the categorization problem;
   - Exports a candidate model to a specified path available in the environment
     variable `MODEL_PATH`;
   
   class: `dataModeler`



4. **Model validation** <br>
   Generates metrics about the model accuracy (precision, recall, F1, etc.)
   for each category and exports them to a specified path available in the
   environment variable `METRICS_PATH`.

   class: `modelValidator`
   



for the pipeline scheduling a last class: `dataPipeline` was created.

It purpose is to run the pipeline recursively by calling the other classes when necessary.

# Import libs

In [8]:
# sys and data processing libs
import sys
import os
import re
import time
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from pycaret.nlp import *

# nlp data corpus and libs
import nltk
nltk.download('punkt')
nltk.download('stopwords')

from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize

from string import punctuation

from nltk.probability import FreqDist

from scipy import sparse as sp_sparse

# ML modules from sklearn
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score 
from sklearn.metrics import average_precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report

from pycaret.nlp import *

[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# 1. Data Extraction

In [2]:
class dataExtractor:

    def __init__(self):

        self.data_path = os.getenv('DATASET_PATH')
        self.data = self.extractData()

        
    def extractData(self):
        '''
        Loads a dataset with product data from a specified path.
        '''
       
        return pd.read_csv(self.data_path)

In [19]:
df = dataExtractor().data
df = df[['query', 'title', 'category']]
df['query_title'] = df['query'] + ' ' + df['title']
df = df[['query_title', 'category']]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 38000 entries, 0 to 37999
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   query_title  38000 non-null  object
 1   category     38000 non-null  object
dtypes: object(2)
memory usage: 593.9+ KB


In [20]:
df.head()

Unnamed: 0,query_title,category
0,espirito santo Mandala Espírito Santo,Decoração
1,cartao de visita Cartão de Visita,Papel e Cia
2,expositor de esmaltes Organizador expositor p/...,Outros
3,medidas lencol para berco americano Jogo de Le...,Bebê
4,adesivo box banheiro ADESIVO BOX DE BANHEIRO,Decoração


#  2. Data Formatting

In [21]:
nlp = setup(data=df, target='category', session_id=1)

INFO:logs:PyCaret NLP Module
INFO:logs:version 2.2.2
INFO:logs:Initializing setup()
INFO:logs:USI: a44f
INFO:logs:setup(data=(38000, 2), target=category, custom_stopwords=None, html=True, session_id=1, log_experiment=False,
                    experiment_name=None, log_plots=False, log_data=False, verbose=True)
INFO:logs:Checking environment
INFO:logs:python_version: 3.9.1
INFO:logs:python_build: ('default', 'Feb  9 2021 07:55:26')
INFO:logs:machine: x86_64
INFO:logs:platform: Linux-5.10.93-87.444.amzn2.x86_64-x86_64-with-glibc2.28
INFO:logs:Checking libraries
INFO:logs:pd==1.2.0
INFO:logs:numpy==1.19.4
INFO:logs:gensim==4.1.2
INFO:logs:spacy==3.2.1
INFO:logs:nltk==3.5
INFO:logs:textblob==0.17.1
INFO:logs:pyLDAvis==3.2.2
INFO:logs:wordcloud==1.8.1
INFO:logs:mlflow==1.23.1
INFO:logs:Checking Exceptions
ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.



Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/pycaret/nlp.py", line 313, in setup
    sp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
  File "/usr/local/lib/python3.9/site-packages/spacy/__init__.py", line 51, in load
    return util.load_model(
  File "/usr/local/lib/python3.9/site-packages/spacy/util.py", line 427, in load_model
    raise IOError(Errors.E050.format(name=name))
OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.9/site-packages/IPython/core/interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-21-0c45c4b72c50>", line 1, in <module>
    nlp = setup(data=df, target='category', session_id=1)
  File "/usr/local/lib/python3.9/site-packages/pycaret/nlp

TypeError: object of type 'NoneType' has no len()

# 3. Data Modeling & Model Exportation

# 4. Model Validation

# 5. DS Pipeline

In [2]:
from pycaret.datasets import get_data
dataset = get_data('iris')

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [3]:
dataset.species.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)