# Development of Intelligent Computing Systems _ 2022  IME-USP
- course [page][4]
- ministred by: MSc [Renato Cordeiro Ferreira][1]
- student: [Pedro Almeida][3] and [Rodrigo Didier Anderson][2]

[1]: https://www.linkedin.com/in/renatocf/
[2]: https://www.linkedin.com/in/didier11/
[3]: https://www.linkedin.com/in/plbalmeida/
[4]: https://www.ime.usp.br/verao/index.php

This is the first part of the course project, we will create the training pipeline for a categorization model.

More specifically, the goal is to train a model that should receive data related to products and return the best categories for them.


- More details about this stage of the project [here][1].
- More info about the data can be found [here][2]
[1]: https://github.com/didier-rda/intelligent-systems-project/blob/main/training/README.md
[2]: https://github.com/didier-rda/intelligent-systems-project/blob/main/data/README.md

## Training Pipeline  

This training pipeline follows the following steps. 

For each step, a class was created with the necessary methods to fulfill the respective stage of the pipline.

1. **Data extraction** <br>
   class: `DataExtractor`
   
   Loads a dataset with product data from a specified path available in the
   environment variable `DATASET_PATH`.
   
   
   
2. **Data formatting** <br>
   class: `DataFormatter`
   
   Processes "query" feature through CountVectorizer class from scikit-learn, training (70%) and test (30%) sets are generated in sequence.
      


3. **Modeling** <br>
   class: `Modeler`
   
   Naive Bayes classifier was the chosen model, it's commonly used for testing NLP classification problems. Therefore, the MultinomialNB class from scikit-learn was used, which works well with integer type features generated through CountVectorizer. MultinomialNB is also used for multiple label classification, which is the problem of the present dataset.
   


4. **Model validation** <br>
   class: `ModelValidator`
   
   Generates metrics about the model accuracy (precision, recall, F1, etc.)
   for each category and exports them to a specified path available in the
   environment variable `METRICS_PATH`.
   
   
   
5. **Model exportation** <br>
   class: `ModelExporter`
   
   Exports a candidate model to a specified path available in the environment variable `MODEL_PATH`.

# Import libs

In [1]:
import os
import pandas as pd
import pickle
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# 1. Data Extraction

In [2]:
class DataExtractor:
    
    def __init__(self):
        self.path = os.getenv('DATASET_PATH')
        self.data = self.get_data()
    
    def get_data(self):
        return pd.read_csv(self.path)

#  2. Data Formatting

In [3]:
class DataFormatter:
    
    def __init__(self):
        self.data = DataExtractor().data
    
    def data_split(self):
        y = self.data['category']
        X_train, X_test, y_train, y_test = train_test_split(self.data['query'], y, test_size=0.3, random_state=123)
        return X_train, X_test, y_train, y_test
        
    def count_vectorizer(self):
        count_vectorizer = CountVectorizer() 
        count_train = count_vectorizer.fit_transform(self.data_split()[0])
        count_test = count_vectorizer.transform(self.data_split()[1])
        return count_train, count_test

# 3. Modeling

In [4]:
class Modeler:
    
    def classifier(self):    
        nb_classifier = MultinomialNB()
        nb_classifier.fit(DataFormatter().count_vectorizer()[0], DataFormatter().data_split()[2])
        return nb_classifier

# 4. Model Validation

In [5]:
class ModelValidator:
    
    def __init__(self):
        self.model = Modeler().classifier()
    
    def prediction(self):
        y_pred = self.model.predict(DataFormatter().count_vectorizer()[1])
        return y_pred
    
    def metrics(self):
        metrics_report = classification_report(DataFormatter().data_split()[3].values, self.prediction())
        f = open(os.getenv('METRICS_PATH'), 'w')
        f.write(f'Test set metrics:\n')
        f.write(f'\n{metrics_report}\n')
        f.close()

# 5. Model exportation

In [6]:
class ModelExporter:
    
    def __init__(self):
        self.model = Modeler().classifier()    
        pickle.dump(self.model, open(os.getenv('MODEL_PATH'),'wb'))

In [7]:
# pipeline execution
if __name__ == '__main__':
    ModelValidator().metrics()
    ModelExporter()