Let's imagine that we work in the data science area of a company and that we have been assigned the objective of creating a tool capable of automatically analyzing feedback from our customers. that is, we must find out if it is possible to develop an algorithm capable of automatically learning and predicting the feelings of customers in relation to the company's product and then providing support to the marketing team.
so, in order to create a prototype capable of meeting the needs of our company, we will use the IMDB dataset that contains information about feedback being positive or not

In [None]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd

df = pd.read_csv('/content/drive/MyDrive/IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


After importing the dataset, we can see that it has two columns - one with the text and the other with the sentiment attributed to the feedback in question.
Remembering that our task is to create a prototype of a product to support the analysis of feedback from the company's customers and by noticing the characteristics of the data set, we can conclude that it is possible to develop such a tool through supposited learning

Supervised learning is one in which computer learning will take place through examples that have already been recorded. note, for example, that in the dataset used in our prototype we have information about the written comment and whether it was positive or not.

So far, we deduced that it is possible to develop a tool capable of helping the user experience team in their analyzes through supervised learning. so, it's time to test the possible machine learning algorithms for this task - for this brief notebook, I took the liberty of testing only the random forest algorithm so that the tutorial is not extensive

At this stage, we can conduct the training with the following steps:

        A) Split the dataset into training/test
        B) Prepare the training dataset
        C) Fit the machine learning algorithm (with the training dataset)
        D) Evaluate the model on the test dataset
        E) Save the model to further deploy


In [None]:
#With this cell, we can split the original dataset into 2, train/test dataset

from sklearn.model_selection import train_test_split

X = df.review
y = df.sentiment

X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
print('X_train -> ', X_train.shape) #this print the training data set's dimensions
print('X_test -> ', X_test.shape) #the same idea before, note we'll use 12500 observations to validate our model

X_train ->  (37500,)
X_test ->  (12500,)


In [None]:
 #In this cell, we will import our Randon Forrest model, a cross validations function and a feature text extraction for the training task

 from sklearn.ensemble import ExtraTreesClassifier #import our model
 from sklearn.model_selection import cross_val_score # import the cross validations funciontion
 from sklearn.feature_extraction.text import TfidfVectorizer # import the model that will transform our text data into a numeric one


Tf = TfidfVectorizer() # the model that will transform our text data into a numeric one


X_traintf = Tf.fit_transform(X_train) # here we fit and trasnform the training data into numbers, note we will use it to transoform the test training ass well




In [None]:
#in this cell we will evalute our model with the cross val function - it will provide us an estimation about the model's accuracy.
#then we will fit our model at the training data set.

model_score = cross_val_score(ExtraTreesClassifier(), X_traintf, y_train, cv=2)

In [None]:
model_score #show the cv's score for each kfold

array([0.85386667, 0.85488   ])

At this point, we estimate our model's accuracy would be arround 85% then our next task will be to predict the test data set and check the rate of our model

In [None]:
#Here we fit our model with the traning dataset
model = ExtraTreesClassifier().fit(X_traintf, y_train)

In [None]:
#At this point, we use the Tf method to transform the training set into a numeric matrix (based on the training set) and then predict
preds = model.predict(Tf.transform(X_test))

In [None]:
#with this cell, we can now check our model's accuracy rate on the test data set
from sklearn.metrics import classification_report, confusion_matrix

In [None]:

print(classification_report(y_test, preds))

print('\n')

print(confusion_matrix(y_test, preds))

              precision    recall  f1-score   support

    negative       0.85      0.86      0.86      6233
    positive       0.86      0.85      0.85      6267

    accuracy                           0.86     12500
   macro avg       0.86      0.86      0.86     12500
weighted avg       0.86      0.86      0.86     12500



[[5380  853]
 [ 957 5310]]


By using the classification report, we can see our model's accuracy was 86% a value pretty closer to the one calculeted with cross_validation_report - it wass pretty good!

At this point, we have our machine learning model ready for commercial use. Now suppose we want to deploy the model. We could do this by calling the pickle module, whose function would be to save our algorithm in a file for future imports, without the need to perform training.

In [None]:
#import the pickle module
import pickle

In [None]:
# save the model to disk
filename = '/content/drive/MyDrive/Tf.pkl'
pickle.dump(Tf, open(filename, 'wb'))

In [None]:
# save the model to disk
filename = '/content/drive/MyDrive/model.pkl'
pickle.dump(model, open(filename, 'wb'))

It's done!
With the model saved in a pickle file we can create a small application in django or flask that will be able to serve a user with the predictions of our model