<a href="https://colab.research.google.com/github/jpkrajewski/NLP-youtube-analysis/blob/main/NLP_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

NLP stands for Natural Language Processing. It is a subfield of artificial intelligence and linguistics that focuses on the interaction between computers and human language. NLP involves developing algorithms, models, and techniques that enable computers to understand, interpret, and generate human language in a way that is meaningful and useful.

The primary goal of NLP is to bridge the gap between human language and computer language, allowing machines to process, analyze, and extract information from textual data. NLP encompasses a wide range of tasks and applications

## Importing the libraries

In [4]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

Generic Sentiment | Multidomain Sentiment Dataset
50K sentiments merged from multiple domain (Yelp, Twitter, Mobile reviews)

https://www.kaggle.com/datasets/akgeni/generic-sentiment-multidomain-sentiment-dataset

**Context**

We find sentiment dataset pertaining to a domain. To have a general sense of sentiment we need to understand the sentiment semantics.

**Content**

Combined Mobile reviews, Twitter sentiment, Yelp review, Toxic reviews and few more to cover multiple domain of sentiment analysis.

* 0->Negative
* 1->Neutral
* 2->Positive

In [5]:
dataset = pd.read_csv('./dataset/generic_sentiment_dataset_50k.csv')
features = dataset.iloc[:, 1].values
labels = dataset.iloc[:, 2].values

In [6]:
dataset.head()

Unnamed: 0,sentiment,text,label
0,positive,good mobile. battery is 5000 mah is very big. ...,2
1,positive,Overall in hand ecpirience is quite good matt ...,2
2,positive,"1. Superb Camera,\n2. No lag\n3. This is my fi...",2
3,positive,Bigger size of application names doesn't allow...,2
4,negative,Just a hype of stock android which is not flaw...,0


## Cleaning the texts

In [7]:
# In Natural Language Processing (NLP), text preprocessing plays a crucial role in preparing textual data for analysis.

# The code  aims to clean and normalize the text data,
# reducing noise and simplifying subsequent NLP analysis.

# Preprocessing is crucial for improving the quality and effectiveness of NLP models and algorithms,
# as it helps standardize the text and remove irrelevant information,
# allowing the focus to be on the meaningful aspects of the text that are relevant to the task at hand.

import re
processed_features = []
for sentence in features:

  # Remove all the special characters
  processed_feature = re.sub(r'\W', ' ', str(sentence))

  # remove all single characters
  processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

  # Remove single characters from the start
  processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)

  # Substituting multiple spaces with single space
  processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

  # Removing prefixed 'b'
  processed_feature = re.sub(r'^b\s+', '', processed_feature)

  # Converting to Lowercase
  processed_feature = processed_feature.lower()
  processed_features.append(processed_feature)

## Creating the Bag of Words model

Utilizing the NLTK (Natural Language Toolkit) library and the scikit-learn library (specifically the TfidfVectorizer class) to perform feature extraction using the TF-IDF (Term Frequency-Inverse Document Frequency) approach.

The resulting processed_features will contain the numerical feature vectors representing the preprocessed text data, where each feature vector corresponds to a document (in this case, a processed comment). The TF-IDF approach assigns weights to words based on their frequency in a document and their inverse frequency across the entire corpus, allowing the importance of each word to be captured in the feature vectors.

In [8]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1500, stop_words=stopwords.words('english'))
processed_features = vectorizer.fit_transform(processed_features).toarray()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abdel\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Splitting the dataset into the Training set and Test set

In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(processed_features, labels, test_size=0.2, random_state=0)

## Training the RandomForestClassifer model on the Training set

In [10]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(n_estimators=80, random_state=0)
rf_classifier.fit(X_train, y_train)

### Predicting the Test set results

In [11]:
rf_predictions = rf_classifier.predict(X_test)

### Making the Confusion Matrix

In [12]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test, rf_predictions))
print(classification_report(y_test, rf_predictions))
print(accuracy_score(y_test, rf_predictions))


[[2001  239  525]
 [ 585  480  888]
 [ 386  192 4704]]
              precision    recall  f1-score   support

           0       0.67      0.72      0.70      2765
           1       0.53      0.25      0.34      1953
           2       0.77      0.89      0.83      5282

    accuracy                           0.72     10000
   macro avg       0.66      0.62      0.62     10000
weighted avg       0.70      0.72      0.69     10000

0.7185


## Training the Bayes model on the Training set

In [13]:
from sklearn.naive_bayes import GaussianNB

gnb_classifier = GaussianNB()
gnb_classifier.fit(X_train, y_train)

### Predicting the Test set results

In [14]:
gnb_predictions = gnb_classifier.predict(X_test)

### Making the Confusion Matrix

In [15]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test, gnb_predictions))
print(classification_report(y_test, gnb_predictions))
print(accuracy_score(y_test, gnb_predictions))


[[1912  461  392]
 [ 705  605  643]
 [ 633  652 3997]]
              precision    recall  f1-score   support

           0       0.59      0.69      0.64      2765
           1       0.35      0.31      0.33      1953
           2       0.79      0.76      0.78      5282

    accuracy                           0.65     10000
   macro avg       0.58      0.59      0.58     10000
weighted avg       0.65      0.65      0.65     10000

0.6514


## Training the DecisionTree model

In [16]:
from sklearn.tree import DecisionTreeClassifier

dtree_classifier = DecisionTreeClassifier()
dtree_classifier.fit(X_train, y_train)

In [17]:
dtree_predictions = dtree_classifier.predict(X_test)

In [18]:
print(confusion_matrix(y_test, dtree_predictions))
print(classification_report(y_test, dtree_predictions))
print(accuracy_score(y_test, dtree_predictions))

[[1723  475  567]
 [ 568  593  792]
 [ 520  517 4245]]
              precision    recall  f1-score   support

           0       0.61      0.62      0.62      2765
           1       0.37      0.30      0.34      1953
           2       0.76      0.80      0.78      5282

    accuracy                           0.66     10000
   macro avg       0.58      0.58      0.58     10000
weighted avg       0.64      0.66      0.65     10000

0.6561


## Training the Logistic Regression OVR (One-vs-Rest) model

In [19]:
from sklearn.linear_model import LogisticRegression

# Initialize the Logistic Regression model with 'ovr' multiclass strategy
lreg_classifier = LogisticRegression(multi_class='ovr')

# Fit the model on the train dataset
lreg_classifier.fit(X_train, y_train)

In [20]:
# Predicting the Test set results
lreg_predictions = lreg_classifier.predict(X_test)

# Evaluating the Algorithm
print(confusion_matrix(y_test, lreg_predictions))
print(classification_report(y_test, lreg_predictions))
print(accuracy_score(y_test, lreg_predictions))


[[2045  236  484]
 [ 631  440  882]
 [ 351  216 4715]]
              precision    recall  f1-score   support

           0       0.68      0.74      0.71      2765
           1       0.49      0.23      0.31      1953
           2       0.78      0.89      0.83      5282

    accuracy                           0.72     10000
   macro avg       0.65      0.62      0.62     10000
weighted avg       0.69      0.72      0.69     10000

0.72


## Training the XGBoost model

In [21]:
from xgboost import XGBClassifier

# Initialize XGBoost model
xgb_classifier = XGBClassifier()

# Train the model using the entire training dataset
xgb_classifier.fit(X_train, y_train)

In [22]:
# Make predictions on the testing data and evaluate the model
xgb_predictions = xgb_classifier.predict(X_test)

# Print the confusion matrix, classification report, and accuracy score
print(confusion_matrix(y_test, lreg_predictions))
print(classification_report(y_test, lreg_predictions))
print(accuracy_score(y_test, lreg_predictions))

[[2045  236  484]
 [ 631  440  882]
 [ 351  216 4715]]
              precision    recall  f1-score   support

           0       0.68      0.74      0.71      2765
           1       0.49      0.23      0.31      1953
           2       0.78      0.89      0.83      5282

    accuracy                           0.72     10000
   macro avg       0.65      0.62      0.62     10000
weighted avg       0.69      0.72      0.69     10000

0.72


## Training the LightGBM model

In [23]:
from lightgbm import LGBMClassifier

# Initialize LGBM classifier
lgbm_classifier = LGBMClassifier()

# Train the LGBM classifier using the entire training dataset
lgbm_classifier.fit(X_train, y_train)

In [24]:
# Make predictions on the testing data and evaluate the model
lgbm_predictions = lgbm_classifier.predict(X_test)

# Print the confusion matrix, classification report, and accuracy score
print(confusion_matrix(y_test, lgbm_predictions))
print(classification_report(y_test, lgbm_predictions))
print(accuracy_score(y_test, lgbm_predictions))

[[1945  292  528]
 [ 547  557  849]
 [ 325  262 4695]]
              precision    recall  f1-score   support

           0       0.69      0.70      0.70      2765
           1       0.50      0.29      0.36      1953
           2       0.77      0.89      0.83      5282

    accuracy                           0.72     10000
   macro avg       0.66      0.63      0.63     10000
weighted avg       0.70      0.72      0.70     10000

0.7197


## Training the CatBoost model

In [25]:
from catboost import CatBoostClassifier

# Initialize CatBoost Classifier
catboost_classifier = CatBoostClassifier(verbose=0)

# Train the CatBoost Classifier using the entire training dataset
catboost_classifier.fit(X_train, y_train)

<catboost.core.CatBoostClassifier at 0x2181684afd0>

In [26]:
# Make predictions on the testing data and evaluate the model
catboost_predictions = catboost_classifier.predict(X_test)

# Print the confusion matrix, classification report, and accuracy score
print(confusion_matrix(y_test, catboost_predictions))
print(classification_report(y_test, catboost_predictions))
print(accuracy_score(y_test, catboost_predictions))

[[1990  215  560]
 [ 593  419  941]
 [ 340  192 4750]]
              precision    recall  f1-score   support

           0       0.68      0.72      0.70      2765
           1       0.51      0.21      0.30      1953
           2       0.76      0.90      0.82      5282

    accuracy                           0.72     10000
   macro avg       0.65      0.61      0.61     10000
weighted avg       0.69      0.72      0.69     10000

0.7159


## Downloading the model

The XGBClassifer has the best accuracy, so I am choosing this model for application.

In [27]:
import joblib
joblib.dump(xgb_classifier, './app/finalized_model.sav')

['./app/finalized_model.sav']

In [28]:
joblib.dump(vectorizer, './app/vectorizer.sav')

['./app/vectorizer.sav']