<a href="https://colab.research.google.com/github/jpkrajewski/NLP-youtube-analysis/blob/main/NLP_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing

NLP stands for Natural Language Processing. It is a subfield of artificial intelligence and linguistics that focuses on the interaction between computers and human language. NLP involves developing algorithms, models, and techniques that enable computers to understand, interpret, and generate human language in a way that is meaningful and useful.

The primary goal of NLP is to bridge the gap between human language and computer language, allowing machines to process, analyze, and extract information from textual data. NLP encompasses a wide range of tasks and applications

## Importing the libraries

In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

Generic Sentiment | Multidomain Sentiment Dataset
50K sentiments merged from multiple domain (Yelp, Twitter, Mobile reviews)

https://www.kaggle.com/datasets/akgeni/generic-sentiment-multidomain-sentiment-dataset

**Context**

We find sentiment dataset pertaining to a domain. To have a general sense of sentiment we need to understand the sentiment semantics.

**Content**

Combined Mobile reviews, Twitter sentiment, Yelp review, Toxic reviews and few more to cover multiple domain of sentiment analysis.

* 0->Negative
* 1->Neutral
* 2->Positive

In [3]:
dataset = pd.read_csv('generic_sentiment_dataset_50k.csv')
features = dataset.iloc[:, 1].values
labels = dataset.iloc[:, 2].values

In [4]:
dataset.head()

Unnamed: 0,sentiment,text,label
0,positive,good mobile. battery is 5000 mah is very big. ...,2
1,positive,Overall in hand ecpirience is quite good matt ...,2
2,positive,"1. Superb Camera,\n2. No lag\n3. This is my fi...",2
3,positive,Bigger size of application names doesn't allow...,2
4,negative,Just a hype of stock android which is not flaw...,0


## Cleaning the texts

In [None]:
# In Natural Language Processing (NLP), text preprocessing plays a crucial role in preparing textual data for analysis.

# The code  aims to clean and normalize the text data,
# reducing noise and simplifying subsequent NLP analysis.

# Preprocessing is crucial for improving the quality and effectiveness of NLP models and algorithms,
# as it helps standardize the text and remove irrelevant information,
# allowing the focus to be on the meaningful aspects of the text that are relevant to the task at hand.

import re
processed_features = []
for sentence in features:

  # Remove all the special characters
  processed_feature = re.sub(r'\W', ' ', str(sentence))

  # remove all single characters
  processed_feature= re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)

  # Remove single characters from the start
  processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)

  # Substituting multiple spaces with single space
  processed_feature = re.sub(r'\s+', ' ', processed_feature, flags=re.I)

  # Removing prefixed 'b'
  processed_feature = re.sub(r'^b\s+', '', processed_feature)

  # Converting to Lowercase
  processed_feature = processed_feature.lower()
  processed_features.append(processed_feature)

## Creating the Bag of Words model

Utilizing the NLTK (Natural Language Toolkit) library and the scikit-learn library (specifically the TfidfVectorizer class) to perform feature extraction using the TF-IDF (Term Frequency-Inverse Document Frequency) approach.

The resulting processed_features will contain the numerical feature vectors representing the preprocessed text data, where each feature vector corresponds to a document (in this case, a processed comment). The TF-IDF approach assigns weights to words based on their frequency in a document and their inverse frequency across the entire corpus, allowing the importance of each word to be captured in the feature vectors.

In [None]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=1500, stop_words=stopwords.words('english'))
processed_features = vectorizer.fit_transform(processed_features).toarray()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Splitting the dataset into the Training set and Test set

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(processed_features, labels, test_size=0.2, random_state=0)

## Training the RandomForestClassifer model on the Training set

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(n_estimators=80, random_state=0)
rf_classifier.fit(X_train, y_train)

### Predicting the Test set results

In [None]:
rf_predictions = rf_classifier.predict(X_test)

### Making the Confusion Matrix

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test, rf_predictions))
print(classification_report(y_test, rf_predictions))
print(accuracy_score(y_test, rf_predictions))


[[1990  251  524]
 [ 612  483  858]
 [ 376  201 4705]]
              precision    recall  f1-score   support

           0       0.67      0.72      0.69      2765
           1       0.52      0.25      0.33      1953
           2       0.77      0.89      0.83      5282

    accuracy                           0.72     10000
   macro avg       0.65      0.62      0.62     10000
weighted avg       0.69      0.72      0.69     10000

0.7178


## Training the KNN model on the Training set

In [None]:
from sklearn.neighbors import KNeighborsClassifier

knn_classifier = KNeighborsClassifier()
knn_classifier.fit(X_train, y_train)

### Predicting the Test set results

In [None]:
knn_predictions = knn_classifier.predict(X_test)

### Making the Confusion Matrix

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test, knn_predictions))
print(classification_report(y_test, knn_predictions))
print(accuracy_score(y_test, knn_predictions))


[[ 928  262 1575]
 [ 371  350 1232]
 [ 329  252 4701]]
              precision    recall  f1-score   support

           0       0.57      0.34      0.42      2765
           1       0.41      0.18      0.25      1953
           2       0.63      0.89      0.74      5282

    accuracy                           0.60     10000
   macro avg       0.53      0.47      0.47     10000
weighted avg       0.57      0.60      0.55     10000

0.5979


## Training the Bayes model on the Training set

In [None]:
from sklearn.naive_bayes import GaussianNB

gnb_classifier = GaussianNB()
gnb_classifier.fit(X_train, y_train)

### Predicting the Test set results

In [None]:
gnb_predictions = gnb_classifier.predict(X_test)

### Making the Confusion Matrix

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
print(confusion_matrix(y_test, gnb_predictions))
print(classification_report(y_test, gnb_predictions))
print(accuracy_score(y_test, gnb_predictions))


[[1911  463  391]
 [ 707  604  642]
 [ 627  655 4000]]
              precision    recall  f1-score   support

           0       0.59      0.69      0.64      2765
           1       0.35      0.31      0.33      1953
           2       0.79      0.76      0.78      5282

    accuracy                           0.65     10000
   macro avg       0.58      0.59      0.58     10000
weighted avg       0.65      0.65      0.65     10000

0.6515


## Downloading the model to deploy in production

The RandomForestClassifer has the best accuracy, so I am choosing this model for application.

In [None]:
import joblib
joblib.dump(rf_classifier, 'finalized_model.sav')

['finalized_model.sav']

In [None]:
joblib.dump(vectorizer, 'vectorizer.sav')

['vectorizer.sav']