# Sentiment Analysis

This notebook intends to build a sentiment classifier (positive, negative) from “Multi-Domain Sentiment Dataset” per each category (“Books”, “DVD”, “Electronics”, “Kitchen”).

### Library imports

In [1]:
#Scikit-learn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, ConfusionMatrixDisplay

#Libraries to graph
import matplotlib.pyplot as plt
import seaborn as sns

#NLTK
import nltk
from nltk.corpus import stopwords
import re
import pandas as pd


stemmer = nltk.stem.SnowballStemmer('english') 
nltk.download('stopwords') 

[nltk_data] Error loading stopwords: <urlopen error [SSL:
[nltk_data]     CERTIFICATE_VERIFY_FAILED] certificate verify failed:
[nltk_data]     unable to get local issuer certificate (_ssl.c:992)>


False

### Read and transform the .review's files

Run the python file "PreProcessingSentimentAnalysis.py"

### Creating the training/validation dataframe

In [2]:
def create_df (file_name):
    df = pd.read_csv(file_name, sep=',')


### Text processing function

In [3]:
def text_processing(text):
    # Step 1: Remove special characters using a regular expression (non-words).
    processed_feature = re.sub(r'\W', ' ', str(text))
    # Step 2: Remove single-character occurrences.
    processed_feature = re.sub(r'\s+[a-zA-Z]\s+', ' ', processed_feature)
    processed_feature = re.sub(r'\^[a-zA-Z]\s+', ' ', processed_feature)
    # Step 3: Remove numbers (very sporadic occurrences in our dataset).
    processed_feature = re.sub(r'[0-9]+', ' ', processed_feature)
    # Step 4: Simplify consecutive spaces to a single space between words.
    processed_feature = re.sub(' +', ' ', processed_feature)
    # Step 5: Convert all text to lowercase.
    processed_feature = processed_feature.lower()
    # Step 6: Apply stemming. It's a way to bring words to a common root, simplifying the vocabulary.
    # This helps to avoid having two different words with the same meaning in our vocabulary.
    processed_feature = " ".join([stemmer.stem(i) for i in processed_feature.split()])

    return processed_feature


Applying the text processing function to each data set

In [4]:
def apply_processing(category)->list:

    #Extracting the unprocessed texts and its labels
    not_processed = category['review'].values()
    labels = category['labels'].values()

    #Creating a list to save the processed texts
    processed = []

    #Processing all the texts
    for t in range(0, len(not_processed)):
        text = text_processing(not_processed[t])
        processed.append(text)

    # Saving the processed texts in the df
    category['processed'] = processed

    #Returning the processeddf
    return category


NameError: name 'df' is not defined

## Text representation

In this part we must take the processed text and represent it in such way we can operate it correctly. We are going to create a bow (bag of words)

### Vectorizer

We are using the CountVectorizer in order to create the bow

In [None]:
def bow(processed_text:list):
    
    #Bag of words
    vectorizer = CountVectorizer(max_features=2500, stop_words=stopwords.words('english'))
    
    #Now we build the vocabulary and also transform our text using our dataset
    text_features = vectorizer.fit_transform(processed_text).toarray()

    return text_features

Lets see if the processing function worked ok

In [None]:
print("Not processed:")
print(not_processed[1000])
print("---------------------------------")
print("Processed:")
print(processed[1000])