<a href="https://colab.research.google.com/github/kilos11/Data_Science/blob/main/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Review Sentiment

In [1]:

import random  # Import the random module for randomization
import json  # Import the json module for handling JSON data
import pickle  # Import the pickle module for object serialization
from sklearn.model_selection import train_test_split  # Import train_test_split for splitting data
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer  # Import CountVectorizer and TfidfVectorizer for text feature extraction
from sklearn.linear_model import LogisticRegression  # Import LogisticRegression for logistic regression model
from sklearn.tree import DecisionTreeClassifier  # Import DecisionTreeClassifier for decision tree model
from sklearn.naive_bayes import GaussianNB  # Import GaussianNB for naive Bayes model
from sklearn.model_selection import GridSearchCV  # Import GridSearchCV for hyperparameter tuning
from sklearn.metrics import f1_score  # Import f1_score for evaluating model performance
from sklearn import svm  # Import svm for support vector machine model

In [2]:

class Sentiment:
    NEGATIVE = "NEGATIVE"  # Define a constant for negative sentiment
    NEUTRAL = "NEUTRAL"  # Define a constant for neutral sentiment
    POSITIVE = "POSITIVE"  # Define a constant for positive sentiment


class Review:
    def __init__(self, text, score):
        self.text = text  # Set the text of the review
        self.score = score  # Set the score of the review
        self.sentiment = self.get_sentiment()  # Determine the sentiment of the review using the get_sentiment() method

    def get_sentiment(self):
        if self.score <= 2:
            return Sentiment.NEGATIVE  # If the score is less than or equal to 2, the sentiment is negative
        elif self.score == 3:
            return Sentiment.NEUTRAL  # If the score is 3, the sentiment is neutral
        else:
            return Sentiment.POSITIVE  # If the score is greater than 3, the sentiment is positive


class ReviewContainer:
    def __init__(self, reviews):
        self.reviews = reviews  # Set the list of reviews in the container

    def get_text(self):
        return [x.text for x in self.reviews]  # Get the text of each review in a list

    def get_sentiment(self):
        return [x.sentiment for x in self.reviews]  # Get the sentiment of each review in a list

    def evenly_distribute(self):
        negative = list(filter(lambda x: x.sentiment == Sentiment.NEGATIVE, self.reviews))  # Filter reviews with negative sentiment
        positive = list(filter(lambda x: x.sentiment == Sentiment.POSITIVE, self.reviews))  # Filter reviews with positive sentiment
        positive_shrunk = positive[:len(negative)]  # Shrink the positive review list to match the length of the negative review list
        self.reviews = negative + positive_shrunk  # Combine the negative and shrunken positive review lists
        random.shuffle(self.reviews)  # Shuffle the reviews randomly

'''In this code:

- The `Sentiment` class is defined, which contains three constants for representing different sentiment types: `NEGATIVE`, `NEUTRAL`, and `POSITIVE`.
- The `Review` class is defined, which represents a single review. It has attributes for the text, score, and sentiment of the review. The `get_sentiment` method is used to determine the sentiment based on the score.
- The `ReviewContainer` class is defined, which acts as a container for a collection of reviews. It has methods to retrieve the text and sentiment of the reviews. The `evenly_distribute` method is used to balance the number of positive and negative reviews by shrinking the positive review list and combining it with the negative review list. The resulting list is then shuffled randomly.
- The `lambda` function is used with the `filter` function to filter reviews based on their sentiment.'''

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:

file_name = '/content/drive/MyDrive/DS_and_ML_projects-master/Review Sentiment/Books_small_10000.json'

reviews = []  # Create an empty list to store the reviews

with open(file_name) as f:
    for line in f:
        review = json.loads(line)  # Parse each line of the file as JSON
        reviews.append(Review(review['reviewText'], review['overall']))  # Create a new Review object and append it to the list

'''In this code:

- The `file_name` variable stores the path to the JSON file containing the reviews.
- An empty list, `reviews`, is created to store the `Review` objects.
- The `open` function is used to open the file in read mode, and the file object is assigned to `f`.
- A loop is used to iterate over each line in the file.
- The `json.loads` function is used to parse each line as JSON, converting it into a Python dictionary.
- A new `Review` object is created using the values from the parsed JSON (`review['reviewText']` for the text and `review['overall']` for the score).
- The newly created `Review` object is appended to the `reviews` list.'''

In [None]:
reviews[84].text

"I'm very happy thus far with this purchase.Have made a total re-do of my eating habits and needed to know some OTHER recipes for the way I want to eat!Kudos for your efforts!Thanks againHAPPY"

In [6]:
reviews[74].text



In [7]:
reviews[4].text

'It was a decent read.. typical story line. Nothing unsavory as so many are. Just a slice of life, plausible.'

### Prep Data

In [8]:

training, test = train_test_split(reviews, test_size=0.33, random_state=42)

# Split the reviews into training and test sets using train_test_split
# The 'reviews' list contains the reviews to be split
# The 'test_size' parameter specifies the proportion of the data to be allocated for the test set (in this case, 33%)
# The 'random_state' parameter sets the random seed for reproducibility

train_container = ReviewContainer(training)

# Create a ReviewContainer object called 'train_container' using the training set
# The 'training' list contains the reviews that will be used for training the model

test_container = ReviewContainer(test)

# Create a ReviewContainer object called 'test_container' using the test set
# The 'test' list contains the reviews that will be used for evaluating the model's performance

In [9]:

train_container.evenly_distribute()

# Call the 'evenly_distribute' method on 'train_container' to balance the number of positive and negative reviews in the training set
# This method shrinks the positive review list to match the length of the negative review list and shuffles the reviews randomly

train_x = train_container.get_text()
# Get the text of the reviews in the training set using the 'get_text' method of 'train_container'
# The 'train_x' variable will contain a list of the review texts

train_y = train_container.get_sentiment()
# Get the sentiment of the reviews in the training set using the 'get_sentiment' method of 'train_container'
# The 'train_y' variable will contain a list of the review sentiments

test_container.evenly_distribute()

# Call the 'evenly_distribute' method on 'test_container' to balance the number of positive and negative reviews in the test set
# This method shrinks the positive review list to match the length of the negative review list and shuffles the reviews randomly

test_x = test_container.get_text()
# Get the text of the reviews in the test set using the 'get_text' method of 'test_container'
# The 'test_x' variable will contain a list of the review texts

test_y = test_container.get_sentiment()
# Get the sentiment of the reviews in the test set using the 'get_sentiment' method of 'test_container'
# The 'test_y' variable will contain a list of the review sentiments

print(train_y.count(Sentiment.POSITIVE))
# Count the number of positive sentiments in the training set using the 'count' method of the 'train_y' list
# Print the result

print(train_y.count(Sentiment.NEGATIVE))
# Count the number of negative sentiments in the training set using the 'count' method of the 'train_y' list
# Print the result

'''In this code:

- The `evenly_distribute` method is called on both the `train_container` and `test_container` objects to balance the number of positive and negative reviews in each set.
- After balancing, the `get_text` method is called on `train_container` and `test_container` to get the text of the reviews in the training and test sets, respectively. The results are stored in `train_x` and `test_x`.
- Similarly, the `get_sentiment` method is called on `train_container` and `test_container` to get the sentiment of the reviews in the training and test sets, respectively. The results are stored in `train_y` and `test_y`.
- The `count` method is used on `train_y` to count the number of positive and negative sentiments in the training set, and the results are printed.'''

436
436


### Bag of words vectorization

In [10]:

vectorizer = TfidfVectorizer()
# Create an instance of the TfidfVectorizer class, which is used to convert text data into numerical features based on term frequency-inverse document frequency (TF-IDF)
# Alternatively, you can use CountVectorizer to convert text data into a matrix of token counts

vectorizer.fit_transform(train_x)
# Fit the vectorizer to the training data (train_x)
# This step learns the vocabulary and IDF (inverse document frequency) values from the training data

train_x_vectors = vectorizer.fit_transform(train_x)
# Convert the text data in the training set (train_x) into a numerical feature matrix using the vectorizer
# The resulting train_x_vectors variable will contain the TF-IDF or count vector representation of the text data

test_x_vectors = vectorizer.transform(test_x)
# Convert the text data in the test set (test_x) into a numerical feature matrix using the vectorizer
# The resulting test_x_vectors variable will contain the TF-IDF or count vector representation of the text data

print(train_x[0])
# Print the first review text in the training set (train_x)

print(train_x_vectors[0])
# Print the vector representation of the first review text in the training set (train_x_vectors)


'''In this code:

- The `TfidfVectorizer` class is used to create a vectorizer object that will convert the text data into TF-IDF vectors.
- Alternatively, you can use `CountVectorizer` to create a vectorizer object that will convert the text data into a matrix of token counts.
- The `fit_transform` method is called on the vectorizer object with `train_x` as input to fit the vectorizer to the training data and convert the training text data into numerical feature vectors.
- The resulting `train_x_vectors` variable contains the TF-IDF or count vector representation of the training text data.
- The `transform` method is called on the vectorizer object with `test_x` as input to convert the test text data into numerical feature vectors.
- The resulting `test_x_vectors` variable contains the TF-IDF or count vector representation of the test text data.
- The `print` statements are used to display the first review text in the training set (`train_x[0]`) and its corresponding vector representation (`train_x_vectors[0]`).'''

Was not really sure what this book was about.  It was so boring, I stopped reading after the 8th/9th chapter.
  (0, 1354)	0.27267863324867586
  (0, 129)	0.41947771626897
  (0, 120)	0.44496480430090557
  (0, 7929)	0.07514566909764407
  (0, 285)	0.20430064266606754
  (0, 6399)	0.17881355463413195
  (0, 7525)	0.3068496986885096
  (0, 1007)	0.2668721485601011
  (0, 7280)	0.1437670029883238
  (0, 4277)	0.09144528706761185
  (0, 149)	0.1527220495082482
  (0, 991)	0.09235961436579658
  (0, 7976)	0.08426891379599838
  (0, 8679)	0.15740353429718923
  (0, 7683)	0.25824783788042205
  (0, 6411)	0.16720137332287055
  (0, 5408)	0.11614310508129516
  (0, 8608)	0.3256587722422593


# Classification

### Linear SVM

In [11]:

clf_svm = svm.SVC(kernel='linear')
# Create an instance of the SVM (Support Vector Machine) classifier with a linear kernel
# The linear kernel is used as the decision function for classification

clf_svm.fit(train_x_vectors, train_y)
# Train the SVM classifier using the training data (train_x_vectors) and corresponding labels (train_y)
# The classifier learns to classify the text data into positive or negative sentiment based on the feature vectors

test_x[0]
# Print the first review text in the test set (test_x)

clf_svm.predict(test_x_vectors[0])
# Use the trained SVM classifier (clf_svm) to predict the sentiment of the first review text in the test set (test_x_vectors[0])
# The predict method returns the predicted sentiment label (positive or negative) for the given review text

'''In this code:

- An instance of the SVM classifier is created using the `svm.SVC` class with a linear kernel specified as `kernel='linear'`.
- The `fit` method is called on the SVM classifier to train the model using the training data (`train_x_vectors`) and corresponding labels (`train_y`).
- The `test_x[0]` statement prints the first review text in the test set.
- The `predict` method is called on the trained SVM classifier (`clf_svm`) with `test_x_vectors[0]` as input to predict the sentiment label (positive or negative) for the first review text in the test set.'''

array(['NEGATIVE'], dtype='<U8')

#### Decision Tree


In [12]:

clf_dec = DecisionTreeClassifier()
# Create an instance of the DecisionTreeClassifier, which is a decision tree-based classifier

clf_dec.fit(train_x_vectors, train_y)
# Train the DecisionTreeClassifier using the training data (train_x_vectors) and corresponding labels (train_y)
# The classifier learns to classify the text data into positive or negative sentiment based on the feature vectors

clf_dec.predict(test_x_vectors[0])
# Use the trained DecisionTreeClassifier (clf_dec) to predict the sentiment of the first review text in the test set (test_x_vectors[0])
# The predict method returns the predicted sentiment label (positive or negative) for the given review text

array(['NEGATIVE'], dtype='<U8')

#### Naive Bayes

In [13]:

clf_gnb = DecisionTreeClassifier()
# Create an instance of the DecisionTreeClassifier, a decision tree-based classifier

clf_gnb.fit(train_x_vectors, train_y)
# Train the DecisionTreeClassifier using the training data (train_x_vectors) and corresponding labels (train_y)
# The classifier learns to classify the text data into positive or negative sentiment based on the feature vectors

clf_gnb.predict(test_x_vectors[0])
# Use the trained DecisionTreeClassifier (clf_gnb) to predict the sentiment of the first review text in the test set (test_x_vectors[0])
# The predict method returns the predicted sentiment label (positive or negative) for the given review text

array(['NEGATIVE'], dtype='<U8')

#### Logistic Regression

In [14]:

clf_log = LogisticRegression()
# Create an instance of the LogisticRegression classifier, which is commonly used for binary classification tasks

clf_log.fit(train_x_vectors, train_y)
# Train the LogisticRegression classifier using the training data (train_x_vectors) and corresponding labels (train_y)
# The classifier learns to classify the text data into positive or negative sentiment based on the feature vectors

clf_log.predict(test_x_vectors[0])
# Use the trained LogisticRegression classifier (clf_log) to predict the sentiment of the first review text in the test set (test_x_vectors[0])
# The predict method returns the predicted sentiment label (positive or negative) for the given review text

array(['NEGATIVE'], dtype='<U8')

## Evaluation

In [15]:
print(clf_svm.score(test_x_vectors, test_y))
print(clf_dec.score(test_x_vectors, test_y))
print(clf_gnb.score(test_x_vectors, test_y))
print(clf_log.score(test_x_vectors, test_y))

0.8076923076923077
0.6610576923076923
0.6658653846153846
0.8052884615384616


In [16]:
f1_score(test_y, clf_svm.predict(test_x_vectors), average=None, labels=[Sentiment.POSITIVE, Sentiment.NEGATIVE])

array([0.80582524, 0.80952381])

In [17]:
test_y.count(Sentiment.POSITIVE)

208

In [18]:
test_set = ['did not enjoy', 'bad book, do not buy', 'good']
new_test = vectorizer.transform(test_set)

clf_svm.predict(new_test)

array(['NEGATIVE', 'NEGATIVE', 'POSITIVE'], dtype='<U8')

## Tuning model with Grid Search

In [19]:

parameters = {'kernel': ('linear', 'rbf'), 'C': (1, 4, 8, 16, 32)}
# Define a dictionary of parameters for the SVM classifier
# The 'kernel' parameter specifies the type of kernel to be used (linear or radial basis function)
# The 'C' parameter controls the regularization strength

svc = svm.SVC()
# Create an instance of the SVM classifier

clf = GridSearchCV(svc, parameters, cv=5)
# Create an instance of the GridSearchCV class, which performs an exhaustive search over specified parameter values for an estimator
# In this case, it searches for the best combination of 'kernel' and 'C' values for the SVM classifier
# The 'cv' parameter specifies the number of folds for cross-validation

clf.fit(train_x_vectors, train_y)
# Fit the GridSearchCV object to the training data (train_x_vectors) and corresponding labels (train_y)
# It performs the grid search to find the best combination of parameters
# The classifier learns to classify the text data into positive or negative sentiment based on the feature vectors

In [20]:
print(clf.score(test_x_vectors, test_y))

0.8197115384615384


## Saving Model


In [21]:
with open('sentiment_classifier.pkl', 'wb') as f:
    pickle.dump(clf, f)

## Loading Model

In [22]:
with open('sentiment_classifier.pkl', 'rb') as f:
    loaded_clf = pickle.load(f)

In [23]:
print(test_x[84])

loaded_clf.predict(test_x_vectors[0])

Thank goodness I found this after all books were published. I would've gone crazy waiting. 6 books. Super quick read - enjoyed immensely!!Here's the premise...Girl meets boy, boy is f'd up, get the picture?  BUT it's so much more than that. It's smart, witty, sexy, sad, and hopeful. Will the girl get her man or vice versa?  Read it!


array(['NEGATIVE'], dtype='<U8')