### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [None]:
from google.colab import files
uploaded = files.upload()
!ls

Saving amazon_baby.csv to amazon_baby (1).csv
'amazon_baby (1).csv'   amazon_baby.csv   sample_data


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


**Note:** The dataset for this exercise is an Amazon reviews dataset containing information about reviews of baby products.  

We are given:
- The name of the product,
- The review content,
- The rating.

Classification is the process of putting data into classes. In this case, we classify some of the reviews as positive or negative, and then train a prediction model to classify the remaining reviews.

## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [None]:
#a)
baby_df["review"] = baby_df["review"].astype(str).apply(remove_punctuation) #apply works way faster than iterrows + loc

#short test:
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

In [None]:
#b)
baby_df["review"] = baby_df["review"].fillna("")
#short test:
baby_df["review"][38] == baby_df["review"][38]

True

In [None]:
#c)
baby_df = baby_df.drop(baby_df[baby_df["rating"] == 3].index)
#short test:
sum(baby_df["rating"] == 3)

0

**Note:** Above, the dataset was cleaned by removing punctuation symbols from the reviews and changing empty imputs to empty strings. Additionally, rows with a rating of 3 were removed - it is because we are classifying reviews as positive or negative, and a rating of 3 (on a 1–5 scale) is considered neutral, so it would not be helpful for this classification.


In [None]:
#d)
baby_df["rating"] = baby_df["rating"].apply(lambda x: 1 if x>= 4 else -1)
#short test:
sum(baby_df["rating"]**2 != 1)

0

**Note:** Finally, to perform binary classification, we are converting ratings as follows:  
- Ratings ≥ 4 are labeled as 1 - positive review
- Ratings ≤ 2 are labeled as 0 - negative review

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out())
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']
[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]


In [None]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer.

In [None]:
#a)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(baby_df["review"], baby_df["rating"], test_size=0.2, random_state=42)

In [None]:
#b)
vectorizer = CountVectorizer()
X_train_example = vectorizer.fit_transform(X_train)
X_test_example = vectorizer.transform(X_test)

**Note:** Based on the example code, I created a vectorizer using all the words from the training reviews. Then the test input was transformed into vectors using the same vectorizer.


## Exercise 3
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [None]:
#a)
model = LogisticRegression(max_iter=1000)
model.fit(X_train_example, Y_train)


In [None]:
feature_names = vectorizer.get_feature_names_out()
print(model.classes_)
print(model.coef_.shape)

coefs_with_weights = pd.DataFrame({'coef': model.coef_[0], 'word': feature_names})
coefs_with_weights = coefs_with_weights.sort_values(by='coef', ascending=False)
print("10 Most positive words: ")
print(coefs_with_weights.head(10))
print("\n10 Most negative words: ")
print(coefs_with_weights.tail(10))


[-1  1]
(1, 121806)
10 Most positive words: 
            coef         word
61675   2.231293    lifesaver
66910   2.182193        minor
27436   2.117454          con
96391   2.075734    skeptical
91856   2.061794        saves
81092   2.017108          ply
106160  2.003761     thankful
89841   1.941083         rich
78718   1.929964        penny
119869  1.927159  wonderfully

10 Most negative words: 
            coef           word
81527  -2.189135           poor
112597 -2.196774   unacceptable
34292  -2.222455   disappointed
113777 -2.290251       unusable
34323  -2.479963  disappointing
81538  -2.544620         poorly
114572 -2.565633        useless
120403 -2.626610          worst
120422 -2.767761      worthless
34884  -2.903406   dissapointed


**Note:** Logistic Regression is applied based on the count of each word in the reviews. When a review is classified as positive or negative, the model adjusts the coefficients for each word accordingly:  

- A higher positive coefficient means the word is strongly associated with positive reviews.  
- A higher negative coefficient means the word is strongly associated with negative reviews.  

For example, in this dataset, the word with the most positive impact is `"lifesaver"`, while the word with the strongest negative impact is `"disappointed"`. Generally, the top positively-correlated words are those naturally used in positive contexts, while the negative-coefficient words tend to be terms associated with negativity — which makes perfect sense.


## Exercise 4
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [None]:
#a)
y_pred = model.predict(X_test_example)

In [None]:
#b)
y_pred_probability = model.predict_proba(X_test_example)
#hint: model.predict_proba()

In [None]:
reviews_w_prob = pd.DataFrame({
    'review': X_test,
    'prob_negative': y_pred_probability[:, 0],
    'prob_positive': y_pred_probability[:, 1]
})

reviews_w_prob_sorted_pos = reviews_w_prob.sort_values(by='prob_positive', ascending=False)
reviews_w_prob_sorted_neg = reviews_w_prob.sort_values(by='prob_negative', ascending=False)

print("Top 5 reviews with highest positive probability:\n")
for i in range(5):
    row = reviews_w_prob_sorted_pos.iloc[i]
    print(f"Review index: {row.name}")
    print(f"Positive probability: {row['prob_positive']}")
    print(f"Negative probability: {row['prob_negative']}")
    print(f"Review: {row['review']}\n")


print("\nTop 5 reviews with highest negative probability:\n")
for i in range(5):
    row = reviews_w_prob_sorted_neg.iloc[i]
    print(f"Review index: {row.name}")
    print(f"Positive probability: {row['prob_positive']}")
    print(f"Negative probability: {row['prob_negative']}")
    print(f"Review: {row['review']}\n")



Top 5 reviews with highest positive probability:

Review index: 74899
Positive probability: 1.0
Negative probability: 0.0
Review: We love this highchair  We have a 4 year old and an 8 month old  This is our 3rd highchairFeatures we loveFit  This chair FITS my infant daughter  She fits in this chair without the extra insert way better than in the basic Evenflo chair we had before  I only use the 3point harness and let her shoulders be free and she sits at a correct level so her arms can move around well and she can lean and reach for things on the tray  Many other chairs have a real problem with fit  So I do believe that with the insert this is the perfect chair to start your 4 month infant in for feeding  The insert will make them more secure kind of like their carseatTray Insert  With our other chairs the tray clicks down into the larger tray all the way aroundyou can remove it for cleaning  Fine  But I always hated that food got into the crack nearest to the baby so I couldnt just wi

**Note:** Above, we can see how “sure” the model was in classifying each review, expressed as a probability between 0 and 1, using the top 5 most confident positive and negative reviews as examples. After reading the reviews we can see that they were classified correctly.

In [None]:
#d)
correctly_pred_counter = 0
Y_test_numpy = Y_test.to_numpy()
for i in range(len(y_pred)):
  if y_pred[i] == Y_test_numpy[i]:
    correctly_pred_counter += 1
accuracy = correctly_pred_counter/len(y_pred) * 100
print(f"Accuracy percent: {accuracy}")


Accuracy percent: 93.28955653503644


**Note:** To calculate the accuracy of the model, I counted how many reviews were correctly classified into the same binary class as in the dataset. Then, I calculated the percentage of correct predictions. The model achieved a pretty high accuracy - about 93% .

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [None]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [None]:
#a)
model_dict = LogisticRegression(max_iter=1000)
vectorizer = CountVectorizer(vocabulary=significant_words)

In [None]:
X_train_example = vectorizer.fit_transform(X_train)
X_test_example_dict = vectorizer.transform(X_test)
model_dict.fit(X_train_example, Y_train)
y_pred = model_dict.predict(X_test_example_dict)

In [None]:
print(vectorizer.get_feature_names_out())

['love' 'great' 'easy' 'old' 'little' 'perfect' 'loves' 'well' 'able'
 'car' 'broke' 'less' 'even' 'waste' 'disappointed' 'work' 'product'
 'money' 'would' 'return']


**Note:** The above code is very similar to what I did before, with one key difference: the `CountVectorizer` now uses the `vocabulary` argument. This means the vectorizer will only include a selected set of words (about 20) instead of all the unique words (a very big amount) from the training set's reviews.

In [None]:
feature_names = vectorizer.get_feature_names_out()
print(model_dict.classes_)
print(model_dict.coef_.shape)
coefs_with_weights = pd.DataFrame({'coef': model_dict.coef_[0], 'word': feature_names})
coefs_with_weights = coefs_with_weights.sort_values(by='coef', ascending=False)
print("10 Most positive words: ")
print(coefs_with_weights.head(10))
print("\n10 Most negative words: ")
print(coefs_with_weights.tail(10))

[-1  1]
(1, 20)
10 Most positive words: 
       coef     word
6  1.684972    loves
5  1.515068  perfect
0  1.359000     love
2  1.193224     easy
1  0.930882    great
4  0.502431   little
7  0.496196     well
8  0.193270     able
9  0.074529      car
3  0.073441      old

10 Most negative words: 
        coef          word
11 -0.201570          less
16 -0.313727       product
18 -0.342239         would
12 -0.489719          even
15 -0.635649          work
17 -0.946424         money
10 -1.680640         broke
13 -1.979571         waste
19 -2.092836        return
14 -2.398751  disappointed


**Note:** The top positively-weighted words, like "loves" or "perfect" are  associated with positive experiences, while the most negative words like "waste" and "disappointed" naturally reflect complaints - this shows that even a lightweight model can capture meaningful patterns.

In [None]:
y_pred_probabilty_dict = model_dict.predict_proba(X_test_example_dict)

In [None]:
reviews_w_prob = pd.DataFrame({
    'review': X_test,
    'prob_negative': y_pred_probabilty_dict[:, 0],
    'prob_positive': y_pred_probabilty_dict[:, 1]
})

reviews_w_prob_sorted_pos = reviews_w_prob.sort_values(by='prob_positive', ascending=False)
reviews_w_prob_sorted_neg = reviews_w_prob.sort_values(by='prob_negative', ascending=False)

print("Top 5 reviews with highest positive probability:\n")
for i in range(5):
    row = reviews_w_prob_sorted_pos.iloc[i]
    print(f"Review index: {row.name}")
    print(f"Positive probability: {row['prob_positive']}")
    print(f"Negative probability: {row['prob_negative']}")
    print(f"Review: {row['review']}\n")


print("\nTop 5 reviews with highest negative probability:\n")
for i in range(5):
    row = reviews_w_prob_sorted_neg.iloc[i]
    print(f"Review index: {row.name}")
    print(f"Positive probability: {row['prob_positive']}")
    print(f"Negative probability: {row['prob_negative']}")
    print(f"Review: {row['review']}\n")

Top 5 reviews with highest positive probability:

Review index: 134265
Positive probability: 0.9999999819310982
Negative probability: 1.8068901819212613e-08
Review: We bought this stroller after selling our beloved BOB rev on craigslist We used the BOB for 9 months for my son but it just wasnt practical I dont jogrun it didnt have a big basket and was very bulky to take into stores quickly However I did love how it unfolded easily but it was heavy to fold up and lift into my small trunk myself Overall I didnt realize what Id need in a stroller until AFTER I had my son Live  learn We did love how easily the BOB would go over pretty much anything Nevertheless we sold it and after extensive research on strollers we decided it was between the uppababy brand because of the large baskets OR the city mini GT because of its easy fold up design After looking over both strollers I decided on the uppababy cruz because of a few main factors It SITS UP I cant tell you how much my son hates being re

**Note:** Interestingly, compared to the previous model, the “lighter” model did not assign a probability of 1 to any review (likely because it has too little features to make extremely confident predictions). The order of the most positive and negative reviews also changed.

In [None]:
correctly_pred_counter = 0
Y_test_numpy = Y_test.to_numpy()
for i in range(len(y_pred)):
  if y_pred[i] == Y_test_numpy[i]:
    correctly_pred_counter += 1
accuracy_dict = correctly_pred_counter/len(y_pred) * 100
print(f"Accuracy percent: {accuracy_dict}")

Accuracy percent: 86.899943030194


**Note:** The accuracy of the limited-vocabulary model is about 87%, which is lower than the full-vocabulary model at 93%. However, this is not a drastic drop, considering the number of features was reduced from 121806 to just 20.

In [None]:
#c)

print("On full dictionary: ")
%time y_pred = model.predict(X_test_example)
%timeit y_pred = model.predict(X_test_example)
print(f"Accuracy : {accuracy}")

print(f"\nOn limited dictionary: ")
%time y_pred = model_dict.predict(X_test_example_dict)
%timeit y_pred = model_dict.predict(X_test_example_dict)
print(f"Accuracy : {accuracy_dict}")

#hint: %time, %timeit

On full dictionary: 
CPU times: user 9.76 ms, sys: 98 µs, total: 9.85 ms
Wall time: 9.63 ms
7.38 ms ± 95.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Accuracy : 93.28955653503644

On limited dictionary: 
CPU times: user 1.49 ms, sys: 0 ns, total: 1.49 ms
Wall time: 1.35 ms
899 µs ± 85.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Accuracy : 86.899943030194


**Note:** Reducing the dictionary size allowed us to perform operations approximately 8 times faster. As mentioned before, losing only 6% accuracy makes this worth considering in practice.