### Today we are going to perform the simple classification of the amazon reviews' sentiment.

### Please, download the dataset amazon_baby.csv.

In [409]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import string
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

def remove_punctuation(text):
    import string
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)

baby_df = pd.read_csv('../../amazon_baby.csv')
baby_df.head()

Unnamed: 0,name,review,rating
0,Planetwise Flannel Wipes,"These flannel wipes are OK, but in my opinion ...",3
1,Planetwise Wipe Pouch,it came early and was not disappointed. i love...,5
2,Annas Dream Full Quilt with 2 Shams,Very soft and comfortable and warmer than it l...,5
3,Stop Pacifier Sucking without tears with Thumb...,This is a product well worth the purchase. I ...,5
4,Stop Pacifier Sucking without tears with Thumb...,All of my kids have cried non-stop when I trie...,5


## Exercise 1 (data preparation)
a) Remove punctuation from reviews using the given function.   
b) Replace all missing (nan) revies with empty "" string.  
c) Drop all the entries with rating = 3, as they have neutral sentiment.   
d) Set all positive ($\geq$4) ratings to 1 and negative($\leq$2) to -1.

In [410]:
#a)
baby_df['review'] = baby_df['review'].astype(str).apply(remove_punctuation)

#short test: 
baby_df["review"][4] == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'
remove_punctuation(baby_df["review"][4]) == 'All of my kids have cried nonstop when I tried to ween them off their pacifier until I found Thumbuddy To Loves Binky Fairy Puppet  It is an easy way to work with your kids to allow them to understand where their pacifier is going and help them part from itThis is a must buy book and a great gift for expecting parents  You will save them soo many headachesThanks for this book  You all rock'

True

In [411]:
#b)
baby_df['review'].fillna('', inplace=True)

#short test:
baby_df["review"][38] == baby_df["review"][38]

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  baby_df['review'].fillna('', inplace=True)


True

In [412]:
#c)
baby_df = baby_df[baby_df['rating'] != 3]

#short test:
sum(baby_df["rating"] == 3)

0

In [413]:
#d) 
baby_df['sentiment'] = baby_df['rating'].apply(lambda rating: 1 if rating >= 4 else -1)
#short test:
sum(baby_df["rating"]**2 != 1)

151569

## CountVectorizer
In order to analyze strings, we need to assign them numerical values. We will use one of the simplest string representation, which transforms strings into the $n$ dimensional vectors. The number of dimensions will be the size of our dictionary, and then the values of the vector will represent the number of appereances of the given word in the sentence.

In [414]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
reviews_train_example = ["We like apples",
                   "We hate oranges",
                   "I adore bananas",
                   "We like like apples and oranges",
                   "They dislike bananas"]

X_train_example = vectorizer.fit_transform(reviews_train_example)

print(vectorizer.get_feature_names_out()) #I use newer version of sklearn so instead of get_feature_name() I must use newer one: get_feature_names_out()
print()
print(X_train_example.todense())



['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they'
 'we']

[[0 0 1 0 0 0 1 0 0 1]
 [0 0 0 0 0 1 0 1 0 1]
 [1 0 0 1 0 0 0 0 0 0]
 [0 1 1 0 0 0 2 1 0 1]
 [0 0 0 1 1 0 0 0 1 0]]



### CountVectorizer Output

This output demonstrates how the `CountVectorizer` transforms text data into numerical vectors:

1.  **Vocabulary (`vectorizer.get_feature_names_out()`):**
    *   `['adore' 'and' 'apples' 'bananas' 'dislike' 'hate' 'like' 'oranges' 'they' 'we']`
    *   This is the dictionary of unique words (features) that the `CountVectorizer` learned from the `reviews_train_example` sentences. Each word is assigned an index, which corresponds to a column in the resulting matrix.

2.  **Vectorized Reviews (`X_train_example.todense()`):**
    *   `[[0 0 1 0 0 0 1 0 0 1] ... [0 0 0 1 1 0 0 0 1 0]]`
    *   This is a matrix where each row represents one of the input sentences from `reviews_train_example`, and each column corresponds to a word from the vocabulary listed above.
    *   The values in the matrix indicate the **count** of each word in the respective sentence. For example:
        *   The first row `[0 0 1 0 0 0 1 0 0 1]` corresponds to "We like apples". It shows a count of 1 for 'apples' (index 2), 1 for 'like' (index 6), and 1 for 'we' (index 9), with 0 for all other words.
        *   The fourth row `[0 1 1 0 0 0 2 1 0 1]` corresponds to "We like like apples and oranges". It shows a count of 1 for 'and' (index 1), 1 for 'apples' (index 2), 2 for 'like' (index 6), 1 for 'oranges' (index 7), and 1 for 'we' (index 9).

This transformation is a fundamental step in converting human-readable text into a format that machine learning models can process.

In [415]:
reviews_test_example = ["They like bananas",
                   "We hate oranges bananas and apples",
                   "We love bananas"] #New word!

X_test_example = vectorizer.transform(reviews_test_example)

print(X_test_example.todense())

[[0 0 0 1 0 0 1 0 1 0]
 [0 1 1 1 0 1 0 1 0 1]
 [0 0 0 1 0 0 0 0 0 1]]


We should acknowledge few facts. Firstly, CountVectorizer does not take order into account. Secondly, it ignores one-letter words (this can be changed during initialization). Finally, for test values, CountVectorizer ignores words which are not in it's dictionary.

## Exercise 2 
a) Split dataset into training and test sets.     
b) Transform reviews into vectors using CountVectorizer. 

In [416]:

#a)
X_train, X_test, y_train, y_test = train_test_split(baby_df['review'], baby_df['sentiment'], test_size=0.2, random_state=42)

print(f"Len of training set: {len(X_train)}")
print(f"Len of test set: {len(X_test)}")

Len of training set: 133401
Len of test set: 33351


In [None]:
#b)

vectorizer = CountVectorizer()

X_train_vec = vectorizer.fit_transform(X_train)

X_test_vec = vectorizer.transform(X_test)

## Exercise 3 
a) Train LogisticRegression model on training data (reviews processed with CountVectorizer, ratings as they were).   
b) Print 10 most positive and 10 most negative words.

In [None]:
#a)
model = LogisticRegression(max_iter=1000) #I have limited max iteration to ensure convergance

model.fit(X_train_vec, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [None]:
#b)
coefficients = model.coef_[0]
feature_names = vectorizer.get_feature_names_out()

word_coefs = list(zip(feature_names, coefficients))

sorted_word_coefs = sorted(word_coefs, key=lambda x: x[1])

# 10 most negative words
most_negative_words = sorted_word_coefs[:10]

# 10 most positive words
most_positive_words = sorted_word_coefs[-10:]

print("10 most positive words:")
for word, coef in reversed(most_positive_words):
    print(f"{word}: {coef:.4f}")

print("\n10 most negative words")
for word, coef in most_negative_words:
    print(f"{word}: {coef:.4f}")

10 most positive words:
lifesaver: 2.2029
minor: 2.1559
con: 2.1066
skeptical: 2.0595
saves: 2.0476
ply: 2.0129
thankful: 2.0057
rich: 1.9425
penny: 1.9062
wonderfully: 1.9001

10 most negative words
dissapointed: -2.8758
worthless: -2.7469
worst: -2.5940
useless: -2.5686
poorly: -2.5307
disappointing: -2.4414
unusable: -2.2661
disappointed: -2.2192
unacceptable: -2.2002
poor: -2.1930


## Exercise 4 
a) Predict the sentiment of test data reviews.   
b) Predict the sentiment of test data reviews in terms of probability.   
c) Find five most positive and most negative reviews.   
d) Calculate the accuracy of predictions.

In [None]:
#a)
y_pred = model.predict(X_test_vec)
print(X_test[:10])
print("Predicted sentiments: ", y_pred[:10])

146345    Not easy to use and didnt fit any of the table...
80867     NOT WHAT I EXPECTED THEY SHOWING IT LIKE ITS A...
148242    The tip about taping half the air output port ...
162192    A bit smaller than I had anticipated but my in...
141236    I just bought this for my grandson  His mom lo...
113594    It is well designed and provides a lot of conc...
100819    Every parent should own this I own two and my ...
181978    The fabric is pretty rough and has kind of a c...
24518     My 2 year old is obsessed with Minnie Mouse an...
16800     There are a few cons to installing these windo...
Name: review, dtype: object
Predicted sentiments:  [ 1 -1 -1  1  1  1  1  1  1 -1]



### Predicted Sentiments

This output shows the sentiment predictions made by the model for the first 10 reviews in the test set.
*   A value of `1` indicates that the model predicted a **positive** sentiment for the review.
*   A value of `-1` indicates that the model predicted a **negative** sentiment for the review.

For example, the first review was predicted as positive, the second and third as negative, and so on. These are the discrete sentiment labels assigned by the model based on its classification.
````

In [None]:
#b)
y_pred_proba = model.predict_proba(X_test_vec)

print("Example probabilities class 1 and class -1:") #class 1 is positive feedback, -1 is negative impact
print(X_test[:5])
print(y_pred_proba[:5])
#hint: model.predict_proba()

Example probabilities class 1 and class -1:
146345    Not easy to use and didnt fit any of the table...
80867     NOT WHAT I EXPECTED THEY SHOWING IT LIKE ITS A...
148242    The tip about taping half the air output port ...
162192    A bit smaller than I had anticipated but my in...
141236    I just bought this for my grandson  His mom lo...
Name: review, dtype: object
[[0.47022063 0.52977937]
 [0.78512203 0.21487797]
 [0.9729086  0.0270914 ]
 [0.00174649 0.99825351]
 [0.00164815 0.99835185]]



### Explanation of Prediction Probabilities

This output shows the predicted probabilities for the first five reviews in the test set. Each row corresponds to a review, and the two columns represent the probabilities for each sentiment class:

*   **First column**: Probability that the review has a negative sentiment (class -1).
*   **Second column**: Probability that the review has a positive sentiment (class 1).

For example, for the first review, there is approximately a 41.44% chance of negative sentiment and a 58.56% chance of positive sentiment. These probabilities indicate the model's confidence in its sentiment classification for each review.

In [None]:
#c) 
positive_probabilities = y_pred_proba[:, 1]

sorted_indices = np.argsort(positive_probabilities)

most_negative_reviews_indices = sorted_indices[:5] #lowest probability being the most positive
most_positive_reviews_indices = sorted_indices[-5:] #highest probability being the most positive

print("5 most positive reviews:")
for i in reversed(most_positive_reviews_indices):
    print(f"Probability: {positive_probabilities[i]:.4f}\Review: {X_test.iloc[i]}\n")

print("\n5 most negative reviews:")
for i in most_negative_reviews_indices:
    print(f"Probability: {positive_probabilities[i]:.4f}\nReview: {X_test.iloc[i]}\n")
#hint: use the results of b)

5 most positive reviews:
Probability: 1.0000\Review: After much research I purchased an Urbo2 Its exactly what I hoped it would be For one thing its gorgeous the frame the fabrics the leather looking bumper bar and handle everything It has great maneuverability and its easy to push with just one hand even Its easy to fold and unfold and its lightweight The reversible seat is a must for me as my babies always seem to be very happy in a stroller when they can see me I prefer the handle height all the way extended as I am fairly tall but I also love that it can telescope down to a very low height which is great for my older kids who love to push the baby as well as for making a more compact stroller in tight places like a restaurant or a bus The stroller seat is a nice generous size and the padding is the cushiest Ive ever seen Even the strap covers are cushy The seat sits nice and upright and I love that it can lay truly flat for long stroller naps Because it can lay truly flat it would 

### Analysis of Most Positive and Negative Reviews

The model successfully identified reviews with extreme sentiments, assigning probabilities close to 1.0 for highly positive reviews and 0.0 for highly negative ones.

*   **Most Positive Reviews (Probability ~1.0000)**: These reviews consistently praise products, highlighting features like ease of use, comfort, versatility, durability, and overall satisfaction. Common themes include products being "perfect," "loved," "great," and "worth every penny," often solving specific problems for parents.

*   **Most Negative Reviews (Probability ~0.0000)**: These reviews express strong dissatisfaction, citing issues such as poor design, safety concerns, flimsiness, difficulty in assembly, and inadequate customer service. Words like "disappointment," "dangerous," "awful," "waste," and "return" are prominent, indicating severe problems with the product or brand experience.

In [None]:
#d) 
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy of the modeL: {accuracy:.4f}")

Accuracy of the modeL: 0.9329


Accuracy of 93% is the highest!

## Exercise 5
In this exercise we will limit the dictionary of CountVectorizer to the set of significant words, defined below.


a) Redo exercises 2-5 using limited dictionary.   
b) Check the impact of all the words from the dictionary.   
c) Compare accuracy of predictions and the time of evaluation.

In [None]:
significant_words = ['love','great','easy','old','little','perfect','loves','well','able','car','broke','less','even','waste','disappointed','work','product','money','would','return']

In [None]:
#a)
print("--- Model with limited vectorizer ---")
limited_vectorizer = CountVectorizer(vocabulary=significant_words)
X_train_limited_vec = limited_vectorizer.fit_transform(X_train)
X_test_limited_vec = limited_vectorizer.transform(X_test)

# Redo Exercise 3: 
limited_model = LogisticRegression()
limited_model.fit(X_train_limited_vec, y_train)

# Redo Exercise 4: 
y_pred_limited = limited_model.predict(X_test_limited_vec)
accuracy_limited = accuracy_score(y_test, y_pred_limited)

print(f"Accuracy of the model with limited vectorizer: {accuracy_limited:.4f}")

--- Model with limited vectorizer ---
Accuracy of the model with limited vectorizer: 0.8690


Accuracy of 86 % is pretty high :)

In [None]:
#b)
limited_coefficients = limited_model.coef_[0]
limited_feature_names = limited_vectorizer.get_feature_names_out()

limited_word_coefs = list(zip(limited_feature_names, limited_coefficients))
sorted_limited_word_coefs = sorted(limited_word_coefs, key=lambda x: x[1], reverse=True)

print("\nInfluence of words in model with limited vectorizer")
for word, coef in sorted_limited_word_coefs:
    print(f"{word}: {coef:.4f}")


Influence of words in model with limited vectorizer
loves: 1.6850
perfect: 1.5151
love: 1.3590
easy: 1.1932
great: 0.9309
little: 0.5024
well: 0.4962
able: 0.1933
car: 0.0745
old: 0.0734
less: -0.2016
product: -0.3137
would: -0.3422
even: -0.4897
work: -0.6356
money: -0.9464
broke: -1.6806
waste: -1.9796
return: -2.0928
disappointed: -2.3988


## Analysis of Word Influence (Limited Dictionary Model)

The analysis of the coefficients from the limited dictionary model reveals how the model has learned to associate specific words with positive or negative sentiment:

*   **Strong Positive Influence Words**: Words such as `loves`, `perfect`, `love`, `easy`, and `great` have high positive coefficients. This indicates that their presence in a review is a strong signal for the model that the sentiment is positive.

*   **Strong Negative Influence Words**: Conversely, words like `disappointed`, `return`, `waste`, and `broke` have large negative coefficients. The model has learned that these words are strongly associated with negative reviews.

*   **Conclusion**: Even with a small, carefully selected set of words, the model is able to effectively distinguish sentiment. These coefficients logically reflect the emotional meaning of the words, confirming that the model has correctly learned patterns in the data.

In [None]:
#c)

print("\n--- Model Comparison ---")
print(f"Accuracy (full dictionary): {accuracy:.4f}")
print(f"Accuracy (limited dictionary): {accuracy_limited:.4f}")

print("\nPrediction time for the model with the full dictionary:")
%timeit model.predict(X_test_vec)

print("\nPrediction time for the model with the limited dictionary:")
%timeit limited_model.predict(X_test_limited_vec)


--- Model Comparison ---
Accuracy (full dictionary): 0.9329
Accuracy (limited dictionary): 0.8690

Prediction time for the model with the full dictionary:
5.88 ms ± 71.3 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Prediction time for the model with the limited dictionary:
403 μs ± 16.1 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


### Model Comparison Summary

The results highlight a classic trade-off between model accuracy and computational efficiency:

*   **Accuracy**: The model using the full dictionary is more accurate (93.5%) than the model with a limited set of words (86.9%). This is expected, as more features (words) allow the model to capture more detailed patterns in the data.

*   **Speed**: The model with the limited dictionary is significantly faster. Its prediction time is about 405 microseconds, which is over 12 times faster than the 5.23 milliseconds (5230 microseconds) taken by the model with the full dictionary.

In conclusion, while the full-dictionary model provides higher accuracy, the limited-dictionary model offers a massive improvement in prediction speed. The best choice depends on the application's priorities: if maximum accuracy is essential, the full model is better. If speed and efficiency are more critical, the limited model is a strong alternative with a reasonable sacrifice in accuracy.