In [29]:
# Initialize Otter
import otter
grader = otter.Notebook("sentiment-analysis-jumia-reviews.ipynb")

## Week 8 - Sentiment Analysis of Jumia Reviews

Product reviews are evaluations or opinions shared by consumers who have purchased and used a specific product or service. These reviews are typically written on online platforms such as e-commerce websites, social media, or review websites.

In this assignment, you will apply your knowledge of sentiment analysis to analyze the sentiments expressed in product reviews by Jumia customers. You will work together as a group to preprocess the text data, build a sentiment analysis model, and interpret the results.




In [30]:
import pandas as pd
import numpy as np
import nltk
# nltk.data.path.append(os.path.join(os.getcwd(), 'nltk_data'))
# nltk.download('punkt')
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import joblib

%matplotlib inline
import otter
grader = otter.Notebook()

**Question 1:** Load the product reviews dataset into a variable called `customer_review_df`. Next, write a function called `check_data` to check if the data has been loaded successfully.

**Question 1.1:** Explore the distribution of sentiment labels in the dataset.

**Question 1.2** Engineer a new feature called `Sentiment` from the _Rating_ column. This takes the values -1, 0, and 1 for `negative`, `neutral`, and `positive`.
- Reviews with Rating > 3 is positive
- Reviews with Rating = 3 is neutral
- Reviews with Rating < 3 is negative

In [31]:
# load the customer retention dataset
customer_review_df = pd.read_csv('sentiment-analysis-jumia-reviews.csv')

# write a function called `check_data` to check data loading is successful
def check_data():
    if customer_review_df is not None:
        print("Data loaded successfully.")
        print(customer_review_df.head())
    else:
        print("Error loading data.")

check_data()

# Define a function to convert ratings to sentiments
def convert_to_sentiment(rating):
    if rating > 3:
        return 1  # Positive
    elif rating == 3:
        return 0  # Neutral
    else:
        return -1  # Negative

# Apply the function to create a 'Sentiment' column
customer_review_df['Sentiment'] = customer_review_df['Rating'].apply(
    convert_to_sentiment)
customer_review_df['Sentiment']

# Exploring the distribution of sentiment labels
sentiment_distribution = customer_review_df['Sentiment'].value_counts()
print(f"\nThe sentiment distribution \n{sentiment_distribution}")

customer_review_df

Data loaded successfully.
   Rating                   Title  \
0       3               I like it   
1       1  not happy with product   
2       5                    good   
3       4                    Good   
4       4                 quality   

                                              Review  
0         The neck need to be adjusted, it's too big  
1  You people should improve in the item's you pe...  
2                                            Well ok  
3                                           Was Fine  
4                       Quality is very ok with size  

The sentiment distribution 
 1    76
 0    14
-1    10
Name: Sentiment, dtype: int64


Unnamed: 0,Rating,Title,Review,Sentiment
0,3,I like it,"The neck need to be adjusted, it's too big",0
1,1,not happy with product,You people should improve in the item's you pe...,-1
2,5,good,Well ok,1
3,4,Good,Was Fine,1
4,4,quality,Quality is very ok with size,1
...,...,...,...,...
95,5,wonderful,Great fabrics. Looks good on me every time,1
96,5,Nice quality,Good Quality,1
97,1,4 Top,Very low quality,-1
98,5,service rendered,"The service is good, I thought is children shirt",1


**Question 2:** Preprocess the text data by completing the following:
- Convert the reviews to lowercase and remove punctuation. 
- Tokenize the text data to split it into individual words or tokens.

**Note**: Assign your final preprocessed dataset to a variable called `processed_customer_review_df`. Failure to do this might result in you not getting a score for this question.


In [32]:
# Preprocess text data
def preprocess_text(text):

    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = ''.join([char for char in text if char.isalnum() or char.isspace()])
    return text


# Apply text preprocessing to 'Review' column
customer_review_df['Review'] = customer_review_df['Review'].apply(
    preprocess_text)

# Tokenize the text data
customer_review_df['Tokens'] = customer_review_df['Review'].apply(
    word_tokenize)

# Combine tokens into a string (needed for feature extraction)
customer_review_df['Tokens'] = customer_review_df['Tokens'].apply(
    lambda tokens: ' '.join(tokens))

processed_customer_review_df = customer_review_df.copy()
processed_customer_review_df

Unnamed: 0,Rating,Title,Review,Sentiment,Tokens
0,3,I like it,the neck need to be adjusted its too big,0,the neck need to be adjusted its too big
1,1,not happy with product,you people should improve in the items you peo...,-1,you people should improve in the items you peo...
2,5,good,well ok,1,well ok
3,4,Good,was fine,1,was fine
4,4,quality,quality is very ok with size,1,quality is very ok with size
...,...,...,...,...,...
95,5,wonderful,great fabrics looks good on me every time,1,great fabrics looks good on me every time
96,5,Nice quality,good quality,1,good quality
97,1,4 Top,very low quality,-1,very low quality
98,5,service rendered,the service is good i thought is children shirt,1,the service is good i thought is children shirt


**Question 3:** Split your processed dataset into training and testing set by using `80:20` rule. You can use **X_train, X_test, y_train, y_test** variable to store your splitted dataset.

**Question 3.1:** Choose a feature extraction technique and implement it. You can choose from techniques like `BoW`, `TF-IDF`, or Word Embeddings. Remember to explain your choice.

**Question 3.2:** Train the sentiment analysis model using `MultinomialNB()` to analyse the reviews. 

**Note**: Assign your model to a variable called `sentiment_review_model`. Failure to do this might result in you not getting a score for this question.

In [33]:

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(processed_customer_review_df['Tokens'], 
                                                        processed_customer_review_df['Sentiment'], test_size=0.2, random_state=42)

# Choose a feature extraction technique (e.g., Bag of Words)
vectorizer = CountVectorizer(max_features=1000)  # Limit to the top 1000 words
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)


# Create and train the sentiment analysis model
sentiment_review_model = MultinomialNB()
sentiment_review_model.fit(X_train_bow, y_train)

**Question 4:** Predict using the developed model and evaluate the model. Evaluate this model using MAE, MSE, RMSE, and R-squared.

**Note**: Assign your prediction to a variable called `prediction`. Failure to do this might result in you not getting a score for this question.

In [34]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# analyse reviews using the model
prediction = sentiment_review_model.predict(X_test_bow)

# evaluate the model using different metrics
mae = mean_absolute_error(y_test, prediction)
mse = mean_squared_error(y_test, prediction)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, prediction)

# Display the evaluation metrics
# Print the results
print(f"Mean Absolute Error (MAE): {mae:.3f}")
print(f"Mean Squared Error (MSE): {mse:.3f}")
print(f"Root Mean Squared Error (RMSE): {rmse:.5f}")
print(f"R-squared (R²): {r2:.3f}")

Mean Absolute Error (MAE): 0.250
Mean Squared Error (MSE): 0.350
Root Mean Squared Error (RMSE): 0.59161
R-squared (R²): -0.129


<!-- BEGIN QUESTION -->

**Question 5:** What insight can you derive from this data?

The model's performance, as indicated by the evaluation metrics, is not satisfactory. The MAE value of `0.25` indicates that, on average, the model's predictions differ by 0.25 from the actual sentiment labels. The MSE value of `0.35` signifies the average squared deviation of the model's predictions from the true sentiment labels. The RMSE value of approximately `0.59` suggests that, on average, the model's predictions deviate by approximately 0.59 from the actual sentiment labels. The R2 value of `-0.129` indicates that the model does not perform well, as it has negative predictive capability. It might be performing worse than a simple mean-based model. 

The negative R2 value and the other error metrics suggest that the model struggles to capture the variability in sentiment labels, possibly indicating the need for further refinement or exploration of alternative modeling approaches.

In [35]:
# save the model to a .pkl file
joblib.dump(sentiment_review_model, 'sentiment_model.pkl')
joblib.dump(vectorizer, 'count_vectorizer.pkl')

['count_vectorizer.pkl']

<!-- END QUESTION -->

<!-- END QUESTION -->



## Submission

Make sure you have run all cells in your notebook in order before running the cell below, so that all images/graphs appear in the output. The cell below will generate a zip file for you to submit. **Please save before exporting!**

These are some submission instructions.

In [36]:
# Save your notebook first, then run this cell to export your submission.
# grader.export(run_tests=True)