**Question 1:** Load the product reviews dataset into a variable called `customer_review_df`. Next, write a function called `check_data` to check if the data has been loaded successfully.

**Question 1.1:** Explore the distribution of sentiment labels in the dataset.

**Question 1.2** Engineer a new feature called `Sentiment` from the _Rating_ column. This takes the values -1, 0, and 1 for `negative`, `neutral`, and `positive`.
- Reviews with Rating > 3 is positive
- Reviews with Rating = 3 is neutral
- Reviews with Rating < 3 is negative

## Week 8 - Sentiment Analysis of Jumia Reviews

Product reviews are evaluations or opinions shared by consumers who have purchased and used a specific product or service. These reviews are typically written on online platforms such as e-commerce websites, social media, or review websites.

In this assignment, you will apply your knowledge of sentiment analysis to analyze the sentiments expressed in product reviews by Jumia customers. You will work together as a group to preprocess the text data, build a sentiment analysis model, and interpret the results.




In [None]:
import pandas as pd
import numpy as np
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

%matplotlib inline
import otter
grader = otter.Notebook()

In [None]:
# load the customer retention dataset
customer_review_df = pd.read_csv('sentiment-analysis-jumia-reviews.csv', engine='python') # SOLUTION

# write a function called `check_data` to check data loading is successful
def check_data():
    # BEGIN SOLUTION
    return customer_review_df.empty
    # END SOLUTION

# Define a function to convert ratings to sentiments
def convert_to_sentiment(rating):
    # BEGIN SOLUTION
    if rating > 3:
        return 1  # Positive
    elif rating == 3:
        return 0  # Neutral
    else:
        return -1  # Negative
    # END SOLUTION

# Apply the function to create a 'Sentiment' column
customer_review_df['Sentiment'] = customer_review_df['Rating'].apply(convert_to_sentiment)  # SOLUTION


**Question 2:** Preprocess the text data by completing the following:
- Convert the reviews to lowercase and remove punctuation. 
- Tokenize the text data to split it into individual words or tokens.

**Note**: Assign your final preprocessed dataset to a variable called `processed_customer_review_df`. Failure to do this might result in you not getting a score for this question.


In [None]:
# Preprocess text data
def preprocess_text(text):

    # Convert to lowercase
    text = text.lower()     # SOLUTION
    # Remove punctuation
    text = ''.join([char for char in text if char not in string.punctuation])   # SOLUTION
    return text

# Apply text preprocessing to 'Review' column
customer_review_df['Review'] = customer_review_df['Review'].apply(preprocess_text)  # SOLUTION

# Tokenize the text data
customer_review_df['Tokens'] = customer_review_df['Review'].apply(word_tokenize)    # SOLUTION

# Combine tokens into a string (needed for feature extraction)
customer_review_df['Tokens'] = customer_review_df['Tokens'].apply(' '.join)     # SOLUTION

processed_customer_review_df = customer_review_df       # SOLUTION


**Question 3:** Split your processed dataset into training and testing set by using `80:20` rule. You can use **X_train, X_test, y_train, y_test** variable to store your splitted dataset.

**Question 3.1:** Choose a feature extraction technique and implement it. You can choose from techniques like `BoW`, `TF-IDF`, or Word Embeddings. Remember to explain your choice.

**Question 3.2:** Train the sentiment analysis model using `MultinomialNB()` to analyse the reviews. 

**Note**: Assign your model to a variable called `sentiment_review_model`. Failure to do this might result in you not getting a score for this question.

In [None]:

# Split the dataset into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(processed_customer_review_df['Tokens'], 
                                                        processed_customer_review_df['Sentiment'], test_size=0.2, random_state=42)

# Choose a feature extraction technique (e.g., Bag of Words)
vectorizer = CountVectorizer(max_features=1000)  # Limit to the top 1000 words
X_train_bow = vectorizer.fit_transform(X_train)      # SOLUTION
X_test_bow = vectorizer.transform(X_test)            # SOLUTION


# Create and train the sentiment analysis model
sentiment_review_model = MultinomialNB()            # SOLUTION
sentiment_review_model.fit(X_train_bow, y_train)    # SOLUTION



**Question 4:** Predict using the developed model and evaluate the model. Evaluate this model using MAE, MSE, RMSE, and R-squared.

**Note**: Assign your prediction to a variable called `prediction`. Failure to do this might result in you not getting a score for this question.

In [None]:
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# analyse reviews using the model
prediction = sentiment_review_model.predict(X_test)      # SOLUTION

# evaluate the model using different metrics
mae = mean_absolute_error(y_test, prediction)       # SOLUTION
mse = mean_squared_error(y_test, prediction)        # SOLUTION
rmse = np.sqrt(mse)                                 # SOLUTION
r2 = r2_score(y_test, prediction)                   # SOLUTION

**Question 5:** What insight can you derive from this data?

**SOLUTION:**

<!-- END QUESTION -->

