### Amazon Reviews Dataset

This is a dataset of Amazon Reviews and one column has the actual text we want to examine. Follow the tasks below to do so and perform sentiment analysis on the text, creating a new column that displays either positive, negative, or neutral.


### Task 1: Read in the dataset


In [14]:
import pandas as pd


# TODO: Load the dataset (replace 'your_dataset.csv' with the actual file name)
df = pd.read_csv('data/amazonReviews.csv')
# Display the first few rows
print(df.head())

                     id       asins   brand                  categories  \
0  AVpe7AsMilAPnD_xQ78G  B00QJDU3KY  Amazon  Amazon Devices,mazon.co.uk   
1  AVpe7AsMilAPnD_xQ78G  B00QJDU3KY  Amazon  Amazon Devices,mazon.co.uk   
2  AVpe7AsMilAPnD_xQ78G  B00QJDU3KY  Amazon  Amazon Devices,mazon.co.uk   
3  AVpe7AsMilAPnD_xQ78G  B00QJDU3KY  Amazon  Amazon Devices,mazon.co.uk   
4  AVpe7AsMilAPnD_xQ78G  B00QJDU3KY  Amazon  Amazon Devices,mazon.co.uk   

  colors             dateAdded           dateUpdated  \
0    NaN  2016-03-08T20:21:53Z  2017-07-18T23:52:58Z   
1    NaN  2016-03-08T20:21:53Z  2017-07-18T23:52:58Z   
2    NaN  2016-03-08T20:21:53Z  2017-07-18T23:52:58Z   
3    NaN  2016-03-08T20:21:53Z  2017-07-18T23:52:58Z   
4    NaN  2016-03-08T20:21:53Z  2017-07-18T23:52:58Z   

                  dimension  ean                         keys  ...  \
0  169 mm x 117 mm x 9.1 mm  NaN  kindlepaperwhite/b00qjdu3ky  ...   
1  169 mm x 117 mm x 9.1 mm  NaN  kindlepaperwhite/b00qjdu3ky  ...   
2 

### Task 2: Clean the Text Data

Remove unwanted characters, convert text to lowercase, and strip extra spaces.


In [15]:
import re

def clean_text(text):
    # TODO: Convert text to lowercase 
    text = text.lower()
    
    # TODO: Remove special characters and digits
    text = re.sub(r'[^a-z\s]', '', text)
    
    # TODO: Strip extra spaces
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

# Apply cleaning function
df['clean_text'] = df['reviews.text'].apply(clean_text)

### Task 3: Tokenize the Text

Split text into individual words (tokens).


In [16]:
from nltk.tokenize import word_tokenize


# TODO: Apply word tokenization
df['tokens'] = df['clean_text'].apply(word_tokenize)
print(df[['clean_text', 'tokens']].head())

                                          clean_text  \
0  i initially had trouble deciding between the p...   
1  allow me to preface this with a little history...   
2  i am enjoying it so far great for reading had ...   
3  i bought one of the first paperwhites and have...   
4  i have to say upfront i dont like coroporate h...   

                                              tokens  
0  [i, initially, had, trouble, deciding, between...  
1  [allow, me, to, preface, this, with, a, little...  
2  [i, am, enjoying, it, so, far, great, for, rea...  
3  [i, bought, one, of, the, first, paperwhites, ...  
4  [i, have, to, say, upfront, i, dont, like, cor...  


### Task 4: Remove Stopwords

Eliminate common stopwords that do not contribute much meaning.


In [17]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    # TODO: Filter out stopwords
    return [word for word in tokens if word not in stop_words]

# Apply stopword removal
df['filtered_tokens'] = df['tokens'].apply(remove_stopwords)
print(df[['tokens', 'filtered_tokens']].head())

                                              tokens  \
0  [i, initially, had, trouble, deciding, between...   
1  [allow, me, to, preface, this, with, a, little...   
2  [i, am, enjoying, it, so, far, great, for, rea...   
3  [i, bought, one, of, the, first, paperwhites, ...   
4  [i, have, to, say, upfront, i, dont, like, cor...   

                                     filtered_tokens  
0  [initially, trouble, deciding, paperwhite, voy...  
1  [allow, preface, little, history, casual, read...  
2  [enjoying, far, great, reading, original, fire...  
3  [bought, one, first, paperwhites, pleased, con...  
4  [say, upfront, dont, like, coroporate, hermeti...  


[nltk_data] Downloading package stopwords to /Users/jff/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Task 5: Convert Text into Numerical Representation

Use TF-IDF (Term Frequency-Inverse Document Frequency) to convert text into a numerical format.


In [18]:
from sklearn.feature_extraction.text import TfidfVectorizer

# TODO: Initialize TF-IDF Vectorizer
vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')

# TODO: Fit and transform the text data
X = vectorizer.fit_transform(df['clean_text'])

print(X.shape)  # Display shape of transformed data

(775, 1000)


### Task 6: Perform Basic Sentiment Analysis

Classify text as positive or negative using a simple rule-based approach.
For textblob, a polarity > 1 = Positive, polarity < 1 = negative and a polarity of 0 = neutral


In [19]:
from textblob import TextBlob

def get_sentiment(text):
    # TODO: Compute polarity score
    blob = TextBlob(text)
    return blob.sentiment.polarity
    

# Apply sentiment analysis
df['sentiment'] = df['clean_text'].apply(get_sentiment)
print(df[['clean_text', 'sentiment']].head())

                                          clean_text  sentiment
0  i initially had trouble deciding between the p...   0.199715
1  allow me to preface this with a little history...   0.155127
2  i am enjoying it so far great for reading had ...   0.420833
3  i bought one of the first paperwhites and have...   0.143352
4  i have to say upfront i dont like coroporate h...   0.267086
