# NLP - Mini-Project 
#### Submission by Sai Manasa Ivaturi (sivatur@iu.edu)

### About the data :

This dataset has been downloaded from the repo provided in the Assignment and was made available locally for this miniproject.

This dataset "Musical Instruments" is about the Amazon reviews for musical instruments.

The overall size of dataset is about 10,000 rows of json data, where each JSON comprises of critical information about the review for muscial instrument in context.

More about the data from each json object is given below -


| Attribute                                           | Description                      | Example          |   |
|----------------------------------------------------|----------------------------------|------------------|---|
| reviewerID                                          |ID of the reviewer                | e.g. A2SUAM1J3GNN3B|   | 
| 'asin                                              | ID of the product                | e.g. 0000013714' |   |
| 'reviewerName                                      | name of the reviewer'            |                  |   |
| 'helpful                                           | helpfulness rating of the review | e.g. 2/3'        |   |
| 'reviewText                                        | text of the review'              |                  |   |
| 'overall                                           | rating of the product'           |                  |   |
| 'summary                                           | summary of the review'           |                  |   |
| 'unixReviewTime                                    | time of the review (unix time)'  |                  |   |
| 'reviewTime                                        | time of the review (raw)'        |                  |   |

Dataset source 

    - webpage -> http://jmcauley.ucsd.edu/data/amazon/
    - data -> http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Musical_Instruments_5.json.gz

### Loading and Cleaning

This below section of code deals with loading the data into dataframe and cleaning the data

In [1]:
# loading the dataset
import pandas as pd

In [2]:
path = "/Users/manasaivaturi/Desktop/Musical_Instruments_5.json"
df = pd.read_json(path, lines = True)

In [3]:
print ("Dataframe Shape (row, column):\t"+ str(df.shape))

Dataframe Shape (row, column):	(10261, 9)


In [4]:
print ("Reviews info :\t\t" + str(df.info()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10261 entries, 0 to 10260
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   reviewerID      10261 non-null  object
 1   asin            10261 non-null  object
 2   reviewerName    10234 non-null  object
 3   helpful         10261 non-null  object
 4   reviewText      10261 non-null  object
 5   overall         10261 non-null  int64 
 6   summary         10261 non-null  object
 7   unixReviewTime  10261 non-null  int64 
 8   reviewTime      10261 non-null  object
dtypes: int64(2), object(7)
memory usage: 721.6+ KB
Reviews info :		None


As we can see from the below cell output the total number of reviews loaded into the dataframe is 10,261.

Here, one important observation can be that mean of the column "overall" is 4.4/5 meaning that most of the data is positive rated.

In [5]:
df.describe()

Unnamed: 0,overall,unixReviewTime
count,10261.0,10261.0
mean,4.488744,1360606000.0
std,0.894642,37797350.0
min,1.0,1095466000.0
25%,4.0,1343434000.0
50%,5.0,1368490000.0
75%,5.0,1388966000.0
max,5.0,1405987000.0


we can validate the same by looking at the data from below that
the value_counts for reviews with ratings 5 and 4 is way above 3,2 and 1.

In [6]:
df["overall"].value_counts()

5    6938
4    2084
3     772
2     250
1     217
Name: overall, dtype: int64

Even if we are to proceed to cleaning the dataset, as we can see, only reviewName has some null values.

Except for that, no other column really produces any null values that needs to be treated beforehand.

In [7]:
# checking for null values
df.isnull().sum()

reviewerID         0
asin               0
reviewerName      27
helpful            0
reviewText         0
overall            0
summary            0
unixReviewTime     0
reviewTime         0
dtype: int64

### Cleaning/preprocessing the data

As we can see from above, for the sake of sentiment analysis 
we might not be in need of all the columns from our dataset.

Since we want to perform sentiment analysis - most useful columns are the 
    - reviewText
    - summary
Also, for us to be able to get the accuracy of our sentiment analysis, we might want to keep the "overall" column handy.

So, we are creating a much simpler data frame to support our project.

In [8]:
fin = df[["overall","reviewText"]]

In [9]:
fin

Unnamed: 0,overall,reviewText
0,5,"Not much to write about here, but it does exac..."
1,5,The product does exactly as it should and is q...
2,5,The primary job of this device is to block the...
3,5,Nice windscreen protects my MXL mic and preven...
4,5,This pop filter is great. It looks and perform...
...,...,...
10256,5,"Great, just as expected. Thank to all."
10257,5,I've been thinking about trying the Nanoweb st...
10258,4,I have tried coated strings in the past ( incl...
10259,4,"Well, MADE by Elixir and DEVELOPED with Taylor..."


#### changing "overall" column to positive and negetive

As we have seen from above, the current rating is a 5-star based system, which is highly granular in its rating system

For the sake of experimentation, ease of understanding and simplicity, we are trying to bring it to a binary level, that is 
we will create a new column which says either the rating is positive or negetive.

To do so, we will categorize all reviews with an "overall" score of 3 or above as positive and below 3 as negative

In [10]:
def sent(rating):
    if rating >=3:
        return True
    else:
        return False
    return value

In [11]:
fin["rating"] = fin["overall"].apply(sent)

In [12]:
fin["rating"].value_counts()

True     9794
False     467
Name: rating, dtype: int64

Here, the orignal postive ratings are about 9794 and the negetive ratings are 467

This will be the source of truth on our end and we will calculate accuracy for out sentiment analysis against this.

In [13]:
fin

Unnamed: 0,overall,reviewText,rating
0,5,"Not much to write about here, but it does exac...",True
1,5,The product does exactly as it should and is q...,True
2,5,The primary job of this device is to block the...,True
3,5,Nice windscreen protects my MXL mic and preven...,True
4,5,This pop filter is great. It looks and perform...,True
...,...,...,...
10256,5,"Great, just as expected. Thank to all.",True
10257,5,I've been thinking about trying the Nanoweb st...,True
10258,4,I have tried coated strings in the past ( incl...,True
10259,4,"Well, MADE by Elixir and DEVELOPED with Taylor...",True


### Processing data

Since we have the data in the required format, we shall now continue to implement the NLP techniques we have learnt in this course.

Since the data has been cleaned, now it is ready for further process of
    
    - Tokenization
    - Removing stop words
    - Removing punctuation
    - Not tail contraction    
    - POS tagging
    - Lemmatization
    - Sentiment analysis

Lemmatization breaks the word into its root-word-form, however, if the pos tag is to be considered then the lemmatization must happen only after POS tagging is already complete.

Another important aspect is that we are using sentiwordnet which has a different mapping for the POS tags than that of the nltk word tags.

However, the advantage of the `sentiwordnet` is that the most probable pos tag is always present in the score list.

On that note for this project, more consideration has been given to appropriate cleaning and tokenization to find the right word from the `sentiwordnet` than the POS tagging and Lemmatization.

#### importing libraries

In [14]:
from nltk.corpus import stopwords 
import nltk
from nltk.corpus import sentiwordnet as swn 
import string

When passed a sentence/tokens the below function is responsible for removing the following -

    - converting strings to lowercase
    - remove stop words
    - removes punctuation
    - removes tail contraction 

This function will be applied later

In [15]:
def remove_stopwords(tokens):
    # lowercase
    # not in stopwords
    # not in punctuation
    # not tail contraction    
    return [i.lower() for i in tokens 
            if i not in  stopwords.words('english') 
            and 
            i not in string.punctuation and "'" not in i]

Given a list of tokens the below function checks the `senti_synset` and returns the object for further processing

This function will be applied later

In [16]:
def get_scores(tokens):
    sysnets = []
    for word in tokens:
        try:
            sysnets.append(list(swn.senti_synsets(word))[0])
        except:
            sysnets.append(None)
    return sysnets

The below function takes the `sentiwordnet` objects and gets their respective normalized positive or negetive scores.

Note: normalization happens depending on the number of tokens found in the `sentiwordnet`

In [17]:
def sent_rating(score_objects):
    pos = [i.pos_score() for i in score_objects if i is not None] 
    neg = [i.neg_score()  for i in score_objects if i is not None]
#     obj = [i.obj_score()  for i in score_objects if i is not None]
    m_p = sum(pos)/len(pos) if len(pos)>0 else 0
    m_n = sum(neg)/len(neg) if len(neg)>0 else 0
    if m_p>m_n:
        return True
    else:
        return False
    return res.index(max(res))

The below code processes the entire data from the `fin` dataframe's `reviewText` column and applies the respective functions.

The advantage of the below loops is that if we decide to apply more functions they are now parameterized and can be modified.

In [18]:
tokenize = nltk.word_tokenize
swn.all_senti_synsets()
functions = [tokenize, remove_stopwords]
for j in functions:
    for i in ["reviewText"]:
        fin[i] = fin[i].apply(j)

At this point, the column "reviewText" has been modified to undergo 

    - Tokenization
    - Changing each token to - 
        - lowercase
        - remove stop words
        - removes punctuation
        - removes tail contraction
        
We can validate the same by looking at the dataframe

In [19]:
fin

Unnamed: 0,overall,reviewText,rating
0,5,"[not, much, write, exactly, supposed, filters,...",True
1,5,"[the, product, exactly, quite, affordable.i, r...",True
2,5,"[the, primary, job, device, block, breath, wou...",True
3,5,"[nice, windscreen, protects, mxl, mic, prevent...",True
4,5,"[this, pop, filter, great, it, looks, performs...",True
...,...,...,...
10256,5,"[great, expected, thank]",True
10257,5,"[i, thinking, trying, nanoweb, strings, i, bit...",True
10258,4,"[i, tried, coated, strings, past, including, e...",True
10259,4,"[well, made, elixir, developed, taylor, guitar...",True


The below line of code applies the aforementioned 
`get_scores` function on the `reviewText` column 
and stores the respective objects in the `scores_reviewText` column of the dataframe

In [20]:
fin["scores_reviewText"]=fin["reviewText"].apply(get_scores)

The below line of code applies the aforementioned 
`sent_rating` function on the `scores_reviewText` column 
and stores the respective objects in the `new_rating` column of the dataframe

In [21]:
fin["new_rating"]=fin["scores_reviewText"].apply(sent_rating)

Because the sentiment analysis was performed already on the given column, it is now appropriate to find the accuracy of the sentiment analysis performed.

On those lines, we know from the previous parts of the notebook that the column `rating` is the source of truth for finding accuracy of this prediction.

The results of this are being saved to another column `same` in the dataframe.

In [22]:
fin["same"]=(fin["rating"]==fin["new_rating"])

### Comparing the original ratings vs generated 

In [23]:
fin["rating"].value_counts()

True     9794
False     467
Name: rating, dtype: int64

In [24]:
fin["new_rating"].value_counts()

True     7831
False    2430
Name: new_rating, dtype: int64

### Results

In [25]:
results = fin["same"].value_counts()

In [26]:
results

True     7748
False    2513
Name: same, dtype: int64

As we can see from the above results, the number of accurate classifcations are - "7748" and the number of wrongly classified sentences are - "2513"

Which is about 75% accurate. This is the accuracy that could be accomplished with the `sentiwordnet` package from NLTK

In [27]:
fin

Unnamed: 0,overall,reviewText,rating,scores_reviewText,new_rating,same
0,5,"[not, much, write, exactly, supposed, filters,...",True,"[<not.r.01: PosScore=0.0 NegScore=0.625>, <muc...",False,False
1,5,"[the, product, exactly, quite, affordable.i, r...",True,"[None, <merchandise.n.01: PosScore=0.0 NegScor...",True,True
2,5,"[the, primary, job, device, block, breath, wou...",True,"[None, <primary.n.01: PosScore=0.0 NegScore=0....",True,True
3,5,"[nice, windscreen, protects, mxl, mic, prevent...",True,"[<nice.n.01: PosScore=0.0 NegScore=0.0>, <wind...",True,True
4,5,"[this, pop, filter, great, it, looks, performs...",True,"[None, <dad.n.01: PosScore=0.0 NegScore=0.0>, ...",True,True
...,...,...,...,...,...,...
10256,5,"[great, expected, thank]",True,"[<great.n.01: PosScore=0.0 NegScore=0.0>, <exp...",False,False
10257,5,"[i, thinking, trying, nanoweb, strings, i, bit...",True,"[<iodine.n.01: PosScore=0.0 NegScore=0.0>, <th...",True,True
10258,4,"[i, tried, coated, strings, past, including, e...",True,"[<iodine.n.01: PosScore=0.0 NegScore=0.0>, <tr...",True,True
10259,4,"[well, made, elixir, developed, taylor, guitar...",True,"[<well.n.01: PosScore=0.0 NegScore=0.0>, <make...",True,True
