

![data-x](https://raw.githubusercontent.com/afo/data-x-plaksha/master/imgsource/dx_logo.png)


___

#### NAME: Kunhee Kim

#### STUDENT ID: 3036181095
___

## Module 500 NLP Module: Text Processing
### Sentiment Analysis on Lululemon reviews


In [25]:
#make compatible with Python 2 and Python 3
from __future__ import print_function, division, absolute_import

# Remove warnings
import warnings
warnings.filterwarnings('ignore')

# plotting
import matplotlib.pyplot as plt
%matplotlib inline

___

#### About 

As you go through the notebook, you will encounter these main steps in the code: 
1. Reading of file `lululemon_website_reviews_v1.csv` 
2. A function `review_cleaner(review)` which cleans the reviews in the input file.<br>
3. Using VADER to analyze the sentiment
4. Analyze sentiment
___


#### Data description
>Data source: https://shop.lululemon.com/<br>

<br>

___

## Data Statistics


In [2]:
# regular expressions, text parsing, and ML classifiers
import re
import nltk
import bs4 as bs
import numpy as np
import pandas as pd
 

# download NLTK classifiers
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')

# import ml classifiers
from nltk.tokenize import sent_tokenize # tokenizes sentences
from nltk.stem import PorterStemmer     # parsing/stemmer
from nltk.tag import pos_tag            # parts-of-speech tagging
from nltk.corpus import wordnet         # sentiment scores
from nltk.stem import WordNetLemmatizer # stem and context
from nltk.corpus import stopwords       # stopwords
from nltk.util import ngrams            # ngram iterator

eng_stopwords = stopwords.words('english')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/konijjang/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/konijjang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/konijjang/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/konijjang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


<br>

**Load Data**

In [6]:
lululemon_data = pd.read_csv("/Users/konijjang/Desktop/INDENG 135/lululemon_website_reviews_v1.csv")
lululemon_data.head()

Unnamed: 0.1,Unnamed: 0,title,review,product
0,0,Perfect for Alaska living,"I love my align leggings, but needed somethin...","Base Pace High-Rise Tight 28"" Brushed"
1,1,Always have such high hopes for tights an...,"Alas, another pair of tights that pill. It is...","Base Pace High-Rise Tight 28"" Brushed"
2,3,Vest,Love the side pockets and new color. Great fi...,Down for It All Vest
3,4,Light Vest,Very light weight vest. Love the green color....,Down for It All Vest
4,5,Runs small,When I order or buy something the is “fit” I ...,Down for It All Vest


In [8]:
lululemon_data.shape

(1938, 4)

In [9]:
print('Average character length of the reviews are:')
lengths= lululemon_data['review'].apply(len)
print(np.mean(lengths))

Average character length of the reviews are:
213.7296181630547


<br>

___


## Preparing data for classification



We'll use the function `review_cleaner` to read in reviews and:

> - Removes HTML tags (using beautifulsoup)
> - Extract emoticons (emotion symbols, aka smileys :D )
> - Removes non-letters (using regular expression)
> - Converts all words to lowercase letters and tokenizes them (using .split() method on the review strings, so that every word in the review is an element in a list)
> - Removes all the English stopwords from the list of movie review words
> - Join the words back into one string seperated by space, append the emoticons to the end

<br>

**Pro Tip:** Transform the list of stopwords to a set before removing the stopwords -- i.e. assign `eng_stopwords = set(stopwords.words("english"))`. Use the set to look up stopwords. This will substantially speed up the computations (Python is much quicker when searching a set than a list).

<br>

In [10]:
from nltk.tokenize import sent_tokenize
import re

def review_cleaner(review):
    '''
        Clean and preprocess a review.
            1. Remove HTML tags
            2. Extract emoticons
            3. Use regex to remove all special characters (only keep letters)
            4. Make strings to lower case and tokenize / word split reviews
            5. Remove English stopwords
            6. Rejoin to one string
        
        @review (type:str) is an unprocessed review string
        @return (type:str) is a 6-step preprocessed review string
    '''
    
    #1. Remove HTML tags
    review = bs.BeautifulSoup(review).text
    
    #2. Use regex to find emoticons
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', review)
    
    #3. Remove punctuation
    review = re.sub("[^a-zA-Z]", " ",review)
    
    #4. Tokenize into words (all lower case)
    review = review.lower().split()
    
    #5. Remove stopwords
    eng_stopwords = set(stopwords.words("english"))
    review = [w for w in review if not w in eng_stopwords]
    
    #6. Join the review to one sentence
    review = ' '.join(review+emoticons)
    # add emoticons to the end

    return(review)

In [11]:
lululemon_data['cleaned_text'] = lululemon_data['review'].apply(review_cleaner)

In [12]:
lululemon_data.head()

Unnamed: 0.1,Unnamed: 0,title,review,product,cleaned_text
0,0,Perfect for Alaska living,"I love my align leggings, but needed somethin...","Base Pace High-Rise Tight 28"" Brushed",love align leggings needed something warmer te...
1,1,Always have such high hopes for tights an...,"Alas, another pair of tights that pill. It is...","Base Pace High-Rise Tight 28"" Brushed",alas another pair tights pill disappointing lo...
2,3,Vest,Love the side pockets and new color. Great fi...,Down for It All Vest,love side pockets new color great fit staple t...
3,4,Light Vest,Very light weight vest. Love the green color....,Down for It All Vest,light weight vest love green color fits true size
4,5,Runs small,When I order or buy something the is “fit” I ...,Down for It All Vest,order buy something fit usually go size bigger...


In [13]:
lululemon_data['cleaned_text'][3]

'light weight vest love green color fits true size'

<br>

___


## Using VADER to analyze the sentiment



In [14]:
! pip install vaderSentiment



In [15]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [16]:
obj = SentimentIntensityAnalyzer()

In [17]:
def predict_sentiment(sentence):
    return obj.polarity_scores(sentence)

In [18]:
lululemon_data['polarity_score'] = lululemon_data['cleaned_text'].apply(predict_sentiment)

In [20]:
lululemon_data['polarity_score'][3]

{'neg': 0.0, 'neu': 0.5, 'pos': 0.5, 'compound': 0.7906}

In [21]:
def extract_compound(dict):
    return dict['compound']

In [23]:
lululemon_data['compound'] = lululemon_data['polarity_score'].apply(extract_compound)

In [24]:
lululemon_data.head()

Unnamed: 0.1,Unnamed: 0,title,review,product,cleaned_text,polarity_score,compound
0,0,Perfect for Alaska living,"I love my align leggings, but needed somethin...","Base Pace High-Rise Tight 28"" Brushed",love align leggings needed something warmer te...,"{'neg': 0.0, 'neu': 0.422, 'pos': 0.578, 'comp...",0.9853
1,1,Always have such high hopes for tights an...,"Alas, another pair of tights that pill. It is...","Base Pace High-Rise Tight 28"" Brushed",alas another pair tights pill disappointing lo...,"{'neg': 0.451, 'neu': 0.549, 'pos': 0.0, 'comp...",-0.802
2,3,Vest,Love the side pockets and new color. Great fi...,Down for It All Vest,love side pockets new color great fit staple t...,"{'neg': 0.0, 'neu': 0.399, 'pos': 0.601, 'comp...",0.9618
3,4,Light Vest,Very light weight vest. Love the green color....,Down for It All Vest,light weight vest love green color fits true size,"{'neg': 0.0, 'neu': 0.5, 'pos': 0.5, 'compound...",0.7906
4,5,Runs small,When I order or buy something the is “fit” I ...,Down for It All Vest,order buy something fit usually go size bigger...,"{'neg': 0.07, 'neu': 0.759, 'pos': 0.172, 'com...",0.4749


<br>

___


## Using VADER to analyze the sentiment



In [49]:
negative_sentiments = len(lululemon_data[lululemon_data['compound'] >0])
positive_sentiments = len(lululemon_data[lululemon_data['compound'] <0])
neutral_sentiments = len(lululemon_data[lululemon_data['compound'] ==0])
print(negative_sentiments)
print(positive_sentiments)
print(neutral_sentiments)

1760
123
55
