#### Data Preparation

Using pandas to read data

In [1]:
import pandas as pd

df = pd.read_csv('fashion_data.csv')
df.head()

Unnamed: 0,year,season,brand,author of review,location,time,review text
0,2016,Spring,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,"September 21, 2015",August saw the announcement of big news for A....
3,2016,Spring,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ..."
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...


#### Data Processing

1. Use NLTK to tokenize text in the "review text" column
2. Give the user options for "bag-of-words", "bag-of-words with stemming and stop words removal", "POS with all nouns" and "POS with only NNP" by customized functions
3. Upon choosing the functions, add top 30 features to dataframe as new columns for further analysis and return a new dataframe

In [9]:
import nltk
import re
class reviewAnalyzer():
    def __init__(self, df, top_features=30):
        '''
        
        :param df: the original input data
        :param ngrams: options for n-grams bag of words
        :param top_features: options for number of top features returned
        '''
        
        self.df = df
        self.top_features = top_features
    
    def simple_bags(self, ngrams = 2):
        '''
        :return: Use a simple bag of words to analyze
        '''
        
        data = self.df.copy()
        
        for ind in range(len(data)):
            rawText = data.loc[ind, 'review text']
            # Use regular expressions to get a pure letter text and transfer all words to lower case
            letterText = re.sub('[^a-zA-Z]', ' ', rawText).lower()
            tokens = nltk.word_tokenize(letterText, language='english')
            text = []
            for n in range(1, ngrams+1):
                tempText = nltk.ngrams(tokens, n)
                text.extend([' '.join(words).strip() for words in tempText])
            
            freq = nltk.FreqDist(text)
            sorted_freq = sorted(freq.items(),key = lambda k:k[1], reverse = True)[:self.top_features]
            
            for feature, freqency in sorted_freq:
                data.loc[ind, feature] = freqency
        
        data = data.fillna(0)
        return data
    
    
    def bag_of_words_stem_stop(self, ngrams = 3):
        '''
        :return: Use a bag of words approach with stemming and stop words removal to analyze
        '''
        
        data = self.df.copy()
        stopwords = nltk.corpus.stopwords.words('english')
        wnl = nltk.WordNetLemmatizer()
        
        for ind in range(len(data)):
            rawText = data.loc[ind, 'review text']
            # Use regular expressions to get a pure letter text and transfer all words to lower case
            letterText = re.sub('[^a-zA-Z]', ' ', rawText).lower()
            tokens = nltk.word_tokenize(letterText, language='english')
            
            text = []
            for n in range(1, ngrams+1):
                tempText = nltk.ngrams(tokens, n)
                text.extend([' '.join(words).strip() for words in tempText])
            
            # Remove Stop words and stemming
            
            noStopWordsText = [words for words in text if words not in stopwords]
            stemmedText = [wnl.lemmatize(words) for words in noStopWordsText]
            
            freq = nltk.FreqDist(stemmedText)
            sorted_freq = sorted(freq.items(),key = lambda k:k[1], reverse = True)[:self.top_features]
            
            for feature, freqency in sorted_freq: 
                data.loc[ind, feature] = freqency   
        
        data = data.fillna(0)
        return data
    
    
    def pos_tags(self, posList = ('NN', 'NNP', 'NNS', 'NNPS')):
        '''
        :return: Use POS approach and focus on all the noun forms (NN, NNP, NNS, NNPS)
        '''
                
        data = self.df.copy()

        
        for ind in range(len(data)):
            rawText = data.loc[ind, 'review text']
            # Use regular expressions to get a pure letter text and transfer all words to lower case
            letterText = re.sub('[^a-zA-Z]', ' ', rawText).lower()
            tokens = nltk.word_tokenize(letterText, language='english')
            
            # pos tags the tokens
            
            posTag = [(pos, tag) for (pos, tag) in nltk.pos_tag(tokens=tokens) if tag in posList]
            
            freq = nltk.FreqDist(posTag)
            sorted_freq = sorted(freq.items(),key = lambda k:k[1], reverse = True)[:self.top_features]
            
            for feature, freqency in sorted_freq: 
                data.loc[ind, feature] = freqency

#### Data Analysis

After calling the class function, there are four functions can be called by users' needs:

1. A simple bag-of-words approach
2. A bag-of-words approach with stemming and stop words removal
3. POS approach and focus on all the noun forms (NN, NNP, NNS, NNPS)
4. POS approach and only focus on NNP

In [10]:
# input the data to initiate the class function
analyzer = reviewAnalyzer(df)

#####Simple Bag of Words Approch

simple_bags() function doesn't only calculate the single words, but also add customized ngrams phrases into analysis to better understand the result.

In [11]:
simple_bags = analyzer.simple_bags()

Display the first 10 rows of the final dataframe

In [13]:
simple_bags.head(10)

Unnamed: 0,year,season,brand,author of review,location,time,review text,the,a,of,...,start,they start,industry,ishii,flower power,flower,heart after,dressing in,loose capri,another was
0,2016,Spring,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...,14.0,9.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...,19.0,10.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,"September 21, 2015",August saw the announcement of big news for A....,18.0,20.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016,Spring,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ...",24.0,13.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...,10.0,10.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2016,Spring,Ace & Jig,Kristin Anderson,NEW YORK,"October 13, 2015",Process has always been paramount to Ace & Jig...,5.0,6.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2016,Spring,Acne Studios,Chioma Nnadi,PARIS,"October 3, 2015",A bohemian circle of muses have been in heavy ...,19.0,9.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2016,Spring,Adam Lippes,Nicole Phelps,NEW YORK,"September 12, 2015",Walking into Adam Lippess Washington Square ap...,7.0,9.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,2016,Spring,Adam Selman,Lee Carter,NEW YORK,"September 10, 2015","Ever the adventurer, Adam Selman will gleefull...",14.0,13.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,2016,Spring,ADEAM,Kristin Anderson,NEW YORK,"September 14, 2015",Hanako Maeda has been busy rediscovering her J...,19.0,11.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
%matplotlib
simple_bags.iloc[:, 7:].sum().sort_values(ascending=False)[:30].plot(kind='bar', title='Words Frequency with Simple Bag of Words Approach')

Using matplotlib backend: MacOSX


<matplotlib.axes._subplots.AxesSubplot at 0x114c0db70>

##### Bag-of-words approach with stemming and stop words removal


bag_of_words_stem_stop() function will use nltk stopwords and WordNet Lematizer for stemming; User can also choose a n-grams approach to include phrases frequency for more insight 

In [16]:
bagWords = analyzer.bag_of_words_stem_stop()

Display the first 10 rows of the output

In [46]:
bagWords.head(10)

Unnamed: 0,year,season,brand,author of review,location,time,review text,dtacher,woman,dress,...,flower power,klum,bey heidi,separately big,newfound leg,read flower,big hit,mechanic suit,printed different,season founder
0,2016,2,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...,4.0,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,2,August saw the announcement of big news for A....,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016,Spring,2,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ...",0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2016,Spring,Ace & Jig,Kristin Anderson,NEW YORK,"October 13, 2015",Process has always been paramount to Ace & Jig...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2016,Spring,Acne Studios,Chioma Nnadi,PARIS,"October 3, 2015",A bohemian circle of muses have been in heavy ...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2016,2,Adam Lippes,Nicole Phelps,NEW YORK,"September 12, 2015",Walking into Adam Lippess Washington Square ap...,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,2016,Spring,Adam Selman,Lee Carter,NEW YORK,"September 10, 2015","Ever the adventurer, Adam Selman will gleefull...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,2016,3,ADEAM,Kristin Anderson,NEW YORK,"September 14, 2015",Hanako Maeda has been busy rediscovering her J...,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
bagWords.iloc[:, 7:].sum().sort_values(ascending=False)[:30].plot(kind='bar', title='Words Frequen')

<matplotlib.axes._subplots.AxesSubplot at 0x11addd2b0>