## Data Preparation

Using pandas to read data and show the first five rows:

In [1]:
import pandas as pd

df = pd.read_csv('fashion_data.csv')
df.head()

Unnamed: 0,year,season,brand,author of review,location,time,review text
0,2016,Spring,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,"September 21, 2015",August saw the announcement of big news for A....
3,2016,Spring,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ..."
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...


## Self-written Python Module Import

In [4]:
from ReviewAnalyzer import reviewAnalyzer
# input the data to initiate the class function
analyzer = reviewAnalyzer(df)

## Data Analysis

After calling the class function, there are four functions can be called by users' needs:

1. A simple bag-of-words approach: reviewAnalyzer.simple_bags()
2. A bag-of-words approach with stemming and stop words removal: reviewAnalyzer.bags_of_words_stem_stop()
3. POS approach and focus on all the noun forms (NN, NNP, NNS, NNPS): reviewAnalyzer.posTag()
4. POS approach and only focus on NNP: reviewAnalyzer.posTag(postList = ['NNP'])

####1. Simple Bag of Words Approch

simple_bags() function doesn't only calculate the single words, but also can add customized ngrams phrases into analysis to better understand the result. However, in this case, we don't need to use ngram approach.

In [5]:
simple_bags = analyzer.simple_bags(ngrams=1)

0      Detachment was the word of the day at A Dtache...
1      You heard this collection coming long before y...
2      August saw the announcement of big news for A....
3      They call me the king of basics, Jean Touitou ...
4      Natalia Alaverdian is a designer with a lot of...
5      Process has always been paramount to Ace & Jig...
6      A bohemian circle of muses have been in heavy ...
7      Walking into Adam Lippess Washington Square ap...
8      Ever the adventurer, Adam Selman will gleefull...
9      Hanako Maeda has been busy rediscovering her J...
10     According to the folks at Adidas HQ, this seas...
11     This collection was an ode to Tiresias, a bird...
12     This was a poignant presentationyet not necess...
13     Akriss Albert Kriemler has long looked to arch...
14     The backdrop of Alberta Ferrettis show was an ...
15     For Spring, Alessandra Rich unveiled an unabas...
16     And then there was light. . . . Alexander Lewi...
17     The girls had pink-cheek

KeyError: "['the'] not in index"

In [27]:
simple_bags.to_csv('output.csv')

Display the first 10 rows of the final dataframe:

In [22]:
simple_bags.head(10)

Unnamed: 0,year,season,brand,author of review,location,time,review text,the,a,of,...,gerbase,inserts,trajectory,start,dirt,industry,ishii,flower,patch,yep
0,2016,Spring,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...,14.0,9.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...,19.0,10.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,"September 21, 2015",August saw the announcement of big news for A....,18.0,20.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016,Spring,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ...",24.0,13.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...,10.0,10.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2016,Spring,Ace & Jig,Kristin Anderson,NEW YORK,"October 13, 2015",Process has always been paramount to Ace & Jig...,5.0,6.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2016,Spring,Acne Studios,Chioma Nnadi,PARIS,"October 3, 2015",A bohemian circle of muses have been in heavy ...,19.0,9.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2016,Spring,Adam Lippes,Nicole Phelps,NEW YORK,"September 12, 2015",Walking into Adam Lippess Washington Square ap...,7.0,9.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,2016,Spring,Adam Selman,Lee Carter,NEW YORK,"September 10, 2015","Ever the adventurer, Adam Selman will gleefull...",14.0,13.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,2016,Spring,ADEAM,Kristin Anderson,NEW YORK,"September 14, 2015",Hanako Maeda has been busy rediscovering her J...,19.0,11.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


plot the top 30 concepts (keywords with highest frequency)

In [6]:
%matplotlib
simple_bags.iloc[:, 7:].sum().sort_values(ascending=False)[:30].plot(kind='bar', title='Words Frequency - Simple Bag of Words Approach')

Using matplotlib backend: MacOSX


<matplotlib.axes._subplots.AxesSubplot at 0x11eef5cf8>

In [15]:
simple_bags.iloc[:, 2:].groupby('time').sum()

Unnamed: 0_level_0,the,a,of,with,and,that,dtacher,to,you,at,...,carlo,dirt,start,mounds,ishii,flower,industry,proposed,heidi,convert
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,122.0,84.0,77.0,29.0,76.0,27.0,0.0,56.0,2.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,200.0,127.0,109.0,41.0,110.0,28.0,0.0,74.0,2.0,13.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,42.0,42.0,35.0,4.0,37.0,19.0,0.0,31.0,5.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,7.0,15.0,8.0,2.0,7.0,4.0,0.0,4.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"November 10, 2015",11.0,8.0,4.0,3.0,19.0,2.0,0.0,5.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"October 1, 2015",214.0,120.0,99.0,34.0,125.0,45.0,0.0,86.0,6.0,22.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"October 12, 2015",34.0,23.0,27.0,6.0,16.0,13.0,0.0,22.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"October 13, 2015",4.0,9.0,4.0,3.0,7.0,0.0,0.0,5.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"October 14, 2015",10.0,15.0,16.0,4.0,10.0,5.0,0.0,10.0,3.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"October 15, 2015",28.0,33.0,23.0,3.0,16.0,6.0,0.0,5.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


####2. Bag-of-words approach with stemming and stop words removal


bag_of_words_stem_stop() function will use nltk stopwords and WordNet Lematizer for stemming; User can also choose a n-grams approach to include phrases frequency for more insight 

In [30]:
bagWords = analyzer.bag_of_words_stem_stop()

Display the first 10 rows of the output

In [34]:
bagWords.head(10)

Unnamed: 0,year,season,brand,author of review,location,time,review text,dtacher,dress,woman,...,lims,compost,installed,mound,waste,phillip,ishii,yep,mechanic,denimheads
0,2016,2,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...,4.0,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,2,August saw the announcement of big news for A....,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016,Spring,2,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ...",0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2016,Spring,Ace & Jig,Kristin Anderson,NEW YORK,"October 13, 2015",Process has always been paramount to Ace & Jig...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2016,Spring,Acne Studios,Chioma Nnadi,PARIS,"October 3, 2015",A bohemian circle of muses have been in heavy ...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2016,2,Adam Lippes,Nicole Phelps,NEW YORK,"September 12, 2015",Walking into Adam Lippess Washington Square ap...,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,2016,Spring,Adam Selman,Lee Carter,NEW YORK,"September 10, 2015","Ever the adventurer, Adam Selman will gleefull...",0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,2016,3,ADEAM,Kristin Anderson,NEW YORK,"September 14, 2015",Hanako Maeda has been busy rediscovering her J...,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [36]:
bagWords.iloc[:, 7:].sum().sort_values(ascending=False)[:30].plot(kind='bar', title='Words Frequency - Bag of Words Approach Excluded Stopwords and Stemmed ')

<matplotlib.axes._subplots.AxesSubplot at 0x1217a2e80>

####POS approach and focus on all the noun forms (NN, NNP, NNS, NNPS)

In [64]:
posTag = analyzer.pos_tags()

In [69]:
posTag.head(10)

Unnamed: 0,year,season,brand,author of review,location,time,review text,clothes,woman,dresses,...,dirt,lims,compost,phillip,directness,food,rebirth,ishii,varietals,pima
0,2016,1,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...,3.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,2,August saw the announcement of big news for A....,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016,1,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016,1,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2016,Spring,Ace & Jig,Kristin Anderson,NEW YORK,1,Process has always been paramount to Ace & Jig...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2016,Spring,Acne Studios,Chioma Nnadi,PARIS,"October 3, 2015",A bohemian circle of muses have been in heavy ...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2016,1,Adam Lippes,Nicole Phelps,NEW YORK,"September 12, 2015",Walking into Adam Lippess Washington Square ap...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,2016,1,Adam Selman,Lee Carter,NEW YORK,"September 10, 2015","Ever the adventurer, Adam Selman will gleefull...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,2016,2,ADEAM,Kristin Anderson,NEW YORK,"September 14, 2015",Hanako Maeda has been busy rediscovering her J...,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [66]:
posTag.iloc[:, 7:].sum().sort_values(ascending=False)[:30].plot(kind='bar', title='Words Frequency - POS Approach (All Nouns) ')

<matplotlib.axes._subplots.AxesSubplot at 0x123e75c18>

####POS approach ONLY focus on NNP

In [78]:
posTagNNP = analyzer.pos_tags(posList=['NNP'])

In [70]:
posTagNNP.head(10)

Unnamed: 0,year,season,brand,author of review,location,time,review text,berber,madison,zealand,...,treacly,kenya,navy,mood,zip,kaleidoscopic,kick,robber,zac,zimmermann
0,2016,Spring,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,"September 21, 2015",August saw the announcement of big news for A....,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016,Spring,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2016,Spring,Ace & Jig,Kristin Anderson,NEW YORK,"October 13, 2015",Process has always been paramount to Ace & Jig...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2016,Spring,Acne Studios,Chioma Nnadi,PARIS,"October 3, 2015",A bohemian circle of muses have been in heavy ...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2016,Spring,Adam Lippes,Nicole Phelps,NEW YORK,"September 12, 2015",Walking into Adam Lippess Washington Square ap...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,2016,Spring,Adam Selman,Lee Carter,NEW YORK,"September 10, 2015","Ever the adventurer, Adam Selman will gleefull...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,2016,Spring,ADEAM,Kristin Anderson,NEW YORK,"September 14, 2015",Hanako Maeda has been busy rediscovering her J...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [74]:
posTagNNP.iloc[:, 7:].sum().sort_values(ascending=False)[:30].plot(kind='bar', title='Words Frequency - POS Approach (NNP only) ')

<matplotlib.axes._subplots.AxesSubplot at 0x123e75c18>