## Data Preparation

Using pandas to read data and show the first five rows:

In [2]:
import pandas as pd

df = pd.read_csv('fashion_data.csv')

df.head()

Unnamed: 0,year,season,brand,author of review,location,time,review text
0,2016,Spring,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,"September 21, 2015",August saw the announcement of big news for A....
3,2016,Spring,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ..."
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...


## Self-written Python Module Import

In [3]:
from ReviewAnalyzer import reviewAnalyzer
# input the data to initiate the class function
analyzer = reviewAnalyzer(df)

## Data Analysis

After calling the class function, there are four functions can be called by users' needs:

1. A simple bag-of-words approach: reviewAnalyzer.simple_bags()
2. A bag-of-words approach with stemming and stop words removal: reviewAnalyzer.bags_of_words_stem_stop()
3. POS approach and focus on all the noun forms (NN, NNP, NNS, NNPS): reviewAnalyzer.posTag()
4. POS approach and only focus on NNP: reviewAnalyzer.posTag(postList = ['NNP'])

####1. Simple Bag of Words Approch

simple_bags() function doesn't only calculate the single words, but also can add customized ngrams phrases into analysis to better understand the result. However, in this case, we don't need to use ngram approach.

In [4]:
simple_bags = analyzer.simple_bags(ngrams=1)

In [5]:
simple_bags.to_csv('output.csv')

Display the first 10 rows of the final dataframe:

In [6]:
simple_bags.head(10)

Unnamed: 0,YEAR,SEASON,BRAND,AUTHOR OF REVIEW,LOCATION,TIME,REVIEW TEXT,the,a,of,...,gerbases,gerbase,brandelli,dirt,start,lims,ishii,flower,et,varietals
0,2016,Spring,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...,14.0,9.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...,19.0,10.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,"September 21, 2015",August saw the announcement of big news for A....,18.0,20.0,10.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016,Spring,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ...",24.0,13.0,4.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...,10.0,10.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2016,Spring,Ace & Jig,Kristin Anderson,NEW YORK,"October 13, 2015",Process has always been paramount to Ace & Jig...,5.0,6.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2016,Spring,Acne Studios,Chioma Nnadi,PARIS,"October 3, 2015",A bohemian circle of muses have been in heavy ...,19.0,9.0,11.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2016,Spring,Adam Lippes,Nicole Phelps,NEW YORK,"September 12, 2015",Walking into Adam Lippess Washington Square ap...,7.0,9.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,2016,Spring,Adam Selman,Lee Carter,NEW YORK,"September 10, 2015","Ever the adventurer, Adam Selman will gleefull...",14.0,13.0,12.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,2016,Spring,ADEAM,Kristin Anderson,NEW YORK,"September 14, 2015",Hanako Maeda has been busy rediscovering her J...,19.0,11.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


plot the top 30 concepts (keywords with highest frequency):

In [7]:
%matplotlib
simple_bags.iloc[:, 7:].sum().sort_values(ascending=False)[:30].plot(kind='bar', title='Words Frequency - Simple Bag of Words Approach')

Using matplotlib backend: MacOSX


<matplotlib.axes._subplots.AxesSubplot at 0x1122b2320>

Get the top keywords (concepts) for further analysis:

In [20]:
top_concepts = simple_bags.iloc[:, 7:].sum().sort_values(ascending=False)[:30].index.values
top_concepts

array(['the', 'a', 'and', 'of', 'to', 'in', 'with', 'that', 'was', 'for',
       'on', 'as', 'it', 'her', 'his', 'were', 'is', 'this', 'but', 'at',
       'from', 'he', 'an', 'their', 'she', 'its', 'collection', 'by',
       'all', 'or'], dtype=object)

Group by TIME to see the key words trend. First, we need to sort the index by changing their data types to datetime.datetime:

In [54]:
groupedData = simple_bags.iloc[:, 2:].groupby('TIME').sum()
groupedData.index = pd.to_datetime(groupedData.index)
groupedData.sort_index(inplace=True)
groupedData

Unnamed: 0_level_0,the,a,of,and,with,dtacher,that,to,clothes,you,...,gerbases,gerbase,brandelli,dirt,start,lims,ishii,flower,et,varietals
TIME,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2015-09-08,121.0,76.0,68.0,68.0,36.0,0.0,17.0,42.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2015-09-09,158.0,119.0,107.0,127.0,54.0,0.0,34.0,61.0,7.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2015-09-10,307.0,198.0,176.0,168.0,87.0,0.0,53.0,102.0,4.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2015-09-11,297.0,240.0,188.0,206.0,89.0,0.0,74.0,120.0,6.0,8.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2015-09-12,337.0,177.0,156.0,162.0,56.0,0.0,59.0,106.0,4.0,6.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2015-09-13,268.0,202.0,137.0,169.0,58.0,4.0,65.0,103.0,4.0,5.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2015-09-14,513.0,382.0,272.0,275.0,116.0,0.0,113.0,210.0,12.0,6.0,...,0.0,0.0,0.0,3.0,3.0,2.0,2.0,2.0,1.0,1.0
2015-09-15,253.0,184.0,142.0,151.0,56.0,0.0,72.0,113.0,6.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2015-09-16,278.0,154.0,143.0,161.0,58.0,0.0,66.0,112.0,5.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2015-09-17,182.0,112.0,75.0,111.0,43.0,0.0,17.0,76.0,6.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Plot the top concepts over time to show the trend (legends are only for top 5 keywords):

In [30]:
ax1 = groupedData.loc[:, top_concepts].plot(title='Top 30 Keywords Trends over Time')
lines, labels = ax1.get_legend_handles_labels()
ax1.legend(lines[:5], labels[:5], loc='best')

<matplotlib.legend.Legend at 0x121b9b0b8>

Since we need to plot the same thing over and over again, write above steps into a function for easy calling:

In [61]:
def myPlot(df, byCol = 'TIME', topKeywords=30, topLegend=5):
    '''
    To plot the top concepts trend over time
    :param df: input pandas dataframe
    :param by: which column to goup by 
    :return: None
    '''
    top_concepts = df.iloc[:, 7:].sum().sort_values(ascending=False)[:topKeywords].index.values
    groupedData = simple_bags.iloc[:, 2:].groupby(byCol).sum()
    groupedData.index = pd.to_datetime(groupedData.index)
    groupedData.sort_index(inplace=True)
    ax1 = groupedData.loc[:, top_concepts].plot(title='Top 30 Keywords Trends over Time')
    lines, labels = ax1.get_legend_handles_labels()
    ax1.legend(lines[:topLegend], labels[:topLegend], loc='best')

####2. Bag-of-words approach with stemming and stop words removal


bag_of_words_stem_stop() function will use nltk stopwords and WordNet Lematizer for stemming; User can also choose a n-grams approach to include phrases frequency for more insight 

In [32]:
bagWords = analyzer.bag_of_words_stem_stop()

Display the first 10 rows of the output

In [33]:
bagWords.head(10)

Unnamed: 0,YEAR,SEASON,BRAND,AUTHOR OF REVIEW,LOCATION,TIME,REVIEW TEXT,dtacher,clothes,woman,...,lims,installed,compost,mound,reprising,lackedhis,ishii,denimheads,tomboy,remnant
0,2016,Spring,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...,4.0,3.0,3.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,"September 21, 2015",August saw the announcement of big news for A....,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016,Spring,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2016,Spring,Ace & Jig,Kristin Anderson,NEW YORK,"October 13, 2015",Process has always been paramount to Ace & Jig...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2016,Spring,Acne Studios,Chioma Nnadi,PARIS,"October 3, 2015",A bohemian circle of muses have been in heavy ...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2016,Spring,Adam Lippes,Nicole Phelps,NEW YORK,"September 12, 2015",Walking into Adam Lippess Washington Square ap...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,2016,Spring,Adam Selman,Lee Carter,NEW YORK,"September 10, 2015","Ever the adventurer, Adam Selman will gleefull...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,2016,Spring,ADEAM,Kristin Anderson,NEW YORK,"September 14, 2015",Hanako Maeda has been busy rediscovering her J...,0.0,2.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Plot the top 30 concepts (keywords with highest frequency):

In [34]:
bagWords.iloc[:, 7:].sum().sort_values(ascending=False)[:30].plot(kind='bar', title='Words Frequency - Bag of Words Approach Excluded Stopwords and Stemmed ')

<matplotlib.axes._subplots.AxesSubplot at 0x1226ac8d0>

Plot the top concepts over time to show the trend (legends are only for top 10 keywords):

In [38]:
myPlot(bagWords, topLegend=10)

####POS approach and focus on all the noun forms (NN, NNP, NNS, NNPS)

Same approach as previous methods.

In [39]:
posTag = analyzer.pos_tags()

In [40]:
posTag.head(10)

Unnamed: 0,YEAR,SEASON,BRAND,AUTHOR OF REVIEW,LOCATION,TIME,REVIEW TEXT,clothes,dtacher,dresses,...,gazar,maya,arrangement,phillip,bey,heidi,stella,tommy,hit,waistline
0,2016,Spring,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...,3.0,2.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,"September 21, 2015",August saw the announcement of big news for A....,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016,Spring,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ...",0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2016,Spring,Ace & Jig,Kristin Anderson,NEW YORK,"October 13, 2015",Process has always been paramount to Ace & Jig...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2016,Spring,Acne Studios,Chioma Nnadi,PARIS,"October 3, 2015",A bohemian circle of muses have been in heavy ...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2016,Spring,Adam Lippes,Nicole Phelps,NEW YORK,"September 12, 2015",Walking into Adam Lippess Washington Square ap...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,2016,Spring,Adam Selman,Lee Carter,NEW YORK,"September 10, 2015","Ever the adventurer, Adam Selman will gleefull...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,2016,Spring,ADEAM,Kristin Anderson,NEW YORK,"September 14, 2015",Hanako Maeda has been busy rediscovering her J...,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [74]:
posTag.iloc[:, 7:].sum().sort_values(ascending=False)[:30].plot(kind='bar', title='Words Frequency - POS Approach (All Nouns) ')

<matplotlib.axes._subplots.AxesSubplot at 0x12baa4cc0>

In [64]:
myPlot(posTag, topKeywords=15, topLegend=12)

####POS approach ONLY focus on NNP

In [81]:
posTagNNP = analyzer.pos_tags(posList=['NNP'])

In [82]:
posTagNNP.head(10)

Unnamed: 0,YEAR,SEASON,BRAND,AUTHOR OF REVIEW,LOCATION,TIME,REVIEW TEXT,berber,square,madison,...,treacly,kenya,navy,mood,zip,kick,kaleidoscopic,robber,zac,zimmermann
0,2016,Spring,A Dtacher,Kristin Anderson,NEW YORK,"September 13, 2015",Detachment was the word of the day at A Dtache...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2016,Spring,A.F. Vandevorst,Luke Leitch,PARIS,"October 1, 2015",You heard this collection coming long before y...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2016,Spring,A.L.C.,Kristin Anderson,NEW YORK,"September 21, 2015",August saw the announcement of big news for A....,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2016,Spring,A.P.C.,Nicole Phelps,PARIS,"October 3, 2015","They call me the king of basics, Jean Touitou ...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2016,Spring,A.W.A.K.E.,Maya Singer,NEW YORK,"October 21, 2015",Natalia Alaverdian is a designer with a lot of...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,2016,Spring,Ace & Jig,Kristin Anderson,NEW YORK,"October 13, 2015",Process has always been paramount to Ace & Jig...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,2016,Spring,Acne Studios,Chioma Nnadi,PARIS,"October 3, 2015",A bohemian circle of muses have been in heavy ...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
7,2016,Spring,Adam Lippes,Nicole Phelps,NEW YORK,"September 12, 2015",Walking into Adam Lippess Washington Square ap...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,2016,Spring,Adam Selman,Lee Carter,NEW YORK,"September 10, 2015","Ever the adventurer, Adam Selman will gleefull...",0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,2016,Spring,ADEAM,Kristin Anderson,NEW YORK,"September 14, 2015",Hanako Maeda has been busy rediscovering her J...,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [84]:
posTagNNP.iloc[:, 7:].sum().sort_values(ascending=False)[:30]#.plot(kind='bar', title='Words Frequency - POS Approach (NNP only) ')

kitsch           2.0
bomber           2.0
x                2.0
keller           1.0
takada           1.0
karl             1.0
michigan         1.0
taeuber          1.0
montgomery       1.0
kinda            1.0
zimmermann       1.0
december         1.0
blanket          1.0
von              1.0
xiao             1.0
zealand          1.0
madison          1.0
square           1.0
mott             1.0
ziegfeldas       1.0
zac              1.0
treacly          1.0
robber           1.0
kaleidoscopic    1.0
kick             1.0
zip              1.0
mood             1.0
navy             1.0
kenya            1.0
mouret           1.0
dtype: float64

In [70]:
myPlot(posTagNNP, topKeywords=15, topLegend=10)