## Topic Modelling 
- [helper functions](#helperFunctions)

### [Sentence Tokenizing](#sentTokenizing)

### [Cleaning](#cleaning)

### [NMF](#nmf)
- [5 topics](#nmf-5topics)
    - [topic breakdown](#topicBreakdown1)
- [4 topics](#nmf-4topics)
    - [topic breakdown](#topicBreakdown2)
    
### [NMF With Nouns](#nmfNouns)
- [model 1](#nmfNouns1)
    - [topic analysis](#nouns-ta)

### [Polarized topics](#polarized)
- 
### [Unpolarized topics](#unpolarized)
- 

In [10]:
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import scale
from sklearn.datasets import fetch_mldata
from sklearn.cluster import KMeans
from sklearn.utils import shuffle
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize
from pymongo import MongoClient
import pandas as pd
import numpy as np
from seaborn import plt
import matplotlib.pyplot as mplt
%matplotlib inline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import time
import nltk
from nltk import word_tokenize

In [2]:
client = MongoClient('ec2-34-198-179-91.compute-1.amazonaws.com', 27017)
db = client.fletcher
dress_col = db.rtr_dresses
rev_col = db.rtr_reviews

In [3]:
cur = rev_col.find({}, {"review":1, "title":1,"_id":0})
rev_df = pd.DataFrame(list(cur))

In [4]:
rev_df.columns

Index(['review', 'title'], dtype='object')

<a id="helperFunctions"></a>
## Helper Functions

In [125]:
def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(", ".join(sorted(list(zip(feature_names, topic))), key=lambda x:x[0], reverse=True)[:n_top_words])
    print()

In [6]:
def get_tfidf_and_tf(text, stopwords, max_df=0.90, min_df=0.001, ngram=(2,2), vocab=None):
    tfidf_vectorizer = TfidfVectorizer(max_df=max_df, min_df=min_df,
                                       ngram_range=ngram,
                                       stop_words=sw, vocabulary = vocab)
    t0 = time.time()
    tfidf = tfidf_vectorizer.fit_transform(text)
    print("done in %0.3fs." % (time.time() - t0))

    # Use tf (raw term count) features for LDA.
    print("Extracting tf features for LDA...")
    tf_vectorizer = CountVectorizer(max_df=max_df, min_df=min_df,
                                    ngram_range=ngram,
                                    stop_words=sw)
    t0 = time.time()
    tf = tf_vectorizer.fit_transform(text)
    print("done in %0.3fs." % (time.time() - t0))
    return tfidf, tfidf_vectorizer, tf, tf_vectorizer

<a id="sentTokenizing"></a>
## Sentence Tokenization

In [8]:
sentences = rev_df.review.apply(sent_tokenize)

In [9]:
df_sent = pd.concat([pd.DataFrame({'review': x, 'index': i}) for i,x in enumerate(sentences)], ignore_index=True)

<a id="cleaning"></a>
## Cleaning

In [12]:
df_sent.review = df_sent.review.str.lower()
df_sent.review = df_sent.review.str.replace(',', ' ')
df_sent.review = df_sent.review.str.replace('.', ' ')

<a id="nmf"></a>
## NMF

In [17]:
sw = stopwords.words('english')
tfidf, tfidf_vectorizer, tf, tf_vectorizer = get_tfidf_and_tf(df_sent.review, sw, min_df=0, max_df=0.5, ngram=(1,2))

done in 6.185s.
Extracting tf features for LDA...
done in 5.444s.


<a id="nmf-5topics"></a>
### 5 Topics

In [18]:
n_topics = 5
n_top_words = 20
t0 = time.time()
nmf = NMF(n_components=n_topics, random_state=1,
          alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time.time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

done in 26.115s.

Topics in NMF model:
Topic #0:
loved, loved dress, dress, absolutely loved, absolutely, overall loved, overall, everyone loved, everyone, beautiful, loved loved, loved wearing, much, really loved, dress much, pockets, loved pockets, loved fit, really, love
Topic #1:
compliments, many compliments, many, received, got, received many, got many, night, compliments night, compliments dress, dress, tons, tons compliments, received compliments, got compliments, lots compliments, lots, night long, felt, got tons
Topic #2:
fit, perfect, dress, great, dress fit, fit perfectly, perfectly, fit great, comfortable, fit perfect, length, great dress, like, beautiful, fit like, glove, like glove, dress perfect, heels, well
Topic #3:
rent, would, definitely, recommend, would definitely, definitely rent, dress, would rent, highly, highly recommend, definitely recommend, recommend dress, rent dress, would recommend, would highly, dress would, wear, anyone, rtr, rent runway
Topic #4:
size

<a id="topicBreakdown1"></a>
### Topic breakdown
1. Topic 0 = Loved the dress
2. Topic 1 = Received a lot of compliments
3. Topic 2 = Dress fits well
4. Topic 3 = Would definitely rent again or recommend
5. Topic 4 = True to size 

It looks like Topic 2 and 5 is similar. Will redo this with 4 topics. This can be used for what attributes the user love the dress in.


<a id="nmf-4topics"></a>
### 4 Topics

In [20]:
# Fit the NMF model
n_topics = 4
t0 = time.time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time.time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

done in 21.987s.

Topics in NMF model:
Topic #0:
loved, loved dress, dress, absolutely loved, absolutely, overall loved, overall, beautiful, everyone loved, everyone, loved loved, loved wearing, much, really loved, beautiful dress, dress much, love, amazing, pockets, really
Topic #1:
compliments, many compliments, many, received, got, received many, got many, night, compliments night, compliments dress, dress, tons, tons compliments, received compliments, got compliments, lots compliments, lots, night long, felt, got tons
Topic #2:
fit, size, perfect, dress, true, true size, great, dress fit, fit perfectly, perfectly, fit great, fit true, comfortable, fit perfect, length, like, glove, fit like, like glove, great dress
Topic #3:
rent, would, definitely, recommend, would definitely, definitely rent, dress, would rent, highly, highly recommend, definitely recommend, recommend dress, rent dress, would recommend, would highly, beautiful, dress would, wear, overall, love



<a id="topicBreakdown2"></a>
### Topic breakdown
1. Topic 0 = Loved the dress.
2. Topic 1 = Received a lot of compliments.
3. Topic 2 = Good fit, true to size. Flattering fit.
4. Topic 3 = Would definitely rent again or recommend.


<a id="nmfNouns"></a>
## NMF With Nouns

In [36]:
def noun(s):
    return ' '.join([word[0] for word in nltk.pos_tag(s) if word[1] == 'NN' or word[1] == 'NNS'])

In [None]:
word_tokens = df_sent.review.apply(nltk.word_tokenize)
nouns = word_tokens.apply(noun)

In [116]:
nouns.to_csv('../data/nouns_per_sentence.csv')

In [118]:
df_sent['nouns'] = nouns

<a id="nmfNouns1"></a>
### Max df = 0.5

In [48]:
tfidf, tfidf_vectorizer, tf, tf_vectorizer = get_tfidf_and_tf(df_sent.nouns, sw, min_df=0, max_df=0.5, ngram=(1, 3))

done in 3.566s.
Extracting tf features for LDA...
done in 2.988s.


In [50]:
# Fit the NMF model
n_topics = 30
t0 = time.time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time.time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

done in 143.073s.

Topics in NMF model:
Topic #0:
dress, dress compliments, dress dress, dress night, fit dress, heels dress, size dress, fun, dress length, party, dress event, dress wedding, dress bit, night dress, everything, bra dress, everyone, fun dress, dress occasion, dress heels
Topic #1:
size, size size, fit size, size fit, size dress, backup, fits size, fits, order size, backup size, order, size backup, size length, size dresses, dresses, reviews size, size bit, size reviews, size back, dress fit size
Topic #2:
fit, fit size, size fit, fit dress, fit glove, fit perfect, fit length, fit well, fit perfectly, well, fit bit, perfectly, length fit, medium fit, medium, fit great, fit bust, fit heels, color fit, fit waist
Topic #3:
compliments, compliments night, dress compliments, lot compliments, ton compliments, lot, ton, compliments evening, evening, dress compliments night, wedding compliments, strangers, dress lot compliments, dress lot, compliments night dress, compliments st

<a id="nouns-ta"></a>
### Topic Analysis
- Topic #0: Dress 
- Topic #1: Fit
- Topic #2: Fit
- Topic #3: Compliments
- Topic #4: Length (+)
- Topic #5: Night
- Topic #6: Length 
- Topic #7: Bra (+)
- Topic #8: Bit
- Topic #9: RTR
- Topic #10: Material (+)
- Topic #11: Fit like a glove
- Topic #12: Wedding (+)
- Topic #13: Color (+)
- Topic #14: Event (+)
- Topic #15: Sequins (+)
- Topic #16: Pockets (+)
- Topic #17: Back (+)
- Topic #18: Heels (+) (kind of related to length)
- Topic #19: Perfect
- Topic #20: Fit
- Topic #21: Stretch (+)
- Topic #22: Compliments 
- Topic #23: Compliments
- Topic #24: RTR Experience
- Topic #25: Dress size
- Topic #26: Way
- Topic #27: Bust area (+)
- Topic #28: Lots
- Topic #29: Reviews

The ones with + are significant, and will be used as features of the dress

In [55]:
topic_prob = pd.DataFrame(nmf.transform(tfidf), columns=['topic_{}'.format(i) for i in range(30)])

In [57]:
good_cols = [4, 7, 10, 12, 13, 14, 15, 16, 17, 18, 21, 27]
bad_cols = set(range(30)) - set(good_cols)

In [59]:
for col in bad_cols:
    del topic_prob['topic_{}'.format(col)]

In [61]:
topic_prob.columns = ['length', 'bra', 'material', 'wedding', 'color', 'event', 'sequins', 'pockets', 'back', 'heels', 'stretch', 'bust_area']

I have recognized 3 different group of categories.
1. Body type related
    - length
    - stretch
    - bust area
2. General 
    - bra
    - material
    - wedding 
    - color 
    - event
    - sequins 
    - pockets 
    - back
    
For the categories that are body type related, it will be scored per body type per dress. For general features, it will be scored per dress.


For each category, there's also 3 ways we can "score" them
1. Polarity (Good/bad)
    - Body type related:
        - length
        - stretch 
        - bust area
    - General
        - material 
        - back 
        - sequins (itchy or not)
        - bra
        - color
2. Sum (How much it's mentioned)
    - sequins
    - wedding 
    - pockets
3. Categorical
    - bra 
    - event 
    - color
    


    

### Linking sentences to dress

In [149]:
url_cur = rev_col.find({}, {"url":1,"_id":0})

In [150]:
url_list = pd.DataFrame(list(url_cur))

### Linking body types to comments

In [223]:
df_body = pd.read_csv('../data/measurement_data.csv', index_col=0)

<a id="polarized"></a>
### Polarized topics

<a id="polarizedGeneral"></a>
#### General

In [144]:
from textblob import TextBlob
def calc_polarity(s):
    return TextBlob(s).sentiment[0]

In [145]:
polarity = df_sent.review.apply(calc_polarity)

In [172]:
topic_prob = topic_prob.join(polarity)
topic_prob = topic_prob.join(df_sent['index'])
topic_prob = topic_prob.join(url_list, on='index')

In [180]:
p_topics = ['bra', 'material', 'color', 'sequins', 'back']
topic_prob['topic'] = topic_prob.idxmax(axis=1)

In [205]:
polar_general = topic_prob[topic_prob.topic.isin(p_topics)]

In [206]:
polar_general = pd.DataFrame(polar_general.groupby(['url', 'topic'], as_index=False)['review'].mean())

In [215]:
polar_general = polar_general.pivot('url', 'topic', 'review')

<a id="polarizedBody"></a>
#### Body Type Specific


In [293]:
bt_topics = ['length', 'stretch', 'bust_area']

In [294]:
labeled_df = topic_prob.join(df_body)

In [295]:
labeled_df = labeled_df.dropna()

In [296]:
polar_bt = labeled_df[labeled_df[bt_topics].sum(axis=1) > 0]

In [297]:
polar_bt['topic'] = polar_bt[bt_topics].idxmax(axis=1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [302]:
polar_bt = polar_bt.groupby(['url', 'topic', 'kmean_label_2'], as_index=False)['review'].mean()

<a id="unpolarized"></a>
#### Unpolarized topics

In [250]:
up_topics = ['sequins', 'wedding', 'pockets']


In [258]:
up_general = topic_prob.groupby('index', as_index=False)[up_topics].sum()

In [261]:
up_general = up_general.join(url_list)

In [264]:
up_general = up_general.groupby('url', as_index=False)[up_topics].mean()

<a id="combining"></a>
### Combining Everything

In [305]:
df_general = polar_general.join(up_general.set_index('url'), lsuffix='_polar', rsuffix='_unpolar')

In [307]:
df_general = df_general.replace(np.nan, 0)

#### Saving to CSV

In [308]:
df_general.to_csv('../data/dress_features.csv')

In [311]:
polar_bt.to_csv('../data/dress_features_bt.csv')

## Thoughts 
- Select which topics are dress related vs people related.
- Cluster on similar dresses by the dress features.
- Rate people on how much they love the dress. If they love the dress, they will love similar dresses too.
- If they input body data, recommend what people with the same body cluster love. 