# Topic Modelling
- [helper functions](#helper)

## [Tokenizing](#tokenization)
- [sentence tokenization](#sentenceTokenization)
- [cleaning](#cleaning)

## [NMF](#nmf1)
- [5 topics](#nmf1-topic5)
    - [topic analysis](#nmf1-topic5-analysis)
- [9 topics](#nmf1-topic9)
    - [topic analysis](#nmf1-topic9-analysis)

## [NMF With Only Nouns](#nmf2)
- [extract nouns with NLTK](#extractNouns1)
    - [max df = 0.01](#max1)
        - [5 topics](#max1-topic5)
            - [topic analysis](#nmf2-topic5-analysis)
        - [9 topics](#max1-topic9)
            - [topic analysis](#max1-topic9-analysis)
    - [max df = 0.3](#max2)
        - [5 topics](#max2-topic5)
            - [topic analysis](#max2-topic5-analysis)
        - [9 topics](#max2-topic9)
            - [topic analysis](#max2-topic9-analysis)
- [extract nouns with TextBlob](#extractNouns2)
    - [topic analysis](#nmf3-topic5-analysis)
    
## [Model Review](#modelReview)

## [Evaluating Likeness](#likeness)

## [Feature Engineering Topics](#featureEngineeringTopics)
- [Sentiment analysis](#sentimentAnalysis)
- [Topic probability analysis](#topicProbability)


In [1]:
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.metrics import pairwise_distances
from sklearn.metrics import silhouette_score
from sklearn.preprocessing import scale
from sklearn.datasets import fetch_mldata
from sklearn.cluster import KMeans
from sklearn.utils import shuffle
from pymongo import MongoClient
import pandas as pd
import numpy as np
from seaborn import plt
import matplotlib.pyplot as mplt
%matplotlib inline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
import time
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize

In [191]:
client = MongoClient('ec2-34-198-179-91.compute-1.amazonaws.com', 27017)
db = client.fletcher
dress_col = db.rtr_dresses
rev_col = db.rtr_reviews

In [3]:
cur = rev_col.find({}, {"review":1, "title":1,"_id":0})
rev_df = pd.DataFrame(list(cur))

<a id="helper"></a>
### Helper Functions 

In [168]:
def print_top_words(model, feature_names, n_top_words):
    for topic_id, topic in enumerate(model.components_):
        print("Topic {}".format(topic_id))
        words = [feature_names[i].strip() for i, v in (sorted(enumerate(topic), key=lambda x:x[1], reverse=True)[:n_top_words])]
        print(', '.join(words))
    print()

In [5]:
def get_tfidf_and_tf(text, stopwords, max_df=0.90, min_df=0.001, ngram=(2,2), vocab=None):
    tfidf_vectorizer = TfidfVectorizer(max_df=max_df, min_df=min_df,
                                       ngram_range=ngram,
                                       stop_words=sw, vocabulary = vocab)
    t0 = time.time()
    tfidf = tfidf_vectorizer.fit_transform(text)
    print("done in %0.3fs." % (time.time() - t0))

    # Use tf (raw term count) features for LDA.
    print("Extracting tf features for LDA...")
    tf_vectorizer = CountVectorizer(max_df=max_df, min_df=min_df,
                                    ngram_range=ngram,
                                    stop_words=sw)
    t0 = time.time()
    tf = tf_vectorizer.fit_transform(text)
    print("done in %0.3fs." % (time.time() - t0))
    return tfidf, tfidf_vectorizer, tf, tf_vectorizer

<a id="tokenization"></a>
## Tokenization
<a id="sentenceTokenization"></a>
### Sentence Tokenization
It seems like there are multiple topics per review. Users comment on fit, occasion, recommendations, use of undergarments, and overall impression of the dress (beautiful, sparkly, etc). I will separate each comment to sentences using sent_tokenize.

In [13]:
sentences = rev_df.review.apply(sent_tokenize)

In [14]:
df_sent = pd.concat([pd.DataFrame({'review': x, 'index': i}) for i,x in enumerate(sentences)], ignore_index=True)

<a id="cleaning"></a>
### Cleaning
Get rid of punctuation, capital letters, etc.

In [15]:
df_sent.review = df_sent.review.str.replace(r'[\.\,]', '')

In [19]:
df_sent.review = df_sent.review.str.replace('-', ' ')

<a id="nmf1"></a>
## NMF
- max distribution frequency = 0.05
- (1,2) ngram

In [20]:
sw = stopwords.words('english')
n_top_words = 20

In [22]:
tfidf, tfidf_vectorizer, tf, tf_vectorizer = get_tfidf_and_tf(df_sent.review, sw, min_df=0, max_df=0.05, ngram=(1,2))

done in 7.682s.
Extracting tf features for LDA...
done in 6.799s.


<a id="nmf1-topic5"></a>
### Extracting 5 topics

In [23]:
n_topics = 5
t0 = time.time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time.time() - t0))

print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

done in 29.800s.

Topics in NMF model:
Topic #0:
loved, loved dress, absolutely loved, absolutely, overall loved, overall, everyone loved, loved loved, everyone, loved fit, loved wearing, really loved, much, loved pockets, dress much, pockets, dress got, dress would, wearing, loved everything
Topic #1:
many compliments, many, received, received many, got, got many, compliments dress, night, compliments night, received compliments, felt, got compliments, beautiful, night long, tons, tons compliments, long, throughout, compliments throughout, compliments felt
Topic #2:
true, true size, fit true, fits true, fits, dress true, runs true, runs, dress fit, size comfortable, dress fits, ran true, pretty true, ran, dress runs, pretty, flattering, size flattering, comfortable true, length
Topic #3:
rent, definitely, would definitely, definitely rent, recommend, highly, highly recommend, definitely recommend, recommend dress, would rent, rent dress, would highly, would recommend, rent runway, run

<a id="nmf1-topic5-analysis"></a>
#### Topic Analysis
1. Topic 0 = Loved the dress
2. Topic 1 = Received a lot of compliments
3. Topic 2 = Good fit, true to size.
4. Topic 3 = Would definitely rent again or recommend.
5. Topic 4 = Dress was beautiful.

This is a good start to the topic analysis. I can certainly use these topics to measure how much a user likes the dress.


<a id="nmf1-topic9"></a>
### Extracting 9 Topics

In [25]:
n_topics = 9
t0 = time.time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time.time() - t0))
print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

done in 56.028s.

Topics in NMF model:
Topic #0:
loved, loved dress, absolutely loved, absolutely, overall loved, overall, everyone loved, loved loved, everyone, loved fit, loved wearing, really loved, much, loved pockets, dress much, pockets, dress got, dress would, wearing, loved everything
Topic #1:
many compliments, many, received, received many, got, got many, compliments dress, night, compliments night, received compliments, got compliments, felt, night long, tons, tons compliments, throughout, compliments throughout, compliments felt, dress received, throughout night
Topic #2:
true, true size, fit true, fits true, fits, dress true, runs true, runs, dress fit, size comfortable, dress fits, ran true, pretty true, ran, dress runs, pretty, flattering, size flattering, comfortable true, size fit
Topic #3:
rent, definitely, definitely rent, would definitely, would rent, rent dress, definitely recommend, rent runway, runway, rtr, dress would, recommend, rent rtr, definitely wear, first

<a id="nmf1-topic9-analysis"></a>
#### Topic breakdown
1. Topic 0 = Loved the dress. Pockets Makes people happy.
2. Topic 1 = Received a lot of compliments
3. Topic 2 = Good fit, true to size. Flattering fit.
4. Topic 3 = Would definitely rent again or recommend.
5. Topic 4 = Dress was beautiful.
6. Topic 5 = Recommendation
7. Topic 6 = Beautiful
8. Topic 7 = Fits like a glove.
9. Topic 8 = Dress Length

** 5 is the better topic number. Topics 5 - 8 seems to be repeating itself. ** I should also get rid of adjectives, keep only nouns, because I am getting how a user feels about the dress instead of topics.

<a id="nmf2"> </a>
## NMF With Nouns
<a id="extractNouns1"> </a>
## Extract nouns with NLTK

In [30]:
df_sent.review = df_sent.review.str.lower()

In [37]:
def is_noun(s):
    return [x[0] for x in nltk.pos_tag(s) if x[1] == 'NN' or x[1] == 'NNS']

# nouns = pd.read_csv('data/review_nouns.csv')
nouns = df_sent.review.str.split().apply(is_noun)
nouns = nouns.str.join(' ')
# saving nouns 
# nouns.to_csv('data/review_nouns.csv')

<a id="max1"> </a>
## Max DF = 0.01

In [61]:
tfidf, tfidf_vectorizer, tf, tf_vectorizer = get_tfidf_and_tf(nouns, sw, max_df=0.01, ngram=(1, 2))

done in 2.348s.
Extracting tf features for LDA...
done in 2.455s.


<a id="max1-topic5"> </a>
### 5 Topics

In [45]:
n_topics = 5
t0 = time.time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time.time() - t0))
print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

done in 6.416s.

Topics in NMF model:
Topic #0:
dress size, size size, size dress, size fit, backup, order, reviews dress, fit dress, backup size, size length, size backup, stretch, order size, backup dress, case, sizes, size bit, medium, back dress, dress dress
Topic #1:
sleeves, shoulders, winter, thing, issue, problem, neckline, lace, look, snug, itchy, shoulder, complaint, tight, nothing, part, stretch, weather, straps, one
Topic #2:
compliments dress, tons, tons compliments, lots, lots compliments, dress night, lot compliments, ton, ton compliments, people, evening, complements, strangers, ball, dress wedding, people dress, prom, friends, everyone, fun
Topic #3:
glove, fit glove, dress glove, fits, size fit, curves, stretch, places, fits size, medium, room, snug, flattering, length heels, backup, everything, spanx, shape, beautiful, gown
Topic #4:
chest, top, shoulders, tape, room, front, medium, snug, cleavage, bottom, tight, someone, fashion, problem, fashion tape, straps, lace,

<a id="max1-topic5-analysis"> </a>
#### Topic Analysis
This selection of topics is not as clear cut as the previous models. There are a lot of repeats between topic 0, 4 and 5. 
1. Topic 0: Size
2. Topic 1: Shoulders, sleeves (winter = long sleeves?), upper part of the dress 
3. Topic 2: Compliments 
4. Topic 3: Fits well
5. Topic 4: Fitting issues (room, snug, cleavage, tight, tapes)

<a id="max1-topic9"> </a>
### 9 Topics

In [48]:
n_topics = 9
t0 = time.time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time.time() - t0))
print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

done in 11.109s.

Topics in NMF model:
Topic #0:
dress size, size size, size dress, size fit, order, backup, reviews dress, fit dress, backup size, stretch, size length, size backup, order size, backup dress, case, sizes, size bit, medium, back dress, dress dress
Topic #1:
compliments dress, tons, tons compliments, lots, lots compliments, dress night, lot compliments, ton, ton compliments, people, evening, complements, strangers, dress wedding, ball, people dress, prom, friends, everyone, dress color
Topic #2:
sleeves, shoulders, winter, thing, issue, problem, neckline, lace, snug, itchy, look, shoulder, tight, complaint, nothing, stretch, part, weather, medium, straps
Topic #3:
glove, fit glove, dress glove, fits, size fit, curves, stretch, places, fits size, medium, room, snug, flattering, length heels, backup, everything, spanx, shape, beautiful, dress curves
Topic #4:
chest, shoulders, room, tape, snug, medium, tight, front, cleavage, someone, fashion, fashion tape, problem, cup, i

<a id="max1-topic9-analysis"> </a>
#### Topic Analysis

1. Topic 0: Size
2. Topic 1: Shoulders, sleeves (winter = long sleeves?), upper part of the dress 
3. Topic 2: Compliments 
4. Topic 3: Fits well
5. Topic 4: Fitting issues (room, snug, cleavage, tight, tapes)
6. Topic 5: Phone (pockets/ clutch)
7. Topic 6: Accessories (gold/silver)
8. Topic 7: Event type (wedding, black tie)
9. Topic 8: Fashion tape

<a id="max2"> </a>
## Max DF = 0.3

In [87]:
tfidf, tfidf_vectorizer, tf, tf_vectorizer = get_tfidf_and_tf(nouns, sw, max_df=0.3, ngram=(1, 2))

done in 2.447s.
Extracting tf features for LDA...
done in 2.157s.


<a id="max2-topic5"> </a>
### 5 Topics

In [88]:
n_topics = 5
t0 = time.time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time.time() - t0))
print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

done in 4.275s.

Topics in NMF model:
Topic #0:
size, dress size, size size, fit size, size dress, size fit, reviews, backup, order, bust, dresses, bit, back, backup size, order size, waist, chest, size backup, area, sizes
Topic #1:
fit, dress fit, fit size, glove, fit glove, size fit, fit dress, bit, bust, waist, fit perfect, perfect, color, area, body, fit well, hips, medium, well, great
Topic #2:
compliments, night, compliments night, compliments dress, tons, tons compliments, dress compliments, lots, lots compliments, lot, ton, ton compliments, lot compliments, wedding, dress night, night dress, people, event, party, evening
Topic #3:
heels, length, inch, inch heels, length heels, heels length, bit, floor, heels dress, length inch, dress length, flats, ground, height, heel, length dress, dress heels, shoes, perfect, inches
Topic #4:
bra, back, bit, strapless, bra dress, bust, strapless bra, dress bra, cut, straps, top, chest, backless, tape, front, spanx, need, backless bra, area, 

<a id="max2-topic5-analysis"> </a>
#### Topic Analysis
This maximum frequency works a lot better for nouns because we're already cutting a lot of the unnecessary words. 

1. Topic 0: Size
2. Topic 1: Fit
3. Topic 2: Compliments 
4. Topic 3: Heels/length
5. Topic 4: Bra

<a id="max2-topic9"> </a>
### 9 Topics

In [89]:
n_topics = 9
t0 = time.time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time.time() - t0))
print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

done in 11.449s.

Topics in NMF model:
Topic #0:
size, dress size, size size, fit size, size dress, size fit, backup, reviews, order, dresses, bust, backup size, back, order size, waist, size backup, chest, fits, sizes, fits size
Topic #1:
fit, dress fit, fit size, glove, fit glove, size fit, fit dress, bust, fit perfect, waist, perfect, fit well, body, well, medium, great, color, fit length, fit perfectly, area
Topic #2:
compliments, compliments dress, compliments night, tons, tons compliments, lots, lots compliments, dress compliments, lot compliments, ton compliments, ton, lot, evening, strangers, people, everyone, ball, party, women, gown
Topic #3:
heels, inch, inch heels, heels dress, floor, length heels, heels length, ground, length inch, dress heels, dress floor, height, inches, heels floor, flats, perfect, shoes, regular, order, taller
Topic #4:
bra, back, strapless, bra dress, strapless bra, bust, dress bra, straps, cut, backless, tape, top, chest, need, backless bra, spanx, f

<a id="max2-topic9-analysis"> </a>
#### Topic Analysis

1. Topic 0: Size
2. Topic 1: Fit
3. Topic 2: Compliments 
4. Topic 3: Heels/length
5. Topic 4: Bra
6. Topic 5: Night (Not very important)
7. Topic 6: Length (repeat of topic 3)
8. Topic 7: Details - sequins, zipper, fabric, material
9. Topic 8: Occasion (wedding, etc)

<a id="extractNouns2"></a>
## Extract nouns with TextBlob

In [51]:
from textblob import TextBlob
def get_nouns(text):
    return TextBlob(text).noun_phrases

In [77]:
# tb_nouns = pd.read_csv('data/tb_nouns.csv')
tb_nouns = rev_df.review.apply(get_nouns)
tb_nouns = tb_nouns.str.join(' ')
# tb_nouns.to_csv('data/tb_nouns.csv')

In [85]:
tfidf, tfidf_vectorizer, tf, tf_vectorizer = get_tfidf_and_tf(tb_nouns, sw, max_df=0.1, ngram=(2, 4))

done in 2.671s.
Extracting tf features for LDA...
done in 2.519s.


In [86]:
n_topics = 9
t0 = time.time()
nmf = NMF(n_components=n_topics, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
print("done in %0.3fs." % (time.time() - t0))
print("\nTopics in NMF model:")
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, n_top_words)

done in 2.178s.

Topics in NMF model:
Topic #0:
inch heels, length inch heels, length inch, inch heels perfect, inch heels overall, heels perfect, heels overall, dress inch, dress inch heels, regular length, heels inch, size inch heels, size inch, inch heels perfect length, perfect length inch heels, perfect length inch, heels perfect length, heels inch heels, heels dress, heels rtr
Topic #1:
great dress, overall great dress, overall great, dress great, dress rtr, dress overall, dress formal, dress perfect, bachelorette party, heels great, dress good, area great, formal event, dress fit, party great, dress dress, bit snug, rent runway, holiday party, bust area
Topic #2:
black tie, tie wedding, black tie wedding, tie event, black tie event, tie optional, black tie optional, wore black tie, wore black, dress black tie, dress black, perfect black tie, perfect black, strapless bra, optional wedding, tie optional wedding, black tie optional wedding, dress black tie wedding, wore black tie w

<a id="nmf3-topic5-analysis"> </a>
#### Topic Analysis

1. Topic 0: Beautiful dress
2. Topic 1: Heels
3. Topic 2: Bra
4. Topic 3: Great overall
5. Topic 4: Event

The only good group of words I'm getting here is at topic 4, the sort of words describing the event is better TextBlob.

<a id="modelReview"></a>
## Model Review
I can group the topics to these categories:
1. How much the person likes the dress
    1. General like
        - Beautifulness
        - Recommend
        - Received compliments
    2. Like that can be linked to their body types
        - Bra
        - Heels usage
        - Good fit, true to size
    
2. Dress attributes 
    - Use or bra
    - Beautifulness
    - Heels (high heels, inches, flats)
    - Event it's used for
    - Other features

Therefore, I need these topics: 
- Beautiful
- Recommend
- Compliments
- Bra Issues
- Heels
- Good fit
- Event

    
From these things, we can recommend a dress given a certain body type.
- We can access the general like using [this](#nmf1-topic5).
- Looking at [this](#max1-topic9), I can also see that the more topic number you want, the more details you get. This is meaningful only when you're dealing with the nouns.

<a id="likeness"></a>
## Evaluating Likeness

In [158]:
tfidf, tfidf_vectorizer, tf, tf_vectorizer = get_tfidf_and_tf(df_sent.review, sw, min_df=0, max_df=0.05, ngram=(1,2))

done in 7.611s.
Extracting tf features for LDA...
done in 6.554s.


In [159]:
nmf = NMF(n_components=5, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, 5)

Topic 0
['loved', 'loved dress', 'absolutely loved', 'absolutely', 'overall loved']
Topic 1
['many compliments', 'many', 'received', 'received many', 'got']
Topic 2
['true', 'true size', 'fit true', 'fits true', 'fits']
Topic 3
['rent', 'definitely', 'would definitely', 'definitely rent', 'recommend']
Topic 4
['fit perfectly', 'perfectly', 'dress fit', 'glove', 'like glove']



In [160]:
df_like = pd.DataFrame(nmf.transform(tfidf), columns=['love', 'compliment', 'true_fit', 'recommend', 'perfect_fit'])
# true fit suggests that the dress came true to size
# perfect fit suggests that the dress fits well for them

In [163]:
df_like['topic'] = df_like.idxmax(axis=1)

In [164]:
df_like = df_sent.join(df_like)

In [165]:
df_like.head(10)

Unnamed: 0,index,review,love,compliment,true_fit,recommend,perfect_fit,topic
0,0,fits true to size,0.0,0.0,0.065298,0.0,0.0,true_fit
1,0,"i'm 145 lb 5'1"" and the 10r fit nice except it...",0.0,5.3e-05,0.0,0.0,0.000578,perfect_fit
2,0,got compliments from wedding guests i didn't e...,0.0,0.004855,0.0,0.0,0.0,compliment
3,1,i wish i could have gotten the 16l,0.0,0.0,0.0,0.0,0.0,love
4,1,i am 5'9 and the 16r was a tad short with my h...,0.0,0.0,0.0,0.0,0.000215,perfect_fit
5,1,other than that i would recommend this dress f...,0.0,0.0,0.0,0.006444,0.0,recommend
6,2,i loved this dress so much!,0.044943,0.0,0.0,0.0,0.0,love
7,2,i got lots of compliments and it was very comf...,0.0,0.008206,0.0,0.0,0.0,compliment
8,2,i wore a 2r but next time i would definitely g...,0.0,0.0,0.0,0.011736,0.0,recommend
9,2,even with 3 inch heels i had to carry it every...,0.0,0.0,0.0,0.0,0.000729,perfect_fit


<a id="details"></a>
## Evaluating Dress Details


In [166]:
tfidf, tfidf_vectorizer, tf, tf_vectorizer = get_tfidf_and_tf(nouns, sw, max_df=0.8, ngram=(1, 2))

done in 3.277s.
Extracting tf features for LDA...
done in 2.495s.


In [169]:
nmf = NMF(n_components=20, random_state=1, alpha=.1, l1_ratio=.5).fit(tfidf)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
print_top_words(nmf, tfidf_feature_names, 10)

Topic 0
dress, dress fit, dress size, compliments dress, party, dress dress, dress night, dress compliments, size dress, ball
Topic 1
size, dress size, size size, fit size, size dress, size fit, backup, reviews, order, backup size
Topic 2
fit, dress fit, fit size, fit glove, glove, size fit, fit dress, fit perfect, perfect, fit well
Topic 3
compliments, compliments dress, compliments night, tons, tons compliments, lots compliments, lots, dress compliments, lot compliments, ton compliments
Topic 4
heels, inch, inch heels, heels dress, length heels, floor, heels length, ground, length inch, dress heels
Topic 5
bra, strapless, bra dress, strapless bra, dress bra, straps, backless, cut, tape, need
Topic 6
night, compliments night, dress night, night dress, end night, end, night long, people, long, complements
Topic 7
length, length heels, dress length, heels length, length dress, flats, heel, length inch, size length, fit length
Topic 8
bit, dress bit, sleeves, arms, top, area, size bit, c

Topic 0: Dress fit   
Topic 1: Size  
Topic 2: Dress fit  
Topic 3: Compliments  
Topic 4: Inches  
heels, inch, inch heels, heels dress, length heels, floor, heels length, ground, length inch, dress heels  
Topic 5: Bra  
bra, strapless, bra dress, strapless bra, dress bra, straps, backless, cut, tape, need  
Topic 6: Night  
night, compliments night, dress night, night dress, end night, end, night long, people, long, complements  
Topic 7: Length  
length, length heels, dress length, heels length, length dress, flats, heel, length inch, size length, fit length  
Topic 8: Arms/Sleeves  
bit, dress bit, sleeves, arms, top, area, size bit, chest, stretch, side  
Topic 9: Wedding  
wedding, tie, tie wedding, dress wedding, perfect, dress tie, fall, winter, party, wedding dress  
Topic 10: Sequins (itchy)  
sequins, arms, sequins arms, hair, dress sequins, sleeves, arm, scratchy, skin, itchy  
Topic 11: Color (gold)  
color, gold, dress color, person, cut, picture, shoes, style, skin, dress gold  
Topic 12: Experience  
rtr, time, experience, experience rtr, dress rtr, time rtr, rtr dress, rtr experience, order, service
Topic 13: Material
material, dress material, stretch, quality, stretchy, body, curves, lot, places, winter
Topic 14: Event  
event, dress event, work, perfect, party, tie, event dress, tie event, time, day  
Topic 15: Back  
back, back dress, front, dress back, train, cut, part, straps, lace, reviews    
Topic 16: Stretch  
fabric, stretch, lot, stretchy, way, quality, skirt, hips, curves, give  
Topic 17: Hips area  
bust, waist, hips, area, way, chest, room, top, cut, shoulders  
Topic 18: Not sure  
dresses, designer, one, sizes, bridesmaids, bridesmaid, way, time, everyone, rent  
Topic 19: Issues  
zipper, way, side, help, issue, problem, someone, reviews, con, complaint  


In [None]:
topics = ['dress_fit', 'size']

In [181]:
df_details = pd.DataFrame(nmf.transform(tfidf), columns=['topic_{}'.format(x) for x in range(20)])

In [182]:
drop = [0, 6, 7, 12, 18]
for d in drop:
    df_details = df_details.drop('topic_{}'.format(d), axis=1)

In [186]:
df_details.columns = ['size', 'dress_fit', 'compliments', 'length', 'bra', 'sleeves',
                      'wedding', 'sequins', 'color', 'material', 'event', 'back',
                     'stretch', 'hips', 'issues']


In [187]:
df_details['topic'] = df_details.idxmax(axis=1)

In [189]:
df_details = df_sent.join(df_details)

In [198]:
cur = rev_col.find({}, {"url": 1, "_id": 0})

In [199]:
df_urls = pd.DataFrame(list(cur))

In [203]:
df_details = df_details.join(df_urls, on="index")

In [204]:
df_dresses = pd.DataFrame(list(dress_col.find({}, {"_id": 0})))

In [212]:
df_details.columns

Index(['index', 'review', 'size', 'dress_fit', 'compliments', 'length', 'bra',
       'sleeves', 'wedding', 'sequins', 'color', 'material', 'event', 'back',
       'stretch', 'hips', 'issues', 'topic', 'url'],
      dtype='object')

<a id="featureEngineeringTopics"></a>

## Feature Engineering Topics
There are some topics where we must assess the polarity to see if it's rated positively/negatively (e.g. length, stretch).
But there are also ones that can just be assessed by  just the fact that the topic is mentioned (e.g. back, sequins). The approach to measure these two categories is different: with the polarized one, we take the mean of the sentiment. With the latter one, we take the mean of the topic probability.
### Polarized
['size', 'dress_fit','length', 'bra', 'sleeves', 'sequins', 'color', 'material', 'back', 'stretch', 'hips', 'issues']
### Positive
['compliments', 'issues', 'sequins'] 

I've also identified a third type of topic: that is a categorical one. These topics can be assessed further through more topic analysis, to add more categories. (e.g length = is it short or long?)
### Categorical
['length', 'bra','wedding', 'event', 'back', 'issues']

<a id="sentimentAnalysis"></a>
### Sentiment Analysis


In [222]:
def sentiment(s):
    return TextBlob(s).sentiment[0]

In [223]:
sentiments = df_details.review.apply(sentiment)

In [226]:
df_details['sentiment'] = sentiments

In [323]:
df_polarized = df_details[df_details.topic.isin(['size', 'dress_fit','length', 
                                                 'bra', 'sleeves', 'sequins', 
                                                 'color', 'material', 'back', 
                                                 'stretch', 'hips', 'issues'])] 

In [324]:
df_polarized = pd.DataFrame(df_polarized.groupby(['url', 'topic'], as_index=False)['sentiment'].mean())

In [328]:
df_polarized = df_polarized.pivot('url', 'topic')

In [329]:
df_polarized.head(3)

Unnamed: 0_level_0,sentiment,sentiment,sentiment,sentiment,sentiment,sentiment,sentiment,sentiment,sentiment,sentiment,sentiment,sentiment
topic,back,bra,color,dress_fit,hips,issues,length,material,sequins,size,sleeves,stretch
url,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
https://www.renttherunway.com/shop/designers/alexis/red_leona_dress,0.203861,0.096007,0.407753,0.423304,0.124177,0.081419,0.054844,0.290147,0.066022,0.279006,0.067228,0.105407
https://www.renttherunway.com/shop/designers/allison_parris/cobalt_marilyn_gown,0.144366,0.140034,0.411457,0.360768,0.122181,0.142913,0.317536,0.300392,0.157278,0.296506,0.061319,0.277693
https://www.renttherunway.com/shop/designers/badgley_mischka/award_winner_gown,0.160555,0.14096,0.367764,0.467023,0.179997,0.152752,0.24222,0.313751,0.080275,0.286641,0.148834,0.243559


In [346]:
df_polarized = df_polarized.replace(np.nan, 0) # if it's not mentioned - it's neutral

<a id="topicProbability"></a>
### Topic probability analysis

In [339]:
topic_prob = ['compliments', 'issues', 'sequins']
df_topic_probability = df_details.groupby('index', as_index=False)[topic_prob].sum()

In [340]:
df_index_url = df_details.groupby('index')['index','url'].head(1)

In [341]:
df_topic_probability = df_topic_probability.merge(df_index_url, on='index')

In [342]:
df_topic_probability = df_topic_probability.groupby('url')[topic_prob].mean()

With compliment and issues, it will be multiplied to the end result. It will give a weighting to each of the dress. Therefore, dress with more issues will be ranked relatively lower.

In [316]:
from sklearn.metrics.pairwise import euclidean_distances

In [348]:
df_complete = df_polarized.join(df_topic_probability)



saving everything to combine it with body type information.
for each body type:  
evaluate sentiment on fit, length, hips, size, sleeves and stretch depends on body types.  
Thus, we have a dress space for each cluster.  
Bra, sequins, color, material, back - this is quite general. These values remain the same in all of the clusters. 
Comfort depends on the fit and stretch for each body types.

Recommendation
1. See which cluster the person belongs to.
2. Take their priorities.
3. Take the closest N neighbors 