<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Topic Modeling
## *Data Science Unit 4 Sprint 1 Assignment 4*

Analyze a corpus of Amazon reviews from Unit 4 Sprint 1 Module 1's lecture using topic modeling: 

- Fit a Gensim LDA topic model on Amazon Reviews
- Select appropriate number of topics
- Create some dope visualization of the topics
- Write a few bullets on your findings in markdown at the end
- **Note**: You don't *have* to use generators for this assignment

In [1]:
import numpy as np
import pandas as pd
import gensim
import os
import re
import zipfile

from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from gensim import corpora

from gensim.models.ldamulticore import LdaMulticore



## Load the reviews

In [7]:
path = "data/Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv"

reviews = pd.read_csv(path)
reviews.head()

Unnamed: 0,id,dateAdded,dateUpdated,name,asins,brand,categories,primaryCategories,imageURLs,keys,...,reviews.didPurchase,reviews.doRecommend,reviews.id,reviews.numHelpful,reviews.rating,reviews.sourceURLs,reviews.text,reviews.title,reviews.username,sourceURLs
0,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,3,https://www.amazon.com/product-reviews/B00QWO9...,I order 3 of them and one of the item is bad q...,... 3 of them and one of the item is bad quali...,Byger yang,"https://www.barcodable.com/upc/841710106442,ht..."
1,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,4,https://www.amazon.com/product-reviews/B00QWO9...,Bulk is always the less expensive way to go fo...,... always the less expensive way to go for pr...,ByMG,"https://www.barcodable.com/upc/841710106442,ht..."
2,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Well they are not Duracell but for the price i...,... are not Duracell but for the price i am ha...,BySharon Lambert,"https://www.barcodable.com/upc/841710106442,ht..."
3,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,Seem to work as well as name brand batteries a...,... as well as name brand batteries at a much ...,Bymark sexson,"https://www.barcodable.com/upc/841710106442,ht..."
4,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,,,,,5,https://www.amazon.com/product-reviews/B00QWO9...,These batteries are very long lasting the pric...,... batteries are very long lasting the price ...,Bylinda,"https://www.barcodable.com/upc/841710106442,ht..."


In [8]:
STOPWORDS = set(STOPWORDS).union(set(['amazon']))

def tokenize(text):
    return [token for token in simple_preprocess(text, deacc=True, min_len=4, max_len=20) if token not in STOPWORDS]

In [9]:
def wrangle(df):
    new_df = df.copy()
    new_df['brand'] = new_df['brand'].apply(lambda x: x.lower())
    new_df['tokens'] = new_df['reviews.text'].apply(tokenize)
    new_df = new_df[['brand', 'reviews.text', 'tokens']]
    return new_df

df = wrangle(reviews)
df.head()

Unnamed: 0,brand,reviews.text,tokens
0,amazonbasics,I order 3 of them and one of the item is bad q...,"[order, item, quality, missing, backup, spring..."
1,amazonbasics,Bulk is always the less expensive way to go fo...,"[bulk, expensive, products, like]"
2,amazonbasics,Well they are not Duracell but for the price i...,"[duracell, price, happy]"
3,amazonbasics,Seem to work as well as name brand batteries a...,"[work, brand, batteries, better, price]"
4,amazonbasics,These batteries are very long lasting the pric...,"[batteries, long, lasting, price, great]"


# LDA Topic Modeling

In [10]:
def get_reviews(df, token_col):
    
    assert token_col in df.columns, f"{token_col} does not exist!"
    
    for index, row in df.iterrows():
        review_token = row[token_col]
        yield review_token

In [11]:
id2word = corpora.Dictionary(get_reviews(df, 'tokens'))
id2word.token2id['batteries']

16

In [12]:
sample_tokens = tokenize("I order 3 of them and one of the item is bad quality. Is missing backup spring so I have to put a pcs of aluminum to make the battery work.")
sample_tokens

['order',
 'item',
 'quality',
 'missing',
 'backup',
 'spring',
 'aluminum',
 'battery',
 'work']

In [13]:
id2word.doc2bow(sample_tokens)

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]

In [14]:
# a bag of words(bow) representation of our corpus
# Note: we haven't actually read any text into memory here
corpus = [id2word.doc2bow(tokens) for tokens in get_reviews(df, 'tokens')]
corpus[0][:10]

[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1)]

In [15]:
lda = LdaMulticore(corpus=corpus,
                   id2word=id2word,
                   random_state=723812,
                   num_topics = 15,
                   passes=10,
                   workers=4
                  )

In [16]:
def print_lda_topics(lda):
    words = [re.findall(r'"([^"]*)"',t[1]) for t in lda.print_topics()]
    topics = [' '.join(t[0:5]) for t in words]
    for id, t in enumerate(topics): 
        print(f"------ Topic {id} ------")
        print(t, end="\n\n")
print_lda_topics(lda)

------ Topic 0 ------
purchase batteries battery happy pleased

------ Topic 1 ------
tablet great size good screen

------ Topic 2 ------
battery like batteries duracell long

------ Topic 3 ------
love loves year bought games

------ Topic 4 ------
kindle great excellent black sound

------ Topic 5 ------
games easy play like loves

------ Topic 6 ------
works easy great read user

------ Topic 7 ------
loves bought daughter tablet wife

------ Topic 8 ------
great good price product batteries

------ Topic 9 ------
good charge kindle screen like

------ Topic 10 ------
tablet love great easy perfect

------ Topic 11 ------
gift bought recommend best loved

------ Topic 12 ------
kids tablet apps great play

------ Topic 13 ------
books read kindle reading tablet

------ Topic 14 ------
batteries long work brand battery



## Topic Distance Visualization

In [18]:
import pyLDAvis.gensim

pyLDAvis.enable_notebook()

In [19]:
pyLDAvis.gensim.prepare(lda, corpus, id2word)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  return pd.concat([default_term_info] + list(topic_dfs))


In [20]:
cm = CoherenceModel(model=lda, corpus=corpus, coherence='u_mass')
print(f'Coherence score: {cm.get_coherence()}')

NameError: name 'CoherenceModel' is not defined

## Stretch Goals

* Incorporate Named Entity Recognition in your analysis
* Incorporate some custom pre-processing from our previous lessons (like spacy lemmatization)
* Analyze a dataset of interest to you with topic modeling