<a href="https://colab.research.google.com/github/n8mcdunna/DS-Unit-4-Sprint-1-NLP/blob/main/module4-topic-modeling/414_Topic_Modeling_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

# Topic Modeling
## *Data Science Unit 4 Sprint 1 Assignment 4*

Analyze a corpus of Amazon reviews from Unit 4 Sprint 1 Module 1's lecture using topic modeling: 

- Fit a Gensim LDA topic model on Amazon Reviews
- Select appropriate number of topics
- Create some dope visualization of the topics
- Write a few bullets on your findings in markdown at the end
- **Note**: You don't *have* to use generators for this assignment

In [None]:
!python -m spacy download en_core_web_sm 
# Restart runtime after running this download

In [None]:
!wget https://github.com/LambdaSchool/DS-Unit-4-Sprint-1-NLP/blob/main/module1-text-data/data/Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv.zip?raw=true -O datafiniti.zip
!unzip datafiniti.zip

In [2]:
import pandas as pd
df = pd.read_csv('/content/Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv')

In [5]:
import re
import numpy as np
import pandas as pd
from pprint import pprint

# these are all gensim imports 
# check out the gensim docs: https://radimrehurek.com/gensim/
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# we use spacy to load in the word2vec model 
import spacy

# all these imports are for the data viz tool that we use to look at the LDA results and manually determine the topics in our dataset 
# check out the docs: https://pyldavis.readthedocs.io/en/latest/readme.html
# the docs also have an EXAMPLE NOTEBOOK that is more detailed about this data viz tool than our lecture notebook: https://nbviewer.jupyter.org/github/bmabey/pyLDAvis/blob/master/notebooks/pyLDAvis_overview.ipynb
# import pyLDAvis
# import pyLDAvis.gensim 
import matplotlib.pyplot as plt
%matplotlib inline

In [13]:
df['reviews.text'].head(10)

0    I order 3 of them and one of the item is bad q...
1    Bulk is always the less expensive way to go fo...
2    Well they are not Duracell but for the price i...
3    Seem to work as well as name brand batteries a...
4    These batteries are very long lasting the pric...
5    Bought a lot of batteries for Christmas and th...
6    ive not had any problame with these batteries ...
7    Well if you are looking for cheap non-recharge...
8    These do not hold the amount of high power jui...
9    AmazonBasics AA AAA batteries have done well b...
Name: reviews.text, dtype: object

In [15]:
nlp = spacy.load("en_core_web_sm")

In [16]:
df['lemmas'] = df['reviews.text'].apply(lambda doc: [token.lemma_ for token in nlp(doc) if (token.is_stop == False) and (token.is_punct == False)])

In [17]:
id2word = corpora.Dictionary(df['lemmas'])
corpus = [id2word.doc2bow(text) for text in df['lemmas']]

In [18]:
lda_mc = gensim.models.ldamulticore.LdaMulticore(
    corpus = corpus,
    id2word = id2word,
    num_topics = 5,
    workers = 2
)

In [20]:
lda_mc.save('lda_multicore.model')

In [19]:
# from gensim import models
# models.LdaModel.load('lda_mc.model')

FileNotFoundError: ignored

In [None]:
!pip install pyLDAvis

In [24]:
import pyLDAvis
import pyLDAvis.gensim 

In [25]:
pyLDAvis.enable_notebook()

In [26]:
visualization = pyLDAvis.gensim.prepare(
    lda_mc,
    corpus,
    id2word
)

In [27]:
visualization

In [28]:
def compute_coherence_values(dictionary, corpus, texts, limit, start=2, step=3):
    """
    Compute c_v coherence for various number of topics

    Parameters:
    ----------
    dictionary : Gensim dictionary
    corpus : Gensim corpus
    texts : List of input texts
    limit : Max num of topics

    Returns:
    -------
    model_list : List of LDA topic models
    coherence_values : Coherence values corresponding to the LDA model with respective number of topics
    """
    coherence_values = []
    model_list = []
    
    for num_topics in range(start, limit, step):
        model = gensim.models.ldamulticore.LdaMulticore(corpus=corpus,
                                                        id2word=id2word,
                                                        num_topics=num_topics, 
                                                        chunksize=100,
                                                        passes=10,
                                                        per_word_topics=True,
                                                        workers=12)
        model_list.append(model)
        coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=dictionary, coherence='c_v')
        coherence_values.append(coherencemodel.get_coherence())

    return model_list, coherence_values

In [None]:
model_list, coherence_values = compute_coherence_values(
    dictionary = id2word,
    corpus = corpus,
    texts = df['lemmas'],
    limit = 15,
)

In [31]:
for m, cv in zip(model_list, coherence_values):
  print(f'{m.num_topics} topics has coherence value of {round(cv,5)}')

2 topics has coherence value of 0.44106
5 topics has coherence value of 0.48254
8 topics has coherence value of 0.44795
11 topics has coherence value of 0.42631
14 topics has coherence value of 0.42565


## Stretch Goals

* Incorporate Named Entity Recognition in your analysis
* Incorporate some custom pre-processing from our previous lessons (like spacy lemmatization)
* Analyze a dataset of interest to you with topic modeling