<a href="https://colab.research.google.com/github/punkmic/Topic-modeling-Amazon-Reviews/blob/master/MSDSTopicModel_HW2_BuildATopicModel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MSDS Marketing Text Analytics, Unit 2, Assignment 2: Build a topic model

## ⚡️ Make a Copy

Save a copy of this notebook in your Google Drive before continuing. Be sure to edit your own copy, not the original notebook.

In this assignment, you will implement a topic model preprocessor which can then be applied to the task of topic-modeling Amazon text reviews. Please review the course lectures and documentation up to this point before continuing. Be sure also to be familiar with the [documentation for TMToolkit](https://tmtoolkit.readthedocs.io/en/latest/topic_modeling.html)

Be sure to make a copy into your own Drive account before editing this notebook.

You will implement a preprocessing function to prepare your corpus for topic modeling. It is recommended that you use a small test corpus (an example is provided below) for development, rather than starting with the full review set.

## Imports

**Note:** we have to do a good amount of dependency cleanup in order to get tmtoolkit working in Colab. You will likely see WARNINGS in the output below, but you should not see any ERRORs. In the end, you should end up with spacy 2.3.7, en_core_web_sm 2.3.1, and tmtoolkit 0.10.0.

**Important:** You will also likely see a message to restart the runtime after the installations are complete, and should do so.

In [4]:
try:
    from tmtoolkit.corpus import Corpus
    from tmtoolkit.preprocess import TMPreproc
    from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
    from tmtoolkit.topicmod.tm_lda import compute_models_parallel
except ModuleNotFoundError:
    !pip install --upgrade pip
    !pip install lda
    !pip install spacy-model-manager
    !spacy-model remove en_core_web_sm
    !pip uninstall -y spacy-model-manager
    !pip uninstall -y spacy
    !pip install spacy==2.3.7
    !python -m spacy download en_core_web_sm
    !pip uninstall -y imgaug
    !pip install "imgaug<0.2.7,>=0.2.5"
    !pip install tmtoolkit==0.10.0
    from tmtoolkit.corpus import Corpus
    from tmtoolkit.preprocess import TMPreproc
    from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
    from tmtoolkit.topicmod.tm_lda import compute_models_parallel


ImportError: ignored

**NOTE:** Loading a corpus as a list of strings is not the only way to use tmtoolkit. Given, for example, a large corpus that might not fit in memory, the current approach would not work well. See the tmtoolkit docs on [working with text corpora](https://tmtoolkit.readthedocs.io/en/latest/text_corpora.html) for more info.

## Implement a pre-processor

Here you will implement a function called `preprocess` which returns the TMPreproc object to be used for topic modeling.

The preprocess function will take a list of texts and return a pre-processed corpus object, i.e. a TMPreproc object. Preprocessing should include the following actions on the corpus using the appropriate methods in the TMPreproc class:

 - lemmatize the texts
 - convert tokens to lowercase
 - remove special characters
 - clean tokens to remove numbers and any tokens shorter than 3 characters

The first part of the function to create the corpus and preprocess object are done for you. Your job is to call the specific preprocess functions and to return the resulting preprocess object.


In [7]:
def preprocess(texts, lang="en"):
    """Preprocessor which returns a TMPreproc object processed on corpus as language
    specified by lang (defaults to "en"):

    Should perform all of the following pre-processing functions:
     - lemmatize
     - tokens_to_lowercase
     - remove_special_chars_in_tokens
     - clean_tokens (remove numbers, and remove tokens shorter than 2)
    """
    # Here, we just use the index of the text as the label for the corpus item
    corpus = Corpus({ i:r for i, r in enumerate(texts) })
    preproc = TMPreproc(corpus, language=lang).lemmatize().tokens_to_lowercase().remove_special_chars_in_tokens().clean_tokens(remove_shorter_than=2, remove_numbers=True)

    # TODO: Complete the implementation of this function and submit the
    # .py download of this notebook as your assignment submission.
    return preproc

In [None]:
#~~ /autograde # do not delete this cell

---
### ⚠️  **Caution:** No arbitrary code above this line

The only code written above should be the implementation of your graded function. For experimentation and testing, only add code below.
___

## Function development

Use this section of code to verify your function implementation. You may change the test_corpus as needed to verify your implementation. The grader will be checking that your function returns a TMPreproc object that meets all of the following critera:

 - tokens are lemmatized
 - tokens are converted to lowercase
 - special characters are removed from tokens
 - tokens shorter than 3 characters and numerics are removed

In [8]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

In [9]:
test_corpus = [ # Feel free to edit this corpus for further testing
                # to be sure that your functions meet specifications.
    "The 3 cats sat on the mats!",
    "1 fish 2 fish Red fish Blue fish",
    "She sells $ea$shells"
]
preproc = preprocess(test_corpus)
pp.pprint(preproc.get_tokens())

{   0: {   'lemma': ['cat', 'sit', 'mat'],
           'token': ['cat', 'sit', 'mat'],
           'whitespace': [True, True, False]},
    1: {   'lemma': ['fish', 'fish', 'Red', 'fish', 'Blue', 'fish'],
           'token': ['fish', 'fish', 'red', 'fish', 'blue', 'fish'],
           'whitespace': [True, True, True, True, True, False]},
    2: {   'lemma': ['sell', 'ea$shells'],
           'token': ['sell', 'eashells'],
           'whitespace': [True, False]}}


In [10]:
dtms = {
    "test_corpus": preproc.dtm
}
lda_params = {
    'n_topics': 2,
    'eta': .01,
    'n_iter': 10,
    'random_state': 1234,  # to make results reproducible
    'alpha': 1/16
}

models = compute_models_parallel(dtms, constant_parameters=lda_params)

In [11]:
model = models["test_corpus"][0][1]
print_ldamodel_topic_words(model.topic_word_, preproc.vocabulary, top_n=5)

topic_1
> #1. fish (0.566384)
> #2. red (0.142655)
> #3. cat (0.142655)
> #4. blue (0.142655)
> #5. sit (0.001412)
topic_2
> #1. sit (0.247549)
> #2. sell (0.247549)
> #3. mat (0.247549)
> #4. eashells (0.247549)
> #5. red (0.002451)


### Assignment submission

After completing the preprocess implementation, download your notebook as a .py file (File > Download > Download .py) and submit the downloaded file for grading.

## Topic modeling Amazon Reviews

Once you have completed the assignment above, you will be well prepared to start your final project for this unit. The project will include loading Amazon reviews into a corpus for topic modeling. The code below demonstrates topic modeling the reviews for a given brand. Note that the final project will require additional segmentation of the data, which is not done for you in the example here.

In [13]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [14]:
import gzip
import itertools
import json

asins = []

# To run this code, you will need to download the metadata file from the course
# assets and upload it to your Google Drive. See the notes about that file
# regarding how it was processed from the original file into json-l format.

with gzip.open("drive/MyDrive/meta_Clothing_Shoes_and_Jewelry.jsonl.gz") as products:
    for product in products:
        data = json.loads(product)
        categories = [c.lower() for c in
                      list(itertools.chain(*data.get("categories", [])))]
        if "nike" in categories:
            asins.append(data["asin"])

Inspect the first fews ASINs

In [15]:
asins[:3]

['B0000V9K32', 'B0000V9K3W', 'B0000V9K46']

Check the length, i.e. the number of resulting ASINs

In [16]:
len(asins)

8327

Build a corpus of review texts

In [17]:
review_corpus = []
with gzip.open("drive/MyDrive/reviews_Clothing_Shoes_and_Jewelry_5.json.gz") as reviews:
    for review in reviews:
        data = json.loads(review)
        if data["asin"] in asins:
            text = data["reviewText"]
            review_corpus.append(text)

Inspect a few of the reviews

In [18]:
for i, review in enumerate(review_corpus[:5]):
    print(i, review[:80])

0 the colour i received is not blue as shown but yellow.Couldnt change it because 
1 Very cute and is really practical. Fits better on smaller wrists which is my cas
2 The watch was exactly what i ordered and I got it very fast. Unfortunately it wa
3 This product came promptly and as described, pleasure doing business with them!-
4 Why isn't Nike making these anymore?  I love this watch, and I get a lot of comp


Build a TMPreproc object from the review corpus

In [19]:
pre = preprocess(review_corpus)

In [20]:
dtms = {
    "reviews_corpus": pre.dtm
}
lda_params = {
    'n_topics': 10,
    'eta': .01,
    'n_iter': 10,
    'random_state': 1234,  # to make results reproducible
    'alpha': 1/16
}

models = compute_models_parallel(dtms, constant_parameters=lda_params)



Print the topics

In [21]:
model = models["reviews_corpus"][0][1]
print_ldamodel_topic_words(model.topic_word_, pre.vocabulary, top_n=5)

topic_1
> #1. shoe (0.046912)
> #2. good (0.023715)
> #3. fit (0.021885)
> #4. size (0.018012)
> #5. comfortable (0.017603)
topic_2
> #1. shoe (0.029601)
> #2. nike (0.015272)
> #3. watch (0.014036)
> #4. great (0.012632)
> #5. look (0.012486)
topic_3
> #1. shoe (0.030574)
> #2. size (0.019319)
> #3. fit (0.012594)
> #4. buy (0.011312)
> #5. love (0.011284)
topic_4
> #1. shoe (0.035784)
> #2. watch (0.018186)
> #3. good (0.017141)
> #4. great (0.016400)
> #5. nike (0.012763)
topic_5
> #1. shoe (0.038853)
> #2. great (0.017597)
> #3. wear (0.014864)
> #4. get (0.014170)
> #5. love (0.013813)
topic_6
> #1. shoe (0.047699)
> #2. great (0.017569)
> #3. run (0.016964)
> #4. love (0.015816)
> #5. fit (0.014231)
topic_7
> #1. shoe (0.033067)
> #2. wear (0.015657)
> #3. buy (0.013333)
> #4. good (0.012803)
> #5. foot (0.011926)
topic_8
> #1. shoe (0.049415)
> #2. good (0.016635)
> #3. great (0.016451)
> #4. wear (0.016002)
> #5. love (0.015982)
topic_9
> #1. shoe (0.055937)
> #2. fit (0.019548

## Save your topic model and corpus for use in Lab 2

Once you have completed the above assignment, run the following code to save your topic model and your corpus to your Google Drive. You will load this model and use it for document classification in Lab 2.

In [22]:
import pickle
from tmtoolkit.topicmod.model_io import save_ldamodel_to_pickle

with open("drive/MyDrive/MSDS_HW2_model.p", "wb") as modelfile:
    save_ldamodel_to_pickle(modelfile, model, pre.vocabulary, pre.doc_labels, dtm=pre.dtm)

In [23]:
with open("drive/MyDrive/MSDS_HW2_corpus.p", "wb") as corpusfile:
    pickle.dump(review_corpus, corpusfile)