# DTSA 5799 Unsupervised Text Classification for Marketing Analytics Final Project

## Imports

In [2]:
try:
    from tmtoolkit.corpus import Corpus
    from tmtoolkit.preprocess import TMPreproc
    from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
    from tmtoolkit.topicmod.tm_lda import compute_models_parallel
except ModuleNotFoundError:
    !pip install lda
    !pip install tmtoolkit
    from tmtoolkit.corpus import Corpus
    from tmtoolkit.preprocess import TMPreproc
    from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
    from tmtoolkit.topicmod.tm_lda import compute_models_parallel

## Implement a pre-processor

Here you will implement a function called `preprocess` which returns the TMPreproc object to be used for topic modeling.

The preprocess function will take a list of texts and return a pre-processed corpus object, i.e. a TMPreproc object. Preprocessing should include the following actions on the corpus using the appropriate methods in the TMPreproc class:

 - lemmatize the texts
 - convert tokens to lowercase
 - remove special characters
 - clean tokens to remove numbers and any tokens shorter than 3 characters

The first part of the function to create the corpus and preprocess object are done for you. Your job is to call the specific preprocess functions and to return the resulting preprocess object.


In [3]:
def preprocess(texts, lang="en"):
    """Preprocessor which returns a TMPreproc object processed on corpus as language
    specified by lang (defaults to "en"):

    Should perform all of the following pre-processing functions:
     - lemmatize
     - tokens_to_lowercase
     - remove_special_chars_in_tokens
     - clean_tokens (remove numbers, and remove tokens shorter than 2)
    """
    # Here, we just use the index of the text as the label for the corpus item
    corpus = Corpus({ i:r for i, r in enumerate(texts) })

    preproc = TMPreproc(corpus, language=lang)

    TMPreproc.lemmatize(preproc)
    TMPreproc.tokens_to_lowercase(preproc)
    TMPreproc.remove_special_chars_in_tokens(preproc)
    TMPreproc.clean_tokens(preproc, remove_shorter_than=3, remove_numbers=True)


    return preproc

    # TODO: Complete the implementation of this function and submit the
    # .py download of this notebook as your assignment submission.

In [None]:
#~~ /autograde # do not delete this cell

---
### ⚠️  **Caution:** No arbitrary code above this line

The only code written above should be the implementation of your graded function. For experimentation and testing, only add code below.
___

## Function development

Use this section of code to verify your function implementation. You may change the test_corpus as needed to verify your implementation. The grader will be checking that your function returns a TMPreproc object that meets all of the following critera:

 - tokens are lemmatized
 - tokens are converted to lowercase
 - special characters are removed from tokens
 - tokens shorter than 3 characters and numerics are removed

In [4]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

## Topic modeling Amazon Reviews

Once you have completed the assignment above, you will be well prepared to start your final project for this unit. The project will include loading Amazon reviews into a corpus for topic modeling. The code below demonstrates topic modeling the reviews for a given brand. Note that the final project will require additional segmentation of the data, which is not done for you in the example here.

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [7]:
import gzip
import itertools
import json

asins = []

# To run this code, you will need to download the metadata file from the course
# assets and upload it to your Google Drive. See the notes about that file
# regarding how it was processed from the original file into json-l format.

with gzip.open("drive/MyDrive/meta_Clothing_Shoes_and_Jewelry.jsonl.gz") as products:
    for product in products:
        data = json.loads(product)
        categories = [c.lower() for c in
                      list(itertools.chain(*data.get("categories", [])))]
        if "nike" in categories:
            asins.append(data["asin"])

Inspect the first fews ASINs

In [8]:
asins[:3]

['B0000V9K32', 'B0000V9K3W', 'B0000V9K46']

Check the length, i.e. the number of resulting ASINs

In [9]:
len(asins)

8327

Build a corpus of review texts

In [10]:
review_corpus = []
with gzip.open("drive/MyDrive/reviews_Clothing_Shoes_and_Jewelry.json.gz") as reviews:
    for review in reviews:
        data = json.loads(review)
        if data["asin"] in asins:
            text = data["reviewText"]
            review_corpus.append(text)

Inspect a few of the reviews

In [11]:
for i, review in enumerate(review_corpus[:5]):
    print(i, review[:80])

0 the colour i received is not blue as shown but yellow.Couldnt change it because 
1 Very cute and is really practical. Fits better on smaller wrists which is my cas
2 The watch was exactly what i ordered and I got it very fast. Unfortunately it wa
3 This product came promptly and as described, pleasure doing business with them!-
4 Why isn't Nike making these anymore?  I love this watch, and I get a lot of comp


Build a TMPreproc object from the review corpus

In [12]:
pre = preprocess(review_corpus)



In [13]:
dtms = {
    "reviews_corpus": pre.dtm
}
lda_params = {
    'n_topics': 10,
    'eta': .01,
    'n_iter': 10,
    'random_state': 1234,  # to make results reproducible
    'alpha': 1/16
}

models = compute_models_parallel(dtms, constant_parameters=lda_params)

INFO:lda:n_documents: 21570
INFO:lda:vocab_size: 18131
INFO:lda:n_words: 460163
INFO:lda:n_topics: 10
INFO:lda:n_iter: 10
INFO:lda:<0> log likelihood: -4731611
INFO:lda:<9> log likelihood: -3645141


Print the topics

In [14]:
model = models["reviews_corpus"][0][1]
print_ldamodel_topic_words(model.topic_word_, pre.vocabulary, top_n=5)

topic_1
> #1. shoe (0.046964)
> #2. good (0.025713)
> #3. great (0.019031)
> #4. fit (0.019010)
> #5. love (0.018058)
topic_2
> #1. shoe (0.036253)
> #2. great (0.017030)
> #3. fit (0.014959)
> #4. buy (0.014594)
> #5. love (0.014180)
topic_3
> #1. shoe (0.039852)
> #2. good (0.013687)
> #3. great (0.013644)
> #4. wear (0.012543)
> #5. size (0.011865)
topic_4
> #1. shoe (0.038793)
> #2. great (0.020065)
> #3. size (0.019479)
> #4. love (0.018729)
> #5. fit (0.016385)
topic_5
> #1. shoe (0.027436)
> #2. watch (0.014914)
> #3. good (0.013763)
> #4. wear (0.012883)
> #5. like (0.012793)
topic_6
> #1. shoe (0.032338)
> #2. wear (0.014842)
> #3. nike (0.014476)
> #4. fit (0.014476)
> #5. great (0.013785)
topic_7
> #1. shoe (0.061466)
> #2. foot (0.019962)
> #3. wear (0.019247)
> #4. run (0.018513)
> #5. good (0.016801)
topic_8
> #1. shoe (0.046920)
> #2. good (0.015410)
> #3. great (0.014826)
> #4. wear (0.014199)
> #5. comfortable (0.013385)
topic_9
> #1. shoe (0.044996)
> #2. nike (0.0157

Tpoic 1: shoe good great fit love  
Topic 2: shoe great fit buy love  
Tpoic 3: shoe good great wear size  
Topic 4: shoe great size love fit  
Tpoic 5: shoe watch good wear like  
Topic 6: shoe wear nike fit great    
Tpoic 7: shoe foot wear run good   
Topic 8: shoe good great wear comfortable  
Tpoic 9: shoe nike like good great  
Tpoic 10: shoe watch good nike buy
