# DTSA 5799 Unsupervised Text Classification for Marketing Analytics Final Project

## Imports

In [1]:
try:
    from tmtoolkit.corpus import Corpus
    from tmtoolkit.preprocess import TMPreproc
    from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
    from tmtoolkit.topicmod.tm_lda import compute_models_parallel
except ModuleNotFoundError:
    !pip install lda
    !pip install tmtoolkit
    from tmtoolkit.corpus import Corpus
    from tmtoolkit.preprocess import TMPreproc
    from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
    from tmtoolkit.topicmod.tm_lda import compute_models_parallel

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lda
  Downloading lda-2.0.0-cp37-cp37m-manylinux1_x86_64.whl (351 kB)
[K     |████████████████████████████████| 351 kB 15.5 MB/s 
Collecting pbr<4,>=0.6
  Downloading pbr-3.1.1-py2.py3-none-any.whl (99 kB)
[K     |████████████████████████████████| 99 kB 8.4 MB/s 
[?25hInstalling collected packages: pbr, lda
Successfully installed lda-2.0.0 pbr-3.1.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tmtoolkit
  Downloading tmtoolkit-0.10.0-py3-none-any.whl (7.1 MB)
[K     |████████████████████████████████| 7.1 MB 15.5 MB/s 
[?25hCollecting spacy<2.4,>=2.3.0
  Downloading spacy-2.3.7-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (10.4 MB)
[K     |████████████████████████████████| 10.4 MB 46.3 MB/s 
[?25hCollecting scipy<1.6,>=1.5.0
  Downloading scipy-1.5.4-cp37-cp37m-manylinux1_x86_64.whl (25.9 MB)
[

## Implement a pre-processor

Here you will implement a function called `preprocess` which returns the TMPreproc object to be used for topic modeling.

The preprocess function will take a list of texts and return a pre-processed corpus object, i.e. a TMPreproc object. Preprocessing should include the following actions on the corpus using the appropriate methods in the TMPreproc class:

 - lemmatize the texts
 - convert tokens to lowercase
 - remove special characters
 - clean tokens to remove numbers and any tokens shorter than 3 characters

The first part of the function to create the corpus and preprocess object are done for you. Your job is to call the specific preprocess functions and to return the resulting preprocess object.


In [2]:
def preprocess(texts, lang="en"):
    """Preprocessor which returns a TMPreproc object processed on corpus as language
    specified by lang (defaults to "en"):

    Should perform all of the following pre-processing functions:
     - lemmatize
     - tokens_to_lowercase
     - remove_special_chars_in_tokens
     - clean_tokens (remove numbers, and remove tokens shorter than 2)
    """
    # Here, we just use the index of the text as the label for the corpus item
    corpus = Corpus({ i:r for i, r in enumerate(texts) })

    preproc = TMPreproc(corpus, language=lang)

    TMPreproc.lemmatize(preproc)
    TMPreproc.tokens_to_lowercase(preproc)
    TMPreproc.remove_special_chars_in_tokens(preproc)
    TMPreproc.clean_tokens(preproc, remove_shorter_than=3, remove_numbers=True)


    return preproc

    # TODO: Complete the implementation of this function and submit the
    # .py download of this notebook as your assignment submission.

## Function development

Use this section of code to verify your function implementation. You may change the test_corpus as needed to verify your implementation. The grader will be checking that your function returns a TMPreproc object that meets all of the following critera:

 - tokens are lemmatized
 - tokens are converted to lowercase
 - special characters are removed from tokens
 - tokens shorter than 3 characters and numerics are removed

In [3]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

## Topic modeling Amazon Reviews

Once you have completed the assignment above, you will be well prepared to start your final project for this unit. The project will include loading Amazon reviews into a corpus for topic modeling. The code below demonstrates topic modeling the reviews for a given brand. Note that the final project will require additional segmentation of the data, which is not done for you in the example here.

In [4]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [5]:
import gzip
import itertools
import json

asins = []

# To run this code, you will need to download the metadata file from the course
# assets and upload it to your Google Drive. See the notes about that file
# regarding how it was processed from the original file into json-l format.

with gzip.open("drive/MyDrive/meta_Clothing_Shoes_and_Jewelry.jsonl.gz") as products:
    for product in products:
        data = json.loads(product)
        categories = [c.lower() for c in
                      list(itertools.chain(*data.get("categories", [])))]
        if "nike" in categories:
            asins.append(data["asin"])

Build a corpus of review texts

In [6]:
review_corpus = []
with gzip.open("drive/MyDrive/reviews_Clothing_Shoes_and_Jewelry.json.gz") as reviews:
    for review in reviews:
        data = json.loads(review)
        if data["asin"] in asins:
            text = data["reviewText"]
            review_corpus.append(text)

Inspect a few of the reviews

In [7]:
for i, review in enumerate(review_corpus[:5]):
    print(i, review[:80])

0 the colour i received is not blue as shown but yellow.Couldnt change it because 
1 Very cute and is really practical. Fits better on smaller wrists which is my cas
2 The watch was exactly what i ordered and I got it very fast. Unfortunately it wa
3 This product came promptly and as described, pleasure doing business with them!-
4 Why isn't Nike making these anymore?  I love this watch, and I get a lot of comp


Build a TMPreproc object from the review corpus

In [8]:
pre = preprocess(review_corpus)



In [9]:
dtms = {
    "reviews_corpus": pre.dtm
}
lda_params = {
    'n_topics': 20,
    'eta': .01,
    'n_iter': 100,
    'random_state': 777,  # to make results reproducible
    'alpha': 1/16
}

models = compute_models_parallel(dtms, constant_parameters=lda_params)

INFO:lda:n_documents: 21570
INFO:lda:n_words: 460163
INFO:lda:vocab_size: 18131
INFO:lda:n_topics: 20
INFO:lda:n_iter: 100
INFO:lda:<0> log likelihood: -5256335
INFO:lda:<10> log likelihood: -3803279
INFO:lda:<20> log likelihood: -3649503
INFO:lda:<30> log likelihood: -3573576
INFO:lda:<40> log likelihood: -3532709
INFO:lda:<50> log likelihood: -3504166
INFO:lda:<60> log likelihood: -3484099
INFO:lda:<70> log likelihood: -3468593
INFO:lda:<80> log likelihood: -3459168
INFO:lda:<90> log likelihood: -3451574
INFO:lda:<99> log likelihood: -3444859


Print the topics

In [10]:
model = models["reviews_corpus"][0][1]
print_ldamodel_topic_words(model.topic_word_, pre.vocabulary, top_n=5)

topic_1
> #1. color (0.060760)
> #2. shoe (0.040493)
> #3. look (0.032800)
> #4. like (0.030566)
> #5. black (0.027630)
topic_2
> #1. shoe (0.064983)
> #2. foot (0.047672)
> #3. size (0.035678)
> #4. fit (0.028079)
> #5. wear (0.023104)
topic_3
> #1. shoe (0.057409)
> #2. play (0.037918)
> #3. good (0.036248)
> #4. basketball (0.027793)
> #5. great (0.025313)
topic_4
> #1. order (0.031153)
> #2. shoe (0.029330)
> #3. ship (0.019902)
> #4. size (0.018971)
> #5. return (0.018389)
topic_5
> #1. run (0.040772)
> #2. much (0.020806)
> #3. good (0.019034)
> #4. shirt (0.016748)
> #5. light (0.015348)
topic_6
> #1. shoe (0.068555)
> #2. love (0.042773)
> #3. nike (0.034915)
> #4. comfortable (0.033216)
> #5. great (0.033173)
topic_7
> #1. sock (0.032171)
> #2. wear (0.031252)
> #3. get (0.029075)
> #4. boot (0.023124)
> #5. like (0.020367)
topic_8
> #1. shoe (0.034488)
> #2. foot (0.031290)
> #3. like (0.015112)
> #4. good (0.014799)
> #5. sole (0.014391)
topic_9
> #1. shoe (0.080829)
> #2. r