<a href="https://colab.research.google.com/github/njaiman14/SupervisedLearning/blob/main/MSDSTopicModel_UnsupersedTextClassification_Final_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MSDS Unsupervised Text Classification Final Assignment

In this assignment, you will implement a topic model preprocessor which can then be applied to the task of topic-modeling Amazon text reviews. We are going to perform following steps:
* Extracting the Data
* Performing Topic Modeling
* Identify Nike’s product ASINS and extract the relevant reviews
* Visualize topics
* Performing Data Clustering
* Provide Marketing & Product Insights

## Imports

In [None]:
try:
    from tmtoolkit.corpus import Corpus
    from tmtoolkit.preprocess import TMPreproc
    from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
    from tmtoolkit.topicmod.tm_lda import compute_models_parallel
except ModuleNotFoundError:
    !pip install --upgrade pip
    !pip install lda
    !pip install spacy-model-manager
    !spacy-model remove en_core_web_sm
    !pip uninstall -y spacy-model-manager
    !pip uninstall -y spacy
    !pip install spacy==2.3.7
    !python -m spacy download en_core_web_sm
    !pip uninstall -y imgaug
    !pip install "imgaug<0.2.7,>=0.2.5"
    !pip install tmtoolkit==0.10.0
    from tmtoolkit.corpus import Corpus
    from tmtoolkit.preprocess import TMPreproc
    from tmtoolkit.topicmod.model_io import print_ldamodel_topic_words
    from tmtoolkit.topicmod.tm_lda import compute_models_parallel

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pip
  Downloading pip-23.1-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.0.1
    Uninstalling pip-23.0.1:
      Successfully uninstalled pip-23.0.1
Successfully installed pip-23.1
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lda
  Downloading lda-2.0.0-cp39-cp39-manylinux1_x86_64.whl (348 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m348.5/348.5 kB[0m [31m7.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pbr<4,>=0.6 (from lda)
  Downloading pbr-3.1.1-py2.py3-none-any.whl (99 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.7/99.7 kB[0m [31m11.3 MB/s[0m eta [36m

## Implement a pre-processor

Here you will implement a function called `preprocess` which returns the TMPreproc object to be used for topic modeling.

The preprocess function will take a list of texts and return a pre-processed corpus object, i.e. a TMPreproc object. Preprocessing should include the following actions on the corpus using the appropriate methods in the TMPreproc class:

 - lemmatize the texts
 - convert tokens to lowercase
 - remove special characters
 - clean tokens to remove numbers and any tokens shorter than 3 characters


In [None]:
def preprocess(texts, lang="en"):
    """Preprocessor which returns a TMPreproc object processed on corpus as language
    specified by lang (defaults to "en"):

    Should perform all of the following pre-processing functions:
     - lemmatize
     - tokens_to_lowercase
     - remove_special_chars_in_tokens
     - clean_tokens (remove numbers, and remove tokens shorter than 2)
    """
    # Here, we just use the index of the text as the label for the corpus item
    corpus = Corpus({ i:r for i, r in enumerate(texts) })
    preproc = TMPreproc(corpus, language=lang)

    #corpus = Corpus({ i:r for i, r in enumerate(texts) })
    preproc = TMPreproc(corpus, language=lang)
    
    preproc.lemmatize()
    preproc.tokens_to_lowercase()
    preproc.remove_special_chars_in_tokens()
    preproc.clean_tokens(remove_numbers=True, remove_shorter_than=2)
    
    return preproc

## Function development

Use this section of code to verify your function implementation. You may change the test_corpus as needed to verify your implementation. The grader will be checking that your function returns a TMPreproc object that meets all of the following critera:

 - tokens are lemmatized
 - tokens are converted to lowercase
 - special characters are removed from tokens
 - tokens shorter than 3 characters and numerics are removed

In [None]:
import pprint
from textblob import TextBlob
pp = pprint.PrettyPrinter(indent=4)

In [None]:
test_corpus = [ # Feel free to edit this corpus for further testing
                # to be sure that your functions meet specifications.
    "The 3 cats sat on the mats!",
    "1 fish 2 fish Red fish Blue fish",
    "She sells $ea$shells"
]
preproc = preprocess(test_corpus)
pp.pprint(preproc.get_tokens())

{   0: {   'lemma': ['cat', 'sit', 'mat'],
           'token': ['cat', 'sit', 'mat'],
           'whitespace': [True, True, False]},
    1: {   'lemma': ['fish', 'fish', 'Red', 'fish', 'Blue', 'fish'],
           'token': ['fish', 'fish', 'red', 'fish', 'blue', 'fish'],
           'whitespace': [True, True, True, True, True, False]},
    2: {   'lemma': ['sell', 'ea$shells'],
           'token': ['sell', 'eashells'],
           'whitespace': [True, False]}}


In [None]:
dtms = {
    "test_corpus": preproc.dtm
}
lda_params = {
    'n_topics': 2,
    'eta': .01,
    'n_iter': 10,
    'random_state': 1234,  # to make results reproducible
    'alpha': 1/16
}

models = compute_models_parallel(dtms, constant_parameters=lda_params)

In [None]:
model = models["test_corpus"][0][1]
print_ldamodel_topic_words(model.topic_word_, preproc.vocabulary, top_n=5)

topic_1
> #1. fish (0.566384)
> #2. red (0.142655)
> #3. cat (0.142655)
> #4. blue (0.142655)
> #5. sit (0.001412)
topic_2
> #1. sit (0.247549)
> #2. sell (0.247549)
> #3. mat (0.247549)
> #4. eashells (0.247549)
> #5. red (0.002451)


### Load Data

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Topic modeling Amazon Reviews

Once you have completed the assignment above, you will be well prepared to start your final project for this unit. The project will include loading Amazon reviews into a corpus for topic modeling. The code below demonstrates topic modeling the reviews for a given brand. Note that the final project will require additional segmentation of the data, which is not done for you in the example here.

###### Get Negative Texts

In [None]:
def get_negative_texts(texts):
    """Implement this function which should take a list of texts
    and returns a list of the texts that are determined to be
    of negative sentiment.

    See the TextBlob documentation for how to evaluate sentiment. For our
    purposes here, negative sentiment is a sentiment with polarity < 0.0.
    """
    pass # TODO: Implement this function
    negative_texts = []
    for text in texts:
        blob = TextBlob(text)
        if blob.sentiment.polarity < 0.0:
            negative_texts.append(text)
    return negative_texts

In [24]:
import gzip
import itertools
import json
import requests
import shutil
from textblob import TextBlob

asins = []

# To run this code, you will need to download the metadata file from the course
# assets and upload it to your Google Drive. See the notes about that file
# regarding how it was processed from the original file into json-l format.

with gzip.open("drive/MyDrive/meta_Clothing_Shoes_and_Jewelry.jsonl.gz") as products:
    for product in products:
        data = json.loads(product)
        categories = [c.lower() for c in
                      list(itertools.chain(*data.get("categories", [])))]
        if "nike" in categories:
            asins.append(data["asin"])

Inspect the first fews ASINs

In [None]:
asins[:3]

['B0000V9K32', 'B0000V9K3W', 'B0000V9K46']

Check the length, i.e. the number of resulting ASINs

In [None]:
len(asins)

8327

Build a corpus of review texts

In [None]:
all_texts = []
with gzip.open('drive/MyDrive/FinalAssignment/reviews_Clothing_Shoes_and_Jewelry.json.gz') as reviews:
    for review in reviews:
        data = json.loads(review)
        if data["asin"] in asins:
            text = data["reviewText"]
            all_texts.append(text)

Inspect a few of the reviews

In [17]:
for i, review in enumerate(review_corpus[:5]):
    print(i, review[:80])

0 the colour i received is not blue as shown but yellow.Couldnt change it because 
1 Very cute and is really practical. Fits better on smaller wrists which is my cas
2 The watch was exactly what i ordered and I got it very fast. Unfortunately it wa
3 This product came promptly and as described, pleasure doing business with them!-
4 Why isn't Nike making these anymore?  I love this watch, and I get a lot of comp


In [26]:
negative_texts = get_negative_texts(all_texts)
for i, text in enumerate(negative_texts[:5]):
    print(i, text)
len(negative_texts)
print(negative_texts)

0 I'm on my 4th watch... I keep returning it due to poor design.  The band keeps coming apart in the same spot!  Nike hasn't been helpful when I've been in contact with them.  Now, I'm on my 4th watch and something funking is going on with the face of this watch and I've brought it in to a jeweler to have the battery changed.. TWICE and it's still acting up.  I wouldn't purchase another NIKE watch.
1 i had some problems with this order, the bill didn't arrive with the watch to the p.o. box and it couldn't be sent to my country as it was supossed to, when i finally recieved this watch (after sending several emails to solve the situation) it just didn't work, i had to spent money to fix it.
2 Watch was very small and barely fit my 6 year-old daughter's wrist.  And she has a small frame!  Strange.
3 The digital numbers were impossible to see I don't know if there was something wrong with the watch but the glare was terrible.  I sent this watch back.
4 Everything that glitters is not gold!

##### Function development
Use this section of code to verify your function implementation. You may change the test_corpus as needed to verify your implementation. The grader will be checking that your function returns a TMPreproc object that meets all of the following critera:

tokens are lemmatized
tokens are converted to lowercase
special characters are removed from tokens
tokens shorter than 3 characters and numerics are removed

In [27]:
import pprint
pp = pprint.PrettyPrinter(indent=4)

test_corpus = [ # Feel free to edit this corpus for further testing
                # to be sure that your functions meet specifications.
    "The 3 cats sat on the mats!",
    "1 fish 2 fish Red fish Blue fish",
    "She sells $ea$shells"
]
preproc = preprocess(test_corpus)
pp.pprint(preproc.get_tokens())

dtms = {
    "test_corpus": preproc.dtm
}
lda_params = {
    'n_topics': 2,
    'eta': .01,
    'n_iter': 10,
    'random_state': 1234,  # to make results reproducible
    'alpha': 1/16
}

models = compute_models_parallel(dtms, constant_parameters=lda_params)

model = models["test_corpus"][0][1]
print_ldamodel_topic_words(model.topic_word_, preproc.vocabulary, top_n=5)



{   0: {   'lemma': ['cat', 'sit', 'mat'],
           'token': ['cat', 'sit', 'mat'],
           'whitespace': [True, True, False]},
    1: {   'lemma': ['fish', 'fish', 'Red', 'fish', 'Blue', 'fish'],
           'token': ['fish', 'fish', 'red', 'fish', 'blue', 'fish'],
           'whitespace': [True, True, True, True, True, False]},
    2: {   'lemma': ['sell', 'ea$shells'],
           'token': ['sell', 'eashells'],
           'whitespace': [True, False]}}
topic_1
> #1. fish (0.566384)
> #2. red (0.142655)
> #3. cat (0.142655)
> #4. blue (0.142655)
> #5. sit (0.001412)
topic_2
> #1. sit (0.247549)
> #2. sell (0.247549)
> #3. mat (0.247549)
> #4. eashells (0.247549)
> #5. red (0.002451)


Build a TMPreproc object from the review corpus

In [18]:
pre = preprocess(review_corpus)

In [20]:
dtms = {
    "reviews_corpus": pre.dtm
}
lda_params = {
    'n_topics': 10,
    'eta': .01,
    'n_iter': 10,
    'random_state': 1234,  # to make results reproducible
    'alpha': 1/16
}

models = compute_models_parallel(dtms, constant_parameters=lda_params)

Print the topics

In [21]:
model = models["reviews_corpus"][0][1]
print_ldamodel_topic_words(model.topic_word_, pre.vocabulary, top_n=5)

topic_1
> #1. watch (0.051211)
> #2. use (0.019462)
> #3. time (0.017413)
> #4. heart (0.017413)
> #5. rate (0.014597)
topic_2
> #1. watch (0.050371)
> #2. time (0.032055)
> #3. like (0.016702)
> #4. love (0.014278)
> #5. easy (0.014009)
topic_3
> #1. watch (0.051927)
> #2. look (0.021341)
> #3. great (0.021104)
> #4. like (0.020630)
> #5. much (0.016362)
topic_4
> #1. watch (0.066008)
> #2. wrist (0.016448)
> #3. buy (0.014892)
> #4. battery (0.014670)
> #5. get (0.014003)
topic_5
> #1. watch (0.060353)
> #2. time (0.017485)
> #3. would (0.016766)
> #4. work (0.015090)
> #5. battery (0.014850)
topic_6
> #1. shoe (0.058213)
> #2. good (0.030330)
> #3. run (0.029108)
> #4. great (0.022504)
> #5. nike (0.017612)
topic_7
> #1. watch (0.080141)
> #2. time (0.018387)
> #3. get (0.017916)
> #4. good (0.014852)
> #5. work (0.014616)
topic_8
> #1. shoe (0.051799)
> #2. buy (0.027003)
> #3. good (0.023972)
> #4. like (0.022870)
> #5. fit (0.020115)
topic_9
> #1. use (0.034016)
> #2. watch (0.02

## Save your topic model and corpus for use in Lab 2

Once you have completed the above assignment, run the following code to save your topic model and your corpus to your Google Drive. You will load this model and use it for document classification in Lab 2.

In [38]:
import pickle
from tmtoolkit.topicmod.model_io import save_ldamodel_to_pickle

with open("drive/MyDrive/FinalAssignment/MSDS_FinalAssignment_model.p", "wb") as modelfile:
    save_ldamodel_to_pickle(modelfile, model, pre.vocabulary, pre.doc_labels, dtm=pre.dtm)

  and should_run_async(code)


In [39]:
with open("drive/MyDrive/FinalAssignment/MSDS_FinalAssignment_corpus.p", "wb") as corpusfile:
    pickle.dump(review_corpus, corpusfile)

  and should_run_async(code)


In [40]:
import numpy as np
import pickle

try:
  import pyLDAvis
except:
  !pip install pyLDAvis==2.1.2
  import pyLDAvis

try:
  import tmtoolkit
except:
  !pip install tmtoolkit
  import tmtoolkit

try:
  from lda import LDA
except: 
  !pip install lda
  from lda import LDA

from tmtoolkit.bow.bow_stats import doc_lengths
from tmtoolkit.topicmod.model_stats import generate_topic_labels_from_top_words
from tmtoolkit.topicmod.model_io import ldamodel_top_doc_topics
from tmtoolkit.topicmod.model_io import load_ldamodel_from_pickle
from tmtoolkit.topicmod.visualize import parameters_for_ldavis

  and should_run_async(code)


In [43]:
with open("drive/MyDrive/MSDS_HW2_corpus.p", "rb") as corpusfile:
    corpus = pickle.load(corpusfile)

with open("drive/MyDrive/MSDS_HW2_model.p", "rb") as modelfile:
    model_info = load_ldamodel_from_pickle(modelfile)

model_info.keys()

model = model_info["model"]
vocab = model_info["vocab"]
dtm = model_info["dtm"]
doc_labels = model_info["doc_labels"]

topic_labels = generate_topic_labels_from_top_words(
    model.topic_word_,
    model.doc_topic_,
    doc_lengths(dtm),
    np.array(vocab),
)

  and should_run_async(code)


In [44]:
topic_labels

  and should_run_async(code)


array(['1_watch_use_time', '2_watch_time_like', '3_watch_look_great',
       '4_watch_wrist_buy', '5_watch_time_would', '6_shoe_good_run',
       '7_watch_time_get', '8_shoe_buy_good', '9_use_watch_time',
       '10_watch_band_run'], dtype='<U18')

In [45]:
doc_topic = model.doc_topic_
documentclassifications = ldamodel_top_doc_topics(doc_topic, doc_labels, top_n=2, topic_labels=topic_labels) 

  and should_run_async(code)


In [46]:
documentclassifications.head()

  and should_run_async(code)


Unnamed: 0_level_0,rank_1,rank_2
document,Unnamed: 1_level_1,Unnamed: 2_level_1
0,4_watch_wrist_buy (0.4829),10_watch_band_run (0.3462)
1,7_watch_time_get (0.445),5_watch_time_would (0.445)
2,4_watch_wrist_buy (0.8853),5_watch_time_would (0.07798)
3,5_watch_time_would (0.5328),2_watch_time_like (0.2705)
4,2_watch_time_like (0.5917),4_watch_wrist_buy (0.1514)


In [47]:
documentclassifications["text"] = corpus
documentclassifications.head()

  and should_run_async(code)


Unnamed: 0_level_0,rank_1,rank_2,text
document,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,4_watch_wrist_buy (0.4829),10_watch_band_run (0.3462),the colour i received is not blue as shown but...
1,7_watch_time_get (0.445),5_watch_time_would (0.445),Very cute and is really practical. Fits better...
2,4_watch_wrist_buy (0.8853),5_watch_time_would (0.07798),The watch was exactly what i ordered and I got...
3,5_watch_time_would (0.5328),2_watch_time_like (0.2705),"This product came promptly and as described, p..."
4,2_watch_time_like (0.5917),4_watch_wrist_buy (0.1514),Why isn't Nike making these anymore? I love t...


Visualization
pyLDAvis is a Python port of the LDAvis package in R, and is used as a tool for interpreting the topics in a topic model that has been fit to a corpus of text data.

Execute the following code to create an interactive visualization of your topic model.

In [49]:
ldavis_params = parameters_for_ldavis(
    model.topic_word_,
    model.doc_topic_,
    dtm,
    vocab
)

  and should_run_async(code)


In [50]:
%matplotlib inline
vis = pyLDAvis.prepare(**ldavis_params)
pyLDAvis.enable_notebook(local=True)
pyLDAvis.display(vis)

  and should_run_async(code)
  warn("The `IPython.html` package has been deprecated since IPython 4.0. "
