# Parker Christenson Assignment 5 Topic modeling

## `Instructions`
1. Pre-process the data by removing stop words and applying stemming or lemmatization, then vectorize the raw text with the CountVectorizer function from Sklearn.
    - Optionally create bigrams and/or trigrams from the available words.
2. Build topic models with 8, 9, and 10 different topics.
3. Evaluate the different topics using perplexity and/or topic coherence.
4. Select a single model, and describe the different topics generated. You can leverage the helper functions provided for this purpose.
    - List the most common word in each topic and optionally create word clouds for each topic.
    - Write a sentence or two to describe your perspective on the composition of each topic. What makes this topic stand out from the others?
    - Take one or two sentences and summarize all the topics together.
    - Note that all the items in Step 4 should be in text form in a Markdown cell in your submitted Jupyter Notebook. For reference, there is no example of this in the COVID19.ipynb Download COVID19.ipynbassignment lab Jupyter Notebook.


In [8]:
# library imports
import numpy as np
import pandas as pd
import re
import json
import sys
import os
import ast
import random
pd.set_option('display.max_columns', 40)
import nltk
import gensim
import wordcloud
import faiss
from nltk.corpus import stopwords
from helper_functions import * 

# having some trouble with the nltk stopwords, so I'm going to download them again
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tehwh\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

## `Data pre-processing`

In [9]:
root_location = 'E:/Data_sets' # I offloaded all of the data to a 4TB external drive to save space on my compter here. I dont need 30K articles on my computer lol

file_locations = [f'{root_location}/biorxiv_medrxiv/biorxiv_medrxiv', 
                  f'{root_location}/noncomm_use_subset/noncomm_use_subset', 
                  f'{root_location}/comm_use_subset/comm_use_subset',
                  f'{root_location}/custom_license/custom_license']

## Set up Stop Words
## Add Any other relevant options manually

stop_words = stopwords.words('english') + ['et', 'al', 'fig', 'etal', 'et al', 'et-al']

processed_articles = process_articles(file_locations, stop_words)
processed_articles.read_files()

print('Example File Name')
print(processed_articles.root_files[0])
print('Number of files')
print(len(processed_articles.root_files))
print('Example Article Information')
print(processed_articles.title_text[2])

Processing Files at the requested location
There are 177 files to process
There were 885 files in the dataset
Processing Files at the requested location
There are 470 files to process
There were 2353 files in the dataset
Processing Files at the requested location
There are 1823 files to process
There were 9118 files in the dataset
Processing Files at the requested location
There are 3391 files to process
There were 16959 files in the dataset
Example File Name
200b85eea010e695ac7281e5b8758c9fbbd6d0ad.json
Number of files
3391
Example Article Information
['Data_sets', 'Generating genomic platforms to study Candida albicans pathogenesis', ['Over the last decade, there has been an exponential growth in the quantity of available genome sequence data due to the very rapid progress in sequencing technology. In 2004, the genome sequence of the human fungal pathogen Candida albicans was released as Assembly 19 (1). With the challenge of working with a heterozygous diploid organism, new computat

In [10]:
processed_articles.process_text()

Cleaning out Junk
Tokenizing words
Converting to list of words and removing stop words
Creating word stems


### As provided in the covid 19 lab, we will be creating the `bigrams and trigrams` from the available words. 

In [11]:
processed_articles.trigrams([art[3] for art in processed_articles.processed_article])
processed_articles.processed_trigrams[2][4]

Creating Bigrams
Creating Trigrams from Bigrams


"['last_decad', 'exponenti_growth', 'quantiti', 'avail', 'genom_sequenc', 'data', 'due', 'rapid_progress', 'sequenc', 'technolog', 'genom_sequenc', 'human', 'fungal_pathogen', 'candida_albican', 'releas', 'assembl', 'challeng', 'work', 'heterozyg', 'diploid', 'organ', 'new', 'comput', 'method', 'develop', 'result', 'releas', 'assembl', 'assembl', 'complet', 'phase', 'diploid', 'genom_sequenc', 'standard', 'albican', 'refer', 'strain', 'sc', 'open_new', 'perspect', 'understand', 'genet_basi', 'function', 'mechan_underli', 'pathogenesi', 'evolut', 'organ', 'although', 'albican', 'gene', 'sequenc', 'avail', 'commun', 'decad', 'predict', 'protein', 'code', 'gene', 'character', 'januari_st', 'accord', 'candida', 'genom', 'databas', 'nowaday', 'grow', 'avail', 'whole_genom', 'dataset', 'encourag', 'shift_toward', 'develop', 'function', 'genom', 'system', 'biolog', 'enabl', 'analysi', 'highthroughput', 'whole_genom', 'assay', 'better_understand', 'biolog', 'network', 'context', 'major', 'effo

In [12]:
from sklearn.feature_extraction.text import CountVectorizer

train, test = train_test_splitter(processed_articles.processed_trigrams, 0.90)

vectorizer = CountVectorizer(min_df = 50, max_df = 0.8, max_features = 50000)
tf = vectorizer.fit_transform([t[4] for t in train]) ## Vectorize training set
tf_feature_names = vectorizer.get_feature_names_out() ## Pull out words for use in eval

# Transform test data for perplexity eval

tf_test = vectorizer.transform([t[4] for t in test])

## Importing the Gensim library for topic modeling

In [21]:
import gensim.corpora as corpora
from gensim.models import CoherenceModel
import logging

In [22]:
# creating a dictionary and a corpus for the LDA model

id2word = corpora.Dictionary([t[4].split() for t in train])
corpus = [id2word.doc2bow(text.split()) for text in [t[4] for t in train]]

## `Modeling -LDA`

In [23]:
# setting the lists and dictionaries for the models
num_topics_list = [8, 9, 10]
models = {}
perplexity = {}
coherence = {}

# nice little loop to build the models

for num_topics in num_topics_list:
    lda_model = gensim.models.LdaModel(corpus=corpus,
                                       id2word=id2word,
                                       num_topics=num_topics,
                                       random_state=100,
                                       update_every=1,
                                       chunksize=100,
                                       passes=10,
                                       alpha='auto',
                                       per_word_topics=True)
    models[num_topics] = lda_model
    # evaluate using perplexity
    perplexity[num_topics] = lda_model.log_perplexity(corpus)

    # evaluate using coherence
    coherence_model_lda = CoherenceModel(model=lda_model, texts=[t[4].split() for t in train], dictionary=id2word, coherence='c_v')
    coherence[num_topics] = coherence_model_lda.get_coherence()

`That was a 17 minute run time for the cell above`

In [24]:
# Print the evaluation results
for num_topics in num_topics_list:
    print(f'Number of topics: {num_topics}')
    print(f'Perplexity: {perplexity[num_topics]}')
    print(f'Coherence: {coherence[num_topics]}')
    print('')

Number of topics: 8
Perplexity: -9.311266383217047
Coherence: 0.4842853560855015

Number of topics: 9
Perplexity: -9.621981029842669
Coherence: 0.4329586510443721

Number of topics: 10
Perplexity: -10.019501432435934
Coherence: 0.45750258984138614



### `We are going to go with the 10 topic model as it is the best model based on the perplexity and coherence scores`

In [27]:
# printing the top 3 most common words in each topic for the 10 topic model

lda_model = models[10]
topics = lda_model.print_topics(num_words=3)

# prints
print("Most common words in each topic (model with 10 topics):")
for topic_num, topic_words in topics:
    print(f"Topic {topic_num}: {topic_words}")

Most common words in each topic (model with 10 topics):
Topic 0: 0.028*"'use'," + 0.012*"'concentr'," + 0.009*"'compound',"
Topic 1: 0.039*"'protein'," + 0.011*"'cell'," + 0.010*"'structur',"
Topic 2: 0.028*"'patient'," + 0.017*"'infect'," + 0.012*"'studi',"
Topic 3: 0.024*"'use'," + 0.022*"'sequenc'," + 0.017*"'detect',"
Topic 4: 0.008*"'develop'," + 0.008*"'provid'," + 0.008*"'health',"
Topic 5: 0.010*"'use'," + 0.009*"'case'," + 0.008*"'studi',"
Topic 6: 0.047*"'de'," + 0.021*"'cat'," + 0.013*"'dog',"
Topic 7: 0.022*"'may'," + 0.017*"'diseas'," + 0.015*"'anim',"
Topic 8: 0.048*"'cell'," + 0.018*"'infect'," + 0.012*"'activ',"
Topic 9: 0.046*"'viru'," + 0.037*"'infect'," + 0.032*"'vaccin',"


### `Topic breakdowns by the most common words and What I think the topics are about`

- `Topic 0`: This looks like it is talking about some kind of using some kind of compund and looks like its talking about development of drugs
- `Topic 1`: this looks like it us talking about the structure of viruses with the words like protein, cell and structure
- `Topic 2`: This looks like the articles are talking about the patinets and the way they are infected during a study. 
- `Topic 3`: To me it looks like the sequence of events when someone is infected by a virus as it has the word sequencing in there. 
- `Topic 4`: It looks like they are talking about the action plans to make a vaccine for the virus, and provide it, for healthcare. 
- `Topic 5`: Maybe how to use studies to help with the virus and using specific cases to help tailor the vaccine? 
- `Topic 6`: This is maybe talking about the way that the dogs and cats are being infected by the virus or testing subjects? 
- `Topic 7`: This is talking about the lock down because it icludes the word 'May'
- `Topic 8`: This could be talking about what the virus is doing to the cells and how it is affecting the person when the infection sets it? 
- `Topic 9`: Maybe this is talking about the way that the virus is being spread and how it is being spread, and the vaccines? 

### `Summary of all the topics 2-3 sentences`

I think that all of the topics are looking at different things and the way that the effect people and the way that the virus is being spread. I think that the topics are all very different and are all looking at different thing, and explore multiple topics, but they all are dealing with Viruses, and the way that they are being spread, and the way that they are being treated historically speaking that is.