Project 4: Natural Language Processing and Unsupervised Learning

In this notebook, I test out how to preprocess a reddit comment.

8/12 Notes:
Workflow to be: get monthly post submissions -> add a column for cleaned_text, add a column for just nouns, just verbs, just adjectives, just named entities (?); do topic modeling on the nouns and named entities; do sentiment analysis on the cleaned_text 

# Table of Contents
1. [Imports](#section1)
2. [Getting Post Data](#section2)
3. [Exploratory Preprocessing](#section3)

<a id='section1'></a>
### 1. Imports

In [None]:
import json
import praw
import requests

import pandas as pd
import numpy as np
from datetime import datetime
from spacy.lang.en import English
import spacy

import psycopg2 as pg

import re

<a id='section2'></a>
### 2. Getting Post Data

In [None]:
# Postgres info to connect

connection_args = {
    'host': 'localhost', # We are connecting to our local version of psql
    'dbname': 'reddit_medicine',        # DB that we are connecting to
    'port': 5432,        # port we opened on AWS
    'password':'',
    'user': 'postgres'
}

In [None]:
def connect_to_postgres(connection_args):
    '''
    Connect to PostgreSQL database server
    '''
    connection = None
    try:
        connection = pg.connect(**connection_args)
    except (Exception, pg.DatabaseError) as error:
        print(error)
        
    return connection

In [None]:
connection = connect_to_postgres(connection_args)

In [None]:
jan_submission_query = "SELECT * FROM submissions WHERE month = 1;"

jan_df = pd.read_sql(jan_submission_query, connection)

In [None]:
print(jan_df.shape)
jan_df.head()

In [None]:
jan_df["full_text"] = jan_df["title"] + ' ' + jan_df["submission_text"]
jan_df.head()

In [None]:
jan_corpus = ' '.join(jan_df["full_text"])

In [None]:
jan_corpus

Notes about the information. There is a lot of new lines and some interesting &#x200B characters. There's also some markdown formatting. I want to keep the text in the square brackets, but eliminate the text with parentheses that start with https.

There are some abbreviations (TL:DR; 75 yo CT (probably cat scan), MD, DO, NP, PA, ED that I might want to keep).

In [None]:
# clean up "\n" characters in corpus
jan_corpus_1 = re.sub('\n', ' ', jan_corpus)
jan_corpus_1

In [None]:
# remove all links
jan_corpus_2 = re.sub(r'https?:\/\/\S+', ' ', jan_corpus)
jan_corpus_2

In [None]:
# remove all weird &gt or &#x200B characters
jan_corpus_3 = re.sub(r'&\S*', ' ', jan_corpus)
jan_corpus_3

In [None]:
clean_jan_corpus = re.sub('\n', ' ', jan_corpus)
clean_jan_corpus = re.sub(r'https?:\/\/\S+', ' ', clean_jan_corpus)
clean_jan_corpus = re.sub(r'&\S*', ' ', clean_jan_corpus)
clean_jan_corpus = re.sub(r'\\xa0', ' ', clean_jan_corpus)
characters_to_clean = '*>|[]()\\",°'
for c in characters_to_clean:
    clean_jan_corpus = clean_jan_corpus.replace(c, '')
clean_jan_corpus = clean_jan_corpus.replace('/', ' ') 
clean_jan_corpus = clean_jan_corpus.replace('-', ' ')
clean_jan_corpus

In [None]:
# finding common medical abbreviations
abbreviations = re.findall('[A-Z][A-Z]+', clean_jan_corpus)
set(abbreviations)

Common abbreviations I can find (manual):
ICU, MD, DO, NP, PA, ED, DDS, MBA, EMS
WHO, SARS, MERS

In [None]:
abbreviation_dict = {"MD": "doctor_of_medicine",
                     "MPH": "master_of_public_health",
                     "MBA": "master_of_business_administrationi",
                     "NP": "nurse_practitioner",
                     "PA": "physician_assistant",
                     "RN": "registered_nurse",
                     "DVM": "doctor_of_veterinary_medicine",
                     "DDS": "doctor_of_dentistry",
                     "DO": "doctor_of_osteopathy",
                     "ICU": "intensive_care_unit",
                     "ER": "emergency_room",
                     "EMS": "emergency_medical_services",
                     "CDC": "centers_for_disease_control",
                     "WHO": "world_health_organization",
                     "SARS": "severe_acute_respiratory_syndrome",
                     "MERS": "middle_east_respiratory_syndrome",
                     "SOM": "school_of_medicine"
                     }
abbreviation_dict

In [None]:
for key in abbreviation_dict.keys():
    clean_jan_corpus = clean_jan_corpus.replace(key, ' ' + abbreviation_dict[key] + ' ')
clean_jan_corpus

In [None]:
len(clean_jan_corpus)

In [None]:
def cleaning_function(corpus, regex_patterns, char_space, char_no_space, abbrev_dict):
    '''
    Inputs:
    - corpus (string): string of reddit posts
    - regex_patterns (list): list of regex patterns to remove
    - char_space (string): characters to replace with a space
    - char_no_space (string): characters to replace with no space
    - abbrev_dict (dict): dictionary to replace abbreviations with full words
    Outputs:
    - cleaned_corpus (string): cleaned string of reddit posts
    '''
    cleaned_corpus = str(corpus)
    for pattern in regex_patterns:
        cleaned_corpus = re.sub(pattern, ' ', cleaned_corpus)
    
    for char in char_space:
        cleaned_corpus = cleaned_corpus.replace(char, ' ')
    
    for char in char_no_space:
        cleaned_corpus = cleaned_corpus.replace(char, '')
        
    for key in abbrev_dict.keys():
        cleaned_corpus = cleaned_corpus.replace(key, ' ' + abbrev_dict[key] + ' ')
    
    cleaned_corpus = cleaned_corpus.lower()
    
    return cleaned_corpus

In [None]:
regex_patterns = ['\n', '\t', r'https?:\/\/\S+', r'&\S*', r'\\xa0', r'[__]{2,}', r'[\d]+', 'χ', '®']
char_space = '^*>|[]()",°#'
char_no_space = '/-\\'
abbrev_dict = {"MD": "doctor_of_medicine",
               "MPH": "master_of_public_health",
               "MBA": "master_of_business_administrationi",
               "NP": "nurse_practitioner",
               "PA": "physician_assistant",
               "RN": "registered_nurse",
               "DVM": "doctor_of_veterinary_medicine",
               "DDS": "doctor_of_dentistry",
               "DO": "doctor_of_osteopathy",
               "ICU": "intensive_care_unit",
               "ER": "emergency_room",
               "ED": "emergency_department",
               "EMS": "emergency_medical_services",
               "EMR": "electronic_medical_records",
               "CFR": "case_fatality_rate",
               "CT": "computed_tomography",
               "CDC": "centers_for_disease_control",
               "WHO": "world_health_organization",
               "FDA": "food_and_drug_administration",
               "SARS": "severe_acute_respiratory_syndrome",
               "MERS": "middle_east_respiratory_syndrome",
               "ARDS": "acute_respiratory_distress_syndrome",
               "SOM": "school_of_medicine",
               "COVID": "covid",
               "N95": "n95",
               "n95": "n95", # make sure n95 is counted as a distinc
               "PPE": "personal_protective_equipment"
               }

In [None]:
cleaned_jan_corpus = cleaning_function(jan_corpus, regex_patterns, char_space, char_no_space, abbrev_dict)

In [None]:
cleaned_jan_corpus

In [None]:
# exclude megathreads that have repetitive post titles and post submission text
full_submission_query = "SELECT * FROM submissions WHERE title NOT LIKE 'Megathread:%' AND title NOT LIKE 'Megathread #%' and TITLE NOT LIKE 'Weekly Careers Thread';"
full_df = pd.read_sql(full_submission_query, connection)
full_df["full_text"] = full_df["title"] + ' ' + full_df["submission_text"]
full_corpus = ' '.join(full_df["full_text"])

In [None]:
full_df["cleaned_text"] = full_df["full_text"].apply(lambda x:
                          cleaning_function(x, regex_patterns, char_space, char_no_space, abbrev_dict))

In [None]:
cleaned_full_corpus = cleaning_function(full_corpus, regex_patterns, char_space, char_no_space, abbrev_dict)

In [None]:
# might need to exclude megathreads because they repeat the same text over and over
cleaned_full_corpus

<a id='section3'></a>
### 3. Exploratory Preprocessing

In [None]:
# trying out nltk
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
from nltk.tokenize import RegexpTokenizer
from nltk.util import ngrams

In [None]:
# looks the most promising?
tokenize_by_word = word_tokenize(cleaned_full_corpus)
tokenize_by_word[:200]

In [None]:
# not looking too great
tokenize_by_sent = sent_tokenize(cleaned_full_corpus)
tokenize_by_sent[:200]

In [None]:
tokenize_by_bigram = word_tokenize(cleaned_full_corpus)
twograms = list(ngrams(tokenize_by_bigram, 2))
twograms[:200]

In [None]:
# RegexpTokenizer with whitespace delimiter
whitespace_tokenizer = RegexpTokenizer("\s+", gaps=True)
tokenize_by_regex = whitespace_tokenizer.tokenize(cleaned_full_corpus)
tokenize_by_regex[:200]

In [None]:
# make all text lowercase
cleaned_full_corpus = cleaned_full_corpus.lower()
cleaned_full_corpus

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
cv = CountVectorizer(stop_words='english')
X = cv.fit_transform(word_tokenize(cleaned_full_corpus))
pd.DataFrame(X.toarray(), columns=cv.get_feature_names())

In [None]:
feature_names = cv.get_feature_names()
feature_names[-500:]

In [None]:
# more cleaning I need to do; lots of numbers

#matches 2 or more underscores

corpus_copy = re.sub('[__]{2,}',' ', cleaned_full_corpus)
corpus_copy = re.sub('[\d]+', ' ', corpus_copy)
corpus_copy = re.sub('χ', ' ', corpus_copy)
corpus_copy

In [None]:
cv1 = CountVectorizer(stop_words='english')
X_1 = cv1.fit_transform(word_tokenize(corpus_copy))
feature_names = cv1.get_feature_names()

In [None]:
# testing out stemmers now

In [None]:
from nltk.stem.lancaster import LancasterStemmer
from nltk.stem.porter import PorterStemmer
from nltk.stem.snowball import SnowballStemmer

In [None]:
token_words = word_tokenize(corpus_copy)

In [None]:
lancaster = LancasterStemmer()
lancaster_list = []
for word in token_words:
    lancaster_list.append(lancaster.stem(word))

In [None]:
lancaster_list[:500]

In [None]:
porter = PorterStemmer()
porter_list = []
for word in token_words:
    porter_list.append(porter.stem(word))

In [None]:
porter_list[:500]

In [None]:
snowball = SnowballStemmer('english')
snowball_list = []
for word in token_words:
    snowball_list.append(snowball.stem(word))

In [None]:
snowball_list[:500]

In [None]:
# now trying part of speech tag
from nltk.tag import pos_tag

In [None]:
part_of_speech = pos_tag(token_words)
part_of_speech[:500]

In [None]:
only_nouns = []
for word, pos in part_of_speech:
    if 'NN' in pos:
        only_nouns.append(word)

In [None]:
only_nouns[:1000]

In [None]:
len(only_nouns)

In [None]:
from nltk.help import upenn_tagset

In [None]:
upenn_tagset()

In [None]:
# seems like spacy does a lot of things already in the pipeline
# it has a part of speech tagger, a dependency parser, a named
# entity recognizer, and a text classifier

# if I want to use the named entity recognizer, I shouldn't spell out abbreviations?

nlp = English()
spacy_corpus = corpus_copy[:100000]


In [None]:
spacy_test = nlp(spacy_corpus)

In [None]:
def spacy_post_extract(text, pos):
    '''
    Use spacy to extract words that are a specific part of speech from text.
    Inputs:
    - text (str): string of words to extract words from
    - pos (list): list of part of speech strings; must be one of "NOUN", "VERB", "ADJ", "ADV", "PROPN"
    https://spacy.io/api/annotation#pos-tagging this link contains a table with all the different parts of speech
    Output:
    - pos_string (str): string of words that are a specific part of speech
    '''
    
    nlp = spacy.load("en_core_web_sm")
    spacy_text = nlp(text)
    
    pos_list = []
    
    for token in spacy_text:
        for part_of_speech in pos:
            if token.pos_ == part_of_speech:
                pos_list.append(token.text)
    
    pos_string = ' '.join(pos_list)
    return pos_string

In [None]:
sample_text = "Our current enemy was still in the shadows. The wards slowly emptied with activities freed."
pos = ["NOUN", "VERB"]
spacy_post_extract(sample_text, pos)

In [None]:
nlp_2 = spacy.load("en_core_web_sm")
spacy_test_2 = nlp_2(spacy_corpus)
for token in spacy_test_2[300:350]:
    print(token.i, token.text, token.is_alpha, token.is_punct, token.like_num, token.pos_, token.dep_, token.head.text)

In [None]:
# https://spacy.io/api/annotation#pos-tagging list of all parts of speech; may need NOUN and PROPN (proper noun)

In [None]:
# try gettting all nouns with spacy
noun_list = []
for token in spacy_test_2:
    if token.pos_ == "NOUN":
        noun_list.append(token)

In [None]:
noun_list[:500]

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
# setting min_df to 2 helps clear away a lot of junk!
count_vect = CountVectorizer(analyzer = 'word', stop_words = 'english', min_df=2)
doc_word = count_vect.fit_transform(full_df["cleaned_text"])
words = count_vect.get_feature_names()
vocab = count_vect.vocabulary_

In [None]:
count_vec_df = pd.DataFrame(doc_word.toarray(), columns=count_vect.get_feature_names())
count_vec_df

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
# trying out tfidf vectorizer
tfidf_vect = TfidfVectorizer(analyzer = 'word', stop_words = 'english', min_df=2)
tfidf_doc = tfidf_vect.fit_transform(full_df["cleaned_text"])

In [None]:
tfidf_vec_df = pd.DataFrame(tfidf_doc.toarray(), columns=tfidf_vect.get_feature_names())
tfidf_vec_df

Able to successfully do countvectorizer and tfidf vectorizer. However, a column of the column names look like junk. So it looks like before I do the vectorizer, I should stem beforehand.

I also want to look at the posts per month rather than per the entire year. So my vectorizer should vectorize on a monthly basis instead.

In [None]:
# trying using cosine similarities to find similar posts
from itertools import combinations
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
pairs = list(combinations(enumerate(full_df["cleaned_text"]), 2))
combos = [(a[0], b[0]) for a, b in pairs]
phrases = [(a[1], b[1]) for a, b in pairs]

In [None]:
results = [cosine_similarity(count_vec_df.iloc[[a]], count_vec_df.iloc[[b]]) for a, b in combos]
doc_similarity_count = sorted(zip(results, phrases), reverse=True)

In [None]:
# will need to eliminate weekly careers thread
doc_similarity_count[:50]

In [None]:
doc_similarity_count[100:200]

In [None]:
def spacy_named_entities (text):
    '''
    Use spacy to extract named entities from text
    Inputs:
    - text (str): string of words to extract words from
    Output:
    - ent_string (str): string of words that are named entities
    '''
    
    nlp = spacy.load("en_core_web_sm")
    spacy_text = nlp(text)
    
    ent_list = []
    
    for ent in spacy_text.ents:
        ent_list.append((ent.text, ent.label_))
    return ent_list
#     ent_string = ' '.join(ent_list)
#     return ent_string

In [None]:
full_text = ' '.join(full_df["full_text"].to_list())
full_text[:100]

In [None]:
# looking at all spacy entities; could be interesting to look at all the geopolitical entities (is China mentioned more?)
spacy_named_entities(' '.join(full_df["full_text"].to_list())[:100000])

In [None]:
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import NMF

In [None]:
#count_vect = CountVectorizer(analyzer = 'word', stop_words = 'english', min_df=2)
#doc_word = count_vect.fit_transform(full_df["cleaned_text"])
lsa = TruncatedSVD(3)
doc_topic = lsa.fit_transform(doc_word)
lsa.explained_variance_ratio_

In [None]:
index_list = []
for i in range(1, 4):
    index_list.append(f'component_{i}')
index_list

In [None]:
topic_word = pd.DataFrame(lsa.components_.round(3),
                          index = index_list,
                          columns=count_vect.get_feature_names())
topic_word

In [None]:
def display_topics(model, feature_names, no_top_words, topic_names=None):
    for ix, topic in enumerate(model.components_):
        if not topic_names or not topic_names[ix]:
            print("\nTopic ", ix)
        else:
            print("\nTopic: '",topic_names[ix],"'")
        print(", ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

In [None]:
display_topics(lsa, count_vect.get_feature_names(), 10)

In [None]:
# trying out NMF
nmf_model = NMF(3)
doc_topic_1 = nmf_model.fit_transform(doc_word)
topic_word_1 = pd.DataFrame(nmf_model.components_.round(3),
                            index=index_list,
                            columns=count_vect.get_feature_names())
topic_word_1

In [None]:
display_topics(nmf_model, count_vect.get_feature_names(), 10)

In [None]:
# trying to grab comments from one submission ignore the bottom
url_template = (r'https://api.pushshift.io/reddit/search/comment/?link_id={}&limit=1000&sort_type={}&sort={}')
submission_id = 'fjj0lr'
sort = 'desc'
sort_type = 'score'
filled_in_template = url_template.format(submission_id, sort, sort_type)
request = requests.get(filled_in_template)
assert request.status_code == 200

In [None]:
json_response = request.json()
march_comment_list = []
comment_dict = {}

comment_dict['id'] = submission_id
every_comment = ''

In [None]:
# making a dataframe of comments from one submission

march_comment_list = []

comment_dict = {}
comment_dict['id'] = submission_id
every_comment = ''

for i in range(len(json_response['data'])):
    comment = json_response['data'][i]['body']
    
    # basic cleanup
    comment = comment.replace('\n','').replace('^', '').replace("\\'","'")
    
    every_comment += comment + ' '

comment_dict['all_comments'] = every_comment

march_comment_list.append(comment_dict)

In [None]:
march_comment_df = pd.DataFrame(march_comment_list)
march_comment_df.head()

In [None]:
test_comment = march_comment_df.loc[0, "all_comments"]
test_comment

In [None]:
# Create the nlp spacy object
nlp = English()

In [None]:
spacy_doc = nlp(test_comment)

In [None]:
# at first glance, it looks like the apostrophes aren't keeping the words together
# the word "that's" has been separated into "that" and "'s"
for token in spacy_doc[300:350]:
    # token i is index; text is the test; alpha (alphanumeric), punctuation, or resembles a number
    print(token.i, token.text, token.is_alpha, token.is_punct, token.like_num)

In [None]:
nlp_2 = spacy.load("en_core_web_sm")

In [None]:
spacy_doc_2 = nlp_2(test_comment)

In [None]:
# COVID-19 is thought of as a number?; can get part of speech; syntactic dependencies
for token in spacy_doc_2[300:350]:
    print(token.i, token.text, token.is_alpha, token.is_punct, token.like_num, token.pos_, token.dep_, token.head.text)

In [None]:
# looking at named entities
for ent in spacy_doc_2[:1000].ents:
    print(ent.text, ent.label_)

In [None]:
spacy.explain("FAC")

In [None]:
spacy.explain("dobj")

In [None]:
# match patterns
[{"TEXT": "iPhone"}, {"TEXT": "X"}]

In [None]:
# example is nlp = spacy.load("en_core_web_sm"); matcher = Matcher(nlp.vocab)
# can use matches and operators to find text

matcher = spacy.matcher.Matcher(nlp_2.vocab)

In [None]:
pattern = [{"TEXT": "COVID-19"}]
matcher.add("COVID_PATTERN", None, pattern)
sample_doc = nlp_2("COVID-19 coronavirus pandemic")

In [None]:
matches = matcher(sample_doc)

for match_id, start, end in matches:
    matched_span = sample_doc[start:end]
    print(match_id, start, end, matched_span.text)

<a id='section4'></a>
### 4. Aug 13

Goals for today include:
* getting the top words for a section
* stemming and lemmatizing and add those as columns
* using bigrams or trigrams to do topic modeling
* getting sentiment analysis

Getting the top words for section

In [None]:
full_df.head()

In [None]:
from collections import Counter

In [None]:
# to future Jacky: use this function!
def find_top_words_per_post(text: str, n: int):
    '''
    This function returns the top n words per reddit post.
    Inputs:
    - text (str): reddit submission post
    - n (int): number of words
    Outputs:
    - list_of_top_words (list): a list of the top n words in the post
    '''
    tokenize_text = word_tokenize(text)
    word_counter = Counter(tokenize_text)
    list_of_top_words = [word for word, word_counter in word_counter.most_common(n)]
    
    if n < len(list_of_top_words):
        return list_of_top_words[:n]
    else:
        return list_of_top_words

In [None]:
example_words = full_df.loc[2, "cleaned_text"].strip()
find_top_words_per_post(example_words, 5)

In [None]:
def find_top_words_per_post(text: str, n: int):
    '''
    This function returns the top n words per reddit post.
    Inputs:
    - text (str): reddit submission post
    - n (int): number of words
    Outputs:
    - list_of_top_words (list): a list of the top n words in the post
    '''
#     tokenize_text = word_tokenize(text)
#     tokenize_df = pd.DataFrame(tokenize_text)
#     list_of_top_words = tokenize_df[0].value_counts().index.to_list()
#     if n < len(list_of_top_words):
#         return list_of_top_words[:n]
#     else:
#         return list_of_top_words
    

In [None]:
tokenize_text_1 = word_tokenize(example_df.loc[0,"cleaned_text"])
tokenize_df_1 = pd.DataFrame(tokenize_text_1)
list_of_top_words_1 = tokenize_df_1[0].value_counts().index.to_list()
list_of_top_words_1[:20]

My next goal is to try stemming and lemmatizing, then making bigrams or trigrams to use with topic modeling

In [None]:
full_df.head()

In [None]:
example_df = full_df.copy().iloc[:100]

In [None]:
snowball = SnowballStemmer('english')

In [None]:
example_df["tokenize_text"] = example_df["cleaned_text"].apply(lambda x:word_tokenize(x))
example_df.head()

In [None]:
def stem_text(text):
    #snowball = SnowballStemmer('english')
    stem_list = [snowball.stem(word) for word in text]
    stem_string = ' '.join(stem_list)
    return stem_string

In [None]:
example_df["stemmed_text"] = example_df["tokenize_text"].apply(stem_text)
example_df.head()

In [None]:
from spacy.lemmatizer import Lemmatizer

In [None]:
def spacy_lemmatizer(text):
    spacy_nlp = English()
    doc = spacy_nlp(text)
    
    lemmatize_list = []
    for token in doc:
        lemmatize_list.append(token.lemma_)
    return ' '.join(lemmatize_list)

In [None]:
example_df["lemmatized_text"] = example_df["cleaned_text"].apply(spacy_lemmatizer)

In [None]:
example_df.head()

In [None]:
count_vect_test_1 = CountVectorizer(analyzer = 'word', stop_words = 'english', min_df = 2, ngram_range=(2,2))
doc_word_1 = count_vect_test_1.fit_transform(example_df["cleaned_text"])
lsa = TruncatedSVD(3)
doc_topic_1 = lsa.fit_transform(doc_word_1)

index_list = []

for i in range(1, 4):
    index_list.append(f'component_{i}')

topic_word_1 = pd.DataFrame(lsa.components_.round(3),
                            index=index_list,
                            columns=count_vect_test_1.get_feature_names())
display_topics(lsa, count_vect_test_1.get_feature_names(), 10)

In [None]:
count_vect_test_2 = CountVectorizer(analyzer = 'word', stop_words = 'english', min_df = 2, ngram_range=(2,2))
doc_word_2 = count_vect_test_2.fit_transform(example_df["stemmed_text"])
lsa = TruncatedSVD(3)
doc_topic_2 = lsa.fit_transform(doc_word_2)

index_list = []

for i in range(1, 4):
    index_list.append(f'component_{i}')

topic_word_2 = pd.DataFrame(lsa.components_.round(3),
                            index=index_list,
                            columns=count_vect_test_2.get_feature_names())
display_topics(lsa, count_vect_test_2.get_feature_names(), 10)

In [None]:
count_vect_test_3 = CountVectorizer(analyzer = 'word', stop_words = 'english', min_df = 2, ngram_range=(2,2))
doc_word_3 = count_vect_test_3.fit_transform(example_df["lemmatized_text"])
lsa = TruncatedSVD(3)
doc_topic_3 = lsa.fit_transform(doc_word_3)

index_list = []

for i in range(1, 4):
    index_list.append(f'component_{i}')

topic_word_3 = pd.DataFrame(lsa.components_.round(3),
                            index=index_list,
                            columns=count_vect_test_3.get_feature_names())
display_topics(lsa, count_vect_test_3.get_feature_names(), 10)

In [None]:
nmf_model = NMF(3)

In [None]:
count_vect_test_4 = CountVectorizer(analyzer = 'word', stop_words = 'english', min_df = 2, ngram_range=(2,2))
doc_word_4 = count_vect_test_4.fit_transform(example_df["cleaned_text"])
nmf_model = NMF(3)
doc_topic_4 = nmf_model.fit_transform(doc_word_4)

index_list = []

for i in range(1, 4):
    index_list.append(f'component_{i}')

topic_word_4 = pd.DataFrame(nmf_model.components_.round(3),
                            index=index_list,
                            columns=count_vect_test_4.get_feature_names())
display_topics(nmf_model, count_vect_test_4.get_feature_names(), 10)

In [None]:
count_vect_test_5 = CountVectorizer(analyzer = 'word', stop_words = 'english', min_df = 2, ngram_range=(2,2))
doc_word_5 = count_vect_test_5.fit_transform(example_df["stemmed_text"])
nmf_model = NMF(3)
doc_topic_5 = nmf_model.fit_transform(doc_word_5)

index_list = []

for i in range(1, 4):
    index_list.append(f'component_{i}')

topic_word_5 = pd.DataFrame(nmf_model.components_.round(3),
                            index=index_list,
                            columns=count_vect_test_5.get_feature_names())
display_topics(nmf_model, count_vect_test_5.get_feature_names(), 10)

In [None]:
count_vect_test_6 = CountVectorizer(analyzer = 'word', stop_words = 'english', min_df = 2, ngram_range=(2,2))
doc_word_6 = count_vect_test_6.fit_transform(example_df["lemmatized_text"])
nmf_model = NMF(3)
doc_topic_6 = lsa.fit_transform(doc_word_6)

index_list = []

for i in range(1, 4):
    index_list.append(f'component_{i}')

topic_word_6 = pd.DataFrame(lsa.components_.round(3),
                            index=index_list,
                            columns=count_vect_test_6.get_feature_names())
display_topics(lsa, count_vect_test_6.get_feature_names(), 10)

In [None]:
def print_topics(vectorizer, text, model, n_words):
    '''
    This function prints out topics based on the model on the vectorized text.
    
    Inputs:
    - vectorizer: word vectorized used to vectorize the text
    - text: text to be analyzed
    - model (topic modeling model): NMF, LDA, other topic modeling models
    - n_words (int): 
    
    Outputs:
    - prints out topics with n (corresponding to n_words) words that relate to that topic
    '''
    vectorized_text = vectorizer.fit_transform(text)
    transform_text = model.fit_transform(vectorized_text)
    display_topics(model, vectorizer.get_feature_names(), n_words)

In [None]:
def make_topic_dataframe(vectorizer, text, model, n_topics):
        '''
    This function prints out n (corresponding to n_topics) number of topics based on
    the model on the vectorized text.
    
    Inputs:
    - vectorizer: word vectorized used to vectorize the text
    - text: text to be analyzed
    - model (topic modeling model): NMF, LDA, other topic modeling models
    - n_topics (int): number of topics
    
    Outputs:
    - topic_df (DataFrame): 
    '''
    
    vectorized_text = vectorizer.fit_transform(text)
    
    transform_text = model.fit_transform(vectorized_text)
    
    index_list = []
    
    for i in range(1, len(n_topics) + 1):
        index_list.append(f'component_{i}')
        
    topic_df = pd.DataFrame(model.components_.round(3),
                            index=index_list,
                            columns=vectorizer.get_feature_names()
                           )
    
    return topic_df

My last goal for Aug 13 is to do sentiment analysis.

In [None]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [None]:
example_string = example_df.loc[0,"full_text"]
example_string[:150]

In [None]:
analyzer = SentimentIntensityAnalyzer()

In [None]:
analyzer.polarity_scores(example_string)

In [None]:
analyzer.polarity_scores(example_string)['compound']

The compound score is between -1 and 1. It is described as the "normalized, weighted composite socre"

In [None]:
example_df["compound_sentiment"] = example_df["full_text"].apply(lambda x: analyzer.polarity_scores(x)['compound'])
example_df.head()

In [None]:
from textblob import TextBlob

In [None]:
example_df["textblob_polarity"] = example_df["full_text"].apply(lambda x: TextBlob(x).sentiment.polarity)
example_df.head()

In [None]:
# adding stop words

from sklearn.feature_extraction import text 

add_stop_words = ['example']
stop_words = text.ENGLISH_STOP_WORDS.union(add_stop_words)

What do I need for LDA?
* Corpus, num_topics, random_state, chunksize, passes, alpha
* dictionary (corpora.Dictionary(data_lemmatized)
* for corpus, you have to make a term document frequency
* goal is to build many LDA models with different values of number of topics and get the one that gives the highest coherence value
Example: 
```python
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

import pyLDAvis
import pyLDAvis.gensim

# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

# Compute Perplexity
print('\nPerplexity: ', lda_model.log_perplexity(corpus))  # a measure of how good the model is. lower the better.

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print('\nCoherence Score: ', coherence_lda)

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis
```

Aug 15 Work
* Trying to make a class for NLP pipeline
* Trying out LDA

In [None]:
class nlp_preprocessor:
    '''
    A class for pipelining our NLP data. The user provides the text,
    and this class manages the cleaning, transforming, and other
    modifications of the text data.
    
    Parameters:
    vectorizer: model to vectorize text data
    tokenizer: tokenizer to use; defaults to splitting on spaces
    cleaning_function: how to clean the data
    
    '''
    
    
    def __init__(self, 
                 vectorizer=CountVectorizer(),
                 tokenizer=None,
                 cleaning_function=None,
                 stemmer=None,
                 model=None):
        
        if not tokenizer:
            tokenizer = self.splitter
        if not cleaning_function:
            cleaning_function = self.clean_text
        self.stemmer = stemmer
        self.tokenizer = tokenizer
        self.model = model
        self.cleaning_function = cleaning_function
        self.vectorizer = vectorizer
        self._is_fit = False
        
    def splitter(self, text):
        '''
        Default tokenizer that splits on spaces
        '''
        return text.split(' ')
    
    def clean_text(self, text, tokenizer, stemmer):
        '''
        Naive function to lowercase all words and clean them
        quickly. This is the default if no other cleaning
        function is specified
        '''
        cleaned_text = []
        
        for post in text:
            cleaned_words = []
            for word in tokenizer(post):
                lower_word = word.lower()
                if stemmer:
                    lower_word = stemmer.stem(lower_word)
                cleaned_words.append(lower_word)
            cleaned_text.append(' '.join(cleaned_words))
        return cleaned_text
    
    def fit(self, text):
        '''
        Cleans the data and then fits the vectorizer to the text
        '''
        clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)
        self.vectorizer.fit(clean_text)
        self._if_fit = True
    
    def transform(self, text):
        '''
        Cleans the text and transforms it into a vectorized format.
        Returns the vectorized form of the data.
        '''
        
        if not self._is_fit:
            raise ValueError("Must fit model before transforming!")
        
        clean_text = self.cleaning_function(text, self.tokenizer, self.stemmer)
        return self.vectorizer.transform(clean_text)    
    
    

In [None]:
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
from pprint import pprint

import pyLDAvis
import pyLDAvis.gensim

In [None]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')

In [None]:
len(cleaned_full_corpus)

In [None]:
example_corpus = cleaned_full_corpus[:100000]
example_tokenize = word_tokenize(example_corpus)
example_tokenize = [[word for word in example_tokenize if word not in stop_words]]
example_tokenize

In [None]:
lda_dictionary = corpora.Dictionary(example_tokenize)
tokenized_text = example_tokenize

corpus = [lda_dictionary.doc2bow(text) for text in tokenized_text]

print(corpus[:1])

In [None]:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                            id2word=lda_dictionary,
                                            num_topics=10,
                                            random_state=100,
                                            update_every=1,
                                            chunksize=50,
                                            passes=10,
                                            alpha='auto',
                                            per_word_topics=True)
pprint(lda_model.print_topics())

In [None]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, lda_dictionary)
vis

In [None]:


# Create Dictionary
id2word = corpora.Dictionary(data_lemmatized)

# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

# View
print(corpus[:1])

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=20, 
                                           random_state=100,
                                           update_every=1,
                                           chunksize=100,
                                           passes=10,
                                           alpha='auto',
                                           per_word_topics=True)

# Print the Keyword in the 10 topics
pprint(lda_model.print_topics())
doc_lda = lda_model[corpus]

Steps to take for cleaning
* Tokenize, and then remove punctuation

example: <br/>
doc = nlp(text_with_punct)<br/>
tokens = [t.text for t in doc] # python based removal<br/>
tokens_without_punct_python = [t for t in tokens if t not in string.punctuation]<br/>
or # spacy based removal<br/>
tokens_without_punct_spacy = [t.text for t in doc if t.pos_ != 'PUNCT']<br/>

* Normalise data (change numbers to text and abbreviations too?); based on nltk
```python
from normalise import normalise

user_abbr = {
    "N.A.T.O": "North Atlantic Treaty Organization"
}

normalized_tokens = normalise(word_tokenize(text), user_abbrevs=user_abbr, verbose=False)
```

