<a href="https://colab.research.google.com/github/gordeli/NLP_EDHEC2023/blob/main/colab/04_Content_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Natural Language Processing @ EDHEC

# Part 4: Content Analysis

[<- Previous: Corpus Level Processing](https://colab.research.google.com/github/gordeli/NLP_EDHEC2024/blob/main/colab/03_Corpus_Level_Processing.ipynb)

[-> Next: Sentiment Analysis](https://colab.research.google.com/github/gordeli/NLP_EDHEC2024/blob/main/colab/05_Topic_Sentiment.ipynb)


Facilitator: [Ivan Gordeliy](https://www.linkedin.com/in/gordeli/)

---



## Initial Setup

- **Run "Setup" below first.**

    - This will load libraries and download some resources that we'll use throughout the tutorial.

    - You will see a message reading "Done with setup!" when this process completes.
    

In [None]:
#@title Setup (click the "run" button to the left) {display-mode: "form"}

## Setup ##

# imports

# built-in Python libraries
# -------------------------

# counting and data management
import collections
# operating system utils
import os
# regular expressions
import re
# additional string functions
import string
# system utilities
import sys
# request() will be used to load web content
import urllib.request


# 3rd party libraries
# -------------------

# Natural Language Toolkit (https://www.nltk.org/)
import nltk

# download punctuation related NLTK functions
# (needed for sent_tokenize())
nltk.download('punkt')
# download NLKT part-of-speech tagger
# (needed for pos_tag())
nltk.download('averaged_perceptron_tagger')
# download wordnet
# (needed for lemmatization)
nltk.download('wordnet')
# download stopword lists
# (needed for stopword removal)
nltk.download('stopwords')
# dictionary of English words
nltk.download('words')
nltk.download('omw-1.4')

# numpy: matrix library for Python
import numpy as np

# scipy: scientific operations
# works with numpy objects
import scipy

# matplotlib (and pyplot) for visualizations
import matplotlib
import matplotlib.pyplot as plt

# sklearn for basic machine learning operations
import sklearn
import sklearn.manifold
import sklearn.cluster

# worldcloud tool
!pip install wordcloud
from wordcloud import WordCloud

# for checking object memory usage
!pip install pympler
from pympler import asizeof

!pip install spacy
import spacy

# Downloading data
# ----------------
if not os.path.exists("aclImdb"):
    !wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
    !tar -xzf aclImdb_v1.tar.gz

def text_to_lemma_frequencies(text, remove_stop_words=True):
    
    # split document into sentences
    sentences = nltk.sent_tokenize(text)
    
    # create a place to store (word, pos_tag) tuples
    words_and_pos_tags = []
    
    # get all words and pos tags
    for sentence in sentences:
        words_and_pos_tags += nltk.pos_tag(nltk.word_tokenize(sentence))
        
    # load the lemmatizer
    lemmatizer = nltk.stem.WordNetLemmatizer()
    
    # lemmatize the words
    lemmas = [lemmatizer.lemmatize(word,lookup_pos(pos)) for \
              (word,pos) in words_and_pos_tags]
    
    # convert to lowercase
    lowercase_lemmas = [lemma.lower() for lemma in lemmas]
    
    # load the stopword list for English
    stop_words = set([])
    if remove_stop_words:
        stop_words = set(nltk.corpus.stopwords.words('english'))
    
    # add punctuation to the set of things to remove
    all_removal_tokens = stop_words | set(string.punctuation)
    
    # bonus: also add some custom double-quote tokens to this set
    all_removal_tokens |= set(["''","``"])
    
    # only get lemmas that aren't in these lists
    content_lemmas = [lemma for lemma in lowercase_lemmas \
                      if lemma not in all_removal_tokens]
    
    # return the frequency distribution object
    return nltk.probability.FreqDist(content_lemmas)
    
# Lemmatization -- redefining this here to make
# code block more self-contained
def lookup_pos(pos):
    pos_first_char = pos[0].lower()
    if pos_first_char in 'nv':
        return pos_first_char
    else:
        return 'n'

---
## Corpus-level Processing

### Matrix Representations

- Representing documents as vectors of words gets us one step closer to using traditional data science approaches.

- However, never forget that we're still working with language data!

- **How do we get a corpus matrix?**


- First, we'll load a small corpus into memory:

In [None]:
# from the Stanford Movie Reviews Data: 
# http://ai.stanford.edu/~amaas/data/sentiment/

# we downloaded this during our initial Setup
movie_review_dir = "aclImdb/train/unsup/"
movie_review_files = os.listdir(movie_review_dir)
n_movie_reviews = []
n = 50
for txt_file_path in sorted(movie_review_files, \
                            key=lambda x:int(x.split('_')[0]))[:n]:
        full_path = movie_review_dir + txt_file_path
        with open(full_path,'r') as txt_file:
            n_movie_reviews.append(txt_file.read())
            
print("Loaded",len(n_movie_reviews),"movie reviews from the Stanford IMDB " + \
      "corpus into memory.")

- Start by getting a bag-of-words representation for each review.
- Then, create a mapping between the full vocabulary and columns for our matrix.

In [None]:
review_frequency_distributions = []

# process each review, one at a time
for review in n_movie_reviews:
    
    # let's use our function from before
    frequencies = text_to_lemma_frequencies(review)
    review_frequency_distributions.append(frequencies)

# use a dictionary for faster lookup
vocab2index = {}
latest_index = 0
for rfd in review_frequency_distributions:
    for token in rfd.keys():
        if token not in vocab2index:
            vocab2index[token] = latest_index
            latest_index += 1
    
print("Built vocab lookup for vocab of size:",len(vocab2index))

- Given the frequencies and this index lookup, we can build a frequency matrix (as a numpy array).

In [None]:
# make an all-zero numpy array with shape n x v
# n = number of documents
# v = vocabulary size
corpus_matrix = np.zeros((len(review_frequency_distributions), len(vocab2index)))

# fill in the numpy array
for row, rfd in enumerate(review_frequency_distributions):
    for token, frequency in rfd.items():
        column = vocab2index[token]
        corpus_matrix[row][column] = frequency

In [None]:
# get some basic information about our matrix
def print_matrix_info(m):
    print("Our corpus matrix is",m.shape[0],'x',m.shape[1])
    print("Sparsity is:",round(float(100 * np.count_nonzero(m))/ \
                           (m.shape[0] * m.shape[1]),2),"%")

print_matrix_info(corpus_matrix)

- Now that we've seen how this works, let's see how some existing Python functions can do the heavy lifting for us.
- Scikit learn has some useful feature extraction methods:

In [None]:
# we can get a similar corpus matrix with just 3 lines of code
vectorizer = sklearn.feature_extraction.text.CountVectorizer()
sklearn_corpus_data = vectorizer.fit_transform(n_movie_reviews)
sklearn_corpus_matrix = sklearn_corpus_data.toarray()

# get the feature names (1:1 mapping to the columns in the matrix)
print("First 10 features:",vectorizer.get_feature_names_out()[:10])
print()

# let's check out the matrix
print_matrix_info(sklearn_corpus_matrix)

### Document Retrieval and Similarity

- With this matrix, it's very easy to find all documents containing a specific word.

In [None]:
search_term = "funny"
if search_term in vocab2index:
    search_index = vocab2index[search_term]
    matches = [i for i in range(corpus_matrix.shape[0]) \
           if corpus_matrix[i][search_index]!=0]

    # list the documents that contain the search term
    print("These documents contain '"+search_term+"':",matches)
    print()

    # show excerpt where this word appears
    example_location = n_movie_reviews[matches[0]].find(search_term)
    start,end = max(example_location-30,0), min(example_location+30,len(n_movie_reviews[matches[0]]))
    print('For example: "...',n_movie_reviews[matches[0]][start:end],'..."')
    
else:
    print(search_term,"isn't in the sample corups.")

- We can even use the notion of vector representations to compute the similarity between two documents.

    - (we'll talk about more advanced ways to approach this task later in the tutorial)

In [None]:
example_docs =[ "My dog likes to eat vegetables",\
                "Your dog likes to eat fruit",\
                "The computer is offline",\
                "A computer shouldn't be offline" ]

vectorizer = sklearn.feature_extraction.text.CountVectorizer()
example_data = vectorizer.fit_transform(example_docs)
example_matrix = example_data.toarray()

sim_0_1 = 1-scipy.spatial.distance.cosine(example_matrix[0],example_matrix[1])
sim_2_3 = 1-scipy.spatial.distance.cosine(example_matrix[2],example_matrix[3])
sim_0_2 = 1-scipy.spatial.distance.cosine(example_matrix[0],example_matrix[2])

print("Similarity between 0 and 1:",round(sim_0_1,2))
print("Similarity between 2 and 3:",round(sim_2_3,2))
print("Similarity between 0 and 2:",round(sim_0_2,2))

- We can do the same thing with our corpus of movie reviews:

In [None]:
# choose a document, and find the most "similar" other document in the corpus
reference_doc = 0
ref_doc_vec = corpus_matrix[reference_doc]
sim_to_ref_doc = []
for row in corpus_matrix:
    sim_to_ref_doc.append(1-scipy.spatial.distance.cosine(ref_doc_vec,row))
    
print("similarity scores:",sim_to_ref_doc)
most_similar = sim_to_ref_doc.index(max(sim_to_ref_doc[1:]))
print(n_movie_reviews[0])
print("is most similar to")
print(n_movie_reviews[most_similar])

**Exercise 4**

- First, let's load a dataset that should exhibit some natural groupings based on topic.
    - [20news](http://qwone.com/~jason/20Newsgroups/) is classic NLP dataset for document classification.
    

In [None]:
# load 20 newsgroups dataset - just 100 texts from 3 categories
categories = ['comp.sys.ibm.pc.hardware', 'rec.sport.baseball']
newsgroups_train_all = sklearn.datasets.fetch_20newsgroups(subset='train',\
                                                 categories=categories)
newsgroups_train = newsgroups_train_all.data[:100]
newsgroups_labels = newsgroups_train_all.target[:100]

print("Loaded",len(newsgroups_train),"documents.")
print("Label distribution:",collections.Counter(newsgroups_labels))

- Now, write a function that creates a corpus matrix from a list of strings containing documents.
    - We can use the `text_to_lemma_frequencies` that you wrote earlier as a starting point!

In [None]:
# ------------- Exercise 2 -------------- #
def docs2matrix(document_list):
    
    # this should be a nice starting point
    lemma_freqs = [text_to_lemma_frequencies(doc) for doc in document_list]

    # change this to return a 2d numpy array
    return None

# -------------     End    -------------- #

# quick test with first 10 documents
X = docs2matrix(newsgroups_train[:10])
if type(X) != type(np.zeros([3,3])):
    print("Did not return a 2d numpy matrix.")
elif X.shape[0] != 10:
    print("number of rows should be 10, but is",X.shape[0])
else:
    print("Created a matrix with shape:",X.shape)

In [None]:
#@title Sample Solution (double-click to view) {display-mode: "form"}

def docs2matrix(document_list):
    
    # use the vocab2index idea from before
    vocab2index = {}
    
    # this should be a nice starting point
    lemma_freqs = [text_to_lemma_frequencies(doc) for doc in document_list]

    latest_index = 0
    for lf in lemma_freqs:
        for token in lf.keys():
            if token not in vocab2index:
                vocab2index[token] = latest_index
                latest_index += 1
    
    # create the zeros matrix
    corpus_matrix = np.zeros((len(lemma_freqs), len(vocab2index)))
    
    for row, lf in enumerate(lemma_freqs):
        for token, frequency in lf.items():
            column = vocab2index[token]
            corpus_matrix[row][column] = frequency
    
    # change this to return a 2d numpy array
    return corpus_matrix


# quick test with first 10 documents
X = docs2matrix(newsgroups_train[:10])
if type(X) != type(np.zeros([3,3])):
    print("Did not return a 2d numpy matrix.")
elif X.shape[0] != 10:
    print("number of rows should be 10, but is",X.shape[0])
else:
    print("Created a matrix with shape:",X.shape)

### TF-IDF

- Some words are less important when making distinctions between documents in a corpus.
- How can we determine the "less important" words?
    - Using term-frequency * inverse document frequency, we make the assumption that words that appear in *many documents* are *less informative* overall.
    - Therefore, we weigh each term based on the inverse of the number of documents that that term appears in.
    - We can define $\operatorname{tfidf}(t,d,D) = \operatorname{tf}(t,d) * \log\frac{|D|}{|d \in D : t \in d|}$ , where
        - $t$ is a term (token) in a corpus
        - $d$ is a document in the corpus
        - $D$ is the corpus itself, containing documents, which, in turn, contain tokens
        - $\operatorname{tf}(t,d)$ is the frequency of $t$ in $d$ (typically normalized at the document level).
- sklearn has another vectorizer that takes care of this for us: the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
    - It behaves just like the CountVectorizer() that we saw before, except it computes tfidf scores instead of counts!

- Of course we can just use the TfidfVectorizer, but what would it look like to implement this ourselves?

In [None]:
# assume input matrix contains term frequencies
def tfidf_transform(mat):
    
    # convert matrix of counts to matrix of normalized frequencies
    normalized_mat = mat / np.transpose(mat.sum(axis=1)[np.newaxis])
    
    # compute IDF scores for each word given the corpus
    docs_using_terms = np.count_nonzero(mat,axis=0)
    idf_scores = np.log(mat.shape[1]/docs_using_terms)
    
    # compuite tfidf scores
    tfidf_mat = normalized_mat * idf_scores
    return tfidf_mat

tfidf_X = tfidf_transform(X)
print("Counts:",X[0][0:10])
print("TFIDF scores:",tfidf_X[0][0:10])

### Bonus: SpaCy
- If you have extra time, check out the [SpaCy 101 tutorial](https://spacy.io/usage/spacy-101)!
    - SpaCy is less research focused, but after you have a good grasp on the core concepts, it can provide a powerful set of NLP tools, and it is definitely worth knowing about.
        - It is also often faster to run than NLTK.
        - (we will time our nltk version first, for reference)

In [None]:
%timeit docs2matrix(newsgroups_train)

In [None]:
# Example preprocessing with SpaCy
def text_to_lemma_frequencies(text):
    nlp = spacy.load('en_core_web_sm')
    doc = nlp(text)
    words = [token.lemma for token in doc if token.is_stop != True and token.is_punct != True]
    return collections.Counter(words)

In [None]:
# Example document matrix building 
X = docs2matrix(newsgroups_train)
print("Created a matrix with shape:",X.shape)

In [None]:
%timeit docs2matrix(newsgroups_train)

- Why so slow?
    - SpaCy is doing too many tasks that we don't need here.

In [None]:
NLP = spacy.load('en_core_web_sm',disable=['ner','parser'])
def text_to_lemma_frequencies(text):    
    doc = NLP(text)
    words = [token.lemma for token in doc if token.is_stop != True and token.is_punct != True]
    return collections.Counter(words)

In [None]:
%timeit docs2matrix(newsgroups_train)

## Content Analysis

### Visualizing the data

- Let's visualize the data in 2 dimensions
    - We'll use [T-SNE](https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding) to do the dimensionality reduction.
    - Each color (red and blue) will represent one of the "ground truth" clusters.

In [None]:
# show corpus in 2d

X = docs2matrix(newsgroups_train)
print("Created a matrix with shape:",X.shape)
tsne = sklearn.manifold.TSNE(n_components=2, random_state=1)
X_2d = tsne.fit_transform(X)
colors = ['r', 'b']
target_ids = range(len(categories))
for target, c, label in zip(target_ids, colors, categories):
    plt.scatter(X_2d[newsgroups_labels == target, 0], X_2d[newsgroups_labels == target, 1], c=c, label=label)

- The groups have a fair degree of overlap. Can kmeans clustering recover them correctly?

In [None]:
# Do kmeans clustering

kmeans = sklearn.cluster.KMeans(n_clusters=2, random_state=0, algorithm="full").fit(X)
clusters = kmeans.labels_

for target, c, label in zip(target_ids, colors, categories):
    plt.scatter(X_2d[clusters == target, 0], X_2d[clusters == target, 1], c=c, label=label)

# out own purity function
def compute_average_purity(clusters, labels):
    # and computer the cluster purity
    cluster_labels = collections.defaultdict(list)
    for i in range(len(clusters)):
        cluster = clusters[i]
        label = labels[i]
        cluster_labels[cluster].append(label)
    cluster_purities = {}
    for cluster, labels in cluster_labels.items():
        most_common_count = collections.Counter(labels).most_common()[0][1]
        purity = float(most_common_count)/len(labels)
        cluster_purities[cluster] = purity
    avg_purity = sum(cluster_purities.values())/len(cluster_purities.keys())
    print("Average cluster purity:",avg_purity)
    
avg_purity = compute_average_purity(clusters, newsgroups_labels)

- That didn't work as well as we'd like it to.
- It's time to introduce better features that just word frequencies.
    - TF-IDF to the rescue!

### TF-IDF

- Some words are less important when making distinctions between documents in a corpus.
- How can we determine the "less important" words?
    - Using term-frequency * inverse document frequency, we make the assumption that words that appear in *many documents* are *less informative* overall.
    - Therefore, we weigh each term based on the inverse of the number of documents that that term appears in.
    - We can define $\operatorname{tfidf}(t,d,D) = \operatorname{tf}(t,d) * \log\frac{|D|}{|d \in D : t \in d|}$ , where
        - $t$ is a term (token) in a corpus
        - $d$ is a document in the corpus
        - $D$ is the corpus itself, containing documents, which, in turn, contain tokens
        - $\operatorname{tf}(t,d)$ is the frequency of $t$ in $d$ (typically normalized at the document level).
- sklearn has another vectorizer that takes care of this for us: the [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)
    - It behaves just like the CountVectorizer() that we saw before, except it computes tfidf scores instead of counts!

- Of course we can just use the TfidfVectorizer, but what would it look like to implement this ourselves?

In [None]:
# assume input matrix contains term frequencies
def tfidf_transform(mat):
    
    # convert matrix of counts to matrix of normalized frequencies
    normalized_mat = mat / np.transpose(mat.sum(axis=1)[np.newaxis])
    
    # compute IDF scores for each word given the corpus
    docs_using_terms = np.count_nonzero(mat,axis=0)
    idf_scores = np.log(mat.shape[1]/docs_using_terms)
    
    # compuite tfidf scores
    tfidf_mat = normalized_mat * idf_scores
    return tfidf_mat

tfidf_X = tfidf_transform(X)
print("Counts:",X[0][0:10])
print("TFIDF scores:",tfidf_X[0][0:10])

- What happens if we use tfidf instead of just counts or frequencies?

In [None]:
# show corpus in 2d

#X = docs2matrix(newsgroups_train)
vectorizer = sklearn.feature_extraction.text.TfidfVectorizer()
X = vectorizer.fit_transform(newsgroups_train).todense()
print("Created a matrix with shape:",X.shape)
tsne = sklearn.manifold.TSNE(n_components=2, random_state=1)
X_2d = tsne.fit_transform(X)
colors = ['r', 'b']
target_ids = range(len(categories))
for target, c, label in zip(target_ids, colors, categories):
    plt.scatter(X_2d[newsgroups_labels == target, 0], X_2d[newsgroups_labels == target, 1], c=c, label=label)

- These groups appear to have a bit more separation.
- How well can kmeans recover the original groups now?

In [None]:
# Do kmeans clustering with TF-IDF matrisx

kmeans = sklearn.cluster.KMeans(n_clusters=2, random_state=0, algorithm="full").fit(X)
clusters = kmeans.labels_

for target, c, label in zip(target_ids, colors, categories):
    plt.scatter(X_2d[clusters == target, 0], X_2d[clusters == target, 1], c=c, label=label)
    
avg_purity = compute_average_purity(clusters, newsgroups_labels)

In [None]:
#@title Setup (click the "run" button to the left) {display-mode: "form"}

## Setup ##

# imports

# built-in Python libraries
# -------------------------
import collections
import re
import string
import warnings
warnings.filterwarnings('ignore')

# 3rd party libraries
# -------------------

# Natural Language Toolkit (https://www.nltk.org/)
import nltk

# download punctuation related NLTK functions
# (needed for sent_tokenize())
nltk.download('punkt')
# download NLKT part-of-speech tagger
# (needed for pos_tag())
nltk.download('averaged_perceptron_tagger')
# download wordnet
# (needed for lemmatization)
nltk.download('wordnet')
# download stopword lists
# (needed for stopword removal)
nltk.download('stopwords')
# dictionary of English words
nltk.download('words')

# numpy: matrix library for Python
import numpy as np

!pip install -U gensim

# Gensim for topic modeling
import gensim
# for loading data
import sklearn.datasets
# for LDA visualization
!pip install pyLDAvis
import pyLDAvis
import pyLDAvis.gensim_models

# for uploading data files
from google.colab import files

# downloading values lexicon
!wget https://raw.githubusercontent.com/steve-wilson/values_lexicon/master/lexicon_1_0/values_lexicon.txt
!wget https://raw.githubusercontent.com/steve-wilson/values_lexicon/master/sample_data/subreddits/christian_500.txt
!wget https://raw.githubusercontent.com/steve-wilson/values_lexicon/master/sample_data/subreddits/business_500.txt
!wget https://raw.githubusercontent.com/steve-wilson/values_lexicon/master/sample_data/subreddits/college_500.txt

def text_to_lemma_frequencies(text, remove_stop_words=True):
    
    # split document into sentences
    sentences = nltk.sent_tokenize(text)
    
    # create a place to store (word, pos_tag) tuples
    words_and_pos_tags = []
    
    # get all words and pos tags
    for sentence in sentences:
        words_and_pos_tags += nltk.pos_tag(nltk.word_tokenize(sentence))
        
    # load the lemmatizer
    lemmatizer = nltk.stem.WordNetLemmatizer()
    
    # lemmatize the words
    lemmas = [lemmatizer.lemmatize(word,lookup_pos(pos)) for \
              (word,pos) in words_and_pos_tags]
    
    # convert to lowercase
    lowercase_lemmas = [lemma.lower() for lemma in lemmas]
    
    # load the stopword list for English
    stop_words = set([])
    if remove_stop_words:
        stop_words = set(nltk.corpus.stopwords.words('english'))
    
    # add punctuation to the set of things to remove
    all_removal_tokens = stop_words | set(string.punctuation)
    
    # bonus: also add some custom double-quote tokens to this set
    all_removal_tokens |= set(["''","``"])
    
    # only get lemmas that aren't in these lists
    content_lemmas = [lemma for lemma in lowercase_lemmas \
                      if lemma not in all_removal_tokens and \
                      re.match(r"^\w+$",lemma)]
    
    # return the frequency distribution object
    return nltk.probability.FreqDist(content_lemmas)
    
def docs2matrix(document_list):
    
    # use the vocab2index idea from before
    vocab2index = {}
    
    # load the stopword list for English
    stop_words = set(nltk.corpus.stopwords.words('english'))
    stop_words |= set(['from', 'subject', 're', 'edu', 'use'])
    
    # add punctuation to the set of things to remove
    all_removal_tokens = stop_words | set(string.punctuation)
    
    # bonus: also add some custom double-quote tokens to this set
    all_removal_tokens |= set(["''","``"])
    
    vocab2index = {}
    latest_index = 0

    lfs = []
    # this should be a nice starting point
    for doc in document_list:
        lf = text_to_lemma_frequencies(doc,all_removal_tokens)
        for token in lf.keys():
            if token not in vocab2index:
                vocab2index[token] = latest_index
                latest_index += 1
                
        lfs.append(lf)
    
    # create the zeros matrix
    corpus_matrix = np.zeros((len(lfs), len(vocab2index)))
    
    for row, lf in enumerate(lfs):
        for token, frequency in lf.items():
            column = vocab2index[token]
            corpus_matrix[row][column] = frequency
    
    return corpus_matrix, vocab2index

    
# Lemmatization -- redefining this here to make
# code block more self-contained
def lookup_pos(pos):
    pos_first_char = pos[0].lower()
    if pos_first_char in 'nv':
        return pos_first_char
    else:
        return 'n'


            
print()
print("Done with setup!")
print("If you'd like, you can click the (X) button to the left to clear this output.")

### Topic Modeling

- Now that we have some real data, what are some ways that we can explore what's in it?
    - How can we answer the basic question: *What are people talking about in this corpus?*

- Load a corpus matrix, like the ones we created earlier, into gensim's corpus object:

In [None]:
# this time, let's load all documents in the 20news dataset from these categories
categories = ['soc.religion.christian', 'rec.autos', 'talk.politics.misc', \
              'rec.sport.baseball', 'comp.sys.ibm.pc.hardware']
newsgroups_train_all = sklearn.datasets.fetch_20newsgroups(subset='train', \
                                              categories=categories).data
# using the function we wrote before, but modified to also return the vocab2index
corpus_matrix, word2id = docs2matrix(newsgroups_train_all)
# reverse this dictionary
id2word = {v:k for k,v in word2id.items()}

corpus = gensim.matutils.Dense2Corpus(corpus_matrix, documents_columns=False)
print("Loaded",len(corpus),"documents into a Gensim corpus.")

- Given this, we can run LDA right out of the box:

In [None]:
# As of July 2019, gensim calls a deprecated numpy function and gives lots of warning messages
# Let's supress these.
warnings.filterwarnings('ignore')

# run LDA on our corpus, using out dictionary (k=6)
lda = gensim.models.LdaModel(corpus, id2word=id2word, num_topics=6)
lda.print_topics()

- There is still quite a bit of noise in this list because the documents are full of very common words like "write", "subject", and "from".
- One common approach is to remove the most (and possibly least) common words before running LDA.

In [None]:
total_counts = np.sum(corpus_matrix, axis=0)
sorted_words = sorted( zip( range(len(total_counts)) ,total_counts), \
                       key=lambda x:x[1], reverse=True )
N = 100
M = 50
top_N_ids = [item[0] for item in sorted_words[:N]]
appears_less_than_M_times = [item[0] for item in sorted_words if item[1] < M]
vocab_dense = [id2word[idx] for idx in range(len(id2word))]

print("Top words to remove:", ' '.join([id2word[idx] for idx in top_N_ids]))

remove_indexes = top_N_ids+appears_less_than_M_times
corpus_matrix_filtered = np.delete(corpus_matrix,remove_indexes,1)

for index in sorted(remove_indexes, reverse=True):
    del vocab_dense[index]

id2word_filtered = {}
word2id_filtered = {}

for i,word in enumerate(vocab_dense):
    id2word_filtered[i] = word
    word2id_filtered[word] = i
    
corpus_filtered = gensim.matutils.Dense2Corpus(corpus_matrix_filtered, documents_columns=False)

print("Original matrix shape:",corpus_matrix.shape)
print("New matrix shape:",corpus_matrix_filtered.shape)

- Now, run LDA again using this new matrix

In [None]:
lda = gensim.models.LdaModel(corpus_filtered, id2word=id2word_filtered, num_topics=6)
lda.print_topics()

- We can also use this model to get topic probabilities for unseen documents:

In [None]:
unseen_doc = "I went to the baseball game and say the player hit a homerun !"
unseen_doc_bow = [word2id_filtered.get(word.lower(),-1) for word in unseen_doc.split()]
unseen_doc_vec = np.zeros(len(word2id_filtered))
for word in unseen_doc_bow:
    if word >= 0:
        unseen_doc_vec[word] += 1
unseen_doc_vec = unseen_doc_vec[np.newaxis]
unseen_doc_corpus = gensim.matutils.Dense2Corpus(unseen_doc_vec, documents_columns=False)
vector = lda[unseen_doc_corpus]  # get topic probability distribution for a document
for item in vector:
    print(item)

- pyLDAvis is a nice tool for visualizing our topics:

In [None]:
pyLDAvis.enable_notebook()

# need to create a gensim dictionary object instead of our
# lightweight dict object - this is what pyLDA expects as input
dictionary = gensim.corpora.Dictionary()
dictionary.token2id = word2id_filtered

# visualize the LDA model
vis = pyLDAvis.gensim_models.prepare(lda, corpus_filtered, dictionary)
vis

- [-> Next: Sentiment Analysis](https://colab.research.google.com/github/gordeli/NLP_EDHEC2023/blob/main/05_Sentiment_Analysis.ipynb)