# Latent Dirichlet Allocation (LDA)

The books are all in the public domain, and the HTML can be found at https://www.gutenberg.org/.
We will go through one example of how to get the text from the book using Python. Please note, this will not be the most optimal way to do this, but we hope we can make the process clear for you to try with other books or manuscripts. 

### Get the HTML for the Book

We are going to use two libraries for this; one is a standard for Python called. 

```python
import urllib
```
the other is a favorite of ours, called beautiful soup {cite:p}`BeautifulSoup`. 

```python
from bs4 import BeautifulSoup
```

urllib will get the document, and BeautifulSoup makes it easy to parse. 

In [126]:
from urllib.request import urlopen
from bs4 import BeautifulSoup

url = "https://www.gutenberg.org/files/55/55-h/55-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

Here we remove any CSS (style) or JavaScript (script)

In [127]:
soup.header

In [128]:
for script in soup(["script", "style"]):
    script.extract()

Finally, get the text and add it to our document list. 

In [129]:
text = soup.get_text()
documents = []
documents.append(text)

In [130]:
text[:500]

'\n\n\n\n\nThe Project Gutenberg eBook of The Wonderful Wizard of Oz, by L. Frank Baum\n\n\n\n\nThe Project Gutenberg eBook of The Wonderful Wizard of Oz, by L. Frank Baum\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online\r\nat www.gutenberg.org. If you\r\nare not located in th'

In [131]:
len(text)

233238

We will repeat this process for the other four books. 

In [132]:
url = "https://www.gutenberg.org/files/54/54-h/54-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

url = "https://www.gutenberg.org/files/33361/33361-h/33361-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

url = "https://www.gutenberg.org/files/22566/22566-h/22566-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

url = "https://www.gutenberg.org/files/26624/26624-h/26624-h.htm" 
html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")
for script in soup(["script", "style"]):
    script.extract()
text = soup.get_text()
documents.append(text)

In [133]:
len(documents)

5

In [134]:
len(text)

242941

In [135]:
import pandas as pd

### Create Tokens and Vocabulary

Now that we have our books, we need to tokenize the stories by word and then create a vocabulary out of these tokens. sklearn is a fantastic library that we will be using throughout the notebook {cite:p}`sklearn_api`.

In [136]:
documents

['\n\n\n\n\nThe Project Gutenberg eBook of The Wonderful Wizard of Oz, by L. Frank Baum\n\n\n\n\nThe Project Gutenberg eBook of The Wonderful Wizard of Oz, by L. Frank Baum\n\r\nThis eBook is for the use of anyone anywhere in the United States and\r\nmost other parts of the world at no cost and with almost no restrictions\r\nwhatsoever. You may copy it, give it away or re-use it under the terms\r\nof the Project Gutenberg License included with this eBook or online\r\nat www.gutenberg.org. If you\r\nare not located in the United States, you will have to check the laws of the\r\ncountry where you are located before using this eBook.\r\n\nTitle: The Wonderful Wizard of Oz\nAuthor: L. Frank Baum\nRelease Date: February, 1993 [eBook #55]\r\n[Most recently updated: October 19, 2020]\nLanguage: English\nCharacter set encoding: UTF-8\n*** START OF THE PROJECT GUTENBERG EBOOK THE WONDERFUL WIZARD OF OZ ***\n\n\n\nThe Wonderful Wizard of Oz\nby L. Frank Baum\n\r\nThis book is dedicated to my goo

In [137]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
df = cv.fit_transform(documents)
vocab = cv.get_feature_names_out()

In [138]:
vocab

array(['00', '000', '10', ..., 'zoroaster', 'zuz', 'zy'], dtype=object)

Let's take a look at the tokens and the number of occurrence for the tokens. 

In [139]:
print(df.shape)

(5, 9143)


In [140]:
print(df[4])

  (0, 8068)	2870
  (0, 6153)	88
  (0, 3799)	94
  (0, 2714)	12
  (0, 5459)	994
  (0, 9001)	19
  (0, 8982)	40
  (0, 5594)	100
  (0, 1344)	147
  (0, 3400)	9
  (0, 893)	7
  (0, 8101)	245
  (0, 4377)	214
  (0, 3330)	251
  (0, 8580)	28
  (0, 593)	7
  (0, 596)	8
  (0, 4216)	614
  (0, 8508)	10
  (0, 7635)	14
  (0, 551)	1667
  (0, 5169)	34
  (0, 5546)	78
  (0, 5665)	8
  (0, 9034)	22
  :	:
  (0, 4223)	2
  (0, 4870)	2
  (0, 9061)	2
  (0, 2842)	2
  (0, 1404)	1
  (0, 9011)	2
  (0, 2464)	1
  (0, 8294)	1
  (0, 848)	1
  (0, 6727)	1
  (0, 4773)	1
  (0, 1596)	3
  (0, 4975)	1
  (0, 2351)	1
  (0, 5964)	1
  (0, 1079)	1
  (0, 1748)	1
  (0, 3910)	2
  (0, 6098)	1
  (0, 0)	1
  (0, 6004)	1
  (0, 6744)	1
  (0, 5057)	1
  (0, 3985)	1
  (0, 4837)	1


The second number listed is the token number, and we use the vocab list to see what the actual word. An example would be to look at the first line. 

```python
(0, 8068) 3198
```
The 8074 token was used 3198 times. The 8068 token is:

In [141]:
print (vocab[551])
print (vocab[8068])

and
the


In [142]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components = 5, doc_topic_prior=1)
lda.fit(df)

LatentDirichletAllocation(doc_topic_prior=1, n_components=5)

In [143]:
lda.components_[0]

array([0.20000954, 0.20000966, 0.20000765, ..., 0.2000091 , 0.20000843,
       0.20000782])

In [144]:
import numpy as np 
topic_words = {}
n_top_words = 10
for topic, comp in enumerate(lda.components_):
    #print(topic, comp)
    word_idx = np.argsort(comp)[::-1][:n_top_words] #argsort to get index, and [::-1] to sort in descending
    # store the words most relevant to the topic
    topic_words[topic] = [vocab[i] for i in word_idx]
    # break
    
for topic, words in topic_words.items():
    print('Topic: %d' % topic)
    print('  %s' % ', '.join(words))

Topic: 0
  shaken, blocks, rising, council, false, politeness, drying, alas, soldered, scarecrows
Topic: 1
  the, and, to, of, in, you, was, it, that, he
Topic: 2
  shaken, blocks, rising, council, false, politeness, drying, alas, soldered, scarecrows
Topic: 3
  the, to, and, of, in, that, he, tip, you, it
Topic: 4
  shaken, blocks, rising, council, false, politeness, drying, alas, soldered, scarecrows


In [145]:
len(lda.components_)

5

Looking at this, we do not get a clear picture of the topics. This time, let's remove those stopwords and see how important 🧼cleaning the data can be🧼! 

In [146]:
from sklearn.feature_extraction.text import CountVectorizer

# we can add this to the tokenization step
cv = CountVectorizer(stop_words='english')
df = cv.fit_transform(documents)
vocab = cv.get_feature_names()



In [147]:
from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components = 4, doc_topic_prior=1)
lda.fit(df)

LatentDirichletAllocation(doc_topic_prior=1, n_components=4)

In [148]:
topic_words = {}
n_top_words = 20
for topic, comp in enumerate(lda.components_):
    word_idx = np.argsort(comp)[::-1][:n_top_words]
    # store the words most relevant to the topic
    topic_words[topic] = [vocab[i] for i in word_idx]
    
for topic, words in topic_words.items():
    print('Topic: %d' % topic)
    print('  %s' % ', '.join(words))

Topic: 0
  said, scarecrow, dorothy, woodman, tip, tin, saw, oz, city, horse, asked, gutenberg, head, good, great, lion, little, project, jack, witch
Topic: 1
  dorothy, said, pg, man, little, king, ozma, wizard, asked, shaggy, gutenberg, oz, girl, project, good, like, time, don, bright, know
Topic: 2
  wagged, dancing, favor, hairy, blossoms, entertainment, magician, favorite, sorts, queerest, sorrowful, hello, dearly, mat, arrive, smelled, cannon, advise, traveling, countless
Topic: 3
  wagged, dancing, favor, hairy, blossoms, entertainment, magician, favorite, sorts, queerest, sorrowful, hello, dearly, mat, arrive, smelled, cannon, advise, traveling, countless


Much better!

In [149]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer,HashingVectorizer
from sklearn.decomposition import LatentDirichletAllocation


In [150]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [153]:
sub_df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks_KDM/CSEE5590-KDM/kdm_project/nsf_funding1.csv', encoding = 'latin-1')

In [154]:
sub_df['Synopsis']


0      Using the Rules of Life to Address Societal Ch...
1      Biology has transformed science over the last ...
2                                                    NaN
3                                                    NaN
4                                                    NaN
                             ...                        
672    The aim of the PAC program is to support empir...
673    The National Facilities program supports the o...
674    The Social Psychology Program at NSF supports ...
675    The Sociology Program supports basic research ...
676    The Linguistics Program supports basic science...
Name: Synopsis, Length: 677, dtype: object

In [155]:

tfidf_vect = TfidfVectorizer(stop_words = 'english')
tfidf_tokens = tfidf_vect.fit_transform(sub_df['Synopsis'].values.astype('U'))


In [156]:
tfidf_vocab = tfidf_vect.get_feature_names_out()

In [158]:
print(tfidf_tokens[0])

In [159]:
lda = LatentDirichletAllocation(n_components = 5, doc_topic_prior=1, random_state=0)
lda.fit(tfidf_tokens)

LatentDirichletAllocation(doc_topic_prior=1, n_components=5, random_state=0)

In [161]:
import nltk
nltk.download('stopwords')
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction._stop_words import ENGLISH_STOP_WORDS
from spacy.lang.en import English
from spacy.lang.en.examples import sentences #sample sentences
import spacy
nlp = spacy.load('en_core_web_sm')
parser = English()
from nltk.stem import PorterStemmer
# Stop words and special characters 
STOPLIST = set(stopwords.words('english') + list(ENGLISH_STOP_WORDS)) 
SYMBOLS = " ".join(string.punctuation).split(" ") + ["-", "...", "”", "”","''"]



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Applied some cleaning and stemming

In [162]:
# Data Cleaner and tokenizer
def tokenizeText(text):

    text = text.strip().replace("\n", " ").replace("\r", " ")
    #print('replaced \n and \r: ', text)
    text = text.lower()
    #print('Lowered case: ', text)
    # tokens = parser(text)
    tokens = nlp(text)
    #print('parsed tokens: ', tokens)
    # print('Initial Tokens:', tokens)
    
    lemmas = []
    for tok in tokens:
        lemmas.append(tok.lemma_)
    tokens = lemmas
    #print('\nLemmatized Tokens:', tokens)

    
    # remove stop words and special charaters
    tokens = [tok for tok in tokens if tok.lower() not in STOPLIST]
    tokens = [tok for tok in tokens if tok not in SYMBOLS]
    tokens = [tok for tok in tokens if len(tok) >= 3]
    #print('\nStopword- and Special-Character Removed Tokens:', tokens)
    
    # remove remaining tokens that are not alphabetic
    tokens = [tok for tok in tokens if tok.isalpha()]
    #print('alphbatic tokens: ', tokens)
    # stemming of words
    #porter = PorterStemmer()
    #tokens = [porter.stem(word) for word in tokens]
    #print('portered tokens: ', tokens)
    
    tokens = list(set(tokens)) #comment this if you want to keep the position of the words
    # print('\nFinal Tokens:', tokens)

    #return tokens
    return ' '.join(tokens[:])

In [186]:
# Data cleaning
sub_df['Synopsis'] = sub_df['Synopsis'].astype(str).str.replace('D+', '')
 = sub_df['Synopsis'].apply(lambda x:tokenizeText(x))

  


0      inclusion asset investment establish future gl...
1      organization traorm depth sustainable inspire ...
2                                                    nan
3                                                    nan
4                                                    nan
                             ...                        
672    gain material clinical path science irectorate...
673    public specialized operation scientist high ex...
674    ethnicity consider defense clinical doctoral d...
675    illustration specialized mid organization vita...
676    site phonology forâ boundary methodological co...
Name: text_tokenized, Length: 677, dtype: object

In [187]:
sub = sub_df['text_tokenized']

Cleaning data

In [190]:
sub.str.replace('nsf','')
sub.str.replace('nan','')

0      inclusion asset investment establish future gl...
1      organization transform depth sustainable inspi...
2                                                       
3                                                       
4                                                       
                             ...                        
672    gain material clinical path science irectorate...
673    public specialized operation scientist high ex...
674    ethnicity consider defense clinical doctoral d...
675    illustration specialized mid organization vita...
676    site phonology forâ boundary methodological co...
Name: text_tokenized, Length: 677, dtype: object

In [191]:
topic_words = {}
n_top_words = 15
for topic, comp in enumerate(lda.components_):
    word_idx = np.argsort(comp)[::-1][:n_top_words]
    # store the words most relevant to the topic
    topic_words[topic] = [tfidf_vocab[i] for i in word_idx]
topic_words.items()  
for topic, words in topic_words.items():
    if 'nsf' in words: words.remove('nsf')
    print('Topic: %d' % topic)
    print('  %s' % ', '.join(words))

Topic: 0
  earth, physics, basic, language, doctoral, astrophysics, mid, evolutionary, facility, particle, ecosystem, evolution, plasma, cultural, population
Topic: 1
  materials, matter, dmr, ecr, solid, fiscal, lsamp, geochemical, properties, governed, physics, qis, surface, accepted, geochemistry
Topic: 2
  nan, erc, cr, outlast, ewd, logical, forthcoming, gi, placing, measured, publicizes, lectures, sharply, distributions, lecturer
Topic: 3
  research, program, proposals, science, engineering, systems, projects, support, stem, data, solicitation, sciences, education, biological
Topic: 4
  serving, facilities, fellowships, colleges, fellows, minority, universities, native, epscor, fellowship, ear, disabilities, ags, persons, msis
