<h3>Book reviews and how to fake them</h3>

- Find an interesting public domain book (https://www.gutenberg.org/browse/scores/top#books-last30) and download it as plain text.


- Use Python to massage the data into a suitable format for processing by the Latent Dirichlet Allocation (LDA) model contained in Scikit.learn. This will include removing stop words and punctuation. Some ideas for how to do this can be found <a href="https://stackoverflow.com/questions/5486337/how-to-remove-stop-words-using-nltk-or-python">here.</a>


- Break the book up into small sections. The most appropriate level might vary between books, but you will most likely be breaking the book up into either paragraphs or chapters. (This might also be a pragmatic decision based on whatever’s easiest.)


- Train an LDA model on the corpus. The LDA model should find interesting topics that occur at the paragraph (or chapter) level. Be sure to explain your choice of parameters for any parameters that might have a significant effect on the model results.


- Print out the first ten words of the ten most common topics.

In [146]:
sum(np.random.dirichlet([1,2,3]))

0.9999999999999999

In [1]:
#import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

In [54]:
#import data
with open('metamorphosis.txt', 'r') as myfile:
    collection = myfile.read().split("\n\n") #\n\n denotes the double linebreak used for paragraph separation
    
for i in range(len(collection)):
    #replace "\n" in list items with a space
    collection[i] = collection[i].replace('\n',' ')
    #collection[i] = collection[i].replace("\'",'')
    
collection = [i.strip() for i in collection if len(i) > 1]

print(f"Number of paragraphs: {len(collection)}\n")
print(f"First paragraph:\n {collection[0]}\n")
print(f"Last paragraph:\n {collection[-1]}")


Number of paragraphs: 99

First paragraph:
 One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin.  He lay on his armour-like back, and if he lifted his head a little he could see his brown belly, slightly domed and divided by arches into stiff sections.  The bedding was hardly able to cover it and seemed ready to slide off any moment.  His many legs, pitifully thin compared with the size of the rest of him, waved about helplessly as he looked.

Last paragraph:
 After that, the three of them left the flat together, which was something they had not done for months, and took the tram out to the open country outside the town.  They had the tram, filled with warm sunshine, all to themselves.  Leant back comfortably on their seats, they discussed their prospects and found that on closer examination they were not at all bad - until then they had never asked each other about their work but all three had jobs which were very go

In [71]:
#import libraries for cleaning and preprocessing
import nltk
nltk.download("stopwords")
nltk.download("wordnet")
from nltk.stem import WordNetLemmatizer, PorterStemmer
import string


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/hueyninglok/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/hueyninglok/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [79]:
#cleaning and preprocessing
stop = set(stopwords.words('english'))
exclude = set(string.punctuation) 

def lemmatize_stemming(text):
    return PorterStemmer().stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def clean(doc):
    #remove stopwords
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    #remove punctuation
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    #lemmatize and stem
    lemma_stem = " ".join(lemmatize_stemming(word) for word in punc_free.split())
    return lemma_stem

coll_clean = [clean(doc).split() for doc in collection]

In [81]:
print(f"Cleaned first paragraph:\n {coll_clean[0]}\n")
print(f"Cleaned last paragraph:\n {coll_clean[-1]}")

Cleaned first paragraph:
 ['one', 'morn', 'gregor', 'samsa', 'wake', 'troubl', 'dream', 'find', 'transform', 'bed', 'horribl', 'vermin', 'lay', 'armourlik', 'back', 'lift', 'head', 'littl', 'could', 'see', 'brown', 'belli', 'slightli', 'dome', 'divid', 'arch', 'stiff', 'section', 'bed', 'hardli', 'abl', 'cover', 'seem', 'readi', 'slide', 'moment', 'mani', 'leg', 'piti', 'thin', 'compar', 'size', 'rest', 'him', 'wave', 'helplessli', 'look']

Cleaned last paragraph:
 ['that', 'three', 'leav', 'flat', 'togeth', 'someth', 'do', 'month', 'take', 'tram', 'open', 'countri', 'outsid', 'town', 'tram', 'fill', 'warm', 'sunshin', 'themselv', 'lean', 'back', 'comfort', 'seat', 'discuss', 'prospect', 'find', 'closer', 'examin', 'bad', 'never', 'ask', 'work', 'three', 'job', 'good', 'hold', 'particularli', 'good', 'promis', 'futur', 'greatest', 'improv', 'time', 'be', 'cours', 'would', 'achiev', 'quit', 'easili', 'move', 'hous', 'need', 'flat', 'smaller', 'cheaper', 'current', 'one', 'choos', 'grego

In [137]:
# Importing Gensim
import gensim

#create dictionary of words in the clean collection and their integer ids
dictionary = gensim.corpora.Dictionary(coll_clean)
print(f"Number of words in dictionary: {len(dictionary)}\n")

#filter dictionary
print("Filtering process: Remove tokens that appear in less than 5 documents, or more than 50% of the documents.\n")
dictionary.filter_extremes(no_below=5, no_above=0.5)

print(f"Number of words in dictionary post-filter: {len(dictionary)}\n")
print(f"First 10 terms in filtered dictionary:")
#peak at first 10 terms in dict
i = 0
for k, v in dictionary.iteritems():
    print(k,v)
    i += 1
    if i > 10: break



Number of words in dictionary: 1896

Filtering process: Remove tokens that appear in less than 5 documents, or more than 50% of the documents.

Number of words in dictionary post-filter: 451

First 10 terms in filtered dictionary:
0 abl
1 back
2 bed
3 cover
4 find
5 hardli
6 head
7 him
8 lay
9 leg
10 lift


In [138]:
#create word frequency dictionary for each document in the collection
#and store in bag-of-words format
bow_corpus = [dictionary.doc2bow(doc) for doc in coll_clean]

#create bag of words for paragraph 10
bow_p10 = bow_corpus[10]

#preview word frequency in paragraph 10 for first 10 words in the dictionary index.
for i in range(10):
    print(f"Word {bow_p10[i][0]} ({dictionary[bow_p10[i][0]]}) appears {bow_p10[i][1]} time(s).")
    

Word 2 (bed) appears 3 time(s).
Word 6 (head) appears 4 time(s).
Word 50 (get) appears 2 time(s).
Word 63 (quit) appears 1 time(s).
Word 72 (tri) appears 1 time(s).
Word 73 (turn) appears 1 time(s).
Word 78 (becom) appears 1 time(s).
Word 79 (better) appears 1 time(s).
Word 97 (push) appears 1 time(s).
Word 100 (slowli) appears 1 time(s).


In [139]:
# Creating the object for LDA model using gensim library
Lda = gensim.models.ldamodel.LdaModel

# Running and training LDA model on the bow matrix.
ldamodel = Lda(bow_corpus, num_topics=10, id2word = dictionary, passes=50)


In [140]:
for idx, topic in ldamodel.print_topics():
    print(f'Topic: {idx} \nWords: {topic}\n')

Topic: 0 
Words: 0.017*"right" + 0.017*"felt" + 0.016*"feel" + 0.015*"tri" + 0.015*"back" + 0.014*"make" + 0.014*"look" + 0.013*"part" + 0.013*"hard" + 0.013*"get"

Topic: 1 
Words: 0.022*"go" + 0.022*"door" + 0.021*"say" + 0.018*"thing" + 0.017*"it" + 0.016*"first" + 0.015*"home" + 0.015*"way" + 0.014*"father" + 0.014*"make"

Topic: 2 
Words: 0.035*"door" + 0.031*"open" + 0.025*"even" + 0.024*"famili" + 0.018*"one" + 0.016*"father" + 0.015*"much" + 0.014*"come" + 0.014*"sister" + 0.014*"never"

Topic: 3 
Words: 0.023*"long" + 0.023*"go" + 0.023*"sleep" + 0.022*"even" + 0.018*"alarm" + 0.018*"boss" + 0.018*"train" + 0.018*"half" + 0.018*"that" + 0.018*"see"

Topic: 4 
Words: 0.025*"father" + 0.022*"mother" + 0.021*"say" + 0.021*"grete" + 0.019*"back" + 0.015*"play" + 0.015*"gentlemen" + 0.015*"get" + 0.014*"head" + 0.014*"call"

Topic: 5 
Words: 0.016*"sister" + 0.015*"even" + 0.014*"go" + 0.014*"father" + 0.013*"back" + 0.013*"look" + 0.013*"come" + 0.013*"mr" + 0.012*"samsa" + 0.012*