# Project II: Topic Classification 

In this project, I perform Topic Classification on the book: Siddharta by Herman Hesse from Project Gutenburg. 

In [1]:
# Necessary pip install of external libraries
# !pip install gutenberg
# !pip install gutenberg-cleaner
# !pip install pyLDAvis

In [2]:
# Ignore warnings
import warnings
warnings.filterwarnings("ignore")

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import urllib
import urllib.request
import pyLDAvis.sklearn

# NLP preprocessing libraries
import nltk
nltk.download('punkt')
import gensim
import re
import string
import unicodedata
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from gensim.parsing.preprocessing import STOPWORDS
from gensim.parsing.preprocessing import remove_stopwords

# ML libraries
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn import metrics
from sklearn.metrics import plot_confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.decomposition import LatentDirichletAllocation, NMF, TruncatedSVD

# gutenberg cleaning libraries
from gutenberg.cleanup import strip_headers
from gutenberg_cleaner import super_cleaner

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Data Collection and Preprocessing

In [5]:
url = "https://www.gutenberg.org/cache/epub/2500/pg2500.txt"

response = urllib.request.urlopen(url)
raw = response.read()
text = raw.decode("utf-8-sig")

In [6]:
print(text)

The Project Gutenberg eBook of Siddhartha, by Herman Hesse

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: Siddhartha

Author: Herman Hesse

Translator: Gunther Olesch, Anke Dreher, Amy Coulter, Stefan Langer and Semyon Chaichenets

Release Date: February, 2001 [eBook #2500]
[Most recently updated: December 22, 2021]

Language: English


Produced by: Michael Pullen, Chandra Yenco and Isaac Jones

*** START OF THE PROJECT GUTENBERG EBOOK SIDDHARTHA ***




Siddhartha

An Indian Tale


by Herman Hesse




Contents


 FIRST PART
 THE SON OF THE BRAHM

As can be seen from the first print-out of the text. There are many unneccesary sections in the entire ebook. This includes Project Gutenburg footers and headers, as well as book titles and chapters. 

In order to alleviate the work in performing data cleaning, I use the libraries [gutenberg][https://github.com/ageitgey/Gutenberg] and [gutenberg_cleaner] [https://github.com/kiasar/gutenberg_cleaner] to perform footer and header cleaning as well as necessary chapters titles, and book titles cleaning. 

In [7]:
# Removing gutenburg related headers and footers
text = strip_headers(text).strip()

In [8]:
print(text)

Siddhartha

An Indian Tale


by Herman Hesse




Contents


 FIRST PART
 THE SON OF THE BRAHMAN
 WITH THE SAMANAS
 GOTAMA
 AWAKENING

 SECOND PART
 KAMALA
 WITH THE CHILDLIKE PEOPLE
 SANSARA
 BY THE RIVER
 THE FERRYMAN
 THE SON
 OM
 GOVINDA




FIRST PART

To Romain Rolland, my dear friend




THE SON OF THE BRAHMAN


In the shade of the house, in the sunshine of the riverbank near the
boats, in the shade of the Sal-wood forest, in the shade of the fig
tree is where Siddhartha grew up, the handsome son of the Brahman, the
young falcon, together with his friend Govinda, son of a Brahman. The
sun tanned his light shoulders by the banks of the river when bathing,
performing the sacred ablutions, the sacred offerings. In the mango
grove, shade poured into his black eyes, when playing as a boy, when
his mother sang, when the sacred offerings were made, when his father,
the scholar, taught him, when the wise men talked. For a long time,
Siddhartha had been partaking in the discussions of the

In [9]:
text = super_cleaner(text)

Super cleaner replaces the deleted texts with [deleted] so we replace all those with "".

In [10]:
text = text.replace("[deleted]", "")

In [11]:
print(text)



















To Romain Rolland, my dear friend






In the shade of the house, in the sunshine of the riverbank near the
boats, in the shade of the Sal-wood forest, in the shade of the fig
tree is where Siddhartha grew up, the handsome son of the Brahman, the
young falcon, together with his friend Govinda, son of a Brahman. The
sun tanned his light shoulders by the banks of the river when bathing,
performing the sacred ablutions, the sacred offerings. In the mango
grove, shade poured into his black eyes, when playing as a boy, when
his mother sang, when the sacred offerings were made, when his father,
the scholar, taught him, when the wise men talked. For a long time,
Siddhartha had been partaking in the discussions of the wise men,
practising debate with Govinda, practising with Govinda the art of
reflection, the service of meditation. He already knew how to speak the
Om silently, the word of words, to speak it silently into himself while
inhaling, to speak it silently out of hi

In [39]:
# Create functions for cleaning text

def remove_url(text):
  return text.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '')

def remove_url_2(text):
  return re.sub(r'http\S+', '', text)

def remove_twitter_handles(text):
  return re.sub("@[A-Za-z0-9]+", "", text)

def remove_usernames_links(text):
  text = re.sub('@[^\s]+','',text)
  text = re.sub('http[^\s]+','',text)
  return text

def remove_punctuations(text):
  additional_punctuations = ['’', '…'] # punctuations not in string.punctuation
  for punctuation in string.punctuation:
      text = text.replace(punctuation, '')
    
  for punctuation in additional_punctuations:
      text = text.replace(punctuation, '')
      
  return text

def remove_hashtags(text):
  return re.sub("#[A-Za-z0-9_]+","", text)

def remove_emojis(text):
  emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
  
  return emoji_pattern.sub(r'', text)

# Stemming, Lemmatization, and stopwords removal
stemmer = SnowballStemmer('english')
nltk.download('wordnet')

def lemmatize(text):
    return WordNetLemmatizer().lemmatize(text, pos='n')

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            result.append(lemmatize(token))
    return str(result)

# Final data cleaning function
# Numbers not to be removed since they may be important in telling age requirements
def clean_text(text):
  text = remove_twitter_handles(text)
  text = remove_hashtags(text)
  text = remove_url(text)
  text = remove_url_2(text)
  text = remove_punctuations(text)
  text = remove_emojis(text)
  text = remove_stopwords(text)
  text = text.lower()
  text = preprocess(text)
  return text

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


  return text.replace('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '')
  text = re.sub('@[^\s]+','',text)
  text = re.sub('http[^\s]+','',text)


In [13]:
text_list = nltk.tokenize.sent_tokenize(text)
text_list

['\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nTo Romain Rolland, my dear friend\n\n\n\n\n\n\nIn the shade of the house, in the sunshine of the riverbank near the\nboats, in the shade of the Sal-wood forest, in the shade of the fig\ntree is where Siddhartha grew up, the handsome son of the Brahman, the\nyoung falcon, together with his friend Govinda, son of a Brahman.',
 'The\nsun tanned his light shoulders by the banks of the river when bathing,\nperforming the sacred ablutions, the sacred offerings.',
 'In the mango\ngrove, shade poured into his black eyes, when playing as a boy, when\nhis mother sang, when the sacred offerings were made, when his father,\nthe scholar, taught him, when the wise men talked.',
 'For a long time,\nSiddhartha had been partaking in the discussions of the wise men,\npractising debate with Govinda, practising with Govinda the art of\nreflection, the service of meditation.',
 'He already knew how to speak the\nOm silently, the word of words, to speak it silently into

In [14]:
text_list = pd.Series(text_list)
clean_text = text_list.map(clean_text)

In [15]:
pd.set_option('max_colwidth', None)
clean_text

0                                                                                                                                     ['romain', 'rolland', 'dear', 'friend', 'shade', 'house', 'sunshine', 'riverbank', 'near', 'boat', 'shade', 'salwood', 'forest', 'shade', 'fig', 'tree', 'siddhartha', 'grew', 'handsome', 'son', 'brahman', 'young', 'falcon', 'friend', 'govinda', 'son', 'brahman']
1                                                                                                                                                                                                                                                                               ['sun', 'tanned', 'light', 'shoulder', 'bank', 'river', 'bathing', 'performing', 'sacred', 'ablution', 'sacred', 'offering']
2                                                                                                                                                                                                             

## Modelling

In [19]:
# Count Vectorizer and TDIDF
vectorizer = CountVectorizer(stop_words='english', lowercase=True)

tfidf = TfidfVectorizer(stop_words='english')

data_vectorized = vectorizer.fit_transform(clean_text)
data_tfidf = tfidf.fit_transform(clean_text)

Sparcity is  the percentage of non-zero datapoints in the document-word matrix, data_vectorized.

Showing sparcity gives a good idea whether the dataset contains diverse sets of words. 

In [20]:
# Materialize the sparse data
data_dense = data_vectorized.todense()

# Compute Sparsicity = Percentage of Non-Zero cells
print("Sparcity: ", ((data_dense > 0).sum()/data_dense.size)*100, "%")

Sparcity:  0.30218646864686466 %


In [21]:
# Build LDA Model
lda_model = LatentDirichletAllocation(n_components=5,               
                                      max_iter=10,             
                                      learning_method='online',   
                                      random_state=42,             
                                      n_jobs = -1               
                                     )


# Build a Non-Negative Matrix Factorization Model
nmf_model = NMF(n_components=5)

# Build a Latent Semantic Indexing Model
lsi_model = TruncatedSVD(n_components=5)

In [23]:
lda_output_cv = lda_model.fit_transform(data_vectorized)
nmf_output_cv = nmf_model.fit_transform(data_vectorized)
lsi_output_cv = lsi_model.fit_transform(data_vectorized)

lda_output_tfidf = lda_model.fit_transform(data_tfidf)
nmf_output_tfidf = nmf_model.fit_transform(data_tfidf)
lsi_output_tfidf = lsi_model.fit_transform(data_tfidf)

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  kwargs['lwork'] = ret[-2][0].real.astype(numpy.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  kwargs['lwork'] = ret[-2][0].real.astype(numpy.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  kwargs['lwork'] = ret[-2][0].real.astype(numpy.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  kwargs['lwork'] = ret[-2][0].real.astype(numpy.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  kwargs['lwork'] = ret[-2][0].real.astype(numpy.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecation

In [24]:
# inspect the inferred topics
def print_topics(model, vectorizer, top_n=5):
    for idx, topic in enumerate(model.components_):
        print("Topic %d:" % (idx))
        print([(vectorizer.get_feature_names()[i], topic[i])
                        for i in topic.argsort()[:-top_n - 1:-1]])

In [25]:
print("LDA Model CV:")
print_topics(lda_model, vectorizer)
print("=" * 5)

print("\n")

print("LDA Model TFIDF:")
print_topics(lda_model, tfidf)
print("=" * 5)

LDA Model CV:
Topic 0:
[('siddhartha', 14.218479724061757), ('saw', 10.243963935323766), ('smiled', 9.587638611812597), ('face', 9.50564272867062), ('learned', 7.583869454275457)]
Topic 1:
[('siddhartha', 40.89357934082687), ('like', 28.08938146002824), ('time', 21.76400936436771), ('thought', 20.352072507850504), ('man', 18.50633964338401)]
Topic 2:
[('havent', 7.959442635308658), ('beloved', 4.141532391011268), ('shoe', 3.1866126839558717), ('clothes', 2.5373405882944544), ('vasudeva', 2.459866048531698)]
Topic 3:
[('closer', 5.5947383769959576), ('bend', 4.988795099963836), ('able', 4.6863316974418), ('siddhartha', 4.4707850340644315), ('peace', 4.010964338642865)]
Topic 4:
[('farewell', 7.190759017296269), ('word', 5.491996312852652), ('come', 5.142442457101188), ('said', 5.092452122181225), ('govinda', 4.6705514308128935)]
=====


LDA Model TFIDF:
Topic 0:
[('siddhartha', 14.218479724061757), ('saw', 10.243963935323766), ('smiled', 9.587638611812597), ('face', 9.50564272867062), (

In [26]:
print("LSI Model CV:")
print_topics(lsi_model, vectorizer)
print("=" * 5)

print("\n")

print("LSI Model TDIDF:")
print_topics(lsi_model, tfidf)
print("=" * 5)

LSI Model CV:
Topic 0:
[('siddhartha', 0.49007503493805704), ('govinda', 0.21892560109506098), ('like', 0.20572968587202348), ('time', 0.19497437733701672), ('said', 0.18034301208144557)]
Topic 1:
[('like', 0.8834537394773193), ('good', 0.10839315734626698), ('people', 0.0798136987048279), ('love', 0.07719747826827927), ('kamala', 0.054275828213396334)]
Topic 2:
[('siddhartha', 0.6270807190748066), ('like', 0.17421894483217795), ('samana', 0.08443626660718744), ('quoth', 0.05953303434419991), ('spoke', 0.05915272037226546)]
Topic 3:
[('govinda', 0.388691145732435), ('teaching', 0.380355342614107), ('oh', 0.221088205155907), ('friend', 0.16721248671068684), ('come', 0.14333548073256927)]
Topic 4:
[('learned', 0.4612260647617315), ('river', 0.44173769914158406), ('know', 0.13794325860756462), ('youve', 0.1142212917563938), ('thing', 0.11107997479619225)]
=====


LSI Model TDIDF:
Topic 0:
[('siddhartha', 0.49007503493805704), ('govinda', 0.21892560109506098), ('like', 0.20572968587202348)

In [27]:
print("NMF Model CV:")
print_topics(nmf_model, vectorizer)
print("=" * 5)

print("\n")

print("NMF Model TFIDF:")
print_topics(nmf_model, tfidf)
print("=" * 5)

NMF Model CV:
Topic 0:
[('siddhartha', 2.151427678134617), ('thought', 0.33157736859898307), ('samana', 0.311349877791384), ('brahman', 0.24885454748122993), ('saw', 0.24115868203508414)]
Topic 1:
[('like', 1.676774845068931), ('good', 0.2280354201634499), ('people', 0.15416038751235192), ('love', 0.13870602446620714), ('kamala', 0.10186867364176165)]
Topic 2:
[('time', 1.0415888734251701), ('long', 0.7176488884521247), ('felt', 0.2638086671484509), ('face', 0.2230465741546087), ('life', 0.20539118276425683)]
Topic 3:
[('govinda', 0.851691151526929), ('oh', 0.49680814165567666), ('teaching', 0.4942297045955835), ('said', 0.4653648189453845), ('friend', 0.42572913185271805)]
Topic 4:
[('river', 0.9594112013146958), ('learned', 0.7819609115505951), ('vasudeva', 0.2878958896185035), ('man', 0.2753028385791073), ('ferryman', 0.2673802287123392)]
=====


NMF Model TFIDF:
Topic 0:
[('siddhartha', 2.151427678134617), ('thought', 0.33157736859898307), ('samana', 0.311349877791384), ('brahman',

It can be seen that CountVectorizer and TfidfVectorizer creates different topic distributions across the three models. However, common words such as '`siddharta`, `govinda`, `river`, `love` are consistent across models. 

To create a more accurate representation, I perform hyperparameter tuning with the LDA model with the tfidf vectorizer and countvectorizer and compare. 

In [28]:
search_params = {'n_components': range(1, 10), 'learning_decay': [.5, .6, .7, .8, .9]}
lda = LatentDirichletAllocation(learning_method="online")

model_cv = GridSearchCV(lda, param_grid=search_params, verbose=1)
model_cv.fit(data_vectorized)

Fitting 5 folds for each of 45 candidates, totalling 225 fits


GridSearchCV(estimator=LatentDirichletAllocation(learning_method='online'),
             param_grid={'learning_decay': [0.5, 0.6, 0.7, 0.8, 0.9],
                         'n_components': range(1, 10)},
             verbose=1)

In [29]:
search_params = {'n_components': range(1, 10), 'learning_decay': [.5, .6, .7, .8, .9]}
lda = LatentDirichletAllocation(learning_method="online")

model_tfidf = GridSearchCV(lda, param_grid=search_params, verbose=1)
model_tfidf.fit(data_tfidf)

Fitting 5 folds for each of 45 candidates, totalling 225 fits


GridSearchCV(estimator=LatentDirichletAllocation(learning_method='online'),
             param_grid={'learning_decay': [0.5, 0.6, 0.7, 0.8, 0.9],
                         'n_components': range(1, 10)},
             verbose=1)

In [30]:
print(model_cv.best_params_)
print(model_tfidf.best_params_)

{'learning_decay': 0.9, 'n_components': 1}
{'learning_decay': 0.9, 'n_components': 1}


We find that GridSearchCV recommends only one main topic. 

In [31]:
lda_best_model = LatentDirichletAllocation(n_components=1,               
                                      max_iter=10,             
                                      learning_method='online',   
                                      random_state=42,             
                                      n_jobs = -1,
                                      learning_decay=0.9
                                     )

In [32]:
lda_best_output_cv = lda_best_model.fit_transform(data_vectorized)

lda_best_output_tfidf = lda_best_model.fit_transform(data_tfidf)


In [33]:
print("LDA Best Model CV:")
print_topics(lda_best_model, vectorizer)
print("=" * 1)

print("\n")

print("LDA Best Model TFIDF:")
print_topics(lda_best_model, tfidf)
print("=" * 1)

LDA Best Model CV:
Topic 0:
[('siddhartha', 62.39633492512364), ('like', 34.798224976071815), ('govinda', 30.317471789832346), ('thought', 29.08372147641995), ('time', 28.37285850327281)]
=


LDA Best Model TFIDF:
Topic 0:
[('siddhartha', 62.39633492512364), ('like', 34.798224976071815), ('govinda', 30.317471789832346), ('thought', 29.08372147641995), ('time', 28.37285850327281)]
=


These words don't make much sense, so let's look at the other best estimators.

In [34]:
results = pd.DataFrame(model_cv.cv_results_)
results.sort_values(by='rank_test_score', inplace=True)

# Get the second best estimator
params_2nd_best = results.loc[1, 'params']

In [35]:
lda_best_model = LatentDirichletAllocation(n_components=2,               
                                      max_iter=10,             
                                      learning_method='online',   
                                      random_state=42,             
                                      n_jobs = -1,
                                      learning_decay=0.5
                                     )

In [36]:
lda_best_output_cv = lda_best_model.fit_transform(data_vectorized)

lda_best_output_tfidf = lda_best_model.fit_transform(data_tfidf)

In [37]:
print("LDA Best Model CV:")
print_topics(lda_best_model, vectorizer)
print("=" * 2)

print("\n")

print("LDA Best Model TFIDF:")
print_topics(lda_best_model, tfidf)
print("=" * 2)

LDA Best Model CV:
Topic 0:
[('like', 30.052848515334688), ('heart', 16.644956174118885), ('kamala', 14.388222241529657), ('siddhartha', 13.698134252278843), ('face', 13.696821058643398)]
Topic 1:
[('siddhartha', 51.37238983695205), ('thought', 28.147219661237706), ('govinda', 25.691304478924557), ('said', 24.020508656089408), ('path', 20.335299018848403)]
==


LDA Best Model TFIDF:
Topic 0:
[('like', 30.052848515334688), ('heart', 16.644956174118885), ('kamala', 14.388222241529657), ('siddhartha', 13.698134252278843), ('face', 13.696821058643398)]
Topic 1:
[('siddhartha', 51.37238983695205), ('thought', 28.147219661237706), ('govinda', 25.691304478924557), ('said', 24.020508656089408), ('path', 20.335299018848403)]
==


In [38]:
pyLDAvis.enable_notebook()
panel = pyLDAvis.sklearn.prepare(lda_best_model, data_tfidf, tfidf, mds='tsne')
panel

## Conclusion

As we can see from the results of the TFIDF LDA model, we have two main topics. 

The first topic, with the keywords `like`, `heart`, `kamala`, `siddharta`, `face`, refers to the theme of `love` as emphasized by **kamala** who was the courtesan who instructs Siddhartha in the art of physical love. In addition to being Siddhartha's lover, Kamala helps him learn the ways of the city and leave his ascetic life as a Samana behind. Just before she dies from a snakebite, she reveals that Siddhartha is the father of her son. Kamala played a big role in Siddharta's journey and pursuit of enlightenment. 

The second topic, with the keywords, `siddharta`, `thought`, `govinda`, `said`, `path`, relates to Siddharta and his journey following the `path`. In that journey, Govinda, his best friend joins him as they delve deep into thought and pursue enlightenment. 