### 1. Load Libraries

In [123]:
import re, numpy as np, pandas as pd
from pprint import pprint

# Gensim
import gensim

### Tasks in Class

Suppose a vandal has broken into your study and torn apart four of your books:

* *Great Expectations* by Charles Dickens
* *The War of the Worlds* by H.G. Wells
* *Twenty Thousand Leagues Under the Sea* by Jules Verne
* *Pride and Prejudice* by Jane Austen

This vandal has torn the books into individual chapters, and left them in one large pile. How can we restore these disorganized chapters to their original books? This is a challenging problem since the individual chapters are unlabeled: we don’t know what words might distinguish them into groups. We’ll thus use topic modeling to discover how chapters cluster into distinct topics, each of them (presumably) representing one of the books.
The code below loads the chapters of the four books and returns a corpus

In [124]:
%run "CorpusUtils.py"
#books = makeCleanCorpus(abspath = os.path.abspath('.') + '/data/LibraryHeist/')

In [125]:
#import pickle
#pickle.dump( books, open( "data/books.p", "wb" ) )

In [126]:
import pickle
books = pickle.load( open( "data/books.p", "rb" ) )

In [127]:
type(books)

dict

In [128]:
#change dictionary to list: 
#(help: how do I keep the names of the chapters ???)
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations

books_words = list(sent_to_words(books.values()))

#print(books_words[:1])

In [149]:
# Create Dictionary
id2word = corpora.Dictionary(books_words)

# Create Corpus: Term Document Frequency
corpus = [id2word.doc2bow(text) for text in books_words]

# Build LDA model
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
                                           id2word=id2word,
                                           num_topics=4,                                            
                                           per_word_topics=True,
                                           random_state=1)


In [150]:
def format_topics_sentences(ldamodel, corpus, texts):
    # Init output
    sent_topics_df = pd.DataFrame()

    # Get main topic in each document
    for i, row_list in enumerate(ldamodel[corpus]):
        row = row_list[0] if ldamodel.per_word_topics else row_list            
        # print(row)
        row = sorted(row, key=lambda x: (x[1]), reverse=True)
        # Get the Dominant topic, Perc Contribution and Keywords for each document
        for j, (topic_num, prop_topic) in enumerate(row):
            if j == 0:  # => dominant topic
                wp = ldamodel.show_topic(topic_num)
                topic_keywords = ", ".join([word for word, prop in wp])
                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
            else:
                break
    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']

    # Add original text to the end of the output
    contents = pd.Series(texts)
    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
    return(sent_topics_df)

In [151]:
df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=books_words)

# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
df_dominant_topic.head(10)

Unnamed: 0,Document_No,Dominant_Topic,Topic_Perc_Contrib,Keywords,Text
0,0,1.0,0.8372,"elizabeth, miss, like, us, went, saw, came, jo...","[chapter, fathers, family, name, pirrip, chris..."
1,1,2.0,0.9972,"joe, us, captain, thought, nautilus, saw, like...","[chapter, ii, sister, joe, gargery, twenty, ye..."
2,2,2.0,0.5179,"joe, us, captain, thought, nautilus, saw, like...","[chapter, iii, rimy, morning, damp, seen, damp..."
3,3,2.0,0.9993,"joe, us, captain, thought, nautilus, saw, like...","[chapter, iv, fully, expected, find, constable..."
4,4,2.0,0.8326,"joe, us, captain, thought, nautilus, saw, like...","[chapter, apparition, file, soldiers, ringing,..."
5,5,2.0,0.9702,"joe, us, captain, thought, nautilus, saw, like...","[chapter, felicitous, idea, occurred, morning,..."
6,6,1.0,0.8968,"elizabeth, miss, like, us, went, saw, came, jo...","[chapter, xi, appointed, returned, miss, havis..."
7,7,0.0,0.7476,"miss, us, like, elizabeth, went, joe, came, sa...","[chapter, xii, grew, uneasy, subject, pale, yo..."
8,8,0.0,0.6796,"miss, us, like, elizabeth, went, joe, came, sa...","[chapter, xiii, trial, feelings, next, joe, ar..."
9,9,1.0,0.9971,"elizabeth, miss, like, us, went, saw, came, jo...","[chapter, xiv, miserable, thing, feel, ashamed..."


In [152]:
# Create column with true labels
df_dominant_topic["True_Chapter"] = books.keys()
df_dominant_topic["True_Chapter"] = df_dominant_topic.True_Chapter.str.replace("x|.txt", "", regex=True) # remove .txt and x in some GE

# Normalize naming for books
df_dominant_topic["True_Book"] = df_dominant_topic.True_Chapter.str.extract("^(\w*?)(?=[r]?_)")
df_dominant_topic["True_Book"] = df_dominant_topic.True_Book.str.replace("w", "", regex=True)


In [153]:
# Check different names
df_dominant_topic.True_Book.unique()

array(['GE', 'PaP', 'TTLutS', 'TWotW'], dtype=object)

In [154]:
#Check how many chapters from different books have been clustered to each topic
pd.crosstab(df_dominant_topic.Dominant_Topic, df_dominant_topic.True_Book)

True_Book,GE,PaP,TTLutS,TWotW
Dominant_Topic,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0.0,16,1,1,20
1.0,28,59,2,3
2.0,13,1,14,3
3.0,2,0,29,1


- We were not able to restore the disorganized chapters to their original books
- However, almost all chapters from `TTLutS` and `PaP` were assigned to only one cluster 
- Especially the chapters from `GE` and `TTLutS` are distributed across multiple clusters