
<p>In this notebook, we will detect how closely related Charles Darwin's books are to each other.</p>
<p>To this purpose, we will develop <strong>a content-based book recommendation system</strong>, which will determine which books are close to each other based on how similar the discussed topics are. 
<p>Here are the books used in the recommendation system below.</p>

In [19]:
import glob

# The books files are contained in this folder
folder = "datasets/"

# Listing all the .txt files sorted alphabetically
files = [file for file in glob.glob(folder+"*.txt")]
files.sort()
files

['datasets/Autobiography.txt',
 'datasets/CoralReefs.txt',
 'datasets/DescentofMan.txt',
 'datasets/DifferentFormsofFlowers.txt',
 'datasets/EffectsCrossSelfFertilization.txt',
 'datasets/ExpressionofEmotionManAnimals.txt',
 'datasets/FormationVegetableMould.txt',
 'datasets/FoundationsOriginofSpecies.txt',
 'datasets/GeologicalObservationsSouthAmerica.txt',
 'datasets/InsectivorousPlants.txt',
 'datasets/LifeandLettersVol1.txt',
 'datasets/LifeandLettersVol2.txt',
 'datasets/MonographCirripedia.txt',
 'datasets/MonographCirripediaVol2.txt',
 'datasets/MovementClimbingPlants.txt',
 'datasets/OriginofSpecies.txt',
 'datasets/PowerMovementPlants.txt',
 'datasets/VariationPlantsAnimalsDomestication.txt',
 'datasets/VolcanicIslands.txt',
 'datasets/VoyageBeagle.txt']


<p>Loading the contents of each book into Python to do some basic pre-processing to facilitate the downstream analyses. That is creating <strong>a corpus</strong>. Also will store the titles for these books for future reference and print their respective length to get a gauge for their contents.</p>

In [20]:
import re, os

txts = []
titles = []

for n in files:
    f = open(n, encoding='utf-8-sig')
    # Removing all non-alpha-numeric characters
    pattern = re.compile('[^a-zA-Z0-9\s]+')
    text = pattern.sub('', f.read())

    title = os.path.basename(n).split('.')[0]
    # Storing the texts and titles of the books in two separate lists
    txts.append(text)
    titles.append(title)

# Printing the length, in characters, of each book
[len(t) for t in txts]

[123229,
 496539,
 1785723,
 616671,
 919542,
 624250,
 342689,
 534797,
 796499,
 904003,
 1047646,
 1014384,
 777741,
 1723333,
 305219,
 919177,
 1094855,
 1084841,
 341193,
 1154047]


<p>For the next parts of this analysis, we will often check the results returned by our method for a given book. For consistency, we will refer to Darwin's most famous book: "<em>On the Origin of Species</em>." Let's find to which index this book is associated.</p>

In [21]:
titles_idx = {}
for i in range(len(titles)):
    # Storing the index of the titles for future reference
    titles_idx[titles[i]] = i

# Print the stored index
print(titles_idx['OriginofSpecies'])

15



<p>Tokenizing and removing stop words from the text.</p>

In [22]:
# Defined a list of stop words
stoplist = set('for a of the and to in to be which some is at that we i who whom show via may my our might as well'.split())

txts_lower_case = [txt.casefold() for txt in txts]

# Transforming the text into tokens 
txts_split = [txt.split() for txt in txts_lower_case]

# Remove tokens which are part of the list of stop words
texts = [[txt for txt in txts_split[i] if txt not in stoplist] for i in range(len(txts_split))]

# Printing the first 20 tokens for the "On the Origin of Species" book
print(texts[titles_idx['OriginofSpecies']][:20])

['on', 'origin', 'species', 'but', 'with', 'regard', 'material', 'world', 'can', 'least', 'go', 'so', 'far', 'thiswe', 'can', 'perceive', 'events', 'are', 'brought', 'about']



Stemming of the tokenized corpus:

In [24]:
import pickle

# Loading the stemmed tokens list from the pregenerated pickle file
pick_file = open("./datasets/texts_stem.p", mode='rb')
texts_stem = pickle.load(pick_file)

# Printing the 20 first stemmed tokens from the "On the Origin of Species" book
print(texts_stem[titles_idx['OriginofSpecies']][:20])

UnpicklingError: invalid load key, '\xef'.


Building a bag-of-words model

In [7]:
from gensim import corpora

# Creating a dictionary from the stemmed tokens
dictionary = corpora.Dictionary(texts_stem)

bows = [dictionary.doc2bow(book) for book in texts_stem]

# Printing the first five elements of the On the Origin of species' BoW model
bows[ori][:5]

ModuleNotFoundError: No module named 'gensim'


Finding the most common words of a given book

In [None]:
import pandas as pd

# Converting the BoW model for "On the Origin of Species" into a DataFrame
df_bow_origin = pd.DataFrame(bows[ori])

# Adding the column names to the DataFrame
df_bow_origin.columns = ["index", "occurences"]
df_bow_origin.head()

# Adding a column containing the token corresponding to the dictionary index
df_bow_origin['token'] = [dictionary[i] for i in df_bow_origin['index']]
df_bow_origin.tail()

# Sorting the DataFrame by descending number of occurrences and print the first 10 values
df_bow_origin.sort_values(by='occurences', ascending=False).head(10)


Building a tf-idf model

In [None]:
from gensim.models import TfidfModel

# Generate the tf-idf model
model = TfidfModel(bows)

model[bows[ori][:5]]

The results of the tf-idf model

In [None]:
# Converting the tf-idf model for "On the Origin of Species" into a DataFrame
df_tfidf = pd.DataFrame(model[bows[ori]])
df_tfidf.head()

df_tfidf.columns = ['id', 'scores']

# Adding the tokens corresponding to the numerical indices for better readability
df_tfidf['token'] = [dictionary[i] for i in df_tfidf.id]
df_tfidf.tail()

df_tfidf.sort_values(by='scores', ascending=False).head(10)


Computing distance between texts

In [None]:
from gensim import similarities

# Computing the similarity matrix (pairwise distance between all texts)
sims = similarities.MatrixSimilarity(model[bows])

sim_df = pd.DataFrame(list(sims))
sim_df.head()

sim_df.columns = titles
sim_df.index = titles

sim_df.head(10)


The book most similar to "On the Origin of Species"

In [None]:

%matplotlib inline
import matplotlib.pyplot as plt

# Selecting the column corresponding to "On the Origin of Species" and 
v = sim_df['OriginofSpecies']

v_sorted = v.sort_values()
v_sorted[:5]

plot_sim = v_sorted.plot.barh(x='lab', y='val', rot=0).plot()

plt.xlabel("Cosine distance")
plt.ylabel("")
plt.title("Most similar books to 'On the Origin of Species'")


## Books having similar content

In [None]:
from scipy.cluster import hierarchy

# Computeing the clusters from the similarity matrix, using the Ward variance minimization algorithm
Z = hierarchy.linkage(sim_df, 'ward')

# Displaying this result as a horizontal dendrogram
a = hierarchy.dendrogram(Z, leaf_font_size=8, labels=sim_df.index, orientation='left')