# E-book content based recommendation

In this notebook we'll explore content based book recommendations using a subset of ebook data taken from [Gutenberg](https://www.gutenberg.org). Using a variety of nlp libraries like gensim, spacy and nltk, we'll expose hidden patterns which will help us provide accurate content based recommendations! Along the way, we'll explore the most common parts of speech inside the text of our choice to help us better understand the data we're working with.

## Imports

In [15]:
import pickle
import glob
import re, os
import pandas as pd

from gensim import corpora
from gensim.models import TfidfModel
from gensim import similarities

import spacy
from spacy.lang.en import English

import nltk
from nltk.stem import PorterStemmer

from scipy.cluster import hierarchy

from tqdm.notebook import trange, tqdm
from tqdm import tqdm_gui
import time

import plotly.graph_objs as go
import plotly.io as pio
import plotly.express as px
import plotly.figure_factory as ff

In [16]:
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/jovyan/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

## Loading in data

In [17]:
# defining folder where data is kept
# using glob to import the files from the defined folder

os.chdir("/home/jovyan/work/recs/")

folder = "data/"

files = glob.glob(folder+ '*.txt')
files.sort()

In [18]:
# inspecting list of files, to ensure dataset was propertly loaded
files

['data/CoralReefs.txt',
 'data/DescentofMan.txt',
 'data/DifferentFormsofFlowers.txt',
 'data/EffectsCrossSelfFertilization.txt',
 'data/ExpressionofEmotionManAnimals.txt',
 'data/FormationVegetableMould.txt',
 'data/FoundationsOriginofSpecies.txt',
 'data/GeologicalObservationsSouthAmerica.txt',
 'data/InsectivorousPlants.txt',
 'data/LifeandLettersVol1.txt',
 'data/LifeandLettersVol2.txt',
 'data/MonographCirripedia.txt',
 'data/MonographCirripediaVol2.txt',
 'data/MovementClimbingPlants.txt',
 'data/OriginofSpecies.txt',
 'data/PowerMovementPlants.txt',
 'data/VariationPlantsAnimalsDomestication.txt',
 'data/VolcanicIslands.txt',
 'data/VoyageBeagle.txt']

### Isolating text and title for each book

In [19]:
# loading in book content and titles into seperate lists we can use later

txts = []
titles = []

for n in files:
    f = open(n, encoding='utf-8-sig')
    # remove all non alpha numeric characters
    text = re.sub('[\W_]+',' ',f.read())
    # load titles and text into two sepereate lists
    titles.append(os.path.basename(n).replace('.txt', ''))
    txts.append(text)

In [20]:
# taking a look at the first 200 characters of the first book title to ensure we're pulling the titles and text in correctly.
print(titles[0])
print(txts[0][1:400])

CoralReefs
CORAL REEFS by CHARLES DARWIN EDITORIAL NOTE Although in some respects more technical in their subjects and style than Darwin s Journal the books here reprinted will never lose their value and interest for the originality of the observations they contain Many parts of them are admirably adapted for giving an insight into problems regarding the structure and changes of the earth s surface and in f


### Grab index for the book 'Decent of Man'
We'll be seeing how closely the other books in our dataset compare to this title.

In [21]:
for i in range(len(titles)):
    if titles[i] == 'DescentofMan':
        dom = i
# Print the stored index
print(dom)

1


## Preprocessing
This will include:
* Loading in stopwords
* Tokenizing
* Stemming text in each book

### Load in stopwords

In [22]:
# using spacey's stop word set
stopwords = spacy.lang.en.stop_words.STOP_WORDS

# inspecting 10 in the set
list(stopwords)[:10]

['around',
 'whoever',
 'until',
 'hence',
 'whatever',
 'an',
 'nobody',
 'whereafter',
 'hereafter',
 'please']

### Pre-process text in corpus
After converting all text to lowercase and splitting each word on spaces, we'll create a new list represents a book and each item in each list is all the words in the book text that *IS NOT* a stopword as we defined in the cell above.

In [23]:
txts_lower_split = [txt.lower().split() for txt in txts]
texts = [[word for word in txt if word not in stopwords] for txt in txts_lower_split]

print(texts[2][:100])

['different', 'forms', 'flowers', 'plants', 'species', 'charles', 'darwin', 'm', 'f', 'r', 's', 'professor', 'asa', 'gray', 'volume', 'dedicated', 'author', 'small', 'tribute', 'respect', 'affection', 'contents', 'introduction', 'chapter', 'heterostyled', 'dimorphic', 'plants', 'primulaceae', 'primula', 'veris', 'cowslip', 'differences', 'structure', 'forms', 'degrees', 'fertility', 'legitimately', 'illegitimately', 'united', 'p', 'elatior', 'vulgaris', 'sinensis', 'auricula', 'etc', 'summary', 'fertility', 'heterostyled', 'species', 'primula', 'homostyled', 'species', 'primula', 'hottonia', 'palustris', 'androsace', 'vitalliana', 'chapter', 'ii', 'hybrid', 'primulas', 'oxlip', 'hybrid', 'naturally', 'produced', 'primula', 'veris', 'vulgaris', 'differences', 'structure', 'function', 'parent', 'species', 'effects', 'crossing', 'long', 'styled', 'short', 'styled', 'oxlips', 'forms', 'parent', 'species', 'character', 'offspring', 'oxlips', 'artificially', 'self', 'fertilised', 'cross', 'f

### Stemming tokenized words

In [24]:
porter = PorterStemmer()
stem_texts = [[porter.stem(token) for token in text] for text in texts]

In [25]:
# dumping to pickle so we don't have to repeat the stemming step when session ends
with open('stem_texts.p', 'wb') as f:
    pickle.dump(stem_texts, f)

In [26]:
# open pickled stemmed tokens
with open('stem_texts.p', 'rb') as f:
    stem_texts = pickle.load(f)

In [27]:
# remove pickled file from working directory if needed

# os.remove("stem_texts.p")

In [28]:
# previewing first 20 stemmed tokens from Descent of Man using its index.
stem_texts[2][:20]

['differ',
 'form',
 'flower',
 'plant',
 'speci',
 'charl',
 'darwin',
 'm',
 'f',
 'r',
 's',
 'professor',
 'asa',
 'gray',
 'volum',
 'dedic',
 'author',
 'small',
 'tribut',
 'respect']

## Building a bag of words model
We can use methods from gensim to create a dictionary and bag of words model for the stemmed tokens in each book

In [29]:
dictionary = corpora.Dictionary(stem_texts)
bows = [dictionary.doc2bow(i) for i in stem_texts]

# Print the first five elements of the Descent of Mans Bag of words model
print(bows[2][:5])

[(0, 249), (1, 2), (5, 206), (6, 82), (7, 207)]


## Most common words 
Great we have a bag of words model for each book, using the dictionary created in the previous cell, but let's convert that bag of words model into a dataframe in order to inspect the top tokens for each book.

In [30]:
df_bow_dom = pd.DataFrame(bows[2])
df_bow_dom.columns = ['index', 'occurrences']
df_bow_dom['token'] = [dictionary[index] for index in df_bow_dom["index"]]

# sort the created dataframe by occurences
df_bow_dom_sorted = df_bow_dom.sort_values(by='occurrences', ascending=False)
display(df_bow_dom_sorted)

Unnamed: 0,index,occurrences,token
1365,3378,1883,style
1106,2723,1466,plant
698,1573,1419,form
1931,6801,1216,flower
930,2179,960,long
2140,9244,929,pollen
1921,6730,830,fertilis
1276,3194,789,short
1252,3133,776,seed
1325,3293,567,speci


### Further exploration
Let's take a look at what kind of tokens are used most. We'll use spacey to examine top words based on part of speech

In [31]:
# defining part of speech names from NLTK docs that we can use to isolate tokens
adjectives = ['JJ', 'JJR', 'JJS']
nouns = ['NN', 'NNS', 'NNP', 'NNPS']
verbs = ['VB','VBD','VBG','VBN','VBP','VBZ']

In [32]:
# adding part of speech column for each token
df_bow_dom_sorted['pos'] = [i[1] for i in list(nltk.pos_tag(df_bow_dom_sorted['token']))]

In [33]:
df_bow_dom_sorted.head()

Unnamed: 0,index,occurrences,token,pos
1365,3378,1883,style,NN
1106,2723,1466,plant,NN
698,1573,1419,form,NN
1931,6801,1216,flower,NN
930,2179,960,long,JJ


In [34]:
# using pandas query function to create a new dataframe of just nouns and adjectives
# how cool is df.query!?

df_dom_nouns = df_bow_dom_sorted.query(f'pos in {nouns}')
df_dom_adj = df_bow_dom_sorted.query(f'pos == {adjectives}')

### Plotting word frequency

In [35]:
fig = go.Figure(data=go.Bar(y=df_dom_nouns['occurrences'][:20], x = df_dom_nouns['token'][:20]))
  
fig.update_layout(title="Frequency Of Top 20 Nouns In Text",
                  yaxis = dict( title_text = "Frequency"),
                  xaxis = dict( title_text = "Top 20 nouns"),
                  template='plotly_white')

fig.show()

In [36]:
fig = go.Figure(data=go.Bar(y=df_dom_adj['occurrences'][:20], x = df_dom_adj['token'][:20]))
  
fig.update_layout(title="Frequency Of Top 20 Adjectives In Text",
                  yaxis = dict( title_text = "Frequency"),
                  xaxis = dict( title_text = "Top 20 adjectives"),
                  template='plotly_white')

fig.show()

### Building tf-idf model

In [37]:
# Generate the tf-idf model
model = TfidfModel(bows)

# Print the model for "Descent of Mans"
model[bows[2]]

[(0, 0.02789371688179761),
 (1, 0.00022404591872929807),
 (7, 0.015008320500611945),
 (8, 0.0005972017161367635),
 (9, 0.00030818656838726326),
 (10, 0.0004622798525808949),
 (11, 0.0007421266263750278),
 (12, 0.000989502168500037),
 (14, 0.00044809183745859614),
 (15, 0.0017316287948750647),
 (16, 0.0004947510842500185),
 (17, 0.0004622798525808949),
 (20, 7.048892727468483e-05),
 (21, 0.0012368777106250464),
 (22, 0.0005972017161367635),
 (26, 0.0007841607155525432),
 (27, 0.0009953361935612726),
 (28, 0.0017316287948750647),
 (30, 0.0011982099388463045),
 (31, 0.000398134477424509),
 (32, 0.0007125441009448619),
 (33, 0.000989502168500037),
 (34, 0.00035627205047243096),
 (35, 0.0012368777106250464),
 (36, 0.00035627205047243096),
 (37, 0.0005972017161367635),
 (40, 0.0007704664209681582),
 (41, 0.00024737554212500925),
 (44, 0.0009245597051617898),
 (45, 0.000398134477424509),
 (47, 0.0007421266263750278),
 (48, 0.001268800690944327),
 (49, 0.00030818656838726326),
 (51, 0.00149776

### Reviewing results

In [38]:
# converting the tf-idf model for "Descent of Mans" into a DataFrame
df_tfidf = pd.DataFrame()

# naming the columns of the DataFrame id and score
df_tfidf['id'] = [i[0] for i in model[bows[15]]]
df_tfidf['score'] = [i[1] for i in model[bows[15]]]

# adding the tokens corresponding to the numerical indices for better readability
df_tfidf['token'] = [dictionary[index] for index in df_tfidf["id"]]

# sorting the DataFrame by descending tf-idf score and print the first 10 rows.
print(df_tfidf.score.sort_values(ascending=False).head())

3008    0.482463
3541    0.412221
2948    0.336891
2640    0.305406
2788    0.263512
Name: score, dtype: float64


### Computing similarities

In [39]:
# computing the similarity matrix (pairwise distance between all texts)
sims = similarities.MatrixSimilarity(model[bows])

# transforming the resulting list into a dataframe
sim_df = pd.DataFrame(list(sims))

# adding the titles of the books as columns and index of the dataframe
sim_df.columns = titles
sim_df.index = titles

# printing the resulting matrix
sim_df

Unnamed: 0,CoralReefs,DescentofMan,DifferentFormsofFlowers,EffectsCrossSelfFertilization,ExpressionofEmotionManAnimals,FormationVegetableMould,FoundationsOriginofSpecies,GeologicalObservationsSouthAmerica,InsectivorousPlants,LifeandLettersVol1,LifeandLettersVol2,MonographCirripedia,MonographCirripediaVol2,MovementClimbingPlants,OriginofSpecies,PowerMovementPlants,VariationPlantsAnimalsDomestication,VolcanicIslands,VoyageBeagle
CoralReefs,1.0,0.009,0.00164,0.001563,0.004476,0.027276,0.02278,0.059152,0.001936,0.042032,0.025268,0.005691,0.009944,0.00113,0.038166,0.002231,0.010305,0.057589,0.265371
DescentofMan,0.009,1.0,0.067229,0.030609,0.149602,0.025602,0.14284,0.009175,0.009051,0.085616,0.125167,0.053792,0.040782,0.004401,0.274835,0.009909,0.227369,0.007353,0.121822
DifferentFormsofFlowers,0.00164,0.067229,1.0,0.404392,0.005942,0.00838,0.041547,0.002431,0.007171,0.023403,0.074613,0.008961,0.004667,0.007272,0.13222,0.01777,0.050118,0.002239,0.01225
EffectsCrossSelfFertilization,0.001563,0.030609,0.404392,1.0,0.006671,0.027166,0.04304,0.001865,0.006412,0.028481,0.076024,0.002839,0.00231,0.013757,0.153733,0.036919,0.056914,0.001862,0.016372
ExpressionofEmotionManAnimals,0.004476,0.149602,0.005942,0.006671,1.0,0.019316,0.048441,0.00482,0.010857,0.093342,0.075841,0.015865,0.028151,0.005089,0.062631,0.009536,0.082927,0.005014,0.100142
FormationVegetableMould,0.027276,0.025602,0.00838,0.027166,0.019316,1.000001,0.021401,0.065267,0.03315,0.04003,0.037276,0.018025,0.021302,0.033921,0.047834,0.036035,0.030584,0.055706,0.093838
FoundationsOriginofSpecies,0.02278,0.14284,0.041547,0.04304,0.048441,0.021401,1.0,0.028301,0.005636,0.081739,0.085756,0.007254,0.010337,0.003541,0.343215,0.008254,0.211628,0.018725,0.090888
GeologicalObservationsSouthAmerica,0.059152,0.009175,0.002431,0.001865,0.00482,0.065267,0.028301,1.0,0.006104,0.039093,0.016511,0.008449,0.02334,0.001548,0.053571,0.002691,0.01301,0.371084,0.260977
InsectivorousPlants,0.001936,0.009051,0.007171,0.006412,0.010857,0.03315,0.005636,0.006104,1.0,0.008491,0.025871,0.017461,0.017972,0.222627,0.013918,0.020874,0.009444,0.007387,0.013267
LifeandLettersVol1,0.042032,0.085616,0.023403,0.028481,0.093342,0.04003,0.081739,0.039093,0.008491,1.0,0.764225,0.008032,0.011332,0.007485,0.137225,0.01344,0.075095,0.035625,0.228634


### Finding most similar title

In [40]:
# Select the column corresponding to "Descent of Man" and 
v = sim_df['DescentofMan']

# Sort by ascending scores
v_sorted = v.sort_values()

In [41]:
fig = go.Figure(data=go.Bar(x=v_sorted.index,y=v_sorted))
  
fig.update_layout(title="Frequency Of Top 20 Nouns In Text",
                  yaxis = dict( title_text = "Book Title"),
                  xaxis = dict( title_text = "Book Similarity"),
                  template='plotly_white')

fig.show()

#### As we might have expected, the Origin of Species is the most similar book to Descent of Man. But if we want to visualize the 'big picture' and how all the titles relate to one another (based on their topics), we can create a dendrogram.

In [42]:
fig = ff.create_dendrogram(sim_df, labels=sim_df.index, orientation="left")
fig.update_layout(width=1000, height=800, title='Similarity Dendogram')
fig.show()