Tasks
* Finish doing research into the existing solutions and consolidate knowledge on their technical approaches
* Assemble existing code bases on GitHub for using DBpedia to extract knowledge
* Learn pdfminer and write code to parse textbooks in PDF form
* Extract section headers and table of contents, if possible
* Perform basic NLP using SpaCy

Deliverables
* Proof that the textbooks have been successfully parsed: number of words, most common words
* Most common words (entities?) by textbook
* Structure of each textbook as parsed by the pipeline (table of contents or chapter headings) [optional]

In [18]:
import os
import re
import io
import pickle
import pandas as pd
import spacy
import spotlight
    
input_dir = 'textbooks/'
data_dir = 'data/'
metadata_file = 'data/metadata.csv'
toc_file = 'toc.pkl'
text_file = 'text.pkl'
spotlight_server = 'http://192.168.99.101:2222/rest/annotate'

## Load data and spacy model

In [2]:
isbns = os.listdir(input_dir)

with open(os.path.join(data_dir, toc_file), 'rb') as fp:
    all_toc = pickle.load(fp) 

with open(os.path.join(data_dir, text_file), 'rb') as fp:
    all_text = pickle.load(fp) 
    
nlp = spacy.load('en_core_web_md')

## Load metadata and calculate number of pages

In [3]:
metadata = pd.read_csv(metadata_file, dtype = {'ISBN': 'str'})
metadata['num_pages'] = [len(all_text[isbn]) for isbn in metadata['ISBN']]
metadata

Unnamed: 0,ISBN,title,author,imprint,sold_by,start_page,end_page,num_pages
0,9781429219617,BIOLOGY OF PLANTS,PETER H RAVEN,FREEMAN/WORTH,Macmillan Higher Education,21,747,863
1,9781429242301,INTRODUCING PSYCHOLOGY,DANIEL L SCHACTER,FREEMAN/WORTH,Macmillan Higher Education,38,526,616
2,9781429298643,LIFE: THE SCIENCE OF BIOLOGY,DAVID E SADAVA,FREEMAN/WORTH,Macmillan Higher Education,51,1297,1447
3,9781429298902,PSYCHOLOGY: A CONCISE INTRODUCTION,RICHARD A GRIGGS,WORTH PUBLISHERS,Macmillan Higher Education,22,464,545
4,9781464126147,MOLECULAR BIOLOGY: PRINCIPLES AND PRACTICE,MICHAEL M COX,W. H. FREEMAN,Macmillan Higher Education,30,828,934
5,9781464135958,WHAT IS LIFE? A GUIDE TO BIOLOGY,JAY PHELAN,FREEMAN/WORTH,Macmillan Higher Education,34,718,773
6,9781464140815,PSYCHOLOGY,DAVID G MYERS,FREEMAN/WORTH,Macmillan Higher Education,59,751,985
7,9781464154072,EXPLORING PSYCHOLOGY,DAVID G MYERS,WORTH PUBLISHERS,Macmillan Higher Education,59,662,892
8,9781464171703,ABNORMAL PSYCHOLOGY,RONALD J COMER,WORTH PUBLISHERS,Macmillan Higher Education,33,699,852


In [4]:
isbn = '9781429242301'
start_page = metadata.loc[metadata['ISBN'] == isbn, 'start_page'].values[0]
end_page = metadata.loc[metadata['ISBN'] == isbn, 'end_page'].values[0]

text = all_text[isbn][(start_page-1):(end_page)]
text = ' '.join(text)

In [15]:
doc = nlp(text)

In [27]:
token = doc[2]
print(token)
sentence = next(doc.sents)
print(sentence)
print([word.lemma_ for word in sentence])

Psychology
1 䉱 Psychology’s Roots:
['1', '䉱', 'psychology', '’s', 'root', ':']


In [25]:
test_text = "President Obama called Wednesday on Congress to extend a tax break"\
    " for students included in last year's economic stimulus package,"\
    " arguing that the policy provides more generous assistance"
annotations = spotlight.annotate(spotlight_server,
                                 test_text,
                                 confidence=0.4, support=20)

In [26]:
annotations # doc[:100]

[{'URI': 'http://dbpedia.org/resource/President_of_the_United_States',
  'offset': 0,
  'percentageOfSecondRank': 0.008381875965542296,
  'similarityScore': 0.9907347787815642,
  'support': 33769,
  'surfaceForm': 'President',
  'types': ''},
 {'URI': 'http://dbpedia.org/resource/United_States_Congress',
  'offset': 36,
  'percentageOfSecondRank': 0.0033682306283999795,
  'similarityScore': 0.9951933315366982,
  'support': 19906,
  'surfaceForm': 'Congress',
  'types': 'DBpedia:Agent,Schema:Organization,DBpedia:Organisation,DBpedia:Legislature'},
 {'URI': 'http://dbpedia.org/resource/Tax',
  'offset': 57,
  'percentageOfSecondRank': 0.013863102398633843,
  'similarityScore': 0.9762126993833549,
  'support': 9470,
  'surfaceForm': 'tax',
  'types': ''},
 {'URI': 'http://dbpedia.org/resource/Economic_Stimulus_Act_of_2008',
  'offset': 104,
  'percentageOfSecondRank': 0.6035320849660459,
  'similarityScore': 0.6236230719155123,
  'support': 35,
  'surfaceForm': 'economic stimulus',
  'typ