# 01. Textbook Problem Extraction and EDA

Our goal for this project is to be able to balance a chemical equation that may or may not be presented in plain English. This will involve several steps:

- parsing text
- 

In this notebook we will use `tika` to extract back-of-the-chapter problems from textbook `.pdf` files. As an initial comparison, we'll limit our scope to the following topics: `chemical reactions` vs. `quantum mechanics`. To make the processing a little easier, the pages containing the problems have been extracted from the main textbook files using `Preview` (any `pdf` editor should work).

In [2]:
import os                               
import re                   
import time                 # to stall requests (just in case)

import pandas as pd 
import chemdataextractor as cde     # chemistry parser

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

import tika                 # to initiate tika server
from tika import parser     # the specific parser method      

pd.set_option('display.max_colwidth', 0)    # no max column width
pd.set_option('display.max_rows', 150)      

  return f(*args, **kwds)
  return f(*args, **kwds)


In [4]:
filepaths = []

for file in os.listdir('../data/textbook-problems/'):
    if 'pdf' in file:
        filepaths.append('../data/textbook-problems/'+file)
        
filepaths

['../data/textbook-problems/zumdahl-6.pdf',
 '../data/textbook-problems/bauer-7.pdf',
 '../data/textbook-problems/zumdahl-11.pdf',
 '../data/textbook-problems/tro-7.pdf',
 '../data/textbook-problems/bauer-5.pdf',
 '../data/textbook-problems/tro-9.pdf']

## Examine the Text

In [5]:
def grab_text(file, sleep=0, counter=0):
    
    if counter == 2:        # so we stop the recursive function
        pass

    # grab the raw text using parser.from_file()
    raw = parser.from_file(file)    
    status = raw['status']          # returns the status code from tika server
    
    # if things go well, return the raw text
    if status == 200:               
        print(f"'{file}' successfully opened!")
        return raw['content']
    
    # if things don't go well, pause for five seconds and try again
    # we might not need this code, but it's useful for other server calls
    else:                           
        print(f'! ! ! ! error code {status} ! ! ! !')
        print(f'! ! ! ! trying again ! ! ! !')
        
        time.sleep(5)
        counter += 1
        
        # repeats grab_text up to twice
        return grab_text(file, counter=counter) 

We will be using `regex` to extract the problems from each file. Different textbooks may label their questions differently, so we will need to examine how each textbook is formatted so we only grab the relevant information.

In [6]:
test = grab_text('../data/textbook-problems/bauer-7.pdf')
test

'../data/textbook-problems/bauer-7.pdf' successfully opened!


'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n9780073511078.pdf\n\n\n282 Chapter 7 Electron Structure of the Atom\n\nElectromagnetic Radiation and Energy\n 7.3 List three types of electromagnetic radiation that have \n\nlonger wavelengths than visible light.\n 7.4 List three types of electromagnetic radiation that have \n\nhigher frequencies than visible light.\n 7.5 Draw a picture of two waves, one with twice the \n\nfrequency of the other. Label the wave with the \nhigher frequency.\n\n 7.6 Draw a picture of two waves, one with three times \nthe wavelength of the other. Label the wave with the \nlonger wavelength.\n\n 7.7 List the following colors of visible light from \nshortest wavelength to longest wavelength: blue, orange, \nyellow, red.\n\n 7.8 List the following colors of visible light from \nlowest frequency to highest frequency: orange, green, \nviolet, yellow.\n\n 7.9 What does it mean when we say that wavelength is \ninversely proportional to freq

If we examine how the text is formatted in this text, we see that each new problem is formatted with a line break, the chapter number, a period, and the problem number:

`\n 7.4 List three types of electromagnetic radiation that have \n\nhigher frequencies than visible light.\n 7.5 Draw a picture of two waves, one with twice the \n\nfrequency of the other. Label the wave with the \nhigher frequency.`

Notice how there are also line breaks in the problem itself. These shouldn't be included in the cleaned text.

Every textbook has their own way of formatting the problems.

## Clean the Text

The following function attempts to catch all of the variations.

In [7]:
# https://stackoverflow.com/questions/44333462/
def make_problems(document):
    
    # '\n 7.16 ' => 'PPRROOBBLLEEMM'
    clean = re.sub('\n\s[0-9]+\.[0-9]+\s', r'PPRROOBBLLEEMM', document)
    
    # '  16. ' => 'PPRROOBBLLEEMM'
    clean = re.sub('\s\s[0-9]+\.\s', r'PPRROOBBLLEEMM', clean)
    
    # '\n\n 16. ' => 'PPRROOBBLLEEMM'
    clean = re.sub('\n\n\s[0-9]+\. ', r'PPRROOBBLLEEMM', clean)
    
    # '\n16. ' => 'PPRROOBBLLEEMM'
    clean = re.sub('\n([0-9]+)\.\s', r'PPRROOBBLLEEMM', clean)

    # '\na. ' => ' '
    # joins sub-problems with the main problem
    clean = re.sub('\n([a-z]+)\.\s', ' ', clean)
    
    clean = re.sub('\-\n', '', clean)       # join hyphenated words
    clean = re.sub('\n', ' ', clean)        # treat line breaks as spaces
    clean = re.sub('\t', ' ', clean)        # treat tabs as spaces
    clean = re.sub('\s\s', ' ', clean)      # treat double-spaces as singles

    clean = re.split('PPRROOBBLLEEMM', clean)   # split by problem
    
    return clean

In [8]:
make_problems(test)

['                  9780073511078.pdf  282 Chapter 7 Electron Structure of the Atom Electromagnetic Radiation and Energy',
 'List three types of electromagnetic radiation that have  longer wavelengths than visible light.',
 'List three types of electromagnetic radiation that have  higher frequencies than visible light.',
 'Draw a picture of two waves, one with twice the  frequency of the other. Label the wave with the higher frequency. ',
 'Draw a picture of two waves, one with three times the wavelength of the other. Label the wave with the longer wavelength. ',
 'List the following colors of visible light from shortest wavelength to longest wavelength: blue, orange, yellow, red. ',
 'List the following colors of visible light from lowest frequency to highest frequency: orange, green, violet, yellow. ',
 'What does it mean when we say that wavelength is inversely proportional to frequency? ',
 ' What does it mean when we say that photon energy is proportional to frequency? ',
 'What t

In [15]:
PROBLEMS = []

for file in filepaths: 
    name = file[26:-4]          # file[:26] = '../data/textbook-problems/'
    text = make_problems(grab_text(file))
    
    for i, problem in enumerate(text):
        problem_dict = {}
        problem_dict['filepath'] = name
        problem_dict['number'] = i
        problem_dict['text'] = problem.strip()
        PROBLEMS.append(problem_dict)

# turn list of dictionaries into DataFrame
df = pd.DataFrame(PROBLEMS, columns=['filepath', 'number', 'text'])

print(df.shape)
df.head()

'../data/textbook-problems/zumdahl-6.pdf' successfully opened!
'../data/textbook-problems/bauer-7.pdf' successfully opened!
'../data/textbook-problems/zumdahl-11.pdf' successfully opened!
'../data/textbook-problems/tro-7.pdf' successfully opened!
'../data/textbook-problems/bauer-5.pdf' successfully opened!
'../data/textbook-problems/tro-9.pdf' successfully opened!
(533, 3)


Unnamed: 0,filepath,number,text
0,zumdahl-6,0,"The oxides of nitrogen (which are common in automobile exhaust gases), in particular, are known to decompose ozone. For example, gaseous nitric oxide (NO) reacts with ozone gas to produce nitrogen dioxide gas and oxygen gas. Write the unbalanced chemical equation for this process."
1,zumdahl-6,1,"Carbon tetrachloride was widely used for many years as a solvent until its harmful properties became well Carbon tetrachloride may be prepared by the reaction of natural gas (methane, CH4) and elemental chlorine gas in the presence of ultraviolet Write the unbalanced chemical equation for this process."
2,zumdahl-6,2,"When elemental phosphorus, P4, burns in oxygen gas, it produces an intensely bright light, a great deal of heat, and massive clouds of white solid phosphorus(V) oxide (P2O5) product. Given these properties, it is not surprising that phosphorus has been used to manufacture incendiary bombs for warfare. Write the unbalanced equation for the reaction of phosphorus with oxygen gas to produce phosphorus(V) oxide."
3,zumdahl-6,3,"Calcium oxide is sometimes very challenging to store in the chemistry laboratory. This compound reacts with moisture in the air and is converted to calcium If a bottle of calcium oxide is left on the shelf too long, it gradually absorbs moisture from the humidity in the laboratory. Eventually the bottle cracks and spills the calcium hydroxide that has been Write the unbalanced chemical equation for this process."
4,zumdahl-6,4,"Although they were formerly called the inert gases, the heavier elements of Group 8 do form relatively stable compounds. For example, at high temperatures in the presence of an appropriate catalyst, xenon gas will combine directly with fluorine gas to produce solid xenon tetrafluoride. Write the unbalanced chemical equation for this process."


### Tokenizing and Removing Stopwords

It may be useful to tokenize the problems. We will use two methods:

- `NLTK` `word_tokenize()`, removing English stopwords
- `ChemDataExtractor` chemical entity mentions

In [16]:
stop_words = stopwords.words('english')
stop_words += ['copyright',
               'cengage',
               'pearson',
               'learning',
               'may',
               'copied',
               'scanned',
               'duplicated',
               'chapter',
               'practice',
               'problem',
               'exercise',
               'review',
               'question',
               'figure']
stop_words

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [17]:
def remove_stops(doc):
    doc = word_tokenize(doc)
    doc = [w.lower() for w in doc if not w in stop_words] 
    doc = [w for w in doc if w.isalpha()]
    return doc

In [18]:
df['txt'] = df['text'].apply(remove_stops)
df.head()

Unnamed: 0,filepath,number,text,txt
0,zumdahl-6,0,"The oxides of nitrogen (which are common in automobile exhaust gases), in particular, are known to decompose ozone. For example, gaseous nitric oxide (NO) reacts with ozone gas to produce nitrogen dioxide gas and oxygen gas. Write the unbalanced chemical equation for this process.","[the, oxides, nitrogen, common, automobile, exhaust, gases, particular, known, decompose, ozone, for, example, gaseous, nitric, oxide, no, reacts, ozone, gas, produce, nitrogen, dioxide, gas, oxygen, gas, write, unbalanced, chemical, equation, process]"
1,zumdahl-6,1,"Carbon tetrachloride was widely used for many years as a solvent until its harmful properties became well Carbon tetrachloride may be prepared by the reaction of natural gas (methane, CH4) and elemental chlorine gas in the presence of ultraviolet Write the unbalanced chemical equation for this process.","[carbon, tetrachloride, widely, used, many, years, solvent, harmful, properties, became, well, carbon, tetrachloride, prepared, reaction, natural, gas, methane, elemental, chlorine, gas, presence, ultraviolet, write, unbalanced, chemical, equation, process]"
2,zumdahl-6,2,"When elemental phosphorus, P4, burns in oxygen gas, it produces an intensely bright light, a great deal of heat, and massive clouds of white solid phosphorus(V) oxide (P2O5) product. Given these properties, it is not surprising that phosphorus has been used to manufacture incendiary bombs for warfare. Write the unbalanced equation for the reaction of phosphorus with oxygen gas to produce phosphorus(V) oxide.","[when, elemental, phosphorus, burns, oxygen, gas, produces, intensely, bright, light, great, deal, heat, massive, clouds, white, solid, phosphorus, v, oxide, product, given, properties, surprising, phosphorus, used, manufacture, incendiary, bombs, warfare, write, unbalanced, equation, reaction, phosphorus, oxygen, gas, produce, phosphorus, v, oxide]"
3,zumdahl-6,3,"Calcium oxide is sometimes very challenging to store in the chemistry laboratory. This compound reacts with moisture in the air and is converted to calcium If a bottle of calcium oxide is left on the shelf too long, it gradually absorbs moisture from the humidity in the laboratory. Eventually the bottle cracks and spills the calcium hydroxide that has been Write the unbalanced chemical equation for this process.","[calcium, oxide, sometimes, challenging, store, chemistry, laboratory, this, compound, reacts, moisture, air, converted, calcium, if, bottle, calcium, oxide, left, shelf, long, gradually, absorbs, moisture, humidity, laboratory, eventually, bottle, cracks, spills, calcium, hydroxide, write, unbalanced, chemical, equation, process]"
4,zumdahl-6,4,"Although they were formerly called the inert gases, the heavier elements of Group 8 do form relatively stable compounds. For example, at high temperatures in the presence of an appropriate catalyst, xenon gas will combine directly with fluorine gas to produce solid xenon tetrafluoride. Write the unbalanced chemical equation for this process.","[although, formerly, called, inert, gases, heavier, elements, group, form, relatively, stable, compounds, for, example, high, temperatures, presence, appropriate, catalyst, xenon, gas, combine, directly, fluorine, gas, produce, solid, xenon, tetrafluoride, write, unbalanced, chemical, equation, process]"


In [19]:
df['text'] = df['text'].apply(sent_tokenize)
df.head()

Unnamed: 0,filepath,number,text,txt
0,zumdahl-6,0,"[The oxides of nitrogen (which are common in automobile exhaust gases), in particular, are known to decompose ozone., For example, gaseous nitric oxide (NO) reacts with ozone gas to produce nitrogen dioxide gas and oxygen gas., Write the unbalanced chemical equation for this process.]","[the, oxides, nitrogen, common, automobile, exhaust, gases, particular, known, decompose, ozone, for, example, gaseous, nitric, oxide, no, reacts, ozone, gas, produce, nitrogen, dioxide, gas, oxygen, gas, write, unbalanced, chemical, equation, process]"
1,zumdahl-6,1,"[Carbon tetrachloride was widely used for many years as a solvent until its harmful properties became well Carbon tetrachloride may be prepared by the reaction of natural gas (methane, CH4) and elemental chlorine gas in the presence of ultraviolet Write the unbalanced chemical equation for this process.]","[carbon, tetrachloride, widely, used, many, years, solvent, harmful, properties, became, well, carbon, tetrachloride, prepared, reaction, natural, gas, methane, elemental, chlorine, gas, presence, ultraviolet, write, unbalanced, chemical, equation, process]"
2,zumdahl-6,2,"[When elemental phosphorus, P4, burns in oxygen gas, it produces an intensely bright light, a great deal of heat, and massive clouds of white solid phosphorus(V) oxide (P2O5) product., Given these properties, it is not surprising that phosphorus has been used to manufacture incendiary bombs for warfare., Write the unbalanced equation for the reaction of phosphorus with oxygen gas to produce phosphorus(V) oxide.]","[when, elemental, phosphorus, burns, oxygen, gas, produces, intensely, bright, light, great, deal, heat, massive, clouds, white, solid, phosphorus, v, oxide, product, given, properties, surprising, phosphorus, used, manufacture, incendiary, bombs, warfare, write, unbalanced, equation, reaction, phosphorus, oxygen, gas, produce, phosphorus, v, oxide]"
3,zumdahl-6,3,"[Calcium oxide is sometimes very challenging to store in the chemistry laboratory., This compound reacts with moisture in the air and is converted to calcium If a bottle of calcium oxide is left on the shelf too long, it gradually absorbs moisture from the humidity in the laboratory., Eventually the bottle cracks and spills the calcium hydroxide that has been Write the unbalanced chemical equation for this process.]","[calcium, oxide, sometimes, challenging, store, chemistry, laboratory, this, compound, reacts, moisture, air, converted, calcium, if, bottle, calcium, oxide, left, shelf, long, gradually, absorbs, moisture, humidity, laboratory, eventually, bottle, cracks, spills, calcium, hydroxide, write, unbalanced, chemical, equation, process]"
4,zumdahl-6,4,"[Although they were formerly called the inert gases, the heavier elements of Group 8 do form relatively stable compounds., For example, at high temperatures in the presence of an appropriate catalyst, xenon gas will combine directly with fluorine gas to produce solid xenon tetrafluoride., Write the unbalanced chemical equation for this process.]","[although, formerly, called, inert, gases, heavier, elements, group, form, relatively, stable, compounds, for, example, high, temperatures, presence, appropriate, catalyst, xenon, gas, combine, directly, fluorine, gas, produce, solid, xenon, tetrafluoride, write, unbalanced, chemical, equation, process]"


In [57]:
r = cde.doc.Paragraph(df.loc[0, 'text'])
s = [i for i in r.pos_tagged_tokens]
s

[[('The', 'DT'),
  ('oxides', 'NNS'),
  ('of', 'IN'),
  ('nitrogen', 'NN'),
  ('(', '-LRB-'),
  ('which', 'WDT'),
  ('are', 'VBP'),
  ('common', 'JJ'),
  ('in', 'IN'),
  ('automobile', 'NN'),
  ('exhaust', 'NN'),
  ('gases', 'NNS'),
  (')', '-RRB-'),
  (',', ','),
  ('in', 'IN'),
  ('particular', 'JJ'),
  (',', ','),
  ('are', 'VBP'),
  ('known', 'VBN'),
  ('to', 'TO'),
  ('decompose', 'VB'),
  ('ozone', 'NN'),
  ('.', '.')],
 [('For', 'IN'),
  ('example', 'NN'),
  (',', ','),
  ('gaseous', 'JJ'),
  ('nitric', 'JJ'),
  ('oxide', 'NN'),
  ('(', '-LRB-'),
  ('NO', 'NN'),
  (')', '-RRB-'),
  ('reacts', 'VBZ'),
  ('with', 'IN'),
  ('ozone', 'NN'),
  ('gas', 'NN'),
  ('to', 'TO'),
  ('produce', 'VB'),
  ('nitrogen', 'NN'),
  ('dioxide', 'NN'),
  ('gas', 'NN'),
  ('and', 'CC'),
  ('oxygen', 'NN'),
  ('gas', 'NN'),
  ('.', '.')],
 [('Write', 'VB'),
  ('the', 'DT'),
  ('unbalanced', 'JJ'),
  ('chemical', 'NN'),
  ('equation', 'NN'),
  ('for', 'IN'),
  ('this', 'DT'),
  ('process', 'NN'),
  ('.

In [58]:
s = r.pos_tagged_tokens

In [63]:
s[0][0][0]

'The'

In [54]:
t = []
for w in df.loc[0, 'txt']:
    if w not in s:
        t.append(w)
    else:
        t.append('CPD')

In [55]:
t

['the',
 'CPD',
 'CPD',
 'common',
 'automobile',
 'exhaust',
 'gases',
 'particular',
 'known',
 'decompose',
 'CPD',
 'for',
 'example',
 'gaseous',
 'nitric',
 'oxide',
 'no',
 'reacts',
 'CPD',
 'gas',
 'produce',
 'CPD',
 'dioxide',
 'gas',
 'CPD',
 'gas',
 'write',
 'unbalanced',
 'chemical',
 'equation',
 'process']

In [96]:
df['filepath'].value_counts()

bauer-7       120
tro-7         99 
zumdahl-11    91 
tro-9         90 
bauer-5       78 
zumdahl-6     49 
Name: filepath, dtype: int64

## Labeling Problems

In [98]:
df['txt'] = df['text'].apply(remove_stops)
df.reset_index(drop=True, inplace=True)

In [99]:
def check(df):
    print(df.shape)
    return df.head()

In [101]:
# code adapted from Dae H, Sophia A, Sonam T : LA

def label_problem(df, start=0, problem_col='text'):

    balancing = []
    e_config = []
    
    if start > 0:
        for i in range(start):
            balancing.append(df.loc[i, problem_col]['balancing'])
            e_config.append(df.loc[i, problem_col]['e_config'])
    else:
        pass

    for i in range(start, len(df)):
        print('=====================')
        print()
        print(f'{i+1}/{len(df)}')
        print(df.loc[i, problem_col])
        print()
        bal = input('answer specifically asks for balanced equation: 1 -- ')
        if bal == "stop":
            for n in range(len(df) - (i)):
                balancing.append(999)
            
            dataframe['balancing'] = balancing
            print('function stopped')
            print(f'stopped at index {i}.')
            break
        else:
            balancing.append(bal)
        print()
        ec = input('answer specifically asks for electron configuration: 1 -- ')
        if ec == "stop":
            for n in range(len(df) - (i)):
                e_config.append(999)
            
            dataframe['e_config'] = e_config
            print('function stopped')
            print(f'stopped at index {i}.')
            break
        else:
            e_config.append(ec)
    
    df['balancing'] = balancing
    df['e_config'] = e_config
    
    return df

In [None]:
# df = label_problem(df)

In [102]:
df.dropna(axis=0, inplace=True)

In [100]:
df.to_csv('../data/textbook-problems.csv', index=False)