# How can Natural Language Processing (NLP) reframe scholarship across disciplines?

## Introduction

When I started programming, it was to make tools that made my life as a humanist easier. For this talk, I wanted to put together a tool that I had come up with about five years ago, while studying abroad at St. Andrews. Then the intro to computer science class had me stumped, but now, with the new skills in programming that I have fostered for the past two and a half years, I decided to got back and try to make my idea a reality.

What I'll be showing: a search engine for a specific text
* It will two inputs:
    1. the text, split into sentences
    2. a search term
* It will give the user back a list of relevent sentences based on similarity with that search term

This system allows the humanist to pick out important yet hidden sections of text, instead of agonizing over dogeared pages and trying to remember where a certain line was written. 

I wanted to focus on this tool because I think that walking you through how I built it will give you a good overview of how NLP methods can encourage interdisciplinary research. But I want to give you a little demo so that you understand the goal and know what to expect as we go on.

In [None]:
from search import searchEngine
import spacy
import pandas as pd
nlp = spacy.load('en_core_web_md')
df = pd.read_csv('ushh_sentsrand200.csv')

ushh_search = searchEngine(df=df, nlp=nlp)

We will talk at length about this dataset in a couple minutes, but to put it simply, it is made up of 200 random US congressional hearings. Because these hearings span a long time, we can query this search engine to discover what certain congresspeople said or asked about a variety of contemporary issues.

In [None]:
search_term = 'family debt relief'
search = ushh_search.search('serialized_data/rand200_output', search_term,context_size=3,title='title',date='date',speaker='speaker')
ushh_search.displaySearch(search[0],search[1])

So if you are a data scientist or interested in the implementation of this search engine, I have all of the code that I used in this notebook, as well as a module that you can call from the search.py file. All of the code for this presentation can be found at my GitHub: INCLUDE LINK!!!!

Otherwise, if you are a humanist who is intersted in how to approach traditional questions in a novel way, I will show you how I thought about this problem, how I devised a solution to it and how you might address a research questions of your own in the same way.

In [None]:
## all imports for the notebook
import requests
from bs4 import BeautifulSoup
import pandas as pd
import re
import nltk
from tqdm import tqdm
from nltk.tokenize import sent_tokenize
import spacy
import warnings
from IPython.core.display import display, HTML
import matplotlib.pyplot as plt
import geopandas as gpd
import collections

warnings.filterwarnings('ignore')
nltk.download('punkt')
tqdm.pandas()

## United States Congressional Hearings

In this section, I will show you methods to get and clean data and then how to prepare it for text processing. To create our search engine, the computer is going to need to turn all of the words of <br>a text into a special type of number called a vector. These vectors represent the semantic meaning of a word or sentence. We will be using spaCy's medium English model to generate these vectors, but before we can <br> we need to make sure our data is the correct form. 

From the Government Publishing Office: *A hearing is a meeting or session of a Senate, House, joint, or special committee of Congress, usually open to the public, to obtain information and opinions on <br> proposed legislation, conduct an investigation, or evaluate/oversee the activities of a government department or the implementation of a Federal law.*

These documents cover a huge range of subjects from maritime law to medication regulation. Encoded in these documents are what our congresspeople think about certain issues. Thus, our search engine <br>will take in any word or phrase and return what opinions on that subject certain congresspeople have. In a certain sense, these views are available to anyone, but without NLP tools and methods they remain inaccessible <br> to most researchers.  

In [None]:
## archive.org mirror of the data
from IPython.display import IFrame

IFrame('https://archive.org/details/us_house_hearings', width=1800, height=600)

#### Preprocessing
I had to do a lot of data cleaning to get this dataset into a form that we could analyze. <br>
If you are interested in webscraping or data structures, I would encourage you to read the following. 

In [None]:
def scrapeArchive(url):
  r = requests.get(url)
  r.encoding = 'utf-8'
  soup = BeautifulSoup(r.text)

  url_dict = {}
  for div in soup.find_all('div', class_='C234'):
    if isinstance(div.find('a'), type(None)):
      continue
    else:
      download = div.find('a')['href'].replace('/details/', '/download/')
      download = f'{download}/{download.split(".")[-1]}.htm'
      url_dict[div.find('a').find('div', class_='ttl').text.strip()] = download
  ushh = pd.DataFrame.from_dict(url_dict, orient='index').reset_index().rename(columns={'index':'title', 0:'url'})
  return ushh

The code above strips all of the links to the raw textual data and the title of the hearing and deposits them into a dataframe, 

In [None]:
def getText(title, url):
  r = requests.get(f'http://www.archive.org{url}')
  r.encoding = 'utf-8'
  soup = BeautifulSoup(r.text)
  if not isinstance(soup.pre, type(None)):
    text = soup.pre.text.replace('                                ','').replace('\n','')
    if title.upper() in text:
      text = text[text.find(title.upper()):text.find('[Submissions for the record follow.]')].replace('TIFF','').replace('OMITTED','').replace('[GRAPHIC] [ ]', '')
      return text 
    else:
      return "No data."

while this function can be applied to each of the links to retrieve the text and put it in the same dataframe.

In [None]:
DFs = []
for i in tqdm(range(133)):
  url = f'https://archive.org/details/us_house_hearings?page={i+1}'
  DFs.append(scrapeArchive(url))
ushh = pd.concat(DFs).reset_index(drop=True)

ushh['text'] = ushh.progress_apply(lambda x: getText(x['title'],x['url']), axis=1)
ushh = ushh.drop(ushh.loc[ushh['text'] == 'No data.'].index).dropna().reset_index(drop=True)

Originally, this method took 3 hours, 52 minutes and 24 seconds to complete, so below I will load in the completed dataset, serialized as a CSV file.

In [None]:
ushh = pd.read_csv('ushh827.csv').drop('Unnamed: 0', axis=1).dropna().reset_index(drop=True)
ushh

Great! We have the text in a structured form. Now, we're done, right? .... Right?

In [None]:
def make_html(image):
     return f'<img src="{image}" style="display:inline;margin:1px"/>'
files = ['ushh ex1.PNG','ushh ex2.PNG','ushh ex3.PNG']

display(HTML('<p>Example of header</p>'))
display(HTML(make_html(files[0])))
display(HTML('<p>Example of unuseful pieces of data</p>'))
display(HTML(make_html(files[1])))
display(HTML('<p>Example of Example of how the transcript looks</p>'))
display(HTML(make_html(files[2])))

In fact, our data wrangling has just begun. Now we need to:
* Remove any documents that have no data
* Clean the documents for NLP tools
* Split up documents into speeches and then sentences

In [None]:
## Problem: many documents have no substantive text that we can use for our tool
## Solution: only full hearing transcripts have a date
ushh['text_check'] = ushh['text'].apply(lambda x: True if re.search('[A-Z]+,\s[A-Z]+\s\d{2},\s\d{4}|[A-Z]+,\s[A-Z]+\s\d{1},\s\d{4}', x) else False) 
ushh = ushh.drop(ushh.loc[ushh['text_check'] == False].index)
ushh = ushh.drop('text_check', axis=1)

## Problem: all of the document have a long header that will skew our results
## Solution: split the header off, on the date we checked for above 
def cleanEntry(text):
    header = re.split('(?=([A-Z]+,\s[A-Z]+\s\d{2},\s\d{4}))',text)[0]
    clean_full = text[len(header):]
    return header, clean_full
ushh['clean_tup'] = ushh.text.apply(cleanEntry)
ushh['header'] = ushh['clean_tup'].apply(lambda x: x[0])
ushh['clean_full'] = ushh['clean_tup'].apply(lambda x: x[1])

## Problem: we want to look at each sentence, but the document is split into speeches
## Solution: each speech begins with the speaker's name
ushh['clean_split'] = ushh.clean_full.apply(lambda x: re.split('(M[a-z]\.\s[A-Z][a-z]+\.)|(Senator\s[A-Z][a-z]+\.)|(Dr\.\s[A-Z][a-z]+\.)',x)[1:])
ushh['date'] = ushh.clean_full.apply(lambda x: re.search('([A-Z]+,\s[A-Z]+\s\d{2},\s\d{4})',x))
ushh_speech_explode = ushh.explode('clean_split').reset_index()
ushh_speech_explode['speech_check'] = ushh_speech_explode.clean_split.apply(lambda x: True if isinstance(x,str) else False)
ushh_speech_explode = ushh_speech_explode.drop(ushh_speech_explode.loc[ushh_speech_explode['speech_check'] == False].index)

ushh_speech_explode['speaker'] = ushh_speech_explode.clean_split.apply(lambda x: re.search('(M[a-z]\.\s[A-Z][a-z]+\.)|(Senator\s[A-Z][a-z]+\.)|(Dr\.\s[A-Z][a-z]+\.)',x))
ushh_speech_explode = ushh_speech_explode.drop('speech_check',axis=1).reset_index(drop=True)

## Problem: want to look at the sentence level, but the text is divided by speech
## Solution nltk's sent_tokenize function
ushh_speech_explode['sents'] = ushh_speech_explode.clean_split.apply(sent_tokenize)
ushh_sents = ushh_speech_explode.explode('sents')

## Further cleaning of textual artifacts
ushh_sents = ushh_sents.drop(ushh_sents.loc[ushh_sents['sents'].str.contains('Whereupon')].index)
ushh_sents['sents_check'] = ushh_sents['sents'].apply(lambda x: True if not x.isupper() else False)

ushh_sents = ushh_sents.drop(ushh_sents.loc[ushh_sents['sents_check'] == False].index).drop(['sents_check','text','clean_tup','header','clean_full','url'],axis=1).reset_index(drop=True)
ushh_sents['date'] = ushh_sents['date'].apply(lambda x: x.group(0) if not isinstance(x, type(None)) else 'Not Found')
ushh_sents['speaker'] = ushh_sents['speaker'].apply(lambda x: x.group(0) if not isinstance(x, type(None)) else 'Not Found')

ushh_sents

#### Clean outputs

In [None]:
ushh = pd.read_csv('ushh827.csv')
ushh

In [None]:
ushh_sents = pd.read_csv('ushh_sentsrand200.csv')
ushh_sents

#### Using spaCy

In [None]:
# !spacy download en_core_web_md --quiet ## uncomment this line to download model, must do this the first time
import spacy
nlp = spacy.load("en_core_web_md")
nlp.pipe_names

##### What is spaCy doing?

For right now, this spaCy language is a black box. We input unmarked text and we get back a spaCy doc object. <br>
This special Python object has many features, but we are most interested in this: 

In [None]:
a_word = nlp.vocab['word']
a_word.vector

This might look really complicated, but I'll show you below how we can intepret a word vector and all of the thing we can do with it.

In [None]:
from sklearn.decomposition import PCA
import plotly.express as px
from plotly.offline import init_notebook_mode
init_notebook_mode() 

words = ['dog','cat','king','queen','man','woman','red','blue']
vectors = []
for word in words:
    vectors.append(nlp.vocab[word].vector)

pca = PCA(n_components=2)
embeddings_2d = pca.fit_transform(vectors)

embeddings_df = pd.DataFrame({"x":embeddings_2d[:, 0], "y":embeddings_2d[:, 1], "token":words})

fig = px.scatter(embeddings_df, x='x', y='y', opacity=0.5, hover_data=['token'])
fig.show()

As we can see above, similar words tend to stay together in this 2d representation of the word. <br>
We can formalized this relation using cosine similarity. <br>

First, imagine that each point in that graph above has a line from the point to the origin, as we see below.<br>
The angle between these two lines will then represent the difference between the two words, at least the difference as far as out model understands. <br>
<img src='cosinesim.png'/>
<br>
We can then take the cosine of this angle to get a simiarlity score. <br>
We choose to take the cosine because the cosine of 0, that is when the vectors are the same, is 1. <br>
While, the cosine of 180, that is when the vectors are facing in opposite directions, is -1

#### Designing a close reading tool

In [None]:
# ushh_sents['sent_doc'] = ushh_sents.sents.progress_apply(nlp)

from spacy.tokens import DocBin
bytes_file = open('serialized_data/rand200_output','rb').read()
doc_bin = DocBin().from_bytes(bytes_file)
ushh_sents['sent_doc'] =  pd.Series(doc_bin.get_docs(nlp.vocab))

In [None]:
def searchWordOrPhraseUSSH(search_text, entries=5, context_size=2):
  if ' ' in search_text:
    ## if the search term is multi-token, we must pass it through our NLP pipeline
    search_vec = nlp(search_text)
  else:
    ## if the search term is only one word, we can find the word vector directly
    search_vec = nlp.vocab[search_text]
  
  ushh_search = ushh_sents
  sim_score = ushh_search['sent_doc'].apply(lambda x: x.similarity(search_vec)).sort_values(ascending=False)[0:entries]
  sim_df = sim_score.reset_index().rename(columns={'index':'org_idx'})

  ## allows the user to choose how much of the context around the selected line they want to see
  def createContext(org, context_size):
    context = ushh_sents.sent_doc.iloc[org].text
    for i in range(context_size):
      if (i < len(ushh_sents)) and (i > 0):
        context = ushh_sents.sent_doc.iloc[org-i].text + '\n' + context
        context = context + '\n' + ushh_sents.sent_doc.iloc[org+i].text
    return context

  sim_df['context'] = sim_df['org_idx'].apply(lambda x: createContext(x, context_size))

  ## interesting metadata we saved from the scraping phase
  sim_df['title'] = sim_df['org_idx'].apply(lambda x: ushh_sents.title.iloc[x])
  sim_df['date'] = sim_df['org_idx'].apply(lambda x: ushh_sents.date.iloc[x])
  sim_df['speaker'] = sim_df['org_idx'].apply(lambda x: ushh_sents.speaker.iloc[x])
  return sim_df

In [None]:
term = 'Global Warming' ## add search term!
search = searchWordOrPhraseUSSH(term.lower().strip())

display(HTML(f'<h2>{term}</h2>'))
display(HTML('<br>'))
for i in range(len(search)):
  display(HTML(f'<small><i>{search.title.to_list()[i]}</i></small>'))
  display(HTML(f'<small>{search.date.to_list()[i]}</small>'))
  display(HTML(f'<small>{search.speaker.to_list()[i]}</small>'))
  display(HTML(f'<small>Similarity Score: {round(search.sent_doc.to_list()[i], 3)}</small>'))
  display(HTML(f'<p>{search.context.to_list()[i]}</p>'))
  display(HTML('<br>'))

What next? Does this type of tool work for other fields?

## The Three Musketeers

From Wikipedia: The Three Musketeers *is a French historical adventure novel written in 1844 by French author Alexandre Dumas. It is in the swashbuckler genre, which has heroic, chivalrous swordsmen who fight for justice. Set between 1625 and 1628, it recounts the adventures of a young man named d'Artagnan after he leaves home to travel to Paris, hoping to join the Musketeers of the Guard. Although d'Artagnan is not able to join this elite corps immediately, he is befriended by three of the most formidable musketeers of the age – Athos, Porthos and Aramis, "the three musketeers" or "the three inseparables" – and becomes involved in affairs of state and at court.*

Like the US Congressional Hearings data we saw above, these texts are available, in fact enjoyable, to anyone, but they are not very accessible or navigable, especially for research and scholarship. 

In [None]:
## We have some web scraping and data cleaning to do, but much lass than before
def scrapeText(url):
  r = requests.get(url)
  r.encoding = 'utf-8'
  soup = BeautifulSoup(r.text)
  
  book = {}
  for div in soup.find_all('div', class_='chapter'):
    if div.h2 is not None:
      title = div.h2.text
      text = ''
      for p in div.find_all('p'):
        text += p.text
      book[title] = (text.replace('\n','').replace('\r', ' '), url)
  
  book = pd.DataFrame.from_dict(book, orient='index').reset_index().rename(columns={'index':'title',0:'text'})
  return book

urls = [
    'https://www.gutenberg.org/files/1257/1257-h/1257-h.htm', # The Three Musketeers
    'https://www.gutenberg.org/files/1259/1259-h/1259-h.htm', # Twenty Years After
    'https://www.gutenberg.org/files/2609/2609-h/2609-h.htm', # The Vicomte de Bragelonne
    'https://www.gutenberg.org/files/2681/2681-h/2681-h.htm', # Ten Years Later
    'https://www.gutenberg.org/files/2710/2710-h/2710-h.htm', # Louise de la Valliere
    'https://www.gutenberg.org/files/2759/2759-h/2759-h.htm'  # The Man in the Iron Mask
    ]

DFs = []
for i, url in enumerate(urls):
  DFs.append(scrapeText(url))

ttm = pd.concat(DFs)
ttm = ttm.drop(ttm.loc[ttm['title'].str.contains('Footnotes')].index)
ttm = ttm.drop(ttm.loc[ttm['text'] == ''].index).reset_index(drop=True)

def url2name(url):
  if url == 'https://www.gutenberg.org/files/1257/1257-h/1257-h.htm':
    return 'The Three Musketeers'
  elif url == 'https://www.gutenberg.org/files/1259/1259-h/1259-h.htm':
    return 'Twenty Years After'
  elif url == 'https://www.gutenberg.org/files/2609/2609-h/2609-h.htm':
    return 'The Vicomte de Bragelonne'
  elif url == 'https://www.gutenberg.org/files/2681/2681-h/2681-h.htm':
    return 'Ten Years Later'
  elif url == 'https://www.gutenberg.org/files/2710/2710-h/2710-h.htm':
    return 'Louise de la Valliere'
  else:
    return 'The Man in the Iron Mask'
  
ttm['book'] = ttm[1].apply(url2name)
ttm = ttm.drop(1, axis=1)
ttm

In [None]:
## sentence tokenization and application of the NLP method as above
ttm['sents'] = ttm['text'].progress_apply(sent_tokenize)
ttm_sents = ttm.explode('sents').reset_index(drop=True)
#ttm_sents['sent_doc'] = ttm_sents['sents'].progress_apply(nlp)

from spacy.tokens import DocBin
bytes_file = open('serialized_data/ttm_output','rb').read()
doc_bin = DocBin().from_bytes(bytes_file)
ttm_sents['sent_doc'] =  pd.Series(doc_bin.get_docs(nlp.vocab))
ttm_sents

What else can we do to our tool to make it more useful for this text in particular? Dumas writes for a primarily French audience, who are expected to have a bredth knowledge concerning western European geography. Using spaCy's off-the-shelf named entity recognition pipe, we can add a mapping component to lines which contain a geopolitical entities and locations.

In [None]:
## same basic code as before
def searchWordOrPhraseTTM(search_text, entries=5, context_size=2):
  if ' ' in search_text:
    search_vec = nlp(search_text)
  else:
    search_vec = nlp.vocab[search_text]
  
  ttm_search = ttm_sents
  sim_score = ttm_search['sent_doc'].apply(lambda x: x.similarity(search_vec)).sort_values(ascending=False)[0:entries]
  sim_df = sim_score.reset_index().rename(columns={'index':'org_idx'})

  def createContext(org, context_size):
    context = ttm_sents.sent_doc.iloc[org].text
    for i in range(context_size):
      if (i < len(ttm_sents)) and (i > 0):
        context = ttm_sents.sent_doc.iloc[org-i].text + '\n' + context
        context = context + '\n' + ttm_sents.sent_doc.iloc[org+i].text
    return context

  ## returns a list of coordinates
  def getLatLong(org, context_size):
    for i in range(context_size):
      ## default dicts are really useful for tasks like this
      places = collections.defaultdict(int)
      for ent in ttm_sents.sent_doc.iloc[org].ents:
        ## using spaCy entity tags
          places[ent.text] += 1
      for ent in ttm_sents.sent_doc.iloc[org-i].ents:
        if (ent.label_ == 'GPE') or (ent.label_ == 'LOC'):
          places[ent.text] += 1
      for ent in ttm_sents.sent_doc.iloc[org+i].ents:
        if (ent.label_ == 'GPE') or (ent.label_ == 'LOC'):
          places[ent.text] += 1
      coords = []
      # filtering out noise from spaCy NER
      chars = ['Athos','Porthos','Aramis','Grimaud','Felton', 'Louise', 'Montalais', 'Mazarin']
      for place in places:
        if (place != 'one') and (place != 'four') and (place != 'first') and (place != 'Roman') and (place not in chars):
          ## using geonames API to get coordinates
          geoname = BeautifulSoup(requests.get(f'http://api.geonames.org/search?name_equals={place}&continentCode=EU&maxRows=10&username=pnadel').text).find('geoname')
          if geoname != None:
            lat = float(geoname.find('lat').get_text())
            lng = float(geoname.find('lng').get_text())
            coords.append((place,lat,lng,places[place]))
    return coords

  ## text 
  sim_df['context'] = sim_df['org_idx'].apply(lambda x: createContext(x, context_size))
  
  ## mapping data
  sim_df['coords'] = sim_df['org_idx'].apply(lambda x: getLatLong(x, context_size))

  ## useful metadata
  sim_df['chapter'] = sim_df['org_idx'].apply(lambda x: ttm_sents.title.iloc[x])
  sim_df['book'] = sim_df['org_idx'].apply(lambda x: ttm_sents.book.iloc[x])
  return sim_df

In [None]:
def europePlot(coords):
  if len(coords) > 0:
    ## coordinates in DF to allow for easier ploting
    coord_df = pd.DataFrame(coords,columns=['place','lat','lng','c'])

    ## open access geojson of Europe
    filename = "europe.geojson"
    file = open(filename)
    df = gpd.read_file(file)

    fig, ax = plt.subplots(figsize=(8,6))
    ## select countries to plot
    df[df['id'] == 'FR'].plot(ax=ax)
    df[df['id'] == 'ES'].plot(ax=ax)
    df[df['id'] == 'GB'].plot(ax=ax)
    df[df['id'] == 'BE'].plot(ax=ax)
    df[df['id'] == 'NL'].plot(ax=ax)
    df[df['id'] == 'IE'].plot(ax=ax)
    df[df['id'] == 'PT'].plot(ax=ax)
    df[df['id'] == 'AD'].plot(ax=ax)
    df[df['id'] == 'CH'].plot(ax=ax)

    ## using pandas built-in plot function
    coord_df.plot(x="lng",y='lat',kind='scatter', c='c',colormap="Reds", ax=ax)

    ## making labels
    for i in range(len(coord_df)):
      plt.text(coord_df.iloc[i].lng,coord_df.iloc[i].lat,f'{coord_df.iloc[i].place}')
      
    ax.grid(b=True, alpha=0.5)
    plt.show()

In [None]:
term = 'Chivalry' ## add search term!
ttm_search = searchWordOrPhraseTTM(term.lower(),context_size=3)

display(HTML(f'<h2>{term}</h2>'))
display(HTML('<br>'))
for i in range(len(ttm_search)):
  display(HTML(f'<small><i>{ttm_search.book.to_list()[i]}</i></small>'))
  display(HTML(f'<small>{ttm_search.chapter.to_list()[i]}</small>'))
  display(HTML(f'<small>Similarity Score: {round(ttm_search.sent_doc.to_list()[i], 3)}</small>'))
  display(HTML(f'<p>{ttm_search.context.to_list()[i]}</p>'))
  europePlot(ttm_search.coords.to_list()[i])
  display(HTML('<br>'))

## Looking at the generalized method

I'm going to show off the generalized method by changing the language. We are now looking at the French (original) version of the Three Musketeers. All we will have to do to query this book instead of the English is change the language model from English to French

In [None]:
ttm_fr = scrapeText('https://www.gutenberg.org/files/13951/13951-h/13951-h.htm')
ttm_fr

In [None]:
ttm_fr['sents'] = ttm_fr['text'].progress_apply(sent_tokenize)
ttm_fr_sents = ttm_fr.explode('sents').reset_index(drop=True)
ttm_fr_sents

In [None]:
## get French model
!spacy download fr_core_news_md --quiet
nlp_fr = spacy.load('fr_core_news_md')

In [None]:
from search import searchEngine

ttm_fr_search = searchEngine(
    df = ttm_fr_sents,
    nlp = nlp_fr
)

ttm_fr_search.spacyify('sents','ttm_fr_output')

In [None]:
term = 'Chevalerie'
search_fr = ttm_fr_search.search('serialized_data/ttm_fr_output', term, mapping=True, title='title')
ttm_fr_search.displaySearch(search_fr[0],search_fr[1], mapping=True, title='title')

## Takeaway: Close vs. distant reading
In overview NLP methods through this example, we got to explore the power and limitations of both close reading and distant reading. <br>
We saw how:
* to process data and to clean a dataset for NLP
* to apply an off-the-shelf language model to preprocessed data
* to build a rudimentary user interface to interpret the model output
* to generalize a method so that it can be used for many different texts

#### What's next? 
The next step would be to deploy this search engine as an application. You have variety of options, like Streamlit, Voila or Django, but that is outside the purview of this talk, so hopefully I cna present on deployment at another time.  

# Thank you for your attention!