# Data Extraction using spaCy

This chapter provides an example of how we can put spaCy to use to extract data from natural language texts. We'll walk through how we can collect date and size variables from the text present in a collection of Wikipedia texts covering protests following the deaths of George Floyd, Ahmaud Arbery and Breonna Taylor.

#### Setup

For this exercise, we'll import `pandas`, `spacy`, and [`re`](https://docs.python.org/3/library/re.html#). 

In [1]:
import pandas as pd
import spacy
import re

In [2]:
nlp = spacy.load("en_core_web_md")

In [3]:
wiki = pd.read_csv('wiki_protest.csv')

At present, our dataset includes information on the `city` and `state` in which 245 protests occured, the `text` from Wikipedia detailing these protests, and the `references` associated with the Wikipedia text. 

In [4]:
wiki

Unnamed: 0,city,text,references,state
0,Auburn,Hundreds of demonstrators held a largely peace...,['https://www.wsfa.com/2020/05/31/hundreds-gat...,Alabama
1,Birmingham,"An estimated 1,000 people gathered on May 31 f...",['https://www.al.com/news/birmingham/2020/05/p...,Alabama
2,Hoover,At least 100 protesters attended a march along...,['https://www.al.com/news/birmingham/2020/05/p...,Alabama
3,Montgomery,Hundreds of people protested on the steps of t...,['https://www.montgomeryadvertiser.com/story/n...,Alabama
4,Troy,About 50 people demonstrated peacefully on May...,['https://www.dothanfirst.com/news/no-justice-...,Alabama
...,...,...,...,...
240,Appleton,Over one thousand people gathered in downtown ...,['https://www.postcrescent.com/story/news/2020...,Wisconsin
241,Eau Claire,Hundreds marched from Phoenix Park to Owen Par...,['https://www.weau.com/content/news/George-Flo...,Wisconsin
242,Kenosha,Between 100 and 125 demonstrators peacefully m...,['https://www.kenoshanews.com/news/local/watch...,Wisconsin
243,Madison,"On May 30, there was a peaceful demonstration ...",['https://madison.com/wsj/news/local/govt-and-...,Wisconsin


#### Entity Labels

The date and size information we're looking to extract can be collected by identifying the named entity labels spaCy uses to classify these phrases. Based on the first few words in each row of the `text` key above, we can create a string including a handful of dates and protestor counts, and see how spaCy classifies them: 

In [5]:
test = nlp("An estimated 1,000 people gathered on May 31 At least 100 protesters attended a march along Hundreds of people protested on the steps About 50 people demonstrated peacefully Over one thousand people gathered in downtown On May 30, there was a peaceful demonstration")
for ent in test.ents:
    print(ent.text, ent.label_)

An estimated 1,000 CARDINAL
May 31 DATE
At least 100 CARDINAL
Hundreds CARDINAL
About 50 CARDINAL
Over one thousand CARDINAL
May 30 DATE


It looks like spaCy classifies the dates in our string as `DATE` entities (this makes sense), and the protestor counts in our string as `CARDINAL` entities.

#### Dates 

We can now define a function to extract entities from our 'text' key labelled `DATE`.  The function is largely identical to the `extract_adjective` function we previously used in the POS chapter. 

In [6]:
def extract_dates(text):
    dates = []
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_=='DATE':
            dates.append(ent.text)
    return dates

Let's now apply the function to our `text` key:

In [7]:
wiki['dates'] = wiki['text'].apply(extract_dates)

In [8]:
wiki

Unnamed: 0,city,text,references,state,dates
0,Auburn,Hundreds of demonstrators held a largely peace...,['https://www.wsfa.com/2020/05/31/hundreds-gat...,Alabama,[May 31]
1,Birmingham,"An estimated 1,000 people gathered on May 31 f...",['https://www.al.com/news/birmingham/2020/05/p...,Alabama,[May 31]
2,Hoover,At least 100 protesters attended a march along...,['https://www.al.com/news/birmingham/2020/05/p...,Alabama,[May 30]
3,Montgomery,Hundreds of people protested on the steps of t...,['https://www.montgomeryadvertiser.com/story/n...,Alabama,[May 30]
4,Troy,About 50 people demonstrated peacefully on May...,['https://www.dothanfirst.com/news/no-justice-...,Alabama,[May 29]
...,...,...,...,...,...
240,Appleton,Over one thousand people gathered in downtown ...,['https://www.postcrescent.com/story/news/2020...,Wisconsin,[May 30 and 31]
241,Eau Claire,Hundreds marched from Phoenix Park to Owen Par...,['https://www.weau.com/content/news/George-Flo...,Wisconsin,[May 31]
242,Kenosha,Between 100 and 125 demonstrators peacefully m...,['https://www.kenoshanews.com/news/local/watch...,Wisconsin,[May 31]
243,Madison,"On May 30, there was a peaceful demonstration ...",['https://madison.com/wsj/news/local/govt-and-...,Wisconsin,[May 30]


#### Size

The function we define to extract protester counts is a bit more complex. Here, we'll use the [findall](https://docs.python.org/3/library/re.html?highlight=findall#re.findall) method from `re` return more consistent data: instead of extracting different phrases for "hundreds of protestors," "hundreds of people," or a "few hundred people," for example, we'll ensure each of these phrases returns "hundreds" for the `size` key once we apply our function to the Wikipedia dataset. 

In [9]:
def extract_size(text):
    s = text.lower()
    for prefix in ['up to', 'more than','around','at least', 'about', 'approximately', 'over', 'almost']:
        for suffix in ['protesters', 'people', 'town residents']:
            h = re.findall('%s \d+(?:,\d+)? %s' % (prefix, suffix), s)
            if len(h) > 0:
                return h[0]
        if 'hundreds of protesters' in s:
            return 'hundreds'
        if 'hundreds of people' in s:
            return 'hundreds'
        if 'few hundred people' in s:
            return 'hundreds'
        if 'several hundred' in s:
            return 'hundreds'
        if 'over a hundred' in s:
            return 'hundred'
        if 'an estimated 1000' in s:
            return '1000'
        if 'approximately 1,000' in s:
            return '1000'
        if 'more than 1,000 people' in s:
            return '1000'
        if 'more than 1000 protestors' in s:
            return '1000'
        if 'Around 150' in s:
            return '150'
        if 'about 200 people' in s:
            return '200'
        if 'over 100' in s:
            return '100'
        if 'about 30 people' in s:
            return '30'

        
    for d in ['dozens of protestors', 'few dozen' , 'at least a dozen']:
        if d in s:
            return d
        if 'several dozen' in s:
            return 'dozens'


    size = []
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_=='CARDINAL':
            size.append(ent.text)
    return ', '.join(size)

With our second function defined, we can extract our size data from the Wikipedia dataset: 

In [10]:
wiki['size'] = wiki['text'].apply(extract_size)

In [11]:
wiki.sample(10)

Unnamed: 0,city,text,references,state,dates,size
116,Presque Isle,May 30: More than 30 people gathered on Main S...,['https://thecounty.me/2020/06/02/news/peacefu...,Maine,[May 30],more than 30 people
78,Carmel,Hundreds of protesters attended a peaceful mar...,['http://www.youarecurrent.com/2020/06/01/carm...,Indiana,[June 1],hundreds
160,Franklin Township (Somerset County),"On May 31, hundreds of people protested in a m...",['https://www.tapinto.net/towns/franklin-towns...,New Jersey,[May 31],hundreds
64,Valdosta,About 50 protesters assembled on the grounds o...,['https://www.valdostadailytimes.com/news/loca...,Georgia,[May 30],about 50 protesters
76,Springfield,More than 1000 protested peacefully on Ninth S...,['https://www.wcia.com/news/city-leaders-join-...,Illinois,[June 1],More than 1000
97,Waterloo,Approximately 500 people marched on May 29. <s...,['https://kwwl.com/2020/05/29/hundreds-march-f...,Iowa,[May 29],approximately 500 people
218,Newport,Hundreds gathered in front of Newport City Hal...,['https://newportnewstimes.com/article/protest...,Oregon,[June 3],Hundreds
56,Dalton,"On Monday, June 1, several dozen demonstrators...",['https://newschannel9.com/news/local/peaceful...,Georgia,"[Monday, June 1]",dozens
29,Colorado Springs,About 300 protesters demonstrated by lying on ...,['https://www.kktv.com/content/news/Protesters...,Colorado,"[May 30, May 30, May 30]",about 300 protesters
243,Madison,"On May 30, there was a peaceful demonstration ...",['https://madison.com/wsj/news/local/govt-and-...,Wisconsin,[May 30],around 1000
