<a href="https://colab.research.google.com/github/kevinmfreire/wheres_waldo/blob/main/spacy_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Name-Entity-Recognition with spaCy

In [1]:
import pandas as pd
import spacy
import requests
import re
import numpy as np
import json
from bs4 import BeautifulSoup
from spacy import displacy
import sqlite3

### We first want to load the spaCy model `"en_core_web_sm"` and test it out, the following name of the model has it's purposes as follow:
* `en` for english
* `core` is for the type of capability, and this name is for a general purpose pipeline with tagging, parsing, lemmatization and name-entity-recognition.
* `web` is the type of text the pipeline was trained on, and in our case we used `web` because we are parsing through a web with different syntax.
* `sm` stands for small which is the size of the model.  The small size is built for efficiency, however it has a reduced accuracy.

In [2]:
NER = spacy.load("en_core_web_sm")
raw_text="The Indian Space Research Organisation or is the national space agency of India, headquartered in Bengaluru. It operates under Department of Space which is directly overseen by the Prime Minister of India while Chairman of ISRO acts as executive of DOS as well."
text1= NER(raw_text)

### Let's see how the model words, we first print out all the entities the model recognized which are:
* ORG - for organization
* GPE - for contries, state,etc.

In [3]:
for word in text1.ents:
    print(word.text,word.label_)

The Indian Space Research Organisation ORG
India GPE
Bengaluru GPE
Department of Space ORG
India GPE
ISRO ORG
DOS ORG


What the model looks for and labels the entity.

In [4]:
spacy.explain("ORG")

'Companies, agencies, institutions, etc.'

In [5]:
spacy.explain("GPE")

'Countries, cities, states'

A pleasent visualization using displacy.

In [6]:
displacy.render(text1,style="ent",jupyter=True)

### Let's look further onto a different example and do the same analysis as the previous example.

In [7]:
raw_text2='The Mars Orbiter Mission (MOM), informally known as Mangalyaan, was launched into Earth orbit on 5 November 2013 by the Indian Space Research Organisation (ISRO) and has entered Mars orbit on 24 September 2014. India thus became the first country to enter Mars orbit on its first attempt. It was completed at a record low cost of $74 million.'

In [8]:
text2= NER(raw_text2)
for word in text2.ents:
    print(word.text,word.label_)

The Mars Orbiter Mission (MOM PRODUCT
Mangalyaan PERSON
Earth LOC
5 November 2013 DATE
the Indian Space Research Organisation ORG
Mars LOC
24 September 2014 DATE
India GPE
first ORDINAL
Mars LOC
$74 million MONEY


This time we see a few new entities such as **PERSON**, **DATE**, **LOC**, **ORDINAL**, and **PRODUCT**.

However, we are only interested in Name, Organization and Location.  As we've seen before there are two entities that provides some certain location.  We know that **GPE** shows a geographical location so let's see what **LOC** is.

In [9]:
spacy.explain("PERSON")

'People, including fictional'

In [10]:
spacy.explain("LOC")

'Non-GPE locations, mountain ranges, bodies of water'

As we observe the above, the **LOC** entity only classifies mountain ranges, bodies of water and in the second example it classifies earth and mars as the entity **LOC**.  So we are not interested in this entity.

Let's render the text using `displacy.render` to visualize the models prediction within this notebook

In [11]:
displacy.render(text2,style="ent",jupyter=True)

In [12]:
def clean_text(contents):
  body= contents.replace('n', ' ')
  body= contents.replace('t', ' ')
  body= contents.replace('r', ' ')
  body= contents.replace('\xa0', ' ')
  return body

### Let's now build our function for scraping multiple articles from NBC news website.

In [13]:
def web_scraper(url, number_of_articles=1):
    # Request
    request = requests.get(url)
    print(request.status_code)

    # We'll save in coverpage the cover page content
    coverpage = request.content

    # Soup creation
    soup1 = BeautifulSoup(coverpage, 'html5lib')

    # News identification
    coverpage_news = []
    for tag in soup1.find_all('h2', class_='styles_headline__ice3t'):
        for anchor in tag.find_all('a'):
            coverpage_news.append(anchor)

    print('Number of articles found: {}'.format(len(coverpage_news)))

    ## Let's extract the text from the article
    # Empty lists for content, links and titles
    news_contents = []
    list_links = []
    list_titles = []

    for n in np.arange(0, number_of_articles):
            
        # Getting the link of the article
        link = coverpage_news[n]['href']
        list_links.append(link)
        
        # Getting the title
        title = coverpage_news[n].get_text()
        list_titles.append(title)
        
        # Reading the content (it is divided in paragraphs)
        article = requests.get(link)
        article_content = article.content
        soup_article = BeautifulSoup(article_content, 'html5lib')
        x = soup_article.find_all('p', {'class':['','endmark']})
        
        # Unifying the paragraphs
        list_paragraphs = []
        for p in np.arange(0, len(x)):
            paragraph = x[p].get_text()
            list_paragraphs.append(paragraph)
            final_article = " ".join(list_paragraphs)
            
        news_contents.append(final_article)

    # df_show_info
    nbc_articles = pd.DataFrame({
        # 'Article Title': list_titles,
        'Article Link': list_links,
        'Article Content': news_contents})

    # return [list_titles, news_contents, list_links]
    return nbc_articles

In [14]:
URL='https://www.nbcnews.com/'

In [15]:
contents = web_scraper(URL, 5)

200
Number of articles found: 40


In [16]:
contents

Unnamed: 0,Article Link,Article Content
0,https://www.nbcnews.com/news/danish-police-arr...,"COPENHAGEN, Denmark — Three people are dead an..."
1,https://www.nbcnews.com/news/us-news/akron-ohi...,Officials on Monday set a 9 p.m. curfew for do...
2,https://www.nbcnews.com/news/world/chunk-alpin...,ROME — A large chunk of an Alpine glacier brok...
3,https://www.nbcnews.com/news/world/israeli-mil...,The United States has concluded that gunfire f...
4,https://www.nbcnews.com/news/world/russia-conq...,The last Ukrainian bastion in a key eastern pr...


In [17]:
contents['Article Content']

0    COPENHAGEN, Denmark — Three people are dead an...
1    Officials on Monday set a 9 p.m. curfew for do...
2    ROME — A large chunk of an Alpine glacier brok...
3    The United States has concluded that gunfire f...
4    The last Ukrainian bastion in a key eastern pr...
Name: Article Content, dtype: object

Since the spaCy NER model classifies varioues entities we want to build a function to only extract our target entities which are `PERSON`, `GPE`, and `ORG` and we place it on a new dictionary of lists with the following format:

```
{
  'NAME' : [Kevin, Daniel, ...],
  'ORGANIZATION' : [Apple, Amazon, ...],
  'LOCATION' : [Toronto, Florida, ...]
}
```

In [18]:
def get_unique_results(model_output):
    # Prepare dictionary for obtaining only Name, Organization and Location
    article = {'NAME':[], 'ORGANIZATION':[], 'LOCATION':[]}

    # Iterate through each word in the sentence and extract the target entities
    for word in model_output.ents:
        if word.label_ == 'PERSON' and (word.text not in article["NAME"]):
            article["NAME"].append(word.text)
        elif word.label_ == 'ORG' and (word.text not in article["ORGANIZATION"]):
            article["ORGANIZATION"].append(word.text)
        elif word.label_ == 'GPE' and (word.text not in article["LOCATION"]):
            article["LOCATION"].append(word.text)
    return article

Since we scraped through multiple articles from NBC news I created a function to extract all target entities from the article contents and place it in a new dataframe under the `'Article Content'` column.

In [19]:
def get_ner_for_all(article, model):
    ''''
    This function is used to obtain NER results for each content in the article
    and is place in a new dataframe
    '''
    final_out = article.copy()
    for index, row in final_out.iterrows():
        spacy_results = model(row['Article Content'])
        article_ner = get_unique_results(spacy_results)
        final_out.iloc[[index], [1]] = [article_ner]
    return final_out

In [20]:
output = get_ner_for_all(contents, NER)
output

Unnamed: 0,Article Link,Article Content
0,https://www.nbcnews.com/news/danish-police-arr...,"{'NAME': ['Mattias Tesfaye', 'Søren Thomassen'..."
1,https://www.nbcnews.com/news/us-news/akron-ohi...,"{'NAME': ['Jayland Walker', 'Dan Horrigan', 'H..."
2,https://www.nbcnews.com/news/world/chunk-alpin...,"{'NAME': ['Luca Zaia', 'Mario Draghi', 'Alps',..."
3,https://www.nbcnews.com/news/world/israeli-mil...,"{'NAME': ['Shireen Abu Akleh', 'Ned Price', 'A..."
4,https://www.nbcnews.com/news/world/russia-conq...,"{'NAME': ['Vladimir Putin', 'Sergei Shoigu', '..."


### For visualization purposes I rendered through all articles to visualize how the model classified it's entities.

In [21]:
render_list = []
for content in contents['Article Content']:
  render_out = NER(content)
  render_list.append(render_out)

In [22]:
len(render_list)

5

In [23]:
displacy.render(render_list[0],style="ent",jupyter=True)

Next I also wanted to save the extracted NER results into both a `.json` and `.csv` file where the `json` file has a structure of:

```
{
  "https://www.nbcnews.com/axample-article-1.html" : {
                                        'NAME' : [Kevin, Daniel, ...],
                                        'ORGANIZATION' : [Apple, Amazon, ...],
                                        'LOCATION' : [Toronto, Florida, ...]
                                      },

  "https://www.nbcnews.com/axample-article-2.html" : {
                                        'NAME' : [Kevin, Daniel, ...],
                                        'ORGANIZATION' : [Apple, Amazon, ...],
                                        'LOCATION' : [Toronto, Florida, ...]
                                      },
  "https://www.nbcnews.com/axample-article-3.html" : {
                                        'NAME' : [Kevin, Daniel, ...],
                                        'ORGANIZATION' : [Apple, Amazon, ...],
                                        'LOCATION' : [Toronto, Florida, ...]
                                      },
                                      ...
}
```

In [24]:
def save_to_json(results, path):
    outputDict = results.set_index('Article Link').to_dict()['Article Content']

    with open(path+'output.json', 'w') as fp:
        json.dump(outputDict, fp,  indent=4)

def save_to_csv(results, path):
    results.set_index('Article Link').to_csv(path+'output.csv')

In [25]:
save_to_json(output, './output.json')

In [26]:
save_to_csv(output, './output.csv')

In [27]:
json_file = open('./output.jsonoutput.json')
json_data = json.load(json_file)
json_data

{'https://www.nbcnews.com/news/danish-police-arrest-one-connection-shooting-copenhagen-mall-rcna36515': {'LOCATION': ['Denmark',
   'New York',
   'London'],
  'NAME': ['Mattias Tesfaye',
   'Søren Thomassen',
   'Thomassen',
   'Mette Frederiksen',
   'Harry Styles',
   'Julianne McShane',
   'Patrick Smith'],
  'ORGANIZATION': ['COPENHAGEN',
   'Copenhagen',
   'Field',
   'YouTube',
   'Live Nation',
   'Camilla Fuhr']},
 'https://www.nbcnews.com/news/us-news/akron-ohio-sets-downtown-curfew-cancels-fireworks-wake-jayland-walker-rcna36557': {'LOCATION': ['Akron',
   'Ohio'],
  'NAME': ['Jayland Walker',
   'Dan Horrigan',
   'Horrigan',
   'Martin Luther King Jr. Boulevard',
   'Martin Luther King Jr.',
   'Walker'],
  'ORGANIZATION': ['Horrigan',
   'Downtown Akron',
   'State Route 59',
   'State Route 8',
   'Walker']},
 'https://www.nbcnews.com/news/world/chunk-alpine-glacier-detaches-italy-killing-6-hikers-injuring-9-rcna36526': {'LOCATION': ['Italy',
   'Marmolada'],
  'NAME': 

In [28]:
csv_df = pd.read_csv('./output.csvoutput.csv')
csv_df

Unnamed: 0,Article Link,Article Content
0,https://www.nbcnews.com/news/danish-police-arr...,"{'NAME': ['Mattias Tesfaye', 'Søren Thomassen'..."
1,https://www.nbcnews.com/news/us-news/akron-ohi...,"{'NAME': ['Jayland Walker', 'Dan Horrigan', 'H..."
2,https://www.nbcnews.com/news/world/chunk-alpin...,"{'NAME': ['Luca Zaia', 'Mario Draghi', 'Alps',..."
3,https://www.nbcnews.com/news/world/israeli-mil...,"{'NAME': ['Shireen Abu Akleh', 'Ned Price', 'A..."
4,https://www.nbcnews.com/news/world/russia-conq...,"{'NAME': ['Vladimir Putin', 'Sergei Shoigu', '..."


### Since I am only interested in extracting the target entity values I create a new dataframe with columns:
* `NAME`
* `ORGANIZATION`
* `LOCATION`

I want to focus on just one article.

In [29]:
data_list = [dict(content) for content in output['Article Content']]
data_list[0]

{'LOCATION': ['Denmark', 'New York', 'London'],
 'NAME': ['Mattias Tesfaye',
  'Søren Thomassen',
  'Thomassen',
  'Mette Frederiksen',
  'Harry Styles',
  'Julianne McShane',
  'Patrick Smith'],
 'ORGANIZATION': ['COPENHAGEN',
  'Copenhagen',
  'Field',
  'YouTube',
  'Live Nation',
  'Camilla Fuhr']}

In [30]:
dict1 = data_list[0]
df = pd.DataFrame(dict([(k,pd.Series(v)) for k,v in dict1.items()]))

In [31]:
df

Unnamed: 0,NAME,ORGANIZATION,LOCATION
0,Mattias Tesfaye,COPENHAGEN,Denmark
1,Søren Thomassen,Copenhagen,New York
2,Thomassen,Field,London
3,Mette Frederiksen,YouTube,
4,Harry Styles,Live Nation,
5,Julianne McShane,Camilla Fuhr,
6,Patrick Smith,,


### I now want to save the results into a database for easy search using sqlite3.

Open a connection to a new database

In [32]:
conn = sqlite3.connect('ner_database')
c = conn.cursor()

In [33]:
c.execute('CREATE TABLE IF NOT EXISTS articles (Article link, NER results)')
conn.commit()

In [34]:
df.to_sql('articles', conn, if_exists='replace', index = False)

In [35]:
c.execute('''  
SELECT * FROM articles
          ''')

for row in c.fetchall():
    print (row)

('Mattias Tesfaye', 'COPENHAGEN', 'Denmark')
('Søren Thomassen', 'Copenhagen', 'New York')
('Thomassen', 'Field', 'London')
('Mette Frederiksen', 'YouTube', None)
('Harry Styles', 'Live Nation', None)
('Julianne McShane', 'Camilla Fuhr', None)
('Patrick Smith', None, None)
