Question 3: information_extraction.ipynb — Named Entity Recognition 

1 - Add 2 new texts containing PERSON, ORG, and DATE or GPE entities
2 - Extract entities using spaCy

In [1]:
!pip -q install spacy

In [2]:
!python -m spacy download en_core_web_sm -q

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [3]:
import spacy
import pandas as pd

In [4]:
nlp = spacy.load("en_core_web_sm")

In [51]:
texts = [
    "Dr. Angela Susan Mathew joined Simon Fraser University in 2021.",
    "Ronald started working at Google Canada in Vancouver.",
    "Murilo Farias is studying Data Engineering at Canadian College of Technology and Business.",
    "Jeff and Ronald attended a workshop hosted by Microsoft.",
    "Amazon opened a new office in Vancouver in 2022.",
    "Sarah Thompson was hired by IBM last year.",
    "The project was developed by engineers from Meta.",
    "Murilo and Jeff collaborated on a project at CCTB.",
    "Apple announced a partnership with Simon Fraser University.",
    "Dr. John Smith works as a researcher at Google.",
    "Microsoft Canada organized an AI conference in Toronto.",
    "Angela Mathew presented her research at UBC.",
    "Ronald moved to Vancouver to work for Amazon.",
    "IBM hired new data engineers from CCTB.",
    "Jeff joined a startup backed by Google Ventures.",
    "Murilo completed an internship at Microsoft.",
    "The team from Meta visited Simon Fraser University.",
    "Sarah and John are employees of Apple.",
    "Amazon Web Services hired engineers in Canada.",
    "Google collaborated with IBM on a cloud project."
]


In [52]:
rows = [] 

In [53]:

for doc_id, text in enumerate(texts, start=1):

    # Process the text through spaCy pipeline
    doc = nlp(text)

    # Each ent is an entity object with .text and .label_
    for ent in doc.ents:

        rows.append({
            "doc_id": doc_id,         # which document this entity came from
            "text": text,             # original text
            "entity_text": ent.text,  # the exact string spaCy marked as an entity
            "entity_label": ent.label_ # entity type (PERSON, ORG, GPE, DATE, etc.)
        })

In [54]:
entities_df = pd.DataFrame(rows)

In [55]:
entities_df

Unnamed: 0,doc_id,text,entity_text,entity_label
0,1,Dr. Angela Susan Mathew joined Simon Fraser Un...,Angela Susan Mathew,PERSON
1,1,Dr. Angela Susan Mathew joined Simon Fraser Un...,Simon Fraser University,ORG
2,1,Dr. Angela Susan Mathew joined Simon Fraser Un...,2021,DATE
3,2,Ronald started working at Google Canada in Van...,Ronald,PERSON
4,2,Ronald started working at Google Canada in Van...,Google Canada,ORG
5,2,Ronald started working at Google Canada in Van...,Vancouver,GPE
6,3,Murilo Farias is studying Data Engineering at ...,Murilo Farias,PERSON
7,3,Murilo Farias is studying Data Engineering at ...,Data Engineering,ORG
8,3,Murilo Farias is studying Data Engineering at ...,Canadian College of Technology and Business,ORG
9,4,Jeff and Ronald attended a workshop hosted by ...,Jeff,PERSON


3 - Filter results to include only PERSON and ORG

In [56]:
entities_df_filters = entities_df[entities_df["entity_label"].isin(["PERSON", "ORG"])]


In [57]:
entities_df_filters

Unnamed: 0,doc_id,text,entity_text,entity_label
0,1,Dr. Angela Susan Mathew joined Simon Fraser Un...,Angela Susan Mathew,PERSON
1,1,Dr. Angela Susan Mathew joined Simon Fraser Un...,Simon Fraser University,ORG
3,2,Ronald started working at Google Canada in Van...,Ronald,PERSON
4,2,Ronald started working at Google Canada in Van...,Google Canada,ORG
6,3,Murilo Farias is studying Data Engineering at ...,Murilo Farias,PERSON
7,3,Murilo Farias is studying Data Engineering at ...,Data Engineering,ORG
8,3,Murilo Farias is studying Data Engineering at ...,Canadian College of Technology and Business,ORG
9,4,Jeff and Ronald attended a workshop hosted by ...,Jeff,PERSON
10,4,Jeff and Ronald attended a workshop hosted by ...,Ronald,PERSON
11,4,Jeff and Ronald attended a workshop hosted by ...,Microsoft,ORG


4 - Create a frequency table of extracted entities

In [58]:
freq_table = (
    entities_df_filters
    #.groupby(["entity_text", "entity_label"])
    .groupby(["entity_text"])
    .size()
    .reset_index(name="frequency")
    .sort_values("frequency", ascending=False)
)

freq_table

#we are using the entity_text to count the frequency of each entity


Unnamed: 0,entity_text,frequency
11,Jeff,3
18,Ronald,3
10,IBM,3
7,Google,2
15,Microsoft,2
14,Meta,2
0,Amazon,2
4,Apple,2
22,Simon Fraser University,2
6,Data Engineering,1


5 - Display the top 10 most frequent entities

In [60]:
top_10_entities = (
    freq_table
    .sort_values(by="frequency", ascending=False)
    .head(10)
)

In [61]:
top_10_entities

Unnamed: 0,entity_text,frequency
11,Jeff,3
10,IBM,3
18,Ronald,3
7,Google,2
15,Microsoft,2
14,Meta,2
0,Amazon,2
4,Apple,2
22,Simon Fraser University,2
3,Angela Susan Mathew,1
