# HW 4-7

## Step 1: Setup Environment

- Ensure you have spaCy and pandas installed in your Python environment.
- Import the required libraries for working with spaCy and pandas.

In [1]:
import pandas as pd 
!pip install spacy
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ------- -------------------------------- 2.4/12.8 MB 13.4 MB/s eta 0:00:01
     ----------------- ---------------------- 5.5/12.8 MB 14.6 MB/s eta 0:00:01
     --------------------------- ------------ 8.7/12.8 MB 14.9 MB/s eta 0:00:01
     ----------------------------------- --- 11.8/12.8 MB 14.8 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 13.8 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


## Step 2: Load the Dataset

- Download the provided dataset and load it into a pandas DataFrame. HINT: The file is a .tsv, so use pd.read_csv() with the sep='\t' to read the file.
- Examine the DataFrame to identify the columns containing company names and stock symbol.

In [4]:
df=pd.read_csv('stocks-1.tsv', sep='\t')
df

Unnamed: 0,Symbol,CompanyName,Industry,MarketCap
0,A,Agilent Technologies,Life Sciences Tools & Services,53.65B
1,AA,Alcoa,Metals & Mining,9.25B
2,AAC,Ares Acquisition,Shell Companies,1.22B
3,AACG,ATA Creativity Global,Diversified Consumer Services,90.35M
4,AADI,Aadi Bioscience,Pharmaceuticals,104.85M
...,...,...,...,...
5874,ZWRK,Z-Work Acquisition,Shell Companies,278.88M
5875,ZY,Zymergen,Chemicals,1.31B
5876,ZYME,Zymeworks,Biotechnology,1.50B
5877,ZYNE,Zynerba Pharmaceuticals,Pharmaceuticals,184.39M


## Step 3: Extract Data for Patterns

- Extract unique company names and stock symbol from the appropriate columns of the DataFrame.
- Create patterns for each company and stock symbol, ensuring they are properly formatted to be recognized by spaCy's EntityRuler.
- DO NOT manually input individual company stocks to create your patterns. Find an automated solution in case your DataFrame ever gets updated. HINT: Think for-loops.

In [8]:
# this extracts the unique company names and stock symbols
unique_companies=df['CompanyName'].unique()
unique_symbols=df['Symbol'].unique()

# this create empty lists for patterns
company_patterns=[]
symbol_patterns=[]

# this loops through the companies and creates patterns
for name in unique_companies:
    pattern={'label': 'ORG', 'pattern': name}
    company_patterns.append(pattern)

# this loops through the symbols and creates patterns
for symbol in unique_symbols:
    pattern={'label': 'STOCK', 'pattern': symbol}
    symbol_patterns.append(pattern)

# this combines both pattern lists
all_patterns=company_patterns+symbol_patterns
all_patterns


[{'label': 'ORG', 'pattern': 'Agilent Technologies'},
 {'label': 'ORG', 'pattern': 'Alcoa'},
 {'label': 'ORG', 'pattern': 'Ares Acquisition'},
 {'label': 'ORG', 'pattern': 'ATA Creativity Global'},
 {'label': 'ORG', 'pattern': 'Aadi Bioscience'},
 {'label': 'ORG', 'pattern': 'Arlington Asset Investment'},
 {'label': 'ORG', 'pattern': 'American Airlines'},
 {'label': 'ORG', 'pattern': 'Altisource Asset Management'},
 {'label': 'ORG', 'pattern': 'Atlantic American'},
 {'label': 'ORG', 'pattern': "The Aaron's Company"},
 {'label': 'ORG', 'pattern': 'Applied Optoelectronics'},
 {'label': 'ORG', 'pattern': 'AAON, Inc.'},
 {'label': 'ORG', 'pattern': 'Advance Auto Parts'},
 {'label': 'ORG', 'pattern': 'Apple'},
 {'label': 'ORG', 'pattern': 'Accelerate Acquisition'},
 {'label': 'ORG', 'pattern': 'American Assets Trust'},
 {'label': 'ORG', 'pattern': 'Autoscope Technologies'},
 {'label': 'ORG', 'pattern': 'Almaden Minerals'},
 {'label': 'ORG', 'pattern': 'Atlas Air Worldwide Holdings'},
 {'lab

## Step 4: Create an EntityRuler

- Use a spaCy language model to create an EntityRuler.
- Add the patterns for both companies and stock symbols to the EntityRuler pipeline.

In [10]:
import spacy
from spacy.pipeline import EntityRuler
nlp=spacy.load('en_core_web_sm')

# this creates an EntityRuler and adds it to the pipeline
ruler=nlp.add_pipe('entity_ruler', before='ner')
ruler.add_patterns(all_patterns)

## Step 5: Test the EntityRuler

- Will use sample texts, that include references to companies and stock symbols
- Apply the EntityRuler to the text and check if it correctly identifies the entities

In [19]:
# I used separate cells for each paragraph so this is for the first paragraph
paragraph1="""Helmerich & Payne (HP) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Energy Equipment & Services sector. In contrast, Check-Cap (CHEK) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions.

Meanwhile, Vallon Pharmaceuticals (VLON) gained 0.8% after strong quarterly earnings, outperforming its peers in the Biotechnology space. Sequans Communications (SQNS) also recorded a modest increase of 0.5%, reflecting investors' confidence in its ability to navigate challenges in the Semiconductors & Semiconductor Equipment industry."""


# this applies the NLP pipeline to the first paragraph and then prints the recognized entities for ORG and STOCK
doc=nlp(paragraph1)
for ent in doc.ents:
        if ent.label_ in ('ORG', 'STOCK'):
            print(f"{ent.text} - {ent.label_}")

Helmerich & Payne - ORG
HP - STOCK
the Energy Equipment & Services - ORG
Check-Cap - ORG
CHEK - STOCK
Vallon Pharmaceuticals - ORG
VLON - STOCK
Biotechnology - ORG
Sequans Communications - ORG
SQNS - STOCK
Semiconductors & Semiconductor Equipment - ORG


In [20]:
# this is for paragraph 2
paragraph2="""Aemetis (AMTX) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Oil, Gas & Consumable Fuels sector. In contrast, Ferro Corporation (FOE) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions.

Meanwhile, RingCentral (RNG) gained 0.8% after strong quarterly earnings, outperforming its peers in the Software space. ACI Worldwide (ACIW) also recorded a modest increase of 0.5%, reflecting investors' confidence in its ability to navigate challenges in the Software industry."""

# this applies the NLP pipeline to the second paragraph and then prints the recognized entities for ORG and STOCK
doc=nlp(paragraph2)
for ent in doc.ents:
        if ent.label_ in ('ORG', 'STOCK'):
            print(f"{ent.text} - {ent.label_}")

Aemetis - ORG
AMTX - STOCK
the Oil, Gas & Consumable Fuels - ORG
Ferro Corporation - ORG
FOE - STOCK
RingCentral - ORG
RNG - STOCK
Software - ORG
ACI Worldwide - ORG
ACIW - STOCK
Software - ORG


In [22]:
# this is for paragraph 3
paragraph3="""On a mixed trading day, Par Pacific Holdings (PARR) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Oil, Gas & Consumable Fuels sector. In contrast, Nano Dimension (NNDM) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions.

Meanwhile, Beyond Meat (BYND) gained 0.8% after strong quarterly earnings, outperforming its peers in the Food Products space. Apollo Investment (AINV) also recorded a modest increase of 0.5%, reflecting investors' confidence in its ability to navigate challenges in the Capital Markets industry."""

# this applies the NLP pipeline to the third paragraph and then prints the recognized entities for ORG and STOCK
doc=nlp(paragraph3)
for ent in doc.ents:
        if ent.label_ in ('ORG', 'STOCK'):
            print(f"{ent.text} - {ent.label_}")

Par Pacific Holdings - ORG
PARR - STOCK
the Oil, Gas & Consumable Fuels - ORG
Nano Dimension - ORG
NNDM - STOCK
Beyond Meat - ORG
BYND - STOCK
Food Products - ORG
Apollo Investment - ORG
AINV - STOCK
Capital Markets - ORG
