## 1. Setup Environment

First, ensure that spaCy and pandas are installed in your Python environment. 

In [None]:
import spacy
import pandas as pd

## 2. Load the Dataset

Now we need to load the stocks-1.tsv file into a pandas DataFrame. Since the file is in .tsv format, we'll use pd.read_csv() with sep='\t' to load it properly.

In [None]:
# Load the dataset (adjust the path as needed)
df = pd.read_csv('stocks-1.tsv', sep='\t')

# Display the first few rows of the DataFrame to examine the structure
df.head()


Unnamed: 0,Symbol,CompanyName,Industry,MarketCap
0,A,Agilent Technologies,Life Sciences Tools & Services,53.65B
1,AA,Alcoa,Metals & Mining,9.25B
2,AAC,Ares Acquisition,Shell Companies,1.22B
3,AACG,ATA Creativity Global,Diversified Consumer Services,90.35M
4,AADI,Aadi Bioscience,Pharmaceuticals,104.85M


## 3. Extract Data for Patterns

Assume the columns for company names and stock symbols are Company and Symbol respectively. We’ll extract the unique values from these columns.

In [None]:
# Extract unique company names from the 'CompanyName' column
company_names = df['CompanyName'].unique()

# Display the first few company names to verify
print("Company Names:", company_names[:10])

# Create patterns for the company names
patterns = [{"label": "COMPANY", "pattern": company} for company in company_names]


Company Names: ['Agilent Technologies' 'Alcoa' 'Ares Acquisition' 'ATA Creativity Global'
 'Aadi Bioscience' 'Arlington Asset Investment' 'American Airlines'
 'Altisource Asset Management' 'Atlantic American' "The Aaron's Company"]


## 4. Create an EntityRuler

Now, let's use the spaCy model to create the EntityRuler and add it to the pipeline.

In [None]:
# Initialize a spaCy model
nlp = spacy.load('en_core_web_sm')

# Add the EntityRuler to the pipeline before the 'ner' component
ruler = nlp.add_pipe("entity_ruler", before="ner")

# Add the patterns for the company names to the EntityRuler
ruler.add_patterns(patterns)

# Verify the updated pipeline components
print(nlp.pipeline)


[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec object at 0x176f4dd90>), ('tagger', <spacy.pipeline.tagger.Tagger object at 0x1773214f0>), ('parser', <spacy.pipeline.dep_parser.DependencyParser object at 0x176de57e0>), ('attribute_ruler', <spacy.pipeline.attributeruler.AttributeRuler object at 0x1773a8690>), ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer object at 0x17731b850>), ('entity_ruler', <spacy.pipeline.entityruler.EntityRuler object at 0x1770cff50>), ('ner', <spacy.pipeline.ner.EntityRecognizer object at 0x17720c2e0>)]


## 5. Test the EntityRuler

We’ll test the EntityRuler using the sample texts provided in the assignment. First, let’s define the paragraphs and process them with the updated nlp pipeline.

In [None]:
# Sample text with company names
text = """
Helmerich & Payne (HP) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Energy Equipment & Services sector. 
In contrast, Check-Cap (CHEK) faced a decline of 2.3% following its announcement of increased costs related to supply chain disruptions.
"""

# Process the text with the updated spaCy pipeline
doc = nlp(text)

# Display the recognized entities and their labels
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")

# Another sample text with company names
text2 = """
Aemetis (AMTX) saw its stock rise by 1.5%, fueled by optimistic forecasts in the Oil, Gas & Consumable Fuels sector.
Meanwhile, Ferro Corporation (FOE) faced a decline of 2.3% following its announcement of increased costs.
"""

# Process the second sample text
doc2 = nlp(text2)

# Display the recognized entities and their labels
for ent in doc2.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")



Entity: Helmerich & Payne, Label: COMPANY
Entity: HP, Label: ORG
Entity: 1.5%, Label: PERCENT
Entity: the Energy Equipment & Services, Label: ORG
Entity: Check-Cap, Label: COMPANY
Entity: 2.3%, Label: PERCENT
Entity: Aemetis, Label: COMPANY
Entity: 1.5%, Label: PERCENT
Entity: the Oil, Gas & Consumable Fuels, Label: ORG
Entity: Ferro Corporation, Label: COMPANY
Entity: FOE, Label: ORG
Entity: 2.3%, Label: PERCENT


## 6. Final Testing and Reprocessing

Finally, after running the EntityRuler, you should test it on other text inputs to ensure it consistently recognizes company names and stock symbols. Additionally, reprocess the DataFrame to see the results.

In [None]:
# Display all entities from the second sample text
for ent in doc2.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")


Entity: Aemetis, Label: COMPANY
Entity: 1.5%, Label: PERCENT
Entity: the Oil, Gas & Consumable Fuels, Label: ORG
Entity: Ferro Corporation, Label: COMPANY
Entity: FOE, Label: ORG
Entity: 2.3%, Label: PERCENT
