<table align="center">
  <a target="_blank" href="https://colab.research.google.com/github/martinlf6/schwab-ds-takehome-FengLiu/blob/main/03_models.ipynb">
        <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a>
</table>

In [None]:
!pip install datasets==3.6.0 --force-reinstall


Collecting datasets==3.6.0
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting filelock (from datasets==3.6.0)
  Downloading filelock-3.19.1-py3-none-any.whl.metadata (2.1 kB)
Collecting numpy>=1.17 (from datasets==3.6.0)
  Downloading numpy-2.3.2-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (62 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.1/62.1 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyarrow>=15.0.0 (from datasets==3.6.0)
  Downloading pyarrow-21.0.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets==3.6.0)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting pandas (from datasets==3.6.0)
  Downloading pandas-2.3.2-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.2/91.2 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting

In [None]:
# A) Load dataset
# Use the recommended method for loading the dataset
from datasets import load_dataset
ds = load_dataset("financial_phrasebank", "sentences_allagree") # 'sentences_allagree' means only sentences where all annotators agreed on the sentiment label are included in the loaded dataset. ds is now a DatasetDict object with splits like "train"
df = ds["train"].to_pandas().rename(columns={"sentence":"text","label":"y"}) # Convert to Pandas dataframe. ds["train"] selects the training split of the dataset
label_map = {0: "negative", 1: "neutral", 2: "positive"} # Create a mapping from numbers to labels
df["label"] = df["y"].map(label_map) # Apply the mapping: replaces each numeric value in column y with its text label and creates a new column label with human-readable sentiment.
df["len"] = df["text"].str.split().apply(len) # .split() splits each list in text column (sentence) on whitespace, another word, splits each sentence into words. apply(len) applies the built-in Python len() function to each list in text column (sentence) that gives the number of words in the sentence.

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


The repository for financial_phrasebank contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/financial_phrasebank.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


FinancialPhraseBank-v1.0.zip:   0%|          | 0.00/682k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/2264 [00:00<?, ? examples/s]

In [None]:
df

Unnamed: 0,text,y,label,len
0,"According to Gran , the company has no plans t...",1,neutral,25
1,"For the last quarter of 2010 , Componenta 's n...",2,positive,39
2,"In the third quarter of 2010 , net sales incre...",2,positive,29
3,Operating profit rose to EUR 13.1 mn from EUR ...,2,positive,24
4,"Operating profit totalled EUR 21.1 mn , up fro...",2,positive,22
...,...,...,...,...
2259,Operating result for the 12-month period decre...,0,negative,27
2260,HELSINKI Thomson Financial - Shares in Cargote...,0,negative,40
2261,LONDON MarketWatch -- Share prices ended lower...,0,negative,26
2262,Operating profit fell to EUR 35.4 mn from EUR ...,0,negative,23


In [None]:
# Install spaCy and python -m in order to load the spaCy transformer model, en_core_web_trf.
!pip install spacy
!python -m spacy download en_core_web_trf


Collecting en-core-web-trf==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_trf-3.8.0/en_core_web_trf-3.8.0-py3-none-any.whl (457.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m457.4/457.4 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_trf')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [None]:
import spacy
import pandas as pd
nlp = spacy.load("en_core_web_trf") # Load spaCy transformer model, en_core_web_trf.

In [None]:
# Define aspect extraction function
# Aspects: organizations, products (e.g., “ETF” (Exchange-Traded Fund), “credit card”, “mobile app”), key financial facets (“revenue”, “guidance”, “fees”).)
def extract_aspects(text):
    doc = nlp(text) # Runs the input text through spaCy’s NLP pipeline that will produce a Doc object with tokens, entities, and noun chunks.
    aspects = [] # Initializes an empty list aspects to store extracted phrases.

    # Named entities of interest
    for ent in doc.ents: # Loops through recognized named entities (doc.ents).
        if ent.label_ in {"ORG","PRODUCT","WORK_OF_ART"}:  # Keeps only those labeled as: "ORG", "PRODUCT" (products such as iPhone and Coca-Cola), and "WORK_OF_ART" (titles, works, etc.). Extend if needed.
            aspects.append(ent.text) # Add the text to the aspects list.

    # Extract financial noun chunks for financial facets
    for nc in doc.noun_chunks: # Iterates over noun phrases (e.g., "the quarterly revenue", "profit margin")
        head = nc.root.lemma_.lower() # Gets the root word’s lemma (base form)
        if head in {"revenue","guidance","dividend","yield","fees","costs","margin","outlook","results","stock","shares"}: # If the root is one of the predefined financial terms (like "revenue", "margin", "stock", etc.), it’s considered an aspect.
            aspects.append(nc.text) # Adds the entire chunk (nc.text) to aspects.

    # Deduplicate results: returns a cleaned list of aspects.
    aspects = list(dict.fromkeys(a.strip() for a in aspects if len(a.strip())>1)) # a.strip(): removes leading/trailing whitespace. len(a.strip())>1: ignores meaningless single-character tokens. dict.fromkeys(...): removes duplicates while preserving order.
    return aspects

df["aspects"] = df["text"].apply(extract_aspects) # Creates a new column aspects containing a list of extracted aspects for each sentence.
df["num_aspects"] = df["aspects"].str.len() # Creates another column num_aspects with the number of aspects found.
print(df[["text","aspects"]].head(10))


                                                text  \
0  According to Gran , the company has no plans t...   
1  For the last quarter of 2010 , Componenta 's n...   
2  In the third quarter of 2010 , net sales incre...   
3  Operating profit rose to EUR 13.1 mn from EUR ...   
4  Operating profit totalled EUR 21.1 mn , up fro...   
5  Finnish Talentum reports its operating profit ...   
6  Clothing retail chain Sepp+ñl+ñ 's sales incre...   
7  Consolidated net sales increased 16 % to reach...   
8  Foundries division reports its sales increased...   
9  HELSINKI ( AFX ) - Shares closed higher , led ...   

                             aspects  
0                                 []  
1                       [Componenta]  
2                                 []  
3                                 []  
4                                 []  
5                 [Finnish Talentum]  
6                        [Sepp+ñl+ñ]  
7                                 []  
8          [Foundries, Machine S

In [None]:
df

Unnamed: 0,text,y,label,len,aspects,num_aspects
0,"According to Gran , the company has no plans t...",1,neutral,25,[],0
1,"For the last quarter of 2010 , Componenta 's n...",2,positive,39,[Componenta],1
2,"In the third quarter of 2010 , net sales incre...",2,positive,29,[],0
3,Operating profit rose to EUR 13.1 mn from EUR ...,2,positive,24,[],0
4,"Operating profit totalled EUR 21.1 mn , up fro...",2,positive,22,[],0
...,...,...,...,...,...,...
2259,Operating result for the 12-month period decre...,0,negative,27,[],0
2260,HELSINKI Thomson Financial - Shares in Cargote...,0,negative,40,"[HELSINKI Thomson Financial, Cargotec]",2
2261,LONDON MarketWatch -- Share prices ended lower...,0,negative,26,"[MarketWatch, bank stocks]",2
2262,Operating profit fell to EUR 35.4 mn from EUR ...,0,negative,23,[],0


In [None]:
# Transform the sentence-level dataset into an aspect-level dataset
pairs = []
for _, row in df.iterrows(): # Loops through the DataFrame df row by row.
    s = row["text"]; sent_label = row["label"]; aspects = row["aspects"] # s: the sentence text. sent_label: sentiment label (negative/neutral/positive). aspects: list of extracted aspects from the sentence.
    if not aspects: # Skip sentences with no aspects (the sentence with no aspects is not useful for aspect-level analysis)
        continue
    if len(aspects) == 1:
        pairs.append({"sentence": s, "aspect": aspects[0], "label": sent_label}) # If there’s exactly 1 aspect, create a dictionary with: "sentence" as full sentence text, "aspect" as the only aspect, and "label" as the sentence-level sentiment; and append it to pairs.
    else:
        # If there are multiple aspects in the same sentence: It duplicates the sentence into multiple rows, one for each aspect. Assigns the same sentence-level sentiment to each aspect (this is a simplification; in reality, some aspects might be positive while others are negative).
        for a in aspects:
            pairs.append({"sentence": s, "aspect": a, "label": sent_label})

pairs_df = pd.DataFrame(pairs) # Convert to DataFrame, and now each row = one aspect, instead of one sentence.
print(pairs_df.head())


                                            sentence            aspect  \
0  For the last quarter of 2010 , Componenta 's n...        Componenta   
1  Finnish Talentum reports its operating profit ...  Finnish Talentum   
2  Clothing retail chain Sepp+ñl+ñ 's sales incre...         Sepp+ñl+ñ   
3  Foundries division reports its sales increased...         Foundries   
4  Foundries division reports its sales increased...      Machine Shop   

      label  
0  positive  
1  positive  
2  positive  
3  positive  
4  positive  


In [None]:
pairs_df

Unnamed: 0,sentence,aspect,label
0,"For the last quarter of 2010 , Componenta 's n...",Componenta,positive
1,Finnish Talentum reports its operating profit ...,Finnish Talentum,positive
2,Clothing retail chain Sepp+ñl+ñ 's sales incre...,Sepp+ñl+ñ,positive
3,Foundries division reports its sales increased...,Foundries,positive
4,Foundries division reports its sales increased...,Machine Shop,positive
...,...,...,...
2003,"Operating profits in the half were 0.8 m , do...",Glisten,negative
2004,HELSINKI Thomson Financial - Shares in Cargote...,HELSINKI Thomson Financial,negative
2005,HELSINKI Thomson Financial - Shares in Cargote...,Cargotec,negative
2006,LONDON MarketWatch -- Share prices ended lower...,MarketWatch,negative
