# General Observations
### Types of Named Entities Identified:

- PERCENT: Identifies percentages (e.g., "-0.42%", "9.61%").
- TIME: Identifies times (e.g., "this morning", "5:00 PM ET").
- GPE (Geopolitical Entity): Identifies countries, cities, states (e.g., "China", "U.S.", "Britain").
- DATE: Identifies dates and periods (e.g., "Tuesday", "January", "3-1/2 week").
- ORG (Organization): Identifies organizations and companies (e.g., "Boeing Co (BA)", "Morgan Stanley", "Fed").
- MONEY: Identifies monetary values (e.g., "$3.8 billion", "27.87").
- PERSON: Identifies people's names (e.g., "Christopher Waller", "Charles Schwab").
- CARDINAL: Identifies numbers that do not fall into other categories (e.g., "25", "412").
Entity Recognition Accuracy:

> Most entities are correctly identified, such as organizations (e.g., "Morgan Stanley"), monetary values (e.g., "$3.8 billion"), and dates (e.g., "Tuesday").
Some entities might be contextually relevant but can be misclassified, especially if they have multiple meanings (e.g., numbers classified as CARDINAL might need more context).

# Named Entities Contextual Analysis:

* The entities span various contexts like stock market performance, economic data, corporate news, and geopolitical events.
The NER model effectively captures the key players, financial figures, dates, and locations, giving a comprehensive view of the text content.


In [1]:
from google.colab import drive
drive.mount('/content/drive/')

Mounted at /content/drive/


In [2]:
import pandas as pd

def load_and_describe(file_path):
    """
    Load a CSV file into a pandas DataFrame and print its shape, columns, and first few rows.

    Parameters:
    - file_path (str): The path to the CSV file.

    Returns:
    - pd.DataFrame: The loaded DataFrame.
    """
    data = pd.read_csv(file_path)
    print(f"Shape of {file_path}: {data.shape}")
    print(f"Columns of {file_path}: {data.columns.tolist()}")
    print(f"First few rows of {file_path}:")
    print(data.head(), "\n")
    return data

if __name__ == "__main__":
    # Define paths to the CSV files
    aa_path = '/content/drive/MyDrive/42850/ass3/stock_price_preprocessed/aa.csv'
    aapl_path = '/content/drive/MyDrive/42850/ass3/stock_price_preprocessed/aapl.csv'
    news_data_path = '/content/drive/MyDrive/42850/ass3/news_data_preprocessed/aa.csv'

    # Load and describe each CSV file
    aa_data = load_and_describe(aa_path)
    aapl_data = load_and_describe(aapl_path)
    news_data = load_and_describe(news_data_path)


Shape of /content/drive/MyDrive/42850/ass3/stock_price_preprocessed/aa.csv: (13642, 7)
Columns of /content/drive/MyDrive/42850/ass3/stock_price_preprocessed/aa.csv: ['Date', 'Open', 'High', 'Low', 'Close', 'Adj close', 'Volume']
First few rows of /content/drive/MyDrive/42850/ass3/stock_price_preprocessed/aa.csv:
                        Date       Open       High        Low      Close  \
0  2024-02-02 00:00:00+00:00  29.000000  29.719999  28.549999  29.490000   
1  2024-02-01 00:00:00+00:00  30.080000  30.405001  29.150000  29.690001   
2  2024-01-31 00:00:00+00:00  30.490000  31.360001  29.715000  29.750000   
3  2024-01-30 00:00:00+00:00  30.340000  30.840000  30.000000  30.610001   
4  2024-01-29 00:00:00+00:00  30.459999  30.969999  29.688999  30.910000   

   Adj close   Volume  
0  29.490000  4954000  
1  29.690001  4174600  
2  29.750000  5760400  
3  30.610001  4714700  
4  30.910000  4649100   

Shape of /content/drive/MyDrive/42850/ass3/stock_price_preprocessed/aapl.csv: (1087

In [3]:
!pip install spacy
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m56.5 MB/s[0m eta [36m0:00:00[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [4]:
import pandas as pd
import spacy

def load_data(file_path):
    """
    Load a CSV file into a pandas DataFrame.

    Parameters:
    - file_path (str): The path to the CSV file.

    Returns:
    - pd.DataFrame: The loaded DataFrame.
    """
    return pd.read_csv(file_path)

def apply_ner(text, nlp):
    """
    Apply NER to a single piece of text.

    Parameters:
    - text (str): The text to process.
    - nlp (spacy.lang): The spaCy NLP model.

    Returns:
    - List[Tuple[str, str]]: List of named entities and their labels.
    """
    doc = nlp(text)
    return [(ent.text, ent.label_) for ent in doc.ents]

def process_ner(data, text_column):
    """
    Process NER on a DataFrame.

    Parameters:
    - data (pd.DataFrame): The data to process.
    - text_column (str): The column containing the text.

    Returns:
    - pd.DataFrame: The data with an additional column for NER results.
    """
    nlp = spacy.load("en_core_web_sm")
    data['Named_Entities'] = data[text_column].apply(lambda x: apply_ner(str(x), nlp))
    return data

def save_data(data, file_path):
    """
    Save the DataFrame to a CSV file.

    Parameters:
    - data (pd.DataFrame): The data to save.
    - file_path (str): The path to save the CSV file.
    """
    data.to_csv(file_path, index=False)

def print_ner_examples(data, num_examples=5):
    """
    Print example rows of the DataFrame with their NER results.

    Parameters:
    - data (pd.DataFrame): The DataFrame containing text and NER results.
    - num_examples (int): Number of examples to print.
    """
    for idx, row in data.head(num_examples).iterrows():
        print(f"Text: {row['Text']}")
        print(f"Named Entities: {row['Named_Entities']}")
        print("-" * 80)

if __name__ == "__main__":
    # Define paths
    news_data_path = '/content/drive/MyDrive/42850/ass3/news_data_preprocessed/aa.csv'
    news_output_path = '/content/drive/MyDrive/42850/ass3/news_data_preprocessed/news_ner_output.csv'

    # Load the data
    news_data = load_data(news_data_path)

    # Apply NER
    news_ner_results = process_ner(news_data, 'Text')

    # Save results
    save_data(news_ner_results, news_output_path)

    # Print some example NER results
    print_ner_examples(news_ner_results, num_examples=5)


Text: March S&P 500 E-Mini futures (ESH24) are trending down -0.42% this morning as investors digested weak economic data from China while also gearing up for crucial U.S. retail sales data.
In Tuesday’s trading session, Wall Street’s major averages closed in the red, with the blue-chip Dow falling to a 3-1/2 week low. Boeing Co (BA) plunged over -7% and was the top percentage loser on the Dow and S&P 500 after Wells Fargo Securities downgraded the stock to Equal Weight from Overweight, citing an increased risk that the heightened scrutiny of the company’s manufacturing quality could affect the pace of production or deliveries. Also, Morgan Stanley (MS) slumped more than -4% after the bank reported mixed Q4 results and warned of lower margins in the wealth-management business. In addition, Spirit Airlines Inc (SAVE) plummeted over -47% after a federal judge blocked the company’s planned $3.8 billion sale to JetBlue Airways on antitrust grounds. On the bullish side, Advanced Micro Devic