# Project ML for portfolio management

The aim of this project is to explore gender differences in central bank communication / speeches using sentiment analysis, inspired by the paper "Leadership, Gender, and Discourse in Monetary Policy: Analyzing Speech Dynamics in the FOMC" (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5002334).

We will explore gender communication styles in speeches delivered by central bank officials, focusing on whether there are observable differences between male and female central bank leaders.
More specifically, using sentiment analysis and NLP techniques, we will analyze whether male and female speakers differ in the topics they address (topic modeling) and the tone of their speeches (sentiment analysis).


The speeches would be scraped or downloaded from the BIS website, along with the speaker information. I would need to further define which speeches to use / focus on (based on major policy announcements for instance ?).

instructions : 1 : Sharp summary of our results. Should be clear if it works or not.
2 : dataset, can be short if reusing. Or longer if webscrapping  summary statistics, code, …
3 : why is it an important question, why does it matter ? (central bank data). What is your contribution, what do you bring ? what has been done and what changes with what I’m doing. What is new in what you’re doing.
Is the evaluation a backtest or something else, …
4 : results : explain what we’re doing and 
Summary of our test / empirical result. Can be positive or negative. 
5 : did it work or not, what can be improved / added ?
We can also / should !! read a paper and replicate it : google scholar 


1. Introduction
1. Dataset overview
1. Analytics and learning strategies
1. Empirical resuts: baseline and robustness 
1. Conclusion

if you need to add any package, no problem: add cells in your notebook with "pip install my_additional_package" so that I'm aware of what additional packages I need to run your notebook. 
if you use data that you scrapped online, just provide the code to programmatically scrape the data. More generally, I don't want to receive data.csv files. 
if you use .py files to tidy your project, just use a %%writefile magic in the notebook -- so that on my side, I can create the same .py files on the fly. I don't want to receive additional .py files.

In [None]:
pip install selenium

### INTRODUCTION


The opening segment encompasses four essential elements:

- 1 Contextual Background: What is the larger setting of the study? What makes this area of inquiry compelling? What are the existing gaps or limitations within the current body of research? What are some unanswered yet noteworthy questions?

- 2 Project Contributions: What are the specific advancements made by this study, such as in data acquisition, algorithmic development, parameter adjustments, etc.?

- 3 Summary of the main empirical results: What is the main statistical statement? is it significant (e.g. statistically or economically)? 

- 4 Literature and Resource Citations: What are related academic papers? What are the github repositories, expert blogs, or software packages that used in this project? 

references : 
- Leadership, Gender, and Discourse in Monetary Policy: Analyzing Speech Dynamics in the FOMC (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5002334)
- Information in Central Bank Sentiment: An Analysis of Fed and ECB Communication (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4797935)

### DATASET OVERVIEW

In the dataset profile, one should consider:

- The origin and composition of data utilized in the study. If the dataset is original, then provide the source code to ensure reproducibility.

- The chronological accuracy of the data points, verifying that the dates reflect the actual availability of information.

- A detailed analysis of descriptive statistics, with an emphasis on discussing the importance of the chosen graphs or metrics.

Central Bank Speeches: You need a corpus of speeches delivered by central bank officials, ideally annotated with speaker information (name, gender, role, country, etc.).
Metadata: To analyze gender differences, the dataset must include:
Gender of the speaker.
Date of the speech.
Context (policy announcements, conferences, etc.).

scope (e.g., BIS speeches from 2000 onwards)

In [2]:
import requests
from bs4 import BeautifulSoup
import time
import pandas as pd
import os
from pathlib import Path

from tqdm import tqdm
# warnings.filterwarnings("ignore", category=UserWarning, module="openpyxl")

In [3]:
# build functions
def get_central_bank_speeches_urls():
    '''
    '''
    url = "https://www.bis.org/api/document_lists/cbspeeches.json"
    reviews=[]
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        speeches = response.json()
        # Get the list of speeches ids ,
        for review in speeches['list']:
            reviews.append(review)
    else:
        print(f"Failed to retrieve data. Status code: {response.status_code}")
    
    return(reviews)

review_urls = get_central_bank_speeches_urls()
print(len(review_urls))

20093


In [None]:
# NOT DONE AT ALL !!!!

 

def clean_directory_path(cache_dir, default_dir="data"):
    if cache_dir is None:
        cache_dir = Path(os.getcwd()) / default_dir
    if isinstance(cache_dir, str):
        cache_dir = Path(cache_dir)
    if not cache_dir.is_dir():
        os.makedirs(cache_dir)
    return cache_dir


logger = logging.getLogger(__name__)


def load_central_banks_speeches(add_url=True, cache_dir="data", force_reload=False, progress_bar=False):
    """
    """
    filename = clean_directory_path(cache_dir) / "central_banks_speeches.parquet"
    if (filename.exists()) & (~force_reload):
        logger.info(f"logging from cache file: {filename}")
        speeches = pd.read_parquet(filename)
    else:
        logger.info("loading from external source")
        urls = get_central_bank_speeches_urls()
        if progress_bar:
            urls_ = tqdm(urls)
        else:
            urls_ = urls


        # get speeches metadata + extract / scrape speech from html page
        all_speeches = []
        base_url_api = "https://www.bis.org/api/documents"
        base_url = "https://www.bis.org"

        counter = 0
        switch = None
        for link in review_urls : #tqdm(reviews):
            counter += 1
        #   link = '/review/r010105b'
            speech_data = {}
            review_url = f'{base_url_api}{link}.json'
            speech_url = f'{base_url}{link}.htm'
            print(f"Processing speech: {review_url}")
            try:
                # Fetch speech page
                review_response = requests.get(review_url)
                review_metadata = review_response.json()
                speech_data.update(review_metadata)

                # Check if 'institutions' exists, skip processing if not
                    # only scrape speech if it's the right institution (filtering before scraping)  
                if 'institutions' in speech_data: #and speech_data['institutions']==6  and 'publication_start_date' = 2024-12-06
                    switch = speech_data['id']
                    # print(f"Skipping speech {review_url} as it does not have the right institution")       
                    try:
                        # Scrape speech content
                        speech_response = requests.get(speech_url)
                        # Parse the HTML content
                        soup = BeautifulSoup(speech_response.content, 'html.parser')
                        # full speech is ocntained in the 'section' class
                        speech_content = soup.find('div', id='cmsContent')
                        speech_content_text = speech_content.get_text()
                        speech_data['speech_content'] = speech_content_text

                    except Exception as e:
                            print(f"Failed to process speech: {speech_url}, error: {e}")
                            continue 

                    # append speech dict (content + metadata) to all_speeches
                    all_speeches.append(speech_data)

            except Exception as e:
                    print(f"Failed to fetch data of review {review_url}, status code: {response.status_code}")
                    continue 
        

         speeches = pd.DataFrame(
        {
            "release_date": release_date,
            "last_update": last_update,
            "text": text,
            "voting": voting,
            "release_time": release_time,
        }
    )

        speeches = all_speeches

        if add_url:
            speeches = speeches.assign(url=urls)
        speeches = speeches.sort_index()
        logger.info(f"saving cache file {filename}")
        speeches.to_parquet(filename)
    return speeches

In [None]:

# get speeches metadata + extract / scrape speech from html page
all_speeches = []
base_url_api = "https://www.bis.org/api/documents"
base_url = "https://www.bis.org"

counter = 0
switch = None
for link in review_urls : #tqdm(reviews):
     counter += 1
   #   link = '/review/r010105b'
     speech_data = {}
     review_url = f'{base_url_api}{link}.json'
     speech_url = f'{base_url}{link}.htm'
     print(f"Processing speech: {review_url}")
     try:
        # Fetch speech page
        review_response = requests.get(review_url)
        review_metadata = review_response.json()
        speech_data.update(review_metadata)

        # Check if 'institutions' exists, skip processing if not
             # only scrape speech if it's the right institution (filtering before scraping)  
        if 'institutions' in speech_data and speech_data['institutions']==6 :  #and 'publication_start_date' = 2024-12-06
            switch = speech_data['id']
            # print(f"Skipping speech {review_url} as it does not have the right institution")       
            try:
                # Scrape speech content
                speech_response = requests.get(speech_url)
                # Parse the HTML content
                soup = BeautifulSoup(speech_response.content, 'html.parser')
                # full speech is ocntained in the 'section' class
                speech_content = soup.find('div', id='cmsContent')
                speech_content_text = speech_content.get_text()
                speech_data['speech_content'] = speech_content_text

            except Exception as e:
                      print(f"Failed to process speech: {speech_url}, error: {e}")
                      continue 

              # append speech dict (content + metadata) to all_speeches
            all_speeches.append(speech_data)

     except Exception as e:
            print(f"Failed to fetch data of review {review_url}, status code: {response.status_code}")
            continue 
     
     # Pause to avoid overloading
   #   time.sleep(1)
   #   break
     

# Save results
save_path = 'speeches.csv'
df = pd.DataFrame(all_speeches)
df.to_csv(save_path, index=False)
print(f"Scraped data saved to {save_path}")
df

SyntaxError: expected ':' (3672091727.py, line 23)

219 : 1min7s cours

In [None]:
67*10000/60/60  # --> 3h pour 20000 lignes

186.11111111111111

In [None]:

def load_loughran_mcdonald_dictionary(cache_dir="data", force_reload=False):
    """
    Software Repository for Accounting and Finance by Bill McDonald
    https://sraf.nd.edu/loughranmcdonald-master-dictionary/
    """
    filename = (
        clean_directory_path(cache_dir)
        / "Loughran-McDonald_MasterDictionary_1993-2021.csv"
    )
    if (filename.exists()) & (~force_reload):
        logger.info(f"logging from cache file: {filename}")
    else:
        logger.info("loading from external source")
        id = "17CmUZM9hGUdGYjCXcjQLyybjTrcjrhik"
        url = f"https://docs.google.com/uc?export=download&confirm=t&id={id}"        
        subprocess.run(f"wget -O '{filename}' '{url}'", shell=True, capture_output=True)
    return pd.read_csv(filename)

In [None]:

def load_10X_summaries(cache_dir="data", force_reload=False):
    """
    Software Repository for Accounting and Finance by Bill McDonald
    https://sraf.nd.edu/sec-edgar-data/
    """
    filename = (
        clean_directory_path(cache_dir)
        / "Loughran-McDonald_10X_Summaries_1993-2021.csv"
    )
    if (filename.is_file()) & (~force_reload):
        logger.info(f"logging from cache directory: {filename}")
    else:
        logger.info("loading from external source")
        id = "1CUzLRwQSZ4aUTfPB9EkRtZ48gPwbCOHA"
        url = f"https://docs.google.com/uc?export=download&confirm=t&id={id}"
        subprocess.run(f"wget -O '{filename}' '{url}'", shell=True, capture_output=True)

    df = pd.read_csv(filename).assign(
        date=lambda x: pd.to_datetime(x.FILING_DATE, format="%Y%m%d")
    )
    return df



### ANALYTICS AND LEARNING STRATEGY

The analytics and machine learning methodologies section accounts for:

- A detailed explanation of the foundational algorithm.

- A description of the data partitioning strategy for training, validation and test.

- An overview of the parameter selection and optimization process.

### EMPIRICAL RESULTS : BASELINE AND ROBUSTNESS

To effectively convey the empirical findings, separate the baseline results from the additional robustness tests. Within the primary empirical outcomes portion, include:

- Key statistical evaluations (for instance, if presenting a backtest – provide a pnl graph alongside the Sharpe ratio).

- Insights into what primarily influences the results, such as specific characteristics or assets that significantly impact performance.

The robustness of empirical tests section should detail:

- Evaluation of the stability of the principal finding against variations in hyperparameters or algorithmic modifications.

### CONCLUSION

Finally, the conclusive synthesis should recapitulate the primary findings, consider external elements that may influence the results, and hint at potential directions for further investigative work.