# General Tips
## Using virtual environments
**Step 1:** CD to desired directory and Create a Virtual Environment `python3 -m venv myenv`. (Run `py -3.13 -m venv myenv` for a specific version of python)

Check your python installed versions with `py -0` on Windows (`python3 --version` on Linux)

**Step 2:** Activate the Environment `source myenv/bin/activate` (on Linux) and `myenv\Scripts\activate` (on Windows).

**Step 3:** Install Any Needed Packages. e.g: `pip install requests pandas`. Or better to use `requirements.txt` file (`pip install -r requirements.txt`)

**Step 4:** List All Installed Packages using `pip list`

## Connecting the Jupyter Notebook to the vistual env
1. Make sure that myenv is activate (`myenv\Scripts\activate`)
2. Run this inside the virtual environment: `pip install ipykernel`
3. Still inside the environment: `python -m ipykernel install --user --name=myenv --display-name "Whatever Python Kernel Name"`
   
   --name=myenv: internal identifier for the kernel
   
   --display-name: name that shows up in VS Code kernel picker
4. Open VS Code and select the kernel

   At the top-right, click "Select Kernel".
   Look for “Whatever Python Kernel Name” — pick that.
5. If you don’t see it right away, try: Reloading VS Code, Or running Reload Window from Command Palette (Ctrl+Shift+P)

## Useful Commands
1. Use `py -0` to check which python installation we have on Windows

# Step 1: Load the Dataset

In [None]:
from datasets import load_dataset

ds = load_dataset("PatronusAI/financebench", split="train")

Records:  150
dict_keys(['financebench_id', 'company', 'doc_name', 'question_type', 'question_reasoning', 'domain_question_num', 'question', 'answer', 'justification', 'dataset_subset_label', 'evidence', 'gics_sector', 'doc_type', 'doc_period', 'doc_link'])
{'financebench_id': 'financebench_id_03029', 'company': '3M', 'doc_name': '3M_2018_10K', 'question_type': 'metrics-generated', 'question_reasoning': 'Information extraction', 'domain_question_num': None, 'question': 'What is the FY2018 capital expenditure amount (in USD millions) for 3M? Give a response to the question by relying on the details shown in the cash flow statement.', 'answer': '$1577.00', 'justification': 'The metric capital expenditures was directly extracted from the company 10K. The line item name, as seen in the 10K, was: Purchases of property, plant and equipment (PP&E).', 'dataset_subset_label': 'OPEN_SOURCE', 'evidence': [{'evidence_text': 'Table of Contents \n3M Company and Subsidiaries\nConsolidated Statement o

In [27]:
print("Records: ", len(ds))
print("Keys: ", ds[84])
print("First record: ", ds[0]["doc_link"])

print("List of document links:")
counter = 0
for doc in ds:
    counter += 1
    print(f"{counter}: {doc['doc_link']}")

Records:  150
Keys:  {'financebench_id': 'financebench_id_10136', 'company': 'General Mills', 'doc_name': 'GENERALMILLS_2022_10K', 'question_type': 'metrics-generated', 'question_reasoning': 'Numerical reasoning', 'domain_question_num': None, 'question': "We want to calculate a financial metric. Please help us compute it by basing your answers off of the cash flow statement and the income statement. Here's the question: what is the FY2022 retention ratio (using total cash dividends paid and net income attributable to shareholders) for General Mills? Round answer to two decimal places.", 'answer': '0.54', 'justification': 'The metric in question was calculated using other simpler metrics. The various simpler metrics (from the current and, if relevant, previous fiscal year(s)) used were:\n\nMetric 1: Total cash dividends paid out. This metric was located in the 10K as a single line item named: Dividends paid.\n\nMetric 2: Net income. This metric was located in the 10K as a single line it

In [None]:
import os
import requests
from urllib.parse import urlparse

output_dir = "../financebench_pdfs"
os.makedirs(output_dir, exist_ok=True)

# Helper function to get filename from URL
def get_filename_from_url(url):
    return os.path.basename(urlparse(url).path)

# Helper function to download PDF
def download_pdf(url, output_path):
    response = requests.get(url, stream=True)
    if response.status_code == 200 and 'application/pdf' in response.headers.get('Content-Type', ''):
        with open(output_path, 'wb') as f:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
        print(f"✅ Downloaded: {output_path}")
    else:
        print(f"❌ Document not found or not a PDF: {url} (status code {response.status_code})")

# # Download loop
# for i, record in enumerate(ds):
#     url = record["doc_link"]
#     filename = get_filename_from_url(url)
#     output_path = os.path.join(output_dir, filename)
    
#     if os.path.exists(output_path):
#         print(f"[{i}] Skipping (already exists): {filename}")
#         continue

#     print(f"[{i}] Downloading: {filename}")
#     download_pdf(url, output_path)

[0] Downloading: 0001558370-19-000470.pdf
✅ Downloaded: ../financebench_pdfs/0001558370-19-000470.pdf
[1] Skipping (already exists): 0001558370-19-000470.pdf
[2] Downloading: 0000066740-23-000014.pdf
✅ Downloaded: ../financebench_pdfs/0000066740-23-000014.pdf
[3] Skipping (already exists): 0000066740-23-000014.pdf
[4] Skipping (already exists): 0000066740-23-000014.pdf
[5] Downloading: 0000066740-23-000058.pdf
✅ Downloaded: ../financebench_pdfs/0000066740-23-000058.pdf
[6] Skipping (already exists): 0000066740-23-000058.pdf
[7] Skipping (already exists): 0000066740-23-000058.pdf
[8] Downloading: 32abe798-add2-4770-9c7d-4cd3a840ede2
✅ Downloaded: ../financebench_pdfs/32abe798-add2-4770-9c7d-4cd3a840ede2
[9] Skipping (already exists): 32abe798-add2-4770-9c7d-4cd3a840ede2
[10] Downloading: pdf-page.html
❌ Document not found or not a PDF: https://www.adobe.com/pdf-page.html?pdfTarget=aHR0cHM6Ly93d3cuYWRvYmUuY29tL2NvbnRlbnQvZGFtL2NjL2VuL2ludmVzdG9yLXJlbGF0aW9ucy9wZGZzL0FEQkUtMTBLLUZZMTUtRkl

ConnectionError: HTTPSConnectionPool(host='johnsonandjohnson.gcs-web.com', port=443): Max retries exceeded with url: /static-files/9b012500-471a-4df9-93fc-6cee2b420678 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f1512f077d0>: Failed to resolve 'johnsonandjohnson.gcs-web.com' ([Errno -2] Name or service not known)"))