# The goal of this notebook is to extract and analyze the text data available in the EDGAR tool

In [2]:
import pandas as pd
import requests
from datetime import datetime
import json

### Define the Header and import the S&P 500 Company data

In [5]:
sp = pd.read_csv('./data/sp500.csv')
sp.head(3)

Unnamed: 0.1,Unnamed: 0,Symbol,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded
0,0,MMM,3M,Industrials,Industrial Conglomerates,"Saint Paul, Minnesota",1957-03-04,66740,1902
1,1,AOS,A. O. Smith,Industrials,Building Products,"Milwaukee, Wisconsin",2017-07-26,91142,1916
2,2,ABT,Abbott Laboratories,Health Care,Health Care Equipment,"North Chicago, Illinois",1957-03-04,1800,1888


In [6]:
sp['GICS Sector'].unique()

array(['Industrials', 'Health Care', 'Information Technology',
       'Utilities', 'Financials', 'Materials', 'Consumer Discretionary',
       'Real Estate', 'Communication Services', 'Consumer Staples',
       'Energy'], dtype=object)

In [31]:
# First get all financial sector companies
fin = sp[sp['GICS Sector'] == 'Energy']
industry = fin['GICS Sub-Industry'].unique()
industry

array(['Oil & Gas Exploration & Production',
       'Oil & Gas Equipment & Services', 'Integrated Oil & Gas',
       'Oil & Gas Storage & Transportation',
       'Oil & Gas Refining & Marketing'], dtype=object)

In [None]:
import os
import requests
import pandas as pd
import time

# Headers for SEC API requests
headers = {
    'User-Agent': 'Maseeh Faizan maseeh.faizan@unil.ch'
}

# Base URL for SEC data
base_url = "https://data.sec.gov"

# Create directory for storing filings
os.makedirs('sec_filings', exist_ok=True)

# Loop through all companies in the dataframe
for _, row in sp.iterrows():
    company = row['Symbol']
    cik = row['CIK']

    print(f"Processing {company} (CIK: {cik})...")

    try:
        # Get company submissions
        submissions_url = f"{base_url}/submissions/CIK{cik:010d}.json"
        response = requests.get(submissions_url, headers=headers)
        response.raise_for_status()
        submissions = response.json()

        # Convert to DataFrame
        submissions_df = pd.DataFrame(submissions['filings']['recent'])

        # Filter for 10-K filings
        submissions_df = submissions_df[submissions_df['form'] == '10-K']

        # Create document URLs
        submissions_df['doc_url'] = submissions_df.apply(
            lambda x: f"https://www.sec.gov/Archives/edgar/data/{cik}/{x['accessionNumber'].replace('-', '')}/{x['primaryDocument']}",
            axis=1
        )

        # Skip if no 10-K filings found
        if len(submissions_df) == 0:
            print(f"No 10-K filings found for {company}")
            continue

        # Download each filing
        for idx, filing_row in submissions_df.iterrows():
            url = filing_row['doc_url']
            form = filing_row['form']
            date = filing_row['reportDate']

            # Create filename with company symbol
            filename = f"{company}_{form}_{date}.html"
            file_path = os.path.join('sec_filings', filename)

            try:
                response = requests.get(url, headers=headers)
                with open(file_path, 'w', encoding='utf-8') as f:
                    f.write(response.text)
                print(f"Downloaded: {filename}")
            except Exception as e:
                print(f"Error downloading {filename}: {e}")

        print(f"Completed {company} ({len(submissions_df)} filings)")

        # Add a small delay to be respectful of SEC's servers
        time.sleep(0.1)

    except Exception as e:
        print(f"Error processing {company}: {e}")

print("All companies processed!")

Processing MMM (CIK: 66740)...
Downloaded: MMM_10-K_2024-12-31.html
Downloaded: MMM_10-K_2023-12-31.html
Downloaded: MMM_10-K_2022-12-31.html
Downloaded: MMM_10-K_2021-12-31.html
Downloaded: MMM_10-K_2020-12-31.html
Downloaded: MMM_10-K_2019-12-31.html
Downloaded: MMM_10-K_2018-12-31.html
Downloaded: MMM_10-K_2017-12-31.html
Completed MMM (8 filings)
Processing AOS (CIK: 91142)...
Downloaded: AOS_10-K_2024-12-31.html
Downloaded: AOS_10-K_2023-12-31.html
Downloaded: AOS_10-K_2022-12-31.html
Downloaded: AOS_10-K_2021-12-31.html
Downloaded: AOS_10-K_2020-12-31.html


KeyboardInterrupt: 

# This section is to parse text data from the HTML format and categorize each section


The idea here is First I will find any text that has the starting pattern (i.e Item 1 Business). It also needs to have the ending pattern i.e Item 1A. When that is the case then the text in between is the text that is part of Item 1. This approach works for most of the companies tested.
Same for all the other parts.

I am also making sure the longest text is the main text, (This way it ignores table of contents or other anomalies)

In [None]:
import re
import os
import pandas as pd
import glob
from bs4 import BeautifulSoup, XMLParsedAsHTMLWarning
import warnings
warnings.filterwarnings("ignore", category=XMLParsedAsHTMLWarning)

# Configuration
companies = ['MSFT']  # Example tickers - modify as needed
base_dir = './sec_filings/'

# List to store data for all companies and all sections
all_data = []

# Section patterns and display names
section_patterns = {
    "Business": {
        "start_patterns": [r'ITEM\s+1\.\s*', r'Item\s+1\.\s*'],
        "end_patterns": [r'ITEM\s+1A\.\s*', r'Item\s+1A\.\s*', r'ITEM\s+1\.A\.\s*', r'Item\s+1\.A\.\s*', r'ITEM\s+2\.\s*', r'Item\s+2\.\s*'],
        "display_name": "Item 1. Business"
    },
    "Risk Factors": {
        "start_patterns": [r'ITEM\s+1A\.\s*', r'Item\s+1A\.\s*'],
        "end_patterns": [r'ITEM\s+1B\.\s*', r'Item\s+1B\.\s*', r'ITEM\s+1C\.\s*', r'Item\s+1C\.\s*', r'ITEM\s+2\.\s*', r'Item\s+2\.\s*'],
        "display_name": "Item 1A. Risk Factors"
    },
    "Cybersecurity": {
        "start_patterns": [r'ITEM\s+1C\.\s*', r'Item\s+1C\.\s*'],
        "end_patterns": [r'ITEM\s+2\.\s*', r'Item\s+2\.\s*'],
        "display_name": "Item 1C. Cybersecurity"
    },
    "Mine Safety Disclosures": {
    "start_patterns": [
        r'ITEM\s+4\.\s*',
        r'Item\s+4\.\s*',
    ],
    "end_patterns": [
        r'PART\s+II\s*', 
        r'ITEM\s+5\.\s*', 
        r'Item\s+5\.\s*'
    ],
    "display_name": "Item 4. Mine Safety Disclosures"
    },
    "Properties": {
        "start_patterns": [r'ITEM\s+2\.\s*', r'Item\s+2\.\s*'],
        "end_patterns": [r'ITEM\s+3\.\s*', r'Item\s+3\.\s*'],
        "display_name": "Item 2. Properties"
    },
    "Legal Proceedings": {
        "start_patterns": [r'ITEM\s+3\.\s*', r'Item\s+3\.\s*'],
        "end_patterns": [r'ITEM\s+4\.\s*', r'Item\s+4\.\s*'],
        "display_name": "Item 3. Legal Proceedings"
    },
    "Mine Safety Disclosures": {
        "start_patterns": [r'ITEM\s+4\.\s*', r'Item\s+4\.\s*'],
        "end_patterns": [r'PART\s+II\s*', r'ITEM\s+5\.\s*', r'Item\s+5\.\s*'],
        "display_name": "Item 4. Mine Safety Disclosures"
    },
    "Management Discussion and Analysis": {
        "start_patterns": [r'ITEM\s+7\.\s*', r'Item\s+7\.\s*'],
        "end_patterns": [r'ITEM\s+7A\.\s*', r'Item\s+7A\.\s*', r'ITEM\s+7\.A\.\s*', r'Item\s+7\.A\.\s*'],
        "display_name": "Item 7. Management Discussion and Analysis"
    },
    "Quantitative and Qualitative Disclosures": {
        "start_patterns": [r'ITEM\s+7A\.\s*', r'Item\s+7A\.\s*'],
        "end_patterns": [r'ITEM\s+8\.\s*', r'Item\s+8\.\s*'],
        "display_name": "Item 7A. Quantitative and Qualitative Disclosures about Market Risk"
    },
}

# Find all 10-K files for specified companies
file_paths = []
for ticker in companies:
    pattern = f"{base_dir}{ticker}_10-K_*.html"
    ticker_files = glob.glob(pattern)
    file_paths.extend(ticker_files)

if not file_paths:
    print("No matching files found. Please check the directory and file naming pattern.")
else:
    print(f"Found {len(file_paths)} files to process.")

    # Process each file
    for html_file_path in file_paths:
        if not os.path.exists(html_file_path):
            print(f"Error: File not found at '{html_file_path}'")
            continue

        try:
            # Extract ticker from filename
            base_filename = os.path.basename(html_file_path)
            ticker = base_filename.split('_')[0]

            # Extract date from filename
            parts = base_filename.split('_')
            filing_date = parts[2].split('.')[0] if len(parts) >= 3 else None

            # Read the HTML file
            with open(html_file_path, 'r', encoding='utf-8') as f:
                html_content = f.read()

            # Parse the HTML
            soup = BeautifulSoup(html_content, 'lxml')

            # Extract Text
            text_content = soup.get_text(separator=" ", strip=True)

            # Clean the text
            text_lines = text_content.splitlines()
            cleaned_lines = []
            for line in text_lines:
                processed_line = re.sub(r'[ \t]+', ' ', line).strip()
                if processed_line:
                    cleaned_lines.append(processed_line)
            final_text = "\n".join(cleaned_lines)

            # Sections to extract
            section_names_to_extract = [
                "Business",
                "Risk Factors",
                "Cybersecurity",
                "Properties",
                "Legal Proceedings",
                "Mine Safety Disclosures",
                "Management Discussion and Analysis",
                "Quantitative and Qualitative Disclosures"
            ]

            # Extract all sections and store results
            for section_name_key in section_names_to_extract:
                section_config = section_patterns[section_name_key]
                start_patterns = section_config["start_patterns"]
                end_patterns = section_config["end_patterns"]
                display_name = section_config["display_name"]

                valid_sections = []
                for start_pattern in start_patterns:
                    for start_match in re.finditer(start_pattern, final_text, re.IGNORECASE):
                        start_pos = start_match.start()
                        search_start = start_pos + len(start_match.group())

                        for end_pattern in end_patterns:
                            end_match = re.search(end_pattern, final_text[search_start:], re.IGNORECASE)
                            if end_match:
                                end_pos = search_start + end_match.start()
                                section_content = final_text[start_pos:end_pos].strip()
                                section_content = re.sub(start_pattern, '', section_content, flags=re.IGNORECASE).strip()

                                min_content_length = 200
                                if len(section_content) > min_content_length:
                                    valid_sections.append({
                                        'content': section_content,
                                        'length': len(section_content),
                                        'display_name': display_name
                                    })
                                break # Found an end pattern for this start, move to next start_match if any

                if valid_sections:
                    main_section = max(valid_sections, key=lambda x: x['length'])
                    section_content_extracted = main_section['content']
                    section_display_name = main_section['display_name']
                    print(f"Successfully extracted {section_display_name} section for {ticker} (filing date: {filing_date})")
                else:
                    section_content_extracted = ""
                    section_display_name = display_name
                    print(f"No {section_name_key} section found for {ticker}. Check the patterns or document structure.")

                all_data.append({
                    'ticker': ticker,
                    'filing_date': filing_date,
                    'section': section_display_name,
                    'content': section_content_extracted
                })

        except Exception as e:
            print(f"Error processing {html_file_path}: {e}")

# Create DataFrame with all sections
if all_data:
    long_df = pd.DataFrame(all_data)
    print("Processing complete!")
else:
    print("No data was extracted. Please check the file paths and contents.")

Found 6 files to process.
Successfully extracted Item 1. Business section for MSFT (filing date: 2023-06-30)
Successfully extracted Item 1A. Risk Factors section for MSFT (filing date: 2023-06-30)
No Cybersecurity section found for MSFT. Check the patterns or document structure.
Successfully extracted Item 2. Properties section for MSFT (filing date: 2023-06-30)
No Legal Proceedings section found for MSFT. Check the patterns or document structure.
No Mine Safety Disclosures section found for MSFT. Check the patterns or document structure.
Successfully extracted Item 7. Management Discussion and Analysis section for MSFT (filing date: 2023-06-30)
Successfully extracted Item 7A. Quantitative and Qualitative Disclosures about Market Risk section for MSFT (filing date: 2023-06-30)
Successfully extracted Item 1. Business section for MSFT (filing date: 2019-06-30)
Successfully extracted Item 1A. Risk Factors section for MSFT (filing date: 2019-06-30)
No Cybersecurity section found for MSFT. 

### For testing I will only take in the latest value (2024)

In [52]:
import pandas as pd

def get_latest_data(df: pd.DataFrame) -> pd.DataFrame:
    df['filing_date'] = pd.to_datetime(df['filing_date'])
    latest_date = df['filing_date'].max()
    latest_df = df[df['filing_date'] == latest_date]
    return latest_df

latest_data = get_latest_data(long_df)

sp_renamed = sp.rename(columns={'Symbol': 'ticker'})
columns_to_merge = ['ticker', 'Security', 'GICS Sector', 'GICS Sub-Industry']
sp_filtered = sp_renamed[columns_to_merge]


# Perform a left merge to keep all rows from long_df and add matching GICS data
merged_financial_data = pd.merge(latest_data, sp_filtered, on='ticker', how='left')
merged_financial_data.tail(30)

Unnamed: 0,ticker,filing_date,section,content,Security,GICS Sector,GICS Sub-Industry
0,MSFT,2024-06-30,Item 1. Business,B USINESS GENERAL Embracing Our Future Microso...,Microsoft,Information Technology,Systems Software
1,MSFT,2024-06-30,Item 1A. Risk Factors,RIS K FACTORS Our operations and financial res...,Microsoft,Information Technology,Systems Software
2,MSFT,2024-06-30,Item 1C. Cybersecurity,CY BERSECURITY RISK MANAGEMENT AND STRATEGY Mi...,Microsoft,Information Technology,Systems Software
3,MSFT,2024-06-30,Item 2. Properties,PR OPERTIES Our corporate headquarters are loc...,Microsoft,Information Technology,Systems Software
4,MSFT,2024-06-30,Item 3. Legal Proceedings,,Microsoft,Information Technology,Systems Software
5,MSFT,2024-06-30,Item 4. Mine Safety Disclosures,,Microsoft,Information Technology,Systems Software
6,MSFT,2024-06-30,Item 7. Management Discussion and Analysis,MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANC...,Microsoft,Information Technology,Systems Software
7,MSFT,2024-06-30,Item 7A. Quantitative and Qualitative Disclosu...,QUANTITATIVE AND QUALITAT IVE DISCLOSURES ABOU...,Microsoft,Information Technology,Systems Software


# Prompt

In [None]:
prompt = f"""
            As financial analysts, we are extracting financial data from the 10-K, more specifically the {section_name} section of the 10-K for the company {company_name}, which is generally operating in the {gics_sector} GICS Sector, specifically the {gics_sub_industry} GICS Sub-Industry.

            You are an information extraction bot.
            **Strictly adhere to the text in the "{section_name}" section to answer the questions below.**
            Your output will be **ONLY a numbered list of answers**, formatted as '1. [Answer]'.
            Each answer must be a paragraph; no bullet points or internal lists.
            **Do not include any other text** (introductions, conclusions, explanations, etc.).
            If information is not explicitly present in the provided text, respond with "Information not available in this section." for that answer.

            GICS Sector: {gics_sector}
            GICS Sub-Industry: {gics_sub_industry}

            **{section_name} Text:**
            {content}

            ---
            **Questions:**
            {question_list}
            """

In [None]:
import pandas as pd
import requests
from tqdm.notebook import tqdm

# Define questions for each section
Business = """
1.  What is the company's primary business and what main products or services does it offer?
2.  What is the company's general business model or how does it primarily generate revenue as described?
3.  What are the main operating segments of the company, if discussed?
4.  What primary markets (e.g., geographic, customer types, industries) does the company serve?
5.  Who are the main competitors mentioned in this section?
6.  How does the company generally manage its operations, such as manufacturing, sourcing, or distribution?
7.  What is the general approach to sales and marketing described?
8.  Is intellectual property mentioned as important to the business, and if so, how?
9.  What is the approximate number of employees mentioned?
10. What significant government regulations are described as applicable to the company's business?
"""

RiskFactors = """
1.  What are the main categories or types of risks disclosed in this section?
2.  What are identified as the most significant overall risks to the company's business or financial condition?
3.  What key risks are mentioned related to the company's industry, markets, economic conditions, or competition?
4.  What key risks are mentioned related to the company's products, services, technology, cybersecurity, or intellectual property?
5.  What key risks are mentioned related to the company's operations, supply chain, manufacturing, or infrastructure?
6.  What key risks are mentioned related to legal, regulatory, compliance matters, or potential litigation?
7.  What key risks are mentioned related to the company's financial condition, liquidity, or access to capital?
8.  Are there any significant risks mentioned related to personnel, management, or key employees?
9.  Are there any risks mentioned related to external events such as natural disasters, pandemics, or geopolitical issues?
10. Are there any risks mentioned related to the company's inability to successfully implement its strategies (e.g., M&A integration risks, new market entry risks)?
"""

Unresolved = """
1.  Are there any unresolved SEC staff comments disclosed?
2.  What is the nature of these unresolved comments?
3.  How long have these comments been outstanding?
4.  What is the potential impact or required action if these comments are resolved unfavorably for the company?
5.  Do the comments suggest potential issues with the company's accounting practices or transparency?
"""

Cybersecurity = """
1. What is the company's general approach to cybersecurity risk management?
2. Is there a specific team or individuals responsible for cybersecurity?
3. What specific cybersecurity risks does the company identify?
4. Has the company experienced any material cybersecurity incidents?
5. What measures or controls does the company have in place to address cybersecurity risks?
6. Does the company mention any third-party assessments or standards they follow?
7. Is there board oversight of cybersecurity risks, and if so, how is it described?
8. Does the company have specific cybersecurity training programs mentioned?
9. Are there any industry-specific cybersecurity regulations mentioned?
10. How does the company approach data protection and privacy?
"""

Properties = """
1.  What are the company's most significant physical properties?
2.  Where are these principal properties located geographically?
3.  Are the key properties owned or leased, and what are the terms of any significant leases?
4.  Is the described capacity and condition of the properties sufficient for current and planned operations?
5.  Are there any material encumbrances or environmental issues noted regarding the properties?
6.  How do the properties described align with and support the company's overall business strategy and segment operations?
"""

LegalProceedings = """
1.  Are there any material legal proceedings disclosed?
2.  What is the nature of the material proceedings?
3.  Who are the key parties involved in the litigation?
4.  What stage are the material proceedings currently in?
5.  Has the company estimated the potential range of loss or impact?
6.  What is the potential impact of an unfavorable outcome on the company?
7.  Are any of the proceedings brought by or against governmental authorities?
"""

ManagementDiscussion = """
1.  What are the key factors management highlights as driving the changes in revenue, costs, and profitability for the reported periods?
2.  How does management explain the performance and key trends within the company's different operating segments?
3.  What significant non-recurring items, unusual events, or accounting changes does management discuss as impacting the results?
4.  What known trends, events, or uncertainties does management identify as reasonably likely to have a material effect on future financial condition or results of operations?
5.  What is management's discussion of the company's liquidity and capital resources?
6.  What are identified as the primary sources and uses of cash during the periods presented?
7.  What are the company's material cash requirements from known contractual obligations, commitments, or debt maturities?
8.  How has the company's capital structure (e.g., debt-to-equity ratio) changed, and what is management's commentary on it?
9.  What are the critical accounting estimates identified by management?
10. Why are these estimates considered critical, and what are the key assumptions or uncertainties underlying them?
11. How does management explain the sensitivity of the financial statements to changes in these critical accounting estimates?
"""

QuantitativeDisclosures = """
1.  What are the primary market risks the company is disclosed as being exposed to?
2.  How does management describe the nature of these market risk exposures?
3.  What are the company's stated objectives and general strategies for managing these market risks?
4.  Does the company disclose the use of derivative financial instruments for hedging market risks? If so, how are they generally used?
5.  Does the company mention holding derivative instruments for trading or speculative purposes?
6.  What quantitative information is provided about the potential impact of changes in interest rates?
7.  What quantitative information is provided about the potential impact of changes in foreign currency exchange rates?
8.  What quantitative information is provided about the potential impact of changes in commodity prices?
9.  What quantitative information is provided about the potential impact of changes in equity prices, if any?
10. What methods (e.g., sensitivity analysis, Value at Risk - VAR) are mentioned as being used for the quantitative market risk analysis? What are the key assumptions of the method used?
"""

# Map section names to their question lists
section_question_map = {
    "Item 1. Business": Business,
    "Item 1A. Risk Factors": RiskFactors,
    "Item 1B. Unresolved Staff Comments" : Unresolved,
    "Item 1C. Cybersecurity": Cybersecurity,
    "Item 2. Properties": Properties,
    "Item 3. Legal Proceedings": LegalProceedings,
    "Item 7. Management Discussion and Analysis": ManagementDiscussion,
    "Item 7A. Quantitative and Qualitative Disclosures about Market Risk": QuantitativeDisclosures
}

def call_local_gemma(prompt, temperature=0.1):
    """
    Call the local Ollama Gemma model and return raw output
    """
    # Set up Ollama API request
    api_url = "http://localhost:11434/api/generate"
    payload = {
        "model": "gemma3:4b-it-qat",  # Using Gemma 3 model
        "prompt": prompt,
        "temperature": temperature,
        "stream": False
    }
    
    # Make API call
    try:
        print("Sending prompt to local Gemma model...")
        response = requests.post(api_url, json=payload)
        if response.status_code == 200:
            result = response.json()
            return result.get("response", "No response")
        else:
            print(f"Error: {response.status_code}")
            return f"Error: {response.status_code}"
    except Exception as e:
        print(f"Connection error: {str(e)}")
        return f"Connection error: {str(e)}"

def process_10k_sections(df):
    """
    Process each section and display raw outputs without parsing
    """
    # Identify the section columns
    section_columns = [col for col in df.columns if col in section_question_map]
    
    if not section_columns:
        # Try with partial matching if exact matches aren't found
        section_columns = []
        for col in df.columns:
            for section_name in section_question_map.keys():
                if section_name.lower() in col.lower():
                    section_columns.append(col)
                    break
    
    print(f"Found {len(section_columns)} section columns: {section_columns}")
    
    # Process each row in the dataframe
    for idx, row in tqdm(df.iterrows(), total=len(df), desc="Processing companies"):
        ticker = row.get('ticker', f"Company_{idx}")
        filing_date = row.get('filing_date', 'Unknown')
        
        print(f"\n\n{'='*80}")
        print(f"PROCESSING: {ticker} - Filing Date: {filing_date}")
        print(f"{'='*80}")
        
        # Process each section
        for section_col in section_columns:
            content = row.get(section_col)
            
            # Skip if no content
            if pd.isna(content) or str(content).strip() == "" or str(content).lower() == "nan":
                print(f"Skipping {ticker} {section_col} - no content")
                continue
            
            # Find the matching question set
            section_name = None
            question_list = None
            
            # Try exact match first
            if section_col in section_question_map:
                section_name = section_col
                question_list = section_question_map[section_col]
            else:
                # Try partial matching
                for key in section_question_map.keys():
                    if key.lower() in section_col.lower():
                        section_name = key
                        question_list = section_question_map[key]
                        break
            
            if not section_name or not question_list:
                continue
                
            # Create the prompt
            # Create the prompt
            
            
            # Call the Gemma model and display raw output
            try:
                print(f"\n\n{'-'*80}")
                print(f"SECTION: {section_name}")
                print(f"{'-'*80}")
                
                raw_response = call_local_gemma(prompt)
                
                print("\nRAW MODEL OUTPUT:")
                print(f"{'-'*40}")
                print(raw_response)
                print(f"{'-'*40}")
                
            except Exception as e:
                print(f"Error processing {ticker} {section_name}: {str(e)}")

# Function to test the Gemma model with a sample prompt
def test_gemma_with_sample():
    sample_prompt = """
    You are an information extraction bot.
    Based strictly and only on the text provided, answer the following questions.
    Your output must be **ONLY** the numbered answers, formatted as a numbered list (e.g., '1. [Answer]').
    
    **Text:**
    This is a sample company description. The company produces software for healthcare providers.
    They have approximately 5,000 employees and operate mainly in North America and Europe.
    Their main competitors are XYZ Corp and ABC Inc.
    
    ---
    **Questions:**
    1. What is the company's primary business?
    2. How many employees does the company have?
    3. Where does the company operate?
    4. Who are the main competitors?
    """
    
    print("\nTesting Gemma with sample prompt...")
    raw_response = call_local_gemma(sample_prompt)
    print("\nSAMPLE RAW OUTPUT:")
    print(f"{'-'*40}")
    print(raw_response)
    print(f"{'-'*40}")

# Example usage:
# First test with a sample prompt
#test_gemma_with_sample()

# Then process the actual data
#df = pd.read_csv('your_10k_data.csv')  # Replace with your actual data loading
process_10k_sections(df)

Found 7 section columns: ['Item 1. Business', 'Item 1A. Risk Factors', 'Item 1C. Cybersecurity', 'Item 2. Properties', 'Item 3. Legal Proceedings', 'Item 7. Management Discussion and Analysis', 'Item 7A. Quantitative and Qualitative Disclosures about Market Risk']


Processing companies:   0%|          | 0/1 [00:00<?, ?it/s]



PROCESSING: MSFT - Filing Date: 2024-06-30


--------------------------------------------------------------------------------
SECTION: Item 1. Business
--------------------------------------------------------------------------------
Sending prompt to local Gemma model...

RAW MODEL OUTPUT:
----------------------------------------
Here’s a breakdown of the answers to your questions based on the provided text:

1. **What is the company’s primary business and what main products or services does it offer?**
   The company’s primary business is information technology. It offers a wide range of products and services, including operating systems (Windows), productivity software (Microsoft Office), server and cloud computing services, gaming (Xbox), and more.

2. **What is the company’s general business model or how does it primarily generate revenue as described?**
    The company generates revenue primarily through the sale of software licenses, subscriptions to cloud services (like Micros

In [18]:
df['Item 7. Management Discussion and Analysis'].values[0]

'MANAGEMENT’S DISCUSSION AND ANALYSIS OF FINANCIAL CONDITION AND RESULTS OF OPERATIONS The following Management’s Discussion and Analysis of Financial Condition and Results of Operations (“MD&A”) is intended to help the reader understand the results of operations and financial condition of Microsoft Corporation. MD&A is provided as a supplement to, and should be read in conjunction with, our consolidated financial statements and the accompanying Notes to Financial Statements (Part II, Item 8 of this Form 10-K). This section generally discusses the results of our operations for the year ended June 30, 2024 compared to the year ended June 30, 2023. For a discussion of the year ended June 30, 2023 compared to the year ended June 30, 2022, please refer to Part II, Item 7, “Management’s Discussion and Analysis of Financial Condition and Results of Operations” in our Annual Report on Form 10-K for the year ended June 30, 2023. OVERVIEW Microsoft is a technology company committed to making di

In [None]:
df.head(30)

NameError: name 'df' is not defined