# Data Cleaning Notebook

### This notebook cleans and merges stock-related data and general Reddit posts.
### The process is divided into the following sections:
###  1. Data Imports and Setup
###  2. Data Cleaning for Stock Posts and Comments
###  3. Merging and Exporting the Final DataFrame


In [2]:
import pandas as pd
import numpy as np
import os
import importlib.util
import sys
import json

## 1. Data Imports and Setup for Stock Data

In [15]:
# Loading general and stock specific posts
posts = pd.read_csv('./data/reddit/general_posts.csv').dropna(subset='category') ## Making sure to drop any empty cells
stock_posts = pd.read_csv('./data/reddit/posts_stock_specific.csv').dropna(subset='category')

# Loading comments and stock specific comments
comments = pd.read_csv('./data/reddit/general_comments.csv')
stock_comments = pd.read_csv('./data/reddit/comments_stock_specific.csv')

### Making sure that the scores are considered as actual number and not strings. Moreover, I will sort the comments by score and group them together by their respective post_id. 

I am only considering the top 15 comments for each post

In [16]:
comments['score'] = pd.to_numeric(comments['score'])
stock_comments['score'] = pd.to_numeric(stock_comments['score'])

In [17]:
# Group by 'post_id', sort by 'score' within each group, and get the head (top 10)
stock_comments = stock_comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(15)).reset_index(drop=True)

comments = comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(15)).reset_index(drop=True)

  stock_comments = stock_comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(15)).reset_index(drop=True)
  comments = comments.groupby('post_id').apply(lambda x: x.sort_values(by='score', ascending=False).head(15)).reset_index(drop=True)


#### I have multiple comments per post_id. I want to make sure that all the comments stay together. I am putting all the comments in a list and will group them together. 

In [18]:
stock_comments = stock_comments.groupby('post_id')['body'].apply(list).reset_index()
comments = comments.groupby('post_id')['body'].apply(list).reset_index()

#### Let's now merge the posts and comments together

In [19]:
stocks = pd.merge(stock_posts, stock_comments, on='post_id', how='left')
general = pd.merge(posts, comments, on='post_id', how='left')

#### Making sure the datetime is well structured and removing having a uniform date. (Removing Hours and Seconds)

In [20]:
#Convert 'created_utc' to datetime and keep only the date ---
stocks['created_utc'] = pd.to_datetime(stocks['created_utc']).dt.date
general['created_utc'] = pd.to_datetime(general['created_utc']).dt.date

In [21]:
general.head()

Unnamed: 0,post_id,title,selftext,score,upvote_ratio,created_utc,num_comments,author,permalink,url,is_self,flair,subreddit,category,body
0,1esvxig,CNBC: Harris to propose federal ban on 'corpor...,,35612,0.87,2024-08-15,2768,BothZookeepergame612,/r/Economics/comments/1esvxig/cnbc_harris_to_p...,https://www.cnbc.com/2024/08/15/harris-corpora...,False,,Economics,general,[The headline sounds way different than the pr...
1,1f2eubo,Should the world's richest 1% - who gained $42...,,18133,0.91,2024-08-27,2815,Impressive-Ad1944,/r/Economics/comments/1f2eubo/should_the_world...,https://www.business-standard.com/world-news/w...,False,,Economics,general,"[[removed], While the Walton family is one of ..."
2,1ef14i6,Boomers' iron grip on $76 trillion of wealth p...,,13372,0.91,2024-07-29,773,GetRichQuickSchemer_,/r/Economics/comments/1ef14i6/boomers_iron_gri...,https://creditnews.com/economy/boomers-iron-gr...,False,News,Economics,general,[Having a large IRA/401K is understandable. \n...
3,1cbzoay,Nate Silver: Go to a state school. The Ivy Lea...,,12639,0.92,2024-04-24,1432,jivatman,/r/Economics/comments/1cbzoay/nate_silver_go_t...,https://www.natesilver.net/p/go-to-a-state-school,False,,Economics,general,[The point of the ivies isn't the quality of t...
4,1cz1a2v,Some Americans live in a parallel economy wher...,,10761,0.85,2024-05-23,3178,mafco,/r/Economics/comments/1cz1a2v/some_americans_l...,https://finance.yahoo.com/news/some-americans-...,False,News,Economics,general,[The Great Bifurcation occurred right around C...


#### Renaming The columns to have an appropriate name

In [22]:
names = {'selftext':'post','body':'comments'}
stocks = stocks.rename(columns=names)
general = general.rename(columns=names)

###  3. Merging and Exporting the Final DataFrame

In [23]:
main_df = pd.concat([stocks,general], ignore_index=True)

main_df.to_csv('./data/cleaned_stock.csv')

# Financial Data Cleaner (This will only output yearly data for the 10-K NO 10-Qs for now)

The Goal in this section is to clean the raw financial data for it to be useful for my needs

1. I am importing the raw data
2. I am pivoting the data so the account names are columns (Easier to manipulate)
3. I am Removing all the accounts that are not part of 10-K or 10-Q. Essentially if the frame is not for either of those the rows will fall (CY20XXQX or CY20XX)
4. I am removing all of the columns that are 95% empty. This is because some of the accounts are available but with no information
5. I am doing this for all 500 companies so the data is easier to manipulate, once it is done the raw data will be deleted

### I am processing different Financial csv data and cleaning the date up
- Import the Financial csv data
- sort values by date
- Pivot the table to make the concept appear as independant columns across different date&Time
- Given that during the pivote there were a lot of duplicate that were generated, I group the columns by filed date and remove any NaN cells as to have a cleaned dataframe
- To further clean the data, I removed the columns that are 95% empty. I cannot do any meaning full Time Series analysis on those columns

### Output in "Clean" folder

In [None]:
import os
print("Starting data processing...")
sp = pd.read_csv('./data/sp500.csv')
sp.head(3)
for i in sp['Symbol']:
    try:
        print(f'Processing symbol: {i}')
        file_path = f'./data/financial_data/{i}_raw_financials.csv'

        df = pd.read_csv(file_path)
        print(f'Successfully read data for {i}')

        pivoted_df = df.pivot_table(
            index=['cik', 'company_name', 'frame', 'unit', 'end', 'filed', 'form'],
            columns='concept',
            values='value'
        ).reset_index()
        print(f'Pivoted data for {i}')

        metadata_cols = ['cik', 'company_name', 'form', 'frame', 'end', 'unit']
        consolidated_df = pd.DataFrame()
        grouped = pivoted_df.groupby('end')

        for end_date, group in grouped:
            row_data = {'end': end_date}
            for col in metadata_cols:
                if col in group.columns:
                    row_data[col] = group[col].iloc[0]
            
            for col in pivoted_df.columns:
                if col not in metadata_cols and col != 'end':
                    non_nan_values = group[col].dropna()
                    if not non_nan_values.empty:
                        row_data[col] = non_nan_values.iloc[0]
                    else:
                        row_data[col] = None
            
            consolidated_df = pd.concat([consolidated_df, pd.DataFrame([row_data])], ignore_index=True)

        ##### I only want the yearly data not the quarterly data for now ####
        consolidated_df = consolidated_df[consolidated_df['form'] == '10-K']
        consolidated_df = consolidated_df[consolidated_df['frame'].astype(str).str.match((r'^CY\d{4}$'))]
        print(f'Filtered by frame format for {i}')


        print(f'Filtering columns with >95% empty cells for {i}')
        if not consolidated_df.empty:
            nan_percentages = consolidated_df.isna().mean()
            cleaned_df = consolidated_df.loc[:, nan_percentages < 0.95]
            consolidated_df = cleaned_df
            print(f'Removed sparse columns for {i}')
        else:
            print(f'Consolidated dataframe is empty for {i}, skipping NaN column removal.')

        output_file_path = f'./data/clean/{i}.csv'
        
        # Ensure the directory exists before saving
        os.makedirs(os.path.dirname(output_file_path), exist_ok=True)
        
        consolidated_df.to_csv(output_file_path, index=False)
        print(f'Successfully processed and saved data for {i} to {output_file_path}')

    except FileNotFoundError:
        print(f"Error: File not found for symbol {i} at path: {file_path}. Skipping this symbol.")
    except pd.errors.EmptyDataError:
        print(f"Error: File for symbol {i} is empty: {file_path}. Skipping this symbol.")
    except KeyError as e:
        print(f"Error: A required column is missing for symbol {i} (KeyError: {e}). Skipping this symbol.")
    except Exception as e:
        print(f"An unexpected error occurred while processing symbol {i}: {e}")
        print(f"Skipping symbol {i} and continuing with the next one.")

print("Finished processing all symbols.")


Starting data processing...


NameError: name 'sp' is not defined

# Here I will further clean the data to construct the Balance Sheet, Income Statement and Cash Flow Statement for the Sector / Industry and save them into the csv files


- Check each year (Frame) individually for each financial account
- If the primary account has a NaN/empty value in a specific year
- Look at the alternative account names for that same year
- Use the alternative value when available


### Output in financial_statement

In [None]:
# Global list to store mapping debug information
mapping_debug_log = []

# Function to load industry mappings from sector-specific Python files
# Function to load industry mappings from sector-specific Python files
def load_industry_mapping(sector, sub_industry, mappings_dir='.'):
    """
    Load the appropriate industry mapping based on sector and sub-industry.
    
    Args:
        sector (str): GICS Sector 
        sub_industry (str): GICS Sub-Industry
        mappings_dir (str): Directory containing the mapping Python files
        
    Returns:
        dict: The mapping dictionary for the specified sector and sub-industry
    """
    try:
        # Convert sector name to filename format
        sector_file = sector.lower().replace(' ', '').replace('&', '').replace('-', '') + '.py'
        
        # Check if file exists (using the specified directory)
        sector_file_path = os.path.join(mappings_dir, sector_file)
        if not os.path.exists(sector_file_path):
            debug_msg = f"Warning: Mapping file {sector_file_path} not found for {sector} - {sub_industry}."
            print(debug_msg)
            mapping_debug_log.append(debug_msg)
            return None
        
        # Load the module dynamically
        spec = importlib.util.spec_from_file_location(sector.lower(), sector_file_path)
        module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(module)
        
        # Get the industry mappings
        if hasattr(module, 'industry_mappings'):
            industry_mappings = module.industry_mappings
            
            # Log available keys for debugging
            available_keys = list(industry_mappings.keys())
            debug_msg = f"Processing {sector} - {sub_industry}... dict_keys({available_keys})"
            print(debug_msg)
            mapping_debug_log.append(debug_msg)
            
            # Try multiple normalization strategies to find a match
            
            # Strategy 1: Direct key lookup without normalization
            if sub_industry in industry_mappings:
                debug_msg = f"Found direct mapping for {sub_industry}"
                print(debug_msg)
                mapping_debug_log.append(debug_msg)
                return industry_mappings[sub_industry]
            
            # Generate different normalized versions of the sub_industry
            normalized_versions = [
                # Strategy 2: PascalCase with '&' → 'And'
                ''.join(word.capitalize() for word in sub_industry.replace('&', ' And ').split()),
                
                # Strategy 3: PascalCase with '&' removed
                ''.join(word.capitalize() for word in sub_industry.replace('&', ' ').split()),
                
                # Strategy 4: PascalCase with all punctuation removed
                ''.join(word.capitalize() for word in ''.join(c if c.isalnum() or c.isspace() else ' ' for c in sub_industry).split()),
            ]
            
            # Try each normalized version
            for normalized in normalized_versions:
                if normalized in industry_mappings:
                    debug_msg = f"Found mapping for {sub_industry} → {normalized}"
                    print(debug_msg)
                    mapping_debug_log.append(debug_msg)
                    return industry_mappings[normalized]
            
            # Try more aggressive partial matching approaches
            # Create multiple normalized versions for comparison
            sub_variants = [
                # Remove spaces, convert to lowercase
                sub_industry.lower().replace(' ', ''),
                
                # Remove spaces, '&', convert to lowercase
                sub_industry.lower().replace(' ', '').replace('&', ''),
                
                # Replace '&' with 'and', remove spaces, convert to lowercase
                sub_industry.lower().replace(' ', '').replace('&', 'and'),
                
                # Remove all non-alphanumeric chars, convert to lowercase
                ''.join(c.lower() for c in sub_industry if c.isalnum())
            ]
            
            # Try to match with each key using the variants
            for key in industry_mappings:
                # Create similar variants for the key
                key_variants = [
                    # Remove spaces, convert to lowercase
                    key.lower().replace(' ', ''),
                    
                    # Remove spaces, 'And', convert to lowercase
                    key.lower().replace(' ', '').replace('and', ''),
                    
                    # Remove all non-alphanumeric chars, convert to lowercase
                    ''.join(c.lower() for c in key if c.isalnum())
                ]
                
                # Check if any variant of sub_industry matches any variant of key
                for sub_var in sub_variants:
                    for key_var in key_variants:
                        if sub_var == key_var or sub_var in key_var or key_var in sub_var:
                            debug_msg = f"Using mapping for {key} (variant match with {sub_industry})"
                            print(debug_msg)
                            mapping_debug_log.append(debug_msg)
                            return industry_mappings[key]
                
                # Special case for 'Accounts' suffix
                if 'accounts' in key.lower():
                    key_no_accounts = key.lower().replace('accounts', '')
                    for sub_var in sub_variants:
                        if sub_var in key_no_accounts or key_no_accounts in sub_var:
                            debug_msg = f"Using mapping for {key} (matched after removing 'Accounts')"
                            print(debug_msg)
                            mapping_debug_log.append(debug_msg)
                            return industry_mappings[key]
            
            # Use first industry mapping as fallback
            default_key = next(iter(industry_mappings.keys()))
            debug_msg = f"No specific mapping found for {sub_industry}. Using default mapping for {sector}: {default_key}"
            print(debug_msg)
            mapping_debug_log.append(debug_msg)
            return industry_mappings[default_key]
        else:
            debug_msg = f"No industry_mappings found in {sector_file} for {sector} - {sub_industry}."
            print(debug_msg)
            mapping_debug_log.append(debug_msg)
            return None
    except Exception as e:
        debug_msg = f"Error loading industry mapping for {sector} - {sub_industry}: {str(e)}"
        print(debug_msg)
        mapping_debug_log.append(debug_msg)
        return None

# Load company data from CSV
def load_company_data(ticker, data_dir='clean'):
    """
    Load financial data for a specific company.
    
    Args:
        ticker (str): Company ticker symbol
        
    Returns:
        DataFrame: Financial data for the company
    """
    try:
        filepath = os.path.join(data_dir, f"{ticker}.csv")
        df = pd.read_csv(filepath)
        return df
    except Exception as e:
        print(f"Error loading data for {ticker}: {str(e)}")
        return None

# Helper function to extract account value using mapping info
# Updated helper function to extract account value using mapping info
def extract_account_value(df, account_info, available_columns):
    """
    Extract account value from financial data using mapping information.
    Checks cell by cell, using alternatives if primary value is NaN.
    
    Args:
        df (DataFrame): Company financial data
        account_info (dict): Account mapping information
        available_columns (set): Available columns in the DataFrame
        
    Returns:
        Series: Account value from primary or alternatives
    """
    # Initialize result series with NaN values
    result = pd.Series(np.nan, index=df.index)
    
    # Iterate through each row (each reporting period)
    for idx in df.index:
        # Check primary tag first
        if 'primary' in account_info and account_info['primary'] in available_columns:
            primary_value = df.at[idx, account_info['primary']]
            
            # If primary value exists and is not NaN, use it
            if pd.notna(primary_value):
                result.at[idx] = primary_value
                continue
        
        # If primary is NaN or missing, check alternatives
        if 'alternatives' in account_info:
            for alt in account_info['alternatives']:
                if alt in available_columns:
                    alt_value = df.at[idx, alt]
                    if pd.notna(alt_value):
                        result.at[idx] = alt_value
                        break  # Use first non-NaN alternative
        
        # If still NaN, try children as a sum if available
        if pd.isna(result.at[idx]) and 'children' in account_info and account_info['children']:
            child_values = []
            for child in account_info['children']:
                if child in available_columns and pd.notna(df.at[idx, child]):
                    child_values.append(df.at[idx, child])
            
            # If we found any child values, sum them
            if child_values:
                result.at[idx] = sum(child_values)
    
    return result

# Function to reconstruct balance sheet
def reconstruct_balance_sheet(df, industry_mapping):
    """
    Reconstruct balance sheet using industry-specific mappings.
    
    Args:
        df (DataFrame): Company financial data
        industry_mapping (dict): Industry-specific mappings
        
    Returns:
        DataFrame: Reconstructed balance sheet
    """
    # Create new DataFrame for balance sheet
    balance_sheet = pd.DataFrame(index=df.index)
    
    # Copy metadata columns
    id_columns = ['filed', 'company_name', 'end', 'unit', 'form', 'frame', 'cik']
    for col in id_columns:
        if col in df.columns:
            balance_sheet[col] = df[col]
    
    # Get available columns
    available_columns = set(df.columns)
    
    # Process balance sheet sections dynamically
    balance_sheet_sections = ['Assets', 'Liabilities', 'Equity']
    
    for section in balance_sheet_sections:
        if section in industry_mapping:
            # Iterate through all sub-accounts in this section as defined in the mapping
            for account_name, account_info in industry_mapping[section].items():
                value = extract_account_value(df, account_info, available_columns)
                balance_sheet[f"{section} - {account_name}"] = value
    
    # Add missing totals and validate
    balance_sheet = add_missing_balance_sheet_totals(balance_sheet)
    
    # Remove columns with only NaN values
    balance_sheet = remove_nan_only_columns(balance_sheet)
    
    return balance_sheet

# Function to reconstruct income statement
def reconstruct_income_statement(df, industry_mapping):
    """
    Reconstruct income statement using industry-specific mappings.
    
    Args:
        df (DataFrame): Company financial data
        industry_mapping (dict): Industry-specific mappings
        
    Returns:
        DataFrame: Reconstructed income statement
    """
    # Create new DataFrame for income statement
    income_statement = pd.DataFrame(index=df.index)
    
    # Copy metadata columns
    id_columns = ['filed', 'company_name', 'end', 'unit', 'form', 'frame', 'cik']
    for col in id_columns:
        if col in df.columns:
            income_statement[col] = df[col]
    
    # Get available columns
    available_columns = set(df.columns)
    
    # Process Income Statement dynamically
    if 'IncomeStatement' in industry_mapping:
        income_mapping = industry_mapping['IncomeStatement']
        
        # Iterate through all sections in the income statement mapping
        for section_name, section_data in income_mapping.items():
            # Check if this is a nested structure or direct account mapping
            if isinstance(section_data, dict) and 'primary' in section_data:
                # Direct account mapping (unnested)
                value = extract_account_value(df, section_data, available_columns)
                income_statement[f"IncomeStatement - {section_name}"] = value
            else:
                # Nested structure - process each account in the section
                for account_name, account_info in section_data.items():
                    # Check if this is a further nested structure
                    if isinstance(account_info, dict) and 'primary' in account_info:
                        # Direct account mapping
                        value = extract_account_value(df, account_info, available_columns)
                        income_statement[f"IncomeStatement - {section_name} - {account_name}"] = value
                    else:
                        # Further nested structure
                        for sub_account_name, sub_account_info in account_info.items():
                            value = extract_account_value(df, sub_account_info, available_columns)
                            income_statement[f"IncomeStatement - {section_name} - {account_name} - {sub_account_name}"] = value
    
    # Remove columns with only NaN values
    income_statement = remove_nan_only_columns(income_statement)
    
    return income_statement

# Function to reconstruct cash flow statement
def reconstruct_cash_flow_statement(df, industry_mapping):
    """
    Reconstruct cash flow statement using industry-specific mappings.
    
    Args:
        df (DataFrame): Company financial data
        industry_mapping (dict): Industry-specific mappings
        
    Returns:
        DataFrame: Reconstructed cash flow statement
    """
    # Create new DataFrame for cash flow statement
    cash_flow = pd.DataFrame(index=df.index)
    
    # Copy metadata columns
    id_columns = ['filed', 'company_name', 'end', 'unit', 'form', 'frame', 'cik']
    for col in id_columns:
        if col in df.columns:
            cash_flow[col] = df[col]
    
    # Get available columns
    available_columns = set(df.columns)
    
    # Process Cash Flow Statement dynamically
    if 'CashFlowStatement' in industry_mapping:
        cf_mapping = industry_mapping['CashFlowStatement']
        
        # Iterate through all sections in the cash flow mapping
        for section_name, section_data in cf_mapping.items():
            # Process each account in the section
            for account_name, account_info in section_data.items():
                # Check if this is a nested structure
                if isinstance(account_info, dict) and 'primary' in account_info:
                    # Direct account mapping
                    value = extract_account_value(df, account_info, available_columns)
                    cash_flow[f"CashFlow - {section_name} - {account_name}"] = value
                else:
                    # Nested structure
                    for sub_account_name, sub_account_info in account_info.items():
                        value = extract_account_value(df, sub_account_info, available_columns)
                        cash_flow[f"CashFlow - {section_name} - {account_name} - {sub_account_name}"] = value
    
    # Remove columns with only NaN values
    cash_flow = remove_nan_only_columns(cash_flow)
    
    return cash_flow

# Function to add missing balance sheet totals (including your validation logic)
def add_missing_balance_sheet_totals(balance_sheet):
    """
    Adds missing total columns according to accounting relationships.
    
    Args:
        balance_sheet: DataFrame with reconstructed balance sheet
        
    Returns:
        DataFrame with missing totals computed where possible and validation columns
    """
    # Make a copy to avoid modifying the original
    result = balance_sheet.copy()
    
    # Define key total column names
    total_assets_col = 'Assets - Total Assets'
    total_liabilities_col = 'Liabilities - TotalLiabilities'
    total_equity_col = 'Equity - TotalStockholdersEquity'
    total_liab_equity_col = 'Equity - TotalLiabilitiesandEquity'
    
    # Ensure total columns exist before attempting calculations row-wise
    for col in [total_assets_col, total_liabilities_col, total_equity_col, total_liab_equity_col]:
        if col not in result.columns:
            result[col] = pd.NA
    
    # Process row by row to handle NaN values in specific cells
    for idx, row in result.iterrows():
        # Case 1: Compute missing Total Liabilities
        if (pd.notna(row[total_liab_equity_col]) and
            pd.notna(row[total_equity_col]) and
            pd.isna(row[total_liabilities_col])):
            result.at[idx, total_liabilities_col] = (row[total_liab_equity_col] -
                                                    row[total_equity_col])
        
        # Case 2: Compute missing Total Stockholders Equity
        if (pd.notna(row[total_liab_equity_col]) and
            pd.notna(row[total_liabilities_col]) and
            pd.isna(row[total_equity_col])):
            result.at[idx, total_equity_col] = (row[total_liab_equity_col] -
                                              row[total_liabilities_col])
        
        # Case 3: Compute missing Total Liabilities and Equity
        if pd.isna(row[total_liab_equity_col]):
            if pd.notna(row[total_assets_col]):
                # Set Total Liabilities and Equity = Total Assets (accounting equality)
                result.at[idx, total_liab_equity_col] = row[total_assets_col]
            elif (pd.notna(row[total_liabilities_col]) and
                  pd.notna(row[total_equity_col])):
                # Compute Total Liabilities and Equity as sum of components
                result.at[idx, total_liab_equity_col] = (row[total_liabilities_col] +
                                                       row[total_equity_col])
    
    # Add validation columns
    result['Validation - A = L+E Difference'] = (result[total_assets_col] -
                                               result[total_liab_equity_col])
    
    # This check validates if the sum of Liabilities and Equity components equals Total Liabilities and Equity
    if total_liabilities_col in result.columns and total_equity_col in result.columns:
        result['Validation - L+E Components Sum Difference'] = (result[total_liabilities_col] +
                                                              result[total_equity_col] -
                                                              result[total_liab_equity_col])
    else:
        result['Validation - L+E Components Sum Difference'] = pd.NA
    
    return result

# Function to remove columns that only contain NaN values
def remove_nan_only_columns(df):
    """
    Removes columns that contain only NaN values.
    
    Args:
        df: DataFrame to clean
        
    Returns:
        DataFrame with NaN-only columns removed
    """
    nan_cols = df.columns[df.isnull().all()].tolist()
    return df.drop(columns=nan_cols)

# Function to process SP500 companies with debug logging
def process_sp500_companies(sp500_df, data_dir='clean', mappings_dir='.', output_dir='output'):
    """
    Process all S&P 500 companies to generate financial statements.
    
    Args:
        sp500_df (DataFrame): S&P 500 companies data
        data_dir (str): Directory containing company CSV files
        mappings_dir (str): Directory containing mapping Python files
        output_dir (str): Directory to save output files
        
    Returns:
        dict: Debug information including mapping issues
    """
    # Create output directory if it doesn't exist
    os.makedirs(output_dir, exist_ok=True)
    
    # Track success and failures
    results = {
        'processed_successfully': [],
        'failed_processing': [],
        'mapping_issues': []
    }
    
    # Process each company
    for idx, row in sp500_df.iterrows():
        ticker = row['Symbol']
        sector = row['GICS Sector']
        sub_industry = row['GICS Sub-Industry']
        
        # Clear previous entries in the global debug log for this company
        global mapping_debug_log
        mapping_debug_log = []
        
        print(f"\nProcessing {ticker} ({sector} - {sub_industry})...")
        
        # Load company data with specified directory
        df = load_company_data(ticker, data_dir=data_dir)
        if df is None:
            debug_msg = f"Skipping {ticker} - Could not load data."
            print(debug_msg)
            results['failed_processing'].append({
                'ticker': ticker,
                'sector': sector,
                'sub_industry': sub_industry,
                'reason': 'Data load failure'
            })
            continue
        
        # Load industry mapping with specified directory
        industry_mapping = load_industry_mapping(sector, sub_industry, mappings_dir=mappings_dir)
        if industry_mapping is None:
            debug_msg = f"Skipping {ticker} - Could not load industry mapping."
            print(debug_msg)
            results['failed_processing'].append({
                'ticker': ticker,
                'sector': sector,
                'sub_industry': sub_industry,
                'reason': 'Mapping load failure'
            })
            continue
        
        try:
            # Generate balance sheet
            balance_sheet = reconstruct_balance_sheet(df, industry_mapping)
            balance_sheet_path = os.path.join(output_dir, f"{ticker}_balance_sheet.csv")
            balance_sheet.to_csv(balance_sheet_path, index=False)
            
            # Generate income statement
            income_statement = reconstruct_income_statement(df, industry_mapping)
            income_statement_path = os.path.join(output_dir, f"{ticker}_income_statement.csv")
            income_statement.to_csv(income_statement_path, index=False)
            
            # Generate cash flow statement
            cash_flow = reconstruct_cash_flow_statement(df, industry_mapping)
            cash_flow_path = os.path.join(output_dir, f"{ticker}_cash_flow.csv")
            cash_flow.to_csv(cash_flow_path, index=False)
            
            success_msg = f"Successfully generated financial statements for {ticker}"
            print(success_msg)
            
            # Record success
            results['processed_successfully'].append({
                'ticker': ticker,
                'sector': sector,
                'sub_industry': sub_industry
            })
            
            # If we used default mapping, record this as a mapping issue
            if any("No specific mapping found" in log for log in mapping_debug_log):
                results['mapping_issues'].append({
                    'ticker': ticker,
                    'sector': sector,
                    'sub_industry': sub_industry,
                    'debug_logs': mapping_debug_log.copy()
                })
        
        except Exception as e:
            error_msg = f"Error processing {ticker}: {str(e)}"
            print(error_msg)
            results['failed_processing'].append({
                'ticker': ticker,
                'sector': sector,
                'sub_industry': sub_industry,
                'reason': str(e)
            })
            continue
    
    # Save the debug results to a JSON file
    debug_file_path = os.path.join(output_dir, "mapping_debug_results.json")
    import json
    with open(debug_file_path, 'w') as f:
        json.dump(results, f, indent=2)
    
    print(f"\nDebug information saved to: {debug_file_path}")
    return results

# Function to extract and show mapping issues
def show_mapping_issues(results=None):
    """
    Display mapping issues from processing results.
    
    Args:
        results: Results dictionary from process_sp500_companies
        
    Returns:
        DataFrame with mapping issues
    """
    if results is None:
        # Try to load results from the default location
        import json
        try:
            with open("output/mapping_debug_results.json", 'r') as f:
                results = json.load(f)
        except:
            print("No results file found. Run process_sp500_companies first.")
            return None
    
    # Extract mapping issues
    issues_data = []
    for issue in results['mapping_issues']:
        for log in issue['debug_logs']:
            if "No specific mapping found" in log:
                issues_data.append({
                    'ticker': issue['ticker'],
                    'sector': issue['sector'],
                    'sub_industry': issue['sub_industry'],
                    'log_message': log
                })
    
    # Convert to DataFrame
    if issues_data:
        import pandas as pd
        issues_df = pd.DataFrame(issues_data)
        return issues_df
    else:
        print("No mapping issues found.")
        return None

# Display formatted balance sheet (optional for visualization)
def display_balance_sheet(balance_sheet, in_billions=True):
    """
    Display balance sheet in a readable format.
    
    Args:
        balance_sheet: DataFrame with balance sheet data
        in_billions: If True, display in billions; otherwise in millions
        
    Returns:
        DataFrame with formatted balance sheet
    """
    # Make a copy to avoid modifying the original
    formatted_bs = balance_sheet.copy()
    
    # Identify numeric columns
    numeric_cols = [col for col in formatted_bs.columns 
                   if any(col.startswith(prefix) for prefix in ['Assets', 'Liabilities', 'Equity', 'Validation'])]
    
    # Convert to billions or millions
    divisor = 1_000_000_000 if in_billions else 1_000_000
    
    for col in numeric_cols:
        # Check if the column is numeric before dividing
        if pd.api.types.is_numeric_dtype(formatted_bs[col]):
            formatted_bs[col] = formatted_bs[col] / divisor
    
    # Format the date column if it exists
    if 'end' in formatted_bs.columns:
        try:
            formatted_bs['end'] = pd.to_datetime(formatted_bs['end']).dt.strftime('%Y-%m-%d')
        except:
            pass  # Keep original format if conversion fails
    
    # Organize columns by section
    metadata_cols = ['filed', 'company_name', 'end', 'unit', 'form', 'frame', 'cik']
    asset_cols = [col for col in formatted_bs.columns if col.startswith('Assets')]
    liability_cols = [col for col in formatted_bs.columns if col.startswith('Liabilities')]
    equity_cols = [col for col in formatted_bs.columns if col.startswith('Equity')]
    validation_cols = [col for col in formatted_bs.columns if col.startswith('Validation')]
    
    # Create ordered list of columns
    ordered_cols = (
        [col for col in metadata_cols if col in formatted_bs.columns] +
        sorted([col for col in asset_cols if 'Total' not in col]) +
        sorted([col for col in asset_cols if 'Total' in col]) +
        sorted([col for col in liability_cols if 'Total' not in col]) +
        sorted([col for col in liability_cols if 'Total' in col]) +
        sorted([col for col in equity_cols if 'Total' not in col and 'Liabilities and Equity' not in col]) +
        sorted([col for col in equity_cols if 'Total Stockholders Equity' in col]) +
        sorted([col for col in equity_cols if 'Liabilities and Equity' in col]) +
        validation_cols
    )
    
    # Only include columns that actually exist
    final_cols = [col for col in ordered_cols if col in formatted_bs.columns]
    
    return formatted_bs[final_cols]

# Main execution
# Load S&P 500 companies data
# Load S&P 500 companies data
sp500 = pd.read_csv('./data/sp500.csv')

# Process all companies with specified directories
process_sp500_companies(
    sp500_df=sp500,
    data_dir='/Users/maseehfaizan/Desktop/Maseeh/Projects/Hybrid_Pricer/data/clean', 
    mappings_dir='./data/XBRL_dic',  
    output_dir='/Users/maseehfaizan/Desktop/Maseeh/Projects/Hybrid_Pricer/data/financial_statement'     
)


Processing MMM (Industrials - Industrial Conglomerates)...
Processing Industrials - Industrial Conglomerates... dict_keys(['AerospaceDefense', 'AgriculturalFarmMachinery', 'AirFreightLogistics', 'BuildingProducts', 'CargoGroundTransportation', 'ConstructionEngineering', 'DataProcessingOutsourcedServices', 'ElectricalComponentsEquipment', 'DiversifiedSupportServices', 'EnvironmentalAndFacilitiesServices', 'HeavyElectricalEquipment', 'HumanResourceAndEmploymentServices', 'IndustrialConglomerates', 'IndustrialMachineryAndSuppliesAndComponents', 'PassengerAirlines', 'PassengerGroundTransportation', 'RailTransportation', 'ResearchAndConsultingServices', 'TradingCompaniesAndDistributors', 'ConstructionMachineryHeavyTransportationEquipment'])
Found mapping for Industrial Conglomerates → IndustrialConglomerates
Successfully generated financial statements for MMM

Processing AOS (Industrials - Building Products)...
Processing Industrials - Building Products... dict_keys(['AerospaceDefense', 'A

KeyboardInterrupt: 

In [17]:
import pandas as pd
import os
from collections import defaultdict

# Dictionary to group files by ticker
ticker_files = defaultdict(list)
folder_path = "./data/financial_statement"

# Get all CSV files in the folder
csv_files = [f for f in os.listdir(folder_path) if f.endswith('.csv')]

# Group files by ticker (part before first underscore), exclude master files
for file in csv_files:
    if '_' in file and not file.endswith('_master.csv'):
        ticker = file.split('_')[0]
        ticker_files[ticker].append(file)

# Process each ticker group
for ticker, files in ticker_files.items():
    print(f"Processing ticker: {ticker}")
    print(f"Files: {files}")
    
    # Read and merge all files for this ticker
    merged_df = None
    
    for file in files:
        file_path = os.path.join(folder_path, file)
        df = pd.read_csv(file_path)
        
        if merged_df is None:
            merged_df = df
        else:
            # Merge on 'frame' column
            merged_df = pd.merge(merged_df, df, on='frame', how='outer')
    
    # Remove metadata columns
    metadata_cols = [col for col in merged_df.columns if any(col.startswith(meta) for meta in ['filed', 'company_name', 'end', 'unit', 'form', 'cik'])]
    merged_df = merged_df.drop(columns=metadata_cols)
    
    # Save merged dataframe
    output_file = os.path.join(folder_path, f"{ticker}_master.csv")
    merged_df.to_csv(output_file, index=False)
    print(f"Saved: {ticker}_master.csv")

Processing ticker: MTCH
Files: ['MTCH_cash_flow.csv', 'MTCH_income_statement.csv', 'MTCH_balance_sheet.csv']
Saved: MTCH_master.csv
Processing ticker: HWM
Files: ['HWM_cash_flow.csv', 'HWM_balance_sheet.csv', 'HWM_income_statement.csv']
Saved: HWM_master.csv
Processing ticker: IRM
Files: ['IRM_cash_flow.csv', 'IRM_balance_sheet.csv', 'IRM_income_statement.csv']
Saved: IRM_master.csv
Processing ticker: AWK
Files: ['AWK_balance_sheet.csv', 'AWK_cash_flow.csv', 'AWK_income_statement.csv']
Saved: AWK_master.csv
Processing ticker: MMM
Files: ['MMM_balance_sheet.csv', 'MMM_cash_flow.csv', 'MMM_income_statement.csv']
Saved: MMM_master.csv
Processing ticker: AMD
Files: ['AMD_income_statement.csv', 'AMD_balance_sheet.csv', 'AMD_cash_flow.csv']
Saved: AMD_master.csv
Processing ticker: ESS
Files: ['ESS_cash_flow.csv', 'ESS_income_statement.csv', 'ESS_balance_sheet.csv']
Saved: ESS_master.csv
Processing ticker: DGX
Files: ['DGX_cash_flow.csv', 'DGX_balance_sheet.csv', 'DGX_income_statement.csv']
S

# mapping.json (KPIs, MOATs and Risk mapping)

In [None]:
import json

# Define the new values to replace all existing values
new_swot_values = ["Strengths", "Weaknesses", "Opportunities", "Threats"]
new_porter_values = [
    "High Threat of New Entrants",
    "Low Threat of New Entrants",
    "High Bargaining Power of Buyers",
    "Low Bargaining Power of Buyers",
    "High Bargaining Power of Suppliers",
    "Low Bargaining Power of Suppliers",
    "High Threat of Substitute Products or Services",
    "Low Threat of Substitute Products or Services",
    "High Intensity of Rivalry",
    "Low Intensity of Rivalry"
]

# Read the JSON file
with open('mapping.json', 'r') as file:
    data = json.load(file)

# Update all SWOT and Porter values
for category, details in data.items():

    # Here you need to replace the values from 'risks' and 'moats' and 'kpis'
    if 'kpis' in details:
        for moat in details['kpis']:
            mapping = moat.get('mapping', {})
            
            # Replace SWOT values
            if 'swot' in mapping:
                mapping['swot'] = new_swot_values
            if 'SWOT' in mapping:
                mapping['SWOT'] = new_swot_values
            
            # Replace Porter values
            if 'porters' in mapping:
                mapping['porters'] = new_porter_values
            if 'Porter' in mapping:
                mapping['Porter'] = new_porter_values

# Overwrite the original file
with open('mapping.json', 'w') as file:
    json.dump(data, file, indent=2)

print("JSON file has been updated with new SWOT and Porter values")

# Verify the changes
print("\nVerifying changes:")
for category, details in data.items():
    if 'risks' in details:
        for moat in details['risks']:
            mapping = moat.get('mapping', {})
            swot = mapping.get('swot') or mapping.get('SWOT')
            porter = mapping.get('porters') or mapping.get('Porter')
            if swot or porter:
                print(f"Category: {category}")
                print(f"SWOT: {swot}")
                print(f"Porter: {porter}")
                print("-" * 50)

In [None]:
import pandas as pd
f = pd.read_csv('/Users/maseehfaizan/Desktop/Maseeh/Projects/Hybrid_Pricer/data/ticker_csvs/ACN_df_with_gemini_responses.csv')
f.head(20)

Unnamed: 0,ticker,filing_date,section,content,Security,GICS Sector,GICS Sub-Industry,Headquarters Location,Date added,CIK,Founded,question_number,question_formatted,question_prompt,gemini_response
0,ACN,2024-08-31,Item 1. Business,Business 2 Business Overview Accenture is a le...,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989,9.0,What is Accenture's approach to intellectual p...,"As financial analysts, we are extracting finan...","```json\n{\n ""question"": ""What is Accenture's ..."
1,ACN,2024-08-31,Item 1. Business,Business 2 Business Overview Accenture is a le...,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989,10.0,Does Accenture disclose any key performance in...,"As financial analysts, we are extracting finan...","```json\n{\n ""question"": ""Does Accenture discl..."
2,ACN,2024-08-31,Item 1A. Risk Factors,Risk Factors 18 Risk Factors In addition to th...,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989,1.0,What specific risks does Accenture identify re...,"As financial analysts, we are extracting finan...","```json\n{\n ""question"": ""What specific risks ..."
3,ACN,2024-08-31,Item 1A. Risk Factors,Risk Factors 18 Risk Factors In addition to th...,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989,2.0,What risks does Accenture highlight regarding ...,"As financial analysts, we are extracting finan...","```json\n{\n ""question"": ""What risks does Acce..."
4,ACN,2024-08-31,Item 1A. Risk Factors,Risk Factors 18 Risk Factors In addition to th...,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989,3.0,Does Accenture discuss risks associated with c...,"As financial analysts, we are extracting finan...","```json\n{\n ""question"": ""Does Accenture discu..."
5,ACN,2024-08-31,Item 1A. Risk Factors,Risk Factors 18 Risk Factors In addition to th...,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989,4.0,"What cybersecurity, data privacy, or intellect...","As financial analysts, we are extracting finan...","```json\n{\n ""question"": ""What cybersecurity, ..."
6,ACN,2024-08-31,Item 1A. Risk Factors,Risk Factors 18 Risk Factors In addition to th...,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989,5.0,What risks does Accenture identify related to ...,"As financial analysts, we are extracting finan...","```json\n{\n ""question"": ""What risks does Acce..."
7,ACN,2024-08-31,Item 1A. Risk Factors,Risk Factors 18 Risk Factors In addition to th...,Accenture,Information Technology,IT Consulting & Other Services,"Dublin, Ireland",2011-07-06,1467373,1989,6.0,What risks does Accenture face from intense co...,"As financial analysts, we are extracting finan...","```json\n{\n ""question"": ""What risks does Acce..."
