# Integration of FLS datasets (input) and real metric values (target)

This notebook integrates the FLS datasets presented in the previous notebook (``21_apply_classifier_to_create_datasets.ipynb``) to the real metric values fetched from EDGAR database. 

For this project, we want to predict company performance based on each specific metrics, thus the metrics are separated into different datasets and the new data files are named after the metrics' official names on EDGAR, e.g. the metric Earnings Per Share will be saved under ``EarningsPerShareDiluted.csv``. 

## Import packages

In [54]:
import os
import pandas as pd

## Integrate data

### Define function to integrate data

This function will perform the following steps:
1. Load the text data and filter for FLS.
2. Rename columns and drop unnecessary ones for consistency with the metric data.
3. Replace metric names to match those in the metric files.
4. For each metric, merge the text data with the corresponding metric data based on CIK, year, and metric.
5. Save the merged data into separate files named after each metric.

In [65]:
def integrate_data(text_dataset, metrics_folder, output_folder):
    """
    Integrates the text dataset with the metric data.
    
    Parameters
    ----------
    text_dataset : str
        Path to the text dataset.
    metrics_folder : str
        Path to the folder containing the metric data.
    output_folder : str
        Path to the folder where the integrated data will be saved.
    """
    # Read in the text data
    df = pd.read_csv(text_dataset)
    fls = df[df['Label'] == 'FLS']
    
    # Suppress warnings
    pd.options.mode.chained_assignment = None  # Suppress SettingWithCopyWarning
    
    # Rename columns for consistency with the metric data
    fls.rename(columns={'Metric': 'metric',"CIK":"cik","Year":"year",'Sentence':'text','Item':'item'}, inplace=True)
    fls.drop(columns=['index','Label','Company'], inplace=True)
    fls['metric'] = fls['metric'].replace({"Net Income":"Net Income (Loss)",
                                        "EPS":'Diluted Earnings per share',
                                        "Cash Flow (Investing)":"Net Cash from Investing Activities",
                                        "Cash Flow (Financing)":"Net Cash from Financing Activities",
                                        "Cash Flow (Operating)":"Net Cash from Operating Activities",})
    
    # Create a list to store names of all the files
    metric_files = [file for file in os.listdir(metrics_folder) if file.endswith('.csv')]
    
    # Loop through each metric file for integration
    for metric_file in metric_files:
        # Read in the metric file
        metric = pd.read_csv(metrics_folder + metric_file)
        
        # Merge datasets based on CIK number, year and metric
        merged_data = fls.merge(metric, on=['cik', 'year','metric'], how='inner')
        merged_data = merged_data[['text','item', 'cik', 'year', 'val']]
        
        # Save data
        if merged_data.empty:
            print('No data for: ' + metric_file)
        else:
            merged_data.to_csv(output_folder + metric_file, index=False)
            print('Saved: ' + metric_file)


### Apply the function to integrate and save data to the desired folder

In [66]:
integrate_data('../../data/01_interim/distilbert_dataset.csv', '../../data/00_raw/metric_data/', '../../data/02_processed/distilbert_data/')
integrate_data('../../data/01_interim/finbert_dataset.csv', '../../data/00_raw/metric_data/', '../../data/02_processed/finbert_data/')

No data for: CostOfGoodsAndServicesSold.csv
Saved: EarningsPerShareDiluted.csv
Saved: EBIT.csv
No data for: NetCashProvidedByUsedInContinuingOperations.csv
Saved: NetCashProvidedByUsedInFinancingActivities.csv
Saved: NetCashProvidedByUsedInInvestingActivities.csv
Saved: NetIncomeLoss.csv
Saved: RevenueFromContractWithCustomerExcludingAssessedTax.csv
No data for: SellingGeneralAndAdministrativeExpense.csv
No data for: CostOfGoodsAndServicesSold.csv
Saved: EarningsPerShareDiluted.csv
Saved: EBIT.csv
No data for: NetCashProvidedByUsedInContinuingOperations.csv
Saved: NetCashProvidedByUsedInFinancingActivities.csv
Saved: NetCashProvidedByUsedInInvestingActivities.csv
Saved: NetIncomeLoss.csv
Saved: RevenueFromContractWithCustomerExcludingAssessedTax.csv
No data for: SellingGeneralAndAdministrativeExpense.csv


## Next step

After integrating the data, we now have 6 datasets representing 6 metrics for each FLS dataset. In total, there are 12 datasets. 

The last step that needs to be implemented for these datasets is to clean the data to the desired form that we can use directly for the performance prediction task. This is presented in the next notebook ``23_clean_data``. 