# Summary statistics

# General information
The goal of this file is to get summary statistics of the matches obtained. 

The statistics expected to get are: 

- Number of matches of top quality. 
- Number of uncertain matches. 
- Total matches. 
- Firms that didn't get a single match. 
- Total firms in ORBIS. 

These statistics will be disaggregated by: 
- Entity. 
- DENUE's version. 
- Algorithm. 
- Accuracy of the match. 
- Number of workers. 

# Input files
1. **orbis_final:** `'/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/orbis_final.csv'` This file contains a data set where each row represents a firm with one of their names associated, also, entity, municipality and ORBIS's BVDID number.
2. **final_matches:** `'/scratch/public/jpvasquez/MNCs_informality/Final_data/output/final_matches.csv'` This file contains a dataset where each row represents a match between a firm in ORBIS and another one in DENUE. This set contains matches obtained with both algorithms and in both versions of DENUE. 2. **final_matches:** `'/scratch/public/jpvasquez/MNCs_informality/Final_data/output/final_matches.csv'` This file contains a dataset where each row represents a match between a firm in ORBIS and another one in DENUE. This set contains matches obtained with both algorithms and in both versions of DENUE. 

In [1]:
orbis_final_filename = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/orbis_final.csv'
final_matches_filename = '/scratch/public/jpvasquez/MNCs_informality/Final_data/output/3-1-5-final_matches.csv'

# Output files
1. **summary_file:** `'/scratch/public/jpvasquez/MNCs_informality/Final_data/output/3-2-summary.xlsx'` These files contain summary statistics by entity of all the matches that were kept along the process. 

In [2]:
output_file_prefix = '/scratch/public/jpvasquez/MNCs_informality/Final_data/output/3-2-summary.xlsx'

# Packages
These are the needed packages to run this code. In case, the machine you're running this in doesn't have any of these packages, run this code: 

`!pip install package_name`

- **Pandas** is the package which handles importing, wrangling, cleaning and doing everything with the data. 
- **Glob** gets all the files from a directory with a prefix. 

In [3]:
import glob
import pandas as pd

# Importing the data

In [4]:
orbis_final = pd.read_csv(orbis_final_filename)
final_matches = pd.read_csv(final_matches_filename)

# Selecting the variables of interest for the summary statistics

In [5]:
final_matches = final_matches[['algorithm', 'version', 'entidad_x', 'accuracy', 'n_workers', 'bvdidnumber']].copy()

# Select the strongest match
For each firm in ORBIS, when there are multiple matches some of them are top quality, others uncertain; some have a lot of employees, some just a few. So, in order to get the statistics right, we'll say a firm in ORBIS has a match of certain characteristics accounting for the match with the largest number of employees and then, we determine the highest quality match with that range of employees. 

## Encode the variables

In [6]:
final_matches['accuracy_n'] = final_matches['accuracy'].map({'top': 2, 
                                                             'uncertain': 1})

## Keep the best match (this is just for the statistics)

In [7]:
final_matches = (final_matches
                 .sort_values(['n_workers', 'accuracy_n'], ascending = False) # sort the matches
                 .drop_duplicates(['algorithm', 'version', 'entidad_x', 'bvdidnumber'], # drop the duplicates
                                  ignore_index = True, keep = 'first')) # keep the strongest one

# Getting the actual statistics

## Calculating them

In [8]:
final_matches_entidad = (final_matches
                         .value_counts(['entidad_x', 'version', 
                                        'algorithm', 'accuracy', 'n_workers'])
                         .reset_index()) # getting disaggregated statistics
final_matches_all = (final_matches
                     .value_counts(['entidad_x', 'version', 'algorithm'])
                     .reset_index()) # statistics of total firms that matched
final_matches_all['accuracy'] = 'all' # adapting the dataframe to merge
final_matches_all['n_workers'] = 999
orbis_final = (orbis_final[['entidad', 'bvdidnumber']]
               .value_counts(['entidad']) # total firms in ORBIS by entity
               .reset_index()
               .rename(columns = {"entidad": "entidad_x"}))
orbis_final['version'] = 'orbis_total' # adapting the dataframe to merge
orbis_final['algorithm'] = 'orbis_total'
orbis_final['accuracy'] = 'orbis_total'
orbis_final['n_workers'] = 999

## Joining the statistics

In [9]:
final_matches = pd.concat([final_matches_entidad, final_matches_all, orbis_final], ignore_index = True)

# Saving the statistics by entity
This is an Excel file, where each sheet gives the statistics for a specific entity. 

In [10]:
with pd.ExcelWriter(output_file_prefix, engine = 'xlsxwriter') as writer:  
    for group, df in final_matches.groupby(['entidad_x']): # by entity
        df.to_excel(writer, sheet_name = group, index = True) # save one sheet per entity