# Rapid Fuzz
## Controlling ORBIS by entity and municipality and entire DENUE's data set

# General information
This notebook mainly executes the following for the firm names in `denue_final` and `orbis_final`: 

- Filtering the desired geographical zones in both data sets. 
- Extracting the company names. 
- Training the algorithm. 
- Extracting the results. 
- Labeling the results. 
- Exporting them to a Comma Separated Values file. 

After this, we've got the possible name matches in `denue_final` for each company in `orbis_final`. 

# Input files
1. **orbis_final:** `'/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/orbis_final.csv'` This file contains a data set where each row represents a firm with one of their names associated, also, entity, municipality and ORBIS's BVDID number.
2. **denue_final:** `'/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/denue_final.csv'` This file contains a dataset where each row represents a firm with one of their names associated, also, the number of workers in that firm, entity, municipality and DENUE's key.

In [1]:
orbis_final_file = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/orbis_final.csv'
denue_final_file = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/denue_final.csv'

# Output files
1. **output_file_prefix:** `'/scratch/public/jpvasquez/MNCs_informality/Final_data/output/orbis_entity_municipality_denue/denue/rapidfuzz/matches_tf_idf_orbis_'` These files contain the possible matches of each municipality.
2. **final_output:** `'/scratch/public/jpvasquez/MNCs_informality/Final_data/output/matches_rapidfuzz_orbis_entity_municipality_denue.csv'` This file is a concatenation of all the `output_file_prefix` files. 

In [2]:
output_file_prefix = '/scratch/public/jpvasquez/MNCs_informality/Final_data/output/orbis_entity_municipality_denue/denue/rapidfuzz/matches_tf_idf_orbis_'
final_output = '/scratch/public/jpvasquez/MNCs_informality/Final_data/output/matches_rapidfuzz_orbis_entity_municipality_denue.csv'

# Packages
These are the needed packages to run this code. In case, the machine you're running this in doesn't have any of these packages, run this code: 

`!pip install package_name`

- **Pandas** is the package which handles importing, wrangling, cleaning and doing everything with the data. 
- **Numpy** is needed in order to declare missing values. 
- **Glob** gets all the files from a directory with a prefix. 
- **Sklearn** is a package for machile learning, we'll use the module for Natural Language Processing. 
- **Scipy** is used for scientific computing, in our case, `csr_matrix` is a dependency of `awesome_cossim_topn`. 
- **Sparse_dot_topn** performs sparse matrix multiplication followed by top-n multiplication result selection. 

In [3]:
import pandas as pd
import numpy as np
import glob
import dask as dd
from rapidfuzz import fuzz
import dask.dataframe as dd

  import pandas.util.testing as tm


# Importing the data

In [4]:
orbis_final = pd.read_csv(orbis_final_file)
denue_final = pd.read_csv(denue_final_file)

# Declaring options
For customizability's sake, we'll group all the variables, options and arguments we could wish to change in the future. We'll make it in one cell, but feel free to split it into how many cells you want. 

In [5]:
# base_df
base_data = orbis_final
category1 = 'entidad' # control 1 for groupby
category2 = 'municipio' # control 2 for groupby
# matching_df
matching_data = denue_final
base_names_variable = 'companyname' # control 1 for groupby
matching_names_variable = 'firm' # control 2 for groupby
# extracting the results
## top results
top_results_lower_bound = 0.95 # lowest score accepted for top quality matches
## uncertain results
uncertain_results_lower_bound = 0.75 # lowest score accepted for uncertain quality matches
uncertain_results_upper_bound = 0.95 # highest score accepted for uncertain quality matches
n_uncertain_results = 5 # top n results for uncertain matches
cores = 55 # subject to the number specified in the batch options
##################################
### don't change anything below###
##################################
categories = [category1, category2]
matching_df = (matching_data
                   .copy()
                   .dropna(subset = [matching_names_variable]) # drop observations without variable name (firm name)
                   .drop_duplicates(ignore_index = True)) # drop possible duplicates

# Matching algorithm
Since this is a loop, we can't divide the algorithm in separate cells, so, we'll comment it with #. 
The main sections are: 

- Filter the datasets with their corresponding categories. 
- Get the company names. 
- Train the algorithm. 
- Extract the results. 
    - Top results. 
    - Uncertain results. 
- Save the matches. 

In [7]:
i = 0
for base_group, base_df in base_data.groupby(by = categories): 
    i += 1
    print(f'Matching {base_group}, combination {i} out of 398')
    #########################
    ###Filter the datasets###
    #########################
    base_df = (base_df.copy()
               .dropna(subset = [base_names_variable]) # drop observations without variable name (firm name)
               .drop_duplicates(ignore_index = True)) # drop possible duplicates
    ###################################
    ###Converting to Dask Dataframes###
    ###################################
    base_df = dd.from_pandas(base_df, chunksize = 50)
    matching_df = dd.from_pandas(matching_df, chunksize = 500)
    ########################################
    ###All combinations of the dataframes###
    ########################################
    base_df = base_df.assign(key = 0) # set key to match on
    matching_df = matching_df.assign(key = 0) # set key to match on
    matches = dd.merge(base_df, matching_df, suffixes=('_x', '_y'), on = "key", 
                       how = 'outer', shuffle = 'tasks').compute() # tasks: to use distributed computations on all nodes
    ############################
    ###Extracting the results###
    ############################
    matches['score'] = matches.apply(lambda x: 
                                     fuzz.token_sort_ratio(x[base_names_variable], 
                                                                     x[matching_names_variable])/100, 
                                     axis = 1) # apply algorithm
    ############
    #Top results
    ############
    certain_matches = matches[matches['score'] > top_results_lower_bound].copy()
    certain_matches['accuracy'] = 'top' # tag them as top results
    
    ##################
    #Uncertain results
    ##################
    uncertain_matches = (matches[(matches['score'] < uncertain_results_upper_bound) 
                                 & (matches['score'] > uncertain_results_lower_bound)] # select the matches below upper bound
                         .copy() 
                         .sort_values(['score'], ascending = False) # sort values descending
                         .groupby('bvdidnumber').head(n_uncertain_results)) # group by bvdidnumber, then get the n best matches
    uncertain_matches['accuracy'] = 'uncertain' # tag them as uncertain results
    
    ########################
    ###Saving the matches###
    ########################
    
    matches = certain_matches.append(uncertain_matches, ignore_index = True) # append certain with uncertain matches
    file_name = output_file_prefix + base_group[0] + '_' + base_group[1] + '_denue' + '.csv' # create file name
    matches.drop_duplicates(ignore_index = True).to_csv(file_name, index = False) # remove duplicates and save

# Concatenate and save results
First, we concatenate all the files with the prefix `output_file_prefix`, then we label them by algorithm and DENUE's geographical selection, finally, we drop duplicate matches and save it. 

In [8]:
joint_matches = pd.concat([pd.read_csv(f) for f in 
                           glob.glob(output_file_prefix +'*.csv')], ignore_index=True) # concatenate the results
joint_matches['algorithm'] = 'rapidfuzz' # label the algorithm
joint_matches['selection'] = 'orbis_entity_municipality_denue' # label the database selection
joint_matches.drop_duplicates(ignore_index = True).to_csv(final_output, index = False) # drop duplicates and save 