# Rapid Fuzz
## Controlling ORBIS by entity and municipality and DENUE by entity and municipality

# General information
This notebook mainly executes the following for the firm names in `denue_final` and `orbis_final`: 

- Filtering the desired geographical zones in both data sets. 
- Extracting the company names. 
- Training the algorithm. 
- Extracting the results. 
- Labeling the results. 
- Exporting them to a Comma Separated Values file. 

After this, we've got the possible name matches in `denue_final` for each company in `orbis_final`. 

# Input files
1. **orbis_final:** `'/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/orbis_final.csv'` This file contains a data set where each row represents a firm with one of their names associated, also, entity, municipality and ORBIS's BVDID number.
2. **denue_final_alternative:** `'/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/denue_final_alternative.csv'` This file contains a dataset where each row represents a firm with one of their names associated, also, the number of workers in that firm, entity, municipality and DENUE's key. 

In [1]:
orbis_final_file = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/orbis_final.csv'
denue_final_alternative_file = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/denue_final_alternative.csv'

# Output files
1. **output_file_prefix:** `'/scratch/public/jpvasquez/MNCs_informality/Final_data/output/both_entity_municipality/denue_alternative/rapidfuzz/matches_rapidfuzz_both_'` These files contain the possible matches of each municipality.
2. **final_output:** `'/scratch/public/jpvasquez/MNCs_informality/Final_data/output/2-2-2-1-matches_tf_idf_both_entity_municipality_alternative.csv'` This file is a concatenation of all the `output_file_prefix` files. 

In [2]:
output_file_prefix = '/scratch/public/jpvasquez/MNCs_informality/Final_data/output/both_entity_municipality/denue_alternative/rapidfuzz/matches_rapidfuzz_both_'
final_output = '/scratch/public/jpvasquez/MNCs_informality/Final_data/output/2-2-2-1-matches_rapidfuzz_both_entity_municipality_alternative.csv'

# Packages
These are the needed packages to run this code. In case, the machine you're running this in doesn't have any of these packages, run this code: 

`!pip install package_name`

- **Pandas** is the package which handles importing, wrangling, cleaning and doing everything with the data. 
- **Numpy** is needed in order to declare missing values. 
- **Glob** gets all the files from a directory with a prefix. 
- **Sklearn** is a package for machile learning, we'll use the module for Natural Language Processing. 
- **Scipy** is used for scientific computing, in our case, `csr_matrix` is a dependency of `awesome_cossim_topn`. 
- **Sparse_dot_topn** performs sparse matrix multiplication followed by top-n multiplication result selection. 

In [3]:
import pandas as pd
import numpy as np
import glob
import dask as dd
from rapidfuzz import fuzz
import dask.dataframe as dd

  import pandas.util.testing as tm


# Importing the data

In [4]:
orbis_final = pd.read_csv(orbis_final_file)
denue_final = pd.read_csv(denue_final_alternative_file)

# Declaring options
For customizability's sake, we'll group all the variables, options and arguments we could wish to change in the future. We'll make it in one cell, but feel free to split it into how many cells you want. 

In [5]:
# base_df
base_data = orbis_final
category1 = 'entidad' # control 1 for groupby
category2 = 'municipio' # control 2 for groupby
# matching_df
matching_data = denue_final
base_names_variable = 'companyname' # control 1 for groupby
matching_names_variable = 'firm' # control 2 for groupby
# extracting the results
## top results
top_results_lower_bound = 0.95 # lowest score accepted for top quality matches
## uncertain results
uncertain_results_lower_bound = 0.9 # lowest score accepted for uncertain quality matches
uncertain_results_upper_bound = 0.95 # highest score accepted for uncertain quality matches
n_uncertain_results = 5 # top n results for uncertain matches
##################################
### don't change anything below###
##################################
categories = [category1, category2]

# Selecting the inner set of categories in `denue_final` and `orbis_final`
- Get the unique categories tuples. 
- Keep the observations in each dataset that match the unique categories. 

In [6]:
coincident_categories = (base_data[categories]
                         .drop_duplicates() # select the unique controlling variable combinations in base_data
                         .merge(matching_data[categories]
                                .drop_duplicates(), # select the unique controlling variable combinations in base_data
                                how = 'inner', left_on = categories, 
                                right_on = categories)) # merge 'inner' to get the pairs in both data sets

base_data = (base_data.merge(coincident_categories, how = 'inner', # keep the observations that
                             left_on = categories, right_on = categories)) #  match the inner unique categories
matching_data = matching_data.merge(coincident_categories, how = 'inner', # keep the observations that
                                    left_on = categories, right_on = categories) #  match the inner unique categories

# Matching algorithm
Since this is a loop, we can't divide the algorithm in separate cells, so, we'll comment it with #. 
The main sections are: 

- Filter the datasets with their corresponding categories. 
- Get the company names. 
- Train the algorithm. 
- Extract the results. 
    - Top results. 
    - Uncertain results. 
- Save the matches. 

In [7]:
i = 0
for base_group, base_df in base_data.groupby(by = categories): 
    i += 1
    print(f'Matching {base_group}, combination {i} of out 389')
    #########################
    ###Filter the datasets###
    #########################
    base_df = (base_df.copy()
               .dropna(subset = [base_names_variable]) # drop observations without variable name (firm name)
               .drop_duplicates(ignore_index=True)) # drop possible duplicates
    matching_df = (matching_data[(matching_data[category1] == base_group[0]) # filter the matching dataset
                                 & (matching_data[category2] == base_group[1])] # to match a category tuple
                   .copy() 
                   .dropna(subset = [matching_names_variable]) # drop observations without variable name (firm name)
                   .drop_duplicates(ignore_index=True)) # drop possible duplicates
    ###################################
    ###Converting to Dask Dataframes###
    ###################################
    base_df = dd.from_pandas(base_df, chunksize = 50)
    matching_df = dd.from_pandas(matching_df, chunksize = 500)
    ########################################
    ###All combinations of the dataframes###
    ########################################
    base_df = base_df.assign(key = 0) # set key to match on
    matching_df = matching_df.assign(key = 0)# set key to match on
    matches = dd.merge(base_df, matching_df, suffixes=('_x', '_y'), on = "key", 
                       how = 'outer', shuffle = 'tasks').compute() # tasks: to use distributed computations on all nodes
    ############################
    ###Extracting the results###
    ############################
    matches['score'] = matches.apply(lambda x: 
                                     fuzz.token_sort_ratio(x[base_names_variable], 
                                                                     x[matching_names_variable])/100, 
                                     axis = 1) # apply algorithm
    ############
    #Top results
    ############
    certain_matches = matches[matches['score'] > top_results_lower_bound].copy()
    certain_matches['accuracy'] = 'top' # tag them as top results
    
    ##################
    #Uncertain results
    ##################
    uncertain_matches = (matches[(matches['score'] < uncertain_results_upper_bound) 
                                 & (matches['score'] > uncertain_results_lower_bound)] # select the matches below upper bound
                         .copy() 
                         .sort_values(['score'], ascending = False) # sort values descending
                         .groupby('bvdidnumber').head(n_uncertain_results)) # group by bvdidnumber, then get the n best matches
    uncertain_matches['accuracy'] = 'uncertain' # tag them as uncertain results
    
    ########################
    ###Saving the matches###
    ########################
    
    matches = certain_matches.append(uncertain_matches, ignore_index = True) # append certain with uncertain matches
    file_name = output_file_prefix + base_group[0] + '_' + base_group[1] + '_alternative' + '.csv' # create file name
    matches.drop_duplicates(ignore_index = True).to_csv(file_name, index = False) # remove duplicates and save

Matching ('aguascalientes', 'aguascalientes'), combination 1 of out 389
Matching ('aguascalientes', 'el_llano'), combination 2 of out 389
Matching ('aguascalientes', 'jesus_maria'), combination 3 of out 389
Matching ('aguascalientes', 'rincon_de_romos'), combination 4 of out 389
Matching ('aguascalientes', 'san_francisco_de_los_romo'), combination 5 of out 389
Matching ('baja_california', 'ensenada'), combination 6 of out 389
Matching ('baja_california', 'mexicali'), combination 7 of out 389
Matching ('baja_california', 'playas_de_rosarito'), combination 8 of out 389
Matching ('baja_california', 'tecate'), combination 9 of out 389
Matching ('baja_california', 'tijuana'), combination 10 of out 389
Matching ('baja_california_sur', 'la_paz'), combination 11 of out 389
Matching ('baja_california_sur', 'los_cabos'), combination 12 of out 389
Matching ('baja_california_sur', 'mulege'), combination 13 of out 389
Matching ('campeche', 'campeche'), combination 14 of out 389
Matching ('campeche'

Matching ('jalisco', 'arandas'), combination 122 of out 389
Matching ('jalisco', 'atengo'), combination 123 of out 389
Matching ('jalisco', 'atotonilco_el_alto'), combination 124 of out 389
Matching ('jalisco', 'autlan_de_navarro'), combination 125 of out 389
Matching ('jalisco', 'canadas_de_obregon'), combination 126 of out 389
Matching ('jalisco', 'chapala'), combination 127 of out 389
Matching ('jalisco', 'cocula'), combination 128 of out 389
Matching ('jalisco', 'degollado'), combination 129 of out 389
Matching ('jalisco', 'el_salto'), combination 130 of out 389
Matching ('jalisco', 'guadalajara'), combination 131 of out 389
Matching ('jalisco', 'ixtlahuacan_de_los_membrillos'), combination 132 of out 389
Matching ('jalisco', 'jamay'), combination 133 of out 389
Matching ('jalisco', 'jesus_maria'), combination 134 of out 389
Matching ('jalisco', 'lagos_de_moreno'), combination 135 of out 389
Matching ('jalisco', 'ocotlan'), combination 136 of out 389
Matching ('jalisco', 'ojuelos_d

Matching ('oaxaca', 'santa_cruz_xoxocotlan'), combination 247 of out 389
Matching ('oaxaca', 'santa_lucia_del_camino'), combination 248 of out 389
Matching ('oaxaca', 'santa_maria_huatulco'), combination 249 of out 389
Matching ('oaxaca', 'teotitlan_del_valle'), combination 250 of out 389
Matching ('puebla', 'amozoc'), combination 251 of out 389
Matching ('puebla', 'atzala'), combination 252 of out 389
Matching ('puebla', 'coronango'), combination 253 of out 389
Matching ('puebla', 'cuautlancingo'), combination 254 of out 389
Matching ('puebla', 'huejotzingo'), combination 255 of out 389
Matching ('puebla', 'jalpan'), combination 256 of out 389
Matching ('puebla', 'puebla'), combination 257 of out 389
Matching ('puebla', 'quecholac'), combination 258 of out 389
Matching ('puebla', 'san_andres_cholula'), combination 259 of out 389
Matching ('puebla', 'san_gregorio_atzompa'), combination 260 of out 389
Matching ('puebla', 'san_miguel_xoxtla'), combination 261 of out 389
Matching ('puebla

Matching ('veracruz_de_ignacio_de_la_llave', 'pueblo_viejo'), combination 366 of out 389
Matching ('veracruz_de_ignacio_de_la_llave', 'san_juan_evangelista'), combination 367 of out 389
Matching ('veracruz_de_ignacio_de_la_llave', 'tierra_blanca'), combination 368 of out 389
Matching ('veracruz_de_ignacio_de_la_llave', 'tihuatlan'), combination 369 of out 389
Matching ('veracruz_de_ignacio_de_la_llave', 'tuxpan'), combination 370 of out 389
Matching ('veracruz_de_ignacio_de_la_llave', 'veracruz'), combination 371 of out 389
Matching ('veracruz_de_ignacio_de_la_llave', 'xalapa'), combination 372 of out 389
Matching ('veracruz_de_ignacio_de_la_llave', 'yanga'), combination 373 of out 389
Matching ('yucatan', 'kanasin'), combination 374 of out 389
Matching ('yucatan', 'merida'), combination 375 of out 389
Matching ('yucatan', 'progreso'), combination 376 of out 389
Matching ('yucatan', 'ticul'), combination 377 of out 389
Matching ('yucatan', 'tixkokob'), combination 378 of out 389
Matchi

# Concatenate and save results
First, we concatenate all the files with the prefix `output_file_prefix`, then we label them by algorithm and DENUE's geographical selection, finally, we drop duplicate matches and save it. 

In [8]:
joint_matches = pd.concat([pd.read_csv(f) for f in 
                           glob.glob(output_file_prefix +'*.csv')], ignore_index=True) # concatenate the results
joint_matches['algorithm'] = 'rapidfuzz' # label the algorithm
joint_matches['selection'] = 'both_entity_municipality_alternative' # label the database selection
joint_matches.drop_duplicates(ignore_index = True).to_csv(final_output, index = False) # drop duplicates and save 