# TF-IDF Vectorizer 2-3 ngrams Cosine Similarity
## Controlling ORBIS by entity and municipality and DENUE by entity and municipality

# General information
This notebook mainly executes the following for the firm names in `denue_final` and `orbis_final`: 

- Filtering the desired geographical zones in both data sets. 
- Extracting the company names. 
- Training the algorithm. 
- Extracting the results. 
- Labeling the results. 
- Exporting them to a Comma Separated Values file. 

After this, we've got the possible name matches in `denue_final` for each company in `orbis_final`. 

# Input files
1. **orbis_final:** `'/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/orbis_final.csv'` This file contains a data set where each row represents a firm with one of their names associated, also, entity, municipality and ORBIS's BVDID number.
2. **denue_final_alternative:** `'/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/denue_final_alternative.csv'` This file contains a dataset where each row represents a firm with one of their names associated, also, the number of workers in that firm, entity, municipality and DENUE's key.

In [1]:
orbis_final_file = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/orbis_final.csv'
denue_final_alternative_file = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/denue_final_alternative.csv'

# Output files
1. **output_file_prefix:** `'/scratch/public/jpvasquez/MNCs_informality/Final_data/output/both_entity_municipality/denue_alternative/tf-idf/matches_tf_idf_both_'` These files contain the possible matches of each municipality.
2. **final_output:** `'/scratch/public/jpvasquez/MNCs_informality/Final_data/output/2-1-2-1-matches_tf_idf_both_entity_municipality_alternative.csv'` This file is a concatenation of all the `output_file_prefix` files. 

In [2]:
output_file_prefix = '/scratch/public/jpvasquez/MNCs_informality/Final_data/output/both_entity_municipality/denue_alternative/tf-idf/matches_tf_idf_both_'
final_output = '/scratch/public/jpvasquez/MNCs_informality/Final_data/output/2-1-2-1-matches_tf_idf_both_entity_municipality_alternative.csv'

# Packages
These are the needed packages to run this code. In case, the machine you're running this in doesn't have any of these packages, run this code: 

`!pip install package_name`

- **Pandas** is the package which handles importing, wrangling, cleaning and doing everything with the data. 
- **Numpy** is needed in order to declare missing values. 
- **Glob** gets all the files from a directory with a prefix. 
- **Sklearn** is a package for machile learning, we'll use the module for Natural Language Processing. 
- **Scipy** is used for scientific computing, in our case, `csr_matrix` is a dependency of `awesome_cossim_topn`. 
- **Sparse_dot_topn** performs sparse matrix multiplication followed by top-n multiplication result selection. 

In [3]:
import pandas as pd
import numpy as np
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix
from sparse_dot_topn import awesome_cossim_topn

# Importing the data

In [4]:
orbis_final = pd.read_csv(orbis_final_file)
denue_final = pd.read_csv(denue_final_alternative_file)

# Declaring options
For customizability's sake, we'll group all the variables, options and arguments we could wish to change in the future. We'll make it in one cell, but feel free to split it into how many cells you want. 

In [5]:
# base_df
base_data = orbis_final
category1 = 'entidad' # control 1 for groupby
category2 = 'municipio' # control 2 for groupby
# matching_df
matching_data = denue_final
base_names_variable = 'companyname' # control 1 for groupby
matching_names_variable = 'firm' # control 2 for groupby
# extracting the results
## top results
top_results_n = 100 # number of results of top quality
top_results_lower_bound = 0.95 # lowest score accepted for top quality matches
## uncertain results
uncertain_results_n = 300 # number of results of uncertain quality
uncertain_results_lower_bound = 0.75 # lowest score accepted for uncertain quality matches
uncertain_results_upper_bound = 0.95 # highest score accepted for uncertain quality matches
n_uncertain_results = 5 # top n results for uncertain matches
cores = 56 # subject to the number specified in the batch options
##################################
### don't change anything below###
##################################
categories = [category1, category2]

# Selecting the inner set of categories in `denue_final` and `orbis_final`
- Get the unique categories tuples. 
- Keep the observations in each dataset that match the unique categories. 

In [6]:
coincident_categories = (base_data[categories]
                         .drop_duplicates() # select the unique controlling variable combinations in base_data
                         .merge(matching_data[categories]
                                .drop_duplicates(), # select the unique controlling variable combinations in base_data
                                how = 'inner', left_on = categories, 
                                right_on = categories)) # merge 'inner' to get the pairs in both data sets

base_data = (base_data.merge(coincident_categories, how = 'inner', # keep the observations that
                             left_on = categories, right_on = categories)) #  match the inner unique categories
matching_data = matching_data.merge(coincident_categories, how = 'inner', # keep the observations that
                                    left_on = categories, right_on = categories)#  match the inner unique categories

# Matching algorithm
Since this is a loop, we can't divide the algorithm in separate cells, so, we'll comment it with #. 
The main sections are: 

- Filter the datasets with their corresponding categories. 
- Get the company names. 
- Train the algorithm. 
- Extract the results. 
    - Top results. 
    - Uncertain results. 
- Save the matches. 

In [7]:
i = 0
for base_group, base_df in base_data.groupby(by = categories): 
    i += 1
    print(f'Matching {base_group}, combination {i} out of 389')
    #########################
    ###Filter the datasets###
    #########################
    base_df = (base_df.copy()
               .dropna(subset = [base_names_variable]) # drop observations without variable name (firm name)
               .drop_duplicates(ignore_index=True)) # drop possible duplicates
    matching_df = (matching_data[(matching_data[category1] == base_group[0]) # filter the matching dataset
                                 & (matching_data[category2] == base_group[1])] # to match a category tuple
                   .copy() 
                   .dropna(subset = [matching_names_variable]) # drop observations without variable name (firm name)
                   .drop_duplicates(ignore_index=True)) # drop possible duplicates
    ######################################
    ###Getting variable (company) names###
    ######################################
    base_names = base_df[base_names_variable] # get the list of variable names (not necessarily unique) from base_df
    matching_names = matching_df[matching_names_variable] # get the list of variable names 
                                                          # (not necessarily unique) from matching_df
    names = base_names.append(matching_names, ignore_index=True) # concatenating both lists
    ############################
    ###Training the algorithm###
    ############################
    vectorizer = TfidfVectorizer(min_df = 1, ngram_range = (2, 3), analyzer = 'char') # call the function 
                                                                                      # at least one item, 2-3 ngrams, 
                                                                                      # by characters/letters
    tf_idf_matrix = vectorizer.fit(names) # train the models with all the company names from ORBIS and DENUE
    tf_idf_matrix_base = tf_idf_matrix.transform(base_names) # transform each observation into a vector 
                                                             # and append them into a matrix
    tf_idf_matrix_matching = tf_idf_matrix.transform(matching_names) # according to the ngrams
    ############################
    ###Extracting the results###
    ############################
    
    ############
    #Top results
    ############
    
    possible_matches = awesome_cossim_topn(tf_idf_matrix_base, # sparse matrix multiplication, base_df matrix
                                           tf_idf_matrix_matching.transpose(), # multiplied by the matching_df matrix
                                           top_results_n, top_results_lower_bound, use_threads = True, # options
                                           n_jobs = cores)
    
    possible_matches_base_df_index = possible_matches.nonzero()[0] # positions where the matches are located in base_df
    possible_matches_matching_df_index = possible_matches.nonzero()[1] # positions where the matches are located in matching_df
    
    certain_matches = (base_df.iloc[possible_matches_base_df_index] # create certain_matches df, merge the firm names of left side
                       .reset_index(drop = True) # select the observations by location, get index to 0, 1, ..., n
                       .merge(matching_df.iloc[possible_matches_matching_df_index] # select the observations by location
                              .reset_index(drop = True), left_index = True, # get index to 0, 1, ..., n
                              right_index = True)) # merge by index, 0 with 0, 1 with 1, ...
    certain_matches['score'] = possible_matches.data # assign score to each match
    certain_matches['accuracy'] = 'top' # tag them as top results
    
    ##################
    #Uncertain results
    ##################
    
    possible_matches = awesome_cossim_topn(tf_idf_matrix_base, # sparse matrix multiplication, base_df matrix
                                           tf_idf_matrix_matching.transpose(), # multiplied by the matching_df matrix
                                           uncertain_results_n, uncertain_results_lower_bound, use_threads = True, # options
                                           n_jobs = cores)
    
    possible_matches_base_df_index = possible_matches.nonzero()[0] # positions where the matches are located in base_df
    possible_matches_matching_df_index = possible_matches.nonzero()[1] # positions where the matches are located in matching_df
    
    uncertain_matches = (base_df.iloc[possible_matches_base_df_index] # create certain_matches df, merge the firm names of left side
                         .reset_index(drop = True) # select the observations by location, get index to 0, 1, ..., n
                         .merge(matching_df.iloc[possible_matches_matching_df_index] # select the observations by location
                                .reset_index(drop = True), left_index = True, # get index to 0, 1, ..., n
                                right_index = True)) # merge by index, 0 with 0, 1 with 1, ...
    uncertain_matches['score'] = possible_matches.data # assign score to each match
    uncertain_matches = (uncertain_matches[uncertain_matches['score'] < uncertain_results_upper_bound] # select the matches below upper bound
                         .sort_values(['score'], ascending = False) # sort values descending
                         .groupby('bvdidnumber').head(n_uncertain_results)) # group by bvdidnumber, then get the n best matches
    uncertain_matches['accuracy'] = 'uncertain' # tag them as uncertain results
    
    ########################
    ###Saving the matches###
    ########################
    
    matches = certain_matches.append(uncertain_matches, ignore_index = True) # append certain with uncertain matches
    file_name = output_file_prefix + base_group[0] + '_' + base_group[1] + '_alternative' +'.csv' # create file name
    matches.drop_duplicates(ignore_index = True).to_csv(file_name, index = False) # remove duplicates and save

Matching ('aguascalientes', 'aguascalientes'), combination 1 out of 398
Matching ('aguascalientes', 'el_llano'), combination 2 out of 398
Matching ('aguascalientes', 'jesus_maria'), combination 3 out of 398
Matching ('aguascalientes', 'rincon_de_romos'), combination 4 out of 398
Matching ('aguascalientes', 'san_francisco_de_los_romo'), combination 5 out of 398
Matching ('baja_california', 'ensenada'), combination 6 out of 398
Matching ('baja_california', 'mexicali'), combination 7 out of 398
Matching ('baja_california', 'playas_de_rosarito'), combination 8 out of 398
Matching ('baja_california', 'tecate'), combination 9 out of 398
Matching ('baja_california', 'tijuana'), combination 10 out of 398
Matching ('baja_california_sur', 'la_paz'), combination 11 out of 398
Matching ('baja_california_sur', 'los_cabos'), combination 12 out of 398
Matching ('baja_california_sur', 'mulege'), combination 13 out of 398
Matching ('campeche', 'campeche'), combination 14 out of 398
Matching ('campeche'

Matching ('jalisco', 'arandas'), combination 122 out of 398
Matching ('jalisco', 'atengo'), combination 123 out of 398
Matching ('jalisco', 'atotonilco_el_alto'), combination 124 out of 398
Matching ('jalisco', 'autlan_de_navarro'), combination 125 out of 398
Matching ('jalisco', 'canadas_de_obregon'), combination 126 out of 398
Matching ('jalisco', 'chapala'), combination 127 out of 398
Matching ('jalisco', 'cocula'), combination 128 out of 398
Matching ('jalisco', 'degollado'), combination 129 out of 398
Matching ('jalisco', 'el_salto'), combination 130 out of 398
Matching ('jalisco', 'guadalajara'), combination 131 out of 398
Matching ('jalisco', 'ixtlahuacan_de_los_membrillos'), combination 132 out of 398
Matching ('jalisco', 'jamay'), combination 133 out of 398
Matching ('jalisco', 'jesus_maria'), combination 134 out of 398
Matching ('jalisco', 'lagos_de_moreno'), combination 135 out of 398
Matching ('jalisco', 'ocotlan'), combination 136 out of 398
Matching ('jalisco', 'ojuelos_d

Matching ('oaxaca', 'santa_cruz_xoxocotlan'), combination 247 out of 398
Matching ('oaxaca', 'santa_lucia_del_camino'), combination 248 out of 398
Matching ('oaxaca', 'santa_maria_huatulco'), combination 249 out of 398
Matching ('oaxaca', 'teotitlan_del_valle'), combination 250 out of 398
Matching ('puebla', 'amozoc'), combination 251 out of 398
Matching ('puebla', 'atzala'), combination 252 out of 398
Matching ('puebla', 'coronango'), combination 253 out of 398
Matching ('puebla', 'cuautlancingo'), combination 254 out of 398
Matching ('puebla', 'huejotzingo'), combination 255 out of 398
Matching ('puebla', 'jalpan'), combination 256 out of 398
Matching ('puebla', 'puebla'), combination 257 out of 398
Matching ('puebla', 'quecholac'), combination 258 out of 398
Matching ('puebla', 'san_andres_cholula'), combination 259 out of 398
Matching ('puebla', 'san_gregorio_atzompa'), combination 260 out of 398
Matching ('puebla', 'san_miguel_xoxtla'), combination 261 out of 398
Matching ('puebla

Matching ('veracruz_de_ignacio_de_la_llave', 'pueblo_viejo'), combination 366 out of 398
Matching ('veracruz_de_ignacio_de_la_llave', 'san_juan_evangelista'), combination 367 out of 398
Matching ('veracruz_de_ignacio_de_la_llave', 'tierra_blanca'), combination 368 out of 398
Matching ('veracruz_de_ignacio_de_la_llave', 'tihuatlan'), combination 369 out of 398
Matching ('veracruz_de_ignacio_de_la_llave', 'tuxpan'), combination 370 out of 398
Matching ('veracruz_de_ignacio_de_la_llave', 'veracruz'), combination 371 out of 398
Matching ('veracruz_de_ignacio_de_la_llave', 'xalapa'), combination 372 out of 398
Matching ('veracruz_de_ignacio_de_la_llave', 'yanga'), combination 373 out of 398
Matching ('yucatan', 'kanasin'), combination 374 out of 398
Matching ('yucatan', 'merida'), combination 375 out of 398
Matching ('yucatan', 'progreso'), combination 376 out of 398
Matching ('yucatan', 'ticul'), combination 377 out of 398
Matching ('yucatan', 'tixkokob'), combination 378 out of 398
Matchi

# Concatenate and save results
First, we concatenate all the files with the prefix `output_file_prefix`, then we label them by algorithm and DENUE's geographical selection, finally, we drop duplicate matches and save it. 

In [8]:
joint_matches = pd.concat([pd.read_csv(f) for f in 
                           glob.glob(output_file_prefix +'*.csv')], ignore_index=True) # concatenate the results
joint_matches['algorithm'] = 'tf-idf' # label the algorithm
joint_matches['selection'] = 'both_entity_municipality_alternative' # label the database selection
joint_matches.drop_duplicates(ignore_index = True).to_csv(final_output, index = False) # drop duplicates and save 