# Preparing ORBIS's dataset for matching

# General information
The objective of this code is divided in these main objectives: 

- Recode the geographical zones. 
- Cleaning company names. 
- Reshape the dataset such that each observation corresponds to a single possible name of a firm in one of its possible locations. 
- Remove 'stopwords'. 
- Save the dataset. 

# Input files
1. **orbis:** `'/scratch/public/jpvasquez/MNCs_informality/Raw_data/ORBIS/Orbis_for_Denue_merge.dta'` This file has geographical location variables, number of workers, ORBIS and DENUE's firm keys, generic firm names and business names (razon social). 
2. **orbis_geo_corrected:** `'/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/data/orbis_municipalities_corrected.csv'` This file contains the geographical zones of firms with their proper coding, which procedure is documented in `'/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/data/cleaning_datasets_orbis_denue.md'`

In [1]:
orbis_file = '/scratch/public/jpvasquez/MNCs_informality/Raw_data/ORBIS/Orbis_for_Denue_merge.dta'
orbis_geo_corrected_file = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/data/orbis_municipalities_corrected.csv'

# Output files
1. **orbis_final:** `'/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/orbis_final.csv'` This file contains a data set where each row represents a firm with one of their names associated, also, entity, municipality and ORBIS's BVDID number. 

In [2]:
orbis_final = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/orbis_final.csv'

# Packages
These are the needed packages to run this code. In case, the machine you're running this in doesn't have any of these packages, run this code: 

`!pip install package_name`

**Pandas** is the package which handles importing, wrangling, cleaning and doing everything with the data. 

**Numpy** is needed in order to declare missing values. 

In [3]:
import pandas as pd
import numpy as np

# Importing the data

In [4]:
orbis = pd.read_stata(orbis_file)
orbis_geo_corrected = pd.read_csv(orbis_geo_corrected_file, sep = ';')

# Cleaning the data

## ORBIS Geographical Zones Corrected

### Rename the variables

In [5]:
orbis_geo_corrected = orbis_geo_corrected.rename(columns = {'d_estado': 'entidad', # actual name: new name
                                                            'd_mnpio': 'municipio', 
                                                            'd_cp': 'cp'}) 

## ORBIS

### Preparing company names for matching
For each observation, we'll clean certain variables. First, remove accents or non-english characters. Second, replace caps for lower characters in strings. Third, remove multiple spaces. Fourth, replace space with \_. Fifth, remove all non whitespace separators and letters. Finally, remove multiple spaces again. 

In [6]:
for column in orbis[['companyname1', 'companyname2', 'companyname3', 
                     'city1', 'region1', 'city2', 
                     'region2', 'city3', 'region3']]:
    orbis[column] = (orbis[column].str.normalize('NFKD') # Removing
                     .str.encode('ascii', errors='ignore') # the 
                     .str.decode('utf-8') # accents
                     .str.lower() # replace caps
                     .str.strip()) # remove multiple spaces
    
for column in orbis[['city1', 'region1', 'city2', 'region2', 'city3', 'region3']]:
    orbis[column] = orbis[column].str.replace(" ", "_") # replace space with _
    
for column in orbis[['companyname1', 'companyname2', 'companyname3']]:
    orbis[column] = (orbis[column].str.replace('[^\w\s]','') # remove non whitespace separators and letters
                     .str.strip()) # remove multiple spaces

# Reshaping the data
Note that we've got to do two reshapes: 

1. In the first one, we make each observation a single location for every BVDID number. 
2. In the second one, we get all the possible combinations of a firm name, their respective location and their BVDID number. 

This is important because in this data set there's no difference in BVDID number for a firm with two or three locations. 

## First reshape: Geographical zones
For every tuple consisting of the trio of geographical characteristics (region, city and postcode), we use them to reshape the dataset with respect to the `bvdidnumber`. Also, the index is reset. 

In [7]:
orbis = (pd.wide_to_long(orbis, 
                         stubnames = ['postcode', 'city', 'region'], 
                         i = 'bvdidnumber', j = 'n')
         .reset_index())

## Dropping the null values generated
First, for every `bvdidnumber`, we always keep the first possible location (which can be missing) and we keep the second and/or third possible location if the `entidad` or `municipio` is registered. Then, we append one to another and replace the data set. 

In [8]:
orbis1 = orbis[orbis['n'] == 1] # first possible location
orbis2 = orbis[(orbis['n'] != 1) & (orbis['city'] != '') 
               & (orbis['region'] != '')] # second or third possible locations

orbis = orbis1.append(orbis2).reset_index(drop = True)

## Creating an ID for the second reshape
Now that we've got for each observation a single location, we don't exactly know which name is associated to what location. So, we can't assume any of them and have to consider all the possible combinations. 

Note that we need an unique identifier to do the reshape. In order to get an unique ID, we combine the `bvdidnumber` with `n`, which was generated by the first reshape. 

In [9]:
orbis['id'] = orbis['bvdidnumber'] + "_" + orbis['n'].astype(str)

## Drop `n` to create a different one

In [10]:
orbis = orbis.drop(columns = 'n')

## Second reshape: Possible company names
Using `companyname` and the corresponding index, we reshape it again considering all the possible combinations. 

In [11]:
orbis = pd.wide_to_long(orbis, stubnames='companyname', 
                        i='id', j='n').reset_index()

## Drop observations without a company name
Also, reset the index. 

In [12]:
orbis = (orbis[orbis['companyname'] != '']
         .reset_index(drop = True))

# Cleaning the data

## Create missing values
We create missing values if there's spaces or a null string for every variable, we use `regex` option to capture every spacing size. 

In [13]:
orbis = orbis.replace(r'^\s*$', np.NaN, regex=True)

## Removing stopwords
There are multiple words that don't add more information or quality to our matching algorithms. We can assure this because all the firms have the same location: México, and we don't care about the company's structure in name similarity. Also, by looking manually in the data set, we detected common words that could qualify as stopwords and made a list with them. Then, for each possible firm name, we create a vector without the stopwords listed and joined them with spaces again. 

In [14]:
remove_words = ['de', 'a', 's', 'l', 
                'r', 'sade', 'v', 'c', 
                'b', 'sa', 'cv', 'sab', 
                'mexicana', 'mexicano', 'limitada', 'rl', 
                'mexico', 'latinoamerica', 'srl', 'mejico', 
                'via', 'its', 'funds', 'y', 
                'sapi', 'enr', 'sofom', 'mxico', 
                'latin', 'america', 'internacional', 'mexicanos', 
                'mexicanas', 'mex', 'er']
orbis['companyname'] = (orbis['companyname']
                        .apply(lambda x: ' '.join([word for word in x.split() 
                                                   if word not in (remove_words)])))

# Recode geographical zones
Using merge on `region`, `city` and `postcode`, the correct names of the geographical zones are associated according to the assigned codes `entidad`, `municipalidad` and `cp` in `orbis_geo_corrected`. 

In [15]:
orbis = orbis.merge(orbis_geo_corrected, how = 'left', 
                    left_on = ['region', 'city', 'postcode'], 
                    right_on = ['region', 'city', 'postcode'])

# Preparing the data to save it
First, we drop the unnecessary variables. Then, we drop the duplicates if there are repeated company names. Finally, we reset the index, notice that there are many of them after remove punctuation, accents and multiple spaces. 

In [16]:
orbis = (orbis.drop(columns = ['id', 'n', 'city', 
                               'region', 'postcode', 'cp'])
         .drop_duplicates(ignore_index = True))

# Save the dataset
We save the data set to a Comma Separated Values file and we order the columns in our preferred order by naming them one by one. 

In [17]:
orbis.to_csv(orbis_final, index = False)