# Filtering all the possible geographical combinations in `zip_codes`, `orbis` and `denue`

# General information
The main objective of this code is to get all the tuples of the form (`entity`, `municipality`, `postcode`) from `zip_codes`, `orbis` and `denue`. 
Other steps involved in this process specifically for `zip_codes` are: 

- Cleaning the names, removing accents, accents, double spaces and replacing the spaces with low dashes. 
- Renaming columns. 
- Getting the unique combinations. 
- Exporting the result to a Comma Separated Values file. 

# Input files
1. **zip_codes_file:** `'/scratch/public/jpvasquez/MNCs_informality/Raw_data/ORBIS/zip_codes.txt'` This file contains all the geographical zones of México, with a possible zip code, municipality, entity, suburbs, zip codes and other variables both in string and numerical codes. 
2. **orbis_file:** `'/scratch/public/jpvasquez/MNCs_informality/Raw_data/ORBIS/Orbis_for_Denue_merge.dta'` This file contains a list of Multinational Companies in México with their locations, names and Bureau van Dijk ID number. 
3. **denue_file:**  `'/scratch/public/jpvasquez/MNCs_informality/Raw_data/DENUE/output/denue_version_d.csv'` This file contains all the formal and informal companies in México, registered in Directorio Nacional Estadístico with their complete geographical zone data, company names and business names (razones sociales) and DENUE's key. 

In [1]:
zip_codes_file = '/scratch/public/jpvasquez/MNCs_informality/Raw_data/ORBIS/zip_codes.txt'
orbis_file = '/scratch/public/jpvasquez/MNCs_informality/Raw_data/ORBIS/Orbis_for_Denue_merge.dta'
denue_file = '/scratch/public/jpvasquez/MNCs_informality/Raw_data/DENUE/output/denue_version_both.csv'

# Output files
1. **zip_codes_clean:** `'/scratch/public/jpvasquez/MNCs_informality/Raw_data/ORBIS/zip_codes_clean.csv'` `zip_codes` with all their variables cleaned. 
2. **zip_codes_municipios:** `'/scratch/public/jpvasquez/MNCs_informality/Raw_data/ORBIS/zip_codes_municipios.csv'` Unique ntity, municipality and zip code tuples registered in México. 
3. **orbis_municipios:** `'/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/orbis_municipios.csv'` Unique ntity, municipality and zip code tuples obtained from the firms registered in ORBIS. 
4. **denue_municipios:** `'/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/denue_municipios.csv'` Unique ntity and municipality tuples from the firms registered in DENUE. 

In [2]:
zip_codes_clean = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/zip_codes_clean.csv'
zip_codes_municipios = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/zip_codes_municipios.csv'
orbis_municipios = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/orbis_municipios.csv'
denue_municipios = '/scratch/public/jpvasquez/MNCs_informality/Intermediate_data/output/denue_municipios.csv'

# Packages
These are the needed packages to run this code. In case, the machine you're running this in doesn't have any of these packages, run this code: 

`!pip install package_name`

**Pandas** is the package which handles importing, wrangling, cleaning and doing everything with the data. 

In [3]:
import pandas as pd

# Importing the data

In [4]:
zip_codes = pd.read_csv(zip_codes_file, sep = "|", encoding='cp1252')
orbis = pd.read_stata(orbis_file)
denue = pd.read_csv(denue_file, engine='python')

# Cleaning the data

## Preparing geographical information for analysis
For `zip_codes` and `orbis`, for each observation, we'll clean all the string variables. First, remove accents or non-english characters. Second, replace caps for lower characters in strings. Third, remove multiple spaces. Fourth, replace space with \_. 

In [5]:
dfs = [zip_codes, orbis]

for df in dfs:
    for column in df.select_dtypes(include = 'object'):
        df[column] = (df[column].str.normalize('NFKD') # Removing
                             .str.encode('ascii', errors='ignore') # the 
                             .str.decode('utf-8') # accents
                             .str.lower() # remove the caps
                             .str.strip() # remove multiple spaces
                             .str.replace(" ", "_")) # change spaces for lower dashes

## Dropping unnecesary variables

In [6]:
zip_codes = zip_codes.drop(columns = 'c_CP')

## Making sure every variable name is lower instead of capitalized

In [7]:
zip_codes = zip_codes.rename(columns = {'D_mnpio': 'd_mnpio', 
                                        'd_CP': 'd_cp'})

# Preparing ORBIS dataset

## Reshaping the data
We reshape from wide to long using `city` and `region` variables. 

In [8]:
orbis = (pd.wide_to_long(orbis, 
                         stubnames = ['postcode', 'city', 'region'], 
                         i = 'bvdidnumber', j = 'n')
         .reset_index())

# Saving the corresponding data sets

## Saving the clean complete data set

In [9]:
zip_codes.to_csv(zip_codes_clean, index=False)

## Saving the selected variables needed for recoding the other data sets
First, we select the variables of interest. Then, we know that in those tuples there are duplicates, so we drop them. Finally, we save the data set to the assigned location. 

In [10]:
# selecting variables of interest
zip_codes = zip_codes[['d_estado', 'd_mnpio', 'd_cp']].copy()
orbis = orbis[['region', 'city', 'postcode']].copy()
denue = denue[['entidad', 'municipio']].copy()

# databases and locations's lists to iterate over
databases = [zip_codes, orbis, denue]
locations = [zip_codes_municipios, orbis_municipios, denue_municipios]

for i, j in zip(databases, locations): 
    (i.drop_duplicates(ignore_index = True) # dropping duplicates
     .to_csv(j, index=False)) # saving the data set