# Identifying the Country
We use the file Extracted_Attributes_LoanUSD.csv, that includes the extracted attributes from the PDF texts and the loan amount in US dollars. 

We use a fuzzy match between strings (from the fuzzywuzzy package) to compare the possible country names from each document to a list of country names from the package pycountry. We use short names of the countries and also their official names. We use the best match as the Country_Code_Cover. 

We also use the address from the PDFs to try to find the name of a country. If the name of a country or its capital city appears in the address, we call that Country_Code_Address. 

Finally, for the few cases where these are different, we manually decide the country. 

In [1]:
#!pip install python-Levenshtein
import pandas as pd
import re
import numpy as np
from fuzzywuzzy import fuzz, process
import numpy as np
import pycountry as pyco

In [2]:
Attributes_DF=pd.read_csv('../data/Extracted_Attributes_LoanUSD.csv')

# Identifying Country from Cover Page
We do a fuzzy match between the Possible_country and the country names gotten from pycountry. 

In [3]:
def get_short_name(long_name):
    name=long_name.split(',')[0].lower()
    name=re.sub('peoples*|democr\w*|repub\w*','',name)
    re_string='[^0-9a-zA-Z\s.\(\)]+'
    name=re.sub(re_string, '',name)
    return name

In [4]:
#Create a database with the names of the countries
Countries=pd.DataFrame({'Country_Code':[], 'Country_Name':[],'Official_name':[],'Short_name':[]})
for ind,country in enumerate(pyco.countries):
    name=country.name
    short_name=get_short_name(name)
    try: official=country.official_name
    except: official=name
    cut=name.split(',')[0]
    Countries.loc[ind]=[country.alpha_3,name,official,short_name]
#Is to easy to confuse mexico, so we add a new row
Countries.loc[ind+1]=['MEX','Mexico','United Mexican States','united mexican states']

In [5]:
#To be able to do a fuzzy match between all possible country names, we create an expanded dataframe
#that has one row per possible country name
Extended_DF=pd.DataFrame({'filename':[],'Possible_country':[]})
new_ind=0
for ind in Attributes_DF.index:
    filename=Attributes_DF.filename[ind]
    try:
        possible_countries=Attributes_DF.Possible_country_name[ind].split('\n')
        for possible in possible_countries:
            Extended_DF.loc[new_ind]=[filename,possible]
            new_ind+=1
    except:
        True

In [6]:
Extended_DF.shape

(16074, 2)

In [7]:
#Combine the DataFrames
Extended_DF['Key']=1
Countries['Key']=1
Combined=Extended_DF.merge(Countries,on='Key',how='left')
Combined=Combined.drop('Key',axis=1)
Combined=Combined.drop('filename',axis=1).drop_duplicates()

In [8]:
#We use these two fuzzy matchings from fuzzywuzzy, and vectorize them for a faster computation.
def partial_match_long(x,y):
    return(fuzz.WRatio(x.lower(),y.lower()))
partial_match_long_vector = np.vectorize(partial_match_long)

def partial_match_short(x,y):
    return(fuzz.QRatio(x.lower(),y.lower()))
partial_match_short_vector = np.vectorize(partial_match_short)

In [9]:
#Decide on the tresholds
score_treshold1=80
score_treshold2=50

#Find the match score with the Short Name, and just keep the best matching per possible_country
Combined['Score1']=partial_match_long_vector(Combined['Possible_country'],Combined['Short_name'])
Combined['Max_Score1']=Combined.groupby(['Possible_country'])['Score1'].transform(max)
Combined=Combined[Combined.Score1>score_treshold1]
Combined=Combined[Combined.Score1==Combined.Max_Score1]

In [10]:
#Find the match score with the Short Name using the second metric, and just keep the best matching per possible_country
Combined['Score2']=partial_match_short_vector(Combined['Possible_country'],Combined['Short_name'])
Combined['Max_Score2']=Combined.groupby(['Possible_country'])['Score2'].transform(max)
Combined=Combined[Combined.Score2==Combined.Max_Score2]

In [11]:
#Find the match score with the Official Name, and just keep the best matching per possible_country
Combined['Score3']=partial_match_short_vector(Combined['Possible_country'],Combined['Official_name'])
Combined['Max_Score3']=Combined.groupby(['Possible_country'])['Score3'].transform(max)
Combined=Combined[Combined.Score3==Combined.Max_Score3]

In [12]:
#We combine all nonzero scores and use the second treshold
Combined['Score1'][Combined.Score1==0]=None
Combined['Score2'][Combined.Score2==0]=None
Combined['Score3'][Combined.Score3==0]=None
Combined['Score_final']=np.nanmean(Combined[['Score1','Score2','Score3']],axis=1)
Combined=Combined[Combined.Score_final>=score_treshold2]

In [13]:
#Recombine with the filenames
Combined=Combined[['Possible_country','Country_Code','Country_Name','Score_final','Score1']]

In [14]:
Identified_Countries=Extended_DF.merge(Combined,on='Possible_country').drop(['Key'],axis=1).drop_duplicates()

In [15]:
Identified_Countries.shape

(2936, 6)

In [16]:
Identified_Countries=Identified_Countries.sort_values('Score1',ascending=False).groupby('filename').first().reset_index()

In [17]:
Attributes_DF=Attributes_DF.merge(Identified_Countries,how='outer',on='filename')

In [18]:
#How many countries did we miss
Attributes_DF.Country_Code.isnull().sum()

604

In [19]:
#Rename the Country Code
Attributes_DF.rename({'Country_Code':'Country_Code_Cover'},axis=1,inplace=True)

# Extract Country from Address
We extract the country from address by finding which country or capital city from the world_bank_country_mappings data is in the address. We manually fix many common mistakes, like identifying "Niger" when the country is actually "Nigeria".

In [20]:
worldmap = pd.read_csv('../data/world_bank_country_mappings.csv')

In [21]:
country_name = []
country_code = []
income_level = []
region = []
for i in Attributes_DF.index:
    address = Attributes_DF.Address[i]
    if type(address)==str:
        address = address.replace('\n','')
        address = address.replace(' ','')
        address = address.replace('  ','')
        address = address.replace("’","")
        address = address.lower()
        for j in range(len(worldmap['name'])):

            if worldmap['name'][j] == 'Africa':
                continue

            if worldmap['name'][j] == 'United States':
                continue

            if worldmap['name'][j] == 'World':
                continue

            if worldmap['name'][j] == 'Niger' and address.find('nigeria') > -1:
                continue

            if worldmap['name'][j] == 'Oman' and address.find('romania') > -1:
                continue

            if worldmap ['name'][j] == 'Oman' and address.find('phi') > -1:
                continue

            if worldmap ['name'][j] == 'Mali' and address.find('turkey') > -1:
                continue

            if worldmap['name'][j] == 'Guinea' and address.find('papua') > -1:
                countryname = 'Papua New Guinea'
                countrycode = 'PNG'
                incomelevel = 'Lower middle income'
                regi = 'East Asia & Pacific'
                break

            if worldmap['name'][j] == 'Spain' and address.find('trinidad')> -1:
                countryname = 'Trinidad and Tobago'
                countrycode = 'TTO'
                incomelevel = 'High income'
                regi = 'Latin America & Caribbean '
                break

            if worldmap['name'][j] == 'Chile' and address.find('paraguay')> -1:
                countryname = 'Paraguay'
                countrycode = 'PRY'
                incomelevel = 'Upper middle income'
                regi = 'Latin America & Caribbean '
                break

            else:    
                name = worldmap['name'][j]
                name = name.replace(' ','')
                name = name.replace("'","")
                name = name.replace('  ','')
                name = name.lower()
                cut = name.find(',')
                if cut > 0:
                    name = name[:cut]
                countryname = ''
                countrycode = ''
                incomelevel = ''
                regi = ''



                if address.find(name) > -1:
                    if name == 'congo' and address.find('brazzaville')>-1:
                        countryname = 'Congo,Rep.'
                        countrycode = 'COG'
                        incomelevel = 'Lower middle income'
                        regi = 'Sub-Saharan Africa'
                        break

                    else:  
                        countryname = worldmap['name'][j]
                        countrycode = worldmap['id'][j]
                        incomelevel = worldmap['incomeLevel.value'][j]
                        regi = worldmap['region.value'][j]
                        break            


                city = worldmap['capitalCity'][j]
                if type(city) == str:
                    city = city.replace(' ','')
                    city = city.replace("'","")
                    city = city.replace('  ','')
                    city = city.lower()
                    if address.find(name) == -1 and address.find(city) >-1:
                        countryname = worldmap['name'][j]
                        countrycode = worldmap['id'][j]
                        incomelevel = worldmap['incomeLevel.value'][j]
                        regi = worldmap['region.value'][j]
                        break                       


        if countryname == '':
            if address.find('méxico') > -1 or address.find('mexican') > -1:
                countryname = 'Mexico'
                countrycode = 'MEX'
                incomelevel = 'Upper middle income'
                regi = 'Latin America & Caribbean '

        if address.find('yugoslavia') > -1:
                countryname = 'Yugoslavia'
                countrycode = 'YUGOS'
                incomelevel = 'Upper middle income'
                regi = 'Europe & Central Asia'

        if countryname == '':
            countryname = None
            countrycode = None
            incomelevel = None
            regi = None
    else:
        countryname=countrycode=incomelevel=regi=None

    country_name.append(countryname)
    country_code.append(countrycode)
    income_level.append(incomelevel)
    region.append(regi)

In [22]:
Attributes_DF['Country_Code_Address'] = pd.Series(country_code)

# Compare Both Country Codes

In [23]:
cover_code = Attributes_DF['Country_Code_Cover']
address_code = Attributes_DF['Country_Code_Address']

In [24]:
same = []
nan = []
none = []
twonulls = []
twodiff = []


for i in range(3205):
    if cover_code[i] != address_code[i] and type(cover_code[i])== float:
        if address_code[i] == None:
            twonulls.append(i)
        else:
            nan.append(i)
    elif cover_code[i] != address_code[i] and type(cover_code[i])== str:
        if address_code[i] == None:
            none.append(i)
        else:
            twodiff.append(i)
    else:
        same.append(i)

In [25]:
print(len(same),len(nan),len(none),len(twonulls),len(twodiff))

2478 526 74 78 49


In [26]:
bettercode = []
for i in twodiff:
    address = Attributes_DF["Address"][i]
    address = address.replace('\n','')
    address = address.replace(' ','')
    address = address.replace('  ','')
    address = address.replace("’","")
    address = address.lower()
    
    if address_code[i] == 'BRA' and address.find('brazil')> -1:
        bettercode.append(i)
        continue
        
    if address_code[i] != 'YUGOS':
    
        ind = int(worldmap[worldmap['id'] == address_code[i]].index.values)
    
        country = worldmap['name'][ind]
        country = country.lower()
        
        if address.find(country) >-1:
            bettercode.append(i)
            continue
                
        city = worldmap['capitalCity'][ind]
    
        if type(city) == str:
            city = city.replace(' ','')
            city = city.replace("'","")
            city = city.replace('  ','')
            city = city.lower()
            if address.find(city) >-1:
                bettercode.append(i)

In [27]:
rest_twodiff = list(set(twodiff) - set(bettercode))

In [28]:
for i in rest_twodiff:
    if address_code[i] == 'Yugoslavia':
        continue
    if cover_code[i] == 'THA' or cover_code[i] == 'ALB':
        address_code[i] = cover_code[i]       

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  address_code[i] = cover_code[i]


In [29]:
for i in none:
    address_code[i] = cover_code[i]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  address_code[i] = cover_code[i]


In [30]:
#Finally we write the bettercode in one column
Attributes_DF['Country_Code']=Attributes_DF['Country_Code_Address']

In [31]:
Attributes_DF['Country_Code'][Attributes_DF.Country_Code=='IBB']=None

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  Attributes_DF['Country_Code'][Attributes_DF.Country_Code=='IBB']=None


In [32]:
DF_to_export=Attributes_DF.drop(['Country_code_pdf','Possible_country_name','Address','Possible_country','Country_Code_Cover','Country_Name',
               'Score_final','Score1','Country_Code_Address'],axis=1)

In [33]:
worldmap=worldmap[['id','name','region.value','incomeLevel.value']]
worldmap.rename({'id':'Country_Code','name':'Country','region.value':'Region',
                 'incomeLevel.value':'Income_level'},axis=1,inplace=True)

In [34]:
DF_to_export=DF_to_export.merge(worldmap,how='left',on='Country_Code')

In [35]:
#Export the dataframe
DF_to_export.to_csv('../data/Extracted_Attributes_LoanUSD_Country.csv',index=False)