### In this notebook, I will extract and load the world by income and region datasets from its <a href ="https://datatopics.worldbank.org/world-development-indicators/the-world-by-income-and-region.html">source </a>, and I will transferr it into the approprate format for Tableau. 
### I will also make sure that each country's name in the world income dataset matches the country's name in the covid-19 and populations datasets. 

In [1]:
import pandas as pd



I need two datasets for this task to be achive propratelly
- Covid-19 Dataset
- The world income and region dataset


In [4]:
# Load the covid-19 dataset
covid19 = pd.read_csv('dataset/COVID-19.csv')
world_income = pd.read_csv('dataset/data-XHzgJ.csv')
covid19.shape, world_income.shape

((179140, 9), (217, 38))

In [5]:
world_income.head()

Unnamed: 0,Country,Income group,Region,Lending category,1987,1988,1989,1990,1991,1992,...,2011,2012,2013,2014,2015,2016,2017,2018,2019,2020
0,Aruba,High income,Latin America & Caribbean,,,10360.0,11760.0,12230.0,13190.0,13990.0,...,22450.0,23520.0,24510.0,25350.0,26560.0,26840.0,27120.0,,,
1,Afghanistan,Low income,South Asia,IDA,,,,,,,...,530.0,630.0,660.0,630.0,600.0,550.0,530.0,520.0,530.0,500.0
2,Angola,Lower middle income,Sub-Saharan Africa,IBRD,670.0,650.0,860.0,780.0,1380.0,1170.0,...,3410.0,4170.0,4780.0,5010.0,4520.0,3770.0,3450.0,3210.0,2970.0,2230.0
3,Albania,Upper middle income,Europe & Central Asia,IBRD,730.0,730.0,760.0,650.0,410.0,280.0,...,4410.0,4360.0,4540.0,4540.0,4390.0,4320.0,4290.0,4860.0,5220.0,5210.0
4,Andorra,High income,Europe & Central Asia,,,,,,,,...,,,,,,,,,,


In [7]:
covid19.head()

Unnamed: 0.1,Unnamed: 0,Country/Region,Province/State,Lat,Long,date,confirmed,recovery,deaths
0,0,Afghanistan,,33.93911,67.709953,1/22/20,0,0,0
1,1,Albania,,41.1533,20.1683,1/22/20,0,0,0
2,2,Algeria,,28.0339,1.6596,1/22/20,0,0,0
3,3,Andorra,,42.5063,1.5218,1/22/20,0,0,0
4,4,Angola,,-11.2027,17.8739,1/22/20,0,0,0


In [10]:
world_income.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 217 entries, 0 to 216
Data columns (total 38 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   Country           217 non-null    object 
 1   Income group      216 non-null    object 
 2   Region            217 non-null    object 
 3   Lending category  144 non-null    object 
 4   1987              134 non-null    float64
 5   1988              136 non-null    float64
 6   1989              139 non-null    float64
 7   1990              142 non-null    float64
 8   1991              141 non-null    float64
 9   1992              150 non-null    float64
 10  1993              153 non-null    float64
 11  1994              158 non-null    float64
 12  1995              165 non-null    float64
 13  1996              167 non-null    float64
 14  1997              176 non-null    float64
 15  1998              176 non-null    float64
 16  1999              177 non-null    float64
 1

In [14]:
world_income['Income group'].unique()

array(['High income', 'Low income', 'Lower middle income',
       'Upper middle income', nan], dtype=object)

> As we can see from the above information about the the world income dataset, it contains infomration about each country, such as countrires' names, income group (High, Low, Lower middle, and Upper middle income), region, and lending category. 

### Now I will do the matching processing 

In [28]:
# Create a list containing the countries' names from the covid19 dataset
countries_covid = covid19['Country/Region'].unique()

# Create a list containing the countries' names from the the world income dataset
countries_GNI = world_income['Country'].unique()

# by taking advantage of the comprehensive list in Python, I will create another list to store 
#   the countries names that are not matched 
unmatched = [x for x in countries_covid if x not in countries_GNI]

print(
      len(unmatched),
      'Countries in the covid19 dataset do not exist or are not matched in the world income dataset \n\n',
      unmatched,
      '\n')
print('List of countries in the world income dataset that do not exist or are not matched with countries in the covid19 dataset.  \n\n',
      [x for x in countries_GNI if x not in countries_covid])



27 Countries in the covid19 dataset do not exist or are not matched in the world income dataset 

 ['Bahamas', 'Brunei', 'Congo', 'DR Congo', "Côte d'Ivoire", 'Czech Republic (Czechia)', 'Diamond Princess', 'Egypt', 'Gambia', 'Holy See', 'Iran', 'South Korea', 'Kyrgyzstan', 'Laos', 'MS Zaandam', 'Micronesia', 'Russia', 'Saint Kitts & Nevis', 'Saint Lucia', 'St. Vincent & Grenadines', 'Sao Tome & Principe', 'Slovakia', 'Summer Olympics 2020', 'Syria', 'Taiwan', 'Venezuela', 'Yemen'] 

List of countries in the world income dataset that do not exist or are not matched with countries in the covid19 dataset.  

 ['Aruba', 'American Samoa', 'Bahamas, The', 'Bermuda', 'Brunei Darussalam', 'Channel Islands', "Cote d'Ivoire", 'Congo, Dem. Rep.', 'Congo, Rep.', 'Curacao', 'Cayman Islands', 'Czech Republic', 'Egypt, Arab Rep.', 'Faeroe Islands', 'Micronesia, Fed. Sts.', 'Gibraltar', 'Gambia, The', 'Greenland', 'Guam', 'Hong Kong SAR, China', 'Isle of Man', 'Iran, Islamic Rep.', 'Kyrgyz Republic',

> It's noticeable that some countries are in different format and other countries do not exist in one of the two datasets

In [30]:

# Replacing the countries' name in the world income dataset that are in different format in the covid19 dataset. 

country_mapper = {
    'Bahamas, The' : 'Bahamas',
    'Brunei Darussalam' : 'Brunei',
    'Cote d\'Ivoire' : 'Côte d\'Ivoire',
    'Congo, Dem. Rep.' : 'DR Congo',
    'Congo, Rep.' : 'Congo', 
    'Czech Republic' : 'Czech Republic (Czechia)',
    'Egypt, Arab Rep.' : 'Egypt',
    'Gambia, The' : 'Gambia',
    'Iran, Islamic Rep.' : 'Iran',
    'Korea, Rep.' : 'South Korea',
    'Kyrgyz Republic' : 'Kyrgyzstan',
    'Lao PDR' : 'Laos',
    'Micronesia, Fed. Sts.' : 'Micronesia',
    'Russian Federation':'Russia',
    'St. Kitts and Nevis' : 'Saint Kitts & Nevis',
    'St. Lucia' : 'Saint Lucia',
    'St. Vincent and the Grenadines' : 'St. Vincent & Grenadines',
    'Sao Tome and Principe' : 'Sao Tome & Principe',
    'Slovak Republic' :'Slovakia',
    'Syrian Arab Republic':'Syria', 
    'Venezuela, RB' :'Venezuela', 
    'Yemen, Rep.' :'Yemen'
    
    
}

world_income['Country'] = world_income['Country'].replace(country_mapper)



### Repeat the process of finding unmatched countries in the two datasets

In [32]:
# Create a list containing the countries' names from the covid19 dataset
countries_covid = covid19['Country/Region'].unique()

# Create a list containing the countries' names from the the world income dataset
countries_GNI = world_income['Country'].unique()

# by taking advantage of the comprehensive list in Python, I will create another list to store 
#   the countries names that are not matched 
unmatched = [x for x in countries_covid if x not in countries_GNI]

print(
      len(unmatched),
      'Countries in the covid19 dataset do not exist or are not matched in the world income dataset \n\n',
      unmatched,
      '\n')
print('List of countries in the world income dataset that do not exist or are not matched with countries in the covid19 dataset.  \n\n',
      [x for x in countries_GNI if x not in countries_covid])




5 Countries in the covid19 dataset do not exist or are not matched in the world income dataset 

 ['Diamond Princess', 'Holy See', 'MS Zaandam', 'Summer Olympics 2020', 'Taiwan'] 

List of countries in the world income dataset that do not exist or are not matched with countries in the covid19 dataset.  

 ['Aruba', 'American Samoa', 'Bermuda', 'Channel Islands', 'Curacao', 'Cayman Islands', 'Faeroe Islands', 'Gibraltar', 'Greenland', 'Guam', 'Hong Kong SAR, China', 'Isle of Man', 'Macao SAR, China', 'St. Martin (French part)', 'Northern Mariana Islands', 'New Caledonia', 'Nauru', 'Puerto Rico', 'Korea, Dem. Rep.', 'West Bank and Gaza', 'French Polynesia', 'Sint Maarten (Dutch part)', 'Turks and Caicos Islands', 'Turkmenistan', 'Tuvalu', 'British Virgin Islands', 'Virgin Islands (U.S.)']


In [34]:
world_income.to_csv('dataset/updated-data-XHzgJ.csv')