## Overview
The provided instructions will allow users who are new to Python to compile a subset of data taken from the [World Bank](https://data.worldbank.org/indicator/SP.DYN.IMRT.IN).
* This process utilizes two different `.csv` files of data to create a subset comprised of high income countries from varying geographical locations and their respective infant mortality rates from 1960 and 2021.
* The process involves downloading the proper files, importing packages, creating a dataframe, filtering the dataframe to produce a data subset, and exporting the data subset.
## Getting Started
1. Go to the [World Bank website](https://data.worldbank.org/indicator/SP.DYN.IMRT.IN) and download the `.csv` file for global infant mortality rates.
2. To make data filtering easier for yourself later, rename the two `.csv` files that the World Bank provides (ex: "Infant Mortality Rates 1.csv" and "Infant Mortality Rates 2.csv ").
3. Save these `.csv` files to your Google Drive.
3. Go to Google Colab and select File and then New Notebook.

## Packages
1. Now you will need to mount your Google Drive to your New Notebook by using the following command:


In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


2. Import both the Numpy and Pandas packages using the following commands:

In [None]:
import numpy as np

In [None]:
import pandas as pd

> Note that the nicknames `np` and `pd` will make it easier to refer to these packages later in certain functions that you will use.

## Creating a Dataframe
1. Now that you have mounted your Google Drive, you can upload your `.csv` files using the `.read_csv()` commands below:

In [None]:
df=pd.read_csv('gdrive/My Drive/Colab Notebooks/Infant Mortality Rates 2.csv')

In [None]:
df2=pd.read_csv('gdrive/My Drive/Colab Notebooks/Infant Mortality Rates 1 - Infant Mortality Rates 1.csv')

> Note that you can name the dataframe whatever you'd like, I just chose to use `df` and `df2` to define my dataframes.
2. To ensure that the dataframes properly uploaded, you can simply type the name that you used in the above commands and it should return with the data that is included in each of the `.csv` files.

In [None]:
df

Unnamed: 0,Country Code,Region,IncomeGroup,SpecialNotes,TableName
0,ABW,Latin America & Caribbean,High income,,Aruba
1,AFE,,,"26 countries, stretching from the Red Sea in t...",Africa Eastern and Southern
2,AFG,South Asia,Low income,The reporting period for national accounts dat...,Afghanistan
3,AFW,,,"22 countries, stretching from the westernmost ...",Africa Western and Central
4,AGO,Sub-Saharan Africa,Lower middle income,The World Bank systematically assesses the app...,Angola
...,...,...,...,...,...
260,XKX,Europe & Central Asia,Upper middle income,,Kosovo
261,YEM,Middle East & North Africa,Low income,The World Bank systematically assesses the app...,"Yemen, Rep."
262,ZAF,Sub-Saharan Africa,Upper middle income,Fiscal year end: March 31; reporting period fo...,South Africa
263,ZMB,Sub-Saharan Africa,Lower middle income,National accounts data were rebased to reflect...,Zambia


In [None]:
df2

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,Aruba,ABW,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,,,,,,,...,,,,,,,,,,
1,Africa Eastern and Southern,AFE,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,,,,,,,...,50.858298,49.416164,48.047765,46.638627,45.268284,44.081350,43.027778,42.004211,,
2,Afghanistan,AFG,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,,,,228.9,225.1,221.2,...,55.000000,53.000000,51.100000,49.400000,47.800000,46.300000,44.800000,43.400000,,
3,Africa Western and Central,AFW,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,,,,,,,...,69.988550,68.760967,67.571981,66.373973,64.945255,63.556011,62.165177,60.749633,,
4,Angola,AGO,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,,,,,,,...,60.500000,57.900000,55.700000,53.800000,52.000000,50.400000,48.700000,47.200000,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
261,Kosovo,XKX,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,,,,,,,...,13.800000,13.000000,12.200000,11.500000,10.800000,10.100000,9.600000,9.100000,,
262,"Yemen, Rep.",YEM,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,,,,278.2,272.0,265.1,...,45.400000,46.200000,46.100000,46.000000,46.600000,46.500000,45.800000,46.700000,,
263,South Africa,ZAF,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,,,,,,,...,30.200000,29.400000,28.900000,28.300000,27.800000,27.300000,26.900000,26.400000,,
264,Zambia,ZMB,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,121.5,119.4,117.6,116.1,114.7,113.7,...,47.200000,47.200000,45.600000,44.200000,43.600000,42.400000,41.100000,40.200000,,


## Filtering Data
Now that you have uploaded your `.csv` files, it is time to filter your data!
1. To separate the high income countries from countries of other income levels in the dataset, use the following command:

In [None]:
economy=df[df["IncomeGroup"] == "High income"].copy()

2. Again, reaffirm that the command worked by simply typing:

In [None]:
economy

Unnamed: 0,Country Code,Region,IncomeGroup,SpecialNotes,TableName
0,ABW,Latin America & Caribbean,High income,,Aruba
6,AND,Europe & Central Asia,High income,,Andorra
8,ARE,Middle East & North Africa,High income,,United Arab Emirates
11,ASM,East Asia & Pacific,High income,,American Samoa
12,ATG,Latin America & Caribbean,High income,,Antigua and Barbuda
...,...,...,...,...,...
241,TTO,Latin America & Caribbean,High income,,Trinidad and Tobago
249,URY,Latin America & Caribbean,High income,,Uruguay
250,USA,North America,High income,,United States
254,VGB,Latin America & Caribbean,High income,,British Virgin Islands


3. Now that you know that your code worked, you can merge the second `.csv` file with the first `.csv` file using the following `pd.merge()` fuction:

In [None]:
mergedData = pd.merge(economy, df2, on="Country Code")

> Note that, in this case, the two datasets can be merged because they have a column in common: `Country Code`.
4. Time to make sure that your code worked:

In [None]:
mergedData

Unnamed: 0,Country Code,Region,IncomeGroup,SpecialNotes,TableName,Country Name,Indicator Name,Indicator Code,1960,1961,...,2014,2015,2016,2017,2018,2019,2020,2021,2022,2023
0,ABW,Latin America & Caribbean,High income,,Aruba,Aruba,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,,,...,,,,,,,,,,
1,AND,Europe & Central Asia,High income,,Andorra,Andorra,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,,,...,3.5,3.3,3.2,3.0,2.9,2.8,2.7,2.6,,
2,ARE,Middle East & North Africa,High income,,United Arab Emirates,United Arab Emirates,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,135.5,128.9,...,6.7,6.5,6.3,6.2,6.0,5.8,5.6,5.4,,
3,ASM,East Asia & Pacific,High income,,American Samoa,American Samoa,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,,,...,,,,,,,,,,
4,ATG,Latin America & Caribbean,High income,,Antigua and Barbuda,Antigua and Barbuda,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,62.9,58.8,...,6.8,6.5,6.2,6.0,5.7,5.5,5.3,5.2,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
77,TTO,Latin America & Caribbean,High income,,Trinidad and Tobago,Trinidad and Tobago,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,56.0,54.3,...,18.1,17.5,17.0,16.5,16.0,15.5,15.0,14.6,,
78,URY,Latin America & Caribbean,High income,,Uruguay,Uruguay,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,56.6,55.4,...,8.0,7.7,7.3,6.8,6.3,5.8,5.4,5.0,,
79,USA,North America,High income,,United States,United States,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,25.9,25.4,...,5.9,5.8,5.7,5.7,5.6,5.5,5.4,5.4,,
80,VGB,Latin America & Caribbean,High income,,British Virgin Islands,British Virgin Islands,"Mortality rate, infant (per 1,000 live births)",SP.DYN.IMRT.IN,68.9,65.6,...,12.3,12.0,11.6,11.3,10.9,10.6,10.3,9.9,,


> As you can see, the two datasets have officially been merged!
5. Using the merged dataset, you will use the `.loc` function as shown below to separate out the variables that you want to use for further analysis:


> In this case, the variables you will isolate are `Region`, `Country Name`, `Income Group`, `1960`, and `2021`.


In [None]:
final_subset = mergedData.loc[:, ["Region", "Country Name", "IncomeGroup", "1960", "2021"]]

6. Now it's time to ensure that your code worked properly:

In [None]:
final_subset

Unnamed: 0,Region,Country Name,IncomeGroup,1960,2021
0,Latin America & Caribbean,Aruba,High income,,
1,Europe & Central Asia,Andorra,High income,,2.6
2,Middle East & North Africa,United Arab Emirates,High income,135.5,5.4
3,East Asia & Pacific,American Samoa,High income,,
4,Latin America & Caribbean,Antigua and Barbuda,High income,62.9,5.2
...,...,...,...,...,...
77,Latin America & Caribbean,Trinidad and Tobago,High income,56.0,14.6
78,Latin America & Caribbean,Uruguay,High income,56.6,5.0
79,North America,United States,High income,25.9,5.4
80,Latin America & Caribbean,British Virgin Islands,High income,68.9,9.9


## Exporting the Subset
You have successfully filtered your original dataframes into a smaller and much more organized subset. The only thing that is left to do is to export this subset to your Google Drive using the `.to_csv()` function.
> Use the following command:

In [None]:
final_subset.to_csv("final_subset.csv", index =False)

Check your Google Drive in the original location of your `.csv` files. Your new subset should now be here in the form of a `.csv` file!