
# STOR 320: Introduction to Data Science
## Lab 6

**Name:** Robert Nachnani

**PID:** 730573785

In [74]:
from datetime import datetime
from bs4 import BeautifulSoup
from io import StringIO
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import requests

In [12]:
%pip install html5lib

Note: you may need to restart the kernel to use updated packages.


# Scraping, Merging, and Analyzing Datasets for Countries (25 points)

**Background:** Many times in data science, your data will be split between many different sources, some of which may be online. In this analysis assignment, we will webscrape country level data from multiple websites, clean the data individually, and merge the data. The website [Worldometers](https://www.worldometers.info/) contains very interesting country level data that when connected may allow us to learn interesting things about the wonderful world in which we exist.

## 0. GDP by Country (7 Points)
Information at [Worldometer GDP](https://www.worldometers.info/gdp/gdp-by-country/) contains GDP data from 2022 published by the world bank. GDP is the monetary value of goods and services produced within a country over a period of time. On this website, GDP is presented in dollars.

### 0.0 Scraping the Data
We will walk through webscraping the data from https://www.worldometers.info/gdp/gdp-by-country/ using Pandas into a DataFrame called GDP. You should end up with a new object called GDP which is a DataFrame with 177 observations and 8 variables.

In [21]:
URL_GDP = "https://www.worldometers.info/gdp/gdp-by-country/"

# Send a GET request to the URL
response = requests.get(URL_GDP)

# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find all tables and read into pandas DataFrame
tables = soup.find_all('table')

table_IO = StringIO(str(tables))
GDP = pd.read_html(table_IO, flavor='bs4', header=0)[0]  # Read the first table

GDP.shape

(177, 8)

In [22]:
GDP.head(5)

Unnamed: 0,#,Country,"GDP (nominal, 2022)",GDP (abbrev.),GDP growth,Population (2022),GDP per capita,Share of World GDP
0,1,United States,"$25,462,700,000,000",$25.463 trillion,2.06%,341534046,"$74,554",25.32%
1,2,China,"$17,963,200,000,000",$17.963 trillion,2.99%,1425179569,"$12,604",17.86%
2,3,Japan,"$4,231,140,000,000",$4.231 trillion,1.03%,124997578,"$33,850",4.21%
3,4,Germany,"$4,072,190,000,000",$4.072 trillion,1.79%,84086227,"$48,429",4.05%
4,5,India,"$3,385,090,000,000",$3.385 trillion,7.00%,1425423212,"$2,375",3.37%


### 0.1 Cleaning the Data (7 points)

Now that we scraped our data into a DataFrame, we need to clean it up. Perform the following tasks:

1.   Remove the first ('#') and fourth ('GDP (abbrev.)') columns from the DataFrame.
2.   Rename the columns 'GDP  (nominal, 2022)', 'GDP growth', 'Population  (2022)', 'GDP per capita', and 'Share of  World GDP' to 'GDP', 'Growth', 'Population', 'PerCapita', and 'Share', respectively.
3.   Remove all dollar signs, percent signs, and commas from 'GDP', 'Growth', 'PerCapita', and 'Share'.
4.  Update column data type of "Country" to be a string dtype and the remaining columns to be numeric. Hint: use `pd.to_numeric`
5. Rewrite over the original 'GDP' variable with a new variable called 'GDP' that is in trillions of dollars rather than in actual dollars. Rewrite over the original 'Population' variable with a new variable of the same name that is in millions of people rather than in actual people. You are scaling the original variables to change the units without changing the variable names.

Be careful of the formatting and spacing in the original column names! Display the first five rows of the cleaned `GDP` DataFrame and the dtype info for `GDP`.



In [23]:
# Code Solution Here
GDP = GDP.drop(GDP.columns[[0,3]] , axis=1)
GDP.head()

Unnamed: 0,Country,"GDP (nominal, 2022)",GDP growth,Population (2022),GDP per capita,Share of World GDP
0,United States,"$25,462,700,000,000",2.06%,341534046,"$74,554",25.32%
1,China,"$17,963,200,000,000",2.99%,1425179569,"$12,604",17.86%
2,Japan,"$4,231,140,000,000",1.03%,124997578,"$33,850",4.21%
3,Germany,"$4,072,190,000,000",1.79%,84086227,"$48,429",4.05%
4,India,"$3,385,090,000,000",7.00%,1425423212,"$2,375",3.37%


In [24]:
GDP.columns = [GDP.columns[0] , 'GDP' , 'Growth' , 'Population' , 'PerCapita' , 'Share']
GDP.head()

Unnamed: 0,Country,GDP,Growth,Population,PerCapita,Share
0,United States,"$25,462,700,000,000",2.06%,341534046,"$74,554",25.32%
1,China,"$17,963,200,000,000",2.99%,1425179569,"$12,604",17.86%
2,Japan,"$4,231,140,000,000",1.03%,124997578,"$33,850",4.21%
3,Germany,"$4,072,190,000,000",1.79%,84086227,"$48,429",4.05%
4,India,"$3,385,090,000,000",7.00%,1425423212,"$2,375",3.37%


In [25]:
GDP['GDP'] = pd.to_numeric(GDP['GDP'].replace({'\$': '', ',': ''}, regex=True))
GDP['Growth'] = pd.to_numeric(GDP['Growth'].replace({'%': ''}, regex=True))
GDP['PerCapita'] = pd.to_numeric(GDP['PerCapita'].replace({'\$': '', ',': ''}, regex=True))
GDP['Share'] = pd.to_numeric(GDP['Share'].replace({'%': ''}, regex=True))
GDP['Population'] = pd.to_numeric(GDP['Population'].replace({',': ''}, regex=True))
GDP['Country'] = GDP['Country'].astype(str)
GDP.head()

Unnamed: 0,Country,GDP,Growth,Population,PerCapita,Share
0,United States,25462700000000,2.06,341534046,74554,25.32
1,China,17963200000000,2.99,1425179569,12604,17.86
2,Japan,4231140000000,1.03,124997578,33850,4.21
3,Germany,4072190000000,1.79,84086227,48429,4.05
4,India,3385090000000,7.0,1425423212,2375,3.37


In [26]:
GDP['GDP'] = GDP['GDP'] / 1e12
GDP['Population'] = GDP['Population'] / 1e6
GDP

Unnamed: 0,Country,GDP,Growth,Population,PerCapita,Share
0,United States,25.462700,2.06,341.534046,74554,25.32
1,China,17.963200,2.99,1425.179569,12604,17.86
2,Japan,4.231140,1.03,124.997578,33850,4.21
3,Germany,4.072190,1.79,84.086227,48429,4.05
4,India,3.385090,7.00,1425.423212,2375,3.37
...,...,...,...,...,...,...
172,Sao Tome & Principe,0.000547,0.93,0.226305,2416,0.00
173,Micronesia,0.000427,-0.62,0.523477,816,0.00
174,Marshall Islands,0.000280,1.50,0.040077,6978,0.00
175,Kiribati,0.000223,1.56,0.130469,1712,0.00


## 1. Education Index Data by Country (3 Points)

Check out the [Wikipedia page](https://en.wikipedia.org/wiki/Education_Index), which contains the education index for all countries from 1990 to 2019.

### 1.0 Scraping the Education Index Data
The code provided scrapes the data from (https://en.wikipedia.org/wiki/Education_Index) into a data frame called EDU.

In [27]:
# URL to fetch data from
URL_EDU = "https://en.wikipedia.org/wiki/Education_Index"

# Fetch the HTML content
response = requests.get(URL_EDU)
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table and read it into a DataFrame
table = soup.find_all('table')[0]  # Assuming the first table is the one we want
table_IO = StringIO(str(table))
EDU = pd.read_html(table_IO, flavor='bs4', header=0)[0]

EDU.head(5)

Unnamed: 0,Country,1990,1991,1992,1993,1994,1995,1996,1997,1998,...,2010,2011,2012,2013,2014,2015,2016,2017,2018,2019
0,Afghanistan,0.122,0.133,0.145,0.156,0.168,0.179,0.19,0.202,0.213,...,0.372,0.374,0.39,0.398,0.403,0.405,0.406,0.408,0.413,0.414
1,Albania,0.583,0.588,0.557,0.542,0.528,0.55,0.557,0.569,0.579,...,0.671,0.714,0.739,0.749,0.758,0.753,0.745,0.747,0.743,0.746
2,Algeria,0.385,0.395,0.405,0.414,0.424,0.431,0.443,0.458,0.473,...,0.626,0.644,0.639,0.639,0.652,0.659,0.66,0.665,0.668,0.672
3,Andorra,,,,,,,,,,...,0.67,0.671,0.724,0.714,0.725,0.718,0.722,0.713,0.72,0.72
4,Angola,,,,,,,,,,...,0.398,0.423,0.435,0.447,0.46,0.472,0.487,0.498,0.5,0.5


### 1.1 Cleaning the Education Data (3 points)
Perform the following tasks to clean the `EDU` DataFrame:

1. Modify the resulting DataFrame `EDU` to only keep 2 variables: 1) the country’s name and 2) its education index from 2019.
2. Rename the variable named “2019” to “EDIndex”.
3. Update the dtype of 'Country' to a string.  

Display the first 5 rows of `EDU` and the info of `EDU` after making these changes.

In [28]:
# Code Solution Here
EDU = EDU[['Country' , '2019']]
EDU.rename(columns={'2019': 'EDIndex'}, inplace=True)
EDU['Country'] = EDU['Country'].astype(str)
EDU.head(5)

Unnamed: 0,Country,EDIndex
0,Afghanistan,0.414
1,Albania,0.746
2,Algeria,0.672
3,Andorra,0.72
4,Angola,0.5


## 2: Merging the Datasets (8 points)

Now, we are going to merge the datasets for maximum gains. Make sure you carefully read the instructions for each question. Be very careful in this part of the assignment.

### 2.0 Joining GDP and EDU (2 Points)
The dataset `GDP` is our primary dataset. Create a new DataFrame `GDP_EDU` that brings the the education data from `EDU` into the dataset `GDP` using a left join only. Display the first 12 rows of `GDP_EDU`.

In [31]:
# Code Solution Here
GDP_EDU = pd.merge(GDP , EDU , how='left')
GDP_EDU

Unnamed: 0,Country,GDP,Growth,Population,PerCapita,Share,EDIndex
0,United States,25.462700,2.06,341.534046,74554,25.32,0.900
1,China,17.963200,2.99,1425.179569,12604,17.86,0.862
2,Japan,4.231140,1.03,124.997578,33850,4.21,0.851
3,Germany,4.072190,1.79,84.086227,48429,4.05,0.943
4,India,3.385090,7.00,1425.423212,2375,3.37,0.555
...,...,...,...,...,...,...,...
172,Sao Tome & Principe,0.000547,0.93,0.226305,2416,0.00,
173,Micronesia,0.000427,-0.62,0.523477,816,0.00,
174,Marshall Islands,0.000280,1.50,0.040077,6978,0.00,0.707
175,Kiribati,0.000223,1.56,0.130469,1712,0.00,0.594


### 2.1 Missing Education Index (2 Points)

How many countries in `GDP_EDU` have missing values for Education Index? Show code that can be used to answer this question and then write your answer in complete sentences.

In [32]:
# Code Solution Here
GDP_EDU['EDIndex'].isna().sum()

19

Answer: 19 Countries have missing values for the Education Index

### 2.2 Data Inspection (3 Points)
Closely inspect the original datasets and answer the following questions about GDP_EDU in complete sentences. You can use the code if needed, but it is not required. Please show all work. If you don’t reference the appropriate dataset or you are not specific in your answers, you will get 0 points.

#### 2.2.0 Why is there no education index for Iran in the dataset `G_EDU`? (1 Point)

**Answer:**  There could be multiple answers to this, and the first one is that the education index could not be available for the original source. This could be due to the data for Iran being unreported, incomplete or unavailable at the time the data was collected. Iran also could have intentionally excluded their data off the education index data for political or economic reasons.

#### 2.2.1 Why is there no education index for State of Palestine in the dataset `GDP_EDU`? (1 Point)

**Answer:**  Well to state the obvious from the prior question, they could have excluded their education index or it was incomplete. Name of the state doesn't match each other in the EDU and GDP datasets, as it could have been 'Palestine' rather than 'State of Palestine' in one of the datasets.  They also just became their own state recently so they may not have data on the country.

#### 2.2.2 Why is there no education index for Laos in the dataset `GDP_EDU`? (1 point)

Answer: Like the past countries we could have a multitude of reasons spanning from incomplete data, incorrect labeling, or even for political/economic reasons.  Another thing we haven't looked is that maybe the data collection methods of the countries are insufficient.

### 2.2 Removing NA Values (1 point)

Instead of replacing or dropping all the countries with missing values by hand, we will just drop all rows that are missing the Education Index to move forward with the analysis portion. Drop all rows from `GDP_EDU` that are null for `EDIndex`.



In [43]:
# Code Solution Here
GDP_EDU.dropna(subset=['EDIndex'], inplace=True)
GDP_EDU

Unnamed: 0,Country,GDP,Growth,Population,PerCapita,Share,EDIndex
0,United States,25.462700,2.06,341.534046,74554,25.32,0.900
1,China,17.963200,2.99,1425.179569,12604,17.86,0.862
2,Japan,4.231140,1.03,124.997578,33850,4.21,0.851
3,Germany,4.072190,1.79,84.086227,48429,4.05,0.943
4,India,3.385090,7.00,1425.423212,2375,3.37,0.555
...,...,...,...,...,...,...,...
167,Vanuatu,0.000984,1.85,0.313046,3142,0.00,0.561
170,Samoa,0.000832,-6.02,0.215261,3867,0.00,0.713
171,Dominica,0.000612,5.94,0.066826,9159,0.00,0.632
174,Marshall Islands,0.000280,1.50,0.040077,6978,0.00,0.707


## 3. Analyzing the Merged Dataset (12 points)

In these questions, find the answer using code, and then answer the question using complete sentences below the code.

### 3.0 Above Average GDP PerCapita (2 Points)
How many countries have a GDP per capita above the global average?

In [67]:
# Code Solution Here
average = GDP_EDU['PerCapita'].mean(numeric_only=True)
above_average = (GDP_EDU['PerCapita'] > average)
above_average.sum()

49

Answer: 49

#### 3.1 Highest GDP Growth Rate (4 Points)

*   Of the countries that have above average GDP PerCapita, what country has the highest GDP growth rate?
*   Of the countries that have below average GDP PerCapita, what country has the highest GDP growth rate?

In [72]:
# Code Solution Here
below_average = GDP_EDU[GDP_EDU['PerCapita'] < average][['Country', 'Growth']]
above_average = GDP_EDU[GDP_EDU['PerCapita'] > average][['Country', 'Growth']]
highest_growth_above_avg = above_average.loc[above_average['Growth'].idxmax(), 'Country']
highest_growth_below_avg = below_average.loc[below_average['Growth'].idxmax(), 'Country']
highest_growth_above_avg , highest_growth_below_avg

('Guyana', 'Cabo Verde')

Answer: 
* Above Average: Guyana
* Below Average: Cabo Verde

#### 3.2 Lowest Education Index (4 Points)

*   Of the countries that have above average GDP PerCapita, what country has the lowest education index?
*   Of the countries that have below average GDP PerCapita, what country has the lowest education index?

In [73]:
# Code Solution Here
below_average = GDP_EDU[GDP_EDU['PerCapita'] < average][['Country', 'Growth' , 'EDIndex']]
above_average = GDP_EDU[GDP_EDU['PerCapita'] > average][['Country', 'Growth', 'EDIndex' ]]
lowest_edu_above_avg = above_average.loc[above_average['EDIndex'].idxmin(), 'Country']
lowest_edu_below_avg = below_average.loc[below_average['EDIndex'].idxmin(), 'Country']
lowest_edu_above_avg , lowest_edu_below_avg

('Guyana', 'Niger')

Answer: 
* Lowest Above Average: Guyana
* Lowest Below Average: Niger

#### 3.3 Critical Thinking (2 points)

State two additional questions you could answer with the merged dataset. Be creative. You do not need to find the answer, but are welcome to if you are curious.

Answer: 
1. Does a higher EDIndex mean a higher GDP?
2. Does population play a part in the GDP of a country?