# DS 3000 - Lab 5: Web Scraping & EDA

**Student Name**: [Juia Ouritskaya]

**Date**: [10/6/2023]

### Submission Instructions
<div class="alert alert-block alert-success">
In this lab you will work with data that was scraped from Wikipedia and analyze it using useful pandas functions that group, sort and aggregate data. The data has been minimally prepared for you and the only data preparation that you are required to do is indicated in <strong>Part 2</strong>. Complete the questions in the lab and submit this `ipynb` file with your solution.
</div>

`Note:` The `ipynb` format stores outputs from the last time you ran the notebook. When you open a notebook it has the figures and outputs of the last time you ran it.  To ensure that your submitted `ipynb` file represents your latest code, make sure to give a fresh run `Kernel > Restart & Run All` just before uploading the `ipynb` file to Gradescope.

<div class="alert alert-block alert-danger">
Please do not delete the cells that are provided nor add any extra cells. Ensure that you write your code in the given cells where indicated. <br>
<strong>Do not delete any empty cells.</strong><br>
</div>


In [1]:
#DO NOT DELETE THIS CELL


In [2]:
# Import any Libraries
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

### Question 1 (5 pts)

This lab requires that you extract data from a website, load it into a dataframe and analyze it using useful pandas functions that allow you to group, sort and aggregate data.
The data will be scraped from a Wikipedia page that shows the [percentage of population living below the national poverty line](https://en.wikipedia.org/wiki/List_of_sovereign_states_by_percentage_of_population_living_in_poverty#Percent_of_population_living_below_national_poverty_line) for various countries around the world. This data was collected by the WorldBank API, the CIA Factbook and other sources.

>`Disclaimer:` Definitions of poverty vary considerably among nations. For example, rich nations generally employ more generous standards of poverty than poor nations. -- *Source: The World Factbook*


### Important Instructions
This lab is a graded assignment that will give you a chance to learn how to scrape data from HTML pages. The lab is divided into four parts:
- Part 1: extract the data from the Wikipedia page (this step is done for you)
- Part 2: transform the dataframe to prepare it for analysis
- Part 3: identify the country with the highest poverty
- part 4: analyze the average poverty across continents


`Note:`
- Please do not change the name of the variable `df_poverty_data` that is created in `Part 1`. The variable will be used as input to `Part 2`.  <br>
- You can remove the `raise NotImplementedException()`. <br>
- Part 1 is not graded. This step is performed for you.<br>


`Note`: you are only required to complete the logic inside each function. You can test each function by running the subsequent code cell that prints the results.

**Reminder**: DO NOT add or delete any cells below.

### Part 1 (0 pt)

The cell below performs the following steps:
- Extract the html data from a wikipedia page and save the results in a pandas dataframe called `df_poverty_data`. 
- Rename specific column names that will be used in this lab. 
- Filter the dataframe to only contain data on poverty levels that were obtained from the CIA World Factbook.

`Note:` It is recommended that you view the Wikipedia page which will help you to understand what data is being extracted. This URL shows the data that we will use in this lab: https://en.wikipedia.org/wiki/List_of_sovereign_states_by_percentage_of_population_living_in_poverty#Percent_of_population_living_below_national_poverty_line

In [3]:
# load the data
url       = 'https://en.wikipedia.org/wiki/List_of_sovereign_states_by_percentage_of_population_living_in_poverty'
html_data = pd.read_html(url) #scrape data from each html table in the Wikipedia page
df_poverty_data = pd.DataFrame(html_data[2]) #extract the second table from the html data and convert it to a dataframe

# Format the column names
df_poverty_data.rename(columns={'CIA[10]': 'FactBook_Percentage', #this field is the poverty level
                                    'Year.1': 'FB_Year'}, #this field is the most recent year that was recorded by the World Factbook
                                    inplace=True)

# Filter the dataframe to only use the data on poverty levels from the CIA World Factbook
df_poverty_data = df_poverty_data[['Country', 'FactBook_Percentage', 'FB_Year', 'Continent']]
df_poverty_data.head(5)



Unnamed: 0,Country,FactBook_Percentage,FB_Year,Continent
0,Afghanistan,54.5%,2017,Asia
1,Albania,14.3%,2012,Europe
2,Algeria,5.5%,2011,Africa
3,Angola,32.3%,2018,Africa
4,Anguilla,23.0%,2002,North America


### Part 2 (1 pt)

Complete the function `format_factbook_data(data)`. The function takes as input `df_poverty_data_formatted`. Inside the function ensure that you:
- Remove countries with missing data. Hint: the missing data is denoted by '—'
- Convert the fields: `FactBook_Percentage` and `FB_Year` to suitable numeric types.


In [4]:

def format_factbook_data(data: pd.DataFrame) -> pd.DataFrame:
    """
    Input pandas dataframe with the percentage of the population living below the national poverty line.

    Parameters:
    - data (pd.DataFrame): a pandas dataframe containing the poverty levels for various countries.

    Returns:
    - pd.DataFrame: pandas dataframe with four fields: Country, FactBook_Percentage, FB_Year, and Continent. 
                    Ensure that the FactBook_Percentage and FB_Year are converted to suitable numeric types.
    """

    # Remove countries with missing data which is denoted by '–'
    data = data[(data['FactBook_Percentage'] != '—') & (data['FactBook_Percentage'] != '') 
                & (data['FB_Year'] != '—') & (data['FB_Year'] != '')]
    
    # Convert the fields 'Factbook_Percentage' and 'FB_Year' to suitable numeric types
    data['FactBook_Percentage'] = data['FactBook_Percentage'].str[:-1].astype(float) / 100.0
    data['FB_Year'] = data['FB_Year'].astype(int)
    
    return data

In [5]:
# Printing the result from the function
df_poverty_data_formatted = format_factbook_data(df_poverty_data) #DO NOT change this line.
df_poverty_data_formatted.head()

Unnamed: 0,Country,FactBook_Percentage,FB_Year,Continent
0,Afghanistan,0.545,2017,Asia
1,Albania,0.143,2012,Europe
2,Algeria,0.055,2011,Africa
3,Angola,0.323,2018,Africa
4,Anguilla,0.23,2002,North America


In [6]:
#DO NOT DELETE THIS CELL


### Part 3 (2 pts)

Complete the function `highest_poverty(data)`. The function takes as input `df_poverty_data_formatted`. Inside the function ensure that you:
- Identify the country with the highest percentage of the population living below the poverty line
- Return a list with the name of the country and the year that the poverty data was recorded.


In [7]:

def highest_poverty(data: pd.DataFrame) -> list:
    """
    Input pandas dataframe with the percentage of the population living below the national poverty line.

    Parameters:
    - data (pd.DataFrame): a pandas dataframe containing the poverty levels for various countries.

    Returns:
    - List: a list with the country name and year e.g. ['Canada', 2016]
    """
    # Identify the country with the highest percentage of the population living below the poverty line
    # Sort the data in 'FactBook_Percentage' in descending order
    sorted_data_descending = data.sort_values(by='FactBook_Percentage', ascending=False)
    
    # Select the first row
    first_row = sorted_data_descending.iloc[0]
    
    # Return a list with the name of the country and the year that the poverty data was recorded
    # Extract country and year from first row
    country = first_row['Country']
    year = first_row['FB_Year']
    
    return [country, year]

In [8]:
# Printing the result from the function
result = highest_poverty(df_poverty_data_formatted) #DO NOT change this line.
print(f"The country with the highest poverty level is {result[0]} and the year this was reported was {result[1]}.")

The country with the highest poverty level is Syria and the year this was reported was 2014.


In [9]:
#DO NOT DELETE THIS CELL


### Part 4 (2 pts)

Complete the function `average_poverty(data)`. The function takes as input `df_poverty_data_formatted`. Inside the function ensure that you:
- Calculate the average poverty level for each continent
- Return the names of the two continents with the highest average


In [10]:

def average_poverty(data: pd.DataFrame) -> list:
    """
    Input pandas dataframe with the percentage of the population living below the national poverty line.

    Parameters:
    - data (pd.DataFrame): a pandas dataframe containing the poverty levels for various countries.

    Returns:
    - List: a list that contains the name of two continents with the highest average.

    """
    # Calculate the average poverty level for each continent
    # Group by 'Continent' and calculate the mean of 'FactBook_Percentage'
    average_poverty_by_continent = data.groupby('Continent')['FactBook_Percentage'].mean()
    
    # Sort the average values in descending order  
    sorted_continents_descending = average_poverty_by_continent.sort_values(ascending=False)
    
    # Extract the first two rows
    first_continent = sorted_continents_descending.index[0]
    second_continent = sorted_continents_descending.index[1]
    
    return [first_continent, second_continent]

In [11]:
# Printing the result from the function
result = average_poverty(df_poverty_data_formatted)

print(f"On average, {result[0]} has the highest poverty level followed by {result[1]}.")


On average, Africa has the highest poverty level followed by North America.


In [12]:
#DO NOT DELETE THIS CELL


**Good job with the lab. Hopefully you learned something new about our world. After you submit the lab, I encourage you to keep exploring the data to extract more insights!**

### Additional Resource:

#### [1. Pandas Documentation](https://pandas.pydata.org/docs/) 
#### [2. Pandas Cheatsheet](https://images.datacamp.com/image/upload/v1676302204/Marketing/Blog/Pandas_Cheat_Sheet.pdf)