### Data Preperation
### Term Project - MileStone 3
### Submitter - Himanshu Singh
### Cleaning/Formatting Website Data

In [6]:
# We got the Healthcare PPP spending on wiki page
# We will find the table having the relevant information and fetch data in a table

import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
import pandas as pd
import cloudscraper as cs

# to avoid cloudflare or 403 error using cloudscraper
scraper = cs.create_scraper()
URL = "https://en.wikipedia.org/wiki/List_of_countries_by_total_health_expenditure_per_capita"
response = scraper.get(f"{URL}")
soup = BeautifulSoup(response.content, 'html.parser')

number_of_tables = soup.find_all("table")
print(f"Number of tables on the webpage {len(number_of_tables)}")
#print(soup.prettify())






Number of tables on the webpage 2


## Step 1 Find the Right table on HTML page


In [10]:

# Find the main GDP table. Based on the page structure, it is the
# first major 'wikitable sortable sticky-header-multi static-row-numbers'
health_table = soup.find_all('table', {'class': 'wikitable sortable defaultright static-row-numbers sticky-table-head'})

# Check if the table was found
if health_table:
    print("Main Health table found successfully.")
else:
    print("Could not find the main Health table.")



Main Health table found successfully.


Unnamed: 0,Location,PPP,Year
0,Location,PPP $,Year
1,Afghanistan,383,2022
2,Albania,1186,2022
3,Algeria,547,2022
4,Andorra,5641,2023
...,...,...,...
188,Venezuela,131,2022
189,Vietnam,611,2022
190,Yemen,109,2022
191,Zambia,208,2022


## Step 2 To load data in a dataframe


In [14]:

first_column_data = []
second_column_data = []
third_column_data = []
#fourth_column_data = []

# 4. Iterate through each table row (<tr>)
for row in health_table[0].find_all('tr'):
    # 5. Find all table data cells (<td>) in the current row
    cells = row.find_all(['td', 'th']) # Include 'th' for potential header rows
    # 6. Check if a cell exists and append the text of the first one
    if cells:
      # The first element in the 'cells' list is the first column
        first_column_data.append(cells[0].get_text(strip=True))
        second_column_data.append(cells[1].get_text(strip=True))
        third_column_data.append(cells[2].get_text(strip=True))
        #fourth_column_data.append(cells[3].get_text(strip=True))





Unnamed: 0,Location,PPP in $,Year
0,Location,PPP $,Year
1,Afghanistan,383,2022
2,Albania,1186,2022
3,Algeria,547,2022
4,Andorra,5641,2023
...,...,...,...
188,Venezuela,131,2022
189,Vietnam,611,2022
190,Yemen,109,2022
191,Zambia,208,2022


## Step 3 To remove the comma from numeric column

In [31]:
# Covert the list to Panda Series

second_column_data_pd= pd.Series(second_column_data)
second_column_data_pd= second_column_data_pd.str.replace(',', '', regex=False)
second_column_data_pd

0      PPP $
1        383
2       1186
3        547
4       5641
       ...  
188      131
189      611
190      109
191      208
192       96
Length: 193, dtype: object

## Step 4 To put all in dataframe

In [53]:
# Create a dictionary where keys are the column names and values are the lists
datasource1 = {
    'Country': first_column_data,
    'PPP': second_column_data_pd,
    'Year': third_column_data

}

# Forming in one dataframe
df_source1 = pd.DataFrame(datasource1)
df_source1

Unnamed: 0,Country,PPP,Year
0,Location,PPP $,Year
1,Afghanistan,383,2022
2,Albania,1186,2022
3,Algeria,547,2022
4,Andorra,5641,2023
...,...,...,...
188,Venezuela,131,2022
189,Vietnam,611,2022
190,Yemen,109,2022
191,Zambia,208,2022


## Step 5 To clean the data

In [54]:
# The value to delete
value_to_delete = 'Location'

# Create a new DataFrame (df_filtered) by selecting all rows
# where the 'Location' column is NOT equal to the value_to_delete.
df_filtered = df_source1[df_source1['Country'] != value_to_delete]
df_filtered

Unnamed: 0,Country,PPP,Year
1,Afghanistan,383,2022
2,Albania,1186,2022
3,Algeria,547,2022
4,Andorra,5641,2023
5,Angola,217,2022
...,...,...,...
188,Venezuela,131,2022
189,Vietnam,611,2022
190,Yemen,109,2022
191,Zambia,208,2022


## Step 6 To convert to numeric values

In [55]:
#df_filtered['PPP'] = df_filtered['PPP'].astype(float)
df_filtered.loc[:, 'PPP'] = df_filtered['PPP'].astype(int)
df_filtered

Unnamed: 0,Country,PPP,Year
1,Afghanistan,383,2022
2,Albania,1186,2022
3,Algeria,547,2022
4,Andorra,5641,2023
5,Angola,217,2022
...,...,...,...
188,Venezuela,131,2022
189,Vietnam,611,2022
190,Yemen,109,2022
191,Zambia,208,2022


* What changes were made to the data?

Since the data was coming from a website , it was pretty standard and has no issues. But we did perform the below

a)  Standardize column names
b)  Replace Headers
c)  Remove comma from numeric values for thousand separators
d)  Clean the data removing unwanted values
e)  Convert to numeric values in case there are some visualization required


* Are there any legal or regulatory guidelines for your data or project topic?

Since the data doesn't contain the private Health info it doesn't violate and personal Health privacy violations, though ethics and reasearch guidelines should be maintained. Transparency and Mitigation of Algorithmic Bias to ensure the models and findings do not unfairly impact certain communities.



* What risks could be created based on the transformations done?

Transformations on scraped HTML data create risks of data loss (e.g., losing decimals when removing commas), biased analysis from converting invalid strings to NaN, and structural errors like the SettingWithCopyWarning causing inconsistent data updates. These risks compromise data quality and integrity.


* Did you make any assumptions in cleaning/transforming the data?
Yes, several assumptions are typically made. The main assumptions are that commas are thousands separators, currency symbols and percentage signs should be removed, and that 'N/A' or blanks represent missing numeric data and should be converted to NaN.


* How was your data sourced/verified for credibility?

The data for the Wikipedia article "List of countries by total health expenditure per capita" is sourced and verified for credibility primarily through two major international organizations:

Organisation for Economic Co-operation and Development (OECD): This data is used for the first table, listing OECD member countries and a few others.

World Health Organization (WHO): Data from the WHO's Global Health Expenditure Database is used for the second, more comprehensive table listing nearly all countries.

Both sources provide data on total health spending (public and private) per capita, measured in Purchasing Power Parity (PPP) international dollars, adding a layer of standardization for comparison.


* Was your data acquired in an ethical way?

It was an open source data for public use.


* How would you mitigate any of the ethical implications you have identified?

To mitigate the ethical implications of scraping Wikipedia:

Respect Licensing: Explicitly attribute Wikipedia and adhere to the CC BY-SA 4.0 license for reuse.

Practice Server Etiquette: Implement rate limiting (using delays like time.sleep()) to avoid placing undue load on Wikipedia's servers and prevent potential Denial of Service (DoS) issues.

Verify Credibility: Always cite the primary sources (e.g., WHO, OECD) referenced by the Wikipedia article to ensure data authority and minimize misrepresentation risk.