# Project Milestone 3: Cleaning/Formatting Website Source

### Load Necessary Packages

In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from fuzzywuzzy import process
import math

import warnings
warnings.filterwarnings('ignore')

### Import and View Data from Website

In [2]:
url = 'https://en.wikipedia.org/wiki/List_of_countries_by_suicide_rate'  # store url as variable 
r = requests.get(url)                                    # get results from url
url = r.content                                          # get content from url results
soup = BeautifulSoup(url,'html.parser')                  # save html results frorm url

In [3]:
for table in soup.find_all('table'):       # iteraters through all tables to help narrow which table we need
    print("Classes of Table:")
    print(table.get('class'))

# We will need table with class 'sortable', 'nowrap', 'mw-datatable', 'static-row-numers', etc.

Classes of Table:
['sidebar', 'sidebar-collapse', 'nomobile', 'nowraplinks', 'vcard', 'hlist']
Classes of Table:
['sortable', 'mw-datatable', 'static-row-numbers', 'wikitable']
Classes of Table:
['sortable', 'nowrap', 'mw-datatable', 'static-row-numbers', 'wikitable']
Classes of Table:
['sortable', 'nowrap', 'mw-datatable', 'static-row-numbers', 'wikitable']
Classes of Table:
['sortable', 'nowrap', 'mw-datatable', 'static-row-numbers', 'wikitable']
Classes of Table:
['wikitable', 'sortable']
Classes of Table:
['wikitable', 'sortable']
Classes of Table:
['nowraplinks', 'hlist', 'mw-collapsible', 'autocollapse', 'navbox-inner']
Classes of Table:
['nowraplinks', 'hlist', 'mw-collapsible', 'autocollapse', 'navbox-inner']
Classes of Table:
['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner']


In [4]:
# find the right table using bs4
table = soup.find('table', class_ = 'sortable nowrap mw-datatable static-row-numbers wikitable')
print(type(table))   # confirm table type

<class 'bs4.element.Tag'>


In [5]:
# prepare dataframe with columns for all data
df = pd.DataFrame(columns = ['Country', 'All', 'Male', 'Female'])

# iterate through every row of table
for row in table.tbody.find_all('tr'):    
    # find all data for each column
    columns = row.find_all('td')
    
    # if data is missing, do not include in final table
    if not ('class="table-na"') in str(columns) and columns != []:
        country = columns[0].text.strip()
        all_ppl = columns[1].text.strip()
        male = columns[2].text.strip()
        female = columns[3].text.strip()

        # add data to dataframe
        df = df.append({'Country': country, 'All': all_ppl,
                       'Male': male, 'Female': female}, ignore_index = True)

In [6]:
# view data
df.head()

Unnamed: 0,Country,All,Male,Female
0,Afghanistan *,7.6,7.9,8.0
1,Albania,7.6,6.5,6.6
2,Algeria,5.9,5.7,5.6
3,Angola,30.0,29.8,29.0
4,Antigua and Barbuda,4.5,4.2,2.8


### Step 1: Remove Asterisks from *Country* Column

In [7]:
# uses regular expression to replace * with a blank
df['Country'] = df['Country'].str.replace('\*', '', regex = True)

In [8]:
# view new data
df.head()

Unnamed: 0,Country,All,Male,Female
0,Afghanistan,7.6,7.9,8.0
1,Albania,7.6,6.5,6.6
2,Algeria,5.9,5.7,5.6
3,Angola,30.0,29.8,29.0
4,Antigua and Barbuda,4.5,4.2,2.8


To clean the data, along with allow for better matching when joined with csv data, asterisks need to be removed.

### Step 2: Remove Countries that Don't Match CSV Data (Part 1)

In [9]:
# import csv data
csv_data = pd.read_csv('csv_data.csv')
# create list of all countries included in csv data
countries = csv_data['Country'].unique()
# make upper case for matching
countries = [x.upper() for x in countries]

In [10]:
# if entry is in the countries list, mark as TRUE, otherwise FALSE
df['MATCH'] = df['Country'].apply(lambda x: 'TRUE' if x.upper() in countries else 'FALSE')

We will import csv data and isolate the countries contained there. We will do this in order to remove rows in this data with countries that are not in the csv data. Since the csv data contains latitude and longitude information, which is necessary for retrieving our key data (weather statistics), any country not matching that data will not be relevant to this analysis. 

**The below section shows how white space was identified, which was then addressed in step 3. We will return to step 2 after this issue is addressed.**

In [11]:
# many of these are in the countries list, so let's figure out why they are not matching
df.loc[df['MATCH'] == 'FALSE']

Unnamed: 0,Country,All,Male,Female,MATCH
0,Afghanistan,7.6,7.9,8.0,False
7,Australia,18.8,17.6,16.2,False
12,Bangladesh,10.0,9.3,8.7,False
14,Belarus,69.3,71.5,71.9,False
18,Bhutan,8.6,8.5,8.2,False
19,Bolivia,11.8,11.3,11.1,False
23,Brunei,3.0,2.4,2.6,False
27,Cape Verde,33.3,33.1,33.4,False
29,Cameroon,29.8,30.3,30.0,False
30,Canada,16.6,16.8,16.3,False


In [12]:
# let's look at a country that was marked as false, but we know should have been true
df.loc[df['MATCH'] == 'FALSE'].iloc[0,0]

'Afghanistan\u202f'

In [13]:
# now let's look at one marked as true
df.iloc[1,0]

'Albania'

It looks like there is white space that needs to be removed. We are going to add in a step to remove this white space, but we will leave this here to show how the white space was identified.

### Step 3: Remove Trailing Whitespace from Country Names

In [14]:
# removes both leading and trailing whitespace from country column
df['Country'] = df['Country'].str.strip()

White space is removed to find perfect matches and have consistency in the data.

### Step 2: Remove Countries that Don't Match CSV Data (Part 2)

In [15]:
# second check: if entry is in the countries list, mark as TRUE, otherwise FALSE
df['MATCH'] = df['Country'].apply(lambda x: 'TRUE' if x.upper() in countries else 'FALSE')

In [16]:
# find countries that still do not match after removing white space
fuzzy = df.loc[df['MATCH'] == 'FALSE']

In [17]:
# perform fuzzy matching to narrow down final countries that do not have a match

# holds matches
matches = []

# puts countries from website without matches in a list
website_list = list(fuzzy['Country'])
website_list = [x.upper() for x in website_list]

# saves matches for all countries
for i in website_list:
    matches.append(process.extract(i, countries, limit = 2))
    
# appends matches to current dataframe
fuzzy['Matches'] = matches

In [18]:
# view matches
fuzzy[['Country', 'Matches']]

Unnamed: 0,Country,Matches
19,Bolivia,"[(BOLIVIA, PLURINATIONAL STATE OF, 90), (VENEZ..."
23,Brunei,"[(BURUNDI, 77), (RÉUNION, 67)]"
27,Cape Verde,"[(CABO VERDE, 80), (CÔTE D'IVOIRE, 55)]"
39,Ivory Coast,"[(CROATIA, 61), (CÔTE D'IVOIRE, 58)]"
43,Czech Republic,"[(CONGO, THE DEMOCRATIC REPUBLIC OF THE, 86), ..."
44,North Korea,"[(KOREA, REPUBLIC OF, 86), (NORTH MACEDONIA, 62)]"
45,DR Congo,"[(CONGO, 90), (CONGO, THE DEMOCRATIC REPUBLIC ..."
57,Fiji,"[(AZERBAIJAN, 45), (BOLIVIA, PLURINATIONAL STA..."
77,Iran,"[(IRAN, ISLAMIC REPUBLIC OF, 90), (IRAQ, 75)]"
90,Laos,"[(BARBADOS, 68), (LAO PEOPLE'S DEMOCRATIC REPU..."


In [19]:
# store matches
matches2 = []
p = []

# iterating through the closest matches to filter out the maximum closest match (threshold of 90)
for j in fuzzy['Matches']:
    for k in j:
        
        if k[1] >= 90:
            p.append(k[0])
              
    matches2.append(",".join(p))
    p = []
      
# store the resultant matches back to dataframe
fuzzy['Final_Matches'] = matches2

In [20]:
# view final matches
fuzzy[['Country', 'Final_Matches']]

Unnamed: 0,Country,Final_Matches
19,Bolivia,"BOLIVIA, PLURINATIONAL STATE OF"
23,Brunei,
27,Cape Verde,
39,Ivory Coast,
43,Czech Republic,
44,North Korea,
45,DR Congo,CONGO
57,Fiji,
77,Iran,"IRAN, ISLAMIC REPUBLIC OF"
90,Laos,


In [21]:
# place countries without a match into a list
drop_rows = list(fuzzy['Country'].loc[fuzzy['Final_Matches'] == ''])

In [22]:
# select only rows where country is not in drop list. This removes 11 rows
df = df[~df['Country'].isin(drop_rows)]

Exact matches were first isolated, followed by fuzzy matches. The 11 rows identified as not having a match to our csv data were dropped. 

### Step 4: Change Fuzzy Matches Rows to Match Exactly

In [23]:
# store original country names and their matches
fuzzy_matches = fuzzy[['Country', 'Final_Matches']].loc[fuzzy['Final_Matches'] != '']

In [24]:
# convert upper case matches to lower case, keeping first letter upper
fuzzy_matches['Final_Matches'] = fuzzy_matches['Final_Matches'].apply(str.lower)
fuzzy_matches['Final_Matches'] = fuzzy_matches['Final_Matches'].apply(str.title)

In [25]:
# join matches to dataframe to prepare for replacement
df = df.set_index('Country').join(fuzzy_matches.set_index('Country'))
# make country name a column instead of index
df.reset_index(inplace = True)

In [26]:
# replaces all nan values in matches column with original country names
df['Final_Matches'] = df['Final_Matches'].mask(df['Final_Matches'].isna(), df['Country'])

In [27]:
# view data
df.head()

Unnamed: 0,Country,All,Male,Female,MATCH,Final_Matches
0,Afghanistan,7.6,7.9,8.0,True,Afghanistan
1,Albania,7.6,6.5,6.6,True,Albania
2,Algeria,5.9,5.7,5.6,True,Algeria
3,Angola,30.0,29.8,29.0,True,Angola
4,Antigua and Barbuda,4.5,4.2,2.8,True,Antigua and Barbuda


For the matches that were not exact, we are overwriting the country names from the Wikipedia name to match the csv data. This will allow for a more seamless join.

### Step 5: Remove Unnecessary Columns

In [28]:
# drop redunant and unneeded columns
df.drop(['Country', 'MATCH'], axis = 1)
# rearrange column orderr
df = df[['Final_Matches', 'All', 'Male', 'Female']]

In [29]:
# view data
df.head()

Unnamed: 0,Final_Matches,All,Male,Female
0,Afghanistan,7.6,7.9,8.0
1,Albania,7.6,6.5,6.6
2,Algeria,5.9,5.7,5.6
3,Angola,30.0,29.8,29.0
4,Antigua and Barbuda,4.5,4.2,2.8


A few columns were appended to our dataset while we were conduting matching. These columns will now be removed.

### Step 6: Rename *Final_Matches* Column to *Country*

In [30]:
# renames column
df = df.rename(columns = {"Final_Matches": "Country"})

In [31]:
# view final data
df.head(20)

Unnamed: 0,Country,All,Male,Female
0,Afghanistan,7.6,7.9,8.0
1,Albania,7.6,6.5,6.6
2,Algeria,5.9,5.7,5.6
3,Angola,30.0,29.8,29.0
4,Antigua and Barbuda,4.5,4.2,2.8
5,Argentina,16.0,17.6,17.2
6,Armenia,5.5,5.3,5.7
7,Australia,18.8,17.6,16.2
8,Austria,24.9,23.0,24.9
9,Azerbaijan,5.8,6.2,6.5


The Final_Matches column took over as the main Country column, so it should be renamed.

### Ethical Implications

Although the data being accessed is public information, one of the main ethical considerations when working with sensitive data such as suicide rate is privacy. Privacy is a topic of much consideration, and there are varying opinions of what level of information should be available for public use. While rates do not take into consideration individual names or other personally identifying information, those individuals are included in the rate which is available for public use. When performing these data transformations, the decision was made to not include aliases for country names, thus removing some layer of anonymization. Further, there is the ethical concern of the representative nature of the data, given that some countries must be removed due to information being unavailable in the joining dataset. Although most countries within the original dataset are considered, there may be ethical concerns with excluding some countries. 

Unnamed: 0,Country,All,Male,Female
0,Afghanistan,7.6,7.9,8.0
1,Albania,7.6,6.5,6.6
2,Algeria,5.9,5.7,5.6
3,Angola,30.0,29.8,29.0
4,Antigua and Barbuda,4.5,4.2,2.8
...,...,...,...,...
167,"Venezuela, Bolivarian Republic Of",11.3,11.5,10.7
168,Viet Nam,9.4,9.5,9.3
169,Yemen,10.5,10.4,10.2
170,Zambia,35.9,34.3,34.5
