<img src=https://upload.wikimedia.org/wikipedia/en/8/80/Wikipedia-logo-v2.svg width=150 />

### $ \text{Abstract}$

This tutorial demonstrates how to utilize a simple **scraping** techinque to extract data from **Wikipedia**'s [**list of countries**](https://en.wikipedia.org/wiki/Category:Lists_of_countries) by category. The database contains a huge amounts of analyzed data at different **categories**, which was collected in order to examine trends, and compare them between different countries.

<img src='https://github.com/Daniboy370/Machine-Learning/blob/master/Misc/Animation/VID-out-Wiki.gif?raw=true' width=550 />



### $\text{Table of Contents}$

* [**Choosing criterion** : 👀 ](#1)
* [**Scraping pipeling** : 🛠️](#2)
* [**Restuls** : 📉](#3)

In [1]:
import re
import math
import time
import numpy as np
import pandas as pd

<a id="1"></a> 
### $ \text{Chosen criterion : ♂️ / ♀️ }$

Among all categories, the kernel will be focused on an ineresting comparable criterion - [**Human sex ratio**](https://en.wikipedia.org/wiki/Human_sex_ratio).
 
#### $\text{What is this ? }$

**Sex-ratio** is a demographic measure that denotes the **males** to **females** ratio in a population. Interestingly, most sexual species exhibit a 1.1 sex ratio at birth. However, despite being biased towards males at birth, over time the sex ratio changes due to several factors [[**1**](https://en.wikipedia.org/wiki/Half_the_Sky#:~:text=294%20pp.&text=763098931-,Half%20the%20Sky%3A%20Turning%20Oppression%20into%20Opportunity%20for%20Women%20Worldwide,by%20Knopf%20in%20September%202009.)] :
* Sex-selective infanticide [[**2**](https://en.wikipedia.org/wiki/Infanticide)]
* Sex-selective abortion [[**3**](https://qz.com/335183/heres-why-men-on-earth-outnumber-women-by-60-million/)]
* Deliberate gender control [[**4**](https://pulitzercenter.org/reporting/single-man-one-chinese-bachelors-search-love)]
* War and crime casualties [[**5**](http://data.un.org/Data.aspx?q=sex+ratio+birth&d=PopDiv&f=variableID%3a52)]

For example, the following pyramid graph shows the sex-ratio distribution across age in Bahrain, caused mostly by policies that restrict female spouses and children of immigrant workers :
<img src='https://upload.wikimedia.org/wikipedia/commons/4/49/Pyramide_Bahrein.PNG' width=600 />



<a id="2"></a> 
### $ \text{Scraping pipeling : 🛠️}$

$1.$ Import relevant packages and extracting data from desired web page :

In [2]:
import requests
import urllib.request
from bs4 import BeautifulSoup
from urllib.request import urlopen

url = 'https://en.wikipedia.org/wiki/List_of_countries_by_sex_ratio'
html = urlopen(url) 
soup = BeautifulSoup(html, 'html.parser')

$2.$ Define utilities for processing the amounts of data :

In [3]:
tables = soup.find_all('table')

# ---- Validate input propriety ---- #
def process_num(num):
    if not math.isnan(float(num)):
        res = float(re.sub(r'[^\w\s.]','',num))
    else:
        res = num
    return res

# ----- Validate cell propriety ---- #
def is_cell_valid(cells):
    for i in range(len(cells)):
        if cells[i].text.strip() == 'N/A':
            return False
    return True

# ---- Validate string propriety --- #
def make_str_valid( str ):
    ind_delim = str.find('(')

    if ind_delim != -1 :
        wrd = str[:ind_delim-1]
    else:
        wrd = str
        
    return wrd

$3.$ Extract **relevant data** from downloaded web page :

In [4]:
countries, Sex_R = [], []

for table in tables:
    rows = table.find_all('tr')
    
    for row in rows:
        cells = row.find_all('td')
        
        if ( len(cells) > 1 and is_cell_valid(cells) ):
            # Col_1 :: country
            country = cells[0]
            country_strip = country.text.strip()
            countries.append( make_str_valid( country_strip ))
            
            # Col_2 :: sex-ratio
            col_last = len(cells)-1
            S_R = cells[col_last]
            Sex_R.append(process_num(S_R.text.strip()))

$4.$ **Instantiate** data frame :

In [5]:
# Instantiate data frame
df = pd.DataFrame({'Country':countries, 'Sex-Ratio':Sex_R})

# Clean data-frame ( Duplicates & NaNs )
df[df.duplicated()]
df = df.drop_duplicates(subset=['Country'], keep='last').dropna()
df.sample(15)

Unnamed: 0,Country,Sex-Ratio
8,Antigua and Barbuda,0.89
185,Sint Maarten,0.98
76,Ghana,0.97
159,Paraguay,1.0
63,Eswatini,0.9
161,Philippines,1.01
145,Netherlands,0.98
130,Mauritania,0.93
120,Lithuania,0.86
216,United States,0.97


$5.$ Manual **fixing** of abnormal country names :

In [6]:
country_raw = pd.read_csv('https://raw.githubusercontent.com/lukes/ISO-3166-Countries-with-Regional-Codes/master/all/all.csv')
df_c = country_raw.iloc[:, [0,2]]
df_c = df_c.rename(columns={'name':'Country', 'alpha-3':'ISO-code'})
df_c = df_c.drop_duplicates(subset=['Country'], keep='last').dropna()

# Manual modification of the data
df_c.at[ df_c[df_c['Country']=='Viet Nam'].index.values[0], 'Country' ] = 'Vietnam'
df_c.at[ df_c[df_c['Country']=='United States of America'].index.values[0], 'Country' ] = 'United States'
df_c.at[ df_c[df_c['Country']=='Iran (Islamic Republic of)'].index.values[0], 'Country' ] = 'Iran'

In [7]:
df_c.at[ df_c[df_c['Country']=='Russian Federation'].index.values[0], 'Country' ] = 'Russia'
df_c.at[ df_c[df_c['Country']=='United Kingdom of Great Britain and Northern Ireland'].index.values[0], 'Country' ] = 'United Kingdom'
df_c.at[ df_c[df_c['Country']=='Venezuela (Bolivarian Republic of)'].index.values[0], 'Country' ] = 'Venezuela'
df_c.at[ df_c[df_c['Country']=='Korea (Democratic People\'s Republic of)'].index.values[0], 'Country' ] = 'Korea, North'
df_c.at[ df_c[df_c['Country']=='Korea, Republic of'].index.values[0], 'Country' ] = 'Korea, South'
df_c.at[ df_c[df_c['Country']=='Bolivia (Plurinational State of)'].index.values[0], 'Country' ] = 'Bolivia'
df_c.at[ df_c[df_c['Country']=='Côte d\'Ivoire'].index.values[0], 'Country' ] = 'Ivory Coast'
df_c.at[ df_c[df_c['Country']=='Congo'].index.values[0], 'Country' ] = 'Congo, Republic of the'
df_c.at[ df_c[df_c['Country']=='Tanzania, United Republic of'].index.values[0], 'Country' ] = 'Tanzania'

# Using the ISO-3166 coding standard to map countries
df['ISO-code'] = df['Country'].map(df_c.set_index('Country')['ISO-code'])
# Clean data-frame ( Duplicates & NaNs )
df.isna().sum()
df = df.dropna()

<a id="3"></a> 
### $\text{Results}$

Presenting the cleaned dataframe on a world map using the amazing [**choropleth**](https://plotly.github.io/plotly.py-docs/generated/plotly.express.choropleth.html) library 🗺️ :

In [8]:
import plotly.express as px

thres = 1.3
df_th = df.drop(df[ df['Sex-Ratio'] > thres ].index)

# color pallete @ https://plotly.com/python/builtin-colorscales/
fig = px.choropleth(df_th, locations='ISO-code',
                color="Sex-Ratio", hover_name="Country",
                    color_continuous_scale=px.colors.sequential.Sunset, projection="natural earth")
fig.update_layout(title={'text':'Sex-Ratio per country', 'y':0.95, 'x':0.5, 'xanchor':'center', 'yanchor':'top'})
fig.show() # Sunset / Bluered / Electric

Another option is apply the ***orthographic*** option, which enables an interactive presentation of the globe 🌐 : 

In [9]:
import plotly.express as px

thres = 1.3
df_th = df.drop(df[ df['Sex-Ratio'] > thres ].index)

# color pallete @ https://plotly.com/python/builtin-colorscales/
fig = px.choropleth(df_th, locations='ISO-code',
                color="Sex-Ratio", hover_name="Country",
                    color_continuous_scale=px.colors.sequential.Sunset, projection="orthographic")
fig.update_layout(title={'text':'Sex-Ratio per country', 'y':0.95, 'x':0.5, 'xanchor':'center', 'yanchor':'top'})
fig.show() # Sunset / Bluered / Electric

$$
\circ \text{ Comments (💬) , feedback (🤔) and upvotes (👍) are much welcome ! } \circ 
$$

<img src='https://github.com/Daniboy370/Temp_upload/blob/master/VID-Globe.gif?raw=true' width=400 />

$$
−fin−
$$