# Web Scrapping to find the population of countries:

The aim of this code is find out the population of different countries by using the <a href="https://www.worldometers.info/world-population/population-by-country/" target="_blank">worldometer.info</a> website. Then we shall save the data as a csv file for using with the COVID data being used on my tableau dashboard.

In [60]:
#import useful packages
from __future__ import print_function
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup

## Reading the webpage
Firstly, we need to read the webpage into python. This is done using the requests library to obtain the html code formatting the webpage. The html code can be seen on any webpage by 'right-clicking' and viewing the 'source code'. It is this code that is read into python using the command below.

In [5]:
#read the region data into script
r = requests.get("https://www.worldometers.info/world-population/population-by-country/")

In [6]:
r.text[0:500]

'\n<!DOCTYPE html><!--[if IE 8]> <html lang="en" class="ie8"> <![endif]--><!--[if IE 9]> <html lang="en" class="ie9"> <![endif]--><!--[if !IE]><!--> <html lang="en"> <!--<![endif]--> <head> <meta charset="utf-8"> <meta http-equiv="X-UA-Compatible" content="IE=edge"> <meta name="viewport" content="width=device-width, initial-scale=1"> <title>Population by Country (2020) - Worldometer</title><meta name="description" content="List of countries and dependencies in the world ranked by population, from '

### Parsing the html
Currently, a string is loaded into the instance object r. In order for python to make better sense of the html code, we use an html parser with the following Beautiful Soup command.

In [7]:
soup = BeautifulSoup(r.text, 'html.parser')

## Retrieving relevant data
We need to pick out the data relevant to us. This is the tabular data showing different countries by region. If we look at the 'source page', this data is in the following format:
```
<tr> <td>POP RANK</td> <td style="font-weight: bold; font-size:15px; text-align:left"><a href="COUNTRY HYPERLINK URL">COUNTRY</a></td> <td style="font-weight: bold;">POPULATION</td> <td>YEARLY CHANGE %</td> <td>NET CHANGE</td> <td>DENSITY</td> <td>LAND AREA</td> <td>MIGRANTS NET</td> <td>FERTILITY RATE</td> <td>MEDIAN AGE</td> <td>URBAN POPULATION %</td> <td>WORLD SHARE %</td> </tr> 
```
Therefore, we use the commands `.find_all('`TAG`', attrs={'`ATTRIBUTE`':'`ATTR VALUE`'})` to return results set which lists all of the instances of this tag used in the html.

In [8]:
table = soup.find_all('tr') #note that the tag that encompasses the data we want is <tr> in html or 'tr' in python
results = table[1:] #save the data as results. Note that the first instance of the tag is the column headers and isn't needed.

In [9]:
len(results) 

235

In [11]:
first_result = results[0]

In [12]:
first_result

<tr> <td>1</td> <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/china-population/">China</a></td> <td style="font-weight: bold;">1,439,323,776</td> <td>0.39 %</td> <td>5,540,090</td> <td>153</td> <td>9,388,211</td> <td>-348,399</td> <td>1.7</td> <td>38</td> <td>61 %</td> <td>18.47 %</td> </tr>

### Slicing html tags

In [21]:
#returns the first country
first_result.find_all('td')[1].text

'China'

In [30]:
#returns the first population
int(first_result.find_all('td')[2].text.replace(',', ''))

1439323776

In [31]:
#returns the first land area in km
int(first_result.find_all('td')[6].text.replace(',', ''))

9388211

In [32]:
#returns the first median age
int(first_result.find_all('td')[9].text.replace(',', ''))

38

In [34]:
#returns the first percentage of population in urban communities
int(first_result.find_all('td')[10].text[:-2].replace(',', ''))

61

Now we have manage to return the information we want from the first data point in the results set (`results`), we can iterate through the entire set and return a list of tuples which represent the data points.

In [41]:
#iterate through the entire results list to find the country, region and hemisphere
records = []
for result in results:
    country = result.find_all('td')[1].text
    population = int(result.find_all('td')[2].text.replace(',', ''))
    area = int(result.find_all('td')[6].text.replace(',', ''))
    age = result.find_all('td')[9].text.replace(',', '')
    urbn = result.find_all('td')[10].text[:-2].replace(',', '')
    records.append((country, population, area, age, urbn))

In [42]:
records[-3:]

[('Niue', 1626, 260, 'N.A.', '46'),
 ('Tokelau', 1357, 10, 'N.A.', '0'),
 ('Holy See', 801, 0, 'N.A.', 'N.')]

## Saving to a pandas DataFrame
Having checked that the list of tuples shows the data we wanted to retrieve, we now need to save the list of tuples to a pandas dataframe.

In [83]:
#create dataframe in pandas
df4 = pd.DataFrame(records, columns=['Country/Region', 'Population', 'Land Area (km)', 'Median Age', 'Ratio of Urban Population (%)'])

In [97]:
df4 = df4.replace('N.A.',np.NaN)
df4 = df4.replace('N.' ,np.NaN)
df4['Median Age'] = df4['Median Age'].astype('float64')
df4['Ratio of Urban Population (%)'] = df4['Ratio of Urban Population (%)'].astype('float64')

In [98]:
df4.tail()

Unnamed: 0,Country/Region,Population,Land Area (km),Median Age,Ratio of Urban Population (%)
230,Montserrat,4992,100,,10.0
231,Falkland Islands,3480,12170,,66.0
232,Niue,1626,260,,46.0
233,Tokelau,1357,10,,0.0
234,Holy See,801,0,,


In [99]:
df4.dtypes

Country/Region                    object
Population                         int64
Land Area (km)                     int64
Median Age                       float64
Ratio of Urban Population (%)    float64
dtype: object

In [102]:
df4.to_csv('Population Data.csv', index=False, encoding='utf-8')