## Web Scraping
***
##### This is a showcase notebook on Web Scraping using Jupyter Notebooks and Python libraries. I will be extracting the table 'European Countries by population (2024)' from www.worldometers.info
***

- Firstly, let's import the libraries we'll need:

In [27]:
from bs4 import BeautifulSoup
import requests
import pandas as pd

- Secondly, we define the url (the web page for web scraping) and extract its HTML source:

In [42]:
url = 'https://www.worldometers.info/population/countries-in-europe-by-population/'

page = requests.get(url)

soup = BeautifulSoup(page.text, 'html')

In [43]:
print(soup)

<!DOCTYPE html>
<!--[if IE 8]> <html lang="en" class="ie8"> <![endif]--><!--[if IE 9]> <html lang="en" class="ie9"> <![endif]--><!--[if !IE]><!--><html lang="en"> <!--<![endif]--> <head> <meta charset="utf-8"/> <meta content="IE=edge" http-equiv="X-UA-Compatible"/> <meta content="width=device-width, initial-scale=1" name="viewport"/> <title>European Countries by Population (2024) - Worldometer</title><meta content="List of countries in Europe ranked by population, from the most populous. Growth rate, median age, fertility rate, area, density, population density, urbanization, urban population, share of world population." name="description"/><!-- Favicon --><link href="/favicon/favicon.ico" rel="shortcut icon" type="image/x-icon"/><link href="/favicon/apple-icon-57x57.png" rel="apple-touch-icon" sizes="57x57"/><link href="/favicon/apple-icon-60x60.png" rel="apple-touch-icon" sizes="60x60"/><link href="/favicon/apple-icon-72x72.png" rel="apple-touch-icon" sizes="72x72"/><link href="/favi

- Then, I am isolating the table from the rest of this page's HTML. We're not going to need any additional data from this page. After identifying the class (style) of this table (in HTML), we can then proceed with selecting the FRAME attribute of the TABLE tag:

In [44]:
soup.find('table')

<table cellspacing="0" class="table table-striped table-bordered" id="example2" width="100%"> <thead> <tr> <th>#</th> <th>Country (or dependency)</th> <th>Population<br/> (2023)</th> <th>Yearly<br/> Change</th> <th>Net<br/> Change</th> <th>Density<br/> (P/Km²)</th> <th>Land Area<br/> (Km²)</th> <th>Migrants<br/> (net)</th> <th>Fert.<br/> Rate</th> <th>Med.<br/> Age</th> <th>Urban<br/> Pop %</th> <th>World<br/> Share</th> </tr> </thead> <tbody> <tr> <td>1</td> <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/russia-population/">Russia</a></td> <td style="font-weight: bold;">144,444,359</td> <td>-0.19 %</td> <td>-268,955</td> <td>9</td> <td>16,376,870</td> <td>-136,414</td> <td>1.5</td> <td>39</td> <td>75 %</td> <td>1.80 %</td> </tr> <tr> <td>2</td> <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/germany-population/">Germany</a></td> <td style="font-weight: bold;">83,294,633</td> <td>-0.09 %</td> <td>-75,2

In [45]:
europe_table = soup.find('table', class_ = 'table table-striped table-bordered')

- Next, I'll isolate the table header cells. In HTML they are framed by 'th' parts:

In [46]:
europe_table.find_all('th')

[<th>#</th>,
 <th>Country (or dependency)</th>,
 <th>Population<br/> (2023)</th>,
 <th>Yearly<br/> Change</th>,
 <th>Net<br/> Change</th>,
 <th>Density<br/> (P/Km²)</th>,
 <th>Land Area<br/> (Km²)</th>,
 <th>Migrants<br/> (net)</th>,
 <th>Fert.<br/> Rate</th>,
 <th>Med.<br/> Age</th>,
 <th>Urban<br/> Pop %</th>,
 <th>World<br/> Share</th>]

In [47]:
europe_titles = europe_table.find_all('th')

- Next, I am formatting the HTML extraction as text:

In [48]:
europe_table_titles = [title.text for title in europe_titles]

print(europe_table_titles)

['#', 'Country (or dependency)', 'Population (2023)', 'Yearly Change', 'Net Change', 'Density (P/Km²)', 'Land Area (Km²)', 'Migrants (net)', 'Fert. Rate', 'Med. Age', 'Urban Pop %', 'World Share']


- Then, using Pandas I am defining a dataframe named 'data' with headers extracted from the web page (HTML) as 'europe_table_titles':

In [49]:
data = pd.DataFrame(columns = europe_table_titles)
data

Unnamed: 0,#,Country (or dependency),Population (2023),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share


- In HTML table data cells are framed with 'td' as elements of a table that contains data and may be used as a child of the 'tr' element. The 'tr' element defines a row (or a column, in our case) of cells in a table. Now, I am extracting the rows and the columns:

In [50]:
world_table.find_all('td')

[<td>1</td>,
 <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/russia-population/">Russia</a></td>,
 <td style="font-weight: bold;">144,444,359</td>,
 <td>-0.19 %</td>,
 <td>-268,955</td>,
 <td>9</td>,
 <td>16,376,870</td>,
 <td>-136,414</td>,
 <td>1.5</td>,
 <td>39</td>,
 <td>75 %</td>,
 <td>1.80 %</td>,
 <td>2</td>,
 <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/germany-population/">Germany</a></td>,
 <td style="font-weight: bold;">83,294,633</td>,
 <td>-0.09 %</td>,
 <td>-75,210</td>,
 <td>239</td>,
 <td>348,560</td>,
 <td>155,751</td>,
 <td>1.5</td>,
 <td>45</td>,
 <td>77 %</td>,
 <td>1.04 %</td>,
 <td>3</td>,
 <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/uk-population/">United Kingdom</a></td>,
 <td style="font-weight: bold;">67,736,802</td>,
 <td>0.34 %</td>,
 <td>227,866</td>,
 <td>280</td>,
 <td>241,930</td>,
 <td>165,790</td>,
 <td>1.6</td>,
 <td>40

In [51]:
europe_table.find_all('tr')

[<tr> <th>#</th> <th>Country (or dependency)</th> <th>Population<br/> (2023)</th> <th>Yearly<br/> Change</th> <th>Net<br/> Change</th> <th>Density<br/> (P/Km²)</th> <th>Land Area<br/> (Km²)</th> <th>Migrants<br/> (net)</th> <th>Fert.<br/> Rate</th> <th>Med.<br/> Age</th> <th>Urban<br/> Pop %</th> <th>World<br/> Share</th> </tr>,
 <tr> <td>1</td> <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/russia-population/">Russia</a></td> <td style="font-weight: bold;">144,444,359</td> <td>-0.19 %</td> <td>-268,955</td> <td>9</td> <td>16,376,870</td> <td>-136,414</td> <td>1.5</td> <td>39</td> <td>75 %</td> <td>1.80 %</td> </tr>,
 <tr> <td>2</td> <td style="font-weight: bold; font-size:15px; text-align:left"><a href="/world-population/germany-population/">Germany</a></td> <td style="font-weight: bold;">83,294,633</td> <td>-0.09 %</td> <td>-75,210</td> <td>239</td> <td>348,560</td> <td>155,751</td> <td>1.5</td> <td>45</td> <td>77 %</td> <td>1.04 %</td> </tr

- And again, converting the found HTML elements to text. Then, adding them as rows of our dataframe (named 'data'):

In [52]:
column_data = europe_table.find_all('tr')

In [53]:
for row in column_data:
    row_data = row.find_all('td')
    individual_row_data = [data.text for data in row_data]
    
print(individual_row_data)

['47', 'Holy See', '518', '1.57 %', '8', '1,295', '0', '0', '', '', 'N.A.', '0.00 %']


- Then, I am constructing the data frame, using our scraped and formatted rows:

In [55]:
for row in column_data[1:]:
    row_data = row.find_all('td')
    individual_row_data = [data.text for data in row_data]
    
    length = len(data)
    data.loc[length] = individual_row_data
    
data

Unnamed: 0,#,Country (or dependency),Population (2023),Yearly Change,Net Change,Density (P/Km²),Land Area (Km²),Migrants (net),Fert. Rate,Med. Age,Urban Pop %,World Share
0,1,Russia,144444359,-0.19 %,-268955,9,16376870,-136414,1.5,39.0,75 %,1.80 %
1,2,Germany,83294633,-0.09 %,-75210,239,348560,155751,1.5,45.0,77 %,1.04 %
2,3,United Kingdom,67736802,0.34 %,227866,280,241930,165790,1.6,40.0,85 %,0.84 %
3,4,France,64756584,0.20 %,129956,118,547557,67761,1.8,42.0,84 %,0.80 %
4,5,Italy,58870762,-0.28 %,-166712,200,294140,58496,1.3,48.0,72 %,0.73 %
5,6,Spain,47519628,-0.08 %,-39002,95,498800,39998,1.3,45.0,80 %,0.59 %
6,7,Ukraine,36744634,-7.45 %,-2957105,63,579320,1784718,1.3,45.0,82 %,0.46 %
7,8,Poland,41026067,2.93 %,1168922,134,306230,-910475,1.5,40.0,55 %,0.51 %
8,9,Romania,19892812,1.19 %,233545,86,230170,-254616,1.7,41.0,53 %,0.25 %
9,10,Netherlands,17618299,0.31 %,54285,522,33720,29998,1.6,42.0,92 %,0.22 %


- Woohoo. We just web scraped this table. One last thing, let's export! 

In [56]:
data.to_csv(r'C:\Users\nkant\OneDrive\Desktop\European_Country_Population_Scraped.csv', index = False)