<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#URLs" data-toc-modified-id="URLs-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>URLs</a></span></li><li><span><a href="#Request-&amp;-Response" data-toc-modified-id="Request-&amp;-Response-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Request &amp; Response</a></span></li><li><span><a href="#Wrangling-HTML-With-BeautifulSoup" data-toc-modified-id="Wrangling-HTML-With-BeautifulSoup-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Wrangling HTML With BeautifulSoup</a></span></li><li><span><a href="#Title-of-HTML-content" data-toc-modified-id="Title-of-HTML-content-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Title of HTML content</a></span></li><li><span><a href="#Find-All-Tables" data-toc-modified-id="Find-All-Tables-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Find All Tables</a></span></li><li><span><a href="#Find-Right-Table-to-scrap" data-toc-modified-id="Find-Right-Table-to-scrap-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Find Right Table to scrap</a></span></li><li><span><a href="#Number-of-Columns" data-toc-modified-id="Number-of-Columns-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Number of Columns</a></span></li><li><span><a href="#Get-the-Rows" data-toc-modified-id="Get-the-Rows-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Get the Rows</a></span></li><li><span><a href="#Get-Table-Header-Attributes" data-toc-modified-id="Get-Table-Header-Attributes-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Get Table Header Attributes</a></span></li><li><span><a href="#Get-Tablular-Data" data-toc-modified-id="Get-Tablular-Data-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Get Tablular Data</a></span><ul class="toc-item"><li><span><a href="#Data-Analysis" data-toc-modified-id="Data-Analysis-10.1"><span class="toc-item-num">10.1&nbsp;&nbsp;</span>Data Analysis</a></span></li><li><span><a href="#Scrap-the-Data" data-toc-modified-id="Scrap-the-Data-10.2"><span class="toc-item-num">10.2&nbsp;&nbsp;</span>Scrap the Data</a></span></li><li><span><a href="#Create-Dictionary" data-toc-modified-id="Create-Dictionary-10.3"><span class="toc-item-num">10.3&nbsp;&nbsp;</span>Create Dictionary</a></span></li><li><span><a href="#Create-DataFrame" data-toc-modified-id="Create-DataFrame-10.4"><span class="toc-item-num">10.4&nbsp;&nbsp;</span>Create DataFrame</a></span></li><li><span><a href="#Rename-DataFrame-Columns" data-toc-modified-id="Rename-DataFrame-Columns-10.5"><span class="toc-item-num">10.5&nbsp;&nbsp;</span>Rename DataFrame Columns</a></span></li><li><span><a href="#Top-5-Countries-with-Highest-Population" data-toc-modified-id="Top-5-Countries-with-Highest-Population-10.6"><span class="toc-item-num">10.6&nbsp;&nbsp;</span>Top 5 Countries with Highest Population</a></span></li><li><span><a href="#Lets-do-some-Clean-Up-!" data-toc-modified-id="Lets-do-some-Clean-Up-!-10.7"><span class="toc-item-num">10.7&nbsp;&nbsp;</span>Lets do some Clean Up !</a></span></li></ul></li><li><span><a href="#Visuals" data-toc-modified-id="Visuals-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Visuals</a></span></li></ul></div>

In [82]:
# for performing your HTTP requests
import requests  

# for xml & html scrapping 
from bs4 import BeautifulSoup 

# for table analysis
import pandas as pd

# write to csv
import csv

# Time
import time

#Visuals
import matplotlib.pyplot as plt

import numpy as np

## URLs

In [2]:
# url of wikipedia page from which you want to scrap tabular data.
url1 = "https://en.wikipedia.org/wiki/List_of_21st-century_classical_composers"

## Request & Response

In [3]:
# Session helps to object allows you to persist certain parameters across requests
# By default, Request will keep waiting for a response indefinitely. Therefore, it is advised to set the timeout parameter.
# If the request was successful, you should see the reponse output as '200'.
s = requests.Session()
response = s.get(url1, timeout=10)
#response2 = s.get(url2, timeout=5)
response


<Response [200]>

## Wrangling HTML With BeautifulSoup

In [4]:
# parse response content to html
soup = BeautifulSoup(response.content, 'html.parser')

In [8]:
# to view the content in html format
pretty_soup = soup.prettify()
# print(pretty_soup)

## Title of HTML content

In [9]:
# title of Wikipedia page
soup.title.string

'List of 21st-century classical composers - Wikipedia'

## Find All Tables

In [10]:
# find all the tables in the html
all_tables=soup.find_all('table')

## Find Right Table to scrap

In [11]:
# get right table to scrap
right_table=soup.find('table', {"class":'wikitable sortable'})

## Number of Columns

In [12]:
# Number of columns in the table
for row in right_table.findAll("tr"):
    cells = row.findAll('td')

len(cells)

6

## Get the Rows

In [13]:
# number of rows in the table including header
rows = right_table.findAll("tr")
len(rows)

2924

## Get Table Header Attributes

In [14]:
# header attributes of the table
header = [th.text.rstrip() for th in rows[0].find_all('th')]
print(header)
print('------------')
print(len(header))

['Name', 'Year of', 'Nationality', 'Notable 21st-century works', 'Remarks']
------------
5


## Get Tablular Data

### Data Analysis

In [38]:
first_data_row = 2  # Usually 1, but if the header has 2 rows, it is 2

In [112]:
lst_data_raw = []
for row in rows[first_data_row:]:
    tds = row.select('td')
    name_col = tds[0]
    href = name_col.a['href']
    article = href.split('/')[-1] if href.startswith('/wiki') else ''
    data = [article, name_col.text.rstrip()]
    data.extend([d.text.rstrip() for d in tds[1:]])
    lst_data_raw.append(data)

In [113]:
for row in rows[first_data_row:]:
    tds = row.select('td')
    name_col = tds[0]
    href = name_col.a['href']
    if not href.startswith('/wiki'):
        print(href)

/w/index.php?title=Carl_Bergstr%C3%B8m-Nielsen&action=edit&redlink=1
/w/index.php?title=Javier_Jacinto&action=edit&redlink=1
/w/index.php?title=Zoltan_Paulinyi&action=edit&redlink=1
/w/index.php?title=Matthias_Kadar&action=edit&redlink=1
/w/index.php?title=Robert_R%C3%B8nnes&action=edit&redlink=1


In [114]:
# sample records
lst_data_raw[0:3]

[['Cassandra_Miller',
  'Cassandra Miller',
  '1976',
  '',
  'Canadian',
  'About Bach (string quartet)',
  ''],
 ['Caio_Fac%C3%B3',
  'Caio Facó',
  '1992',
  '',
  'Brazilian',
  'Diário das Narrativas Fantásticas: Uma fantasia sobre a história da América do Sul (for chamber orchestra)[1]',
  'A major work about Latin American culture'],
 ['Rolf_Riehm', 'Rolf Riehm', '1937', '', 'German', 'Sirenen (opera)', '']]

In [100]:
# length of each record
len(lst_data_raw[0])

7

In [101]:
# html of each table record

list_row = []
for row in right_table.findAll("tr"):
    list_row.append(row)

    
print('Number of row :',len(list_row))
print('----------------')
print(list_row[first_data_row])
print('----------------')
print('Second Attribute is has link reference')
print('----------------')
print(list_row[0].findAll('th'))
print('----------------')
print(list_row[first_data_row].find('a').text)

Number of row : 2924
----------------
<tr>
<td><a href="/wiki/Cassandra_Miller" title="Cassandra Miller">Cassandra Miller</a></td>
<td>1976</td>
<td></td>
<td>Canadian</td>
<td><i>About Bach</i> (string quartet)</td>
<td>
</td></tr>
----------------
Second Attribute is has link reference
----------------
[<th data-sort-type="text" rowspan="2">Name
</th>, <th colspan="2">Year of
</th>, <th rowspan="2">Nationality
</th>, <th rowspan="2">Notable 21st-century works
</th>, <th rowspan="2">Remarks
</th>]
----------------
Cassandra Miller


### Scrap the Data

### Create Dictionary

In [115]:
lst_data_raw[:4]

[['Cassandra_Miller',
  'Cassandra Miller',
  '1976',
  '',
  'Canadian',
  'About Bach (string quartet)',
  ''],
 ['Caio_Fac%C3%B3',
  'Caio Facó',
  '1992',
  '',
  'Brazilian',
  'Diário das Narrativas Fantásticas: Uma fantasia sobre a história da América do Sul (for chamber orchestra)[1]',
  'A major work about Latin American culture'],
 ['Rolf_Riehm', 'Rolf Riehm', '1937', '', 'German', 'Sirenen (opera)', ''],
 ['Hiro_Fujikake',
  'Hiro Fujikake',
  '1949',
  '',
  'Japan',
  'Pastoral Fantasy (1975),The Rope Crest (1977), Symphony Japan(1993), Symphony IZUMO(2005)',
  '']]

In [116]:
data = list(zip(*lst_data_raw))

In [117]:
[i[:4] for i in data]

[('Cassandra_Miller', 'Caio_Fac%C3%B3', 'Rolf_Riehm', 'Hiro_Fujikake'),
 ('Cassandra Miller', 'Caio Facó', 'Rolf Riehm', 'Hiro Fujikake'),
 ('1976', '1992', '1937', '1949'),
 ('', '', '', ''),
 ('Canadian', 'Brazilian', 'German', 'Japan'),
 ('About Bach (string quartet)',
  'Diário das Narrativas Fantásticas: Uma fantasia sobre a história da América do Sul (for chamber orchestra)[1]',
  'Sirenen (opera)',
  'Pastoral Fantasy (1975),The Rope Crest (1977), Symphony Japan(1993), Symphony IZUMO(2005)')]

In [61]:
# data

In [118]:
dat = dict(zip(['article', 'name', 'birth', 'death', 'nationality', 'notable_21_century_works', 'comment'], data))

In [64]:
# dat

### Create DataFrame

In [119]:
# convert dict to DataFrame
df = pd.DataFrame(dat)

# Top 5 records
df.head(5)

Unnamed: 0,article,name,birth,death,nationality,notable_21_century_works
0,Cassandra_Miller,Cassandra Miller,1976,,Canadian,About Bach (string quartet)
1,Caio_Fac%C3%B3,Caio Facó,1992,,Brazilian,Diário das Narrativas Fantásticas: Uma fantasi...
2,Rolf_Riehm,Rolf Riehm,1937,,German,Sirenen (opera)
3,Hiro_Fujikake,Hiro Fujikake,1949,,Japan,"Pastoral Fantasy (1975),The Rope Crest (1977),..."
4,Ann_Cleare,Ann Cleare,1983,,Irish,Claustrophobia (orchestra)


In [70]:
# Last 5 records
df.tail(5)

Unnamed: 0,article,Name,birth,death,Nationality,notable_works
2917,Juliana_Hall,Juliana Hall,1958,,American,
2918,Brian_T._Field,Brian T. Field,1967,,American,
2919,Craig_Bohmler,Craig Bohmler,1956,,American,
2920,Evgeny_Kissin,Evgeny Kissin,1971,,"Russian, British, Israeli","Tango, Meditation, Intermezzo, Toccata, Violon..."
2921,Andrew_March,Andrew March,1973,,English,Sanguis Venenatus


In [120]:
df['era'] = '21st-century classical'

### Saving

In [121]:
df.to_csv('composers_21_century.csv', index=False)

In [123]:
df[~(df.death == "")]

Unnamed: 0,article,name,birth,death,nationality,notable_21_century_works,era
7,Eric_Salzman,Eric Salzman,1933,2017,American,The True Last Words of Dutch Schultz (opera),21st-century classical
9,J%C3%B3hann_J%C3%B3hannsson,Jóhann Jóhannsson,1969,2018,Icelandic,And in the Endless Pause There Came the Sound ...,21st-century classical
10,Otomar_Kv%C4%9Bch,Otomar Kvěch,1950,2018,Czech,String Quartet No. 7,21st-century classical
12,Milko_Kelemen,Milko Kelemen,1924,2018,Croatian,Tromberia for Trumpet and Orchestra,21st-century classical
14,Graciela_Agudelo,Graciela Agudelo,1945,2018,Mexican,,21st-century classical
...,...,...,...,...,...,...,...
2892,Ruth_Zechlin,Ruth Zechlin,1926,2007,German,,21st-century classical
2894,Friedrich_Zehm,Friedrich Zehm,1923,2007,German,,21st-century classical
2896,Hans_Zender,Hans Zender,1936,2019,German,Chief Joseph (opera),21st-century classical
2898,Zhu_Jian%27er,Zhu Jian'er,1922,2017,Chinese,,21st-century classical


In [122]:
df[df.article == '']

Unnamed: 0,article,name,birth,death,nationality,notable_21_century_works,era
181,,Carl Bergstrøm-Nielsen [da; de],1951,,Danish,,21st-century classical
224,,Javier Jacinto [es; eu],1968,,Spanish,,21st-century classical
821,,Zoltan Paulinyi,1977,,Brazilian,,21st-century classical
1648,,Matthias Kadar,1977,,Hungarian-German,,21st-century classical
2577,,Robert Rønnes [no],1959,,Norwegian,,21st-century classical


In [86]:
df.dtypes

article                     object
Name                        object
birth                       object
death                       object
Nationality                 object
notable_21_century_works    object
era                         object
dtype: object