<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#URLs" data-toc-modified-id="URLs-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>URLs</a></span></li><li><span><a href="#Request-&amp;-Response" data-toc-modified-id="Request-&amp;-Response-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Request &amp; Response</a></span></li><li><span><a href="#Wrangling-HTML-With-BeautifulSoup" data-toc-modified-id="Wrangling-HTML-With-BeautifulSoup-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Wrangling HTML With BeautifulSoup</a></span></li><li><span><a href="#Title-of-HTML-content" data-toc-modified-id="Title-of-HTML-content-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Title of HTML content</a></span></li><li><span><a href="#Find-All-Tables" data-toc-modified-id="Find-All-Tables-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Find All Tables</a></span></li><li><span><a href="#Find-Right-Table-to-scrap" data-toc-modified-id="Find-Right-Table-to-scrap-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Find Right Table to scrap</a></span></li><li><span><a href="#Number-of-Columns" data-toc-modified-id="Number-of-Columns-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Number of Columns</a></span></li><li><span><a href="#Get-the-Rows" data-toc-modified-id="Get-the-Rows-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Get the Rows</a></span></li><li><span><a href="#Get-Table-Header-Attributes" data-toc-modified-id="Get-Table-Header-Attributes-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Get Table Header Attributes</a></span></li><li><span><a href="#Get-Tablular-Data" data-toc-modified-id="Get-Tablular-Data-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Get Tablular Data</a></span><ul class="toc-item"><li><span><a href="#Data-Analysis" data-toc-modified-id="Data-Analysis-10.1"><span class="toc-item-num">10.1&nbsp;&nbsp;</span>Data Analysis</a></span></li><li><span><a href="#Scrap-the-Data" data-toc-modified-id="Scrap-the-Data-10.2"><span class="toc-item-num">10.2&nbsp;&nbsp;</span>Scrap the Data</a></span></li><li><span><a href="#Create-Dictionary" data-toc-modified-id="Create-Dictionary-10.3"><span class="toc-item-num">10.3&nbsp;&nbsp;</span>Create Dictionary</a></span></li><li><span><a href="#Create-DataFrame" data-toc-modified-id="Create-DataFrame-10.4"><span class="toc-item-num">10.4&nbsp;&nbsp;</span>Create DataFrame</a></span></li><li><span><a href="#Rename-DataFrame-Columns" data-toc-modified-id="Rename-DataFrame-Columns-10.5"><span class="toc-item-num">10.5&nbsp;&nbsp;</span>Rename DataFrame Columns</a></span></li><li><span><a href="#Top-5-Countries-with-Highest-Population" data-toc-modified-id="Top-5-Countries-with-Highest-Population-10.6"><span class="toc-item-num">10.6&nbsp;&nbsp;</span>Top 5 Countries with Highest Population</a></span></li><li><span><a href="#Lets-do-some-Clean-Up-!" data-toc-modified-id="Lets-do-some-Clean-Up-!-10.7"><span class="toc-item-num">10.7&nbsp;&nbsp;</span>Lets do some Clean Up !</a></span></li></ul></li><li><span><a href="#Visuals" data-toc-modified-id="Visuals-11"><span class="toc-item-num">11&nbsp;&nbsp;</span>Visuals</a></span></li></ul></div>

In [2]:
# for performing your HTTP requests
import requests  

# for xml & html scrapping 
from bs4 import BeautifulSoup 

# for table analysis
import pandas as pd

# write to csv
import csv

# Time
import time

#Visuals
import matplotlib.pyplot as plt

import numpy as np

## URLs

In [3]:
# url of wikipedia page from which you want to scrap tabular data.
url1 = "https://en.wikipedia.org/wiki/List_of_Renaissance_composers"

## Request & Response

In [4]:
# Session helps to object allows you to persist certain parameters across requests
# By default, Request will keep waiting for a response indefinitely. Therefore, it is advised to set the timeout parameter.
# If the request was successful, you should see the reponse output as '200'.
s = requests.Session()
response = s.get(url1, timeout=10)
#response2 = s.get(url2, timeout=5)
response


<Response [200]>

## Wrangling HTML With BeautifulSoup

In [5]:
# parse response content to html
soup = BeautifulSoup(response.content, 'html.parser')

In [8]:
# to view the content in html format
pretty_soup = soup.prettify()
# print(pretty_soup)

## Title of HTML content

In [9]:
# title of Wikipedia page
soup.title.string

'List of Renaissance composers - Wikipedia'

## Find All Tables

In [10]:
# find all the tables in the html
all_tables=soup.find_all('table')

## Find Right Table to scrap

In [11]:
# get right table to scrap
right_table=soup.find('table', {"class":'wikitable sortable'})

In [46]:
# get right table to scrap
right_tables=soup.find_all('table', {"class":'wikitable sortable'})

In [48]:
right_table = right_tables[1]

## Number of Columns

In [1]:
# Number of columns in the table
for row in right_table.findAll("tr"):
    cells = row.findAll('td')
    print(len(cells))

len(cells)

NameError: name 'right_table' is not defined

## Get the Rows

In [50]:
# number of rows in the table including header
rows = right_table.findAll("tr")
len(rows)

16

## Get Table Header Attributes

In [51]:
# header attributes of the table
header = [th.text.rstrip() for th in rows[0].find_all('th')]
print(header)
print('------------')
print(len(header))

['Name', 'Born', 'Died', 'Notes']
------------
4


## Get Tablular Data

### Data Analysis

In [52]:
first_data_row = 1  # Usually 1, but if the header has 2 rows, it is 2

In [53]:
lst_data_raw = []
for row in rows[first_data_row:]:
    tds = row.select('td')
    name_col = tds[0]
    href = name_col.a['href']
    article = href.split('/')[-1] if href.startswith('/wiki') else ''
    data = [article, name_col.text.rstrip()]
    data.extend([d.text.rstrip() for d in tds[1:]])
    lst_data_raw.append(data)

In [54]:
for row in rows[first_data_row:]:
    tds = row.select('td')
    name_col = tds[0]
    href = name_col.a['href']
    if not href.startswith('/wiki'):
        print(href)

In [57]:
# sample records
lst_data_raw

[['Pycard',
  'Pycard',
  ' fl. c. 1390',
  'after c. 1410',
  'Has works preserved in the first layer of the Old Hall Manuscript and elsewhere. His identity is unclear; probably English, but possibly from France.'],
 ['Leonel_Power', 'Leonel Power', 'c. 1370', '1445', ''],
 ['Roy_Henry',
  'Roy Henry',
  'fl. 1410',
  'after 1410',
  'Very likely to be Henry V of England (1387–1422)'],
 ['Byttering',
  'Byttering possibly Thomas Byttering',
  'fl. c. 1410',
  'after 1420',
  ''],
 ['John_Plummer_(composer)', 'John Plummer', 'c. 1410', 'c. 1483', ''],
 ['Henry_Abyngdon', 'Henry Abyngdon', 'c. 1418', '1497', ''],
 ['Walter_Frye', 'Walter Frye', 'fl. c. 1450', '1474', ''],
 ['William_Horwood_(composer)',
  'William Horwood',
  'c. 1430',
  '1484',
  'Some of his music is collected in the Eton Choirbook.'],
 ['John_Hothby',
  'John Hothby Johannes Ottobi',
  'c. 1430',
  '1487',
  'English theorist and composer mainly active in Italy.'],
 ['William_Hawte', 'William Hawte William Haute', '

In [58]:
# length of each record
len(lst_data_raw[0])

5

In [59]:
# html of each table record

list_row = []
for row in right_table.findAll("tr"):
    list_row.append(row)

    
print('Number of row :',len(list_row))
print('----------------')
print(list_row[first_data_row])
print('----------------')
print('Second Attribute is has link reference')
print('----------------')
print(list_row[0].findAll('th'))
print('----------------')
print(list_row[first_data_row].find('a').text)

Number of row : 16
----------------
<tr>
<td><a href="/wiki/Pycard" title="Pycard">Pycard</a></td>
<td><span data-sort-value="1370" style="display:none;"></span> <i>fl.</i> c. 1390</td>
<td><span data-sort-value="1410" style="display:none;"></span>after c. 1410</td>
<td>Has works preserved in the first layer of the <a href="/wiki/Old_Hall_Manuscript" title="Old Hall Manuscript">Old Hall Manuscript</a> and elsewhere. His identity is unclear; probably English, but possibly from France.
</td></tr>
----------------
Second Attribute is has link reference
----------------
[<th width="150">Name
</th>, <th width="95">Born
</th>, <th width="95">Died
</th>, <th width="250">Notes
</th>]
----------------
Pycard


### Scrap the Data

### Create Dictionary

In [60]:
lst_data_raw[:4]

[['Pycard',
  'Pycard',
  ' fl. c. 1390',
  'after c. 1410',
  'Has works preserved in the first layer of the Old Hall Manuscript and elsewhere. His identity is unclear; probably English, but possibly from France.'],
 ['Leonel_Power', 'Leonel Power', 'c. 1370', '1445', ''],
 ['Roy_Henry',
  'Roy Henry',
  'fl. 1410',
  'after 1410',
  'Very likely to be Henry V of England (1387–1422)'],
 ['Byttering',
  'Byttering possibly Thomas Byttering',
  'fl. c. 1410',
  'after 1420',
  '']]

In [61]:
data = list(zip(*lst_data_raw))

In [62]:
[i[:4] for i in data]

[('Pycard', 'Leonel_Power', 'Roy_Henry', 'Byttering'),
 ('Pycard',
  'Leonel Power',
  'Roy Henry',
  'Byttering possibly Thomas Byttering'),
 (' fl. c. 1390', 'c. 1370', 'fl. 1410', 'fl. c. 1410'),
 ('after c. 1410', '1445', 'after 1410', 'after 1420'),
 ('Has works preserved in the first layer of the Old Hall Manuscript and elsewhere. His identity is unclear; probably English, but possibly from France.',
  '',
  'Very likely to be Henry V of England (1387–1422)',
  '')]

In [63]:
# data

In [64]:
dat = dict(zip(['article', 'name', 'birth', 'death', 'nationality', 'notable_21_century_works', 'comment'], data))

In [65]:
dat = dict(zip(['article', 'name', 'birth', 'death', 'comment'], data))

In [66]:
dat

{'article': ('Pycard',
  'Leonel_Power',
  'Roy_Henry',
  'Byttering',
  'John_Plummer_(composer)',
  'Henry_Abyngdon',
  'Walter_Frye',
  'William_Horwood_(composer)',
  'John_Hothby',
  'William_Hawte',
  'Richard_Hygons',
  'Gilbert_Banester',
  'Walter_Lambe',
  'Hugh_Kellyk',
  'Edmund_Turges'),
 'name': ('Pycard',
  'Leonel Power',
  'Roy Henry',
  'Byttering possibly Thomas Byttering',
  'John Plummer',
  'Henry Abyngdon',
  'Walter Frye',
  'William Horwood',
  'John Hothby Johannes Ottobi',
  'William Hawte William Haute',
  'Richard Hygons',
  'Gilbert Banester',
  'Walter Lambe',
  'Hugh Kellyk',
  'Edmund Turges possibly the same as Edmund Sturges'),
 'birth': (' fl. c. 1390',
  'c. 1370',
  'fl. 1410',
  'fl. c. 1410',
  'c. 1410',
  'c. 1418',
  'fl. c. 1450',
  'c. 1430',
  'c. 1430',
  'c. 1430',
  'c. 1435',
  'c. 1445',
  'c. 1450',
  'late 15th century',
  '1450'),
 'death': ('after c. 1410',
  '1445',
  'after 1410',
  'after 1420',
  'c. 1483',
  '1497',
  '1474',


In [30]:
# dat['nationality'] = ['Burgundian' if not n else f"{n}-Burgundian" for n in dat['nationality']]

### Create DataFrame

In [75]:
# convert dict to DataFrame
df = pd.DataFrame(dat)

# Top 5 records
df.head(5)

Unnamed: 0,article,name,birth,death,comment
0,Pycard,Pycard,fl. c. 1390,after c. 1410,Has works preserved in the first layer of the ...
1,Leonel_Power,Leonel Power,c. 1370,1445,
2,Roy_Henry,Roy Henry,fl. 1410,after 1410,Very likely to be Henry V of England (1387–1422)
3,Byttering,Byttering possibly Thomas Byttering,fl. c. 1410,after 1420,
4,John_Plummer_(composer),John Plummer,c. 1410,c. 1483,


In [71]:
# Last 5 records
df.tail(5)

Unnamed: 0,article,name,birth,death,comment,era,nationality
10,Richard_Hygons,Richard Hygons,c. 1435,c. 1509,,Renaissance,English
11,Gilbert_Banester,Gilbert Banester,c. 1445,1487,,Renaissance,English
12,Walter_Lambe,Walter Lambe,c. 1450,after 1504,Major contributor to the Eton Choirbook.,Renaissance,English
13,Hugh_Kellyk,Hugh Kellyk,late 15th century,16th century?,"has two surviving pieces, a five-part Magnific...",Renaissance,English
14,Edmund_Turges,Edmund Turges possibly the same as Edmund Sturges,1450,1500,Has a number of works preserved in the Eton Ch...,Renaissance,English


In [76]:
df['era'] = 'Renaissance'

In [78]:
df['nationality'] = 'English'

In [80]:
col_order = ['article', 'name', 'birth', 'death', 'nationality', 'comment', 'era']

In [81]:
df = df[col_order]

## Composing

In [73]:
df2 = pd.read_csv('composers_Renaissance_table_.csv')

In [87]:
df2.head()

Unnamed: 0,article,name,birth,death,nationality,comment,era
0,Johannes_Tapissier,Johannes Tapissier (Jean de Noyers),c. 1370,before 1410,Burgundian,,Renaissance
1,Nicolas_Grenon,Nicolas Grenon,c. 1375,1456,Burgundian,,Renaissance
2,Pierre_Fontaine_(composer),Pierre Fontaine,c. 1380,c. 1450,Burgundian,,Renaissance
3,Jacobus_Vide,Jacobus Vide,fl. 1405?,after 1433,Burgundian,,Renaissance
4,Guillaume_Legrant,Guillaume Legrant (Lemarcherier),fl. 1405,after 1449,Burgundian,,Renaissance


In [84]:
df2['comment'] = ''

In [88]:
df2 = df2[col_order]

In [89]:
dfc = pd.concat([df, df2])

In [91]:
list(map(len, [df, df2, dfc]))

[15, 21, 36]

### Saving

In [92]:
df = dfc

In [93]:
df.to_csv('composers_Renaissance_table_.csv', index=False)

In [94]:
df[~(df.death == "")]

Unnamed: 0,article,name,birth,death,nationality,comment,era
0,Pycard,Pycard,fl. c. 1390,after c. 1410,English,Has works preserved in the first layer of the ...,Renaissance
1,Leonel_Power,Leonel Power,c. 1370,1445,English,,Renaissance
2,Roy_Henry,Roy Henry,fl. 1410,after 1410,English,Very likely to be Henry V of England (1387–1422),Renaissance
3,Byttering,Byttering possibly Thomas Byttering,fl. c. 1410,after 1420,English,,Renaissance
4,John_Plummer_(composer),John Plummer,c. 1410,c. 1483,English,,Renaissance
5,Henry_Abyngdon,Henry Abyngdon,c. 1418,1497,English,,Renaissance
6,Walter_Frye,Walter Frye,fl. c. 1450,1474,English,,Renaissance
7,William_Horwood_(composer),William Horwood,c. 1430,1484,English,Some of his music is collected in the Eton Cho...,Renaissance
8,John_Hothby,John Hothby Johannes Ottobi,c. 1430,1487,English,English theorist and composer mainly active in...,Renaissance
9,William_Hawte,William Hawte William Haute,c. 1430,1497,English,,Renaissance


In [95]:
df[df.article == '']

Unnamed: 0,article,name,birth,death,nationality,comment,era


In [96]:
df.dtypes

article        object
name           object
birth          object
death          object
nationality    object
comment        object
era            object
dtype: object