# Data Cleaning Project: Wikipedia List of Roman Emperors

This Jupyter Notebook creates a csv-file that contains data about roman emperors.

The data source is the Wikipedia page "https://en.wikipedia.org/wiki/List_of_Roman_emperors". The html-file of the Wikipedia page has been downloaded into the project repository.


##### Import Pandas

In [53]:
# Import Pandas library
import pandas as pd

##### Load content

In [54]:
# Load Wikipedia content from pre-downloaded html-file.
wikipedia_content = pd.read_html(r"list_roman_emperors_2023_wikipedia.htm")

##### Get tables from content

In [55]:
# The Wikipedia page has 12 tables with information about roman emperors.
# Download all the tables into a list.
table_list = []
for i in range(1,13):
    table_list.append(wikipedia_content[i])

##### Add information to tables: 'Period' and 'Epoch'

In [45]:
# Every table corresponds to a historic period (mostly a dynasty).
# Creat a list of all the historic periods. 
# The labels of the historic periods are taken from the Wikipedia page.
list_of_periods = ["Julio-Claudian dynasty (27 BC-68 AD)", 
                   "Year of the Four Emperors (68-69)", 
                   "Flavian dynasty (69-96)", 
                   "Nerva-Antonine dynasty (96-192)", 
                   "Year of the Five Emperors (193)", 
                   "Severan dynasty (193-235)", 
                   "Crisis of the Third Century (235-285)", 
                   "Tetrarchy (284-324)", 
                   "Constantinian dynasty (306-363)", 
                   "Valentinianic dynasty (364-392)", 
                   "Theodosian dynasty (379-457)", 
                   "Last western emperors (455-476)"]

In [46]:
# Add the historic periods to the corresponding table in the list of tables.
for i in range(len(table_list)):
    table_list[i]['Period'] = list_of_periods[i]

In [47]:
# The emperor tables correspond to two historical epochs: the Principate and the Dominate.
# Create a list of the epochs.
list_of_epochs = ["Principate (27 BC-284 AD)", "Dominate (284-476)"]

In [48]:
# The Emperor-tables 1 to 7 correspond to the epoch 'Principate' and table 8 to 12 correspond to the epoch 'Dominate'.
# Add the epochs to the corresponding table in the table_list.
for i in range(len(table_list)):
    if i < 7:
        table_list[i]['Epoch'] = list_of_epochs[0]
    else:
        table_list[i]['Epoch'] = list_of_epochs[1]

##### Clean column 'Name'

In [49]:
# Two of the emperor-tables (the 1st table and the 10th table) have divergent column names 
# for the column that holds the names of the emperors. 
# The reason for this is, that in the mentioned emperor tables the "Name" column has a footnote, 
# which creates divergent column names ("Name[f]" and "Name[q]").
# Make sure that name column is labeled "Name" in all emperor tables.
for i in range(len(table_list)):
    table_list[i].columns.values[1] = 'Name'

##### Combine tables into dataframe

In [50]:
# Combine all 12 emperor tables and create a combined table that holds all data.
combined_table = pd.concat(table_list)

##### Save dataframe

In [51]:
# Save the combined_table to a csv-file.
combined_table.to_csv(r"C:\Users\Rainer\Documents\mein github code\roman_emperors_data_cleaning\data_roman_emperors.csv", index=False)