# Clean Entities File

The file `Entities.pdf` from the Territorial Change v6 dataset was converted into a CSV with Tabula. This CSV file required some further cleaning and exploration.

In [1]:
import pandas as pd
import numpy as np
import string

In [2]:
raw_data_path = "../data/raw/"
processed_data_path = "../data/processed/"

In [3]:
dfEntities = pd.read_csv(raw_data_path+'tc_entities_tabula.csv', encoding='utf-8', dtype={'Entity\rNumber': str, 'Name': str, 'Begin Year': str, 'End Year': str, 'Ending Political Status': str})

In [4]:
dfEntities.columns = ['id','name','start_year','end_year','status']

Due to this file being converted to a CSV from a PDF, there are literal whitespace special characters (`\r`) that need to be transformed to a space.

In [5]:
dfEntities['name'] = dfEntities['name'].str.replace(r'[\t\n\r\x0b\x0c]', ' ', regex=True)
dfEntities['status'] = dfEntities['status'].str.replace(r'[\t\n\r\x0b\x0c]', ' ', regex=True)

In [6]:
dfEntities["id"] = dfEntities["id"].str.strip()
dfEntities["name"] = dfEntities["name"].str.strip()
dfEntities["start_year"] = dfEntities["start_year"].str.strip()
dfEntities["end_year"] = dfEntities["end_year"].str.strip()
dfEntities["status"] = dfEntities["status"].str.strip()

In [7]:
dfEntities['referenced_id'] = dfEntities['status'].str.extract(r'([0-9]{1,4})')

In [8]:
dfEntities

Unnamed: 0,id,name,start_year,end_year,status,referenced_id
0,3,Alaska,1816,1867,Became colony of 365,365
1,3,Alaska,1867,1959,Became colony of 2,2
2,3,Alaska,1959,1993,Became part of 2,2
3,4,Hawaii,1898,1960,Became colony of 2,2
4,4,Hawaii,1960,1993,Became part of 2,2
...,...,...,...,...,...,...
2701,9987,Neu Schwabenland,1939,1945,Claimed by 255,255
2702,9991,Peter I I.,1931,1993,Became possession of 385,385
2703,9992,Queen Maud Land,1939,1993,Claimed by 385,385
2704,9993,Bouvet I.,1927,1993,Became possession of 385,385


In [9]:
dfEntities.to_csv(processed_data_path+'tc_entities.csv', index=False)

In [10]:
dfEntities.end_year.astype(float).max()

1997.0

In [11]:
dfEntities.start_year.astype(float).max()

1997.0

In [12]:
dfEntities[(dfEntities['start_year'].isna()) | (dfEntities['end_year'].isna())]

Unnamed: 0,id,name,start_year,end_year,status,referenced_id
386,1111,Aves I.,,,Claimed by 101,101
387,1111,Aves I.,,,Claimed by 210,210
435,1301,Galapagos Is.,1832.0,,Became colony of 130,130
1205,4559,Gold Coast-Togoland (Neutral Zone),1887.0,,Became neutral or demilitarized zone of 200,200
1501,609,Spanish Sahara,1975.0,,Occupied by 600,600
1811,6969,Abu Dhabi-Dubai Neutral Zone,,,Claimed by 6962,6962
1812,6969,Abu Dhabi-Dubai Neutral Zone,,,Claimed by 6961,6961
1824,6983,Oman-Sharjah Neutral Zone,,,Claimed by 698,698
1825,6983,Oman-Sharjah Neutral Zone,,,Claimed by 6963,6963
2340,8123,Chien Khouang,1832.0,,Became part of 815,815
