# Start of Data Transformation

Task: use Pandas to transform csv files into DataFrames that match desired tables for database schema

Tables:

- TERRITORY (done)
- TERRITORIALCHANGE (done)


In [1]:
import pandas as pd
import numpy as np

In [2]:
!ls ../SourceData/CorrelatesOfWar/

[34mCodebooks[m[m                    MID_Narratives_2002-2010.pdf
CowWarList.csv               NMC_5_0-wsupplementary.csv
CowWarList.pdf               Non-StateWarData_v4.0.csv
[31mEntities.pdf[m[m                 Territories.csv
Extra-StateWarData_v4.0.csv  alliance_v4.1_by_member.csv
IGO_stateunit_v2.3.csv       contdir.csv
Inter-StateWarData_v4.0.csv  igounit_v2.3.csv
Intra-StateWarData_v4.1.csv  majors2016.csv
[31mMIDA_4.2.csv[m[m                 states2016.csv
[31mMIDB_4.2.csv[m[m                 system2016.csv
[31mMIDLOCA_2.0.csv[m[m              tc2014.csv
MID_Narratives_1993-2001.pdf


## Create 'TERRITORY' table

Task: transform Territories.csv into a table with attributes:

- TerritoryID
- TerritoryName

Note: Territory.csv was created by running Entities.pdf through [Tabula](https://tabula.technology/) and hand-correcting minor errors (for instance, some sets of rows were shifted to the left).

There were also some `\r`s introduced into rows where the TerritoryName was too long. I removed these by hand.

Some TerritoryIDs matched up to multiple TerritoryNames. Those Territory IDs were:

- 374
- 1152
- 3351
- 3377

I suspect this is a coding error, as the names these IDs corresponded to were in different (albiet relatively close) locations. For the sake of having a unique ID, and because only the ID is recorded in the TERRITORIALCHANGE table, I modified these by hand to be the same, with the second name in parentheses.

I used this code to investigate these irregularities:

```
weirdnames = dfTerritory['TerritoryName'].str.contains('\\r', regex=True)
sum(weirdnames)
dfTerritory['TerritoryID'].value_counts()
```

In [3]:
dfTerritory = pd.read_csv('../SourceData/CorrelatesOfWar/Territories.csv')
dfTerritory

Unnamed: 0,Entity Number,Name,Begin Year,End Year,Ending Political Status
0,3,Alaska,1816.0,1867.0,Became colony of 365
1,3,Alaska,1867.0,1959.0,Became colony of 2
2,3,Alaska,1959.0,1993.0,Became part of 2
3,4,Hawaii,1898.0,1960.0,Became colony of 2
4,4,Hawaii,1960.0,1993.0,Became part of 2
5,5,Virgin Islands,1816.0,1917.0,Became colony of 390
6,5,Virgin Islands,1917.0,1993.0,Became colony of 2
7,6,Puerto Rico,1816.0,1821.0,Became part of 1070
8,6,Puerto Rico,1821.0,1898.0,Became colony of 230
9,6,Puerto Rico,1898.0,1952.0,Became colony of 2


In [4]:
dfTerritory.drop(columns=['Begin Year', 'End Year', 'Ending Political Status'], inplace=True)
dfTerritory.rename(columns={'Entity Number':'TerritoryID', 'Name':'TerritoryName'}, inplace=True)
dfTerritory.drop_duplicates(inplace=True)
dfTerritory

Unnamed: 0,TerritoryID,TerritoryName
0,3,Alaska
3,4,Hawaii
5,5,Virgin Islands
7,6,Puerto Rico
10,7,Texas
14,10,Greenland
16,11,Faeroe Is.
18,20,Canada
20,21,Newfoundland
23,30,Bermuda


In [5]:
dfStates = pd.read_csv('../SourceData/CorrelatesOfWar/states2016.csv')
dfStates =  dfStates[['ccode', 'statenme']]
dfStates.rename(columns={"ccode":"TerritoryID", "statenme":"TerritoryName"}, inplace=True)

allplaces = [dfTerritory, dfStates]
dfTerritories = pd.concat(allplaces)
dfTerritories

Unnamed: 0,TerritoryID,TerritoryName
0,3,Alaska
3,4,Hawaii
5,5,Virgin Islands
7,6,Puerto Rico
10,7,Texas
14,10,Greenland
16,11,Faeroe Is.
18,20,Canada
20,21,Newfoundland
23,30,Bermuda


In [6]:
dfTerritories['TerritoryName'] = dfTerritories['TerritoryName'].str.replace('\&', 'and')
dfTerritories

Unnamed: 0,TerritoryID,TerritoryName
0,3,Alaska
3,4,Hawaii
5,5,Virgin Islands
7,6,Puerto Rico
10,7,Texas
14,10,Greenland
16,11,Faeroe Is.
18,20,Canada
20,21,Newfoundland
23,30,Bermuda


In [7]:
TerritoryNameMaxLength = int(dfTerritories['TerritoryName'].str.encode(encoding='utf-8').str.len().max())
print(TerritoryNameMaxLength)

70


In [8]:
dfTerritories['TerritoryID'].value_counts()

530     3
42      3
740     3
150     3
390     3
385     3
339     3
40      3
41      3
345     3
211     3
652     3
651     3
350     3
368     3
367     3
616     3
600     3
210     3
366     3
290     3
255     3
220     3
315     3
305     3
212     3
371     2
451     2
439     2
438     2
       ..
9972    1
7411    1
3316    1
7413    1
7415    1
5324    1
9464    1
9465    1
9466    1
3326    1
3327    1
7759    1
6631    1
9451    1
9452    1
7399    1
9621    1
7552    1
824     1
9820    1
3291    1
9203    1
7551    1
9603    1
5332    1
5331    1
9253    1
7564    1
5325    1
2       1
Name: TerritoryID, Length: 1167, dtype: int64

In [9]:
dfTerritories.drop_duplicates(subset=['TerritoryID'], inplace=True)
dfTerritories

Unnamed: 0,TerritoryID,TerritoryName
0,3,Alaska
3,4,Hawaii
5,5,Virgin Islands
7,6,Puerto Rico
10,7,Texas
14,10,Greenland
16,11,Faeroe Is.
18,20,Canada
20,21,Newfoundland
23,30,Bermuda


In [10]:
dfTerritories['TerritoryID'].value_counts()

9251    1
564     1
605     1
9002    1
609     1
610     1
6800    1
8231    1
2661    1
4711    1
616     1
620     1
625     1
626     1
8402    1
7563    1
630     1
7220    1
602     1
601     1
572     1
553     1
567     1
568     1
569     1
570     1
571     1
698     1
600     1
580     1
       ..
3327    1
7759    1
4524    1
225     1
7415    1
3340    1
3341    1
3343    1
3346    1
8152    1
679     1
9464    1
7414    1
9203    1
9452    1
3291    1
9820    1
824     1
7552    1
9621    1
7399    1
9451    1
7413    1
6631    1
9453    1
9454    1
9972    1
7411    1
3316    1
2       1
Name: TerritoryID, Length: 1167, dtype: int64

In [11]:
dfTerritories.to_csv('../FinalData/territory.csv', encoding='utf-8', index=False)

## Create 'TERRITORIALCHANGE' table

Task: transform tc2014.csv into a table with attributes:

- TerritorialChangeID
- Gainer
- Loser
- TransferDate
- Year
- Month
- Procedure
- TerritoryID
- TerritoryArea
- TerritoryPopulation
- IsWholeTerritory
- IsMilConflict
- IsIndependence
- GainerIsCont
- LoserIsCont
- IsGainerHomeland
- IsLoserHomeland
- IsSystemEntry
- IsSystemExit

In [12]:
dfTerrChange = pd.read_csv('../SourceData/CorrelatesOfWar/tc2014.csv')
dfTerrChange

Unnamed: 0,year,month,gainer,gaintype,procedur,entity,contgain,area,pop,portion,loser,losetype,contlose,entry,exit,number,indep,conflict,version
0,1816,7,160,1,-9,160,-9,2093164.00,1970000,1,230,0,0,1,0,3,1,0,5
1,1816,3,200,0,3,790,0,1.00,.,0,790,1,1,0,0,4,0,1,5
2,1816,.,200,0,3,420,0,179.00,.,0,-9,1,-9,0,0,5,0,0,5
3,1817,.,220,0,3,433,0,7819.00,100000,1,200,0,0,0,0,28,0,0,5
4,1817,.,365,1,1,365,1,650.00,.,0,-9,1,1,0,0,29,0,1,5
5,1818,10,2,1,3,20,1,84240.00,.,0,200,0,0,0,0,30,0,0,5
6,1818,12,155,1,-9,155,-9,464568.00,1656300,1,230,0,0,1,0,31,1,1,5
7,1818,10,200,0,3,2,0,41600.00,.,0,2,1,1,0,0,32,0,0,5
8,1818,6,200,0,1,750,0,421200.00,.,0,-9,1,-9,0,0,33,0,1,5
9,1818,.,200,0,2,438,0,16.00,.,0,-9,1,-9,0,0,34,0,0,5


In [13]:
dfTerrChange.rename(columns={"year":"Year", "month": "Month", "gainer":"Gainer", "gaintype":"IsGainerHomeland", "procedur":"Procedure", "entity":"TerritoryID", "contgain":"GainerIsCont", "area":"TerritoryArea", "pop":"TerritoryPopulation", "portion":"IsWholeTerritory", "loser":"Loser", "losetype":"IsLoserHomeland", "contlose": "LoserIsCont", "entry":"IsSystemEntry", "exit":"IsSystemExit", "number":"TerritorialChangeID", "indep":"IsIndependence", "conflict":"IsMilConflict"}, inplace=True)
dfTerrChange.drop(columns=['version'], inplace=True)
missingmonth = (dfTerrChange['Month'] == '.')
dfTerrChange['MonthClean'] = dfTerrChange['Month']
dfTerrChange['MonthClean'] [dfTerrChange['MonthClean'] == '.'] = 1 # boolean mask
dfTerrChange['TransferDate'] = pd.to_datetime(dict(year=dfTerrChange.Year, month=dfTerrChange.MonthClean, day='01'))
dfTerrChange = dfTerrChange[['TerritorialChangeID', 'Gainer', 'Loser', 'TransferDate', 'Year', 'Month', 'Procedure', 'TerritoryID', 'TerritoryArea', 'TerritoryPopulation', 'IsWholeTerritory', 'IsMilConflict', 'IsIndependence', 'GainerIsCont', 'LoserIsCont', 'IsGainerHomeland', 'IsLoserHomeland', 'IsSystemEntry', 'IsSystemExit']]
dfTerrChange

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,TerritorialChangeID,Gainer,Loser,TransferDate,Year,Month,Procedure,TerritoryID,TerritoryArea,TerritoryPopulation,IsWholeTerritory,IsMilConflict,IsIndependence,GainerIsCont,LoserIsCont,IsGainerHomeland,IsLoserHomeland,IsSystemEntry,IsSystemExit
0,3,160,230,1816-07-01,1816,7,-9,160,2093164.00,1970000,1,0,1,-9,0,1,0,1,0
1,4,200,790,1816-03-01,1816,3,3,790,1.00,.,0,1,0,0,1,0,1,0,0
2,5,200,-9,1816-01-01,1816,.,3,420,179.00,.,0,0,0,0,-9,0,1,0,0
3,28,220,200,1817-01-01,1817,.,3,433,7819.00,100000,1,0,0,0,0,0,0,0,0
4,29,365,-9,1817-01-01,1817,.,1,365,650.00,.,0,1,0,1,1,1,1,0,0
5,30,2,200,1818-10-01,1818,10,3,20,84240.00,.,0,0,0,1,0,1,0,0,0
6,31,155,230,1818-12-01,1818,12,-9,155,464568.00,1656300,1,1,1,-9,0,1,0,1,0
7,32,200,2,1818-10-01,1818,10,3,2,41600.00,.,0,0,0,0,1,0,1,0,0
8,33,200,-9,1818-06-01,1818,6,1,750,421200.00,.,0,1,0,0,-9,0,1,0,0
9,34,200,-9,1818-01-01,1818,.,2,438,16.00,.,0,0,0,0,-9,0,1,0,0


In [14]:
dfTerrChange = dfTerrChange.replace(-9, '')
dfTerrChange['Loser'] [dfTerrChange['Loser'] == 0 ] = ''
dfTerrChange['Loser'] [dfTerrChange['Loser'] == 7693 ] = ''
dfTerrChange['Loser'] [dfTerrChange['Loser'] == 2292 ] = ''
dfTerrChange['Loser'] [dfTerrChange['Loser'] == 7507 ] = ''
dfTerrChange['Month'] = dfTerrChange['Month'].replace('.', '')
dfTerrChange['TerritoryPopulation'] = dfTerrChange['TerritoryPopulation'].replace('.', '')
dfTerrChange['TerritoryArea'] = dfTerrChange['TerritoryArea'].replace('.', '')
dfTerrChange['TerritoryID'] = dfTerrChange['TerritoryID'].replace('.', '')
dfTerrChange['TransferDate'] = dfTerrChange['TransferDate'].apply(lambda x: x.strftime('%Y-%m-%d'))
dfTerrChange

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,TerritorialChangeID,Gainer,Loser,TransferDate,Year,Month,Procedure,TerritoryID,TerritoryArea,TerritoryPopulation,IsWholeTerritory,IsMilConflict,IsIndependence,GainerIsCont,LoserIsCont,IsGainerHomeland,IsLoserHomeland,IsSystemEntry,IsSystemExit
0,3,160,230,1816-07-01,1816,7,,160,2093164.00,1970000,1,0,1,,0,1,0,1,0
1,4,200,790,1816-03-01,1816,3,3,790,1.00,,0,1,0,0,1,0,1,0,0
2,5,200,,1816-01-01,1816,,3,420,179.00,,0,0,0,0,,0,1,0,0
3,28,220,200,1817-01-01,1817,,3,433,7819.00,100000,1,0,0,0,0,0,0,0,0
4,29,365,,1817-01-01,1817,,1,365,650.00,,0,1,0,1,1,1,1,0,0
5,30,2,200,1818-10-01,1818,10,3,20,84240.00,,0,0,0,1,0,1,0,0,0
6,31,155,230,1818-12-01,1818,12,,155,464568.00,1656300,1,1,1,,0,1,0,1,0
7,32,200,2,1818-10-01,1818,10,3,2,41600.00,,0,0,0,0,1,0,1,0,0
8,33,200,,1818-06-01,1818,6,1,750,421200.00,,0,1,0,0,,0,1,0,0
9,34,200,,1818-01-01,1818,,2,438,16.00,,0,0,0,0,,0,1,0,0


In [15]:
dfgainers = dfTerrChange['Gainer']
dfstaters = dfStates['TerritoryID']
flagg = dfgainers.isin(dfstaters)
flagg.value_counts()

True     812
False     25
Name: Gainer, dtype: int64

In [16]:
dfTerrChange.loc[(flagg == False)]

Unnamed: 0,TerritorialChangeID,Gainer,Loser,TransferDate,Year,Month,Procedure,TerritoryID,TerritoryArea,TerritoryPopulation,IsWholeTerritory,IsMilConflict,IsIndependence,GainerIsCont,LoserIsCont,IsGainerHomeland,IsLoserHomeland,IsSystemEntry,IsSystemExit
45,70,7,70,1836-03-01,1836,3.0,4.0,7,1010472.0,38000.0,1,1,1,,1.0,1.0,0,,0
84,115,563,200,1852-01-01,1852,1.0,,563,286065.0,300000.0,1,0,1,,0.0,1.0,0,,0
90,122,564,200,1854-04-01,1854,4.0,,564,129153.0,245000.0,1,0,1,,0.0,1.0,0,,0
203,240,348,640,1878-07-01,1878,7.0,4.0,348,4584.0,195585.0,1,1,1,,1.0,1.0,0,,0
204,241,348,640,1878-07-01,1878,7.0,3.0,640,4610.0,100000.0,0,1,0,1.0,1.0,1.0,1,0.0,0
213,251,348,640,1880-11-01,1880,11.0,3.0,640,520.0,,0,0,0,1.0,1.0,1.0,1,0.0,0
214,252,563,200,1880-04-01,1880,4.0,,563,285363.0,815000.0,1,1,1,,0.0,1.0,0,,0
303,344,563,200,1893-11-01,1893,11.0,3.0,572,17363.0,55000.0,1,0,0,1.0,0.0,1.0,0,0.0,0
401,444,348,640,1913-05-01,1913,5.0,3.0,640,5200.0,,0,1,0,1.0,1.0,1.0,1,0.0,0
412,455,672,640,1914-01-01,1914,,4.0,672,104338.0,,1,1,1,,1.0,1.0,0,,0


In [17]:
dfTerrChange['Gainer'] [flagg == False] = ''

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [18]:
dflosers = dfTerrChange['Loser']
dfstaters = dfStates['TerritoryID']
flagl = dflosers.isin(dfstaters)
flagl.value_counts()

True     651
False    186
Name: Loser, dtype: int64

In [19]:
dfTerrChange.loc[(flagl == False)]

Unnamed: 0,TerritorialChangeID,Gainer,Loser,TransferDate,Year,Month,Procedure,TerritoryID,TerritoryArea,TerritoryPopulation,IsWholeTerritory,IsMilConflict,IsIndependence,GainerIsCont,LoserIsCont,IsGainerHomeland,IsLoserHomeland,IsSystemEntry,IsSystemExit
2,5,200,,1816-01-01,1816,,3,420,179.00,,0,0,0,0,,0,1,0,0
4,29,365,,1817-01-01,1817,,1,365,650.00,,0,1,0,1,1,1,1,0,0
8,33,200,,1818-06-01,1818,6,1,750,421200.00,,0,1,0,0,,0,1,0,0
9,34,200,,1818-01-01,1818,,2,438,16.00,,0,0,0,0,,0,1,0,0
10,35,640,,1818-01-01,1818,,3,671,388500.00,,1,1,0,1,,1,1,0,0
20,45,640,,1822-01-01,1822,,1,625,229471.00,100000,1,1,0,1,,1,1,0,0
22,47,200,,1824-03-01,1824,3,3,830,583.00,10000,1,0,0,0,1,0,1,0,0
26,51,200,,1825-12-01,1825,12,2,9993,48.00,0,1,0,0,0,,0,1,0,0
29,54,200,,1826-01-01,1826,,2,561,200000.00,,0,0,0,0,,0,1,0,0
31,56,200,,1826-06-01,1826,6,3,821,1.00,,0,0,0,0,,0,1,0,0


In [20]:
dfTerrChange['Loser'] [flagl == False] = ''

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [21]:
dfterrs = dfTerrChange['TerritoryID']
dfterrstates = dfTerritories['TerritoryID'].astype(str)
flagt = dfterrs.isin(dfterrstates)
flagt.value_counts()

True     826
False     11
Name: TerritoryID, dtype: int64

In [22]:
dfTerrChange.loc[(flagt == False)]

Unnamed: 0,TerritorialChangeID,Gainer,Loser,TransferDate,Year,Month,Procedure,TerritoryID,TerritoryArea,TerritoryPopulation,IsWholeTerritory,IsMilConflict,IsIndependence,GainerIsCont,LoserIsCont,IsGainerHomeland,IsLoserHomeland,IsSystemEntry,IsSystemExit
75,105,200,,1849-12-01,1849,12.0,1,,151536.0,9153209.0,1,1,0,0,,0,1.0,0,0
384,427,200,800.0,1909-03-01,1909,3.0,3,822.0,38195.0,450000.0,0,0,0,0,1.0,0,1.0,0,0
409,452,200,,1914-05-01,1914,5.0,2,822.0,18985.0,180412.0,0,0,0,0,,0,1.0,0,0
706,756,740,2.0,1968-06-01,1968,6.0,3,,100.0,,0,0,0,0,0.0,1,0.0,0,0
771,825,645,,1981-12-01,1981,12.0,3,,3333.0,0.0,0,0,0,1,,1,,0,0
772,826,670,,1981-12-01,1981,12.0,3,,3333.0,0.0,0,0,0,1,,1,,0,0
775,829,155,,1984-11-01,1984,11.0,3,,1079.0,0.0,0,0,0,1,,1,,0,0
776,830,160,,1984-11-01,1984,11.0,3,,1083.0,0.0,0,0,0,1,,1,,0,0
802,856,92,,1992-09-01,1992,9.0,3,,147.0,36000.0,0,0,0,1,,1,,0,0
803,857,91,,1992-09-01,1992,9.0,3,,293.0,13000.0,0,0,0,1,,1,,0,0


In [23]:
dfTerrChange['TerritoryID'] [flagt == False] = ''

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [24]:
dfTerrChange.to_csv('../FinalData/territorialchange.csv', encoding='utf-8', index=False)