# Start of Data Transformation

Task: use Pandas to transform csv files into DataFrames that match desired tables for database schema

Tables:

- WAR (done)
- WAR_PARTICIPANTS (done)
- WAR_LOCATION (done)
- WAR_TRANSITIONS

In [1]:
import pandas as pd
import numpy as np

In [2]:
!ls ../SourceData/CorrelatesOfWar/

[34mCodebooks[m[m                    MID_Narratives_2002-2010.pdf
CowWarList.csv               NMC_5_0-wsupplementary.csv
CowWarList.pdf               Non-StateWarData_v4.0.csv
[31mEntities.pdf[m[m                 Territories.csv
Extra-StateWarData_v4.0.csv  alliance_v4.1_by_member.csv
IGO_stateunit_v2.3.csv       contdir.csv
Inter-StateWarData_v4.0.csv  igounit_v2.3.csv
Intra-StateWarData_v4.1.csv  majors2016.csv
[31mMIDA_4.2.csv[m[m                 states2016.csv
[31mMIDB_4.2.csv[m[m                 system2016.csv
[31mMIDLOCA_2.0.csv[m[m              tc2014.csv
MID_Narratives_1993-2001.pdf


In [3]:
dfInterStateWar = pd.read_csv('../SourceData/CorrelatesOfWar/Inter-StateWarData_v4.0.csv')
dfIntraStateWar = pd.read_csv('../SourceData/CorrelatesOfWar/Intra-StateWarData_v4.1.csv')
dfNonStateWar = pd.read_csv('../SourceData/CorrelatesOfWar/Non-StateWarData_v4.0.csv')
dfExtraStateWar = pd.read_csv('../SourceData/CorrelatesOfWar/Extra-StateWarData_v4.0.csv')

## Create 'WAR' table

Task: transform the following csv files into one table:

- Inter-StateWarData_v4.0.csv (note: already saved as 'dfInterStateWar')
- Intra-StateWarData_v4.1.csv (note: already saved as 'dfIntraStateWar')
- Non-StateWarData_v4.0.csv (note: already saved as 'dfNonStateWar')
- Extra-StateWarData_v4.0.csv (note: already saved as 'dfExtraStateWar')
- CowWarList.csv (note: generated from pdf using Tabula, with `\r`s removed by hand)

with the following attributes:

- WarID
- WarShortName
- WarLongName (from CowWarList.csv)
- WarType
- IsIntervention (only relevant for Extra-State Wars)
- IsInternational (only relevant for Intra-State Wars)

Note: I re-saved many of the csv files with UTF-8 encoding.

Note: The carriage return characters can also be removed with this code:

`df = df.replace({r'\r': ' '}, regex=True)`

In [4]:
dfInterWar = dfInterStateWar[['WarNum', 'WarName', 'WarType']]
dfInterWar = dfInterWar.rename(columns={'WarNum':'WarID', 'WarName':'WarShortName'})
dfInterWar = dfInterWar.drop_duplicates()

dfIntraWar = dfIntraStateWar[['WarNum', 'WarName', 'WarType', 'Intnl']]
dfIntraWar = dfIntraWar.rename(columns={'WarNum':'WarID', 'WarName':'WarShortName', 'Intnl':'IsInternational'})
dfIntraWar = dfIntraWar.drop_duplicates()

dfNonWar = dfNonStateWar[['WarNum', 'WarName', 'WarType']]
dfNonWar = dfNonWar.rename(columns={'WarNum':'WarID', 'WarName':'WarShortName'})
dfNonWar = dfNonWar.drop_duplicates()

dfExtraWar = dfExtraStateWar[['WarNum', 'WarName', 'WarType', 'Interven']]
dfExtraWar = dfExtraWar.rename(columns={'WarNum':'WarID', 'WarName':'WarShortName', 'Interven':'IsIntervention'})
dfExtraWar = dfExtraWar.drop_duplicates()

warDFs = [dfInterWar, dfIntraWar, dfNonWar, dfExtraWar]
dfWar = pd.concat(warDFs).sort_values('WarID').reset_index(drop=True)
dfWar = dfWar[['WarID', 'WarShortName', 'WarType', 'IsIntervention', 'IsInternational']]
dfWar = dfWar.astype({'IsIntervention':'Int64', 'IsInternational':'Int64'})
dfWar

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




Unnamed: 0,WarID,WarShortName,WarType,IsIntervention,IsInternational
0,1,Franco-Spanish War,1,,
1,4,First Russo-Turkish,1,,
2,7,Mexican-American,1,,
3,10,Austro-Sardinian,1,,
4,13,First Schleswig-Holstein,1,,
5,16,Roman Republic,1,,
6,19,La Plata,1,,
7,22,Crimean,1,,
8,25,Anglo-Persian,1,,
9,28,Italian Unification,1,,


Now to add the long names and the general category war type:

In [5]:
dfWarNames = pd.read_csv('../SourceData/CorrelatesOfWar/CowWarList.csv')
dfWarNames

Unnamed: 0,Year,War Name,War Type & Number
0,1816,Allied Bombardment of Algiers of 1816,Extra-State War #300
1,1816,Ottoman-Wahhabi Revolt of 1816-1818,Extra-State War #301
2,1817,Liberation of Chile of 1817-1818,Extra-State War #302
3,1817,First Bolivar Expedition of 1817-1819,Extra-State War #303
4,1817,War of Mexican Independence of 1817-1818,Extra-State War #304
5,1817,British-Kandyan War of 1817-1818,Extra-State War #305
6,1817,British-Maratha of 1817-1818,Extra-State War #306
7,1818,First Maori Tribal War of 1818-1824,Non-State War #1500
8,1818,First Caucasus War of 1818-1822,Intra-State War #500
9,1819,Shaka Zulu-Bantu War of 1819-1828,Non-State War #1501


In [6]:
dfWarNamesIDs = dfWarNames['War Type & Number'].str.split("#", n = 1, expand = True)
dfWarNames['WarTypeName'] = dfWarNamesIDs[0]
dfWarNames['WarID'] = dfWarNamesIDs[1]

dfWarNames = dfWarNames[['WarID', 'WarTypeName', 'War Name']]
dfWarNames = dfWarNames.rename(columns={'War Name':'WarLongName'})
dfWarNames

Unnamed: 0,WarID,WarTypeName,WarLongName
0,300,Extra-State War,Allied Bombardment of Algiers of 1816
1,301,Extra-State War,Ottoman-Wahhabi Revolt of 1816-1818
2,302,Extra-State War,Liberation of Chile of 1817-1818
3,303,Extra-State War,First Bolivar Expedition of 1817-1819
4,304,Extra-State War,War of Mexican Independence of 1817-1818
5,305,Extra-State War,British-Kandyan War of 1817-1818
6,306,Extra-State War,British-Maratha of 1817-1818
7,1500,Non-State War,First Maori Tribal War of 1818-1824
8,500,Intra-State War,First Caucasus War of 1818-1822
9,1501,Non-State War,Shaka Zulu-Bantu War of 1819-1828


In [7]:
dfWarNames['WarID'] = dfWarNames['WarID'].astype('int64')
dfWars = pd.merge(dfWar, dfWarNames, on='WarID')
dfWars = dfWars[['WarID', 'WarShortName', 'WarLongName', 'WarType', 'WarTypeName', 'IsIntervention', 'IsInternational']]
dfWars = dfWars.replace(np.nan, '', regex=True)
dfWars

Unnamed: 0,WarID,WarShortName,WarLongName,WarType,WarTypeName,IsIntervention,IsInternational
0,1,Franco-Spanish War,Franco-Spanish War of 1823,1,Inter-State War,,
1,4,First Russo-Turkish,First Russo-Turkish War of 1828-1829,1,Inter-State War,,
2,7,Mexican-American,Mexican-American War of 1846-1847,1,Inter-State War,,
3,10,Austro-Sardinian,Austro-Sardinian War of 1848-1849,1,Inter-State War,,
4,13,First Schleswig-Holstein,First Schleswig-Holstein War of 1848-1849,1,Inter-State War,,
5,16,Roman Republic,War of the Roman Republic of 1849,1,Inter-State War,,
6,19,La Plata,La Plata War of 1851-1852,1,Inter-State War,,
7,22,Crimean,Crimean War of 1853-1856,1,Inter-State War,,
8,25,Anglo-Persian,Anglo-Persian War of 1856-1857,1,Inter-State War,,
9,28,Italian Unification,War of Italian Unification of 1859,1,Inter-State War,,


In [8]:
dfWars.WarTypeName.unique()

array(['Inter-State War ', 'Extra-State War ', 'Intra-State War ',
       'Non-State War '], dtype=object)

The war type names have an extra space at the end due to splitting. 
Stripping white spaces from everything just to be safe:

In [9]:
dfWars = dfWars.apply(lambda x: x.str.strip() if x.dtype == "object" else x)

In [10]:
dfWars.to_csv('../FinalData/war.csv', encoding='utf-8', index=False)

## Create 'WAR_PARTICIPANTS' table

Task: transform the following csv files into one table:

- Inter-StateWarData_v4.0.csv (note: already saved as 'dfInterStateWar')
- Intra-StateWarData_v4.1.csv (note: already saved as 'dfIntraStateWar')
- Non-StateWarData_v4.0.csv (note: already saved as 'dfNonStateWar')
- Extra-StateWarData_v4.0.csv (note: already saved as 'dfExtraStateWar')

with the following attributes:

- WarID
- PolityID
- StartDate
- EndDate
- StartYear
- StartMonth
- StartDay
- EndYear
- EndMonth
- EndDay
- Side
- IsInitiator
- Outcome
- Deaths

Note: 'Outcome' is pretty much entirely determined by 'WarID' and 'Side'. However, one codebook (interstate war) has an additional outcome type: 8 = changed sides. There is exactly 1 instance of this. Therefore, 'Outcome' must also be determined by 'PolityID'. This is why it is included in this table, and not a seperate one.

Similarly, 'Deaths' is almost entirely deterimined by 'WarID' and 'PolityID'. However, there are a very few instances in which it is also dependent on 'StartDate', which is why it is included in this table, and not a seperate one.

### Changes made to original files (due to data entry errors)

- There was a data entry error in 'Intra-StateWarData_v4.1.csv' for WarNum 585; EndDay1 was coded '-91866' and EndYear1 was left blank. I corrected this by hand so the Day was '-9' and the Year '1866'.
- There was another data entry error in the same file for WarNum 682; EndDay1 was coded '1919' and EndYear1 was left blank. I corrected this by hand so that the Day was '-9' and the Year '1919'.
- There was an apparent data entry error in the same file for WarNum 623, the second entry (Korea) - the StartDay1 was coded as '29' when StartMonth1 was 2... which is not a valid date. I corrected this by hand so the StartDay1 became '28'.

In [93]:
dfPolities = pd.read_csv('../FinalData/polity.csv')

### Inter-State War

In [94]:
dfInterStateWar.columns

Index(['WarNum', 'WarName', 'WarType', 'ccode', 'StateName', 'Side',
       'StartMonth1', 'StartDay1', 'StartYear1', 'EndMonth1', 'EndDay1',
       'EndYear1', 'StartMonth2', 'StartDay2', 'StartYear2', 'EndMonth2',
       'EndDay2', 'EndYear2', 'TransFrom', 'WhereFought', 'Initiator',
       'Outcome', 'TransTo', 'BatDeath', 'Version'],
      dtype='object')

In [95]:
dfInterWarPar1 = dfInterStateWar[['WarNum', 'ccode', 'StartMonth1', 'StartDay1', 'StartYear1', 
                                'EndMonth1', 'EndDay1', 'EndYear1', 'Side', 'Initiator', 'Outcome', 'BatDeath']]
dfInterWarPar2 = dfInterStateWar[['WarNum', 'ccode', 'StartMonth2', 'StartDay2', 'StartYear2', 
                                'EndMonth2', 'EndDay2', 'EndYear2', 'Side', 'Initiator', 'Outcome', 'BatDeath']]

dfInterWarPar1 = dfInterWarPar1.rename(columns={'WarNum':'WarID', 'ccode':'PolityID', 'StartMonth1':'StartMonth', 
                                        'StartDay1':'StartDay', 'StartYear1':'StartYear', 'EndMonth1':'EndMonth', 
                                        'EndDay1':'EndDay', 'EndYear1':'EndYear', 'Initiator':'IsInitiator', 
                                        'BatDeath':'Deaths'})
dfInterWarPar2 = dfInterWarPar2.rename(columns={'WarNum':'WarID', 'ccode':'PolityID', 'StartMonth2':'StartMonth', 
                                        'StartDay2':'StartDay', 'StartYear2':'StartYear', 'EndMonth2':'EndMonth', 
                                        'EndDay2':'EndDay', 'EndYear2':'EndYear', 'Initiator':'IsInitiator', 
                                        'BatDeath':'Deaths'})

In [96]:
dfInterWarPar2 = dfInterWarPar2.replace(-8, '')
dfInterWarPar2['datesconcat'] = dfInterWarPar2['StartMonth'].map(str) + dfInterWarPar2['StartDay'].map(str) + dfInterWarPar2['StartYear'].map(str) + dfInterWarPar2['EndMonth'].map(str) + dfInterWarPar2['EndDay'].map(str) + dfInterWarPar2['EndYear'].map(str)
missdate = ''
dfInterWarPar2 = dfInterWarPar2[dfInterWarPar2.datesconcat != missdate]
dfInterWarPar2 = dfInterWarPar2.drop(columns=['datesconcat'])
dfInterWarPar2

Unnamed: 0,WarID,PolityID,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Side,IsInitiator,Outcome,Deaths
7,10,325,3,12,1849,3,30,1849,2,1,2,3400
8,10,300,3,12,1849,3,30,1849,1,2,1,3927
10,13,255,3,25,1849,7,10,1849,1,1,1,2500
11,13,390,3,25,1849,7,10,1849,2,2,2,3500
38,46,255,6,25,1864,7,20,1864,1,1,1,1048
39,46,390,6,25,1864,7,20,1864,2,2,2,2933
40,46,300,6,25,1864,7,20,1864,1,2,1,500
104,100,355,2,3,1913,4,19,1913,1,2,1,32000
105,100,345,2,3,1913,4,19,1913,1,1,1,15000
182,139,365,8,8,1945,8,14,1945,1,2,1,7500000


In [97]:
combinedInterWarPar = [dfInterWarPar1, dfInterWarPar2]
dfInterWarPar = pd.concat(combinedInterWarPar).reset_index(drop=True)
dfInterWarPar

Unnamed: 0,WarID,PolityID,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Side,IsInitiator,Outcome,Deaths
0,1,230,4,7,1823,11,13,1823,2,2,2,600
1,1,220,4,7,1823,11,13,1823,1,1,1,400
2,4,640,4,26,1828,9,14,1829,2,2,2,80000
3,4,365,4,26,1828,9,14,1829,1,1,1,50000
4,7,70,4,25,1846,9,14,1847,2,2,2,6000
5,7,2,4,25,1846,9,14,1847,1,1,1,13283
6,10,337,3,29,1848,8,9,1848,2,2,2,100
7,10,325,3,24,1848,8,9,1848,2,1,2,3400
8,10,300,3,24,1848,8,9,1848,1,2,1,3927
9,10,332,4,9,1848,8,9,1848,2,2,2,100


according to codebook, for the 'Initiator' column, 1 = yes, did initiate; 2 = no, did not initiate. Need to standardize by changing the 2 to 0 (the more universally recognized number for False)

original table as possible values in 'Side' column as 1 and 2. In order to standardize with other tables, need to convert these to A and B.

In [98]:
dfInterWarPar['Side'] [dfInterWarPar['Side'] == 1] = 'A'
dfInterWarPar['Side'] [dfInterWarPar['Side'] == 2] = 'B'
dfInterWarPar['IsInitiator'] [dfInterWarPar['IsInitiator'] == 2] = 0
dfInterWarPar

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,WarID,PolityID,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Side,IsInitiator,Outcome,Deaths
0,1,230,4,7,1823,11,13,1823,B,0,2,600
1,1,220,4,7,1823,11,13,1823,A,1,1,400
2,4,640,4,26,1828,9,14,1829,B,0,2,80000
3,4,365,4,26,1828,9,14,1829,A,1,1,50000
4,7,70,4,25,1846,9,14,1847,B,0,2,6000
5,7,2,4,25,1846,9,14,1847,A,1,1,13283
6,10,337,3,29,1848,8,9,1848,B,0,2,100
7,10,325,3,24,1848,8,9,1848,B,1,2,3400
8,10,300,3,24,1848,8,9,1848,A,0,1,3927
9,10,332,4,9,1848,8,9,1848,B,0,2,100


In [99]:
dfInterWarPar = dfInterWarPar[['WarID', 'PolityID', 'StartYear', 'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Side', 'IsInitiator', 'Outcome', 'Deaths']]
dfInterWarPar

Unnamed: 0,WarID,PolityID,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths
0,1,230,1823,4,7,1823,11,13,B,0,2,600
1,1,220,1823,4,7,1823,11,13,A,1,1,400
2,4,640,1828,4,26,1829,9,14,B,0,2,80000
3,4,365,1828,4,26,1829,9,14,A,1,1,50000
4,7,70,1846,4,25,1847,9,14,B,0,2,6000
5,7,2,1846,4,25,1847,9,14,A,1,1,13283
6,10,337,1848,3,29,1848,8,9,B,0,2,100
7,10,325,1848,3,24,1848,8,9,B,1,2,3400
8,10,300,1848,3,24,1848,8,9,A,0,1,3927
9,10,332,1848,4,9,1848,8,9,B,0,2,100


### Intra-State War

In [100]:
dfIntraStateWar.columns

Index(['WarNum', 'WarName', 'WarType', 'CcodeA', 'SideA', 'CcodeB', 'SideB',
       'Intnl', 'StartMonth1', 'StartDay1', 'StartYear1', 'EndMonth1',
       'EndDay1', 'EndYear1', 'StartMonth2', 'StartDay2', 'StartYear2',
       'EndMonth2', 'EndDay2', 'EndYear2', 'TransFrom', 'WhereFought',
       'Initiator', 'Outcome', 'TransTo', 'SideADeaths', 'SideBDeaths',
       'Version'],
      dtype='object')

- 1A = Side A, First set of start/end dates
- 2A = Side A, Second set of start/end dates (need to get rid of rows with no date values)
- 1B = Side B, First set of start/end dates
- 2B = Side B, Second set of start/end dates (need to get rid of rows with no date values)

In [101]:
dfIntraWarPar1A = dfIntraStateWar[['WarNum', 'CcodeA', 'SideA', 'StartMonth1', 'StartDay1', 'StartYear1', 
                                         'EndMonth1', 'EndDay1', 'EndYear1', 'Initiator', 'Outcome', 'SideADeaths']]
dfIntraWarPar2A = dfIntraStateWar[['WarNum', 'CcodeA', 'SideA', 'StartMonth2', 'StartDay2', 'StartYear2', 
                                         'EndMonth2', 'EndDay2', 'EndYear2', 'Initiator', 'Outcome', 'SideADeaths']]
dfIntraWarPar1B = dfIntraStateWar[['WarNum', 'CcodeB', 'SideB', 'StartMonth1', 'StartDay1', 'StartYear1', 
                                         'EndMonth1', 'EndDay1', 'EndYear1', 'Initiator', 'Outcome', 'SideBDeaths']]
dfIntraWarPar2B = dfIntraStateWar[['WarNum', 'CcodeB', 'SideB', 'StartMonth2', 'StartDay2', 'StartYear2', 
                                         'EndMonth2', 'EndDay2', 'EndYear2', 'Initiator', 'Outcome', 'SideBDeaths']]

dfIntraWarPar1A = dfIntraWarPar1A.rename(columns={'WarNum':'WarID', 'CcodeA':'PolityID', 'SideA':'PolityName', 
                                                  'StartMonth1':'StartMonth', 'StartDay1':'StartDay', 
                                                  'StartYear1':'StartYear', 'EndMonth1':'EndMonth', 
                                                  'EndDay1':'EndDay', 'EndYear1':'EndYear', 'SideADeaths':'Deaths'})
dfIntraWarPar2A = dfIntraWarPar2A.rename(columns={'WarNum':'WarID', 'CcodeA':'PolityID', 'SideA':'PolityName', 
                                                  'StartMonth2':'StartMonth', 'StartDay2':'StartDay', 
                                                  'StartYear2':'StartYear', 'EndMonth2':'EndMonth', 
                                                  'EndDay2':'EndDay', 'EndYear2':'EndYear', 'SideADeaths':'Deaths'})
dfIntraWarPar1B = dfIntraWarPar1B.rename(columns={'WarNum':'WarID', 'CcodeB':'PolityID', 'SideB':'PolityName', 
                                                  'StartMonth1':'StartMonth', 'StartDay1':'StartDay', 
                                                  'StartYear1':'StartYear', 'EndMonth1':'EndMonth', 
                                                  'EndDay1':'EndDay', 'EndYear1':'EndYear', 'SideBDeaths':'Deaths'})
dfIntraWarPar2B = dfIntraWarPar2B.rename(columns={'WarNum':'WarID', 'CcodeB':'PolityID', 'SideB':'PolityName', 
                                                  'StartMonth2':'StartMonth', 'StartDay2':'StartDay', 
                                                  'StartYear2':'StartYear', 'EndMonth2':'EndMonth', 
                                                  'EndDay2':'EndDay', 'EndYear2':'EndYear', 'SideBDeaths':'Deaths'})

Get rid of extra rows in '2' tables (second set of dates)

In [102]:
dfIntraWarPar2A = dfIntraWarPar2A.replace(-8, '')
dfIntraWarPar2A['datesconcat'] = dfIntraWarPar2A['StartMonth'].map(str) + dfIntraWarPar2A['StartDay'].map(str) + dfIntraWarPar2A['StartYear'].map(str) + dfIntraWarPar2A['EndMonth'].map(str) + dfIntraWarPar2A['EndDay'].map(str) + dfIntraWarPar2A['EndYear'].map(str)
missdate2A = dfIntraWarPar2A.loc[0, 'datesconcat']
dfIntraWarPar2A = dfIntraWarPar2A[dfIntraWarPar2A.datesconcat != missdate2A]
dfIntraWarPar2A = dfIntraWarPar2A.drop(columns=['datesconcat'])
dfIntraWarPar2A

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths
48,547,329.0,Two Sicilies,5,15,1848,5,15,1849,Liberals,1,1500
86,590,101.0,Venezuela,8,14,1869,1,7,1871,Conservatives,2,-9
111,623,730.0,Korea,9,14,1894,11,28,1894,Tonghak Society,1,-9
193,720,350.0,Greece,2,12,1946,10,16,1949,Communists,1,17970
310,820,620.0,Libya,6,-9,1983,9,-9,1984,FAN,2,1000
324,836,625.0,Sudan,4,15,1992,1,10,2005,SPLA-Garang faction,3,-9
367,877,346.0,Bosnia,3,20,1995,12,14,1995,Bosnian Serbs,1,27500
369,877,,-8,3,20,1995,12,14,1995,Bosnian Serbs,1,-8
391,898,451.0,Sierra Leone,5,11,2000,11,10,2000,Kabbah faction,2,-9
392,898,,-8,5,11,2000,11,10,2000,Kabbah faction,2,-8


In [103]:
dfIntraWarPar2B = dfIntraWarPar2B.replace(-8, '')
dfIntraWarPar2B['datesconcat'] = dfIntraWarPar2B['StartMonth'].map(str) + dfIntraWarPar2B['StartDay'].map(str) + dfIntraWarPar2B['StartYear'].map(str) + dfIntraWarPar2B['EndMonth'].map(str) + dfIntraWarPar2B['EndDay'].map(str) + dfIntraWarPar2B['EndYear'].map(str)
missdate2B = dfIntraWarPar2B.loc[0, 'datesconcat']
dfIntraWarPar2B = dfIntraWarPar2B[dfIntraWarPar2B.datesconcat != missdate2B]
dfIntraWarPar2B = dfIntraWarPar2B.drop(columns=['datesconcat'])
dfIntraWarPar2B

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths
48,547,,Liberals,5,15,1848,5,15,1849,Liberals,1,-9
86,590,,Conservatives,8,14,1869,1,7,1871,Conservatives,2,-9
111,623,,Tonghak Society,9,14,1894,11,28,1894,Tonghak Society,1,-9
193,720,,Communists,2,12,1946,10,16,1949,Communists,1,50000
310,820,,-8,6,-9,1983,9,-9,1984,FAN,2,-8
324,836,,SPLA-Garang faction,4,15,1992,1,10,2005,SPLA-Garang faction,3,-9
367,877,,Bosnian Serbs,3,20,1995,12,14,1995,Bosnian Serbs,1,18543
369,877,344.0,Croatia,3,20,1995,12,14,1995,Bosnian Serbs,1,185
391,898,,Kabbah faction,5,11,2000,11,10,2000,Kabbah faction,2,-9
392,898,452.0,Ghana,5,11,2000,11,10,2000,Kabbah faction,2,-9


combine side A tables

In [104]:
combinedIntraWarSideA = [dfIntraWarPar1A, dfIntraWarPar2A]
dfIntraWarParA = pd.concat(combinedIntraWarSideA).reset_index(drop=True)
dfIntraWarParA['Side'] = 'A'
dfIntraWarParA

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths,Side
0,500,365,Russia,6,10,1818,-9,-9,1822,Chechens,1,5000,A
1,501,-8,Sidon,6,-9,1820,7,21,1821,Sidon,2,-9,A
2,502,300,Austria,3,-9,1821,3,23,1821,Liberals,1,-9,A
3,502,329,Two Sicilies,7,2,1820,3,23,1821,Liberals,1,-9,A
4,503,230,Spain,12,1,1821,4,6,1823,Royalists,4,-9,A
5,505,300,Austria,3,10,1821,5,8,1821,Carbonari,1,-9,A
6,505,325,Sardinia,3,10,1821,5,8,1821,Carbonari,1,-9,A
7,506,640,Ottoman Empire,3,25,1821,4,25,1828,Greeks,4,6000,A
8,506,-8,-8,10,20,1827,10,27,1827,Greeks,4,-8,A
9,506,-8,-8,10,20,1827,10,27,1827,Greeks,4,-8,A


In [105]:
combinedIntraWarSideB = [dfIntraWarPar1B, dfIntraWarPar2B]
dfIntraWarParB = pd.concat(combinedIntraWarSideB).reset_index(drop=True)
dfIntraWarParB['Side'] = 'B'
# need to make Outcome consistent between war types... in InterStateWar winner = 1, loser = 2; 
# in IntraStateWar sideA wins = 1, sideB wins = 2 ... changing to interstate war method
dfIntraWarParB['Outcome'] [dfIntraWarParB['Outcome'] == 2] = 'win'
dfIntraWarParB['Outcome'] [dfIntraWarParB['Outcome'] == 1] = 'lose'
dfIntraWarParB['Outcome'] [dfIntraWarParB['Outcome'] == 'win'] = 1
dfIntraWarParB['Outcome'] [dfIntraWarParB['Outcome'] == 'lose'] = 2
dfIntraWarParB

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths,Side
0,500,-8,"Georgians, Dhagestania, Chechens",6,10,1818,-9,-9,1822,Chechens,2,6000,B
1,501,-8,Damascus & Aleppo,6,-9,1820,7,21,1821,Sidon,1,-9,B
2,502,-8,-8,3,-9,1821,3,23,1821,Liberals,2,-8,B
3,502,-8,Liberals,7,2,1820,3,23,1821,Liberals,2,-9,B
4,503,-8,Royalists,12,1,1821,4,6,1823,Royalists,4,-9,B
5,505,-8,-8,3,10,1821,5,8,1821,Carbonari,2,-8,B
6,505,-8,Carbonari,3,10,1821,5,8,1821,Carbonari,2,-9,B
7,506,-8,Greeks,3,25,1821,4,25,1828,Greeks,4,3000,B
8,506,200,United Kingdom,10,20,1827,10,27,1827,Greeks,4,80,B
9,506,220,France,10,20,1827,10,27,1827,Greeks,4,40,B


need to get rid of rows with no polity (these are extra rows due to formatting of original table)

In [106]:
dfIntraWarParA['PolityName'] = dfIntraWarParA['PolityName'].str.strip()
dfIntraWarParB['PolityName'] = dfIntraWarParB['PolityName'].str.strip()

dfIntraWarParA = dfIntraWarParA[dfIntraWarParA.PolityName != '-8']
dfIntraWarParB = dfIntraWarParB[dfIntraWarParB.PolityName != '-8']

In [107]:
combinedIntraWarPar = [dfIntraWarParA, dfIntraWarParB]
dfIntraWarPar = pd.concat(combinedIntraWarPar)
dfIntraWarPar = dfIntraWarPar.sort_values('WarID')
dfIntraWarPar.reset_index(drop=True)
dfIntraWarPar

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths,Side
0,500,365,Russia,6,10,1818,-9,-9,1822,Chechens,1,5000,A
0,500,-8,"Georgians, Dhagestania, Chechens",6,10,1818,-9,-9,1822,Chechens,2,6000,B
1,501,-8,Sidon,6,-9,1820,7,21,1821,Sidon,2,-9,A
1,501,-8,Damascus & Aleppo,6,-9,1820,7,21,1821,Sidon,1,-9,B
3,502,-8,Liberals,7,2,1820,3,23,1821,Liberals,2,-9,B
2,502,300,Austria,3,-9,1821,3,23,1821,Liberals,1,-9,A
3,502,329,Two Sicilies,7,2,1820,3,23,1821,Liberals,1,-9,A
4,503,230,Spain,12,1,1821,4,6,1823,Royalists,4,-9,A
4,503,-8,Royalists,12,1,1821,4,6,1823,Royalists,4,-9,B
5,505,300,Austria,3,10,1821,5,8,1821,Carbonari,1,-9,A


In [113]:
dfIntraWarPar['Deaths'] = dfIntraWarPar['Deaths'].replace('-9', '')
dfIntraWarPar = dfIntraWarPar.replace(-9, '')
dfIntraWarPar = dfIntraWarPar.replace(-8, '')
dfIntraWarPar = dfIntraWarPar.replace(-7, '')
dfIntraWarPar

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths,Side
0,500,365,Russia,6,10,1818,,,1822,Chechens,1,5000,A
0,500,,"Georgians, Dhagestania, Chechens",6,10,1818,,,1822,Chechens,2,6000,B
1,501,,Sidon,6,,1820,7,21,1821,Sidon,2,,A
1,501,,Damascus & Aleppo,6,,1820,7,21,1821,Sidon,1,,B
3,502,,Liberals,7,2,1820,3,23,1821,Liberals,2,,B
2,502,300,Austria,3,,1821,3,23,1821,Liberals,1,,A
3,502,329,Two Sicilies,7,2,1820,3,23,1821,Liberals,1,,A
4,503,230,Spain,12,1,1821,4,6,1823,Royalists,4,,A
4,503,,Royalists,12,1,1821,4,6,1823,Royalists,4,,B
5,505,300,Austria,3,10,1821,5,8,1821,Carbonari,1,,A


In [114]:
dfIntraWarPar['PolityName'] = dfIntraWarPar['PolityName'].str.strip()
dfIntraWarPar['Initiator'] = dfIntraWarPar['Initiator'].str.strip()
dfIntraWarPar

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths,Side
0,500,365,Russia,6,10,1818,,,1822,Chechens,1,5000,A
0,500,,"Georgians, Dhagestania, Chechens",6,10,1818,,,1822,Chechens,2,6000,B
1,501,,Sidon,6,,1820,7,21,1821,Sidon,2,,A
1,501,,Damascus & Aleppo,6,,1820,7,21,1821,Sidon,1,,B
3,502,,Liberals,7,2,1820,3,23,1821,Liberals,2,,B
2,502,300,Austria,3,,1821,3,23,1821,Liberals,1,,A
3,502,329,Two Sicilies,7,2,1820,3,23,1821,Liberals,1,,A
4,503,230,Spain,12,1,1821,4,6,1823,Royalists,4,,A
4,503,,Royalists,12,1,1821,4,6,1823,Royalists,4,,B
5,505,300,Austria,3,10,1821,5,8,1821,Carbonari,1,,A


create the 'IsInitiator' column based on the 'Initiator' column

In [115]:
dfIntraWarPar['IsInitiator'] = 0
dfIntraWarPar['IsInitiator'] [dfIntraWarPar['PolityName'] == dfIntraWarPar['Initiator']] = 1
dfIntraWarPar

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths,Side,IsInitiator
0,500,365,Russia,6,10,1818,,,1822,Chechens,1,5000,A,0
0,500,,"Georgians, Dhagestania, Chechens",6,10,1818,,,1822,Chechens,2,6000,B,0
1,501,,Sidon,6,,1820,7,21,1821,Sidon,2,,A,1
1,501,,Damascus & Aleppo,6,,1820,7,21,1821,Sidon,1,,B,0
3,502,,Liberals,7,2,1820,3,23,1821,Liberals,2,,B,1
2,502,300,Austria,3,,1821,3,23,1821,Liberals,1,,A,0
3,502,329,Two Sicilies,7,2,1820,3,23,1821,Liberals,1,,A,0
4,503,230,Spain,12,1,1821,4,6,1823,Royalists,4,,A,0
4,503,,Royalists,12,1,1821,4,6,1823,Royalists,4,,B,1
5,505,300,Austria,3,10,1821,5,8,1821,Carbonari,1,,A,0


In [116]:
checkinit = dfIntraWarPar.groupby('WarID')['IsInitiator'].sum()
checkinit.value_counts()

1    276
0     51
2      7
Name: IsInitiator, dtype: int64

In [117]:
missingInit = checkinit.loc[checkinit==0].index

pd.set_option('display.max_rows', 200)
dfIntraWarPar[dfIntraWarPar.WarID.isin(missingInit)]

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths,Side,IsInitiator
0,500,365.0,Russia,6.0,10.0,1818,,,1822.0,Chechens,1,5000.0,A,0
0,500,,"Georgians, Dhagestania, Chechens",6.0,10.0,1818,,,1822.0,Chechens,2,6000.0,B,0
21,518,640.0,Ottoman Empire,10.0,1.0,1831,12.0,27.0,1832.0,Egyptians,2,8000.0,A,0
21,518,,Egyptians & Bashir,10.0,1.0,1831,12.0,27.0,1832.0,Egyptians,1,4000.0,B,0
36,533,640.0,Ottoman Empire,6.0,10.0,1839,6.0,24.0,1839.0,Mehmet Ali,2,2000.0,A,0
36,533,,Egypt,6.0,10.0,1839,6.0,24.0,1839.0,Mehmet Ali,1,1000.0,B,0
44,542,640.0,Ottoman Empire,12.0,19.0,1842,1.0,13.0,1843.0,Ottomans,1,1600.0,A,0
44,542,,Karbala,12.0,19.0,1842,1.0,13.0,1843.0,Ottomans,2,3000.0,B,0
49,548,,Paez led Conservatives,2.0,4.0,1848,8.0,15.0,1849.0,Former Pres. Paez,2,,B,0
49,548,101.0,Venezuela,2.0,4.0,1848,8.0,15.0,1849.0,Former Pres. Paez,1,1500.0,A,0


In [118]:
# some 'Initiator' values are irregular - due to misspellings, small alterations, alternate names, being part of a list, etc.
# Others are less clear and required some Wikipedia/Google searching on my part.
# I used the above dataframe slice to select which rows should be coded as a 1 in the 'IsInitiator' column
IsInitIndex = [1, 37, 63, 78, 90, 102, 105, 110, 131, 137, 143, 153, 155, 243, 258, 265, 280, 288, 293, 299, 310, 330, 336, 
              359, 394, 401, 461, 497, 526, 529, 530, 544, 552, 572, 578, 583, 598, 608, 615, 619, 621, 633, 636, 692, 
              694, 698, 749, 750, 755, 757, 770, 776, 777]

dfIntraWarPar['IsInitiator'] [dfIntraWarPar.index.isin(IsInitIndex)] = 1
dfIntraWarPar

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths,Side,IsInitiator
0,500,365,Russia,6,10,1818,,,1822,Chechens,1,5000,A,0
0,500,,"Georgians, Dhagestania, Chechens",6,10,1818,,,1822,Chechens,2,6000,B,0
1,501,,Sidon,6,,1820,7,21,1821,Sidon,2,,A,1
1,501,,Damascus & Aleppo,6,,1820,7,21,1821,Sidon,1,,B,1
3,502,,Liberals,7,2,1820,3,23,1821,Liberals,2,,B,1
2,502,300,Austria,3,,1821,3,23,1821,Liberals,1,,A,0
3,502,329,Two Sicilies,7,2,1820,3,23,1821,Liberals,1,,A,0
4,503,230,Spain,12,1,1821,4,6,1823,Royalists,4,,A,0
4,503,,Royalists,12,1,1821,4,6,1823,Royalists,4,,B,1
5,505,300,Austria,3,10,1821,5,8,1821,Carbonari,1,,A,0


In [119]:
checkinit = dfIntraWarPar.groupby('WarID')['IsInitiator'].sum()
checkinit.value_counts()

1    259
0     46
2     27
3      2
Name: IsInitiator, dtype: int64

In [120]:
doubleInit = checkinit.loc[checkinit==2].index
dfIntraWarPar[dfIntraWarPar.WarID.isin(doubleInit)]

# just to check, but everything here looks fine.

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths,Side,IsInitiator
1,501,,Sidon,6,,1820,7,21.0,1821,Sidon,2,,A,1
1,501,,Damascus & Aleppo,6,,1820,7,21.0,1821,Sidon,1,,B,1
37,535,,Amir Bashir & Egypt,5,27.0,1840,7,13.0,1840,Amir Bashir & Egypt,1,1000.0,B,1
37,535,,Lebanese Maronites,5,27.0,1840,7,13.0,1840,Amir Bashir & Egypt,2,3500.0,A,1
442,547,,Liberals,5,15.0,1848,5,15.0,1849,Liberals,2,,B,1
48,547,,Liberals,1,12.0,1848,1,27.0,1848,Liberals,2,,B,1
442,547,329.0,Two Sicilies,5,15.0,1848,5,15.0,1849,Liberals,1,1500.0,A,0
48,547,329.0,Two Sicilies,1,12.0,1848,1,27.0,1848,Liberals,1,1500.0,A,0
63,563,,Liberals,2,1.0,1859,5,23.0,1863,Liberals,1,5000.0,B,1
63,563,101.0,Venezuela,2,1.0,1859,5,23.0,1863,Liberals,2,15000.0,A,1


fill in PolityIDs where missing (mostly NonState Groups)

In [121]:
dfIntraWarPar['PolityID'].replace('', np.nan, inplace=True)
dfIntraWarPar['PolityID'] = dfIntraWarPar['PolityID'].astype(float)
dfIntraWarPar

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths,Side,IsInitiator
0,500,365.0,Russia,6,10,1818,,,1822,Chechens,1,5000,A,0
0,500,,"Georgians, Dhagestania, Chechens",6,10,1818,,,1822,Chechens,2,6000,B,0
1,501,,Sidon,6,,1820,7,21,1821,Sidon,2,,A,1
1,501,,Damascus & Aleppo,6,,1820,7,21,1821,Sidon,1,,B,1
3,502,,Liberals,7,2,1820,3,23,1821,Liberals,2,,B,1
2,502,300.0,Austria,3,,1821,3,23,1821,Liberals,1,,A,0
3,502,329.0,Two Sicilies,7,2,1820,3,23,1821,Liberals,1,,A,0
4,503,230.0,Spain,12,1,1821,4,6,1823,Royalists,4,,A,0
4,503,,Royalists,12,1,1821,4,6,1823,Royalists,4,,B,1
5,505,300.0,Austria,3,10,1821,5,8,1821,Carbonari,1,,A,0


In [122]:
dfIntraWarPar = dfIntraWarPar.merge(dfPolities[['PolityID', 'PolityName']], on='PolityName', how='left', suffixes=('', '_m'),)
dfIntraWarPar

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths,Side,IsInitiator,PolityID_m
0,500,365.0,Russia,6,10,1818,,,1822,Chechens,1,5000,A,0,365.0
1,500,,"Georgians, Dhagestania, Chechens",6,10,1818,,,1822,Chechens,2,6000,B,0,10115.0
2,501,,Sidon,6,,1820,7,21,1821,Sidon,2,,A,1,10092.0
3,501,,Damascus & Aleppo,6,,1820,7,21,1821,Sidon,1,,B,1,10116.0
4,502,,Liberals,7,2,1820,3,23,1821,Liberals,2,,B,1,10048.0
5,502,300.0,Austria,3,,1821,3,23,1821,Liberals,1,,A,0,305.0
6,502,329.0,Two Sicilies,7,2,1820,3,23,1821,Liberals,1,,A,0,329.0
7,503,230.0,Spain,12,1,1821,4,6,1823,Royalists,4,,A,0,230.0
8,503,,Royalists,12,1,1821,4,6,1823,Royalists,4,,B,1,10117.0
9,505,300.0,Austria,3,10,1821,5,8,1821,Carbonari,1,,A,0,305.0


In [123]:
dfIntraWarPar['PolityID'].fillna(dfIntraWarPar['PolityID_m'], inplace=True)
dfIntraWarPar

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,Deaths,Side,IsInitiator,PolityID_m
0,500,365.0,Russia,6,10,1818,,,1822,Chechens,1,5000,A,0,365.0
1,500,10115.0,"Georgians, Dhagestania, Chechens",6,10,1818,,,1822,Chechens,2,6000,B,0,10115.0
2,501,10092.0,Sidon,6,,1820,7,21,1821,Sidon,2,,A,1,10092.0
3,501,10116.0,Damascus & Aleppo,6,,1820,7,21,1821,Sidon,1,,B,1,10116.0
4,502,10048.0,Liberals,7,2,1820,3,23,1821,Liberals,2,,B,1,10048.0
5,502,300.0,Austria,3,,1821,3,23,1821,Liberals,1,,A,0,305.0
6,502,329.0,Two Sicilies,7,2,1820,3,23,1821,Liberals,1,,A,0,329.0
7,503,230.0,Spain,12,1,1821,4,6,1823,Royalists,4,,A,0,230.0
8,503,10117.0,Royalists,12,1,1821,4,6,1823,Royalists,4,,B,1,10117.0
9,505,300.0,Austria,3,10,1821,5,8,1821,Carbonari,1,,A,0,305.0


In [124]:
dfIntraWarPar = dfIntraWarPar[['WarID', 'PolityID', 'StartYear', 'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Side', 'IsInitiator', 'Outcome', 'Deaths']]
dfIntraWarPar

Unnamed: 0,WarID,PolityID,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths
0,500,365.0,1818,6,10,1822,,,A,0,1,5000
1,500,10115.0,1818,6,10,1822,,,B,0,2,6000
2,501,10092.0,1820,6,,1821,7,21,A,1,2,
3,501,10116.0,1820,6,,1821,7,21,B,1,1,
4,502,10048.0,1820,7,2,1821,3,23,B,1,2,
5,502,300.0,1821,3,,1821,3,23,A,0,1,
6,502,329.0,1820,7,2,1821,3,23,A,0,1,
7,503,230.0,1821,12,1,1823,4,6,A,0,4,
8,503,10117.0,1821,12,1,1823,4,6,B,1,4,
9,505,300.0,1821,3,10,1821,5,8,A,0,1,


### Non-State War

In [38]:
dfNonStateWar.columns

Index(['WarNum', 'WarName', 'WarType', 'WhereFought', 'SideA1', 'SideA2',
       'SideB1', 'SideB2', 'SideB3', 'SideB4', 'SideB5', 'StartYear',
       'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Initiator',
       'TransFrom', 'TransTo', 'Outcome', 'SideADeaths', 'SideBDeaths',
       'TotalCombatDeaths', 'Version'],
      dtype='object')

In [125]:
dfNonWarParA1 = dfNonStateWar[['WarNum', 'SideA1', 'StartYear', 'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Initiator', 'Outcome', 'SideADeaths']]
dfNonWarParA2 = dfNonStateWar[['WarNum', 'SideA2', 'StartYear', 'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Initiator', 'Outcome', 'SideADeaths']]
dfNonWarParB1 = dfNonStateWar[['WarNum', 'SideB1', 'StartYear', 'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Initiator', 'Outcome', 'SideBDeaths']]
dfNonWarParB2 = dfNonStateWar[['WarNum', 'SideB2', 'StartYear', 'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Initiator', 'Outcome', 'SideBDeaths']]
dfNonWarParB3 = dfNonStateWar[['WarNum', 'SideB3', 'StartYear', 'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Initiator', 'Outcome', 'SideBDeaths']]
dfNonWarParB4 = dfNonStateWar[['WarNum', 'SideB4', 'StartYear', 'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Initiator', 'Outcome', 'SideBDeaths']]
dfNonWarParB5 = dfNonStateWar[['WarNum', 'SideB5', 'StartYear', 'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Initiator', 'Outcome', 'SideBDeaths']]

In [126]:
dfNonWarParA2

Unnamed: 0,WarNum,SideA2,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Initiator,Outcome,SideADeaths
0,1500,-8,1818,-9,-9,1824,-9,-9,A,1,1500
1,1501,-8,1819,-9,-9,1828,9,24,A,1,20000
2,1502,-8,1819,-9,-9,1822,-9,-9,A,1,-9
3,1503,-8,1820,1,8,1820,2,23,B,2,-9
4,1505,-8,1821,9,-9,1823,-9,-9,A,1,500
5,1506,-8,1821,11,-9,1821,12,-9,A,1,-9
6,1508,-8,1825,-9,-9,1828,-9,-9,A,1,-9
7,1509,-8,1825,10,25,1827,4,13,A,3,-9
8,1510,-8,1826,-9,-9,1829,4,12,B,2,2000
9,1511,-8,1826,-9,-9,1827,5,15,A,2,24000


In [127]:
dfNonWarParA1 = dfNonWarParA1[dfNonWarParA1.SideA1 != '-8']
dfNonWarParA2 = dfNonWarParA2[dfNonWarParA2.SideA2 != '-8']
dfNonWarParB1 = dfNonWarParB1[dfNonWarParB1.SideB1 != '-8']
dfNonWarParB2 = dfNonWarParB2[dfNonWarParB2.SideB2 != '-8']
dfNonWarParB3 = dfNonWarParB3[dfNonWarParB3.SideB3 != '-8']
dfNonWarParB4 = dfNonWarParB4[dfNonWarParB4.SideB4 != '-8']
dfNonWarParB5 = dfNonWarParB5[dfNonWarParB5.SideB5 != '-8']

In [128]:
dfNonWarParA2

Unnamed: 0,WarNum,SideA2,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Initiator,Outcome,SideADeaths
18,1523,Argentina,1837,11,-9,1839,1,20,A,1,1400
35,1543,Khoja,1857,-9,-9,1857,9,-9,A,2,-9
40,1550,Nicaragua,1863,1,23,1863,11,15,A,1,-9
56,1573,military,1948,4,3,1949,5,-9,A,2,17000
60,1582,Apodeti,1975,8,11,1975,10,15,B,4,-9


In [129]:
dfNonWarParA1.rename(columns={'SideA1':'PolityName', 'WarNum':'WarID'}, inplace=True)
dfNonWarParA2.rename(columns={'SideA2':'PolityName', 'WarNum':'WarID'}, inplace=True)

combinedNonWarSideA = [dfNonWarParA1, dfNonWarParA2]
dfNonWarParA = pd.concat(combinedNonWarSideA).reset_index(drop=True)
dfNonWarParA['Side'] = 'A'
dfNonWarParA['IsInitiator'] = 1
dfNonWarParA['IsInitiator'] [dfNonWarParA['Initiator'] == 'B'] = 0
dfNonWarParA

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,WarID,PolityName,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Initiator,Outcome,SideADeaths,Side,IsInitiator
0,1500,Te Rauparaha's Ngati Toa,1818,-9,-9,1824,-9,-9,A,1,1500,A,1
1,1501,Shaka Zulu,1819,-9,-9,1828,9,24,A,1,20000,A,1
2,1502,Burma,1819,-9,-9,1822,-9,-9,A,1,-9,A,1
3,1503,Buenos Aires,1820,1,8,1820,2,23,B,2,-9,A,0
4,1505,Hongi Hika's Nga Phuhi,1821,9,-9,1823,-9,-9,A,1,500,A,1
5,1506,Thailand,1821,11,-9,1821,12,-9,A,1,-9,A,1
6,1508,China,1825,-9,-9,1828,-9,-9,A,1,-9,A,1
7,1509,Mexico,1825,10,25,1827,4,13,A,3,-9,A,1
8,1510,Conservative Confederation,1826,-9,-9,1829,4,12,B,2,2000,A,0
9,1511,Viang Chan,1826,-9,-9,1827,5,15,A,2,24000,A,1


In [130]:
dfNonWarParB1.rename(columns={'SideB1':'PolityName', 'WarNum':'WarID'}, inplace=True)
dfNonWarParB2.rename(columns={'SideB2':'PolityName', 'WarNum':'WarID'}, inplace=True)
dfNonWarParB3.rename(columns={'SideB3':'PolityName', 'WarNum':'WarID'}, inplace=True)
dfNonWarParB4.rename(columns={'SideB4':'PolityName', 'WarNum':'WarID'}, inplace=True)
dfNonWarParB5.rename(columns={'SideB5':'PolityName', 'WarNum':'WarID'}, inplace=True)

combinedNonWarSideB = [dfNonWarParB1, dfNonWarParB2, dfNonWarParB3, dfNonWarParB4, dfNonWarParB5]
dfNonWarParB = pd.concat(combinedNonWarSideB).reset_index(drop=True)
dfNonWarParB['Side'] = 'B'
dfNonWarParB['IsInitiator'] = 1
dfNonWarParB['IsInitiator'] [dfNonWarParB['Initiator'] == 'A'] = 0
dfNonWarParB

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()


Unnamed: 0,WarID,PolityName,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Initiator,Outcome,SideBDeaths,Side,IsInitiator
0,1500,Taranaki,1818,-9,-9,1824,-9,-9,A,1,6000,B,0
1,1501,Bantu,1819,-9,-9,1828,9,24,A,1,40000,B,0
2,1502,Assam,1819,-9,-9,1822,-9,-9,A,1,-9,B,0
3,1503,Provinces,1820,1,8,1820,2,23,B,2,-9,B,1
4,1505,Ngati Paoa,1821,9,-9,1823,-9,-9,A,1,2000,B,0
5,1506,Kedah,1821,11,-9,1821,12,-9,A,1,-9,B,0
6,1508,Muslim rebels,1825,-9,-9,1828,-9,-9,A,1,-9,B,0
7,1509,Yaqui Indians,1825,10,25,1827,4,13,A,3,-9,B,0
8,1510,Liberals,1826,-9,-9,1829,4,12,B,2,1300,B,1
9,1511,Siam,1826,-9,-9,1827,5,15,A,2,7000,B,0


In [131]:
# For non-state war, outcome = 1 if sideA wins; 2 if sideB wins. Need to change Outcome so 1 = win, 2 = loss
dfNonWarParB['Outcome'] [dfNonWarParB['Outcome'] == 1] = 'lose'
dfNonWarParB['Outcome'] [dfNonWarParB['Outcome'] == 2] = 'win'
dfNonWarParB['Outcome'] [dfNonWarParB['Outcome'] == 'win'] = 1
dfNonWarParB['Outcome'] [dfNonWarParB['Outcome'] == 'lose'] = 2

dfNonWarParB

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,WarID,PolityName,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Initiator,Outcome,SideBDeaths,Side,IsInitiator
0,1500,Taranaki,1818,-9,-9,1824,-9,-9,A,2,6000,B,0
1,1501,Bantu,1819,-9,-9,1828,9,24,A,2,40000,B,0
2,1502,Assam,1819,-9,-9,1822,-9,-9,A,2,-9,B,0
3,1503,Provinces,1820,1,8,1820,2,23,B,1,-9,B,1
4,1505,Ngati Paoa,1821,9,-9,1823,-9,-9,A,2,2000,B,0
5,1506,Kedah,1821,11,-9,1821,12,-9,A,2,-9,B,0
6,1508,Muslim rebels,1825,-9,-9,1828,-9,-9,A,2,-9,B,0
7,1509,Yaqui Indians,1825,10,25,1827,4,13,A,3,-9,B,0
8,1510,Liberals,1826,-9,-9,1829,4,12,B,1,1300,B,1
9,1511,Siam,1826,-9,-9,1827,5,15,A,1,7000,B,0


In [132]:
dfNonWarParA.rename(columns={'SideADeaths':'Deaths'}, inplace=True)
dfNonWarParB.rename(columns={'SideBDeaths':'Deaths'}, inplace=True)

combinedNonWarPar = [dfNonWarParA, dfNonWarParB]
dfNonWarPar = pd.concat(combinedNonWarPar).sort_values('WarID').reset_index(drop=True)
dfNonWarPar = dfNonWarPar.replace(-9, '')
dfNonWarPar

Unnamed: 0,WarID,PolityName,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Initiator,Outcome,Deaths,Side,IsInitiator
0,1500,Te Rauparaha's Ngati Toa,1818,,,1824,,,A,1,1500.0,A,1
1,1500,Ngati Ira,1818,,,1824,,,A,2,6000.0,B,0
2,1500,Waikato,1818,,,1824,,,A,2,6000.0,B,0
3,1500,Ngai Tahu,1818,,,1824,,,A,2,6000.0,B,0
4,1500,Taranaki,1818,,,1824,,,A,2,6000.0,B,0
5,1500,Rangitikei,1818,,,1824,,,A,2,6000.0,B,0
6,1501,Shaka Zulu,1819,,,1828,9.0,24.0,A,1,20000.0,A,1
7,1501,Bantu,1819,,,1828,9.0,24.0,A,2,40000.0,B,0
8,1502,Burma,1819,,,1822,,,A,1,,A,1
9,1502,Assam,1819,,,1822,,,A,2,,B,0


In [133]:
dfNonWarPar['PolityName'] = dfNonWarPar['PolityName'].str.strip()
dfNonWarPar = dfNonWarPar.merge(dfPolities[['PolityID', 'PolityName']], on='PolityName', how='left', suffixes=('', '_m'),)
dfNonWarPar

Unnamed: 0,WarID,PolityName,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Initiator,Outcome,Deaths,Side,IsInitiator,PolityID
0,1500,Te Rauparaha's Ngati Toa,1818,,,1824,,,A,1,1500.0,A,1,10000.0
1,1500,Ngati Ira,1818,,,1824,,,A,2,6000.0,B,0,10089.0
2,1500,Waikato,1818,,,1824,,,A,2,6000.0,B,0,10087.0
3,1500,Ngai Tahu,1818,,,1824,,,A,2,6000.0,B,0,10082.0
4,1500,Taranaki,1818,,,1824,,,A,2,6000.0,B,0,10042.0
5,1500,Rangitikei,1818,,,1824,,,A,2,6000.0,B,0,10091.0
6,1501,Shaka Zulu,1819,,,1828,9.0,24.0,A,1,20000.0,A,1,10001.0
7,1501,Bantu,1819,,,1828,9.0,24.0,A,2,40000.0,B,0,10043.0
8,1502,Burma,1819,,,1822,,,A,1,,A,1,10002.0
9,1502,Assam,1819,,,1822,,,A,2,,B,0,7572.0


In [135]:
dfNonWarPar['PolityID'] = dfNonWarPar['PolityID'].astype(int)
dfNonWarPar = dfNonWarPar[['WarID', 'PolityID', 'StartYear', 'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Side', 'IsInitiator', 'Outcome', 'Deaths']]
dfNonWarPar

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,WarID,PolityID,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths
0,1500,10000,1818,,,1824,,,A,1,1,1500.0
1,1500,10089,1818,,,1824,,,B,0,2,6000.0
2,1500,10087,1818,,,1824,,,B,0,2,6000.0
3,1500,10082,1818,,,1824,,,B,0,2,6000.0
4,1500,10042,1818,,,1824,,,B,0,2,6000.0
5,1500,10091,1818,,,1824,,,B,0,2,6000.0
6,1501,10001,1819,,,1828,9.0,24.0,A,1,1,20000.0
7,1501,10043,1819,,,1828,9.0,24.0,B,0,2,40000.0
8,1502,10002,1819,,,1822,,,A,1,1,
9,1502,7572,1819,,,1822,,,B,0,2,


### Extra-State War

In [136]:
dfExtraStateWar.columns

Index(['WarNum', 'WarName', 'WarType', 'ccode1', 'SideA', 'ccode2', 'SideB',
       'StartMonth1', 'StartDay1', 'StartYear1', 'EndMonth1', 'EndDay1',
       'EndYear1', 'StartMonth2', 'StartDay2', 'StartYear2', 'EndMonth2',
       'EndDay2 ', 'EndYear2', 'Initiator', 'Interven', 'TransFrom', 'Outcome',
       'TransTo', 'WhereFought', 'BatDeath', 'NonStateDeaths', 'Version'],
      dtype='object')

In [137]:
dfExtraStateWar

Unnamed: 0,WarNum,WarName,WarType,ccode1,SideA,ccode2,SideB,StartMonth1,StartDay1,StartYear1,...,EndYear2,Initiator,Interven,TransFrom,Outcome,TransTo,WhereFought,BatDeath,NonStateDeaths,Version
0,300,Allied Bombardment of Algiers,3,210,Netherlands,-8,-8,8,26,1816,...,-8,1,1,-8,1,-8,6,13,-8,4
1,300,Allied Bombardment of Algiers,3,200,United Kingdom,-8,Algeria,8,26,1816,...,-8,1,1,-8,1,-8,6,129,6000,4
2,301,Ottoman-Wahhabi,3,640,Ottoman Empire,-8,Saudi Wahhabis,9,-9,1816,...,-8,1,0,-8,1,-8,6,13500,14000,4
3,302,Liberation of Chile,2,230,Spain,-8,San Martin revolutionaries,1,9,1817,...,-8,0,0,-8,2,-8,1,1700,1140,4
4,303,First Bolivar Expedition,2,230,Spain,-8,New Granada,4,11,1817,...,-8,1,0,-8,2,-8,1,3000,2000,4
5,304,Mexican Independence,2,230,Spain,-8,Mina Expedition,8,15,1817,...,-8,0,0,-8,1,-8,1,1000,1000,4
6,305,British-Kandyan,2,200,United Kingdom,-8,Kandyan rebels,10,-9,1817,...,-8,0,0,-8,1,-8,7,1000,10000,4
7,306,British-Maratha,2,200,United Kingdom,-8,Marathas,11,6,1817,...,-8,0,0,-8,1,-8,7,2800,2000,4
8,307,Ottoman Conquest of Sudan,3,640,Ottoman Empire,-8,Sudan states,-9,-9,1820,...,-8,1,0,-8,1,-8,4,4000,2500,4
9,308,Second Bolivar Expedition,2,230,Spain,-8,New Granada,4,28,1821,...,-8,0,0,-8,2,-8,1,1000,500,4


In [138]:
dfExtraWarPar1A = dfExtraStateWar[['WarNum', 'ccode1', 'SideA', 'StartMonth1', 'StartDay1', 'StartYear1', 
                                         'EndMonth1', 'EndDay1', 'EndYear1', 'Initiator', 'Outcome', 'BatDeath', 'NonStateDeaths']]
dfExtraWarPar2A = dfExtraStateWar[['WarNum', 'ccode1', 'SideA', 'StartMonth2', 'StartDay2', 'StartYear2', 
                                         'EndMonth2', 'EndDay2 ', 'EndYear2', 'Initiator', 'Outcome', 'BatDeath', 'NonStateDeaths']]
dfExtraWarPar1B = dfExtraStateWar[['WarNum', 'ccode2', 'SideB', 'StartMonth1', 'StartDay1', 'StartYear1', 
                                         'EndMonth1', 'EndDay1', 'EndYear1', 'Initiator', 'Outcome', 'BatDeath', 'NonStateDeaths']]
dfExtraWarPar2B = dfExtraStateWar[['WarNum', 'ccode2', 'SideB', 'StartMonth2', 'StartDay2', 'StartYear2', 
                                         'EndMonth2', 'EndDay2 ', 'EndYear2', 'Initiator', 'Outcome', 'BatDeath', 'NonStateDeaths']]

In [139]:
dfExtraWarPar1A.rename(columns={'WarNum':'WarID', 'ccode1':'PolityID', 'SideA':'PolityName', 'StartMonth1':'StartMonth', 
                                      'StartDay1':'StartDay', 'StartYear1':'StartYear', 'EndMonth1':'EndMonth', 
                                      'EndDay1':'EndDay', 'EndYear1':'EndYear'}, inplace=True)
dfExtraWarPar2A.rename(columns={'WarNum':'WarID', 'ccode1':'PolityID', 'SideA':'PolityName', 'StartMonth2':'StartMonth', 
                                      'StartDay2':'StartDay', 'StartYear2':'StartYear', 'EndMonth2':'EndMonth', 
                                      'EndDay2 ':'EndDay', 'EndYear2':'EndYear'}, inplace=True)
dfExtraWarPar1B.rename(columns={'WarNum':'WarID', 'ccode2':'PolityID', 'SideB':'PolityName', 'StartMonth1':'StartMonth', 
                                      'StartDay1':'StartDay', 'StartYear1':'StartYear', 'EndMonth1':'EndMonth', 
                                      'EndDay1':'EndDay', 'EndYear1':'EndYear'}, inplace=True)
dfExtraWarPar2B.rename(columns={'WarNum':'WarID', 'ccode2':'PolityID', 'SideB':'PolityName', 'StartMonth2':'StartMonth', 
                                      'StartDay2':'StartDay', 'StartYear2':'StartYear', 'EndMonth2':'EndMonth', 
                                      'EndDay2 ':'EndDay', 'EndYear2':'EndYear'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


In [140]:
dfExtraWarPar1A

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,BatDeath,NonStateDeaths
0,300,210,Netherlands,8,26,1816,8,30,1816,1,1,13,-8
1,300,200,United Kingdom,8,26,1816,8,30,1816,1,1,129,6000
2,301,640,Ottoman Empire,9,-9,1816,9,11,1818,1,1,13500,14000
3,302,230,Spain,1,9,1817,4,5,1818,0,2,1700,1140
4,303,230,Spain,4,11,1817,8,10,1819,1,2,3000,2000
5,304,230,Spain,8,15,1817,1,1,1818,0,1,1000,1000
6,305,200,United Kingdom,10,-9,1817,11,26,1818,0,1,1000,10000
7,306,200,United Kingdom,11,6,1817,6,3,1818,0,1,2800,2000
8,307,640,Ottoman Empire,-9,-9,1820,6,-9,1821,1,1,4000,2500
9,308,230,Spain,4,28,1821,5,24,1822,0,2,1000,500


In [141]:
combinedExtraWarSideA = [dfExtraWarPar1A, dfExtraWarPar2A]
dfExtraWarParA = pd.concat(combinedExtraWarSideA).reset_index(drop=True)
dfExtraWarParA['Side'] = 'A'
dfExtraWarParA['IsInitiator'] = dfExtraWarParA['Initiator']
dfExtraWarParA

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,BatDeath,NonStateDeaths,Side,IsInitiator
0,300,210,Netherlands,8,26,1816,8,30,1816,1,1,13,-8,A,1
1,300,200,United Kingdom,8,26,1816,8,30,1816,1,1,129,6000,A,1
2,301,640,Ottoman Empire,9,-9,1816,9,11,1818,1,1,13500,14000,A,1
3,302,230,Spain,1,9,1817,4,5,1818,0,2,1700,1140,A,0
4,303,230,Spain,4,11,1817,8,10,1819,1,2,3000,2000,A,1
5,304,230,Spain,8,15,1817,1,1,1818,0,1,1000,1000,A,0
6,305,200,United Kingdom,10,-9,1817,11,26,1818,0,1,1000,10000,A,0
7,306,200,United Kingdom,11,6,1817,6,3,1818,0,1,2800,2000,A,0
8,307,640,Ottoman Empire,-9,-9,1820,6,-9,1821,1,1,4000,2500,A,1
9,308,230,Spain,4,28,1821,5,24,1822,0,2,1000,500,A,0


In [142]:
combinedExtraWarSideB = [dfExtraWarPar1B, dfExtraWarPar2B]
dfExtraWarParB = pd.concat(combinedExtraWarSideB).reset_index(drop=True)
dfExtraWarParB['Side'] = 'B'
dfExtraWarParB['IsInitiator'] = 1
dfExtraWarParB['IsInitiator'] [dfExtraWarParB['Initiator'] == 1] = 0
dfExtraWarParB

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,BatDeath,NonStateDeaths,Side,IsInitiator
0,300,-8,-8,8,26,1816,8,30,1816,1,1,13,-8,B,0
1,300,-8,Algeria,8,26,1816,8,30,1816,1,1,129,6000,B,0
2,301,-8,Saudi Wahhabis,9,-9,1816,9,11,1818,1,1,13500,14000,B,0
3,302,-8,San Martin revolutionaries,1,9,1817,4,5,1818,0,2,1700,1140,B,1
4,303,-8,New Granada,4,11,1817,8,10,1819,1,2,3000,2000,B,0
5,304,-8,Mina Expedition,8,15,1817,1,1,1818,0,1,1000,1000,B,1
6,305,-8,Kandyan rebels,10,-9,1817,11,26,1818,0,1,1000,10000,B,1
7,306,-8,Marathas,11,6,1817,6,3,1818,0,1,2800,2000,B,1
8,307,-8,Sudan states,-9,-9,1820,6,-9,1821,1,1,4000,2500,B,0
9,308,-8,New Granada,4,28,1821,5,24,1822,0,2,1000,500,B,1


In [143]:
# For non-state war, outcome = 1 if sideA wins; 2 if sideB wins. Need to change Outcome so 1 = win, 2 = loss
dfExtraWarParB['Outcome'] [dfExtraWarParB['Outcome'] == 1] = 'lose'
dfExtraWarParB['Outcome'] [dfExtraWarParB['Outcome'] == 2] = 'win'
dfExtraWarParB['Outcome'] [dfExtraWarParB['Outcome'] == 'lose'] = 2
dfExtraWarParB['Outcome'] [dfExtraWarParB['Outcome'] == 'win'] = 1

dfExtraWarParB

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,BatDeath,NonStateDeaths,Side,IsInitiator
0,300,-8,-8,8,26,1816,8,30,1816,1,2,13,-8,B,0
1,300,-8,Algeria,8,26,1816,8,30,1816,1,2,129,6000,B,0
2,301,-8,Saudi Wahhabis,9,-9,1816,9,11,1818,1,2,13500,14000,B,0
3,302,-8,San Martin revolutionaries,1,9,1817,4,5,1818,0,1,1700,1140,B,1
4,303,-8,New Granada,4,11,1817,8,10,1819,1,1,3000,2000,B,0
5,304,-8,Mina Expedition,8,15,1817,1,1,1818,0,2,1000,1000,B,1
6,305,-8,Kandyan rebels,10,-9,1817,11,26,1818,0,2,1000,10000,B,1
7,306,-8,Marathas,11,6,1817,6,3,1818,0,2,2800,2000,B,1
8,307,-8,Sudan states,-9,-9,1820,6,-9,1821,1,2,4000,2500,B,0
9,308,-8,New Granada,4,28,1821,5,24,1822,0,1,1000,500,B,1


In [144]:
combinedExtraWar = [dfExtraWarParA, dfExtraWarParB]
dfExtraWarPar = pd.concat(combinedExtraWar).reset_index(drop=True)
dfExtraWarPar

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,BatDeath,NonStateDeaths,Side,IsInitiator
0,300,210,Netherlands,8,26,1816,8,30,1816,1,1,13,-8,A,1
1,300,200,United Kingdom,8,26,1816,8,30,1816,1,1,129,6000,A,1
2,301,640,Ottoman Empire,9,-9,1816,9,11,1818,1,1,13500,14000,A,1
3,302,230,Spain,1,9,1817,4,5,1818,0,2,1700,1140,A,0
4,303,230,Spain,4,11,1817,8,10,1819,1,2,3000,2000,A,1
5,304,230,Spain,8,15,1817,1,1,1818,0,1,1000,1000,A,0
6,305,200,United Kingdom,10,-9,1817,11,26,1818,0,1,1000,10000,A,0
7,306,200,United Kingdom,11,6,1817,6,3,1818,0,1,2800,2000,A,0
8,307,640,Ottoman Empire,-9,-9,1820,6,-9,1821,1,1,4000,2500,A,1
9,308,230,Spain,4,28,1821,5,24,1822,0,2,1000,500,A,0


In [145]:
dfExtraWarPar['Deaths'] = ''
dfExtraWarPar['Deaths'] [dfExtraWarPar['PolityID'] != -8] = dfExtraWarPar['BatDeath']
dfExtraWarPar['Deaths'] [dfExtraWarPar['PolityID'] == -8] = dfExtraWarPar['NonStateDeaths']
dfExtraWarPar

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,BatDeath,NonStateDeaths,Side,IsInitiator,Deaths
0,300,210,Netherlands,8,26,1816,8,30,1816,1,1,13,-8,A,1,13
1,300,200,United Kingdom,8,26,1816,8,30,1816,1,1,129,6000,A,1,129
2,301,640,Ottoman Empire,9,-9,1816,9,11,1818,1,1,13500,14000,A,1,13500
3,302,230,Spain,1,9,1817,4,5,1818,0,2,1700,1140,A,0,1700
4,303,230,Spain,4,11,1817,8,10,1819,1,2,3000,2000,A,1,3000
5,304,230,Spain,8,15,1817,1,1,1818,0,1,1000,1000,A,0,1000
6,305,200,United Kingdom,10,-9,1817,11,26,1818,0,1,1000,10000,A,0,1000
7,306,200,United Kingdom,11,6,1817,6,3,1818,0,1,2800,2000,A,0,2800
8,307,640,Ottoman Empire,-9,-9,1820,6,-9,1821,1,1,4000,2500,A,1,4000
9,308,230,Spain,4,28,1821,5,24,1822,0,2,1000,500,A,0,1000


In [146]:
dfExtraWarPar = dfExtraWarPar.replace(-8, '')
dfExtraWarPar['datesconcat'] = dfExtraWarPar['StartMonth'].map(str) + dfExtraWarPar['StartDay'].map(str) + dfExtraWarPar['StartYear'].map(str) + dfExtraWarPar['EndMonth'].map(str) + dfExtraWarPar['EndDay'].map(str) + dfExtraWarPar['EndYear'].map(str)

missingdate = ''
dfExtraWarPar = dfExtraWarPar[dfExtraWarPar.datesconcat != missingdate]

missingpolity = '-8'
dfExtraWarPar = dfExtraWarPar[dfExtraWarPar.PolityName != missingpolity]

dfExtraWarPar

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Initiator,Outcome,BatDeath,NonStateDeaths,Side,IsInitiator,Deaths,datesconcat
0,300,210,Netherlands,8,26,1816,8,30,1816,1,1,13,,A,1,13,82618168301816
1,300,200,United Kingdom,8,26,1816,8,30,1816,1,1,129,6000,A,1,129,82618168301816
2,301,640,Ottoman Empire,9,-9,1816,9,11,1818,1,1,13500,14000,A,1,13500,9-918169111818
3,302,230,Spain,1,9,1817,4,5,1818,0,2,1700,1140,A,0,1700,191817451818
4,303,230,Spain,4,11,1817,8,10,1819,1,2,3000,2000,A,1,3000,41118178101819
5,304,230,Spain,8,15,1817,1,1,1818,0,1,1000,1000,A,0,1000,8151817111818
6,305,200,United Kingdom,10,-9,1817,11,26,1818,0,1,1000,10000,A,0,1000,10-9181711261818
7,306,200,United Kingdom,11,6,1817,6,3,1818,0,1,2800,2000,A,0,2800,1161817631818
8,307,640,Ottoman Empire,-9,-9,1820,6,-9,1821,1,1,4000,2500,A,1,4000,-9-918206-91821
9,308,230,Spain,4,28,1821,5,24,1822,0,2,1000,500,A,0,1000,42818215241822


In [147]:
dfExtraWarPar = dfExtraWarPar.drop(columns=['datesconcat', 'Initiator', 'BatDeath', 'NonStateDeaths'])
dfExtraWarPar = dfExtraWarPar.replace(-9, '')
dfExtraWarPar = dfExtraWarPar.replace(-7, '')
dfExtraWarPar

Unnamed: 0,WarID,PolityID,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Outcome,Side,IsInitiator,Deaths
0,300,210,Netherlands,8,26,1816,8,30,1816,1,A,1,13
1,300,200,United Kingdom,8,26,1816,8,30,1816,1,A,1,129
2,301,640,Ottoman Empire,9,,1816,9,11,1818,1,A,1,13500
3,302,230,Spain,1,9,1817,4,5,1818,2,A,0,1700
4,303,230,Spain,4,11,1817,8,10,1819,2,A,1,3000
5,304,230,Spain,8,15,1817,1,1,1818,1,A,0,1000
6,305,200,United Kingdom,10,,1817,11,26,1818,1,A,0,1000
7,306,200,United Kingdom,11,6,1817,6,3,1818,1,A,0,2800
8,307,640,Ottoman Empire,,,1820,6,,1821,1,A,1,4000
9,308,230,Spain,4,28,1821,5,24,1822,2,A,0,1000


In [148]:
dfExtraWarPar['PolityName'] = dfExtraWarPar['PolityName'].str.strip()
dfExtraWarPar = dfExtraWarPar.merge(dfPolities[['PolityID', 'PolityName']], on='PolityName', how='left', suffixes=('_orig', '_new'),)
dfExtraWarPar

Unnamed: 0,WarID,PolityID_orig,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Outcome,Side,IsInitiator,Deaths,PolityID_new
0,300,210,Netherlands,8,26,1816,8,30,1816,1,A,1,13,210.0
1,300,200,United Kingdom,8,26,1816,8,30,1816,1,A,1,129,200.0
2,301,640,Ottoman Empire,9,,1816,9,11,1818,1,A,1,13500,
3,302,230,Spain,1,9,1817,4,5,1818,2,A,0,1700,230.0
4,303,230,Spain,4,11,1817,8,10,1819,2,A,1,3000,230.0
5,304,230,Spain,8,15,1817,1,1,1818,1,A,0,1000,230.0
6,305,200,United Kingdom,10,,1817,11,26,1818,1,A,0,1000,200.0
7,306,200,United Kingdom,11,6,1817,6,3,1818,1,A,0,2800,200.0
8,307,640,Ottoman Empire,,,1820,6,,1821,1,A,1,4000,
9,308,230,Spain,4,28,1821,5,24,1822,2,A,0,1000,230.0


In [149]:
dfExtraWarPar['PolityID'] = dfExtraWarPar['PolityID_orig']
dfExtraWarPar['PolityID'] [dfExtraWarPar['PolityID_orig'] == ''] = dfExtraWarPar['PolityID_new']
dfExtraWarPar

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,WarID,PolityID_orig,PolityName,StartMonth,StartDay,StartYear,EndMonth,EndDay,EndYear,Outcome,Side,IsInitiator,Deaths,PolityID_new,PolityID
0,300,210,Netherlands,8,26,1816,8,30,1816,1,A,1,13,210.0,210
1,300,200,United Kingdom,8,26,1816,8,30,1816,1,A,1,129,200.0,200
2,301,640,Ottoman Empire,9,,1816,9,11,1818,1,A,1,13500,,640
3,302,230,Spain,1,9,1817,4,5,1818,2,A,0,1700,230.0,230
4,303,230,Spain,4,11,1817,8,10,1819,2,A,1,3000,230.0,230
5,304,230,Spain,8,15,1817,1,1,1818,1,A,0,1000,230.0,230
6,305,200,United Kingdom,10,,1817,11,26,1818,1,A,0,1000,200.0,200
7,306,200,United Kingdom,11,6,1817,6,3,1818,1,A,0,2800,200.0,200
8,307,640,Ottoman Empire,,,1820,6,,1821,1,A,1,4000,,640
9,308,230,Spain,4,28,1821,5,24,1822,2,A,0,1000,230.0,230


In [150]:
dfExtraWarPar = dfExtraWarPar[['WarID', 'PolityID', 'StartYear', 'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Side', 'IsInitiator', 'Outcome', 'Deaths']]
dfExtraWarPar

Unnamed: 0,WarID,PolityID,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths
0,300,210,1816,8,26,1816,8,30,A,1,1,13
1,300,200,1816,8,26,1816,8,30,A,1,1,129
2,301,640,1816,9,,1818,9,11,A,1,1,13500
3,302,230,1817,1,9,1818,4,5,A,0,2,1700
4,303,230,1817,4,11,1819,8,10,A,1,2,3000
5,304,230,1817,8,15,1818,1,1,A,0,1,1000
6,305,200,1817,10,,1818,11,26,A,0,1,1000
7,306,200,1817,11,6,1818,6,3,A,0,1,2800
8,307,640,1820,,,1821,6,,A,1,1,4000
9,308,230,1821,4,28,1822,5,24,A,0,2,1000


### Combine all war types

In [151]:
combinedWarPar = [dfInterWarPar, dfIntraWarPar, dfNonWarPar, dfExtraWarPar]
dfWarPar = pd.concat(combinedWarPar).sort_values(['WarID', 'Side', 'StartYear']).reset_index(drop=True)
dfWarPar

Unnamed: 0,WarID,PolityID,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths
0,1,220,1823,4,7,1823,11,13,A,1,1,400
1,1,230,1823,4,7,1823,11,13,B,0,2,600
2,4,365,1828,4,26,1829,9,14,A,1,1,50000
3,4,640,1828,4,26,1829,9,14,B,0,2,80000
4,7,2,1846,4,25,1847,9,14,A,1,1,13283
5,7,70,1846,4,25,1847,9,14,B,0,2,6000
6,10,300,1848,3,24,1848,8,9,A,0,1,3927
7,10,300,1849,3,12,1849,3,30,A,0,1,3927
8,10,337,1848,3,29,1848,8,9,B,0,2,100
9,10,325,1848,3,24,1848,8,9,B,1,2,3400


In [152]:
dfWarPar = dfWarPar.replace(-9, '')
dfWarPar = dfWarPar.replace(-8, '')
dfWarPar = dfWarPar.replace(-7, '')
dfWarPar

Unnamed: 0,WarID,PolityID,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths
0,1,220.0,1823,4,7,1823,11,13,A,1,1,400
1,1,230.0,1823,4,7,1823,11,13,B,0,2,600
2,4,365.0,1828,4,26,1829,9,14,A,1,1,50000
3,4,640.0,1828,4,26,1829,9,14,B,0,2,80000
4,7,2.0,1846,4,25,1847,9,14,A,1,1,13283
5,7,70.0,1846,4,25,1847,9,14,B,0,2,6000
6,10,300.0,1848,3,24,1848,8,9,A,0,1,3927
7,10,300.0,1849,3,12,1849,3,30,A,0,1,3927
8,10,337.0,1848,3,29,1848,8,9,B,0,2,100
9,10,325.0,1848,3,24,1848,8,9,B,1,2,3400


In [153]:
dfWarPar['StartMonthClean'] = dfWarPar['StartMonth']
dfWarPar['StartDayClean'] = dfWarPar['StartDay']
dfWarPar['EndMonthClean'] = dfWarPar['EndMonth']
dfWarPar['EndDayClean'] = dfWarPar['EndDay']
dfWarPar['EndYearClean'] = dfWarPar['EndYear']
dfWarPar['StartMonthClean'] [dfWarPar['StartMonthClean'] == ''] = 1
dfWarPar['StartDayClean'] [dfWarPar['StartDayClean'] == ''] = 1
dfWarPar['EndMonthClean'] [dfWarPar['EndMonthClean'] == ''] = 1
dfWarPar['EndDayClean'] [dfWarPar['EndDayClean'] == ''] = 1
dfWarPar['EndYearClean'] [dfWarPar['EndYear'] == ''] = 2100 # placeholder for blank endyears, to be made null later
dfWarPar

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # Rem

Unnamed: 0,WarID,PolityID,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths,StartMonthClean,StartDayClean,EndMonthClean,EndDayClean,EndYearClean
0,1,220.0,1823,4,7,1823,11,13,A,1,1,400,4,7,11,13,1823
1,1,230.0,1823,4,7,1823,11,13,B,0,2,600,4,7,11,13,1823
2,4,365.0,1828,4,26,1829,9,14,A,1,1,50000,4,26,9,14,1829
3,4,640.0,1828,4,26,1829,9,14,B,0,2,80000,4,26,9,14,1829
4,7,2.0,1846,4,25,1847,9,14,A,1,1,13283,4,25,9,14,1847
5,7,70.0,1846,4,25,1847,9,14,B,0,2,6000,4,25,9,14,1847
6,10,300.0,1848,3,24,1848,8,9,A,0,1,3927,3,24,8,9,1848
7,10,300.0,1849,3,12,1849,3,30,A,0,1,3927,3,12,3,30,1849
8,10,337.0,1848,3,29,1848,8,9,B,0,2,100,3,29,8,9,1848
9,10,325.0,1848,3,24,1848,8,9,B,1,2,3400,3,24,8,9,1848


In [154]:
dfWarPar = dfWarPar.astype({'StartMonthClean':int, 'StartDayClean':int, 'EndMonthClean':int, 'EndDayClean':int, 'EndYearClean':int})
dfWarPar.dtypes

WarID                int64
PolityID           float64
StartYear            int64
StartMonth          object
StartDay            object
EndYear             object
EndMonth            object
EndDay              object
Side                object
IsInitiator          int64
Outcome              int64
Deaths              object
StartMonthClean      int64
StartDayClean        int64
EndMonthClean        int64
EndDayClean          int64
EndYearClean         int64
dtype: object

In [155]:
dfWarPar['StartDate'] = pd.to_datetime(dict(year=dfWarPar.StartYear, month=dfWarPar.StartMonthClean, day=dfWarPar.StartDayClean))
dfWarPar['EndDate'] = pd.to_datetime(dict(year=dfWarPar.EndYearClean, month=dfWarPar.EndMonthClean, day=dfWarPar.EndDayClean))
dfWarPar

Unnamed: 0,WarID,PolityID,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths,StartMonthClean,StartDayClean,EndMonthClean,EndDayClean,EndYearClean,StartDate,EndDate
0,1,220.0,1823,4,7,1823,11,13,A,1,1,400,4,7,11,13,1823,1823-04-07,1823-11-13
1,1,230.0,1823,4,7,1823,11,13,B,0,2,600,4,7,11,13,1823,1823-04-07,1823-11-13
2,4,365.0,1828,4,26,1829,9,14,A,1,1,50000,4,26,9,14,1829,1828-04-26,1829-09-14
3,4,640.0,1828,4,26,1829,9,14,B,0,2,80000,4,26,9,14,1829,1828-04-26,1829-09-14
4,7,2.0,1846,4,25,1847,9,14,A,1,1,13283,4,25,9,14,1847,1846-04-25,1847-09-14
5,7,70.0,1846,4,25,1847,9,14,B,0,2,6000,4,25,9,14,1847,1846-04-25,1847-09-14
6,10,300.0,1848,3,24,1848,8,9,A,0,1,3927,3,24,8,9,1848,1848-03-24,1848-08-09
7,10,300.0,1849,3,12,1849,3,30,A,0,1,3927,3,12,3,30,1849,1849-03-12,1849-03-30
8,10,337.0,1848,3,29,1848,8,9,B,0,2,100,3,29,8,9,1848,1848-03-29,1848-08-09
9,10,325.0,1848,3,24,1848,8,9,B,1,2,3400,3,24,8,9,1848,1848-03-24,1848-08-09


In [156]:
dfWarPar['StartDate'] = dfWarPar['StartDate'].apply(lambda x: x.strftime('%Y-%m-%d'))
dfWarPar['EndDate'] = dfWarPar['EndDate'].apply(lambda x: x.strftime('%Y-%m-%d'))
dfWarPar

Unnamed: 0,WarID,PolityID,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths,StartMonthClean,StartDayClean,EndMonthClean,EndDayClean,EndYearClean,StartDate,EndDate
0,1,220.0,1823,4,7,1823,11,13,A,1,1,400,4,7,11,13,1823,1823-04-07,1823-11-13
1,1,230.0,1823,4,7,1823,11,13,B,0,2,600,4,7,11,13,1823,1823-04-07,1823-11-13
2,4,365.0,1828,4,26,1829,9,14,A,1,1,50000,4,26,9,14,1829,1828-04-26,1829-09-14
3,4,640.0,1828,4,26,1829,9,14,B,0,2,80000,4,26,9,14,1829,1828-04-26,1829-09-14
4,7,2.0,1846,4,25,1847,9,14,A,1,1,13283,4,25,9,14,1847,1846-04-25,1847-09-14
5,7,70.0,1846,4,25,1847,9,14,B,0,2,6000,4,25,9,14,1847,1846-04-25,1847-09-14
6,10,300.0,1848,3,24,1848,8,9,A,0,1,3927,3,24,8,9,1848,1848-03-24,1848-08-09
7,10,300.0,1849,3,12,1849,3,30,A,0,1,3927,3,12,3,30,1849,1849-03-12,1849-03-30
8,10,337.0,1848,3,29,1848,8,9,B,0,2,100,3,29,8,9,1848,1848-03-29,1848-08-09
9,10,325.0,1848,3,24,1848,8,9,B,1,2,3400,3,24,8,9,1848,1848-03-24,1848-08-09


In [157]:
dfWarPar = dfWarPar[['WarID', 'PolityID', 'StartDate', 'EndDate', 'StartYear', 'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Side', 'IsInitiator', 'Outcome', 'Deaths']]
dfWarPar

Unnamed: 0,WarID,PolityID,StartDate,EndDate,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths
0,1,220.0,1823-04-07,1823-11-13,1823,4,7,1823,11,13,A,1,1,400
1,1,230.0,1823-04-07,1823-11-13,1823,4,7,1823,11,13,B,0,2,600
2,4,365.0,1828-04-26,1829-09-14,1828,4,26,1829,9,14,A,1,1,50000
3,4,640.0,1828-04-26,1829-09-14,1828,4,26,1829,9,14,B,0,2,80000
4,7,2.0,1846-04-25,1847-09-14,1846,4,25,1847,9,14,A,1,1,13283
5,7,70.0,1846-04-25,1847-09-14,1846,4,25,1847,9,14,B,0,2,6000
6,10,300.0,1848-03-24,1848-08-09,1848,3,24,1848,8,9,A,0,1,3927
7,10,300.0,1849-03-12,1849-03-30,1849,3,12,1849,3,30,A,0,1,3927
8,10,337.0,1848-03-29,1848-08-09,1848,3,29,1848,8,9,B,0,2,100
9,10,325.0,1848-03-24,1848-08-09,1848,3,24,1848,8,9,B,1,2,3400


## Duplicate issue

For some reason, there are 4 duplicate rows:

In [158]:
duplicates = dfWarPar.duplicated(['WarID', 'PolityID', 'StartDate'])
dfWarPar[duplicates]

Unnamed: 0,WarID,PolityID,StartDate,EndDate,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths
696,475,816.0,1979-01-09,1989-09-25,1979,1,9.0,1989,9,25.0,A,1,4,25300.0
1245,803,816.0,1976-01-01,1979-05-01,1976,1,,1979,5,,A,1,1,1000.0
1382,871,372.0,1991-12-26,1992-03-01,1991,12,26.0,1992,3,,A,0,2,
1410,882,372.0,1993-08-18,1994-04-14,1993,8,18.0,1994,4,14.0,A,1,2,5000.0


In [159]:
duplicates2 = dfWarPar.duplicated()
dfWarPar[duplicates2]

Unnamed: 0,WarID,PolityID,StartDate,EndDate,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths
696,475,816.0,1979-01-09,1989-09-25,1979,1,9.0,1989,9,25.0,A,1,4,25300.0
1245,803,816.0,1976-01-01,1979-05-01,1976,1,,1979,5,,A,1,1,1000.0
1382,871,372.0,1991-12-26,1992-03-01,1991,12,26.0,1992,3,,A,0,2,
1410,882,372.0,1993-08-18,1994-04-14,1993,8,18.0,1994,4,14.0,A,1,2,5000.0


In [160]:
dfWarPar [dfWarPar['WarID'] == 803]

Unnamed: 0,WarID,PolityID,StartDate,EndDate,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths
1243,803,812.0,1976-04-01,1979-05-01,1976,4,,1979,5,,A,0,1,1000
1244,803,816.0,1976-01-01,1979-05-01,1976,1,,1979,5,,A,1,1,1000
1245,803,816.0,1976-01-01,1979-05-01,1976,1,,1979,5,,A,1,1,1000
1246,803,10271.0,1976-04-01,1979-05-01,1976,4,,1979,5,,B,1,2,5000


I can't figure out why these are duplicates, so I'm just going to remove them. There were previously 1671 rows, now there will be 1667 rows.

In [161]:
dfWarPar = dfWarPar.drop_duplicates()
dfWarPar

Unnamed: 0,WarID,PolityID,StartDate,EndDate,StartYear,StartMonth,StartDay,EndYear,EndMonth,EndDay,Side,IsInitiator,Outcome,Deaths
0,1,220.0,1823-04-07,1823-11-13,1823,4,7,1823,11,13,A,1,1,400
1,1,230.0,1823-04-07,1823-11-13,1823,4,7,1823,11,13,B,0,2,600
2,4,365.0,1828-04-26,1829-09-14,1828,4,26,1829,9,14,A,1,1,50000
3,4,640.0,1828-04-26,1829-09-14,1828,4,26,1829,9,14,B,0,2,80000
4,7,2.0,1846-04-25,1847-09-14,1846,4,25,1847,9,14,A,1,1,13283
5,7,70.0,1846-04-25,1847-09-14,1846,4,25,1847,9,14,B,0,2,6000
6,10,300.0,1848-03-24,1848-08-09,1848,3,24,1848,8,9,A,0,1,3927
7,10,300.0,1849-03-12,1849-03-30,1849,3,12,1849,3,30,A,0,1,3927
8,10,337.0,1848-03-29,1848-08-09,1848,3,29,1848,8,9,B,0,2,100
9,10,325.0,1848-03-24,1848-08-09,1848,3,24,1848,8,9,B,1,2,3400


In [173]:
dfWarPar.to_csv('../FinalData/war_participants.csv', encoding='utf-8', index=False)

## Create WAR_LOCATION table

Task: Task: transform the following csv files into one table:

- Inter-StateWarData_v4.0.csv (note: already saved as 'dfInterStateWar')
- Intra-StateWarData_v4.1.csv (note: already saved as 'dfIntraStateWar')
- Non-StateWarData_v4.0.csv (note: already saved as 'dfNonStateWar')
- Extra-StateWarData_v4.0.csv (note: already saved as 'dfExtraStateWar')

with the following attributes:

- WarID
- Region

In [79]:
dfNonStateWar.columns

Index(['WarNum', 'WarName', 'WarType', 'WhereFought', 'SideA1', 'SideA2',
       'SideB1', 'SideB2', 'SideB3', 'SideB4', 'SideB5', 'StartYear',
       'StartMonth', 'StartDay', 'EndYear', 'EndMonth', 'EndDay', 'Initiator',
       'TransFrom', 'TransTo', 'Outcome', 'SideADeaths', 'SideBDeaths',
       'TotalCombatDeaths', 'Version'],
      dtype='object')

In [182]:
dfInterWarLocs = dfInterStateWar [['WarNum', 'WhereFought']]
dfIntraWarLocs = dfIntraStateWar [['WarNum', 'WhereFought']]
dfExtraWarLocs = dfExtraStateWar [['WarNum', 'WhereFought']]
dfNonWarLocs = dfNonStateWar [['WarNum', 'WhereFought']]

AllWarLocs = [dfInterWarLocs, dfIntraWarLocs, dfExtraWarLocs, dfNonWarLocs]
dfWarLocs = pd.concat(AllWarLocs).reset_index(drop=True)
dfWarLocs

Unnamed: 0,WarNum,WhereFought
0,1,2
1,1,2
2,4,11
3,4,11
4,7,1
5,7,1
6,10,2
7,10,2
8,10,2
9,10,2


In [183]:
dfWarLocs.drop_duplicates(inplace=True)
dfWarLocs

Unnamed: 0,WarNum,WhereFought
0,1,2
2,4,11
4,7,1
6,10,2
10,13,2
12,16,2
16,19,1
18,22,2
23,25,6
25,28,2


In [184]:
dfWarLocs['WhereFought'].value_counts()

7     185
4     132
1     122
6     112
2      97
11      5
9       5
15      2
14      2
19      1
18      1
17      1
16      1
13      1
12      1
Name: WhereFought, dtype: int64

In [185]:
dfWarLocs.rename(columns={'WarNum':'WarID', 'WhereFought':'Region'}, inplace=True)

In [186]:
dfWarLocs['Region'] [dfWarLocs['Region'] == 1] = 'W. Hemisphere'
dfWarLocs['Region'] [dfWarLocs['Region'] == 2] = 'Europe'
dfWarLocs['Region'] [dfWarLocs['Region'] == 4] = 'Africa'
dfWarLocs['Region'] [dfWarLocs['Region'] == 6] = 'Middle East'
dfWarLocs['Region'] [dfWarLocs['Region'] == 7] = 'Asia'
dfWarLocs['Region'] [dfWarLocs['Region'] == 9] = 'Oceania'
dfWarLocs['Region'] [dfWarLocs['Region'] == 11] = 'Europe, Middle East'
dfWarLocs['Region'] [dfWarLocs['Region'] == 12] = 'Europe, Asia'
dfWarLocs['Region'] [dfWarLocs['Region'] == 13] = 'W. Hemisphere, Asia'
dfWarLocs['Region'] [dfWarLocs['Region'] == 14] = 'Europe, Africa, Middle East'
dfWarLocs['Region'] [dfWarLocs['Region'] == 15] = 'Europe, Africa, Middle East, Asia'
dfWarLocs['Region'] [dfWarLocs['Region'] == 16] = 'Africa, Middle East, Asia, Oceania'
dfWarLocs['Region'] [dfWarLocs['Region'] == 17] = 'Asia, Oceania'
dfWarLocs['Region'] [dfWarLocs['Region'] == 18] = 'Africa, Middle East'
dfWarLocs['Region'] [dfWarLocs['Region'] == 19] = 'Europe, Africa, Middle East, Asia, Oceania'

dfWarLocs

Unnamed: 0,WarID,Region
0,1,Europe
2,4,"Europe, Middle East"
4,7,W. Hemisphere
6,10,Europe
10,13,Europe
12,16,Europe
16,19,W. Hemisphere
18,22,Europe
23,25,Middle East
25,28,Europe


In [188]:
regionlist = dfWarLocs['Region'].str.split(', ', n=4, expand=True)
dfWarLocs['Region1'] = regionlist[0]
dfWarLocs['Region2'] = regionlist[1]
dfWarLocs['Region3'] = regionlist[2]
dfWarLocs['Region4'] = regionlist[3]
dfWarLocs['Region5'] = regionlist[4]
dfWarLocs

Unnamed: 0,WarID,Region,Region1,Region2,Region3,Region4,Region5
0,1,Europe,Europe,,,,
2,4,"Europe, Middle East",Europe,Middle East,,,
4,7,W. Hemisphere,W. Hemisphere,,,,
6,10,Europe,Europe,,,,
10,13,Europe,Europe,,,,
12,16,Europe,Europe,,,,
16,19,W. Hemisphere,W. Hemisphere,,,,
18,22,Europe,Europe,,,,
23,25,Middle East,Middle East,,,,
25,28,Europe,Europe,,,,


In [191]:
region1 = dfWarLocs[['WarID', 'Region1']]
region1.rename(columns={'Region1':'Region'}, inplace=True)

region2 = dfWarLocs[['WarID', 'Region2']]
region2.rename(columns={'Region2':'Region'}, inplace=True)

region3 = dfWarLocs[['WarID', 'Region3']]
region3.rename(columns={'Region3':'Region'}, inplace=True)

region4 = dfWarLocs[['WarID', 'Region4']]
region4.rename(columns={'Region4':'Region'}, inplace=True)

region5 = dfWarLocs[['WarID', 'Region5']]
region5.rename(columns={'Region5':'Region'}, inplace=True)

regiondfs = [region1, region2, region3, region4, region5]
dfWarAllLocs = pd.concat(regiondfs).sort_values(['WarID']).reset_index(drop=True)

dfWarAllLocs = dfWarAllLocs.dropna()
dfWarAllLocs

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)


Unnamed: 0,WarID,Region
0,1,Europe
7,4,Europe
9,4,Middle East
14,7,W. Hemisphere
18,10,Europe
24,13,Europe
29,16,Europe
34,19,W. Hemisphere
39,22,Europe
42,25,Middle East


In [192]:
dfWarAllLocs.to_csv('../FinalData/war_locations.csv', encoding='utf-8', index=False)