# Wrangle War Data

### Input Datasets

- `Data/Raw/Inter-StateWarData_v4.0.csv`
- `Data/Raw/INTRA-STATE_State_participants v5.1 CSV.csv`
- `Data/Raw/Extra-StateWarData_v4.0.csv`
- `Data/Raw/Non-StateWarData_v4.0.csv`

### Output Datasets

- `Data/war.csv`
- `Data/war_locations.csv`
- `Data/war_participants.csv`
- `Data/war_transitions.csv`

In [1]:
import pandas as pd
import numpy as np

In [2]:
dfInterStateWar = pd.read_csv("../Data/Raw/Inter-StateWarData_v4.0.csv", encoding='utf-8', na_values=[-7, -8, -9], dtype={"WarNum": str})
dfIntraStateWar = pd.read_csv("../Data/Raw/INTRA-STATE WARS v5.1 CSV.csv", encoding='latin-1', na_values=[-7, -8, -9], dtype={"WarNum": str, "Intnl": bool})
dfIntraStateWarPar = pd.read_csv("../Data/Raw/INTRA-STATE_State_participants v5.1 CSV.csv", encoding='latin-1', na_values=[-7, -8, -9], dtype={"WarNum": str})
dfExtraStateWar = pd.read_csv("../Data/Raw/Extra-StateWarData_v4.0.csv", encoding='latin-1', na_values=[-7, -8, -9], dtype={"WarNum": str, "Interven": bool})
dfNonStateWar = pd.read_csv("../Data/Raw/Non-StateWarData_v4.0.csv", encoding='utf-8', na_values=[-7, -8, -9], dtype={"WarNum": str})

dfPolities = pd.read_csv("../Data/polity.csv", encoding='utf-8')

## Create "war" table

Table creation statement

```
class War(Base):
    __tablename__ = "war"

    id = Column(Integer(4), primary_key=True)
    name = Column(Text)
    type_code = Column(Integer(1))
    type_name = Column(Text)
    subtype_name = Column(Text)
    is_intervention = Column(Boolean)
    is_international = Column(Boolean)
```

In [3]:
dfWars = pd.concat([
    dfInterStateWar[['WarNum', 'WarName', 'WarType']],
    dfIntraStateWar[['WarNum', 'WarName', 'WarType', 'Intnl']],
    dfExtraStateWar[['WarNum', 'WarName', 'WarType', 'Interven']],
    dfNonStateWar[['WarNum', 'WarName', 'WarType']]
]).drop_duplicates(ignore_index=True).rename(columns={'WarNum':'id', 
                                                      'WarName': 'name', 
                                                      'WarType': 'type_code', 
                                                      'Intnl':'is_international', 
                                                      'Interven': 'is_intervention'})

In [4]:
dfWars['type_name'] = dfWars['type_code'].map({1: "Inter-State", 
                                               2: "Extra-State", 
                                               3: "Extra-State", 
                                               4: "Intra-State", 
                                               5: "Intra-State", 
                                               6: "Intra-State", 
                                               7: "Intra-State",
                                               8: "Non-State",
                                               9: "Non-State"})
dfWars['subtype_name'] = dfWars['type_code'].map({1: "Inter-State", 
                                               2: "Colonial War", 
                                               3: "Imperial War", 
                                               4: "Civil war for central control", 
                                               5: "Civil war over local issues", 
                                               6: "Regional internal", 
                                               7: "Intercommunal",
                                               8: "occur in non-state territory",
                                               9: "occur across state borders"})

What is the difference between type 7 (Intra-State: Intercommunal - wars between two or more non-state actors) and type 8 (Non-state: occur in non-state territory)? Need to ask an expert!

In [5]:
dfWars = dfWars[["id", "name", "type_code", "type_name", "subtype_name", "is_intervention", "is_international"]]

In [6]:
dfWars.dtypes

id                  object
name                object
type_code            int64
type_name           object
subtype_name        object
is_intervention     object
is_international    object
dtype: object

In [7]:
dfWars.id.str.len().max()

5

Note: why are war ids strings now instead of integers? Because the most recent interwars dataset decided that numerical ordering was sooooo important that they added "in-between" war ids, thus war ids with decimals.

In [8]:
dfWars.to_csv("../Data/war.csv", index=False)

## Create "war_locations" table

Table creation statement

```
class War_Locations(Base):
    __tablename__ = "war_locations"

    war = Column(String(5), primary_key=True)
    region = Column(Text, primary_key=True)

    __table_args__ = (ForeignKeyConstraint(["war"], ["war.id"]),)
```

### interstate war locations

In [9]:
dfInterLoc = dfInterStateWar[["WarNum", "WhereFought"]].drop_duplicates()

In [10]:
dfInterLoc[dfInterLoc.duplicated(["WarNum"])]

Unnamed: 0,WarNum,WhereFought
104,100,2
112,106,11
114,106,15
115,106,7
117,106,6
123,106,14
169,139,19
170,139,15
172,139,14
180,139,16


In [11]:
dfInterLoc.dtypes

WarNum         object
WhereFought     int64
dtype: object

In [12]:
dfInterStateWar[dfInterStateWar.WarNum.isin(["100","106","139"])].head()

Unnamed: 0,WarNum,WarName,WarType,ccode,StateName,Side,StartMonth1,StartDay1,StartYear1,EndMonth1,...,EndMonth2,EndDay2,EndYear2,TransFrom,WhereFought,Initiator,Outcome,TransTo,BatDeath,Version
102,100,First Balkan,1,640,Turkey,2,10,17,1912,4,...,,,,650.0,11,2,2,,30000.0,4
103,100,First Balkan,1,350,Greece,1,10,17,1912,4,...,,,,650.0,11,2,1,,5000.0,4
104,100,First Balkan,1,355,Bulgaria,1,10,17,1912,12,...,4.0,19.0,1913.0,650.0,2,2,1,,32000.0,4
105,100,First Balkan,1,345,Yugoslavia,1,10,17,1912,12,...,4.0,19.0,1913.0,650.0,11,1,1,,15000.0,4
111,106,World War I,1,345,Yugoslavia,1,7,29,1914,11,...,,,,,2,2,1,,70000.0,4


War 100 has location 11
War 106 has location 15
War 139 has location 19

In [13]:
dfInterLoc.loc[dfInterLoc['WarNum'] == "100", 'WhereFought'] = 11
dfInterLoc.loc[dfInterLoc['WarNum'] == "106", 'WhereFought'] = 15
dfInterLoc.loc[dfInterLoc['WarNum'] == "139", 'WhereFought'] = 19

In [14]:
dfInterLoc = dfInterLoc.drop_duplicates().rename(columns={"WarNum":"war","WhereFought":"region"})

In [15]:
dfInterLoc

Unnamed: 0,war,region
0,1,2
2,4,11
4,7,1
6,10,2
10,13,2
...,...,...
315,219,4
317,221,2
325,223,7
327,225,7


In [16]:
region_map_values_interstate = {1: 'W. Hemisphere', 2: 'Europe', 4: 'Africa', 6: 'Middle East', 7: 'Asia', 9: 'Oceania', 11: 'Europe,Middle East', 12: 'Europe,Asia', 13: 'W. Hemisphere,Asia', 14: 'Europe,Africa,Middle East', 15: 'Europe,Africa,Middle East,Asia', 16: 'Africa,Middle East,Asia,Oceania', 17: 'Asia,Oceania', 18: 'Africa,Middle East', 19: 'Europe,Africa,Middle East,Asia,Oceania'}

dfInterLoc['region'] = dfInterLoc['region'].replace(region_map_values_interstate).str.split(',')
dfInterLoc = dfInterLoc.explode('region').drop_duplicates().reset_index(drop=True)
dfInterLoc

Unnamed: 0,war,region
0,1,Europe
1,4,Europe
2,4,Middle East
3,7,W. Hemisphere
4,10,Europe
...,...,...
103,219,Africa
104,221,Europe
105,223,Asia
106,225,Asia


In [17]:
dfInterLoc.region.value_counts()

Europe           30
Asia             29
Middle East      24
W. Hemisphere    16
Africa            8
Oceania           1
Name: region, dtype: int64

### intrastate war locations

In [18]:
dfIntraLoc_pre = dfIntraStateWar[dfIntraStateWar.V5RegionNum == 6].iloc[:, 0:7].rename(columns={"WarNum":"war"})

In [19]:
dfIntraLoc_pre["region_asiaoceania"] = "Asia"
dfIntraLoc_pre.loc[dfIntraLoc_pre['CcodeA'] >= 900, 'region_asiaoceania'] = "Oceania"
dfIntraLoc_pre

Unnamed: 0,war,WarName,V5RegionNum,WarType,CcodeA,SideA,SideB,region_asiaoceania
73,567,Taiping Rebellion phase 2 of 1860-1866,6,4,710.0,China,Taipings,Asia
74,568,Second Nien Revolt of 1860-1868,6,5,710.0,China,Nien Society,Asia
75,570,Miao Rebellion phase 2 of 1860-1872,6,5,710.0,China,Miao,Asia
76,571,Panthay Rebellion phase 2 of 1860-1874,6,5,710.0,China,Hui Rebels,Asia
80,576,Tungan Rebellion of 1862-1873,6,5,710.0,China,Shaanxi and Gansu Muslims,Asia
...,...,...,...,...,...,...,...,...
397,936,Second Philippine - NPA War of 2005-2006,6,4,840.0,Philippines,NPA,Asia
399,940,Third Sri Lanka Tamil War of 2006-2009,6,5,780.0,Sri Lanka,LTTE,Asia
402,942,Second Waziristan War of 2007-present,6,5,770.0,Pakistan,Taliban,Asia
408,980,Kachin Rebellion of 2011-2013,6,5,775.0,Myanmar,KIA,Asia


In [20]:
dfIntraLoc_mid = dfIntraStateWar[["WarNum","V5RegionNum"]].rename(columns={"WarNum":"war", "V5RegionNum":"region"})


In [21]:
dfIntraLoc = dfIntraLoc_mid.merge(dfIntraLoc_pre[["war","region_asiaoceania"]], how="left", on="war")

In [22]:
region_map_values_intrastate = {1: 'W. Hemisphere', 2: 'W. Hemisphere', 3: 'Europe', 4: 'Africa', 5: 'Middle East', 6: np.NaN}
dfIntraLoc['region'] = dfIntraLoc['region'].replace(region_map_values_intrastate)
dfIntraLoc['region'] = dfIntraLoc['region'].fillna(dfIntraLoc['region_asiaoceania'])
dfIntraLoc

Unnamed: 0,war,region,region_asiaoceania
0,500,Europe,
1,502,Europe,
2,502.1,Europe,
3,503,Europe,
4,504,Europe,
...,...,...,...
415,992,Middle East,
416,992.5,Africa,
417,993,Europe,
418,994,Middle East,


In [23]:
dfIntraLoc.region.value_counts()

Asia             102
W. Hemisphere    100
Middle East       81
Europe            73
Africa            63
Oceania            1
Name: region, dtype: int64