# IS590DV Final Project: Part 1 - Data Exploration

By: Jenna Jordan

Group Members: Dennis Piehl, Gianni Pezzarossi, Xue Lu, and Hsin-Yuan Wang

## Data Sources

- Correlates of War
- UCDP/PRIO
- World Bank
- Polity IV Project
- CShapes

In [1]:
import pandas as pd

## Correlates of War

http://www.correlatesofwar.org/data-sets

License: http://www.correlatesofwar.org/data-sets/terms-and-conditions

#### War Datasets

http://www.correlatesofwar.org/data-sets/COW-war

Sarkees, Meredith Reid and Frank Wayman (2010). Resort to War: 1816 - 2007. Washington DC: CQ Press.

- Inter-StateWarData_v4.0
    - filesize: 32 kb
- Intra-StateWarData_v4.1
    - filesize: 55 kb
- Non-StateWarData_v4.0
    - filesize: 7 kb
- Extra-StateWarData_v4.0
    - filesize: 23 kb

#### State dataset

http://www.correlatesofwar.org/data-sets/state-system-membership

Correlates of War Project. 2017. "State System Membership List, v2016." Online, http://correlatesofwar.org

- States2016
    - filesize: 11 kb

#### Notes
Missing values are represented by -7, -8, and -9.

For my previous work with these datasets, please see: https://github.com/jenna-jordan/international-relations-database
The functional dependencies and general organization of these datasets is all kinds of messed up, and I fixed it in order to create a database out of these datasets. All of that work is recorded in the above GitHub repo.
Specifically, look at this notebook: https://github.com/jenna-jordan/international-relations-database/blob/master/DataTransformation/WAR%20Data%20Transformation.ipynb


In [2]:
cow_interstate = pd.read_csv("../Data/CorrelatesOfWar/Raw/Inter-StateWarData_v4.0.csv", na_values = [-7, -8, -9])
cow_intrastate = pd.read_csv("../Data/CorrelatesOfWar/Raw/Intra-StateWarData_v4.1.csv", encoding = "latin-1", na_values = [-7, -8, -9])
cow_nonstate = pd.read_csv("../Data/CorrelatesOfWar/Raw/Non-StateWarData_v4.0.csv", na_values = [-7, -8, -9])
cow_extrastate = pd.read_csv("../Data/CorrelatesOfWar/Raw/Extra-StateWarData_v4.0.csv", encoding = "latin-1", na_values = [-7, -8, -9])
cow_states = pd.read_csv("../Data/CorrelatesOfWar/Raw/states2016.csv", na_values = [-7, -8, -9])

### Inter-State War

This dataset records wars fought between states, from 1816 - 2007. There are 25 variables and 337 observations. Each observation is a country-war unit. There are 95 distinct interstate wars recorded, involving 98 distinct states. The earliest conflict occured in 1823, and the latest conflict occured in 2003.

In [3]:
cow_interstate

Unnamed: 0,WarNum,WarName,WarType,ccode,StateName,Side,StartMonth1,StartDay1,StartYear1,EndMonth1,...,EndMonth2,EndDay2,EndYear2,TransFrom,WhereFought,Initiator,Outcome,TransTo,BatDeath,Version
0,1,Franco-Spanish War,1,230,Spain,2,4,7,1823,11,...,,,,503,2,2,2,,600.0,4
1,1,Franco-Spanish War,1,220,France,1,4,7,1823,11,...,,,,503,2,1,1,,400.0,4
2,4,First Russo-Turkish,1,640,Ottoman Empire,2,4,26,1828,9,...,,,,506,11,2,2,,80000.0,4
3,4,First Russo-Turkish,1,365,Russia,1,4,26,1828,9,...,,,,506,11,1,1,,50000.0,4
4,7,Mexican-American,1,70,Mexico,2,4,25,1846,9,...,,,,,1,2,2,,6000.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
332,225,Invasion of Afghanistan,1,700,Afghanistan,2,10,7,2001,12,...,,,,851,7,2,4,481.0,4000.0,4
333,227,Invasion of Iraq,1,900,Australia,1,3,19,2003,5,...,,,,,6,1,4,482.0,0.0,4
334,227,Invasion of Iraq,1,200,United Kingdom,1,3,19,2003,5,...,,,,,6,1,4,482.0,33.0,4
335,227,Invasion of Iraq,1,2,United States of America,1,3,19,2003,5,...,,,,,6,1,4,482.0,140.0,4


In [4]:
cow_interstate.dtypes

WarNum           int64
WarName         object
WarType          int64
ccode            int64
StateName       object
Side             int64
StartMonth1      int64
StartDay1        int64
StartYear1       int64
EndMonth1        int64
EndDay1          int64
EndYear1         int64
StartMonth2    float64
StartDay2      float64
StartYear2     float64
EndMonth2      float64
EndDay2        float64
EndYear2       float64
TransFrom       object
WhereFought      int64
Initiator        int64
Outcome          int64
TransTo        float64
BatDeath       float64
Version          int64
dtype: object

In [5]:
cow_interstate['WarNum'].nunique(), cow_interstate['ccode'].nunique()

(95, 98)

In [6]:
cow_interstate['StartYear1'].min(), cow_interstate['StartYear1'].max()

(1823, 2003)

### Intra-State War

This dataset records civil wars fought within states, from 1816 - 2007. There are 28 variables and 442 observations. Each observation is a country-dyad-war unit. There are 334 distinct intrastate wars recorded, involving 104 distinct states and 290 non-state groups. The earliest conflict started in 1818, and the latest conflict started in 2007.

In [7]:
cow_intrastate

Unnamed: 0,WarNum,WarName,WarType,CcodeA,SideA,CcodeB,SideB,Intnl,StartMonth1,StartDay1,...,EndDay2,EndYear2,TransFrom,WhereFought,Initiator,Outcome,TransTo,SideADeaths,SideBDeaths,Version
0,500,First Caucasus,5,365.0,Russia,,"Georgians, Dhagestania, Chechens",0,6.0,10.0,...,,,,2,Chechens,1,,5000,6000,4.1
1,501,Sidon-Damascus,6,,Sidon,,Damascus & Aleppo,0,6.0,,...,,,,6,Sidon,2,,,,4.1
2,502,First Two Sicilies,4,300.0,Austria,,,1,3.0,,...,,,,2,Liberals,1,,,,4.1
3,502,First Two Sicilies,4,329.0,Two Sicilies,,Liberals,1,7.0,2.0,...,,,,2,Liberals,1,,,,4.1
4,503,Spanish Royalists,4,230.0,Spain,,Royalists,0,12.0,1.0,...,,,,2,Royalists,4,1.0,,,4.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
437,938,Third Somalia,4,520.0,Somalia,,SCIC,1,3.0,6.0,...,,,,4,SCIC,1,,,,4.1
438,938,Third Somalia,4,530.0,Ethiopia,,,1,10.0,9.0,...,,,,4,SCIC,1,,,,4.1
439,938,Third Somalia,4,,,531.0,Eritrea,1,10.0,19.0,...,,,,4,SCIC,1,,,,4.1
440,940,Second Sri Lanka Tamil,5,780.0,Sri Lanka,,LTTE,0,10.0,11.0,...,,,,7,Sri Lanka,5,,,,4.1


In [8]:
cow_intrastate.dtypes

WarNum           int64
WarName         object
WarType          int64
CcodeA         float64
SideA           object
CcodeB         float64
SideB           object
Intnl            int64
StartMonth1    float64
StartDay1      float64
StartYear1       int64
EndMonth1      float64
EndDay1        float64
EndYear1       float64
StartMonth2    float64
StartDay2      float64
StartYear2     float64
EndMonth2      float64
EndDay2        float64
EndYear2       float64
TransFrom      float64
WhereFought      int64
Initiator       object
Outcome          int64
TransTo        float64
SideADeaths     object
SideBDeaths     object
Version        float64
dtype: object

In [9]:
all_countries = set(cow_intrastate['CcodeA'].dropna().tolist() + cow_intrastate['CcodeB'].dropna().tolist())
len(list(all_countries))

104

In [10]:
cow_intrastate['SideB'].nunique()

290

In [11]:
cow_intrastate['WarNum'].nunique()

334

In [12]:
cow_intrastate['StartYear1'].min(), cow_intrastate['StartYear1'].max()

(1818, 2007)

### Non-State War

This dataset records civil wars fought within states where the state is not a participant (only non-state groups), from 1816 - 2007. There are 25 variables and 62 observations. Each observation is a war. There are 62 distinct nonstate wars recorded, involving 119 non-state groups. The earliest conflict started in 1818, and the latest conflict started in 1999.

In [13]:
cow_nonstate

Unnamed: 0,WarNum,WarName,WarType,WhereFought,SideA1,SideA2,SideB1,SideB2,SideB3,SideB4,...,EndMonth,EndDay,Initiator,TransFrom,TransTo,Outcome,SideADeaths,SideBDeaths,TotalCombatDeaths,Version
0,1500,First Maori Tribal War,8,9,Te Rauparaha's Ngati Toa,,Taranaki,Ngai Tahu,Waikato,Ngati Ira,...,,,A,,,1,1500.0,6000.0,7500,4
1,1501,Shaka Zulu-Bantu War,8,4,Shaka Zulu,,Bantu,,,,...,9.0,24.0,A,,,1,20000.0,40000.0,60000,4
2,1502,Burma-Assam War,8,7,Burma,,Assam,,,,...,,,A,,,1,,,,4
3,1503,Buenos Aires War,8,1,Buenos Aires,,Provinces,,,,...,2.0,23.0,B,,,2,,,,4
4,1505,Second Maori Tribal War,8,9,Hongi Hika's Nga Phuhi,,Ngati Paoa,Ngati Maru,Waikato River Maori,Te Arawa,...,,,A,,,1,500.0,2000.0,2500,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57,1574,Rwandan Social Revolution,8,4,Hutu,,Tutsi,,,,...,7.0,1.0,A,,,1,,,,4
58,1577,Dhofar Rebellion Phase 1,8,6,Dhofar,,Oman,,,,...,10.0,6.0,A,,,6,,,5000,4
59,1581,Angola Guerilla War,8,4,MPLA,,FLNA,UNITA,,,...,10.0,22.0,A,,186.0,4,,,,4
60,1582,East Timorese War Phase 1,8,7,Fretilin,Apodeti,UDT,,,,...,10.0,15.0,B,,472.0,4,,,3000,4


In [14]:
cow_nonstate.dtypes

WarNum                 int64
WarName               object
WarType                int64
WhereFought            int64
SideA1                object
SideA2                object
SideB1                object
SideB2                object
SideB3                object
SideB4                object
SideB5                object
StartYear              int64
StartMonth           float64
StartDay             float64
EndYear                int64
EndMonth             float64
EndDay               float64
Initiator             object
TransFrom            float64
TransTo              float64
Outcome                int64
SideADeaths          float64
SideBDeaths          float64
TotalCombatDeaths     object
Version                int64
dtype: object

In [15]:
cow_nonstate['WarNum'].nunique()

62

In [16]:
all_groups = set(cow_nonstate['SideA1'].dropna().tolist() + cow_nonstate['SideA2'].dropna().tolist() + cow_nonstate['SideB1'].dropna().tolist() + cow_nonstate['SideB2'].dropna().tolist() + cow_nonstate['SideB3'].dropna().tolist() + cow_nonstate['SideB4'].dropna().tolist() + cow_nonstate['SideB5'].dropna().tolist())
len(list(all_groups))

119

In [17]:
cow_nonstate['StartYear'].min(), cow_nonstate['StartYear'].max()

(1818, 1999)

### Extra-State War

This dataset records extrasystemic wars (between a state and a nonstate group outside of its borders) fought from 1816 - 2007. There are 28 variables and 198 observations. Each observation is a state/group-dyad-war unit. There are 163 distinct extrastate wars recorded, involving 37 states and 125 non-state groups. The earliest conflict started in 1816, and the latest conflict started in 2004.

In [18]:
cow_extrastate

Unnamed: 0,WarNum,WarName,WarType,ccode1,SideA,ccode2,SideB,StartMonth1,StartDay1,StartYear1,...,EndYear2,Initiator,Interven,TransFrom,Outcome,TransTo,WhereFought,BatDeath,NonStateDeaths,Version
0,300,Allied Bombardment of Algiers,3,210.0,Netherlands,,,8.0,26.0,1816,...,,1,1,,1,,6,13.0,,4
1,300,Allied Bombardment of Algiers,3,200.0,United Kingdom,,Algeria,8.0,26.0,1816,...,,1,1,,1,,6,129.0,6000.0,4
2,301,Ottoman-Wahhabi,3,640.0,Ottoman Empire,,Saudi Wahhabis,9.0,,1816,...,,1,0,,1,,6,13500.0,14000.0,4
3,302,Liberation of Chile,2,230.0,Spain,,San Martin revolutionaries,1.0,9.0,1817,...,,0,0,,2,,1,1700.0,1140.0,4
4,303,First Bolivar Expedition,2,230.0,Spain,,New Granada,4.0,11.0,1817,...,,1,0,,2,,1,3000.0,2000.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
193,482,Iraqi Resistance,3,290.0,Poland,,,3.0,,2004,...,,0,1,227.0,5,,6,23.0,,4
194,482,Iraqi Resistance,3,325.0,Italy,,,3.0,,2004,...,,0,1,227.0,5,,6,33.0,,4
195,482,Iraqi Resistance,3,369.0,Ukraine,,,3.0,,2004,...,,0,1,227.0,5,,6,18.0,,4
196,482,Iraqi Resistance,3,645.0,Iraq,,,6.0,28.0,2004,...,,0,1,227.0,5,,6,10800.0,,4


In [19]:
cow_extrastate['WarNum'].nunique()

163

In [20]:
all_countries2= set(cow_extrastate['ccode1'].dropna().tolist() + cow_extrastate['ccode2'].dropna().tolist())
len(list(all_countries2))

37

In [21]:
all_groups2= set(cow_extrastate['SideA'].dropna().tolist() + cow_extrastate['SideB'].dropna().tolist())
len(list(all_groups2)) - len(list(all_countries2))

125

In [22]:
cow_extrastate['StartYear1'].min(), cow_extrastate['StartYear1'].max()

(1816, 2004)

### States

This dataset records all of the states in the CoW system, and when they entered/exited the system. If a state left and re-entered the system, it is recorded twice. There are 217 distinct states. If the state existed in 1816, when the CoW data began recording information, it is listed as having begun on Jan 1, 1816, and if it was still in existence in 2016 (when the dataset was last updated) it is recorded as having ended on Dec 31, 2016 (e.g. USA). The 'stateabb' is its ISO 3-alpha code.

In [23]:
cow_states

Unnamed: 0,stateabb,ccode,statenme,styear,stmonth,stday,endyear,endmonth,endday,version
0,USA,2,United States of America,1816,1,1,2016,12,31,2016
1,CAN,20,Canada,1920,1,10,2016,12,31,2016
2,BHM,31,Bahamas,1973,7,10,2016,12,31,2016
3,CUB,40,Cuba,1902,5,20,1906,9,25,2016
4,CUB,40,Cuba,1909,1,23,2016,12,31,2016
...,...,...,...,...,...,...,...,...,...,...
238,NAU,970,Nauru,1999,9,14,2016,12,31,2016
239,MSI,983,Marshall Islands,1991,9,17,2016,12,31,2016
240,PAL,986,Palau,1994,12,15,2016,12,31,2016
241,FSM,987,Federated States of Micronesia,1991,9,17,2016,12,31,2016


In [24]:
cow_states['ccode'].nunique()

217

## UCDP-PRIO

UCDP website: https://www.ucdp.uu.se/downloads/
PRIO website: https://www.prio.org/Data/Armed-Conflict/UCDP-PRIO/

License: free to use with citation

Citations:

Pettersson, Therese; Stina Högbladh & Magnus Öberg, 2019. Organized violence, 1989-2018 and peace agreements, Journal of Peace Research 56(4).

Gleditsch, Nils Petter, Peter Wallensteen, Mikael Eriksson, Margareta Sollenberg, and Håvard Strand (2002) Armed Conflict 1946-2001: A New Dataset. Journal of Peace Research 39(5).

#### Datasets

- UCDP/PRIO Armed Conflict Dataset v19.1
    - filesize: 508 kb
- UCDP Actor List
    - filesize: 83 kb

#### Notes

For my work in normalizing this dataset (determining functional dependencies and separating the variables into tidy tables, please see this notebook: https://github.com/jenna-jordan/international-relations-database-extended/blob/master/Wrangle_Data/UCDP-PRIO_Normalize.ipynb

In [25]:
ucdp_conflict = pd.read_csv("../Data/UCDP-PRIO_ArmedConflict/Raw/ucdp-prio-acd-191.csv")

### Armed Conflict & Actors

This dataset records all conflict from 1946 to 2018 where there was 25 deaths or more within a single year. There are 286 distinct conflicts, with the earliest conflict 1946 and the latest in 2018. These conflicts are seperated into episodes, and there are 558 distinct episodes. Each episode is further divided into conflict-years, which is the unit of observation. There are 2384 observations (conflict-years), and 28 variables.

The identifiers deserve some discussion here. UCDP shifted from using the Gleditsch & Ward state codes to their own actor codes, which are supposed to identify all distict actors - states and non-state groups. This actor list is recorded in a seperate file (actorlist.csv), which contains 1687 distinct actors. However, exploration of this proves that not all G&W states are recorded in the UCDP Actor List (see above referenced Jupyter notebook) - and furthermore, some of these actors are actually lists of multiple groups. Since only UCDP uses this actor list and the actor list does not contain a mapping to G&W or a more conventional system like ISO, this dataset contains duplicates of the actor columns - one for UCDP, one for G&W identifiers. Also, all of the columns that contain actors (e.g. side_a, gwno_a) are multivalued - the identifiers are seperated by commas. For the sides, these lists are "synced" - all in the same order. However, the two location columns - the name and the G&W id - are not synced. All of this is a complete mess, and I will refer you again to the Jupyter notebook referenced above for my work in cleaning and normalizing it.

In [26]:
ucdp_conflict

Unnamed: 0,conflict_id,location,side_a,side_a_id,side_a_2nd,side_b,side_b_id,side_b_2nd,incompatibility,territory_name,...,ep_end,ep_end_date,ep_end_prec,gwno_a,gwno_a_2nd,gwno_b,gwno_b_2nd,gwno_loc,region,version
0,13637,Afghanistan,Government of Afghanistan,130,"Government of Pakistan, Government of United S...",IS,234,,1,Islamic State,...,0,,,700,"770, 2",,,700,3,19.1
1,13637,Afghanistan,Government of Afghanistan,130,"Government of Pakistan, Government of United S...",IS,234,,1,Islamic State,...,0,,,700,"770, 2",,,700,3,19.1
2,13637,Afghanistan,Government of Afghanistan,130,Government of United States of America,IS,234,,1,Islamic State,...,0,,,700,2,,,700,3,19.1
3,13637,Afghanistan,Government of Afghanistan,130,Government of United States of America,IS,234,,1,Islamic State,...,0,,,700,2,,,700,3,19.1
4,333,Afghanistan,Government of Afghanistan,130,,PDPA,291,,2,,...,0,,,700,,,,700,3,19.1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2379,318,Zimbabwe (Rhodesia),Government of Zimbabwe (Rhodesia),101,Government of South Africa,ZANU,493,,2,,...,0,,,552,560,,,552,4,19.1
2380,318,Zimbabwe (Rhodesia),Government of Zimbabwe (Rhodesia),101,,"PF, ZANU","494, 493",,2,,...,0,,,552,,,,552,4,19.1
2381,318,Zimbabwe (Rhodesia),Government of Zimbabwe (Rhodesia),101,,PF,494,,2,,...,0,,,552,,,,552,4,19.1
2382,318,Zimbabwe (Rhodesia),Government of Zimbabwe (Rhodesia),101,,PF,494,,2,,...,0,,,552,,,,552,4,19.1


In [27]:
ucdp_conflict.dtypes

conflict_id               int64
location                 object
side_a                   object
side_a_id                object
side_a_2nd               object
side_b                   object
side_b_id                object
side_b_2nd               object
incompatibility           int64
territory_name           object
year                      int64
intensity_level           int64
cumulative_intensity      int64
type_of_conflict          int64
start_date               object
start_prec                int64
start_date2              object
start_prec2               int64
ep_end                    int64
ep_end_date              object
ep_end_prec             float64
gwno_a                   object
gwno_a_2nd               object
gwno_b                   object
gwno_b_2nd               object
gwno_loc                 object
region                   object
version                 float64
dtype: object

In [28]:
ucdp_conflict['conflict_id'].nunique()

286

In [29]:
ucdp_conflict['year'].min(), ucdp_conflict['year'].max()

(1946, 2018)

In [30]:
conflicts_gb = ucdp_conflict.groupby(['conflict_id']).agg({'start_date2':'nunique'})
conflicts_gb['start_date2'].sum()

558

## CShapes

http://nils.weidmann.ws/projects/cshapes/shapefile.html

License: CC BY-NC-SA 4.0

Weidmann, Nils B., Doreen Kuse, and Kristian Skrede Gleditsch. 2010. The Geography of the International System: The CShapes Dataset. International Interactions 36 (1).

#### Shapefiles

This dataset contains shapefiles for historical states, 1946 - 2016. I extracted the attributes from these shapefiles to a csv, which is loaded below (see this notebook for the attribute extraction: https://github.com/jenna-jordan/international-relations-database-extended/blob/master/Gather_Data/CSHAPES_extract-attributes.ipynb)

Each row corresponds to a shape. Each shape is a single state for the duration that its borders do not change. There are shapes for both the CoW and G&W systems - when they agree/overlap, there is a single observation, and when they disagree there are two observations (with a -1 or NA value for the identifier system that does not apply). Thankfully, the ISO codes are also included where the mapping applies. 

So, while there are 255 distinct shapes (rows/observations), there are only 202 distinct CoW states, 201 distinct G&W states, and 201 distinct ISO codes. However, there are 229 unique combinations of these codes. These various numbers represent the discrepancies between the different state identifier systems. This is both a problem to solve and an interesting situation to explore visually.

In [32]:
cshapes_states = pd.read_csv("../Data/CShapes/country_shapes.csv", na_values=[-1])
cshapes_states

Unnamed: 0,area,capital_lat,capital_long,capital_name,country_name,cow_code,cow_endday,cow_endmonth,cow_endyear,cow_startday,...,gw_endday,gw_endmonth,gw_endyear,gw_startday,gw_startmonth,gw_startyear,iso_alpha2,iso_alpha3,iso_name,iso_num
0,2.119820e+05,6.800000,-58.20000,Georgetown,Guyana,110.0,30.0,6.0,2016.0,26.0,...,30.0,6.0,2016.0,26.0,5.0,1966.0,GY,GUY,Guyana,328
1,1.459523e+05,5.833333,-55.20000,Paramaribo,Suriname,115.0,30.0,6.0,2016.0,25.0,...,30.0,6.0,2016.0,25.0,11.0,1975.0,SR,SUR,Suriname,740
2,5.041729e+03,10.650000,-61.50000,Port-of-Spain,Trinidad and Tobago,52.0,30.0,6.0,2016.0,31.0,...,30.0,6.0,2016.0,31.0,8.0,1962.0,TT,TTO,Trinidad and Tobago,780
3,9.167822e+05,10.500000,-66.90000,Caracas,Venezuela,101.0,30.0,6.0,2016.0,1.0,...,30.0,6.0,2016.0,1.0,1.0,1946.0,VE,VEN,Venezuela,862
4,2.955212e+03,-13.800000,-172.00000,Apia,Samoa,990.0,30.0,6.0,2016.0,15.0,...,30.0,6.0,2016.0,1.0,1.0,1962.0,WS,WSM,Samoa,882
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
250,2.087609e+07,55.752220,37.61555,Moscow,USSR,,,,,,...,8.0,9.0,1991.0,6.0,9.0,1991.0,SU,SUN,Union of Soviet Socialist Republics,810
251,2.073345e+07,55.752220,37.61555,Moscow,USSR,,,,,,...,26.0,10.0,1991.0,9.0,9.0,1991.0,SU,SUN,Union of Soviet Socialist Republics,810
252,2.026111e+07,55.752220,37.61555,Moscow,USSR,,,,,,...,30.0,11.0,1991.0,27.0,10.0,1991.0,SU,SUN,Union of Soviet Socialist Republics,810
253,1.966415e+07,55.752220,37.61555,Moscow,USSR,,,,,,...,15.0,12.0,1991.0,1.0,12.0,1991.0,SU,SUN,Union of Soviet Socialist Republics,810


In [33]:
cshapes_states.dtypes

area              float64
capital_lat       float64
capital_long      float64
capital_name       object
country_name       object
cow_code          float64
cow_endday        float64
cow_endmonth      float64
cow_endyear       float64
cow_startday      float64
cow_startmonth    float64
cow_startyear     float64
feature_id          int64
gw_code           float64
gw_endday         float64
gw_endmonth       float64
gw_endyear        float64
gw_startday       float64
gw_startmonth     float64
gw_startyear      float64
iso_alpha2         object
iso_alpha3         object
iso_name           object
iso_num             int64
dtype: object

In [34]:
cshapes_states['cow_code'].nunique()

202

In [35]:
cshapes_states['gw_code'].nunique()

201

In [36]:
cshapes_states['iso_alpha3'].nunique()

201

In [37]:
country_codes = cshapes_states[['cow_code', 'gw_code', 'iso_alpha3']].drop_duplicates()
country_codes

Unnamed: 0,cow_code,gw_code,iso_alpha3
0,110.0,110.0,GUY
1,115.0,115.0,SUR
2,52.0,52.0,TTO
3,101.0,101.0,VEN
4,990.0,990.0,WSM
...,...,...,...
237,510.0,510.0,
240,140.0,140.0,
245,626.0,626.0,SSD
246,,345.0,YUG


In [38]:
country_codes['cow_code'].isna().sum(), country_codes['gw_code'].isna().sum(), country_codes['iso_alpha3'].isna().sum()

(2, 5, 21)

## World Bank

I acquired the World Development Indicators via the World Bank's API - see the code here: https://github.com/jenna-jordan/international-relations-database-extended/blob/master/Gather_Data/WB-IndicatorTimeSeries_API.ipynb

I then attempted to narrow down this huge dataset of 1387 indicators to get only those that had the longest history and covered most countries. You can see that process in this notebook (which includes some vey messy visualizations): https://github.com/jenna-jordan/international-relations-database-extended/blob/master/Wrangle_Data/WorldBank_WDI_narrow-down.ipynb

This resulted in two datasets: one that had 472 indicators and one that simply subselected the top 25 indicators (as identified by the World Bank). I've included the file with the top 25 indicators in this notebook. 

To simply download the WDI dataset, see: https://datacatalog.worldbank.org/dataset/world-development-indicators

License: CC-BY 4.0

#### Datasets

- Full WDI dataset (not included)
    - filesize: 523.2 mb
- Trimmed WDI dataset (not included
    - filesize: 244.9 mb
- Top 25 WDI dataset
    - filesize: 12.8 mb


In [39]:
wb_wdi25 = pd.read_csv("../Data/WorldBank/wdi_timeseries_top25.csv", usecols=['country', 'indicator', 'value', 'year'])
wb_wdi25

Unnamed: 0,country,indicator,value,year
0,AFG,BX.KLT.DINV.CD.WD,,2019
1,AFG,BX.KLT.DINV.CD.WD,1.392000e+08,2018
2,AFG,BX.KLT.DINV.CD.WD,5.153390e+07,2017
3,AFG,BX.KLT.DINV.CD.WD,9.359132e+07,2016
4,AFG,BX.KLT.DINV.CD.WD,1.691466e+08,2015
...,...,...,...,...
287995,ZWE,ST.INT.ARVL,,1964
287996,ZWE,ST.INT.ARVL,,1963
287997,ZWE,ST.INT.ARVL,,1962
287998,ZWE,ST.INT.ARVL,,1961


In [40]:
wb_wdi25['country'].nunique()

192

In [41]:
wb_wdi25['indicator'].nunique()

25

In [42]:
wb_wdi25['year'].min(), wb_wdi25['year'].max()

(1960, 2019)

In [43]:
wb_wdi25['value'].isna().sum() / 288000

0.34694444444444444

In [44]:
wb_wdi25.groupby('indicator').agg({'year': ['min', 'max'], 'value': ['count', 'size'], 'country': ['count']})

Unnamed: 0_level_0,year,year,value,value,country
Unnamed: 0_level_1,min,max,count,size,count
indicator,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
BX.KLT.DINV.CD.WD,1960,2019,7522,11520,11520
EG.ELC.ACCS.ZS,1960,2019,4396,11520,11520
EN.ATM.CO2E.PC,1960,2019,9200,11520,11520
EN.POP.DNST,1960,2019,10812,11520,11520
FP.CPI.TOTL.ZG,1960,2019,7476,11520,11520
MS.MIL.XPND.GD.ZS,1960,2019,6998,11520,11520
NE.EXP.GNFS.ZS,1960,2019,8054,11520,11520
NY.GDP.MKTP.CD,1960,2019,9092,11520,11520
NY.GDP.MKTP.KD.ZG,1960,2019,8732,11520,11520
NY.GDP.PCAP.CD,1960,2019,9089,11520,11520


## Polity IV

http://www.systemicpeace.org/polityproject.html

Datasets available here: http://www.systemicpeace.org/inscrdata.html

License: Not provided. Instead, this text is included in the above page: "All resources listed on this page are copyrighted by the Center for Systemic Peace. Use of any of these resources in published work must provide proper citation. Reproduction or redistribution of these resources, or substantial portions thereof, is prohibited without prior, written permission from the Center for Systemic Peace."

#### Dataset

- Polity IV Annual Time-Series, 1800-2018
    - filesize: 1.4 mb

This dataset is a time-series, and thus is each row/observation represents one country-year unit. Each country has it's CoW code, ISO Alpha-3 code, and name. Each country-year has a democracy score, autocracy score, and aggregate polity score (democracy - autocracy). Other variables record the individual components that went into the democracy & autocracy scores. A new "polity" occurs when there is a regime transition within a country, and additional variables record details about this transition.

There are 194 countries in this dataset, spanning the time frame of 1800 - 2018. There are 17562 row/observations and 36 variables.

In [45]:
polity = pd.read_csv("../Data/PolityIV/Raw/p4v2018.csv")
polity

Unnamed: 0,cyear,ccode,scode,country,year,flag,fragment,democ,autoc,polity,...,interim,bmonth,bday,byear,bprec,post,change,d4,sf,regtrans
0,21800,2,USA,United States,1800,0,,7,3,4,...,,1.0,1.0,1800.0,1.0,4.0,88.0,1.0,,
1,21801,2,USA,United States,1801,0,,7,3,4,...,,,,,,,,,,
2,21802,2,USA,United States,1802,0,,7,3,4,...,,,,,,,,,,
3,21803,2,USA,United States,1803,0,,7,3,4,...,,,,,,,,,,
4,21804,2,USA,United States,1804,0,,7,3,4,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17557,9502014,950,FJI,Fiji,2014,0,0.0,3,1,2,...,,9.0,22.0,2014.0,1.0,2.0,6.0,1.0,,3.0
17558,9502015,950,FJI,Fiji,2015,0,0.0,3,1,2,...,,,,,,,,,,
17559,9502016,950,FJI,Fiji,2016,0,0.0,3,1,2,...,,,,,,,,,,
17560,9502017,950,FJI,Fiji,2017,0,0.0,3,1,2,...,,,,,,,,,,


In [46]:
polity.dtypes

cyear         int64
ccode         int64
scode        object
country      object
year          int64
flag          int64
fragment    float64
democ         int64
autoc         int64
polity        int64
polity2     float64
durable     float64
xrreg         int64
xrcomp        int64
xropen        int64
xconst        int64
parreg        int64
parcomp       int64
exrec       float64
exconst       int64
polcomp     float64
prior       float64
emonth      float64
eday        float64
eyear       float64
eprec       float64
interim     float64
bmonth      float64
bday        float64
byear       float64
bprec       float64
post        float64
change      float64
d4          float64
sf          float64
regtrans    float64
dtype: object

In [47]:
polity['ccode'].nunique()

194

In [48]:
polity['year'].min(), polity['year'].max()

(1800, 2018)

In [49]:
polity['polity2'].value_counts()

 10.0    2503
-7.0     1901
-10.0    1362
-6.0     1302
-3.0     1181
-9.0     1150
 8.0      812
-4.0      729
 9.0      684
-5.0      587
 7.0      577
 6.0      573
 4.0      547
-1.0      546
-8.0      518
 2.0      468
 5.0      461
 0.0      421
 1.0      381
-2.0      334
 3.0      288
Name: polity2, dtype: int64

In [50]:
polity.groupby('scode').agg({'year': ['min', 'max'], 'polity2': ['min', 'max', 'mean']})

Unnamed: 0_level_0,year,year,polity2,polity2,polity2
Unnamed: 0_level_1,min,max,min,max,mean
scode,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
AFG,1800,2018,-10.0,0.0,-6.336735
ALB,1914,2018,-9.0,9.0,-3.190476
ALG,1962,2018,-9.0,2.0,-4.631579
ANG,1975,2018,-7.0,0.0,-3.863636
ARG,1825,2018,-9.0,9.0,-0.639175
...,...,...,...,...,...
YPR,1967,1990,-8.0,-5.0,-6.916667
YUG,1921,1991,-10.0,2.0,-5.492958
ZAI,1960,2018,-9.0,5.0,-3.406780
ZAM,1964,2018,-9.0,7.0,-0.345455
