# Capstone 2: A/B Testing

##Dataset Sources: 
### Sexual Violence in Armed Conflict (SVAC) dataset measures reports of the conflict-related sexual violence committed by armed actors during the years 1989-2015 http://www.sexualviolencedata.org/dataset/
###  https://guides.ucf.edu/war/wardata
###   YRS 1998-2008: https://www.prio.org/Data/Armed-Conflict/Conflict-Site/
### Download Political Instability Task Force Worldwide Atrocities Dataset, January 2016 to February 2020 (.zip) -- less than 10 cases for Yemen
### University of MD, Center for International Development & Conflict Management:  https://www.dropbox.com/sh/g30wur1u0g8bppt/AAA01uvF4hEwYs8OU11wtnbea?dl=0


In [23]:
import numpy as np
import pandas as pd
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

### df = pd.read_excel('grokonez.xlsx')  # parameter (sheetname='sheet_name') is optional
### df.to_csv('grokonez.csv', index=False)  # index=True to write row index

### Data Key: http://www.sexualviolencedata.org/wp-content/uploads/2019/10/SVAC-2.0-Coding-Manual-FINAL.pdf

### Background & Variables: 
SAMPLE DURING ACTIVE CONFLICT YEARS:
Syria cases: (2485 - 2474) + (9206-9199) + (9246-9239)= 11+7+7=25
Yemen cases: 625-594= 31
The SVAC Dataset covers conflict-related sexual violence committed by the following types of
armed conflict actors: government/state military, pro-government militias, and rebel/insurgent
forces between 1989-2009 (SVAC 1.0), and government/state military and rebel/insurgent forces
for 2010-2015 (SVAC 2.0). We include only sexual violence by armed groups against individuals
outside their own organization.

All actors listed in the SVAC 2.0 Dataset are involved in state-based conflicts as defined by the
UCDP/PRIO Armed Conflict Database. Peacekeeper and civilian perpetrators are not included as
actors in the dataset. We also do not include non-state actors (both rebel groups and PGMs)
involved in violence that is not part of a conflict with the government.

The original SVAC Dataset covered armed conflict active in the years 1989-2009, as defined by
the UCDP/PRIO Armed Conflict Database. The SVAC 2.0 update adds the years 2010-2015.
Therefore, the SVAC 2.0 Dataset includes all conflicts active in the years 1989-2015. The
additional six-year period includes 92 conflicts in 49 countries that were either active or within five
years of cessation. We collected data for all years of active conflict (defined by 25 battle deaths or
more per year) and for the five years post-conflict. Beyond this post-conflict time period,
“peacetime” sexual violence is outside the scope of the project. We also code “interim” years; see
definition of this variable below. 

Variables
For compatibility and ease of integration with widely used existing datasets, we include a number
of general variables on region, country, year, actor ID, type of actor, and conflict ID mostly from
the UCDP/PRIO data.
We use a monadic conflict-actor-year data structure; we rejected a dyadic structure because many
of the victims of sexual violence in armed conflicts are civilians, which does not lend itself easily
to a dyadic logic. By including variables with dyad ID and conflict ID, however, analysts may
create a dyadic structure. 

Sexual Violence**
Following the definition used by the International Criminal Court (ICC)3, we use a definition of
crimes of sexual violence which includes (1) rape,(2) sexual slavery,(3) forced prostitution,(4)
forced pregnancy, and (5) forced sterilization/abortion. Following Elisabeth Wood (2009), we
also include (6) sexual mutilation, and (7) sexual torture.10 This definition does not exclude the
existence of female perpetrators and male victims. We focus on behaviors that involve direct force
and/or physical violence. We exclude acts that do not go beyond verbal sexual harassment and
abuse, including sexualized insults or verbal humiliation. 

actor_type 
SVAC A coding for the type of actor. More specifically, we employ
the following scheme:
1: State or incumbent government (in UCDP dyadic, this actor type is called 'Side A')
2: State A2 (in UCDP dyadic, this actor type is called 'Side A2nd'). These are states supporting the state (1) involved with conflict on its territory. 
3: Rebel (in UCDP dyadic, the actor type is called 'Side B')
4: State supporting ‘Side B’ in other country (in UCDP dyadic, this actor type is called 'SideB2nd').
5: Second state in interstate conflict (in UCDP dyadic, this actor is called ‘Side B’).
6: Pro-government militias (PGMs)

type 
UCDP/PRIO Nominal variable with three categories:
2: Interstate Conflict
3: Intrastate Conflict
4: Internationalized Internal Armed Conflict

~Syria is both types of conflict: Government & Territory
~Yemen is both types of conflict: Government & Territory

**Sexual Violence Variables
The sexual violence variables aim to capture data on two dimensions, prevalence and form.
(1) Prevalence
The prevalence measure gives an estimate of the relative magnitude of sexual violence perpetration
was by the conflict actor in the particular year. This is coded according to an ordinal scale, adapted
from Cohen (2010; 2016) and discussed in Cohen and Nordås (2014). Note that the coding is
primarily based on the qualitative description; only secondarily do we rely on a count of estimated
incidents. The SVAC dataset cannot be used as a means to estimate the numbers of victims.
Prevalence = 3 (Massive) Sexual violence is likely related to the conflict, and:
• Sexual violence was described as “systematic” or “massive” or “innumerable”
• Actor used sexual violence as a “means of intimidation,” “instrument of control and
punishment,” “weapon,” “tactic to terrorize the population,” “terror tactic,” “tool of war,”
on a “massive scale”
Note: Absent these or similar terms, a count of 1000 or more reports of sexual violence
indicates a prevalence code of 3.
Prevalence = 2 (Numerous) Sexual violence is likely related to the conflict, but did not meet the
requirements for a 3 coding, and:
• Sexual violence was described as “widespread,” “common,” “commonplace,” “extensive,”
“frequent,” “often,” “persistent,” “recurring,” a “pattern,” a “common pattern,” or a “spree”
• Sexual violence occurred “commonly,” “frequently,” “in large numbers,” “periodically,”
“regularly,” “routinely,” “widely,” or on a “number of occasions;” there were “many” or
“numerous instances”
Note: Absent these or similar terms, a count of 25-999 reports of sexual violence indicates a
prevalence code of 2.
Prevalence = 1 (Isolated) Sexual violence is likely related to the conflict, but did not meet the
requirements for a 2 or 3 coding, and:
SVAC 2.0—November 2019 Update 10
• There were “reports,” “isolated reports,” or “there continued to be reports” of occurrences
of sexual violence
Note: Absent these or similar terms, a count of less than 25 reports of sexual violence indicates
a prevalence code of 1.
Prevalence = 0 (None) Report issued, but no mention of rape or other sexual violence related to
the conflict
Note: For example, a coder finds a report covering a country in a given year but within the report
there is no mention of rape or other sexual violence related to the conflict.
Prevalence = -99 (BOTH No Report AND No Information) No report found and no data
available from subsequent years, and consequentially no data. This code should be used as
infrequently as possible.

In [24]:
#Need to convert XLS file to CSV import pandas as pd #Users⁩/⁨mehrunisaqayyum⁩/⁨Downloads/SVAC_2.0_complete_X.1.xlsx⁩',sheet_name=None)
#df_xls.to_csv('SVAC_2.0_complete_X.1.csv'
df_xls = pd.read_excel('http://www.sexualviolencedata.org/wp-content/uploads/2019/12/SVAC_2.0_complete_X.1.xlsx') #encoding='utf-8')

#df.to_csv('SVAC_2.0_complete_X.1.csv', index=False)

In [16]:
df_xls

Unnamed: 0,year,conflictid_old,conflictid_new,actor,actorid,actorid_new,actor_type,type,incomp,region,...,gwnoloc2,gwnoloc3,gwnoloc4,conflictyear,interm,postc,state_prev,ai_prev,hrw_prev,form
0,1989,6,205,Iran,630,114.0,1,3,1,2,...,0,0,0,0,1,0,0.0,0.0,0.0,-99
1,1990,6,205,Iran,630,114.0,1,3,1,2,...,0,0,0,1,0,0,0.0,0.0,0.0,-99
2,1991,6,205,Iran,630,114.0,1,3,1,2,...,0,0,0,0,1,0,0.0,0.0,0.0,-99
3,1992,6,205,Iran,630,114.0,1,3,1,2,...,0,0,0,0,1,0,0.0,0.0,0.0,-99
4,1993,6,205,Iran,630,114.0,1,3,1,2,...,0,0,0,1,0,0,0.0,0.0,0.0,-99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9277,2015,298,13694,Libya,620,111.0,1,3,1,4,...,0,0,0,1,0,0,0.0,0.0,1.0,7
9278,2015,13721,13721,Algeria,615,109.0,1,3,1,4,...,0,0,0,1,0,0,0.0,0.0,0.0,-99
9279,2015,13721,13721,Jund al-Khilafah,5870,5870.0,3,3,1,4,...,0,0,0,1,0,0,0.0,0.0,0.0,-99
9280,2015,13902,13902,IS,1076,234.0,3,3,1,2,...,0,0,0,1,0,0,0.0,0.0,0.0,-99


In [90]:
df.head()

Unnamed: 0,year,conflictid_old,conflictid_new,actor,actorid,actorid_new,actor_type,type,incomp,region,...,gwnoloc2,gwnoloc3,gwnoloc4,conflictyear,interm,postc,state_prev,ai_prev,hrw_prev,form
0,1989,6,205,Iran,630,114.0,1,3,1,2,...,0,0,0,0,1,0,0.0,0.0,0.0,-99
1,1990,6,205,Iran,630,114.0,1,3,1,2,...,0,0,0,1,0,0,0.0,0.0,0.0,-99
2,1991,6,205,Iran,630,114.0,1,3,1,2,...,0,0,0,0,1,0,0.0,0.0,0.0,-99
3,1992,6,205,Iran,630,114.0,1,3,1,2,...,0,0,0,0,1,0,0.0,0.0,0.0,-99
4,1993,6,205,Iran,630,114.0,1,3,1,2,...,0,0,0,1,0,0,0.0,0.0,0.0,-99


##HYPOTHESIS: Hypothesis: Countries inhabiting more of this type of actor "" will have higher rates of Form or prevalance of sexual violence against women during government and territorial conflict. #Create a new index: a column that combines both columns (both violence and type). What other possible weak points?  #There's a statistical diff between these two countries with more actors than less actors. #look at population per citizen. Proportion of cases larger than the average. Count of the number of cases. Proportion of cases/population. Cases per person for Syria: is it significantly larger than cases per person in the average of all other cases. Focus across 3 years. 

Null: There will be no difference in Test A (Syria in state) and Test B (rest of world) in rates of Form or prevalance of sexual violence against women during government and territorial conflict.

In [91]:
#Need to change the list into a Pandas dataframe: https://stackoverflow.com/questions/45489205/why-is-df-head-not-working-in-python
df = pd.DataFrame(df_xls)

In [92]:
df.shape

(9282, 22)

In [93]:
np.median(df['actor_type'])
#Actor carrying out violence usually type '3': Rebel (in UCDP dyadic, the actor type is called 'Side B')

3.0

In [94]:
import statistics
statistics.mode(df['actor_type'])
#Actor carrying out violence usually type '3': Rebel (in UCDP dyadic, the actor type is called 'Side B')

3

In [95]:
df.describe()

Unnamed: 0,year,conflictid_old,conflictid_new,actorid,actorid_new,actor_type,type,incomp,region,gwnoloc,gwnoloc2,gwnoloc3,gwnoloc4,conflictyear,interm,postc,state_prev,ai_prev,hrw_prev
count,9282.0,9282.0,9282.0,9282.0,9263.0,9282.0,9282.0,9282.0,9282.0,9282.0,9282.0,9282.0,9282.0,9282.0,9282.0,9282.0,9272.0,9272.0,9272.0
mean,2002.067981,153.114738,825.89722,2389.854234,1927.501673,3.199526,3.340228,1.625943,3.193601,570.712346,22.216548,0.51713,0.005171,0.55376,0.080155,0.365869,-3.258412,-2.458477,-12.588654
std,7.562547,379.218092,2337.120281,2956.067713,3186.116762,1.79124,0.543304,0.490319,1.060659,214.027277,120.691644,10.157248,0.101572,0.497128,0.271548,0.481699,17.960517,15.605964,33.064896
min,1989.0,6.0,205.0,2.0,3.0,1.0,2.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,-99.0,-99.0,-99.0
25%,1996.0,92.0,289.0,530.0,97.0,2.0,3.0,1.0,3.0,475.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,2002.0,137.0,333.0,1119.0,277.0,3.0,3.0,2.0,3.0,628.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,2008.0,193.0,388.0,1646.0,785.0,5.0,4.0,2.0,4.0,750.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
max,2015.0,13902.0,13902.0,8185.0,8185.0,6.0,4.0,2.0,5.0,910.0,811.0,200.0,2.0,1.0,1.0,1.0,3.0,3.0,3.0


## Missing Data:
### We can see that 19 records are missing from the 'actorid' column -- we don't know which actors are unaccounted for in 19 records/rows. Also, we are missing 10 records in the last 3 columns. 


In [96]:
df2 = pd.DataFrame(
    columns=[('year'[2011:2017]), 
             'actor_type',
             'location', 'type',],
    index=['Syria', 'Yemen'])
df2


Unnamed: 0,Unnamed: 1,actor_type,location,type
Syria,,,,
Yemen,,,,


In [97]:
#built in like .sum() or .count(). For even greater possibilities you can use .aggregate(numpy_function). 
df.groupby('location').count()

Unnamed: 0_level_0,year,conflictid_old,conflictid_new,actor,actorid,actorid_new,actor_type,type,incomp,region,...,gwnoloc2,gwnoloc3,gwnoloc4,conflictyear,interm,postc,state_prev,ai_prev,hrw_prev,form
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Afghanistan,823,823,823,823,823,823,823,823,823,823,...,823,823,823,823,823,823,823,823,823,823
Algeria,156,156,156,156,156,156,156,156,156,156,...,156,156,156,156,156,156,156,156,156,156
Angola,152,152,152,152,152,152,152,152,152,152,...,152,152,152,152,152,152,152,152,152,152
"Australia, Iraq, United Kingdom, United States of America",24,24,24,24,24,24,24,24,24,24,...,24,24,24,24,24,24,24,24,24,24
Azerbaijan,76,76,76,76,76,76,76,76,76,76,...,76,76,76,76,76,76,76,76,76,76
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Uzbekistan,34,34,34,34,34,34,34,34,34,34,...,34,34,34,34,34,34,34,34,34,34
Venezuela,12,12,12,12,12,12,12,12,12,12,...,12,12,12,12,12,12,12,12,12,12
Yemen,14,14,14,14,14,14,14,14,14,14,...,14,14,14,14,14,14,14,14,14,14
Yemen (North Yemen),32,32,32,32,32,32,32,32,32,32,...,32,32,32,32,32,32,32,32,32,32


In [98]:
df.groupby('location').sum()

Unnamed: 0_level_0,year,conflictid_old,conflictid_new,actorid,actorid_new,actor_type,type,incomp,region,gwnoloc,gwnoloc2,gwnoloc3,gwnoloc4,conflictyear,interm,postc,state_prev,ai_prev,hrw_prev
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Afghanistan,1651023,113355,327275,1170040,856652.0,2245,3253,1642,2469,576100,0,0,0,638,37,147,20.0,-5337.0,-4147.0
Algeria,312371,56856,86886,528370,460572.0,549,543,310,624,95940,0,0,0,119,0,37,3.0,-389.0,-2770.0
Angola,303878,23694,53424,369895,295672.0,456,546,242,608,82080,0,0,0,73,30,49,24.0,-186.0,-681.0
"Australia, Iraq, United Kingdom, United States of America",48132,5424,10080,10482,1818.0,96,48,48,72,21600,15480,4800,48,4,0,20,0.0,3.0,2.0
Azerbaijan,152093,14828,29648,53474,13436.0,192,284,96,76,28348,0,0,0,27,4,45,-297.0,2.0,-295.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Uzbekistan,68142,7514,14110,30406,7456.0,70,112,68,102,23936,0,0,0,8,6,20,3.0,0.0,5.0
Venezuela,23934,960,3324,10224,4560.0,24,36,24,60,1212,0,0,0,2,0,10,2.0,-198.0,-494.0
Yemen,27976,2550,5284,13669,3860.0,28,42,16,28,9492,0,0,0,4,0,10,0.0,0.0,-792.0
Yemen (North Yemen),64434,1576,34190,32503,15920.0,85,122,62,64,21696,0,0,0,31,0,1,0.0,-296.0,0.0


In [99]:
#Country affected with sexual violence
df.groupby('location').aggregate(np.median)

Unnamed: 0_level_0,year,conflictid_old,conflictid_new,actorid,actorid_new,actor_type,type,incomp,region,gwnoloc,gwnoloc2,gwnoloc3,gwnoloc4,conflictyear,interm,postc,state_prev,ai_prev,hrw_prev
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
Afghanistan,2008.0,137.0,333.0,385.0,67.0,2.0,4.0,2.0,3.0,700.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
Algeria,2002.0,191.0,386.0,1390.0,538.0,3.0,3.0,2.0,4.0,615.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
Angola,1999.0,131.0,327.0,1392.0,540.0,3.0,4.0,2.0,4.0,540.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Australia, Iraq, United Kingdom, United States of America",2005.5,226.0,420.0,422.5,72.0,5.0,2.0,2.0,3.0,900.0,645.0,200.0,2.0,0.0,0.0,1.0,0.0,0.0,0.0
Azerbaijan,1998.0,193.0,388.0,373.0,64.0,3.0,4.0,1.0,1.0,373.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Uzbekistan,2004.0,221.0,415.0,704.0,133.0,2.0,3.0,2.0,3.0,704.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Venezuela,1994.5,80.0,277.0,852.0,380.0,2.0,3.0,2.0,5.0,101.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
Yemen,1997.0,207.0,402.0,937.0,238.0,2.0,3.0,1.0,2.0,678.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,-99.0
Yemen (North Yemen),2014.5,33.0,230.0,678.0,123.0,3.0,4.0,2.0,2.0,678.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


### We see that there are 10 different 'actor types' and 4 different 'form' sexual violence measured.

In [None]:
### Need to parse out Syria subset, or 'Group A', aside from the global sample between years 2011-2015. 
#The global sample is 'Group B'. Need a second dataframe to organize this group A. 