# A/B Testing: Conflict Countries' Rate of Sexual Violence

## Introduction
Since the Vietnam War, we see how conflicts take many forms and involve different actors that develop within the country, or organize as rebel forces that gain support from other countries--unwittinglty, or not. In 2010, Syria was an authoritarian regime with a population of 20.4 million split between urban and rural areas. Since its 2011 political uprisings turned proxy war, Syria's population has fallen to an estimated 17.3 million because the conflict turned violent, and thereby causing over 5 million to seek refugee. [Source: https://www.worldometers.info/world-population/syria-population/] One could argue that Syria's conflict has exploded to the extent that warrants the most investigation into war crimes, like assassinations, chemical weapons use, and state implemented sexual violence compared to other similar conflicts. Aside from the Iraq War in 2004, the Syrian conflict represents the worst violence--not just in the region--but globally since the Balkan Crisis of the 1990's. 

## Background
For example, the Balkans Crisis resulted from in the dissolution of Yugoslavia wher simultaneous independence struggles among six provinces sought their own states. However, this struggle devolved into certain states implementing ethnic cleansing measures to influence which countries succeeded in asserting independence based on population numbers. As a result, certain states carried out human rights abuses and sexual violence crimes, which skyrocketed into war crimes to the point of warranting global intervention by the North Atlantic Treaty Organization (NATO). Even NATO intervention did not curtail the abuses as other countries, like Russia, intervened by supplying military assistance. 

Between 1992 to 1995, the conflict erupted into outright genocide resulting in the death of over 200,000 people, 2.3 million fleeing their homes--the largest modern refugee crisis since World War II--and a record number of sexually based offenses, according to the international nonprofit the Borgen Project and the United Nations High Commission for Refugees. [Sources: https://borgenproject.org/10-facts-about-the-bosnian-genocide/ and   https://www.unhcr.org/3ae6a0c58.pdf]

Since the dissolution of Yugoslavia, several conflicts erupted globally. In particular, the Syrian conflict has also morphed into Refugee Crisis that mirrors the internal and external dynamics witnessed after Yugoslavia's dissolution morphed into the Balkans Crisis followed by the Bosnian Genocide. The Syrian siuation repeats the 1990s catastrophe to that the extent that the human rights abuses, proxy actors, food shortages, and sexually based offensed carried out by state and non-state actors has produced 5.27 million Syrian refugees. [Source: https://data2.unhcr.org/en/situations/syria] This is more than double the Balkans conflict. Thus, we will test if the Syrian Crisis outpaces other conflict countries in the same time period of 2011 to 2015 regarding the sexual-violence crimes because this is noted as a warcrime, not just a casualty of war. 

Variables to Consider
Countries inhabiting more of this type of actor "Rebel" will have higher rates of Form or prevalance of sexual violence against women during government and territorial conflict. Create a new index: a column that combines both columns (both violence and type). What other possible weak points?  Proportion of cases larger than the average. Count of the number of cases. Proportion of cases/population. Focus across 3 years.
 
Types of conflict range between government and territorial. 

## Methodology
1) Review dataset for descriptive statistics, missing values, and time period to select "Group A" and "Group B" for our hypothesis. 
    a) Considered selecting Syria and Yemen for A/B testing, however, sample size for both groups was relatively    small. 
        Syria cases: (Rows: 2485 - 2474) + (Rows: 9206-9199) + (Rows: 9246-9239)= 11+7+7 = 25 cases 
        Yemen cases: (Rows: 625-594) = 31 cases
    b) Selected Syria cases between 2011 and 2015 for 'Group A' and selected for 'Group B': all other conflict countries in same time period to control for other global dynamics and power players intervening. 'Group A' will be represented in subset dataframe: df_Syria. 'Group B' will be represented in subset dataframe: df_world.

2)  Identify independent variable to be tested in our hypothesis using the t-test for A-B testing. Our independent variable will combine the different ways the data set captures sexual violence based on type or 'form'; occurence also 'form'; and number of authentic and valid watchdog reports.  

3) Create column to add country population values. Name column 'population' because we need to normalize the "Rate of Sexual Violence", which will be labeled as 'sex_violence'. For 'population' column, create a dictionary to populate the 'population' column. 

    a) Consider cases per person for Syria: is it significantly larger than cases per person in the average of all other cases? 
    b) Consider population challenge. We will pull population statistics from Food & Agricultural Organization site: http://www.fao.org/faostat/en/#data/OA. We will select only year '2013' because our test focuses on years '2011' to '2015'. Because population could drastically grow or fall during conflict across four years, we will select year '2013' as the average between our range to select and insert a population column. As a result, we will "merge" --rather than concatenate-- both data frames pulled from the data sets.

4) Create the Sexual Violence Index, our independent variable, identified in step 2 by calculating total number of reports and counting the incidences of sexual type of crimes for the Syria subset: 'Group A'. 
a) Add the total number of international organizations' filed reports of incidences. These are captured by the top three organizations and extracled from columns showing: Department of State (label: 'state_prev'), Amnesty International (label:'aw_prev'), and Human Rights Watch (label: 'hrw_prev'). This means that each organization filed a report documenting crimes--whether sexual violence was committed by any actor types (label: 'actor_type') and the form of sexual violence observed(label: 'form').  
b) Multiplying by the number of 'form' by the total reports and divide by country population to normalize the comparisons. 

5) Repeat step 4 for 'Group B', the world subset. 

6) Conduct an 'Independent Samples' t-test using the independent variable's t-statistic because we are comparing the means between 'Group A' and'Group B' and referring to this two-tailed t-test chart: https://towardsdatascience.com/inferential-statistics-series-t-test-using-numpy-2718f8f9bf2f
We compare to check if null hypothesis is true. The following represent our assumptions:
a)The samples are independently and randomly drawn.
b)The distribution of the residuals between the two groups should follow the normal distribution.
c)The variances between the two groups are equal.

## Data Code Book

### Data Key: http://www.sexualviolencedata.org/wp-content/uploads/2019/10/SVAC-2.0-Coding-Manual-FINAL.pdf

### Variables: 
SAMPLE DURING ACTIVE CONFLICT YEARS:

The SVAC Dataset covers conflict-related sexual violence committed by the following types of
armed conflict actors: government/state military, pro-government militias, and rebel/insurgent
forces between 1989-2009 (SVAC 1.0), and government/state military and rebel/insurgent forces
for 2010-2015 (SVAC 2.0). We include only sexual violence by armed groups against individuals
outside their own organization.

All actors listed in the SVAC 2.0 Dataset are involved in state-based conflicts as defined by the
UCDP/PRIO Armed Conflict Database. Peacekeeper and civilian perpetrators are not included as
actors in the dataset. We also do not include non-state actors (both rebel groups and PGMs)
involved in violence that is not part of a conflict with the government.

The original SVAC Dataset covered armed conflict active in the years 1989-2009, as defined by
the UCDP/PRIO Armed Conflict Database. The SVAC 2.0 update adds the years 2010-2015.
Therefore, the SVAC 2.0 Dataset includes all conflicts active in the years 1989-2015. The
additional six-year period includes 92 conflicts in 49 countries that were either active or within five
years of cessation. We collected data for all years of active conflict (defined by 25 battle deaths or
more per year) and for the five years post-conflict. Beyond this post-conflict time period,
“peacetime” sexual violence is outside the scope of the project. We also code “interim” years; see
definition of this variable below. 

Variables
For compatibility and ease of integration with widely used existing datasets, we include a number
of general variables on region, country, year, actor ID, type of actor, and conflict ID mostly from
the UCDP/PRIO data.
We use a monadic conflict-actor-year data structure; we rejected a dyadic structure because many
of the victims of sexual violence in armed conflicts are civilians, which does not lend itself easily
to a dyadic logic. By including variables with dyad ID and conflict ID, however, analysts may
create a dyadic structure. 

Sexual Violence**
Following the definition used by the International Criminal Court (ICC)3, we use a definition of
crimes of sexual violence which includes (1) rape,(2) sexual slavery,(3) forced prostitution,(4)
forced pregnancy, and (5) forced sterilization/abortion. Following Elisabeth Wood (2009), we
also include (6) sexual mutilation, and (7) sexual torture. This definition does not exclude the
existence of female perpetrators and male victims. We focus on behaviors that involve direct force
and/or physical violence. We exclude acts that do not go beyond verbal sexual harassment and
abuse, including sexualized insults or verbal humiliation. 

actor_type 
SVAC A coding for the type of actor. More specifically, we employ
the following scheme:
1: State or incumbent government (in UCDP dyadic, this actor type is called 'Side A')
2: State A2 (in UCDP dyadic, this actor type is called 'Side A2nd'). These are states supporting the state (1) involved with conflict on its territory. 
3: Rebel (in UCDP dyadic, the actor type is called 'Side B')
4: State supporting ‘Side B’ in other country (in UCDP dyadic, this actor type is called 'SideB2nd').
5: Second state in interstate conflict (in UCDP dyadic, this actor is called ‘Side B’).
6: Pro-government militias (PGMs)

type 
UCDP/PRIO Nominal variable with three categories:
2: Interstate Conflict
3: Intrastate Conflict
4: Internationalized Internal Armed Conflict

~Syria is both types of conflict: Government & Territory
~Yemen is both types of conflict: Government & Territory

**Sexual Violence Variables
The sexual violence variables aim to capture data on two dimensions, prevalence and form.
(1) Prevalence
The prevalence measure gives an estimate of the relative magnitude of sexual violence perpetration
was by the conflict actor in the particular year. This is coded according to an ordinal scale, adapted
from Cohen (2010; 2016) and discussed in Cohen and Nordås (2014). Note that the coding is
primarily based on the qualitative description; only secondarily do we rely on a count of estimated
incidents. The SVAC dataset cannot be used as a means to estimate the numbers of victims.
Prevalence = 3 (Massive) Sexual violence is likely related to the conflict, and:
• Sexual violence was described as “systematic” or “massive” or “innumerable”
• Actor used sexual violence as a “means of intimidation,” “instrument of control and
punishment,” “weapon,” “tactic to terrorize the population,” “terror tactic,” “tool of war,”
on a “massive scale”
Note: Absent these or similar terms, a count of 1000 or more reports of sexual violence
indicates a prevalence code of 3.
Prevalence = 2 (Numerous) Sexual violence is likely related to the conflict, but did not meet the
requirements for a 3 coding, and:
• Sexual violence was described as “widespread,” “common,” “commonplace,” “extensive,”
“frequent,” “often,” “persistent,” “recurring,” a “pattern,” a “common pattern,” or a “spree”
• Sexual violence occurred “commonly,” “frequently,” “in large numbers,” “periodically,”
“regularly,” “routinely,” “widely,” or on a “number of occasions;” there were “many” or
“numerous instances”
Note: Absent these or similar terms, a count of 25-999 reports of sexual violence indicates a
prevalence code of 2.
Prevalence = 1 (Isolated) Sexual violence is likely related to the conflict, but did not meet the
requirements for a 2 or 3 coding, and:
SVAC 2.0—November 2019 Update 10
• There were “reports,” “isolated reports,” or “there continued to be reports” of occurrences
of sexual violence
Note: Absent these or similar terms, a count of less than 25 reports of sexual violence indicates
a prevalence code of 1.
Prevalence = 0 (None) Report issued, but no mention of rape or other sexual violence related to
the conflict
Note: For example, a coder finds a report covering a country in a given year but within the report
there is no mention of rape or other sexual violence related to the conflict.
Prevalence = -99 (BOTH No Report AND No Information) No report found and no data
available from subsequent years, and consequentially no data. This code should be used as
infrequently as possible.

## Hypothesis
Hypothesis: There will be a statistical difference between 'Group A', Syria, and 'Group B', world, in rates of sexual violence against women during conflict.  

Null Hypothesis: There will be no difference in Test A (Syria in state) and Test B (rest of world) in rates of Form or prevalance of sexual violence against women during conflict. This means that the means of both 'Group A' and 'Group B' will be equal.

## Code for Data Analysis

In [15]:
import numpy as np
import pandas as pd
import scipy.stats
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [16]:
#Need to convert XLS file to CSV import pandas as pd #Users⁩/⁨mehrunisaqayyum⁩/⁨Downloads/SVAC_2.0_complete_X.1.xlsx⁩',sheet_name=None)
#df_xls.to_csv('SVAC_2.0_complete_X.1.csv'
#df_xls2 serves as second data set for country population 
#Need to concatenate both data sets

df_xls = pd.read_excel('http://www.sexualviolencedata.org/wp-content/uploads/2019/12/SVAC_2.0_complete_X.1.xlsx',dtype='object') #encoding='utf-8')
df_xls2 = pd.read_excel('/Users/mehrunisaqayyum/Downloads/population by country.xlsx', header=1)

         
#df_xls = pd.read_excel(MacintoshHD/Users/mehrunisaqayyum/Downloads/SVAC_2.0_complete_X.1.xlsx)
#df.to_csv('SVAC_2.0_complete_X.1.csv', index=False)

In [17]:
df_xls2 = df_xls2[['location', 'population']]

In [18]:
df_xls

Unnamed: 0,year,conflictid_old,conflictid_new,actor,actorid,actorid_new,actor_type,type,incomp,region,...,gwnoloc2,gwnoloc3,gwnoloc4,conflictyear,interm,postc,state_prev,ai_prev,hrw_prev,form
0,1989,6,205,Iran,630,114,1,3,1,2,...,0,0,0,0,1,0,0,0,0,-99
1,1990,6,205,Iran,630,114,1,3,1,2,...,0,0,0,1,0,0,0,0,0,-99
2,1991,6,205,Iran,630,114,1,3,1,2,...,0,0,0,0,1,0,0,0,0,-99
3,1992,6,205,Iran,630,114,1,3,1,2,...,0,0,0,0,1,0,0,0,0,-99
4,1993,6,205,Iran,630,114,1,3,1,2,...,0,0,0,1,0,0,0,0,0,-99
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9277,2015,298,13694,Libya,620,111,1,3,1,4,...,0,0,0,1,0,0,0,0,1,7
9278,2015,13721,13721,Algeria,615,109,1,3,1,4,...,0,0,0,1,0,0,0,0,0,-99
9279,2015,13721,13721,Jund al-Khilafah,5870,5870,3,3,1,4,...,0,0,0,1,0,0,0,0,0,-99
9280,2015,13902,13902,IS,1076,234,3,3,1,2,...,0,0,0,1,0,0,0,0,0,-99


### Population Data
Reorder dataframe (df_xls) index by country name, or 'location', in alphabetical order to list population by countries more efficiently. We pulled population data from the UN data site. We selected the year 2013 because this year is the average of years 2011 and 2015. 

In [19]:
df_xls = df_xls.sort_values(by="location", ascending=False)
print(df_xls)

      year conflictid_old conflictid_new                           actor  \
7305  1998            202            397  Krajina Militia AKA Marticevci   
6700  1992            190            385                        Chetniks   
7789  1998            218            412      Beli Orlovi (White Eagles)   
7790  1999            218            412      Beli Orlovi (White Eagles)   
7791  2000            218            412      Beli Orlovi (White Eagles)   
...    ...            ...            ...                             ...   
4315  1998            137            333    Hizb-i Islami-yi Afghanistan   
4316  1999            137            333    Hizb-i Islami-yi Afghanistan   
4317  2000            137            333    Hizb-i Islami-yi Afghanistan   
4318  2002            137            333    Hizb-i Islami-yi Afghanistan   
4641  2006            137            333                        Portugal   

     actorid actorid_new actor_type type incomp region  ... gwnoloc2 gwnoloc3  \
7305  

### Combine Data Sets
We will ceate a "df2" that shows how we combined both data frames. The concatenate function will not serve the purpose of combinging the data because the data sets list different rows. 'Location' column will not append right next to each country's population. Rather, we will use the "merge" function. Below shows the second data set, or population data set named as df_xls2, and lists populations of only countries undergoing conflict between the years 2011 and 2015 for the purpose of testing our above hypothesis.  

In [6]:
df_xls2

Unnamed: 0,location,population
0,Afghanistan,32269589
1,Algeria,38140133
2,Angola,26015781
3,Azerbaijan,9385468
4,Bangladesh,152761418
...,...,...
77,United States of America,316400538
78,Uzbekistan,29932631
79,Venezuela,29871040
80,Yemen,25147109


In [20]:
merged_df = pd.merge(df_xls, df_xls2, how='inner', left_on='location', right_on='location')

In [8]:
merged_df

Unnamed: 0,year,conflictid_old,conflictid_new,actor,actorid,actorid_new,actor_type,type,incomp,region,...,gwnoloc3,gwnoloc4,conflictyear,interm,postc,state_prev,ai_prev,hrw_prev,form,population
0,2013,33,230,AQAP,1747,881,3,4,2,2,...,0,0,1,0,0,0,-99,0,-99,10000000
1,2012,33,230,AQAP,1747,881,3,4,2,2,...,0,0,1,0,0,0,0,0,-99,10000000
2,2015,33,230,United States of America,2,3,2,4,2,2,...,0,0,0,0,1,0,0,0,-99,10000000
3,2015,33,230,Jordan,663,120,4,4,2,2,...,0,0,1,0,0,0,0,0,-99,10000000
4,2015,33,230,Yemen (North Yemen),678,123,1,4,2,2,...,0,0,1,0,0,0,0,0,-99,10000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8572,1998,137,333,Hizb-i Islami-yi Afghanistan,1141,299,3,4,2,3,...,0,0,0,0,1,0,0,0,-99,32269589
8573,1999,137,333,Hizb-i Islami-yi Afghanistan,1141,299,3,4,2,3,...,0,0,0,0,1,0,0,0,-99,32269589
8574,2000,137,333,Hizb-i Islami-yi Afghanistan,1141,299,3,4,2,3,...,0,0,0,0,1,0,0,0,-99,32269589
8575,2002,137,333,Hizb-i Islami-yi Afghanistan,1141,299,3,4,2,3,...,0,0,1,0,0,0,0,0,-99,32269589


In [9]:
merged_df.columns

Index(['year', 'conflictid_old', 'conflictid_new', 'actor', 'actorid',
       'actorid_new', 'actor_type', 'type', 'incomp', 'region', 'location',
       'gwnoloc', 'gwnoloc2', 'gwnoloc3', 'gwnoloc4', 'conflictyear', 'interm',
       'postc', 'state_prev', 'ai_prev', 'hrw_prev', 'form', 'population'],
      dtype='object')

In [10]:
df_xls2.columns

Index(['location', 'population'], dtype='object')

Note how we checked to see that our newly merged data set "merged_df" shows the additional 'population' column.

## Data Breakdown

In [11]:
merged_df.shape

(8577, 23)

In [12]:
index=merged_df.index
index

Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
            ...
            8567, 8568, 8569, 8570, 8571, 8572, 8573, 8574, 8575, 8576],
           dtype='int64', length=8577)

In [13]:
merged_df.dtypes
            #Note that the variable "form" is an object. Need to transform into numeric for both subsets.

year              object
conflictid_old    object
conflictid_new    object
actor             object
actorid           object
actorid_new       object
actor_type        object
type              object
incomp            object
region            object
location          object
gwnoloc           object
gwnoloc2          object
gwnoloc3          object
gwnoloc4          object
conflictyear      object
interm            object
postc             object
state_prev        object
ai_prev           object
hrw_prev          object
form              object
population         int64
dtype: object

In [None]:
sum(merged_df['actor_type'].isnull())

In [None]:
sum(merged_df['type'].isnull())

In [None]:
sum(merged_df['state_prev'].isnull())

In [None]:
sum(merged_df['ai_prev'].isnull())

In [None]:
sum(merged_df['hrw_prev'].isnull())

In [None]:
sum(merged_df['form'].isnull())

In [21]:
import statistics
statistics.mode(merged_df['actor_type'])

3

We see that the most common d types of 'actor_type'  that emerged in the conflict country cases reviewed was type 3:  'Rebel (in UCDP dyadic, the actor type is called 'Side B')'. 

Below we see that the most common type of conflict is the one labeled '3', or "Intrastate Conflict" type.

In [None]:
merged_df['type'].value_counts()

In [None]:
merged_df['actor_type'].value_counts()


There are six actor types. Although '3', or 'Rebel' was the most prevalent in 3,020 of the conflict country cases, 'actor_type' '6' was the second most prevalent and defined as "Pro-government militias" (PGMs) in 2,335 country cases. 

In [None]:
merged_df['form'].value_counts()
                            #Variable 'form' is read as a string of integers. We need to deal with the '-99' value.

In [None]:
merged_df.iloc[6522:6525]['form'] 

In [None]:
#df['form'] = pd.to_numeric(df['form'], errors='coerce')
#print(df)

In [None]:
merged_df['form'].value_counts()

## 'Sex Violence' Index 
Creat a new column 'sex_violence' as the varaible being tested in this experiment. Use in new Sex_Violence column df['form'].value_counts()

### Creating Code: 'Crime_counts'   
We see that the 'form' tracks both dimensions of documented sexual violence against women: 1) the number, and 2) type of Crimes. Capturing both dimensions represents a key component to create the index "sex_violence', a.k.a, sexual violence. 

In [None]:
merged_df.iloc[6522:6525]['form'] 

In [None]:
crime_counts = merged_df['form'].value_counts()
print(crime_counts)

In [None]:
df['form'] = pd.to_numeric(df['form'], errors='coerce')
print(merged_df)

In [None]:
print(merged_df['form'].iloc[0])
print(type(merged_df['form'].iloc[0]))

In [None]:
def my_function(x):
    if x == -99:
        return 1
    else:
        return x + 1 
    
#merged_df['crime_fixed_another_way'] = merged_df['form'].apply(my_function)  #substituted "crime" for "form"

merged_df['crime_fixed_another_way'] = merged_df['length_of_form'].apply(lambda x: 1 if x == -99 else x+1)

In [None]:
merged_df

In [None]:
print(merged_df['crime_fixed_another_way'].iloc[2474:2485]) 
                                                        #Spot Checking Syria cases in original spreadsheet showing more than 1 crime from type

In [None]:
merged_df['length_of_form'].value_counts()

In [None]:
merged_df['crime_fixed_another_way'].value_counts()

Tiago's Suggested Fix 
merged_df['crime_fixed_another_way'] = merged_df['form'].apply(lambda x: 1 if int(x) == -99 else int(x)+1)

merged_df['number_of_crimes'].apply(len)

In [None]:
merged_df.iloc[6522:6525]['form'] 

In [None]:
merged_df['sex_violence'] = ((merged_df['state_prev']+ merged_df['ai_prev']+ merged_df['hrw_prev']) * merged_df['crime_fixed_another_way'])/ merged_df['population']
print(merged_df['sex_violence'])

We calculated the 'sex_violence' index in another way by removing the step of dividing by 'population' here and called it 'sex_violence_2' to contrast with previous index: 'sex_violence'. 

In [None]:
merged_df['sex_violence_2'] = ((merged_df['state_prev']+ merged_df['ai_prev']+ merged_df['hrw_prev']) * merged_df['crime_fixed_another_way'])#~Removed Population
print(merged_df['sex_violence_2'])

In [None]:
#built in like .sum() or .count(). For even greater possibilities you can use .aggregate(numpy_function). 
merged_df.groupby('location').count()

In [None]:
sns.catplot(x="sex_violence", y="actor_type", hue="year", kind="swarm", data=merged_df)
plt.title('Sexual Violence by Actor Type')

In our 'Sexual Violence by Actor' plot we see that there are six different 'actor types' committing different types of sexual violence, measured by 'form'. "Type 1" representes 'State or incumbent government (in UCDP dyadic, this actor type is called 'Side A'' while "Type 3" represents Rebel (in UCDP dyadic, the actor type is called 'Side B')
"Type 2" represent states supporting the state "Type 1" involved with conflict on its territory, and highlighted as an occurrence after year 2006 and increasingly between 2011 to 2015. Syria represents this situation regarding Russia and Iran.  

So we need to parse out Syria subset, or 'Group A', aside from the global sample between years 2011-2015.
Source: https://www.dunderdata.com/blog/selecting-subsets-of-data-in-pandas-part-1

## Group A: Syria Subset
The next step in our experiment is to parse out Syria subset, or 'Group A', aside from the global sample between years 2011-2017.
Source: https://www.dunderdata.com/blog/selecting-subsets-of-data-in-pandas-part-1

In [None]:
#Select from 'year' between 2011 and 2017 AND location of Syria index
df_Syria = merged_df[(merged_df['year'] >= 2011) & (merged_df['year'] <= 2017) & (merged_df['location'] == 'Syria')]
df_Syria

###Summary stats for Group A
Here we see the breakdown of summary statistics for our experiment on Syria, which is 'Group A'.  There are 28 cases between years 2011 and 2015. Unlike the U.S. Three huge entities specifically report rape or other sexual violence related to the conflict. Although the State Department shows in column 'state_prev' 18 reports mentioning sexual violent occurences, both Amnesty International('ai_prev') and Human Rights Watch ('hrw_prev) provide a higher, and the same number of reports mentioning sexual violence in Syria conflict: 23. 

We need the mean in 'sex_violence', our independent variable of interest, for both groups. We will use the numpy mean built-in method to calculate; shared immediately below: 
    'Group A': 1.6785714285714286
    'Group B': -16.710108073744436

In [None]:
df_Syria.describe()

In [None]:
np.mean(df_Syria['sex_violence'])

In [None]:
np.mean(df_Syria['sex_violence_2'])

In [None]:
sns.catplot(x="type", y="actor_type", kind="swarm", data=df_Syria)
plt.title('Type of Conflict With Actor Intervention in Syria')

We see that 'type' category of 'two' in conflict type is the  is missing in our 'Group A', or Syria case. Category 'two' Compare to Table: 'Type of Conflict With Actor Intervention', which has all three types.

## Group B: World
'Group B' in this experiment represents all countries experiencing conflict between the same time period (our control of years 2011 through 2015) except Syria since Syria is our 'Group A'. 'Group B' includes 1,573 cases for comparison.

In [None]:
df_world = merged_df[(merged_df['year'] >= 2011) & (merged_df['year'] <= 2017) & (merged_df['location'] != 'Syria')]
df_world

In [None]:
df_world.describe()

In [None]:
np.mean(df_world['sex_violence'])

In [None]:
np.mean(df_world['sex_violence_2'])

In [None]:
sns.catplot(x="type", y="actor_type", kind="swarm", data=df_world)
plt.title('Type of Conflict With Actor Intervention')

## T-Test
Between both 'Group A', Syria, and 'Group B', world, we will run a t-test to review our hypothesis. We will run an 'Independent Samples t-test', which compares the means for two groups: 'Group A' (Syria) and 'Group B' (world). Our T-test considers that both samples have different means, variance, and sample sizes.

In [None]:
from scipy import stats
#import researchpy as rp #~ We are using 2 dataframes, Syria and world, so use researchpy.

x= df_Syria['sex_violence']
y= df_world['sex_violence']
ttest=stats.ttest_ind(x,y)
print ('t-test sex_violence', ttest)

#researchpy.ttest(x, y, group1_name= None, group2_name= None, equal_variances= True, paired= False, correction= None)
#print()

We calculate a t-statistic of '1.1516' with a p-value of '0.249'

#Conclusion

After conducting a t-test between 'Group A' and 'Group B', we calculated a t-statistic of '1.1516'. We can reject the null hypothesis of equal means between 'Group A' and 'Group B' regarding their 'sex_violence' level at the 0.25 level. (The p-value represents the probability of getting the data above if the null hypothesis were true in the population.) Specifically: Syria's average rate of sexual violence compared to the conflict countries' average rate of sexual violence are not the same. Although the t-statistic is not very low, it is important to note that our sample size for Syria ('Group A') is 25, which is much less than the sample size of all the other conflict countries ('Group B'), which is 1451 country cases.

In conclusion, the Syria's sexual violence index represents a different combination of documented watchdog reports and a higher pattern and form of sexual violence types as represtened by our constructed variable 'crime_counted_another_way', which was derived from the 'form' type and 'form' counts, or occurences of sexual violence. As a result, the Syrian crisis could represent the worst form of violence against women in conflict since the Balkans crisis and the resulting Bosnian Genocide. 

Other areas for review include examining the 'actor_type' carrying out the sexual violence 'sex_violence' as well as conducting an A/A test between Syria and Yemen. We propopse conducting the latter t-test because Syria and Yemen represent two Middle Eastern countries experiencing similar state and non-state actor conflict that are continuing until today and share similar sample sizes. The rates of violence may differ depending on whether there are more types of actors involved. Is there a statistical difference between these two countries with more actors than less actors.

For next time: Yemen, proposed code below.

In [None]:
df_Yemen = merged_df[(merged_df['year'] >= 2011) & (merged_df['year'] <= 2017) & (merged_df['location'] == 'Yemen')]
df_Yemen