# Introduction
We will analyze the data on violence on women and girls. The data is collected from 70 countries and contains demographics questions and answers as well as percent of response to 5 violence specific questions.

# Analysis preparation

In [2]:
import pandas as pd
data=pd.read_csv("violence_data.csv")

In [3]:
# get an idea about the data
data.head()

Unnamed: 0,RecordID,Country,Gender,Demographics Question,Demographics Response,Question,Survey Year,Value
0,1,Afghanistan,F,Marital status,Never married,... if she burns the food,01/01/2015,
1,1,Afghanistan,F,Education,Higher,... if she burns the food,01/01/2015,10.1
2,1,Afghanistan,F,Education,Secondary,... if she burns the food,01/01/2015,13.7
3,1,Afghanistan,F,Education,Primary,... if she burns the food,01/01/2015,13.8
4,1,Afghanistan,F,Marital status,"Widowed, divorced, separated",... if she burns the food,01/01/2015,13.8


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12600 entries, 0 to 12599
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   RecordID               12600 non-null  int64  
 1   Country                12600 non-null  object 
 2   Gender                 12600 non-null  object 
 3   Demographics Question  12600 non-null  object 
 4   Demographics Response  12600 non-null  object 
 5   Question               12600 non-null  object 
 6   Survey Year            12600 non-null  object 
 7   Value                  11187 non-null  float64
dtypes: float64(1), int64(1), object(6)
memory usage: 787.6+ KB


==> We can notice that there is some missing values in the value column

In [5]:
#Unique values
columns = data.columns
for column in columns:
    print(f"{column} : {data[column].nunique()}")

RecordID : 420
Country : 70
Gender : 2
Demographics Question : 5
Demographics Response : 15
Question : 6
Survey Year : 18
Value : 757


# What demographics are telling us

We have 5 demographics questions:


*   Age (15-24, 25-34, 35-49)
*   Education (No Education, Primary, Secondary, Higher)
*   Employement (Unemployed, Employed for cash, Employed for kind)
*   Marital status (Maried or living together, Widowed, divorced, separated, Never Married)
*   Residence (Rural, Urban)
To get a clear idea We will aggregate the data on the demographic question, answer and calculate the minimm, maximum, median and average stats per each of this demographic.



In [6]:
demographics_df = data.groupby(["Demographics Question", "Demographics Response"])["Value"].agg(["median", "max", "min", "mean"]).reset_index()
demographics_df.columns = ["Question", "Response", "Median", "Max", "Min", "Mean"]
print("Violence % median, min, max, and mean per demographic group")
demographics_df.sort_values(["Question", "Median"])


Violence % median, min, max, and mean per demographic group


Unnamed: 0,Question,Response,Median,Max,Min,Mean
2,Age,35-49,14.15,81.0,0.2,19.336412
1,Age,25-34,14.45,81.5,0.1,19.703562
0,Age,15-24,17.5,80.1,0.1,21.084169
3,Education,Higher,4.2,74.6,0.0,8.89867
6,Education,Secondary,13.05,76.7,0.2,17.378892
5,Education,Primary,18.4,80.5,0.1,22.819093
4,Education,No education,21.55,82.0,0.0,25.403125
9,Employment,Unemployed,14.55,80.1,0.0,19.53971
7,Employment,Employed for cash,14.85,81.5,0.1,19.553804
8,Employment,Employed for kind,20.15,86.9,0.3,24.445541


***What grouped data shows :***


* The most exposed age group is the 15-24, with a median of 17.50 and max of 80.1, while the absolute max is for 25-34 age group;
* Education level of girls and women is a very good predictor for violence against them: a higher education level will ensure a small median percent of 4.2%, while No education shows a 21.55% median and a 82% maximum violence percent.
* Employment factor also counts, from Unemployed to Employed for kind the median is varying from 14.55% to 20.15%. The women that work tend to be more exposed to violence.
* Marriage is slighly protecting if we look to the median, but married women are more exposed to higher level of violence. Widowed, divorced or separated have a smaller median than Married or living together group, as well as the max. But it might be also an age factor here.
* Women from Urban areas are less exposed to violence than women from Rural areas, both the median and the max values being smaller in Urban vs. Rural areas.

# What the question answered reveals

In [7]:
question_df = data.groupby(["Question"])["Value"].agg(["median", "max", "min", "mean"]).reset_index()
question_df.columns = ["Question", "Median", "Max", "Min", "Mean"]
print("Violence % median, min, max, and mean per question asked")
question_df.sort_values(["Median"])

Violence % median, min, max, and mean per question asked


Unnamed: 0,Question,Median,Max,Min,Mean
2,... if she burns the food,6.4,56.7,0.0,9.203445
5,... if she refuses to have sex with him,9.0,68.7,0.0,13.209613
1,... if she argues with him,15.7,76.5,0.0,18.983652
3,... if she goes out without telling him,16.4,77.0,0.0,20.046321
4,... if she neglects the children,20.8,75.6,0.0,23.461249
0,... for at least one specific reason,31.0,86.9,0.0,33.217152


Generally, the median and maximum are aligned. The reasons for violence, ranged by median values are:

*  if she burns the food;
* if she refuses to have sex with him;
* if she argues with him;
* if she goes out without telling him;
* if she neglects the children.
Non specified but at least one specific reason would have the maximum median value and the absolute maximum percent.

# Combining Demographics and question answered 

In [8]:
demoq_df = data.groupby(["Demographics Question", "Demographics Response", "Question"])["Value"].agg(["median", "max", "min", "mean"]).reset_index()
demoq_df.columns = ["Demographics Question", "Demographics Response", "Question", "Median", "Max", "Min", "Mean"]
print("Violence % median, min, max, and mean per demographic group and question asked")
demoq_df = demoq_df.sort_values(["Demographics Question", "Demographics Response", "Median"])
demoq_df.head()

Violence % median, min, max, and mean per demographic group and question asked


Unnamed: 0,Demographics Question,Demographics Response,Question,Median,Max,Min,Mean
2,Age,15-24,... if she burns the food,8.5,47.7,0.1,10.098387
5,Age,15-24,... if she refuses to have sex with him,10.05,59.4,0.3,13.404762
3,Age,15-24,... if she goes out without telling him,18.4,67.7,0.2,21.026563
1,Age,15-24,... if she argues with him,18.65,65.5,0.5,20.304032
4,Age,15-24,... if she neglects the children,23.2,61.4,1.9,25.30625


In [9]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

def show_answers_per_demographic_question(question, aggregate="Median", layout_mode="Single"):
    font_size = 10
    if layout_mode == "Single":
        fig = go.Figure()
        font_size = 12
    else:
        fig = make_subplots(rows=2, cols=2, specs=[[{'type': 'polar'}]*2]*2)
        
    sel_df = demoq_df.loc[demoq_df["Demographics Question"]==question]
    answers = list(sel_df["Demographics Response"].unique())
    count = 0
    for answer in answers:
        subsel_df = sel_df.loc[sel_df["Demographics Response"] == answer]
        if layout_mode == "Single":
            fig.add_trace(go.Scatterpolar(name=answer, r=subsel_df[aggregate], theta=subsel_df['Question']))
        else:
            fig.add_trace(go.Scatterpolar(name=answer, r=subsel_df[aggregate], theta=subsel_df['Question']), count%2 + 1, int(count/2+1))
        count = count + 1
    fig.update_traces(fill='toself')
    fig.update_layout(title='Demografics Question: '+ question + ' | Statistics: ' + aggregate,font=dict(family="Courier New, monospace",size=font_size))
    fig.show()

In [10]:

show_answers_per_demographic_question("Age")

In [11]:
show_answers_per_demographic_question("Age", aggregate="Max")

In [12]:

show_answers_per_demographic_question("Education")

In [13]:
show_answers_per_demographic_question("Education", aggregate="Max")

In [14]:
show_answers_per_demographic_question("Employment")

In [15]:
show_answers_per_demographic_question("Employment", aggregate="Max")

In [16]:
show_answers_per_demographic_question("Marital status")

In [17]:
show_answers_per_demographic_question("Marital status", aggregate="Max")

In [18]:
show_answers_per_demographic_question("Residence")

In [19]:
show_answers_per_demographic_question("Residence", aggregate="Max")

# Median violence percent per country

In [21]:
country_df = pd.read_csv("wikipedia-iso-country-codes.csv")
country_df.columns = ["Country", "A2C", "A3C", "Num","ISO"]
datac_df = data.merge(country_df)

In [22]:
df = datac_df.groupby(["Country", "A3C"])["Value"].agg(["median", "max", "min", "mean"]).reset_index()
df.columns = ["Country", "A3C", "Median", "Max", "Min", "Mean"]

In [23]:
def show_on_map(df, aggregate, title):
    fig = go.Figure(data=go.Choropleth(
        locations = df['A3C'],
        z = df[aggregate],
        text = df['Country'],
        colorscale = 'Blues',
        autocolorscale=False,
        reversescale=True,
        marker_line_color='darkgray',
        marker_line_width=0.5,
        colorbar_tickprefix = '%',
        colorbar_title = 'Violence %',
    ))

    fig.update_layout(
        title_text=title,
        geo=dict(
            showframe=False,
            showcoastlines=False,
            projection_type='equirectangular'
        ),
        annotations = [dict(
            x=0.55,
            y=0.1,
            xref='paper',
            yref='paper',
            text='Source: <a href="https://www.kaggle.com/andrewmvd/violence-against-women-and-girls">\
                Violence against Women and Girls</a>',
            showarrow = False
        )]
    )

    fig.show()

In [24]:
show_on_map(df, "Median", "Violence against Women and Girls - Country median level of violence[%]")

In [25]:
filter_df = datac_df.loc[datac_df["Demographics Response"]=="No education"]
df = filter_df.groupby(["Country", "A3C"])["Value"].agg(["median"]).reset_index()
df.columns = ["Country", "A3C", "Median"]
show_on_map(df, "Median", "Violence against Women with No education - Country Median level of violence [%]")

In [26]:
filter_df = datac_df.loc[datac_df["Demographics Response"]=="Higher"]
df = filter_df.groupby(["Country", "A3C"])["Value"].agg(["median"]).reset_index()
df.columns = ["Country", "A3C", "Median"]
show_on_map(df, "Median", "Violence against Women with Higher education - Country Median level of violence [%]")