# Exploring Brazilian conflicts

Here is an analysis of this dataset of recorded conflicts in Brazil. I am interested in visualizing the features of this data and answer the following questions:

- How did the conflicts evolve over time and did they become more violent?
- How are the conflicts distributed geographically?
- What kind of conflicts are more common?
- Who are the initiators, and who are they getting into conflicts with?
- Is there a relationship between conflicts and demographic indicators?


(To answer those questions I also use data from these other sources: [Shapefiles of Brazilian states](https://www.kaggle.com/datasets/rodsaldanha/brazilianstatesshapefiles) and [Brazilian cities](https://www.kaggle.com/datasets/crisparada/brazilian-cities).)

In [None]:
!pip install geoplot==0.5.1
!pip install geopandas==0.10.2
!pip3 install shapely==1.7.1

In [None]:

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

import os


import seaborn as sns
import matplotlib.pyplot as plt

# plot maps
import geoplot as gplt
import geoplot.crs as gcrs
import geopandas as gpd
from shapely.geometry import Point

#### Load data
data_conflicts = pd.read_csv('Brazil Political Violence and Protests Dataset.csv')
data_cities = pd.read_csv('BRAZIL_CITIES_REV2022.CSV')

plt.rcParams['figure.figsize'] = [12, 9]

## How did the conflicts evolve over time and did they become more violent?

The two graphs below show the evolution over time of the quantity of conflicts and their fatality rates. The vertical lines are dates of presidential elections. We can see two spikes in the number of conflicts. The peak in 2018 is due to the nationwide [protests of truck drivers](https://en.wikipedia.org/wiki/2018_Brazil_truck_drivers%27_strike), which lasted around two months. The (smaller) spike in the end of 2022 reflects the [unsatisfaction with the results of the presidential elections](https://en.wikipedia.org/wiki/2022%E2%80%932023_Brazilian_election_protests) that resulted in Lula being elected for his third mandate.

Protests in 2018 leading to the election were associated with peak levels of violence. After the election, the number of fatalities plummeted and increased steadly. Conflicts after the 2022 elections, however, did not see nearly as many fatalities.

In [None]:
# Number of protests per month
from datetime import datetime
data_conflicts.head()
data_conflicts['Count'] = 1
data_conflicts['EVENT_MONTH_YEAR'] = pd.to_datetime(data_conflicts['EVENT_DATE']).dt.to_period('m')
data_conflicts2 = data_conflicts.groupby('EVENT_MONTH_YEAR').sum()

data_conflicts2['Count'].plot()
plt.axvline(datetime(2018, 10,1),color='black')
plt.axvline(datetime(2022, 10,1),color='black')

In [None]:
# Number of protests per month
data_conflicts2['FATALITIES'].plot()
plt.axvline(datetime(2018, 10,1),color='black')
plt.axvline(datetime(2022, 10,1),color='black')

## Types of conflicts

The dataset contains information on different types of conflicts. The pie graph below shows the distribution of the types of conflicts. Protests, battles, and violence against civilians account for 87% of all conflicts.

In [None]:
# pie chart
sns.set(rc={"figure.figsize":(12, 9)}) #width=6, height=5
data_conflicts2 = data_conflicts.groupby('EVENT_TYPE').agg({'Count': 'sum'})
data_conflicts2.head()

#create pie chart
plt.pie(data_conflicts2['Count'], labels = data_conflicts2.index,autopct='%.0f%%')
plt.show()

## How are the conflicts distributed among urban and rural municipalities?

The bar plots below show that the majority of conflicts happen in urban areas. Protests are the most common instances of conflicts in both regions.

In [None]:
data_conflicts2 = data_conflicts.groupby(['LOCATION','EVENT_TYPE']).agg({'Count': 'sum', 'FATALITIES':'sum'}).reset_index()
data_conflicts2['Count'] = data_conflicts2['Count'].fillna(0)
data_conflicts2['FATALITIES'] = data_conflicts2['FATALITIES'].fillna(0)

data_cities2= data_cities.rename(columns={'CITY':'LOCATION'})
data_merged = pd.merge(data_conflicts2,data_cities2,on='LOCATION',how='left')
data_merged['CAPITAL'] = data_merged['CAPITAL'].fillna(0)

# Is capital of the state?
data_merged_capital = data_merged.groupby(['CAPITAL','EVENT_TYPE']).sum().reset_index()
data_merged_capital['Fatality rate'] = data_merged_capital['FATALITIES']/data_merged_capital['Count']


# Is rural?
data_merged_rural = data_merged.groupby(['RURAL_URBAN','EVENT_TYPE']).sum().reset_index()
data_merged_rural['Fatality rate'] = data_merged_rural['FATALITIES']/data_merged_rural['Count']

In [None]:
g = sns.catplot(data=data_merged_capital, kind="bar",
    x="EVENT_TYPE", y="Count", hue="CAPITAL", palette="dark", alpha=.6, height = 8, aspect=1.618)
g.despine(left=True)
g.set_axis_labels("", "Number of conflicts")
g.legend.set_title("")

In [None]:
g = sns.catplot(data=data_merged_rural, kind="bar",
    x="EVENT_TYPE", y="Count", hue="RURAL_URBAN", palette="dark", alpha=.6, height = 8, aspect=1.618)
g.despine(left=True)
g.set_axis_labels("", "Number of conflicts")
g.legend.set_title("")

The two bar graphs below show the fatality rate (number of casualties divided by the total number of conflicts) in state capitals versus non-state capitals and in rural and urban municipalities. The fatality rate is a lot higher in rural areas, with each battle having at least one causality on average, and around 0.98 causalities on average when the conflict is classified as violence against civilians. This pattern is also prevalent when analysing rural and urban areas.

In [None]:
g = sns.catplot(data=data_merged_capital, kind="bar",
    x="EVENT_TYPE", y="Fatality rate", hue="CAPITAL", palette="dark", alpha=.6, height = 8, aspect=1.618)
g.despine(left=True)
g.set_axis_labels("", "Fatalities/Count")
g.legend.set_title("")

In [None]:
g = sns.catplot(data=data_merged_rural, kind="bar",
    x="EVENT_TYPE", y="Fatality rate", hue="CAPITAL", palette="dark", alpha=.6, height = 8, aspect=1.618)
g.despine(left=True)
g.set_axis_labels("", "Fatalities/Count")
g.legend.set_title("")

## Who are involved in the conflicts?

The dataset also describes the actors involved in the conflicts.

### Who are the main actors?

Below is a pie graph of the top 10 main actors in the conflicts. Protesters are the largest group by a 12 p.p. margin over the next group in the breakdown. However, unidentified armed groups, gangs, or police militias account for 46% of all conflicts.

In [None]:
# top 5-10
data_conflicts2 = data_conflicts.groupby('ACTOR1').agg({'Count': 'sum'})
data_conflicts2 = data_conflicts2.sort_values(by='Count',ascending=False)
data_conflicts2_sorted = data_conflicts2[0:9]
data_conflicts2_sorted.loc['Other'] = data_conflicts2[10:].sum()
data_conflicts2_sorted.head(11)

#create pie chart
plt.pie(data_conflicts2_sorted['Count'], labels = data_conflicts2_sorted.index,autopct='%.0f%%')
plt.show()

### Who's associated with whom?

Below is an alluvial graph indicating the absolute numbers of pairs of actors. The dataset description is not clear with respect of what type of association this is, however. On the left side you can find the main actor in the conflict and the secondary actor on the right side. The majority of conflicts in the dataset involve the police and different unidentified armed groups.

In [None]:
# TF-IDF Feature Generation
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import RegexpTokenizer

# Initialize regex tokenizer
data_conflicts_NA = data_conflicts.dropna()
tokenizer = RegexpTokenizer(r'\w+')

# # Vectorize document using TF-IDF
tf_idf_vect = TfidfVectorizer(lowercase=True,
                        stop_words='english',
                        ngram_range = (1,1),
                        tokenizer = tokenizer.tokenize)
X_train_counts = tf_idf_vect.fit_transform(data_conflicts_NA['ACTOR1'])

from sklearn.cluster import KMeans

nclust = 15
# Create Kmeans object and fit it to the training data
kmeans = KMeans(n_clusters=nclust).fit(X_train_counts)
data_conflicts_NA['ACTOR1_CLUSTER'] = kmeans.labels_


X_train_counts = tf_idf_vect.fit_transform(data_conflicts_NA['ACTOR2'].dropna())
# Create Kmeans object and fit it to the training data
kmeans = KMeans(n_clusters=nclust).fit(X_train_counts)
data_conflicts_NA['ACTOR2_CLUSTER'] = kmeans.labels_

In [None]:
# This part I did manually by inspecting the actors
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==0,'ACTOR1_LABEL'] = 'Unidentified Gang and/or Police Militia'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==1,'ACTOR1_LABEL'] = 'Unidentified Armed Group'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==2,'ACTOR1_LABEL'] = 'Police'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==3,'ACTOR1_LABEL'] = 'Unidentified Gang'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==4,'ACTOR1_LABEL'] = 'Rioters'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==5,'ACTOR1_LABEL'] = 'Criminal factions'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==6,'ACTOR1_LABEL'] = 'Police'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==7,'ACTOR1_LABEL'] = 'Police'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==8,'ACTOR1_LABEL'] = 'Police'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==9,'ACTOR1_LABEL'] = 'Criminal factions'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==10,'ACTOR1_LABEL'] = 'Criminal factions'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==11,'ACTOR1_LABEL'] = 'Police'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==12,'ACTOR1_LABEL'] = 'Protesters'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==13,'ACTOR1_LABEL'] = 'Other'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR1_CLUSTER']==14,'ACTOR1_LABEL'] = 'Police'


data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==0,'ACTOR2_LABEL'] = 'Unidentified Gang and/or Police Militia'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==1,'ACTOR2_LABEL'] = 'Unidentified Armed Group'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==2,'ACTOR2_LABEL'] = 'Police'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==3,'ACTOR2_LABEL'] = 'Unidentified Gang'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==4,'ACTOR2_LABEL'] = 'Rioters'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==5,'ACTOR2_LABEL'] = 'Criminal factions'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==6,'ACTOR2_LABEL'] = 'Police'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==7,'ACTOR2_LABEL'] = 'Police'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==8,'ACTOR2_LABEL'] = 'Police'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==9,'ACTOR2_LABEL'] = 'Criminal factions'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==10,'ACTOR2_LABEL'] = 'Criminal factions'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==11,'ACTOR2_LABEL'] = 'Police'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==12,'ACTOR2_LABEL'] = 'Protesters'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==13,'ACTOR2_LABEL'] = 'Other'
data_conflicts_NA.loc[data_conflicts_NA['ACTOR2_CLUSTER']==14,'ACTOR2_LABEL'] = 'Police'


In [None]:
import plotly.express as px

df = data_conflicts_NA
#df = data_conflicts_NA.groupby(['ACTOR1_CLUSTER','ACTOR2_CLUSTER']).sum().reset_index()
df = df.rename(columns={'Count': 'size'})
df = df.drop(columns=['LATITUDE','LONGITUDE','FATALITIES'])
df['ACTOR1_CLUSTER'] = df['ACTOR1_CLUSTER'].apply(str)
df['ACTOR2_CLUSTER'] = df['ACTOR2_CLUSTER'].apply(str)
df = df.dropna()

fig = px.parallel_categories(df, dimensions=['ACTOR1_LABEL','ACTOR2_LABEL'])
fig.show(figsize=(15, 22), dpi=100)

## Geographic distribution of protests

The maps below show the geographic distribution of protests and fatalities. Unsurprisingly, conflicts are concentrated in more populated regions -- the southeast and the northeast regions. The distribution of fatalities look very similar as well, showing that the level of violence in the conflicts are homogeneous along regions (i.e., there is no 'more violent' region when it comes to conflicts).

In [None]:
# convert latitude and longitude to points
# https://shakasom.medium.com/how-to-convert-latitude-longtitude-columns-in-csv-to-geometry-column-using-python-4219d2106dea


map_df = gpd.read_file('BRA_adm1.shp')

data_conflicts2 = data_conflicts.groupby(['LATITUDE','LONGITUDE']).sum().reset_index()

geometry = [Point(xy) for xy in zip(data_conflicts2['LONGITUDE'], data_conflicts2['LATITUDE'])]
data_conflicts_count = data_conflicts2.drop(columns=['LONGITUDE','LATITUDE','FATALITIES'])

crs = {'init': 'epsg:4326'}
gdf = gpd.GeoDataFrame(data_conflicts_count, crs=crs, geometry=geometry)

ax = gplt.polyplot(map_df,  zorder=1)
gplt.kdeplot(gdf, cmap='Blues', shade=True, clip=map_df.geometry, thresh=0.05, ax=ax)

## Geographic distribution of fatalities

In [None]:
data_conflicts_fatalities = data_conflicts2.drop(columns=['LONGITUDE','LATITUDE','Count'])

crs = {'init': 'epsg:4326'}
gdf = gpd.GeoDataFrame(data_conflicts_fatalities, crs=crs, geometry=geometry)

ax = gplt.polyplot(map_df,  zorder=1)
gplt.kdeplot(gdf, cmap='Reds', shade=True, clip=map_df.geometry, thresh=0.05, ax=ax)

## Social indicators and conflicts

Are there patterns between social indicators and the amount of conflicts? The two plots below relate the number of conflicts with GDP per capita and a municipality-level human development index (HDI). While there does not seem to be a pattern between GDP per capita and the total amount of conflicts, there seems to be a positive association between HDI and total number of conflicts. Moreover, higher the HDI are associated with a higher dispersion of the number of conflicts.

In [None]:
# First, merge datasets...
data_cities= data_cities.rename(columns={'CITY':'LOCATION'})
data_merged = pd.merge(data_conflicts,data_cities,on='LOCATION',how='inner')
data_merged = data_merged.groupby('LOCATION').agg({'GDP_CAPITA':np.average, 'Count':'sum'})#, ''})

cutoff1 = data_merged['GDP_CAPITA'].quantile(q=0.95)
m1 = data_merged['GDP_CAPITA']<=cutoff1
cutoff2 = data_merged['Count'].quantile(q=0.95)
m2 = data_merged['Count']<=cutoff2

data_merged = data_merged[m1 & m2]
sns.scatterplot(data=data_merged, x="GDP_CAPITA", y="Count")

In [None]:
# First, merge datasets...
data_cities= data_cities.rename(columns={'CITY':'LOCATION'})
data_merged = pd.merge(data_conflicts,data_cities,on='LOCATION',how='inner')
data_merged = data_merged.groupby('LOCATION').agg({'IDHM':np.average, 'Count':'sum'})

cutoff1 = data_merged['IDHM'].quantile(q=0.95)
m1 = data_merged['IDHM']<=cutoff1
cutoff2 = data_merged['Count'].quantile(q=0.95)
m2 = data_merged['Count']<=cutoff2

data_merged = data_merged[m1 & m2]
sns.scatterplot(data=data_merged, x="IDHM", y="Count")

# \/ different scatter