# Exercise from Stratascratch  

Imagine you are a security or defense analyst. Analyze the data and draw conclusions on the distribution and nature of terrorist incidents recorded around the world.  
In your analysis, include maps that visualize the location of different incidents. Your analysis may also provide answers to the following questions:  

1. How has the number of terrorist activities changed over the years? Are there certain regions where this trend is different from the global averages?  

2.  Is the number of incidents and the number of casualties correlated? Can you spot any irregularities or outliers?  

3. What are the most common methods of attacks? Does it differ in various regions or in time?  

4. Plot the locations of attacks on a map to visualize their regional spread  

You are also free to explore the data further and extract additional insights other than the questions above.  

Link: https://platform.stratascratch.com/data-projects/terrorism-hotspots

# Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

sns.set_style("darkgrid")
matplotlib.rcParams['font.size'] = 14
matplotlib.rcParams['figure.figsize'] = (15, 5)
matplotlib.rcParams['figure.facecolor'] = '#00000000'

import warnings
warnings.simplefilter(action='ignore')

# Importing Data

In [2]:
terrorism_df = pd.read_csv('./datasets/globalterrorismdb_0718dist.zip', encoding='ISO-8859-1', compression="zip") 
terrorism_df.head()

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,197000000001,1970,7,2,,0,,58,Dominican Republic,2,...,,,,,PGIS,0,0,0,0,
1,197000000002,1970,0,0,,0,,130,Mexico,1,...,,,,,PGIS,0,1,1,1,
2,197001000001,1970,1,0,,0,,160,Philippines,5,...,,,,,PGIS,-9,-9,1,1,
3,197001000002,1970,1,0,,0,,78,Greece,8,...,,,,,PGIS,-9,-9,1,1,
4,197001000003,1970,1,0,,0,,101,Japan,4,...,,,,,PGIS,-9,-9,1,1,


# EDA

In [3]:
terrorism_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Columns: 135 entries, eventid to related
dtypes: float64(55), int64(22), object(58)
memory usage: 187.1+ MB


The dataset is really huge, it has almost 200k rows and more than 100 columns. 

In [4]:
len(terrorism_df.columns[np.where(terrorism_df.isna().sum() != 0)])

106

Most of the columns show missing values. Fortunately the project must be carried with only few of them.

Columns to use for the analysis:  

**success** - Success of a terrorist strike  

**suicide** - 1 = "Yes" The incident was a suicide attack. 0 = "No" There is no indication that the incident was a suicide  

**attacktype1** - The general method of attack  

**attacktype1_txt**- The general method of attack and broad class of tactics used.  

**targtype1_txt** - The general type of target/victim  

**targsubtype1_txt**- - The more specific target category  

**target1**- - The specific person, building, installation that was targeted and/or victimized  

**natlty1_txt**- - The nationality of the target that was attacked  

**gname**- - The name of the group that carried out the attack  

**gsubname**- - Additional details about group that carried out the attack like fractions  

**nperps**- - The total number of terrorists participating in the incident  

**weaptype1_txt**- - General type of weapon used in the incident  

**weapsubtype1_txt**- - More specific value for most of the Weapon Types  

**nkill**- - The number of total confirmed fatalities for the incident  

**nkillus**- - The number of U.S. citizens who died as a result of the incident  

In [5]:
t_df = terrorism_df[['iyear','region_txt','success', 'suicide', 'attacktype1', 'attacktype1_txt', 'targsubtype1_txt', 'target1', 'natlty1_txt', 'gname', 'gsubname',
                     'nperps', 'weaptype1_txt', 'weapsubtype1_txt', 'nkill', 'nkillus', 'country_txt' ,'city' ,'latitude', 'longitude']] 
t_df.head()

Unnamed: 0,iyear,region_txt,success,suicide,attacktype1,attacktype1_txt,targsubtype1_txt,target1,natlty1_txt,gname,gsubname,nperps,weaptype1_txt,weapsubtype1_txt,nkill,nkillus,country_txt,city,latitude,longitude
0,1970,Central America & Caribbean,1,0,1,Assassination,Named Civilian,Julio Guzman,Dominican Republic,MANO-D,,,Unknown,,1.0,,Dominican Republic,Santo Domingo,18.456792,-69.951164
1,1970,North America,1,0,6,Hostage Taking (Kidnapping),"Diplomatic Personnel (outside of embassy, cons...","Nadine Chaval, daughter",Belgium,23rd of September Communist League,,7.0,Unknown,,0.0,,Mexico,Mexico city,19.371887,-99.086624
2,1970,Southeast Asia,1,0,1,Assassination,Radio Journalist/Staff/Facility,Employee,United States,Unknown,,,Unknown,,1.0,,Philippines,Unknown,15.478598,120.599741
3,1970,Western Europe,1,0,3,Bombing/Explosion,Embassy/Consulate,U.S. Embassy,United States,Unknown,,,Explosives,Unknown Explosive Type,,,Greece,Athens,37.99749,23.762728
4,1970,East Asia,1,0,7,Facility/Infrastructure Attack,Embassy/Consulate,U.S. Consulate,United States,Unknown,,,Incendiary,,,,Japan,Fukouka,33.580412,130.396361


In [6]:
t_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 181691 entries, 0 to 181690
Data columns (total 20 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   iyear             181691 non-null  int64  
 1   region_txt        181691 non-null  object 
 2   success           181691 non-null  int64  
 3   suicide           181691 non-null  int64  
 4   attacktype1       181691 non-null  int64  
 5   attacktype1_txt   181691 non-null  object 
 6   targsubtype1_txt  171318 non-null  object 
 7   target1           181053 non-null  object 
 8   natlty1_txt       180132 non-null  object 
 9   gname             181691 non-null  object 
 10  gsubname          5890 non-null    object 
 11  nperps            110576 non-null  float64
 12  weaptype1_txt     181691 non-null  object 
 13  weapsubtype1_txt  160923 non-null  object 
 14  nkill             171378 non-null  float64
 15  nkillus           117245 non-null  float64
 16  country_txt       18

# Question 1  

How has the number of terrorist activities changed over the years?   
Are there certain regions where this trend is different from the global averages?

It is possible to show it using a lineplot.

Terrorism attacks have been quite stable until 2010, where they escalated quickly.  
The reasons behind it can be multiple, but are was it the same for all the Countries?

Comparing the two trends.

The trends look similar, but there are huge differences in the absolute number of regions. By the way, almost all of them have experienced terrorism activities increase since 2010.

# Question 2  

Is the number of incidents and the number of casualties correlated? Can you spot any irregularities or outliers?  

A possbile way to discover this correlation could be by grouping by the year to find the total number of incidents and, always grouping by year, summing the victims number. 

Plotting the correlation. 

As it was already possible to see from the graph, the r-pearson confirms that the correlation between incidents and casualties is strong and positive.  
It seems that the highest the number of incidents, the highest the number of victims.  
When the number of attacks increases, the number of deaths increases, meaning that they are not only more in number, but in violence too.  
About the outliers, the last six points appear to be very far from the distribution. Probably, they are referred to years close to 2010, when the number of attacks started increasing dramatically.  

The table above confirms the previous guess, after 2012 the number of attacks increased rapidly.

To provide a different view of the phenomenon, the same process can be performed by grouping per year and per region.   
This will provide a higher number of observations and will make possible to understand where the highest number of attacks has been performed. 

It is confirmed the strong and positive relationships between the two variables. In particular, South Asia and Middle East & North Africa appear as the regions with the highest number of attacks and deaths.

# Question 3   

What are the most common methods of attacks? Does it differ in various regions or in time?

Bombing/explosion is, by far, the most common attack type.  

Is it different in times and region? 

## Region

Plotting it. 

It has been necessary to scale y-axis values on a logarithmic scale to obtain more interpretable results.  
By the way, bombing/explosion appears to be dominant in almost all the regions. Some exceptions are the Central America & Carribean and Sub-Saharan Africa, where the armed assault appears to be the dominant one. 

## Time Period

Plotting it. 

Bombing/explosion and armed assault have always been the most popular ones.  
A positive trend is shown by "Unknown" and "Kidnapping" categories, which escalated quickly during the years.

# Question 4  

Plot the locations of attacks on a map to visualize their regional spread.  

For this question it will be taken into consideration 2017 only. 

Filtering 2017 data only. 

47 values are missing. It is possible to gather them anyway.  
Firstly, let's check what are the missing locations. 

Latitudes and longitudes are missing on the same observations. It is important to understand in which locations the missing values shall be replaced.

Since the city of the unknown latitudes and longitudes is almost always "NaN", a possible solution could be to use the Country coordinates. Despite the result is less accurate, it allows to not waste data. 

Seleceting the missing values and dropping them from the dataframe. 

Now, it is time to work on the missing values dataset. 

Checking if every latitude  and longitude have been taken properly.  

Adding to the original dataframe.  

## Creating a heatmap

## Creating a choroplet map

Counting the number of attacks. 

Finding latitude and longitude.

It seems that the geocoder cannot retrieve the Gaza Strip coordinates.  
Let's try with another name.

The geocoder can locate the two territories separately, but not them together.  
To solve this problem it is possible to replace the entire name "West Bank and Gaza Strip" with only one between them.  
Despite they are two different territories, their location is quite similar. 

Storing it.

Computing the Countries' geometry.

11 Countries are missing. Let's check which ones.

It is possible to retireve them using their iso3 and by replacing their actual name in the t_2017_countries dataframe. 

Let's see the result.

Kosovo was the only Country not found, it is better to drop it. 

Now, all the Countries should be findable. 

It is fine, let's merge them. 

Storing it.

Setting the dataframe for the choroplet map.

Reading the "geometry" column from a csv file can lead to a TypeError. To avoid this it is necessary to use "wkt.loads". 

Creating the map.

With Kepler 

# Conclusion  

The analysis of the dataset has been quite challenging.  
By the way it provided many interesting hints about the approaches to take in some situations with a huge amount of data.  
It helped a lot to improve the data visualization part.  
Many other insights can be provided for the future.