Skip to content


Repository files navigation

Global Terrorism Database Clustering


The repository contains a visualization of the the Global Terrorism Database (GTD) that is hosted on Kaggle and is provided by the START Consortium. It contains more than 170,000 terrorist attacks from all over the world from 1970 to 2016 (Figure 1). To visualize this dataset, I decided to implement k-means in PySpark to cluster the different geo-locations of the terrorist attacks.


Preprocessing was relatively straightforward for this dataset. Using the Pandas library, I created a DataFrame with only the latitude and longitude. After this, the data was cleaned by dropping any row with either the latitude or longitude missing. From this, I created a CSV file without any indices or headers in order to run it from Amazon S3.


As an illustrative example of clustering on this dataset, I chose to set k=5 and to use the great-circle distance metric because it would roughly correspond to a clustering for each continent and also because we are dealing with the approximate spherical geometry of the earth. The resulting centroids (Figure 2) appear to correspond to actual centers of terrorist attacks.


Figure 1: The entire Global Terror Attacks (1970 - 2016) dataset. The solid red circles are terrorist attacks before clustering is performed.

Figure 2: A map containing the centroids resulting from the k-means algorithm using k=5 and the great-circle distance metric on the whole Global Terror Attacks (1970 - 2016) dataset. It shows clusters of terrorist attacks in the following locations: Oceania, South Asia, Central Africa, Middle East, and South America.

The five resulting centroid coordinates are:

Latitude Longitude
2.361956292733226 29.36602923946585
9.207740262716625 115.80481474889932
28.406414022007812 74.12808393025578
3.2801139683765075 -79.74785539671865
38.414250505940146 28.417748608803993

Figure 1: These are where terrorist attacks have occured. alt text

Figure 2: After clustering, we find the following centroids. alt text

Figure 3: Plotting both together we can see the following map. alt text


A visualization of k-means clustering on terrorist attack locations







No releases published


No packages published