---
title: Pirate Booty
authors: Hong Zhang, Samuel Bloom
---

## **Introduction** ##
<hr>
This project takes all the pirate attacks over the course of a year and analyzes them based on location and description. Based on these attributes, we can decipher the results of many pirate attacks, which areas are most susceptible, and better understand what specifically drives pirates to steal certain materials.

By using online resources (data found through APIs and PDFs), we can visualize the pirate attacks across the globe, what results typically spur from pirate attacks, and much of the nature of piracy. Using nearby port exports and imports, we can further understand what specifically pirates are stealing, and if they are related to many of the goods these ships are carrying.



```{python}
pip install -r ~/pirate-tracker/Home/requirements.txt
```


In [2]:
import geodatasets
import geopandas
import matplotlib
import pandas
import pdfplumber
import regex

Also, make sure to run the code below to complete functionality of the code:

In [3]:
%load_ext autoreload
%autoreload 2

## **Methodology**
<hr>

### **Visualizing Pirate Attacks Across the Globe** ###
Through the usage of online resources, we were able to find both an API that will allow us to collect information on various port data (exports and imports), and also found a PDF from online (https://www.recaap.org/) that provided us with information on pirate attacks over the course of the 2024 year. Given this information, we were able to use an online library, `fitz`, to help digest this information. This online library helps process text block by block, in favor a line by line solution. This is important because it allows us to help analyze the **description** of the function by testing different phrases against it (*a later problem*).

By processing the data through a function that calls `fitz`, `regex`, and `pandas` - libraries of which digest text, analyze structure, and sort into dataframes respectively - we were provided with `pirate_locations.csv`. The columns of the CSV are as follows:

* Index of Incident
* Latitude
* Longitude 
* Area Location

Given this CSV file, of which contains **107 entries**, we can then store these dataframes in empty lists and run operations over said lists. For example, we had to prepare our latitude and longitude data through conversions of **DMS** format (direction, minute, seconds) to degrees, a value that is easier to process and plot. Through this conversion, and accessing an online library through a variety of online libraries - `Geopandas`, `Geodatasets`, `Shapely`, some of the previous libraries accessed (`regex`) - we were then able to visualize our specific pirate attack incidents over a world map. Given this same data, we had also specifically zoomed in on areas around Southeast Asia, a hotspot for real-world piracy for further context.

**P.S. If you are interested in learning about what the specific functions do, look to the docstrings where each libraries usage is explained**



### **Visualizing Phrases Associated with the Outcomes of Real-World Piracy** ###

# **Results** #
<hr>

Given the completed functions, and libraries all downloaded/imported from `requirements.txt`, we can visualize all of our data:

# **Interpretation** #
<hr>

In [3]:
from phrase_counter import extract_top_contextual_phrases

df = extract_top_contextual_phrases("incident_descriptions.txt")


Top 20 contextual phrases:
"the crew was not injured": 36
"crew members were accounted for": 20
"all crew members were accounted": 19
"sighted in the engine room": 17
"were sighted in the engine": 16
"crew mustered to conduct a": 15
"in the engine room the": 15
"crew was not injured the": 15
"all crew mustered to conduct": 11
"and all crew mustered to": 10
"all crew members were safe": 10
"the engine room the master": 9
"stolen the master reported the": 9
"engine room the master raised": 9
"was raised and all crew": 9
"alarm was raised and crew": 9
"raised and all crew mustered": 9
"engine spare parts were stolen": 8
"alarm and mustered the crew": 8
"was not injured the master": 8


In [None]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Your phrase count data (from your loop result)
phrase_counts = {
    "not injured": 38,
    "nothing was stolen": 24,
    "were accounted for": 22,
    "no further assistance was required": 10,
    "parts were stolen": 10,
    "was safe": 4,
    "were reported stolen": 3,
    "no injuries were reported": 2,
    "no property stolen": 1,
    "nothing was reported stolen": 1,
    "no property was stolen": 0
}

# Create a WordCloud object
wordcloud = WordCloud(
    width=800,
    height=400,
    background_color='white'
).generate_from_frequencies(phrase_counts)

# Display it using matplotlib
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.title("Phrase Cloud Based on Incident Reports", fontsize=16)
plt.show()


: 