# Vast Mini Challenge 03

This notebook aims to perform an exploratory analysis of the data from the _VAST Mini Challenge 03_, in order to answer the three questions proposed through data visualization. 

The .csv and .geojson files of the Abila map were obtained from the conversion of the files provided by the mini challenge (present in the Abila folder) using the tool https://mapshaper.org. Just access the site, send the Abila folder with the 4 files inside and click export as .csv and .geojson. 

With the geojson obtained, the code from the **geocode.js** file was used, which loads the geojson file of the Abila map and the csv file of reports. With that, it was iterated about the locations of the reports and obtained their coordinates. After that, a .json file with the coordinates was obtained. This .json is called **coordinates.json**

All these generated files are in the folder _abilaMap_processed_. 

**Attention**: it takes a few steps to run the js code used to generate the files. However, let's skip these steps and use the files already generated. 

In [12]:
import pandas as pd
import spacy as sp
from datetime import datetime
from nltk.corpus import stopwords
import plotly.express as px
import altair as alt

### Abila Map: Locations' Coordinates

In [42]:
coords = pd.read_json("abilaMap_processed/coordinates.json")

coords

Unnamed: 0,location,coords
0,Egeou St / Parla St,"[24.85526400000002, 36.05022]"
1,N. Els St / N. Polvo St,"[24.871374000000017, 36.051901]"
2,2099 Sannan Pky,"[24.89820500000002, 36.069383]"
3,3654 N. Barwyn St,"[24.875345818181838, 36.07422]"
4,3815 N. Blant St,"[24.87282157575759, 36.07712]"
...,...,...
147,Exadakitiou Way / Rist Way,"[24.882564000000016, 36.07542]"
148,N. Limnou St / N. Alm St,"[24.856464000000017, 36.06032]"
149,2299 N. Finiatur St,"[24.85096400000002, 36.06342]"
150,S. Mikonou St / S. Achilleos St,"[24.870004000000023, 36.045235]"


## Preprocessing
The preprocessing includes punctuation and stopword removal and lemmatization. After that, calls ('ccdata' type) and posts ('mbdata' type) are separated into different files.

In [3]:
# Loading the appropriate model
#
# Before doing so, they need to be installed. Choose one (or both, for safe measure) of them:
#  - Bigger, slower, but more accurate: python -m spacy download en_core_web_trf
#  - Small, faster, but less accurate: python -m spacy download en_core_web_sm
nlp = sp.load("en_core_web_trf")

In [6]:
# Loading and preprocessing the data sets.
#
# Time periods:
#  - Period one: 1700-1830
#  - Period two: 1831-2000
#  - Period three: 2001-2131

periods = ("1700-1830", "1831-2000", "2001-2131")
data_frames = [pd.read_csv(f"original_csv/csv-{period}.csv") for period in periods]
stop_words = stopwords.words("english")

for index, data_frame in enumerate(data_frames):
    print(f"[{index + 1}/{len(data_frames)}] Processing data-frames...")
    data_frame_length = data_frame.shape[0]

    for row in range(data_frame_length):
        print(f" |_ [{row + 1}/{data_frame_length}] Processing row...", end="\r")

        message = data_frame.loc[row, "message"]
        document = nlp(message)
        tokens = [
            token.lemma_ for token in document
            if token.text not in stop_words  # Remove stop words
            and token.is_punct is False      # Remove punctuation
        ]

        data_frame.loc[row, "message"] = " ".join(tokens)

    print(f" |_ [{data_frame_length}/{data_frame_length}] Processing completed.")

combined_csv = pd.concat(data_frames)
combined_csv.to_csv("processed/combined_csv.csv", index=False)

[1/3] Processing data-frames...
 |_ [1033/1033] Processing completed.
[2/3] Processing data-frames...
 |_ [1815/1815] Processing completed.
[3/3] Processing data-frames...
 |_ [1215/1215] Processing completed.


In [15]:
# Separate posts and calls (reports)
combined_csv = pd.read_csv("processed/combined_csv.csv")
combined_csv['timestamp'] = combined_csv['date(yyyyMMddHHmmss)'].apply(lambda t : datetime.strptime(str(t),'%Y%m%d%H%M%S').strftime("%Y-%m-%d %H:%M:%S"))
combined_csv = combined_csv.drop(['date(yyyyMMddHHmmss)'], axis='columns')

posts = combined_csv[combined_csv['type'] == 'mbdata']

posts.to_csv("processed/posts.csv", index=False)

reports = combined_csv[combined_csv['type'] == 'ccdata']

reports = reports.drop(['author', 'longitude', 'latitude'], axis='columns')\
    .rename(columns={' location' : 'location'})

reports.to_csv("processed/reports.csv", index=False)

In [None]:
posts = pd.read_csv("processed/posts.csv") # tweets
reports = pd.read_csv("processed/reports.csv") # reports (emergency calls)
combined_csv = pd.read_csv("processed/combined_csv.csv") # csv containing the three periods (1700-1830, 1831-2000, 2001-2131)
coordinates = pd.read_json("abilaMap_processed/coordinates.json") # mapping from location to coordinates 