# Visual Analytics Science and Technology (VAST) Mini Challenge 03

**Authors:**

- Gabriela S. Maximino
- Igor Matheus S. Moreira

**Objective:**

This notebook aims to perform an exploratory analysis of the data from the VAST Mini Challenge 03, in order to answer the three questions proposed through data visualization. The questions to be answered are the following:

1. Using visual analytics, characterize the different types of content in the dataset. What distinguishes meaningful event reports from typical chatter from junk or spam? *Please limit your answer to 8 images and 500 words.*
2. Use visual analytics to represent and evaluate how the level of the risk to the public evolves over the course of the evening. Consider the potential consequences of the situation and the number of people who could be affected. *Please limit your answer to 10 images and 1000 words.*
3. If you were able to send a team of first responders to any single place, where would it be? Provide your rationale. How might your response be different if you had to respond to the events in real time rather than retrospectively? *Please limit your answer to 8 images and 500 words.*

In [1]:
import altair as alt
import pandas as pd
import spacy as sp

In [2]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Preprocessing

Some preprocessing steps were performed outside this notebook:

1. The `.geo.json` and `.topo.json` versions of the Abila map were obtained by converting the files provided by the mini challenge (`Abila.dbf`, `Abila.prj`, `Abila.shp`, and `Abila.shx`) using [Mapshaper](https://mapshaper.org).
2. After obtaining `Abila.geo.json`, the code from `geocode.js` (kindly disclosed in a [GitHub Gist by Tiago Davi](https://gist.github.com/tiagodavi70/d86e7152a730d7c485883751504a6627)) was used to geocode the locations of the emergency calls into coordinates of `Abila.geo.json`, resulting in `coordinates.json`.

All aforementioned files are in the `abila` folder.

Within the scope of this notebook, the preprocessing steps contained herein are the following:

3. All three `.csv` files provided by the mini challenge (located in the `message` folder) are loaded and concatenated into one. During this process, column names are normalized and time data are converted into timestamp columns.
4. `coordinates.json` is loaded. During this process, column names are normalized.
5. The coordinates produced in `coordinates.json` are merged into the `Latitude` and `Longitude` columns of the `csv` files. The `Location` column is then dropped.
6. Messages containing links are filtered out of the concatenated data frame.
7. The messages are separated by type (posts and emergency call reports).

In [71]:
# Loading and concatenating the `.csv` files.
#
# Time periods:
#  - Period one: 1700-1830
#  - Period two: 1831-2000
#  - Period three: 2001-2131

periods = ("1700-1830", "1831-2000", "2001-2131")
columns = ("Type", "Timestamp", "Username", "Message", "Latitude", "Longitude", "Location")
data_frames = [pd.read_csv(f"messages/csv-{period}.csv", header=0, names=columns, parse_dates=[1])
               for period in periods]

combined_data_frame = pd.concat(data_frames, ignore_index=True)
combined_data_frame.loc[combined_data_frame.loc[:, "Type"] == "mbdata", "Type"] = "Microblog"
combined_data_frame.loc[combined_data_frame.loc[:, "Type"] == "ccdata", "Type"] = "Call center"
combined_data_frame.head(5)

Unnamed: 0,Type,Timestamp,Username,Message,Latitude,Longitude,Location
0,Microblog,2014-01-23 17:00:00,POK,Follow us @POK-Kronos,,,
1,Microblog,2014-01-23 17:00:00,maha_Homeland,Don't miss a moment! Follow our live coverage...,,,
2,Microblog,2014-01-23 17:00:00,Viktor-E,Come join us in the Park! Music tonight at Abi...,,,
3,Microblog,2014-01-23 17:00:00,KronosStar,POK rally to start in Abila City Park. POK lea...,,,
4,Microblog,2014-01-23 17:00:00,AbilaPost,POK rally set to take place in Abila City Park...,,,


In [72]:
# Loading `coordinates.json`.

coordinates = pd.read_json("abila/coordinates.json")
coordinates.columns = ("Location", "Coordinates")
coordinates.head(5)

Unnamed: 0,Location,Coordinates
0,Egeou St / Parla St,"[24.85526400000002, 36.05022]"
1,N. Els St / N. Polvo St,"[24.871374000000017, 36.051901]"
2,2099 Sannan Pky,"[24.89820500000002, 36.069383]"
3,3654 N. Barwyn St,"[24.875345818181838, 36.07422]"
4,3815 N. Blant St,"[24.87282157575759, 36.07712]"


In [73]:
# Merging `coordinates` into `combined_data_frame`.
#
# This process makes the number of non-null Latitude/Longitude entries increase from 147 to 176.

for i in combined_data_frame.loc[~combined_data_frame.loc[:, "Location"].isna()].index:
    coordinate = coordinates.loc[coordinates.loc[:, "Location"] == combined_data_frame.loc[i, "Location"]]
    
    try:
        new_longitude, new_latitude = coordinate.reset_index().iloc[0, -1]
        combined_data_frame.loc[i, "Longitude"] = new_longitude
        combined_data_frame.loc[i, "Latitude"] = new_latitude
    except:
        continue
        
# combined_data_frame = combined_data_frame.drop("Location", axis="columns")

In [74]:
# Filtering messages containing links.

indices_to_drop = []

for i in combined_data_frame.index:
    message = combined_data_frame.loc[i, "Message"]
    message_parts = message.split(" ")
    
    for message_part in message_parts:
        if "." in message_part and "/" in message_part:
            indices_to_drop.append(i)
            break

combined_data_frame = combined_data_frame.drop(indices_to_drop, axis=0)

In [75]:
# Separating posts from emergence call reports.
combined_data_frame["User mentioned"] = ""
combined_data_frame["Hashtag"] = ""

posts = combined_data_frame.loc[combined_data_frame.loc[:, "Type"] == "Microblog"]
reports = combined_data_frame.loc[combined_data_frame.loc[:, "Type"] == "Call center"]

In [76]:
combined_data_frame

Unnamed: 0,Type,Timestamp,Username,Message,Latitude,Longitude,Location,User mentioned,Hashtag
0,Microblog,2014-01-23 17:00:00,POK,Follow us @POK-Kronos,,,,,
1,Microblog,2014-01-23 17:00:00,maha_Homeland,Don't miss a moment! Follow our live coverage...,,,,,
2,Microblog,2014-01-23 17:00:00,Viktor-E,Come join us in the Park! Music tonight at Abi...,,,,,
3,Microblog,2014-01-23 17:00:00,KronosStar,POK rally to start in Abila City Park. POK lea...,,,,,
4,Microblog,2014-01-23 17:00:00,AbilaPost,POK rally set to take place in Abila City Park...,,,,,
...,...,...,...,...,...,...,...,...,...
4058,Microblog,2014-01-23 21:33:10,plasticParts,RT @AbilaPost unknown explosion heard from the...,,,,,
4059,Microblog,2014-01-23 21:33:45,klingon4real,RT @CentralBulletin explosion heard at dancing...,,,,,
4060,Microblog,2014-01-23 21:34:00,lindyT,RT @KronosStar There has been an explosion fro...,,,,,
4061,Microblog,2014-01-23 21:34:00,dolls4sale,RT @redisrad What was that? #boom,,,,,


In [83]:
# Logging mentions and hashtags

mentions = []
hashtags = []

for i in range(len(combined_data_frame)):
    try:
        message = combined_data_frame.loc[i, "Message"]
        parts = [part.strip() for part in message.split(" ")]

        for part in parts:
            part = part.strip()

            if len(part) == 0:
                continue
            if part[0] == "@":
                mentions.append(part[1:])

            elif part[0] == "#":
                hashtags.append(part[1:])
                    
        if mentions:
            combined_data_frame.at[i, "User mentioned"] = mentions
        if hashtags:
            combined_data_frame.at[i, "Hashtag"] = hashtags
            
        mentions = []
        hashtags = []
        
    except KeyError:
        pass
 
combined_data_frame = combined_data_frame.explode("User mentioned").reset_index().drop("index", 1).explode("Hashtag").reset_index()\
    .drop("index",1)

In [84]:
combined_data_frame

Unnamed: 0,Type,Timestamp,Username,Message,Latitude,Longitude,Location,User mentioned,Hashtag
0,Microblog,2014-01-23 17:00:00,POK,Follow us @POK-Kronos,,,,POK-Kronos,
1,Microblog,2014-01-23 17:00:00,maha_Homeland,Don't miss a moment! Follow our live coverage...,,,,,
2,Microblog,2014-01-23 17:00:00,Viktor-E,Come join us in the Park! Music tonight at Abi...,,,,,
3,Microblog,2014-01-23 17:00:00,KronosStar,POK rally to start in Abila City Park. POK lea...,,,,,KronosStar
4,Microblog,2014-01-23 17:00:00,AbilaPost,POK rally set to take place in Abila City Park...,,,,,AbilaPost
...,...,...,...,...,...,...,...,...,...
10572,Microblog,2014-01-23 21:34:00,lindyT,RT @KronosStar There has been an explosion fro...,,,,KronosStar,DancingDolphinFire
10573,Microblog,2014-01-23 21:34:00,lindyT,RT @KronosStar There has been an explosion fro...,,,,KronosStar,AFDHeroes
10574,Microblog,2014-01-23 21:34:00,dolls4sale,RT @redisrad What was that? #boom,,,,redisrad,boom
10575,Microblog,2014-01-23 21:34:45,worldWatcher,RT @CentralBulletin explosion heard at dancing...,,,,CentralBulletin,Abila


## Visualizing

In [52]:
entries_with_coordinates = combined_data_frame.dropna(subset=["Latitude", "Longitude"])
post_entries_with_coordinates = entries_with_coordinates.loc[entries_with_coordinates.loc[:, "Type"] == "mbdata"]
report_entries_with_coordinates = entries_with_coordinates.loc[entries_with_coordinates.loc[:, "Type"] == "ccdata"]

In [53]:
abila_map = alt.topo_feature("abila/Abila.topo.json", feature="Abila-geojson")
type_selection = alt.selection_multi(fields=["Type"], bind="legend")
timeline_brush_area = alt.selection_interval(encodings=["x"])

In [54]:
timeline_base = alt.Chart(entries_with_coordinates).mark_bar().encode(
    alt.X("Timestamp:T", title="Hour of day", axis=alt.Axis(tickCount=20), bin=alt.Bin(maxbins=100)),
    alt.Y("count()", title="Count"),
    color=alt.Color("Type:N", scale=alt.Scale(scheme="set1")),
    opacity=alt.condition(type_selection, alt.value(.6), alt.value(.2)),
).properties(
    width=440,
    height=100,
)

In [55]:
timeline_background = timeline_base.encode(
    color=alt.value("lightgray")
).add_selection(
    timeline_brush_area
)

In [56]:
timeline_highlight = timeline_base.transform_filter(
    timeline_brush_area
).add_selection(
    type_selection
)

In [85]:
map_background = alt.Chart(abila_map).mark_geoshape(
    stroke="black",
    fill="None"
).properties(
    title="Location of posts and reports",
    width=440,
    height=400
)

In [86]:
map_points = alt.Chart(entries_with_coordinates).mark_circle(size=100).encode(
    longitude="Longitude:Q",
    latitude="Latitude:Q",
    color=alt.Color("Type:N", scale=alt.Scale(scheme="set1")),
    opacity=alt.condition(type_selection, alt.value(.4), alt.value(.1)),
    tooltip=["Username:N", "Message:N", "Location:N"]
).transform_filter(
    timeline_brush_area
).add_selection(alt.selection_single()) # Avoid tooltip bug

In [104]:
# hashtags_bar_plot = alt.Chart(entries_with_coordinates).mark_bar().encode(
#     alt.X("count()", title="Count"),
#     alt.Y("Hashtag:N", title="", sort=alt.EncodingSortField(op="count", order="descending")),
#     tooltip=[alt.Tooltip("count()", title="Number of mentions: ")]
# ).properties(
#     title="Hashtag occurrences",
#     width=300,
#     height=550
# ).add_selection(type_selection).transform_filter(type_selection).transform_filter(timeline_brush_area)


hashtags_bar_plot = alt.Chart(entries_with_coordinates).transform_aggregate(
    Count="count()",
    groupby=("Hashtag",)
).transform_window(
    rank="rank(Count)",
    sort=[alt.SortField("Count", order="descending")]
).transform_filter(
    alt.datum.rank < 10
).mark_bar().encode(
    alt.X("Count:Q", title="Count", sort=alt.SortField("Hashtag", order="descending")),
    alt.Y("Hashtag:N", title=""),
    tooltip=[alt.Tooltip("count:Q", title="Number of mentions: ")]
).properties(
    title="Hashtag occurrences",
    width=300,
    height=550
).add_selection(
    type_selection
).transform_filter(
    type_selection
).transform_filter(
    timeline_brush_area
)

In [106]:
(((timeline_background + timeline_highlight) & (map_background + map_points)) | hashtags_bar_plot)