# Visual Analytics Science and Technology (VAST) Mini Challenge 03

**Authors:**

- Gabriela S. Maximino
- Igor Matheus S. Moreira

**Objective:**

This notebook aims to perform an exploratory analysis of the data from the VAST Mini Challenge 03, in order to answer the three questions proposed through data visualization. The questions to be answered are the following:

1. Using visual analytics, characterize the different types of content in the dataset. What distinguishes meaningful event reports from typical chatter from junk or spam? *Please limit your answer to 8 images and 500 words.*
2. Use visual analytics to represent and evaluate how the level of the risk to the public evolves over the course of the evening. Consider the potential consequences of the situation and the number of people who could be affected. *Please limit your answer to 10 images and 1000 words.*
3. If you were able to send a team of first responders to any single place, where would it be? Provide your rationale. How might your response be different if you had to respond to the events in real time rather than retrospectively? *Please limit your answer to 8 images and 500 words.*

## Requirements

### Environment

In [1]:
import altair as alt
import pandas as pd
import spacy as sp

In [2]:
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

### Definitions

In [3]:
def get_hashtags(data_frame):
    hashtags_data_frame = pd.DataFrame(columns=["ID", "Hashtag"])

    for row in data_frame.index:
        if data_frame.loc[row, "Type"] == "Call center":
            continue

        hashtags = []

        message = data_frame.loc[row, "Message"]
        parts = [part.strip() for part in message.split(" ")]

        for part in parts:
            part = part.strip()

            if len(part) == 0:
                continue
            elif part[0] == "#":
                if part[:4].lower() == "#pok" and len(part) == 5:
                    hashtags.append("pok")
                else:
                    hashtags.append(part[1:].lower().strip("'").strip('"'))

        if len(hashtags) == 0:
            hashtags = "N/D"

        hashtags_data_frame = hashtags_data_frame.append({"ID": row, "Hashtag": hashtags}, ignore_index=True)

    hashtags_data_frame = hashtags_data_frame.dropna(subset=["Hashtag"]).explode("Hashtag", ignore_index=True)
    
    return hashtags_data_frame.infer_objects()

In [4]:
def get_mentions(data_frame):
    mentions_data_frame = pd.DataFrame(columns=["ID", "Mention"])

    for row in data_frame.index:
        if data_frame.loc[row, "Type"] == "Call center":
            continue

        mentions = []

        message = data_frame.loc[row, "Message"]
        parts = [part.strip() for part in message.split(" ")]

        for part in parts:
            part = part.strip()

            if len(part) == 0:
                continue
            elif part[0] == "@":
                mentions.append(part[1:])

        if len(mentions) == 0:
            mentions = "N/D"
                
        mentions_data_frame = mentions_data_frame.append({"ID": row, "Mention": mentions}, ignore_index=True)

    mentions_data_frame = mentions_data_frame.dropna(subset=["Mention"]).explode("Mention", ignore_index=True)
    
    return mentions_data_frame.infer_objects()

## Preprocessing

Some preprocessing steps were performed outside this notebook:

1. The `.geo.json` and `.topo.json` versions of the Abila map were obtained by converting the files provided by the mini challenge (`Abila.dbf`, `Abila.prj`, `Abila.shp`, and `Abila.shx`) using [Mapshaper](https://mapshaper.org).
2. After obtaining `Abila.geo.json`, the code from `geocode.js` (kindly disclosed in a [GitHub Gist by Tiago Davi](https://gist.github.com/tiagodavi70/d86e7152a730d7c485883751504a6627)) was used to geocode the locations of the emergency calls into coordinates of `Abila.geo.json`, resulting in `coordinates.json`.

All aforementioned files are in the `abila` folder.

Within the scope of this notebook, the preprocessing steps contained herein are the following:

3. All three `.csv` files provided by the mini challenge (located in the `message` folder) are loaded and concatenated into one. During this process, column names are normalized and time data are converted into timestamp columns.
4. `coordinates.json` is loaded. During this process, column names are normalized.
5. The coordinates produced in `coordinates.json` are merged into the `Latitude` and `Longitude` columns of the `csv` files. The `Location` column is then dropped.
6. Messages containing links are filtered out of the concatenated data frame.
7. The messages are separated by type (posts and emergency call reports).

In [5]:
# Loading and concatenating the `.csv` files.
#
# Time periods:
#  - Period one: 1700-1830
#  - Period two: 1831-2000
#  - Period three: 2001-2131

periods = ("1700-1830", "1831-2000", "2001-2131")
columns = ("Type", "Timestamp", "Username", "Message", "Latitude", "Longitude", "Location")
data_frames = [pd.read_csv(f"messages/csv-{period}.csv", header=0, names=columns, parse_dates=[1])
               for period in periods]

combined_data_frame = pd.concat(data_frames, ignore_index=True)
combined_data_frame.loc[combined_data_frame.loc[:, "Type"] == "mbdata", "Type"] = "Microblog"
combined_data_frame.loc[combined_data_frame.loc[:, "Type"] == "ccdata", "Type"] = "Call center"
combined_data_frame.head(5)

Unnamed: 0,Type,Timestamp,Username,Message,Latitude,Longitude,Location
0,Microblog,2014-01-23 17:00:00,POK,Follow us @POK-Kronos,,,
1,Microblog,2014-01-23 17:00:00,maha_Homeland,Don't miss a moment! Follow our live coverage...,,,
2,Microblog,2014-01-23 17:00:00,Viktor-E,Come join us in the Park! Music tonight at Abi...,,,
3,Microblog,2014-01-23 17:00:00,KronosStar,POK rally to start in Abila City Park. POK lea...,,,
4,Microblog,2014-01-23 17:00:00,AbilaPost,POK rally set to take place in Abila City Park...,,,


In [6]:
# Loading `coordinates.json`.

coordinates = pd.read_json("abila/coordinates.json")
coordinates.columns = ("Location", "Coordinates")
coordinates.head(5)

Unnamed: 0,Location,Coordinates
0,Egeou St / Parla St,"[24.85526400000002, 36.05022]"
1,N. Els St / N. Polvo St,"[24.871374000000017, 36.051901]"
2,2099 Sannan Pky,"[24.89820500000002, 36.069383]"
3,3654 N. Barwyn St,"[24.875345818181838, 36.07422]"
4,3815 N. Blant St,"[24.87282157575759, 36.07712]"


In [7]:
# Merging `coordinates` into `combined_data_frame`.
#
# This process makes the number of non-null Latitude/Longitude entries increase from 147 to 176.

for i in combined_data_frame.loc[~combined_data_frame.loc[:, "Location"].isna()].index:
    coordinate = coordinates.loc[coordinates.loc[:, "Location"] == combined_data_frame.loc[i, "Location"]]
    
    try:
        new_longitude, new_latitude = coordinate.reset_index().iloc[0, -1]
        combined_data_frame.loc[i, "Longitude"] = new_longitude
        combined_data_frame.loc[i, "Latitude"] = new_latitude
    except:
        continue
        
# combined_data_frame = combined_data_frame.drop("Location", axis="columns")

In [8]:
# Filtering messages containing links.

indices_to_drop = []

for i in combined_data_frame.index:
    message = combined_data_frame.loc[i, "Message"]
    message_parts = message.split(" ")
    
    for message_part in message_parts:
        if "." in message_part and "/" in message_part:
            indices_to_drop.append(i)
            break

combined_data_frame = combined_data_frame.drop(indices_to_drop, axis=0)

In [9]:
# Separating posts from emergence call reports.

posts = combined_data_frame.loc[combined_data_frame.loc[:, "Type"] == "Microblog"]
reports = combined_data_frame.loc[combined_data_frame.loc[:, "Type"] == "Call center"]

In [10]:
combined_data_frame

Unnamed: 0,Type,Timestamp,Username,Message,Latitude,Longitude,Location
0,Microblog,2014-01-23 17:00:00,POK,Follow us @POK-Kronos,,,
1,Microblog,2014-01-23 17:00:00,maha_Homeland,Don't miss a moment! Follow our live coverage...,,,
2,Microblog,2014-01-23 17:00:00,Viktor-E,Come join us in the Park! Music tonight at Abi...,,,
3,Microblog,2014-01-23 17:00:00,KronosStar,POK rally to start in Abila City Park. POK lea...,,,
4,Microblog,2014-01-23 17:00:00,AbilaPost,POK rally set to take place in Abila City Park...,,,
...,...,...,...,...,...,...,...
4058,Microblog,2014-01-23 21:33:10,plasticParts,RT @AbilaPost unknown explosion heard from the...,,,
4059,Microblog,2014-01-23 21:33:45,klingon4real,RT @CentralBulletin explosion heard at dancing...,,,
4060,Microblog,2014-01-23 21:34:00,lindyT,RT @KronosStar There has been an explosion fro...,,,
4061,Microblog,2014-01-23 21:34:00,dolls4sale,RT @redisrad What was that? #boom,,,


In [11]:
%%capture
# Separing tweets from retweets and counting retweets.

is_microblog = combined_data_frame.loc[:, "Type"] == "Microblog"
microblog = combined_data_frame.loc[is_microblog, :]

is_tweet = combined_data_frame.loc[:, ["Message"]].applymap(lambda string: string[:4] != "RT @").loc[:, "Message"].to_numpy()
is_retweet = combined_data_frame.loc[:, ["Message"]].applymap(lambda string: string[:4] == "RT @").loc[:, "Message"].to_numpy()

tweets = combined_data_frame.loc[is_tweet]
retweets = combined_data_frame.loc[is_retweet]

retweets.loc[:, "Message"] = retweets.loc[:, ["Message"]].applymap(lambda retweet: " ".join(retweet.split(" ")[2:]))
tweets.loc[:, "Retweets"] = tweets.loc[:, ["Message"]]\
    .applymap(lambda message: retweets.loc[retweets.loc[:, "Message"] == message].shape[0])\
    .loc[:, "Message"].to_numpy()

In [12]:
tweets.sort_values("Retweets", ascending=False)

Unnamed: 0,Type,Timestamp,Username,Message,Latitude,Longitude,Location,Retweets
129,Microblog,2014-01-23 17:11:53,AbilaPost,POK rally expected to draw in excess of 1000 p...,,,,8
894,Microblog,2014-01-23 18:15:14,AbilaPost,Dr Audrey McConnell Newman begins her remarks ...,,,,8
1693,Microblog,2014-01-23 19:05:00,HomelandIlluminations,Traffic being diverted from area. #HI,,,,8
2158,Microblog,2014-01-23 19:36:00,HomelandIlluminations,Reports coming in about a possible hit and run...,,,,8
3338,Microblog,2014-01-23 20:30:00,KronosStar,Firemen are running from the building. Somethi...,,,,8
...,...,...,...,...,...,...,...,...
1310,Microblog,2014-01-23 18:47:06,KronosQuoth,Eighty percent of success is showing up. #Kro...,,,,0
1311,Microblog,2014-01-23 18:47:06,KronosQuoth,Happiness is not something readymade. It come...,,,,0
1312,Microblog,2014-01-23 18:47:06,KronosQuoth,The distance between insanity and genius is me...,,,,0
1313,Microblog,2014-01-23 18:47:07,trollingsnark,Find anything yet? Political enemies? Bribe mo...,,,,0


## Visualizing

In [13]:
entries_with_coordinates = combined_data_frame.dropna(subset=["Latitude", "Longitude"]).reset_index()
entries_with_coordinates = entries_with_coordinates.rename(columns={"index": "ID"})

In [14]:
abila_map = alt.topo_feature("abila/Abila.topo.json", feature="Abila-geojson")
type_selection = alt.selection_multi(fields=["Type"], bind="legend")
timeline_brush_area = alt.selection_interval(encodings=["x"])

In [15]:
timeline_base = alt.Chart(entries_with_coordinates).mark_bar().encode(
    alt.X("Timestamp:T", title="Hour of day", axis=alt.Axis(tickCount=20), bin=alt.Bin(maxbins=100)),
    alt.Y("count()", title="Count"),
    color=alt.Color("Type:N", scale=alt.Scale(scheme="set1")),
    opacity=alt.condition(type_selection, alt.value(.6), alt.value(.2)),
    tooltip=[alt.Tooltip("count()", title="Number of occurrences: ")]
).properties(
    width=450,
    height=100,
    title="Message histogram"
)

In [16]:
timeline_background = timeline_base.encode(
    color=alt.value("lightgray")
).add_selection(
    timeline_brush_area
)

In [17]:
timeline_highlight = timeline_base.transform_filter(
    timeline_brush_area
).add_selection(
    type_selection
)

In [18]:
map_background = alt.Chart(abila_map).mark_geoshape(
    stroke="black",
    fill="None"
).properties(
    title="Location of posts and reports",
    width=450,
    height=400
)

In [19]:
map_points = alt.Chart(entries_with_coordinates).mark_circle(size=100).encode(
    longitude="Longitude:Q",
    latitude="Latitude:Q",
    color=alt.Color("Type:N", scale=alt.Scale(scheme="set1")),
    opacity=alt.condition(type_selection, alt.value(.4), alt.value(.1)),
    tooltip=["Username:N", "Message:N", "Location:N"]
).transform_filter(
    timeline_brush_area
).add_selection(alt.selection_single()) # Avoid tooltip bug

In [20]:
hashtags_bar_plot = alt.Chart(entries_with_coordinates).transform_lookup(
    lookup="ID",
    from_=alt.LookupData(data=get_hashtags(combined_data_frame), key="ID", fields=["Hashtag"]),
    default="N/D"
).transform_filter(
    (alt.datum.Hashtag != "") & (alt.datum.Hashtag != "N/D")
).mark_bar().encode(
    alt.X("count:Q", title="", scale=alt.Scale()),
    alt.Y("Hashtag:N", title="", sort='-x'),
    tooltip=[alt.Tooltip("count:Q", title="Occurrences")]
).properties(
    title="Tweet posts per hashtag",
    width=225,
    height=245
).add_selection(
    type_selection
).transform_filter(
    type_selection
).transform_filter(
    timeline_brush_area
).transform_aggregate(
    count="count()",
    groupby=["Hashtag"]
).transform_window(
    rank="rank(count)",
    sort=[alt.SortField("count", order="descending")]
)

In [21]:
users_bar_plot = alt.Chart(entries_with_coordinates).transform_filter(
    (alt.datum.Username != None)
).mark_bar().encode(
    alt.X("count:Q", title="", scale=alt.Scale(domain=(0, 64))),
    alt.Y("Username:N", title="", sort='-x'),
    tooltip=[alt.Tooltip("count:Q", title="Occurrences")]
).properties(
    title="Tweet posts per user",
    width=225,
    height=245
).add_selection(
    type_selection
).transform_filter(
    type_selection
).transform_filter(
    timeline_brush_area
).transform_aggregate(
    count="count()",
    groupby=["Username"]
).transform_window(
    rank="rank(count)",
    sort=[alt.SortField("count", order="descending")]
)

In [22]:
((timeline_background + timeline_highlight) & (map_background + map_points)) | (hashtags_bar_plot & users_bar_plot)

In [23]:
all_entries = combined_data_frame.loc[combined_data_frame.loc[:, "Type"] == "Microblog"].reset_index()
all_entries = all_entries.rename(columns={"index": "ID"})

In [24]:
timeline_brush_area_2 = alt.selection_interval(encodings=["x"])

In [25]:
timeline_all_base = alt.Chart(all_entries).mark_bar().encode(
    alt.X("Timestamp:T", title="Hour of day", axis=alt.Axis(tickCount=20), bin=alt.Bin(maxbins=100)),
    alt.Y("count()", title="Count"),
    color=alt.value("#377eb8"),
    tooltip=[alt.Tooltip("count()", title="Number of occurrences: ")]
).properties(
    width=815,
    height=100,
    title="Tweet histogram"
)

In [26]:
timeline_all_background = timeline_all_base.encode(
    color=alt.value("lightgray")
).add_selection(
    timeline_brush_area_2
)

In [27]:
timeline_all_highlight = timeline_all_base.transform_filter(
    timeline_brush_area_2
)

In [28]:
hashtags_all_bar_plot = alt.Chart(all_entries).transform_lookup(
    lookup="ID",
    from_=alt.LookupData(data=get_hashtags(all_entries), key="ID", fields=["Hashtag"]),
    default="N/D"
).transform_filter(
    (alt.datum.Hashtag != "") & (alt.datum.Hashtag != "N/D")
).mark_bar().encode(
    alt.X("count:Q", title="", scale=alt.Scale()),
    alt.Y("Hashtag:N", title="", sort='-x'),
    tooltip=[alt.Tooltip("count:Q", title="Occurrences")]
).properties(
    title="Tweet posts per hashtag",
    width=343,
    height=400
).transform_filter(
    timeline_brush_area_2
).transform_aggregate(
    count="count()",
    groupby=["Hashtag"]
).transform_window(
    rank="rank(count)",
    sort=[alt.SortField("count", order="descending")]
).transform_filter(
    alt.datum.rank < 10
)

In [29]:
mentions_all_bar_plot = alt.Chart(all_entries).transform_lookup(
    lookup="ID",
    from_=alt.LookupData(data=get_mentions(all_entries), key="ID", fields=["Mention"]),
    default="N/D"
).transform_filter(
    (alt.datum.Mention != "") & (alt.datum.Mention != "N/D")
).mark_bar().encode(
    alt.X("count:Q", title="", scale=alt.Scale()),
    alt.Y("Mention:N", title="", sort='-x'),
    tooltip=[alt.Tooltip("count:Q", title="Occurrences")]
).properties(
    title="User mentions in tweets",
    width=343,
    height=400
).transform_filter(
    timeline_brush_area_2
).transform_aggregate(
    count="count()",
    groupby=["Mention"]
).transform_window(
    rank="rank(count)",
    sort=[alt.SortField("count", order="descending")]
).transform_filter(
    alt.datum.rank < 10
)

In [30]:
((timeline_all_background + timeline_all_highlight) & (hashtags_all_bar_plot | mentions_all_bar_plot))

In [31]:
timeline_brush_area_3 = alt.selection_interval(encodings=["x"])
user_brush_selection = alt.selection_interval(encodings=["y"])

In [32]:
timeline_tweets_base = alt.Chart(tweets).transform_filter(
    alt.datum.Retweets > 0
).mark_bar().encode(
    alt.X("Timestamp:T", title="Hour of day", axis=alt.Axis(tickCount=20), bin=alt.Bin(maxbins=100)),
    alt.Y("count()", title="Tweets"),
    color=alt.value("#377eb8"),
    tooltip=[alt.Tooltip("count()", title="Number of occurrences: ")]
).properties(
    width=815,
    height=100,
    title="Tweet histogram"
)

In [33]:
timeline_tweets_background = timeline_tweets_base.encode(
    color=alt.value("lightgray")
).add_selection(
    timeline_brush_area_3
)

In [34]:
timeline_tweets_highlight = timeline_tweets_base.transform_filter(
    timeline_brush_area_3
)

In [35]:
retweets_bar_plot = alt.Chart(tweets).transform_filter(
    alt.datum.Retweets > 0
).transform_filter(
    timeline_brush_area_3
).mark_bar().encode(
    alt.X("count:Q", title="", scale=alt.Scale()),
    alt.Y("Username:N", title="", sort="-x"),
    color=alt.condition(user_brush_selection, alt.value("#377eb8"), alt.value("lightgray")),
    tooltip=[alt.Tooltip("count:Q", title="Occurrences")]
).properties(
    title="Most mentioned users",
    width=343,
    height=400
).transform_aggregate(
    count="sum(Retweets)",
    groupby=["Username"]
).transform_window(
    rank="rank(count)",
    sort=[alt.SortField("count", order="descending")]
).transform_filter(
    alt.datum.rank <= 10
).add_selection(
    user_brush_selection
)

In [36]:
users_stripplot = alt.Chart(tweets, height=60).transform_filter(
    alt.datum.Username != None
).mark_circle(
    size=20
).encode(
    x=alt.X("Timestamp:T"),
    y=alt.Y(
        "jitter:Q",
        title=None,
        axis=alt.Axis(values=[0], ticks=True, grid=False, labels=False),
        scale=alt.Scale(),
    ),
    color=alt.Color("Retweets:Q", scale=alt.Scale(scheme="reds")),
    row=alt.Row(
        "Username:N",
        header=alt.Header(
            labelAngle=0,
            titleOrient='top',
            labelOrient='left',
            labelAlign='left',
            labelPadding=3,
        ),
        title="Tweets per user"
    ),
    tooltip=["Message:N"]
).transform_window(
    rank="rank(count)",
    sort=[alt.SortField("count", order="descending")]
).transform_calculate(
    jitter="sqrt(-2*log(random()))*cos(2*PI*random())"
).properties(
    width=343
).transform_filter(
    user_brush_selection
).transform_filter(
    timeline_brush_area_3
).add_selection(
    alt.selection_single()
)

In [37]:
((timeline_tweets_background + timeline_tweets_highlight) & (retweets_bar_plot | users_stripplot)).configure_facet(
    spacing=0
).configure_view(
    stroke=None
)

## Dashboard

In [39]:
((timeline_background + timeline_highlight) & (map_background + map_points)) | (hashtags_bar_plot & users_bar_plot)

In [40]:
((timeline_all_background + timeline_all_highlight) & (hashtags_all_bar_plot | mentions_all_bar_plot))