# HCS Workshop 4: Data Science

Welcome to the data science workshop! 

## Part 1: Imports

We'll be using `pandas` for data storage and cleaning, along with `altair` and `seaborn` for exploratory visualization.

In [None]:
import json
import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import altair as alt
from altair import datum, expr

alt.data_transformers.disable_max_rows()

## Part 2: Loading Data

The cell below will load your Facebook messages data into a table called `df`. Each row of this table corresponds to a single message, and it has four columns:

- `chat`: The ID of the messenger chat that this message was sent in.
- `sender`: The name of the user who sent the message.
- `time`: A timestamp specifying exactly when the message was sent.
- `content`: The text of the message.

The `build_dataframe()` function loads all of this data in files downloaded from Facebook. It takes one argument, `path`, containing the folder where the unzipped data resides. If you're running this notebook in the same directory as `messages` then no need to change anything; otherwise, please specify the path to your `messages` folder as an argument to the function.

In [None]:
def build_dataframe(path='messages'):
    rows_list = []
    for filename in Path(path).glob('inbox/*/message_*.json'):
        chat = filename.parent.name
        with open(filename, 'r') as f:
            obj = json.load(f)
        for entry in obj['messages']:
            if entry['type'] == 'Generic' and entry.get('content') is not None:
                rows_list.append({
                    'chat': chat,
                    'sender': entry['sender_name'],
                    'time': pd.to_datetime(entry['timestamp_ms'], unit='ms'),
                    'content': entry['content'],
                })
    return pd.DataFrame(rows_list)

df = build_dataframe()

Let's quickly check to make sure that our data was loaded properly.

In [None]:
df

In [None]:
df.chat.value_counts()

## Part 3: Data Visualization

Now we can start exploring! First, how many messages have we sent over time?

In [None]:
alt.Chart(df).mark_line().encode(
    x='yearmonth(time):T',
    y='count()',
)

Let's color-code this graph by conversation. We can add tooltips and a title along the way.

In [None]:
alt.Chart(df).mark_bar().encode(
    x='yearmonth(time):T',
    y='count()',
    color='chat:N',
    tooltip=['chat', 'count()'],
).properties(
    title='Number of Facebook Messages',
)

How about the number of messages sent by each participant in a particular chat?

Which participants tend to send the most messages in each of your conversations, if any?

In [None]:
chat_id = '2ndfloorbois_xs_g9vj4ha'

alt.Chart(df.query(f'chat == "{chat_id}"')).mark_bar().encode(
    color='sender:N',
    x='yearmonth(time):T',
    y='count()',
    tooltip=['sender', 'count()'],
).properties(
    title='Facebook Messages in Group Chat',
)

In [None]:
alt.Chart(df).mark_bar().encode(
    alt.X('count()', stack='normalize', title='frequency'),
    alt.Y('chat'),
    alt.Color('sender'),
    tooltip=['sender', alt.Tooltip('count()', title='messages')],
).properties(
    title='Who Dominates the Conversation?',
)

Finally, let's look at the number of messages (and words) you sent on each day of the year.

In [None]:
sender = 'Shreyas Iyer'

alt.Chart(df).mark_rect().encode(
    alt.X('date(time):O', title='day'),
    alt.Y('yearmonth(time):O', title='month'),
    alt.Color('count()', scale=alt.Scale(type='linear')),
    tooltip=[
        alt.Tooltip('count()', title='Messages'),
        alt.Tooltip('sum(words):Q', title='Words'),
    ],
).transform_filter(
    datum.sender == sender,
).transform_calculate(
    words=expr.length(expr.split(datum.content, ' ')),
).properties(
    title='Number of Messages Sent by Day',
)

## Part 4: Exercises

Create two more data visualizations, different from the ones above. What are you curious to learn about? Some suggestions:

- Change some of the chart types to line graphs or scatter plots, or experiment with [scales](https://vega.github.io/vega/docs/scales/) and [color schemes](https://vega.github.io/vega/docs/schemes/).
- On which hours and days of the week are you most active?
- Has the average length of your messages, increased, decreased, or stayed the same over time?
- Filter your messages by a particular word. Who do you talk to most about, e.g., sports?
- Who sends you the most messages with [positive sentiment](https://www.nltk.org/api/nltk.sentiment.html)? Has this changed over time?

If you're looking for more inspiration, check out the [Altair example gallery](https://altair-viz.github.io/gallery/index.html)!

In [None]:
# Viz 1: Your code here!

In [None]:
# Viz 2: Your code here!

## Part 5: Submission

If you're comping HCS and would like to receive credit for completing this workshop, here are the instructions to do so.

1. Click the Kernel >> "Restart Kernel and Run All Cells" button at the top-left of the screen. This will run your entire notebook from top to bottom, ensuring that your code is reproducible.
2. Click on the ellipsis icon at the top-right of your two personal visualizations, and save them as SVG graphics files.
3. Click the Kernel >> "Restart Kernel and Clear All Outputs" button at the top-left of the screen. This will remove all outputs from your notebook and leave only the code, which will greatly reduce file size. After this, you're all set!
4. Save the notebook, then drag-and-drop your `.ipynb` file to a [GitHub Gist](https://gist.github.com/).
5. Submit at the [Google Form](https://forms.gle/ssjtbjyGr6qtSL2P9).

Congratulations for finishing!