# HCS Workshop 4: Data Science

Welcome to the data science workshop! 

## Part 1: Imports

We'll be using `pandas` for data storage and cleaning, along with `altair` and `seaborn` for exploratory visualization.

In [13]:
!unzip facebook-jonathanluo1276.zip

Archive:  facebook-jonathanluo1276.zip
   creating: messages/
  inflating: messages/autofill_information.json  
  inflating: messages/secret_conversations.json  
  inflating: messages/secret_groups.json  
   creating: messages/inbox/
   creating: messages/inbox/lrjftp0eua/
  inflating: messages/inbox/lrjftp0eua/message_1.json  
   creating: messages/inbox/yyuuubwffw/
  inflating: messages/inbox/yyuuubwffw/message_1.json  
   creating: messages/inbox/190061_ic3lcvzz6w/
  inflating: messages/inbox/190061_ic3lcvzz6w/message_1.json  
   creating: messages/inbox/190061_ic3lcvzz6w/photos/
   creating: messages/inbox/190061_ic3lcvzz6w/files/
   creating: messages/inbox/190061_ic3lcvzz6w/gifs/
   creating: messages/inbox/190061_ic3lcvzz6w/videos/
   creating: messages/inbox/6006gang_gjitybvh7g/
  inflating: messages/inbox/6006gang_gjitybvh7g/message_1.json  
   creating: messages/inbox/6006gang_gjitybvh7g/files/
   creating: messages/inbox/6006gang_gjitybvh7g/photos/
   creating: messages/inbo

In [14]:
import json
import os
from pathlib import Path

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [15]:
import altair as alt
from altair import datum, expr

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

## Part 2: Loading Data

The cell below will load your Facebook messages data into a table called `df`. Each row of this table corresponds to a single message, and it has four columns:

- `chat`: The ID of the messenger chat that this message was sent in.
- `sender`: The name of the user who sent the message.
- `time`: A timestamp specifying exactly when the message was sent.
- `content`: The text of the message.

The `build_dataframe()` function loads all of this data in files downloaded from Facebook. It takes one argument, `path`, containing the folder where the unzipped data resides. If you're running this notebook in the same directory as `messages` then no need to change anything; otherwise, please specify the path to your `messages` folder as an argument to the function.

In [16]:
def build_dataframe(path='messages'):
    rows_list = []
    for filename in Path(path).glob('inbox/*/message_*.json'):
        chat = filename.parent.name
        with open(filename, 'r') as f:
            obj = json.load(f)
        for entry in obj['messages']:
            if entry['type'] == 'Generic' and entry.get('content') is not None:
                rows_list.append({
                    'chat': chat,
                    'sender': entry['sender_name'],
                    'time': pd.to_datetime(entry['timestamp_ms'], unit='ms'),
                    'content': entry['content'],
                })
    return pd.DataFrame(rows_list)

df = build_dataframe()

Let's quickly check to make sure that our data was loaded properly.

In [17]:
df

Unnamed: 0,chat,sender,time,content
0,jasminechan_xaqdc6hobq,Jasmine Chan,2019-11-24 05:37:54.000,You are now connected on Messenger
1,tessmcginley_wfkolyfpqa,Jonathan Luo,2019-08-06 04:24:55.301,You are now connected on Messenger
2,davidzhang_usotuebl0q,David Zhang,2022-04-18 22:26:00.518,no worries man!
3,davidzhang_usotuebl0q,Jonathan Luo,2022-04-17 23:35:32.723,Thx man!
4,davidzhang_usotuebl0q,David Zhang,2022-04-17 01:45:24.134,her name is Erika
...,...,...,...,...
2471,elaineholliman_r_vrgdczsq,Elaine Holliman,2019-08-28 11:26:18.599,You are now connected on Messenger
2472,carahiggins_uvhj0g7k9w,Cara Higgins,2019-08-25 14:04:06.980,You are now connected on Messenger
2473,janellelee_ofy_oghiea,Janelle Lee,2019-08-03 04:30:17.532,You are now connected on Messenger
2474,shaineelim_qdobd7jpsw,Shainee Lim,2019-10-24 18:16:28.945,You are now connected on Messenger


In [19]:
df.chat.value_counts()

190061_ic3lcvzz6w                             1430
6006gang_gjitybvh7g                            568
rosehong_a0emjjdcma                            147
cynthiacolinjulieannaand3others_djbsixwdtq     128
davidzhang_usotuebl0q                           87
                                              ... 
milosiegmannwatson_1e6vd6yrwq                    1
tessmcginley_wfkolyfpqa                          1
daniellenam_m0y862ipfg                           1
maddychen_iiuunxpyuq                             1
josephhartono_xllrvp_bna                         1
Name: chat, Length: 79, dtype: int64

## Part 3: Data Visualization

Now we can start exploring! First, how many messages have we sent over time?

In [37]:
alt.Chart(df).mark_line().encode(
    x='yearmonth(time):T',
    y='count()',
)

Let's color-code this graph by conversation. We can add tooltips and a title along the way.

In [49]:
alt.Chart(df).mark_bar().encode(
    x='monthdate(time):T',
    y='count()',
    color='chat:N',
    tooltip=['chat', 'count()'],
).properties(
    title='Number of Facebook Messages',
)

How about the number of messages sent by each participant in a particular chat?

Which participants tend to send the most messages in each of your conversations, if any?

In [25]:
chat_id = '190061_ic3lcvzz6w'

alt.Chart(df.query(f'chat == "{chat_id}"')).mark_bar().encode(
    color='sender:N',
    x='yearmonth(time):T',
    y='count()',
    tooltip=['sender', 'count()'],
).properties(
    title='Facebook Messages in Group Chat',
)

In [23]:
alt.Chart(df).mark_bar().encode(
    alt.X('count()', stack='normalize', title='frequency'),
    alt.Y('chat'),
    alt.Color('sender'),
    tooltip=['sender', alt.Tooltip('count()', title='messages')],
).properties(
    title='Who Dominates the Conversation?',
)

Finally, let's look at the number of messages (and words) you sent on each day of the year.

In [31]:
sender = 'Jonathan Luo'

alt.Chart(df).mark_rect().encode(
    alt.X('date(time):O', title='day'),
    alt.Y('yearmonth(time):O', title='month'),
    alt.Color('count()', scale=alt.Scale(type='linear')),
    tooltip=[
        alt.Tooltip('count()', title='Messages'),
        alt.Tooltip('sum(words):Q', title='Words'),
    ],
).transform_filter(
    datum.sender == sender,
).transform_calculate(
    words=expr.length(expr.split(datum.content, ' ')),
).properties(
    title='Number of Messages Sent by Day',
)

## Part 4: Exercises

Create two more data visualizations, different from the ones above. What are you curious to learn about? Some suggestions:

- Change some of the chart types to line graphs or scatter plots, or experiment with [scales](https://vega.github.io/vega/docs/scales/) and [color schemes](https://vega.github.io/vega/docs/schemes/).
- On which hours and days of the week are you most active?
- Has the average length of your messages, increased, decreased, or stayed the same over time?
- Filter your messages by a particular word. Who do you talk to most about, e.g., sports?
- Who sends you the most messages with [positive sentiment](https://www.nltk.org/api/nltk.sentiment.html)? Has this changed over time?

If you're looking for more inspiration, check out the [Altair example gallery](https://altair-viz.github.io/gallery/index.html)!

In [145]:
# Viz 1: Interactive Histogram of Time of Day Messages were Sent into the Group Chat "19.0061"
# (Click and drag on bottom panel to see bins change on top panel)

chat_id = '190061_ic3lcvzz6w'
chat_name = '19.0061'

source = df.query(f'chat == "{chat_id}"')
brush = alt.selection_interval(encodings=['x'])

base = alt.Chart(source).transform_calculate(
    time="((hours(datum.time) + minutes(datum.time) / 60) + 20) % 24"  # Converting GMT to EDT
).mark_bar().encode(
    y='count():Q',
    color=alt.Color('count()', scale=alt.Scale(type='linear')),
    tooltip=[
        alt.Tooltip('count()', title='Messages'),
    ],
).properties(
    width=600,
    height=100,
    title=f'Time of Day at which Messages Were Sent into Chat "{chat_name}"',
)

x_axis_label = "Hour of Day (EDT)"

alt.vconcat(
  base.encode(
    alt.X('time:Q',
      bin=alt.Bin(maxbins=30, extent=brush),
      scale=alt.Scale(domain=brush),
      title=x_axis_label,
    )
  ),
  base.encode(
    alt.X('time:Q',
          bin=alt.Bin(maxbins=30),
          title=x_axis_label,
          )
  ).add_selection(brush)
)

In [173]:
# Viz 2: Visualization of the Mean Number of Words per Message sent by Me
# Appears that aside from really popping off on Sept 8, 2020, I've been staying
# pretty consistent in terms of my message lengths (getting slightly longer in
# the last year though)

sender = 'Jonathan Luo'

alt.Chart(df).mark_rect().encode(
    alt.X('date(time):O', title='Day'),
    alt.Y('yearmonth(time):O', title='Month'),
    alt.Color('mean(words):Q', scale=alt.Scale(type='linear'), title='Mean Words per Message'),
    tooltip=[
        alt.Tooltip('yearmonthdate(time):O', title='Date'),
        alt.Tooltip('mean(words):Q', title='Mean Words per Message', format='.2f'),
        alt.Tooltip('count()', title='Messages'),
    ],
).transform_filter(
    datum.sender == sender,
).transform_calculate(
    words=expr.length(expr.split(datum.content, ' ')),
).properties(
    title='Mean Number of Words per Message by Day',
)

## Part 5: Submission

If you're comping HCS and would like to receive credit for completing this workshop, here are the instructions to do so.

1. Click the Kernel >> "Restart Kernel and Run All Cells" button at the top-left of the screen. This will run your entire notebook from top to bottom, ensuring that your code is reproducible.
2. Click on the ellipsis icon at the top-right of your two personal visualizations, and save them as SVG graphics files.
3. Click the Kernel >> "Restart Kernel and Clear All Outputs" button at the top-left of the screen. This will remove all outputs from your notebook and leave only the code, which will greatly reduce file size. After this, you're all set!
4. Save the notebook, then drag-and-drop your `.ipynb` file to a [GitHub Gist](https://gist.github.com/).
5. Submit at the [Google Form](https://forms.gle/ssjtbjyGr6qtSL2P9).

Congratulations for finishing!