# COGS 108 - Data Checkpoint

# Names

- Jianan Liu
- Casey Lee
- Mark Bussard
- Aryan Ziyar
- Hasan Liou

<a id='research_question'></a>
# Research Question

By merit of the existence of subcultures on Twitch, can we reliably identify a Twitch streamer's channel by performing machine learning analyses on their respective chatlogs? Furthermore, does Twitch emote usage vary from streamer to streamer?

# Dataset(s)

*Fill in your dataset information here*

- Dataset Name: Twitch.tv Chat Log Data
- Dataset Link: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/VE0IVQ
- Number of Observations: 10113500 rows, 9 columns

The chatlog data contains data from 2,162 Videos on Demand (VODs) from 52 streamers. For simplicity of calculation, we will subsample chats from streamers in the following set: loltyler1, xqcow, imaqtpie, Ninja, and TSM_Myth.

If you plan to use multiple datasets, add 1-2 sentences about how you plan to combine these datasets.

# Setup

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns

import urllib
import json

# Data Cleaning

Describe your data cleaning steps here.
1. We called in a dataset for analysis from the Harvard Dataverse database. 

2. We wanted to filter out the emotes that were channel specific, using a 3rd party Twitch Emote API. We were able to narrow down the list of emotes to non-channel specific emotes (also known as global emotes) and we filtered out non-global emotes and placed the new emote list in a new Series in our Dataframe

3. After that, we further refined each text field by creating a new Series for text fragments only and removed all emotes (however, we decided to keep Unicode emojis) We also removed urls using regex and we also removed all identifying information pertaining to the channel (ie if message from the chat had the streamer's name, we removed that name from the text only Series)

4. Finally, we created one more new series, known as created_at, which we used to parse into day time analysis over a certain period of time.

In [None]:
# Bring in subsamples of streamers
xqc = pd.read_pickle('D:/Documents/HW/COGS108/xqcow.pkl').sample(10000)
tim = pd.read_pickle('D:/Documents/HW/COGS108/timthetatman.pkl').sample(10000)
tyler1 = pd.read_pickle('D:/Documents/HW/COGS108/loltyler1.pkl').sample(10000)

qtpie = pd.read_pickle('D:/Documents/HW/COGS108/imaqtpie.pkl').sample(10000)
myth = pd.read_pickle('D:/Documents/HW/COGS108/tsm_myth.pkl').sample(10000)
ninja = pd.read_pickle('D:/Documents/HW/COGS108/ninja.pkl').sample(10000)

In [None]:
# Create list of global emotes

with urllib.request.urlopen("https://api.twitchemotes.com/api/v4/channels/0") as url:
    requested = json.loads(url.read().decode())
emote_list = requested['emotes']
id_list = [int(x['id']) for x in emote_list]

In [None]:
# Emote function (take only one of each global emote, ignore rest)
def make_emote_list(x):
    lst = []
    # For each fragment
    for fragment in x:
        # Check that we have an emote fragment, and that it is a global emote
        if ('emoticon_id' in fragment.keys()) and (int(fragment['emoticon_id']) in id_list):
            # Add emote
            lst.append(int(fragment['emoticon_id']))
    
    # Save only unique emotes
    return list(set(lst))

# Apply function to dataset fragments
xqc['emotes'] = xqc.fragments.apply(make_emote_list)
tim['emotes'] = tim.fragments.apply(make_emote_list)
tyler1['emotes'] = tyler1.fragments.apply(make_emote_list)

qtpie['emotes'] = qtpie.fragments.apply(make_emote_list)
myth['emotes'] = myth.fragments.apply(make_emote_list)
ninja['emotes'] = ninja.fragments.apply(make_emote_list)

In [None]:
# Take only text (no emotes)
def get_text_only(x):
    # Combines all elements in dict with "text" key
    text_str = " ".join([y['text'] for y in x if 'text' in y.keys()])
    
    # Removes trailing whitespaces and uppercase for analysis
    return text_str.lower().strip()

xqc['text_only'] = xqc.fragments.apply(get_text_only)
tim['text_only'] = tim.fragments.apply(get_text_only)
tyler1['text_only'] = tyler1.fragments.apply(get_text_only)

qtpie['text_only'] = qtpie.fragments.apply(get_text_only)
myth['text_only'] = myth.fragments.apply(get_text_only)
ninja['text_only'] = ninja.fragments.apply(get_text_only)

In [None]:
# Regex out URLs
regex_str = r'(https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|www\.[a-zA-Z0-9][a-zA-Z0-9-]+[a-zA-Z0-9]\.[^\s]{2,}|https?:\/\/(?:www\.|(?!www))[a-zA-Z0-9]+\.[^\s]{2,}|www\.[a-zA-Z0-9]+\.[^\s]{2,})'
xqc['text_only'] = xqc.text_only.str.replace(regex_str, '', regex=True)
tim['text_only'] = tim.text_only.str.replace(regex_str, '', regex=True)
tyler1['text_only'] = tyler1.text_only.str.replace(regex_str, '', regex=True)

qtpie['text_only'] = qtpie.text_only.str.replace(regex_str, '', regex=True)
myth['text_only'] = myth.text_only.str.replace(regex_str, '', regex=True)
ninja['text_only'] = ninja.text_only.str.replace(regex_str, '', regex=True)

In [None]:
# Remove identifying information
xqc_identifying_info = '|'.join(['@xqcow', 'xqcow', 'xqc'])
tim_identifying_info = '|'.join(['@timthetatman', 'timthetatman', 'tatman', 'tim'])
tyler1_identifying_info = '|'.join(['@loltyler1', 'loltyler1', 'tyler1', 'tyler'])

qtpie_identifying_info = '|'.join(['@imaqtpie', 'imaqtpie', 'qtpie', 'qt'])
myth_identifying_info = '|'.join(['@myth', 'myth'])
ninja_identifying_info = '|'.join(['@ninja', 'ninja'])

xqc['text_only'] = xqc.text_only.str.replace(xqc_identifying_info, '')
tim['text_only'] = tim.text_only.str.replace(tim_identifying_info, '')
tyler1['text_only'] = tyler1.text_only.str.replace(tyler1_identifying_info, '')

qtpie['text_only'] = qtpie.text_only.str.replace(qtpie_identifying_info, '')
myth['text_only'] = myth.text_only.str.replace(myth_identifying_info, '')
ninja['text_only'] = ninja.text_only.str.replace(ninja_identifying_info, '')

In [None]:
# Datetime parsing
xqc['created_date'] = pd.to_datetime(xqc['created_at'])
tim['created_date'] = pd.to_datetime(tim['created_at'])
tyler1['created_date'] = pd.to_datetime(tyler1['created_at'])

qtpie['created_date'] = pd.to_datetime(qtpie['created_at'])
myth['created_date'] = pd.to_datetime(myth['created_at'])
ninja['created_date'] = pd.to_datetime(ninja['created_at'])

NameError: name 'xqc' is not defined