# Introduction

The following learning project is intended to help me learn concepts around the steps required to create a visualisation in python, but also expose the level of information that is stored by companies such as YouTube (Google). 

## Objectives

* **Data Preparation:** Data preparation makes up ~80% of the work in a Data Science pipeline. I'd like to explore the best techniques to go from raw CSV data, to a cleaned up tabular format that is ready for data visualisation.
    * [X] Data Preparation
    * [ ]Feature Engineering
* **Plot:** There are various modules that can be used to plot the data. I'll consider which is best for this use case, and find the best way to represent the data in hand.
    * [ ] Plot representation within Jupyter Notebook
    * [ ] Plot representation on webfront
* **Web Display:** In order to display this project, what is the best way to allow users to upload their YouTube data and display the visualisation such that is can be deployed on a server with minimal effort to others.

# Preprocessing

## Getting access to YouTube Data

YouTube depricated their history API due to privacy reasons a few years ago, however it's possible to download a copy of your YouTube history in a JSON format by doing the following:

1. Go to [Google Takeout](https://takeout.google.com/)
2. Deselect All 
3. Select YouTube
4. Underneath YouTube, click on "All YouTube Data Included"
5. Deselect All
6. Select history

You can then download your YouTube data after it is generated and a link sent to you. You should get a file called `watch-history.json` that we'll be making use of in future steps.

## Data Preparation

We can take a look at the JSON and see what there is:

In [2]:
import json
import pandas as pd

# Open History File as dataframe
file = '../watch-history.json'
with open(file, encoding='utf8') as wh_file:
    wh_dict = json.load(wh_file)
    
wh = pd.DataFrame.from_dict(wh_dict)

In [3]:
display(wh)

Unnamed: 0,description,header,products,subtitles,time,title,titleUrl
0,,YouTube,[YouTube],"[{'name': 'StromaeVEVO', 'url': 'https://www.y...",2019-05-04T15:51:04.397Z,Watched Stromae - Papaoutai (Clip Officiel),https://www.youtube.com/watch?v=oiKj0Z_Xnjc
1,,YouTube,[YouTube],"[{'name': 'Mick J Clark', 'url': 'https://www....",2019-05-04T15:16:07.495Z,Watched Me My Body And I,https://www.youtube.com/watch?v=tCbPpfIGk-s
2,,YouTube,[YouTube],"[{'name': 'Tech Insider', 'url': 'https://www....",2019-05-04T14:24:33.373Z,Watched I Did Peloton For Two Weeks Straight A...,https://www.youtube.com/watch?v=erqLKwwZCVE
3,,YouTube,[YouTube],"[{'name': 'The Verge', 'url': 'https://www.you...",2019-05-04T14:10:05.115Z,Watched F8 2019 keynote in 12 minutes,https://www.youtube.com/watch?v=UtxPdezclYw
4,,YouTube,[YouTube],"[{'name': 'Dan Croitor', 'url': 'https://www.y...",2019-05-04T13:56:17.096Z,Watched Netflix culture deck via Reed Hastings,https://www.youtube.com/watch?v=2fuOs6nJSjY
5,,YouTube,[YouTube],"[{'name': 'Netflix', 'url': 'https://www.youtu...",2019-05-04T13:36:24.194Z,Watched Tuca & Bertie | Official Trailer [HD] ...,https://www.youtube.com/watch?v=ZybYIJtbcu0
6,,YouTube,[YouTube],"[{'name': 'Dan Croitor', 'url': 'https://www.y...",2019-05-04T13:36:04.262Z,Watched Netflix Interview (1 of 3): 2018 Cultu...,https://www.youtube.com/watch?v=8-4is78UJZ8
7,,YouTube,[YouTube],"[{'name': 'E4', 'url': 'https://www.youtube.co...",2019-05-03T22:20:56.255Z,"Watched ""You Make Me Sick So Much, Jimmy Carr""...",https://www.youtube.com/watch?v=Q8BeTjCjmwQ
8,,YouTube,[YouTube],"[{'name': 'ChrisFix', 'url': 'https://www.yout...",2019-05-03T22:20:08.582Z,Watched How to Install a Hidden Kill Switch in...,https://www.youtube.com/watch?v=XUhXLsrZiE0
9,,YouTube,[YouTube],"[{'name': 'TheEllenShow', 'url': 'https://www....",2019-05-03T22:19:19.939Z,Watched John Bradley Doesn’t Know What He Know...,https://www.youtube.com/watch?v=qPHlaLti3g0


There are some parts we'd ike to make use of such as `title` that show some useful information, and others like `titleUrl` which in this project, we won't have any use for. First, we can drop everything but `title` and `time` as we don't need anything else.

In [4]:
# Drop columns except title and time
wh = wh[['title', 'time']]

We can see clearly that "Watched" is being prepended to the `title` field. Not only this, but where a video has become unavailable, we get a title like "Watched https://www.youtube.com/watch?v=rG5tV7zcl1s". Let's take a look at the percentage of videos that are no longer available in the dataset

In [5]:
len(wh[wh['title'].str.startswith('Watched https://www.youtube.com')]) / len(wh)

0.05473684210526316

It's about 5%. Not too bad. Let's remove the prefix "Watched" from the titles and also drop all the unavailable videos. We'll also remove whitespace and make everything lower case for simplicitely.

In [6]:
import string

# Remove "Watched" from title "Watched Cat Video" -> "Cat Video"
wh['title'] = wh['title'].apply(lambda x: x[7:])

# Remove Unavailable videos
wh = wh.drop(wh[wh['title'].str.startswith('https://www.youtube.com')].index)

# Lower Case "Cat Video" -> "cat video"
wh['title'] = wh['title'].apply(lambda x: x.lower())

# Remove whitespace
wh['title'] = wh['title'].apply(lambda x: x.strip())
wh['title'] = wh['title'].apply(
    lambda x: x.translate(str.maketrans(' ', ' ', string.punctuation)))

wh.head()

Unnamed: 0,title,time
0,stromae papaoutai clip officiel,2019-05-04T15:51:04.397Z
1,me my body and i,2019-05-04T15:16:07.495Z
2,i did peloton for two weeks straight and here’...,2019-05-04T14:24:33.373Z
3,f8 2019 keynote in 12 minutes,2019-05-04T14:10:05.115Z
4,netflix culture deck via reed hastings,2019-05-04T13:56:17.096Z


We only want the date from the `time` field, as the time is too granular for this particular visualisation. Let's convert this attribute to just contain a `datetime` object with the date

In [7]:
import datetime as dt

def get_date(datetime_str):
    return dt.datetime.strptime(datetime_str.split('T')[0], '%Y-%m-%d')

wh['time'] = wh['time'].apply(lambda x: get_date(x))

wh.head()

Unnamed: 0,title,time
0,stromae papaoutai clip officiel,2019-05-04
1,me my body and i,2019-05-04
2,i did peloton for two weeks straight and here’...,2019-05-04
3,f8 2019 keynote in 12 minutes,2019-05-04
4,netflix culture deck via reed hastings,2019-05-04


## Feature Engineering

Now we have the dataframe with the important information from the original history JSON we downloaded. We're now going to add the following features in order to get a better representation of the data. We have a large collection of titles, but the title "Body Workout Exercise" and "Full Body Workout" have very similar themes, but are treated as distinct occurances unless we find some way to group entries with similar themes. We need a way to:

* Extract key themes from a video title
* Find other videos that have this theme

### Keyword Analysis

In order to get aggregated meaning of the titles, we can look at the keywords in the titles.

#### Tokenizing 

We could find the distribution of keywords such as "XBOX" or "Full-body Workout" over the range of the dataset. Notice that keywords could appear in the form of unigrams (one word) or bigrams (two words together) such as the latter example. We also want to remove stop words such as "and" and "with" in our tokens, as these won't provide any additional value to this example.

In [8]:
from nltk.corpus import stopwords
nltk.download('stopwords')
stop_words = stopwords.words('english')

def find_ngrams(input_list, n):
    if n > 1:
        return zip(*(input_list[i:] for i in range(n)))
    else:
        return input_list

# Split words and remove stopwords
wh['unigrams'] = wh['title'].apply(lambda x: [word for word in x.split(' ') if word not in stop_words and word != ''])
wh['bigrams'] = wh['unigrams'].apply(lambda x: list(find_ngrams(x, 2)))
wh.head()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Oliver\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


Unnamed: 0,title,time,unigrams,bigrams
0,stromae papaoutai clip officiel,2019-05-04,"[stromae, papaoutai, clip, officiel]","[(stromae, papaoutai), (papaoutai, clip), (cli..."
1,me my body and i,2019-05-04,[body],[]
2,i did peloton for two weeks straight and here’...,2019-05-04,"[peloton, two, weeks, straight, here’s, happened]","[(peloton, two), (two, weeks), (weeks, straigh..."
3,f8 2019 keynote in 12 minutes,2019-05-04,"[f8, 2019, keynote, 12, minutes]","[(f8, 2019), (2019, keynote), (keynote, 12), (..."
4,netflix culture deck via reed hastings,2019-05-04,"[netflix, culture, deck, via, reed, hastings]","[(netflix, culture), (culture, deck), (deck, v..."


By creating a bag of words, we can explore the frequency of the unigrams and bigrams that are appearing in the dataset.

In [24]:
from collections import Counter

# Create N-Gram Counters
def get_counter_from_column(df, column_name):
    ct = Counter()
    for row in df[column_name]:
        for element in row:
            ct[element] += 1
    return ct

bag_1 = get_counter_from_column(wh, 'unigrams')
bag_2 = get_counter_from_column(wh, 'bigrams')

print("Bag 1\n", bag_1.most_common(10))
print("Bag 2\n", bag_2.most_common(10))

Bag 1
 [('video', 811), ('official', 606), ('trailer', 283), ('removed', 239), ('music', 218), ('1', 205), ('2', 196), ('hd', 190), ('life', 176), ('vs', 173)]
Bag 2
 [(('video', 'removed'), 239), (('official', 'video'), 209), (('music', 'video'), 154), (('official', 'music'), 137), (('official', 'trailer'), 135), (('movie', 'hd'), 53), (('trailer', '1'), 53), (('glass', 'animals'), 43), (('trailer', 'hd'), 43), (('part', '1'), 42)]


In [23]:
print("movies: {}".format(bag_1['movies']))
print("movie: {}".format(bag_1['movie']))

movies: 8
movie: 105


#### Combine Pluralisms

We can see from the following code snippet that a singular word's plural can appear in the bag of words together. Semantically they mean the same thing, therefore we should find a way to combine these instances. In this example, we're taking every regular noun (You just have to add s to pluralise) that has both an occurence of itself and this plural in our unigrams. We then pick one of these with the most occurances and then replace this in both the bag of words and the dataframe.

In [35]:
# Pick singular or plural for most common 500 words
def replace_count(counter, removed, added):
    counter[added] += counter[removed]
    del counter[removed]
    
remove_words = []
for word in bag_1.most_common(1000):
    word = word[0]
    if word.endswith('s'):
        singular = word[:-1]
        plural = word
    else:
        singular = word
        plural = word + 's'
    if plural in bag_1 and singular in bag_1:
        if bag_1[plural] >= bag_1[singular]:
            remove_words.append((singular, plural))
        else:
            remove_words.append((plural, singular))

for (removed, added) in remove_words:
    replace_count(bag_1, removed, added)
    wh['unigrams'] = wh['unigrams'].apply(lambda x: [added if unigram == removed else unigram for unigram in x])
    
bag_1.most_common(10)

[('video', 833),
 ('official', 606),
 ('trailer', 294),
 ('removed', 239),
 ('new', 223),
 ('music', 218),
 ('1', 205),
 ('2', 197),
 ('vs', 191),
 ('hd', 190)]

In [1]:
#!/usr/bin/env python

# Imports
import spacy
import itertools
import plotly.figure_factory as ff
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import numpy as np
import random
from datetime import datetime, timedelta
import plotly.graph_objs as go
import plotly.offline as py
import operator
import functools
import nltk


In [36]:
print(bag_1['glass'])
print(bag_1['animals'])
print(bag_2[('glass', 'animals')])

51
53
43


#### Duplicates in bigrams

We now also have another problem to address. If we were to combine our unigrams and bigrams now and look at the distributions of each keyword, we'd get duplicates. For example the British band "Glass Animals" appears 43 times. "glass" and "animals" occurances are largely contributed by the band's name. If we assume that if a singular word's occurs less than three times much as the respective bigram does, then it's contribution was most likely due to the bigram. Therefore we can remove it. Likewise, if the bigram appears less than a third as many times as the respective unigram, then we can remove it from the dataset.

While this is more anecdotal, this was able to extract some more of the key bigrams that I knew where in the dataset. Obviously this requires more rigor to have reproducable outcomes for other datasets.

In [37]:
THRESHOLD = 0.3
for bigram in bag_2.most_common(1000):
    # print("{}: {}".format(bigram[0][0], bag_1[bigram[0][0]] * 0.75))
    if (bag_1[bigram[0][0]] * THRESHOLD) <= bag_2[bigram[0]]:
        del bag_1[bigram[0][0]]
        if (bag_1[bigram[0][1]] * THRESHOLD) <= bag_2[bigram[0]]:
            del bag_1[bigram[0][1]]
    else:
        del bag_2[bigram[0]]
print(bag_1[('video',)])
wh['ngrams'] = wh['unigrams'] + wh['bigrams']
bag_1_2 = (bag_1 + bag_2)

0


In [39]:
bag_1_2.most_common(10)

[('video', 833),
 ('removed', 239),
 ('new', 223),
 (('official', 'video'), 209),
 ('1', 205),
 ('2', 197),
 ('vs', 191),
 ('hd', 190),
 ('day', 189),
 ('life', 177)]