# COGS 108 - Data Checkpoint

# Names

- Jared (Ruotian) Chen
- Jimin Cheon
- Kane Gu
- Laurence D'Ercole
- Nisha Davankar

<a id='research_question'></a>
# Research Question

## YouTube
What attributes of a YouTube video affect its popularity/shareability (likes and/or shares)?

# Dataset(s)

## Dataset Information

- Dataset #1
    - Dataset Name: YouTube's [official data API](https://developers.google.com/youtube/v3) (video listing)
    - Link to the dataset: https://developers.google.com/youtube/v3/docs/videos/list
    - Number of observations: approx. `200 * 30 == 6000`

- Dataset #2
    - Dataset Name: YouTube's [official data API](https://developers.google.com/youtube/v3) (video searching)
    - Link to the dataset: https://developers.google.com/youtube/v3/docs/search/list
    - Number of observations: approx. `4000 * 30 == 120000` (subject to change)

- Dataset #3
    - Dataset Name: YouTube's internal video metadata API (specifically `window.ytInitialPlayerResponse`)
    - Link to the dataset: https://www.youtube.com/watch?v=[YOUTUBE_VIDEO_ID]
    - Number of observations: depending on the number of existing observations


Dataset #1 to #2 are used for retrieving information on likes/views/shares, genre, duration, date/time posted, content creator (i.e. subscription count), number of hashtags (from poster); Dataset #3 is specifically used for getting the number of ads in a video.

Duplicates in the datasets can be removed by using the videos' unique IDs. Data cleaning and categorization happens in real time. Invalid or irrelevant entries, if any, will be immediately dropped. This design is intentional as we want the datasets to have as little footprint as possible (in memory and on disk). Data in memory may also be flushed to disk at any time, depending on the memory usage.

# Setup

In [26]:
import types
import pandas as pd
df = None
df_cache = 'EDA/dsamples/dcheckpoint_sample.pickle'

from os import path
if not path.exists(df_cache):
    from EDA.dcollect import *

    items = []
    def item_each_fn(item):
        global items
        items.append(item)

    count = 10
    youtube_o = youtube(
        modules = {'http': fasthttp()},
        key = 'AIzaSyBKsF33Y1McGDdBWemcfcTbVyJu23XDNIk',
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
        }
    )
    youtube_o.trending(
        count = count,
        parts = [
            youtube.parts.ID,
            youtube.parts.SNIPPET,
            youtube.parts.STATS,
            youtube.parts.details.CONTENT
        ],
        want = {
           'id': None,
            'creator': {
                'id': None
            },
            'stats': {
                'like': None,
                'comment': None,
                'view': None
            },
            'time': None,
            'length': None
        },
        each_fn = item_each_fn
    )

    df = pd.json_normalize(items, sep = '.')
    df.to_pickle(df_cache)
else:
    df = pd.read_pickle(df_cache)

if not isinstance(df, type(None)):
    print(df.head())
    print(df.describe())


            id                      time          length  \
0  ssq6X6alZ3w 2021-02-12 05:00:14+00:00 0 days 00:04:04   
1  zzd4ydafGR0 2021-02-12 05:03:49+00:00 0 days 00:03:22   
2  aXzVF3XeS8M 2021-02-12 05:00:00+00:00 0 days 00:04:01   
3  pyLJIEROIZo 2021-02-12 00:21:01+00:00 0 days 00:12:48   
4  JSgrumHw-XA 2021-02-12 15:01:39+00:00 0 days 00:03:27   

                 creator.id  stats.like  stats.comment  stats.view  
0  UC0VOyT2OCBKdQhF3BAbZ-1g     1123835          64416     9839634  
1  UCEB4a5o_6KfjxHwNMnmj54Q      438828          42788     3433960  
2  UCANLZYMidaCbLQFWXBC95Jg      788872          72919     5553636  
3  UCt_DaLB_NDqPVxezyvcfRtg      169945          19872     1686460  
4  UCVIFCOJwv3emlVmBbPCZrvw      134165           4089      801628  
                          length    stats.like  stats.comment    stats.view
count                         10  1.000000e+01      10.000000  1.000000e+01
mean      0 days 00:04:42.200000  3.167475e+05   24145.500000  3.108622e+

# Data Cleaning

Describe your data cleaning steps here.

The Youtube API includes a variety of data sets about different aspects of the website. For our purposes, we will go through the API categories and focus on data sets like the Most Popular list, or a list of the video results associated with a specific search paramter. Because there is a lot of different functions and possible datasets to choose from, we will have to narrow our scope quite a bit. In addition the YouTube API displays the lists in JSON format, which we will need to convert to CSV in order to conduct our EDA.       See `EDA/{main,webapi,utils/decode}.py` for details.

# Project Proposal (updated)

| Meeting Date | Meeting Time    | Completed Before Meeting                | Discuss at Meeting                            |
|--------------|-----------------|-----------------------------------------|-----------------------------------------------|
| 2/2          | 6:00 PM         | Project Proposal done                   | Roles, data wrangling                         |
| 2/10         | 5:00 PM         | Review feedback, Part of data wrangling | EDA ideas, Updated Proposal                   |
| 2/11         | 7:00 PM         | Data wrangling and Partial EDA          | Discuss analysis and EDA                      |
| 3/2          | 6:00 PM         | Outline of Analysis and EDA             | Continue Analysis                             |
| 3/9          | 6:00 PM         | Finish Analysis                         | Ethics, Meeting with TA                       |
| 3/16         | 6:00 PM         | Draft done                              | Final review                                  |
| 3/19         | Before 11:59 PM | N/A                                     | Turn in Final Project & Group Project Surveys |