# Youtube Video Engagement

#### Isabella Sun and Josh Bernd

## Introduction
In this proejct, we explored the Trending YouTube Video Statistics data. Specifically, we were interested in different measures of engagement on this set of trending YouTube videos. We wanted to know the relationship between engagement rates and other available information about the videos.

## Data Description

The data describe a set of top-trending videos on the YouTube platform by various countries. We primarily examined the data from the United States, but explored additional countries as well. The countries included in our analysis were:
    -  United States
    - Great Britain
    - Canada
    - Germany
    - France
    - India

The main information included in the data were:
    - Trending Date
    - Video Title
    - YouTube Channel
    - Time the video was published
    - Category
    - Tags
    - Number of views
    - Number of likes
    - Number of dislikes
    - Number of comments
    
Since our focus for this project was engagement rates, our three most important variables were likes, dislikes, and comments. To normalize engagement rates, we divided the total number of likes and dislikes by the number of views, the number of comments by the number of views, and the number of likes to dislikes on each video. 

### Missing/Excluded Data
For the purposes of this project we removed any observations where comments and/or likes and dislikes were disabled for the video and any videos that had an error or were removed from our analysis. In total, there were only 23 videos that were removed or had an error in the US dataset and 696 videos with comments or likes disabled in the US dataset. 


## Analysis

First, we take a look at the top performing videos by our measures of engagement.

Below we have the top 10 videos with the highest likes and dislikes per view. It appears that these videos tend to be music/music video type videos. 



![alt text](i1.png "Title")

Here we have the top 10 videos with the highes comments per view. They appear to commonly be makeup related content. 

![alt text](i2.png "Title")

Here are the top 10 channels with the highest likes and dislikes per view

![alt text](i3.png "Title")

and the top 10 channels with the highest comments per view

![alt text](i4.png "Title")

#### Relationship between the different measures of engagement

We also examined the realtionship between the measures of engagement. The correlation between likes and dislikes per view and comments per view is **0.44**

![alt text](i5.png "Title")

### Engagement Rates by County

Below are figures visualizing the engagement rates by country. 

![alt text](i6.png "Title")

## Visualizing Video View Depreciation

An interesting feature of these data sets was that video data was entered each time the video trended. We wanted to see how a video's trending impacted new views.

In [1]:
import pandas as pd
import numpy as np


##read in US data
us_df = pd.read_csv('case_study1/data/USvideos.csv')

##Set dataframe's index to video_id
us_df.set_index('video_id', inplace=True)

## Sort by views
us_df = us_df.sort_values(['views'], ascending=True)
us_df.head()


In [None]:
## x_list will be a list of length equal to number of times the video trended
x_list = []

## y_list will be a list equal to (views acquired since last time it trended)/(views of first time trending)
y_list = []

## view list is the unnormalized y list
view_list = []
i = 1
for view in us_df.loc['2kyS6SvSYSE']['views']:
    view_list.append(view)
    if i == 1:
        y_list.append(view/view)
    else:
        view = (view - view_list[i-2])/view_list[i-2]
        y_list.append(view)
    x_list.append(i)
    i += 1
    
## make a plot function that takes in the video's id number    
def plot(index):
    x_list = []
    y_list = []
    view_list = []
    i = 1
    for view in us_df.loc[index]['views']:
        view_list.append(view)
        if i == 1:
            y_list.append(view/view)
        else:
            view = (view - view_list[i-2])/view_list[i-2]
            y_list.append(view)
        x_list.append(i)
        i += 1
    
    
    plt.style.use('ggplot')

    plt.title("View depreciation of youtube videos")
    plt.xlabel('Trending #')
    plt.ylabel('Views as a percentage of Views \nwhen it started trending')
    
    plt.scatter(x_list, y_list, 
        color='b',
        s=50,
        alpha=0.3
        )

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
fig, ax = plt.subplots()

##Plotted via copy and paste: didn't have time to figure out how to loop through the indexes and plug
##them into plot

plot('1ZAPwfrtAFY')
plot('2kyS6SvSYSE')
plot('5qpjK5DgCt4')
plot('d380meD0W0M')
plot('puqaWrEC7tY')
plot('jr9QtXwC9vc')
plot('YVfyYrEmzgM')
plot('ZAQs-ctOqXQ')
plot('TaTleo4cOs8')
plot('GgVmn66oK_A')
plot('DM-ni_LSOFE')
plot('BWPrk9PUwQE')
plot('ogYum4kWXgk')

![alt text](graph.png "Title")