**Introduction:**
Youtube.com keeps list of top trending videos on th eplatform. These videos are not necessarily the most viewed videos. In this project we aimed to find some patterns in differet measures of these videos (including users interactions -views, likes, etc- as well title and category). We will focus on data from US.

Let's start by importing, understanding and cleaning data. Data is provided by Youtube and is publicly available on kaggle.com

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

us_youtube = pd.read_csv("../input/youtube-new/USvideos.csv")

print(us_youtube.head())
print(us_youtube.columns)
print(us_youtube.dtypes)
print(us_youtube.shape)

print(len(us_youtube.groupby(['video_id']).nunique()))
print(len(us_youtube.groupby(['title']).nunique()))

Dataset has 40949 rows and 16 columns. We are interested in columns of trending and publishing dates, users interactions (views, likes, dislikes, number of comments), and categories.
video_id column has 6351 unique values while title column has 6455 unique values. it seems that some tiles have been changed after becoming trending video. So we will work with the video_id column.

in the following we perform some calculations of the selected collumns and combine those in new data frame df to base our analysis on. 

In [None]:
trending_days = us_youtube.groupby('video_id')['trending_date'].nunique().reset_index()
trending_days.rename(columns = {'trending_date':'trending_days'}, inplace = True) 
    #number of days that a specific video was on the trending list
    
feedback_begining = us_youtube.groupby('video_id')['views', 'likes', 'dislikes', 'comment_count'].min().reset_index()
feedback_final = us_youtube.groupby('video_id')['views', 'likes', 'dislikes', 'comment_count'].max().reset_index()
    #total number of likes, dislikes and comments at the begining and end of trending poeriod

us_youtube['trending_date'] = pd.to_datetime(us_youtube.trending_date,format='%y.%d.%m', utc=True)
us_youtube['publish_time'] = pd.to_datetime(us_youtube.publish_time)

publishtime = us_youtube.groupby('video_id')['publish_time'].min().reset_index()
publishtime = pd.to_datetime(publishtime['publish_time'])
    #publish date of each unique video ID
print(min(publishtime))

us_youtube['days_to_trending'] = us_youtube['trending_date']-us_youtube['publish_time'] 
days_to_trending = us_youtube.groupby('video_id')['days_to_trending'].min().reset_index()
    #total number of days it took a video to become trending video after publishing date (in days) 
    


def time_of_day(x):   #function to categorize publish time in the day
    return (
    0 if 4<=x<8  #'early morning'
    else 
    1 if 8<=x<12 #'morning'
    else
    2 if 12<=x<16 #'afternoon'
    else
    3 if 16<=x<20 #'evening'
    else 
    4 if 20<=x<24 #'night'
    else
    5 #'late night'
    )

publish_time_of_day=[]
for i in range(len(publishtime.dt.hour)):
    publish_time_of_day.append(time_of_day(publishtime.dt.hour[i]))
publish_time_of_day=pd.Series(publish_time_of_day)
    #identifies the time of the day in wich video first published

category = us_youtube.groupby('video_id')['category_id'].min().reset_index()
    #adding category column to the data frame

df = pd.concat([trending_days, feedback_begining[['views', 'likes', 'dislikes', 'comment_count']], 
                feedback_final[['views', 'likes', 'dislikes', 'comment_count']], 
                days_to_trending.days_to_trending.dt.days, publishtime.dt.hour, publish_time_of_day, pd.Series(category['category_id'])], 
                axis=1)
df.columns = ['video_ID','trending_days', 'views_begining','likes_begining', 'dislikes_begining', 'num_comments_begining', 
              'views_final', 'likes_final', 'dislikes_final','num_comments_final','days_to_trending','trending_hour', 
              'time_of_day', 'category']
print(df)
    #data frame containing iformation for every unique video ID


Let's look closer at some of the data:

In [None]:
ave_trending_days = np.mean(df['trending_days'])
med_trending_days = np.median(df['trending_days'])
ave_days_to_trending = np.mean(df['days_to_trending'])
med_days_to_trending = np.median(df['days_to_trending'])
print('trending days average: ', round(ave_trending_days, 1), 'and median: ', med_trending_days)
print('days to trending average: ', round(ave_days_to_trending, 1), 'and median: ', med_days_to_trending)

trending_first_day = df[df['days_to_trending']==0].video_ID.count()/len(df)  #percent of videos that went viral the first day
print('percent of videos that went viral the very first day they published: ', round(trending_first_day,2))

ave_views = df.groupby(['trending_days'])['views_begining'].mean()  #mean of views per each trending_days
med_views = df.groupby(['trending_days'])['views_begining'].median() #median of views per each trending_days
trending_days = df['trending_days'].unique()
order = np.argsort(trending_days)
trending_days = np.array(trending_days)[order]
ave_views = np.array(ave_views)[order]
med_views = np.array(med_views)[order]

num_videos_per_day = df.groupby(['trending_days'])['video_ID'].count()  #number of videos associated with each duration of trending
num_videos_per_day = np.array(num_videos_per_day)[order]

Majority of the trending videos went viral the very first day they published (43%).
Let's plot some of data

In [None]:
from matplotlib.pyplot import figure
figure(num=None, figsize=(16, 14), dpi=80, facecolor='w', edgecolor='k')
plt.subplot(3,2,1)
plt.scatter(df['days_to_trending'] ,df['trending_days'])
plt.xlabel('days before trending')
plt.ylabel('trending duration (days)')

plt.subplot(3,2,2)
plt.bar(trending_days, num_videos_per_day)
plt.axvline(ave_trending_days, color='r', linestyle='solid', linewidth=3, label="Mean")
plt.xlabel('trending duration (days)')
plt.ylabel('number of videos')
plt.legend()

plt.subplot(3,2,3)
plt.scatter(df['trending_days'], df['views_begining'],alpha=0.5)
plt.plot(trending_days,ave_views, '-g')
plt.plot(trending_days,med_views, '-r')
plt.xlabel('trending duration (days)')
plt.ylabel('number of views')
plt.legend(['average', 'median'])
plt.ylim(top=2e7)

plt.subplot(3,2,4)
plt.scatter(df['trending_days'], df['likes_begining'],alpha=0.5)
plt.xlabel('trending duration (days)')
plt.ylabel('number of likes')

plt.subplot(3,2,5)
plt.scatter(df['trending_days'], df['dislikes_begining'],alpha=0.5)
plt.xlabel('trending duration (days)')
plt.ylabel('number of dislikes')

plt.subplot(3,2,6)
plt.scatter(df['trending_days'], df['num_comments_begining'],alpha=0.5)
plt.xlabel('trending duration (days)')
plt.ylabel('number of comments')
plt.show()

In general, the more the movie is on the trending list, the more is the number of views.