# Exploratory Data Analyses On Youtube Video Data from My Favorite Youtube Channels.

## Data limitations
The dataset is a real-world dataset and suitable for the research. However, the selection of only my favorite Youtube channels to include in the research might not be accurate. My definition is "popular" is only based on subscriber count but there are other metrics that could be taken into consideration as well (e.g. views, engagement).

## Ethics of data source
According to Youtube API's guide (https://developers.google.com/youtube/v3/getting-started), the usage of Youtube API is free of charge given that your application send requests within a quota limit. "The YouTube Data API uses a quota to ensure that developers use the service as intended and do not create applications that unfairly reduce service quality or limit access for others. " The default quota allocation for each application is 10,000 units per day, and you could request additional quota by completing a form to YouTube API Services if you reach the quota limit.

Since all data requested from Youtube API is public data (which everyone on the Internet can see on Youtube), there is no particular privacy issues as far as I am concerned. In addition, the data is obtained only for research purposes in this case and not for any commercial interests.

In [1]:
api_key= 'AIzaSyDqMY4K67Xi_dKjowMNIqOp9HU2eQ0CD3I'

In [2]:
# !pip install --upgrade google-api-python-client

In [3]:
# Google API
from googleapiclient.discovery import build
from IPython.display import JSON
import json
import pandas as pd
import isodate

In [4]:
from dateutil import parser

# Data visualization libraries
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
sns.set(style="darkgrid", color_codes=True)


In [5]:
# NLP libraries
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from wordcloud import WordCloud

In [6]:
channels_id=['UCSUf5_EPEfl4zlBKZHkZdmw',# Danny Gonzalez
            'UC7zsxKqd5MicTf4VhS9Y74g', # Kurtis Conner
            'UCTSRIY3GLFYIpkR2QwyeklA', # Drew Gooden
            'UCpmvp5czsIrQHsbqya4-6Jw', # Chad Chad
            'UCoLUji8TYrgDy74_iiazvYA', # Jarvis Johnson
            'UCfp86n--4JvqKbunwSI2lYQ', # Cody Ko
#             'UCnTr3rLQySAokkQjXbduk7Q', # nickisnotgreen
            'UCnmGIkw-KdI0W5siakKPKog', # Ryan Trahan
            'UCuo9VyowIT-ljA5G2ZuC6Yw', # Eddy Burback
            'UCoo-9GEm2mpyYIXSptrKIpA', # 2 Danny 2 Furious
#             'UCuTQDPUE12sy7g1xf1LAdTA', # Noel Miller    
            'UC0aSHiQNMy8IMmS7_8WbtdA', # FunkyFrogBait
#             'UCKGMrHJRXyM_WmPKS-lvE2A', # Casey Aonso
            ]

In [7]:
# api_service_name = "youtube"
# api_version = "v3"

# youtube =build(api_service_name, api_version, developerKey=api_key)
# request = youtube.channels().list( part="snippet,contentDetails,statistics",forUsername="MrBeast")
# response = request.execute()

# JSON(response)


In [8]:
api_service_name = "youtube"
api_version = "v3"

# Get credentials and create an API client
youtube = build(
    api_service_name, api_version, developerKey=api_key)


In [9]:
def channel_stats(channels_id, youtube):
    
    all_data = []
    request = youtube.channels().list(
    part="snippet,contentDetails,statistics",
    id=",".join(channels_id)
    )
    response = request.execute()
    response_str = json.dumps(response, indent=4)
    
#   loop through items
    for item in response['items']:
        data = {'channelName': item['snippet']['title'],
                'subscriberCount': item['statistics']['subscriberCount'],
                'views': item['statistics']['viewCount'],
                'totalVideos': item['statistics']['videoCount'],
                'playlistId': item['contentDetails']['relatedPlaylists']['uploads'],
               }
        all_data.append(data)
        
    return(pd.DataFrame(all_data))

## **Channel Statistics

Let's take a look at the channel statistics using the channel_stats function defined below We'll take a look at their channels as a whole.

In [10]:
channel_stats = channel_stats(channels_id, youtube)
# Converting string columns to numeric 
channel_stats['subscriberCount'] = pd.to_numeric(channel_stats['subscriberCount'])
channel_stats['views'] = pd.to_numeric(channel_stats['views'])
channel_stats['totalVideos'] = pd.to_numeric(channel_stats['totalVideos'])

HttpError: <HttpError 403 when requesting https://youtube.googleapis.com/youtube/v3/channels?part=snippet%2CcontentDetails%2Cstatistics&id=UCSUf5_EPEfl4zlBKZHkZdmw%2CUC7zsxKqd5MicTf4VhS9Y74g%2CUCTSRIY3GLFYIpkR2QwyeklA%2CUCpmvp5czsIrQHsbqya4-6Jw%2CUCoLUji8TYrgDy74_iiazvYA%2CUCfp86n--4JvqKbunwSI2lYQ%2CUCnmGIkw-KdI0W5siakKPKog%2CUCuo9VyowIT-ljA5G2ZuC6Yw%2CUCoo-9GEm2mpyYIXSptrKIpA%2CUC0aSHiQNMy8IMmS7_8WbtdA&key=AIzaSyDqMY4K67Xi_dKjowMNIqOp9HU2eQ0CD3I&alt=json returned "The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.". Details: "[{'message': 'The request cannot be completed because you have exceeded your <a href="/youtube/v3/getting-started#quota">quota</a>.', 'domain': 'youtube.quota', 'reason': 'quotaExceeded'}]">

In [None]:
channel_stats.sort_values(['subscriberCount'], ascending= False)

In [None]:
channel_stats['viewsPerVideo'] = ((channel_stats['views']/channel_stats['totalVideos']).round(2))

In [None]:
channel_stats.sort_values(['viewsPerVideo'], ascending= False)

 Interestingly, some channels have more subscribers but less views and vice versa. For example, Danny's and Cody Ko's channels has significantly more subscribers than Drew Gooden's channel, but slightly less average views.

Similarly chad chad and Jarvis has the same amount of subscribers but chad chad has significantly less videos uploaded. Quite interesting.

In [None]:
# Lets plot some graphs

sns.set(rc={'figure.figsize':(10,8)})

ax= sns.barplot(y= 'subscriberCount', x= 'channelName', data = channel_stats.sort_values('subscriberCount', ascending=False))
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)  # Rotate the x-axis labels by 45 degrees
ax.set_title('subscribes visualization of each channel')
ax.yaxis.set_major_formatter(ticker.FuncFormatter(lambda x, pos: '{:,.2f}'.format(x/1000000) + 'M'))
plt.show()

## Form functions to get the video details and comments

In [None]:
# now lets get all the video id from the channel using the playlist id
def get_video_ids(youtube, playlist_id):


    request = youtube.playlistItems().list(
        part="snippet,contentDetails",
        playlistId= playlist_id,
        maxResults = 50
    )
    response = request.execute()
    video_ids = []
    
    for i in range(len(response['items'])):
        video_ids.append(response['items'][i]['contentDetails']['videoId'])
        
#   to get all the vidIds since maax is 50 we will implement next_page_token
    next_page_token = response.get('nextPageToken')
    more_pages = True
    
    while more_pages:
        if next_page_token is None:
            more_pages = False
        else:
            request = youtube.playlistItems().list(
                        part='contentDetails',
                        playlistId = playlist_id,
                        maxResults = 50,
                        pageToken = next_page_token)
            response = request.execute()
    
            for i in range(len(response['items'])):
                video_ids.append(response['items'][i]['contentDetails']['videoId'])
            
            next_page_token = response.get('nextPageToken')
        
    return video_ids


In [None]:
# get video statistics of all the given videoIds
def get_video_stats(youtube, video_ids):
    all_video_info = []
    
    for i in range(0, len(video_ids), 50):
        request = youtube.videos().list(
            part="snippet,contentDetails,statistics",
            id=','.join(video_ids[i:i+50])
        )
        response = request.execute() 

        for video in response['items']:
            stats_to_keep = {'snippet': ['channelTitle', 'title', 'description', 'tags', 'publishedAt'],
                             'statistics': ['viewCount', 'likeCount', 'favouriteCount', 'commentCount'],
                             'contentDetails': ['duration', 'definition', 'caption']
                            }
            video_info = {}
            video_info['video_id'] = video['id']

            for k in stats_to_keep.keys():
                for v in stats_to_keep[k]:
                    try:
                        video_info[v] = video[k][v]
                    except:
                        video_info[v] = None

            all_video_info.append(video_info)
    return pd.DataFrame(all_video_info)

In [None]:
def get_comments_in_videos(youtube, video_ids):
    all_comments = []
    for video_id in video_ids:
        try:   
            request = youtube.commentThreads().list(
                part="snippet,replies",
                videoId=video_id
            )
            response = request.execute()
        
            comments_in_video = [comment['snippet']['topLevelComment']['snippet']['textOriginal'] for comment in response['items'][0:10]]
            comments_in_video_info = {'video_id': video_id, 'comments': comments_in_video}

            all_comments.append(comments_in_video_info)
            
        except: 
            # When error occurs - most likely because comments are disabled on a video
            print('Could not get comments for video ' + video_id)
        
    return pd.DataFrame(all_comments)  

In [None]:
# Create a dataframe with video statistics and comments from all channels

video_df = pd.DataFrame()
comments_df = pd.DataFrame()

for c in channel_stats['channelName'].unique():
    print("Getting video information from channel: " + c)
    playlist_id = channel_stats.loc[channel_stats['channelName']== c, 'playlistId'].iloc[0]
    video_ids = get_video_ids(youtube, playlist_id)
    
    # get video data
    video_data = get_video_stats(youtube, video_ids)
    # get comment data
    comments_data = get_comments_in_videos(youtube, video_ids)

    # append video data together and comment data toghether
    video_df = pd.concat([video_df, video_data], ignore_index=True)
    comments_df = pd.concat([comments_df, comments_data], ignore_index=True)

In [None]:
comments_df

In [None]:
video_df

## EDA & Feature engineering
In order to analyze the data effectively, there are a few pre-processing steps that need to be performed. Initially, it is
important to reformat certain columns, specifically the date and time columns like "publishedAt" and "duration". Additionally, it would be beneficial to enhance the data by incorporating new features that provide valuable insights into the characteristics of the videos.

## Check for the null values

In [None]:
video_df.isnull().any()

In [None]:
video_df['commentCount'].isnull().sum()

There are only 11 videos with no comment, I think we could drop those.

In [None]:
video_df['commentCount'] = pd.to_numeric(video_df['commentCount'])
video_df['viewCount'] = pd.to_numeric(video_df['viewCount'])
video_df['likeCount'] = pd.to_numeric(video_df['likeCount'])


In [None]:
video_df = video_df[video_df['commentCount'] != 0]
video_df

In [None]:
video_df_copy = video_df.copy()

In [None]:
video_df_copy.publishedAt.sort_values().value_counts()

There's no strange dates in the publish date column, videos were published between 2007 and 2023.

## Enriching data
I want to enrich the data for further analyses, for example:

-> create published date column with another column showing the day in the week the video was published, which will be useful for later analysis.

-> convert video duration to seconds instead of the current default string format

-> calculate number of tags for each video

-> calculate comments and likes per 1000 view ratio

In [None]:
# Create publish day (in the week) column
video_df_copy.loc[:, 'publishedAt'] =  video_df_copy['publishedAt'].apply(lambda x: parser.parse(x))
video_df_copy.loc[:, 'pushblishDayName'] = video_df_copy['publishedAt'].apply(lambda x: x.strftime("%A"))

In [None]:
# convert duration to seconds
video_df_copyvideo_df_copy['durationSecs'] = video_df_copy['duration'].apply(lambda x: isodate.parse_duration(x))
video_df_copy['durationSecs'] = video_df_copy['durationSecs'].astype('timedelta64[s]')

In [None]:
# Add number of tags
video_df_copy['tagsCount'] = video_df_copy['tags'].apply(lambda x: 0 if x is None else len(x))

In [None]:
video_df_copy['likeRatio'] = video_df_copy['likeCount'] / video_df_copy['viewCount'] * 1000
video_df_copy['commentRatio'] = video_df_copy['commentCount'] / video_df_copy['viewCount'] * 1000
