# Exploratory Data Analysing Using Youtube Video Data from Most Popular Data Science Channels
## 1. Aims, objectives and background
### 1.1. Introduction
Founded in 2005, Youtube has grown to become the second largest search engine in the world (behind Google) that processes more than 3 billion searches per month. It is, however, generally a myth how the Youtube algorithm works, what makes a video get views and be recommended over another. In fact, YouTube has one of the largest scale and most sophisticated industrial recommendation systems in existence. 

### 1.2. Aims and objectives
Within this project, I would like to explore the following:

Getting to know Youtube API and how to obtain video data.
Analyzing video data and verify different common "myths" about what makes a video do well on Youtube, for example:
Does the number of likes and comments matter for a video to get more views?
Does the video duration matter for views and interaction (likes/ comments)?
Does title length matter for views?
How many tags do good performing videos have? What are the common tags among these videos?
Across all the creators I take into consideration, how often do they upload new videos? On which days in the week?
Explore the trending topics using NLP techniques
Which popular topics are being covered in the videos (e.g. using wordcloud for video titles)?
Which questions are being asked in the comment sections in the videos

### 1.3. Steps of the project
Obtain video meta data via Youtube API for the top 10-15 channels in the data science niche (this includes several small steps: create a developer key, request data and transform the responses into a usable data format)
Prepocess data and engineer additional features for analysis
Exploratory data analysis

## Conclusions
### 1.4. Dataset
#### Data selection
As this project is particularly focused on data science channels, I found that not many readily available datasets online are suitable for this purpose. The 2 alternative datasets I found are:

The top trending Youtube videos on Kaggle: This dataset contains several months of data on daily trending YouTube videos for several countries. There are up to 200 trending videos per day. However, this dataset is not fit for this project because the trending videos are about a wide range of topics that are not necessarily related to data science.

Another dataset is obtained from this Github repo of Vishwanath Seshagiri, which is the metadata of 0.5M+ YouTube videos along with their channel data. There is no clear documentation on how this dataset was created, but a quick look at the datasets in the repository suggested that the data was obtained using keyword search of popular keywords such as "football" or "science". There are also some relevant keywords such as "python". However, I decided not to use these datasets because they don't contain data for the channels I am interested in.

I created my own dataset using the Google Youtube Data API version 3.0. The exact steps of data creation is presented in section 2. Data Creation below.

## Data limitations
The dataset is a real-world dataset and suitable for the research. However, the selection of the top 10 Youtube channels to include in the research is purely based on my knowledge of the channels in data science field and might not be accurate. My definition is "popular" is only based on subscriber count but there are other metrics that could be taken into consideration as well (e.g. views, engagement). The top 10 also seems arbitrary given the plethora of channels on Youtube. There might be smaller channels that might also very interesting to look into, which could be the next step of this project.

## Ethics of data source
According to Youtube API's guide, the usage of Youtube API is free of charge given that your application send requests within a quota limit. "The YouTube Data API uses a quota to ensure that developers use the service as intended and do not create applications that unfairly reduce service quality or limit access for others. " The default quota allocation for each application is 10,000 units per day, and you could request additional quota by completing a form to YouTube API Services if you reach the quota limit.

Since all data requested from Youtube API is public data (which everyone on the Internet can see on Youtube), there is no particular privacy issues as far as I am concerned. In addition, the data is obtained only for research purposes in this case and not for any commercial interests.

In [5]:
from datetime import datetime, timedelta
    

In [41]:
hrs_to_subtract = 24
date_to = datetime.now()
date_from = date_to - timedelta(hours = hrs_to_subtract)

print(date_to)
print(type(date_to))
print(date_from)
print(type(date_from))

2023-06-13 14:48:16.573435
<class 'datetime.datetime'>
2023-06-12 14:48:16.573435
<class 'datetime.datetime'>


In [9]:
from googleapiclient.discovery import build

In [28]:
api_key = 'AIzaSyB0JnKRwCyKjv5xE196mIqDPGoqH8GOXbY'
youtube = build('youtube', 'v3', developerKey = api_key, http = http)


NameError: name 'http' is not defined

In [25]:
youtube.search().list(part = 'snipper,contentDetails,statistics', order = 'viewCount', publishedAfter = date_from, publishedBefore = date_to, maxResults = 50).execute()

    

HttpError: <HttpError 400 when requesting https://youtube.googleapis.com/youtube/v3/search?part=snipper%2CcontentDetails%2Cstatistics&order=viewCount&publishedAfter=2023-06-12+14%3A13%3A20.377814&publishedBefore=2023-06-13+14%3A13%3A20.377814&maxResults=50&key=AIzaSyB0JnKRwCyKjv5xE196mIqDPGoqH8GOXbY&alt=json returned "Invalid value at 'published_before' (type.googleapis.com/google.protobuf.Timestamp), Field 'published_before', Illegal timestamp format; timestamps must end with 'Z' or have a valid timezone offset.
Invalid value at 'published_after' (type.googleapis.com/google.protobuf.Timestamp), Field 'published_after', Illegal timestamp format; timestamps must end with 'Z' or have a valid timezone offset.". Details: "[{'message': "Invalid value at 'published_before' (type.googleapis.com/google.protobuf.Timestamp), Field 'published_before', Illegal timestamp format; timestamps must end with 'Z' or have a valid timezone offset.\nInvalid value at 'published_after' (type.googleapis.com/google.protobuf.Timestamp), Field 'published_after', Illegal timestamp format; timestamps must end with 'Z' or have a valid timezone offset.", 'reason': 'invalid'}]">

In [37]:
def get_stats(youtube, date_from, date_to, max_results):
    request = youtube.search().list(
        type='video',
        part = 'snippet', 
        #order = 'viewCount', 
        publishedAfter = date_from, 
        publishedBefore = date_to, 
        location=None,
        locationRadius=None,
        maxResults = max_results)
    response = request.execute()
    return response
    


In [66]:
def youtube_search():
    all_data = []
    search_response = youtube.search().list(
        type='video',
        part='snippet',
        maxResults=10,
        order = 'viewCount',
        location=None,
        locationRadius=None,
        )
    response = search_response.execute() 
    for i in range(len(response['items'])):
  
       data = response['items'][i]['snippet']

       all_data.append(data)
    
    return all_data

In [67]:
print(youtube_search())

[{'publishedAt': '2023-05-10T11:17:00Z', 'channelId': 'UCRIg5SyEdNAWUPTAbb6XPWQ', 'title': 'which do you like? #shorts', 'description': '', 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/LXtJvCGXkP0/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/LXtJvCGXkP0/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/LXtJvCGXkP0/hqdefault.jpg', 'width': 480, 'height': 360}}, 'channelTitle': 'SHIROKI☆しろき', 'liveBroadcastContent': 'none', 'publishTime': '2023-05-10T11:17:00Z'}, {'publishedAt': '2023-05-12T12:06:27Z', 'channelId': 'UCbjaJjcvq8Z6GXPa6J-gTNg', 'title': 'Unser Let&#39;s Go-Moment in Südfrankreich', 'description': '', 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/illILvoVFoQ/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/illILvoVFoQ/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/illILvoVFoQ/hqdefault.jpg', 'width': 

In [73]:

request = youtube.videos().list(
    part = 'statistics',
    chart="mostPopular",
    maxResults=10
    )
response = request.execute()

print(response)

{'kind': 'youtube#videoListResponse', 'etag': '5ZlbJo3aoj9cIrlpUJWXxFwGNlM', 'items': [{'kind': 'youtube#video', 'etag': 'YcJckY6N05YzN8p-IqXgEKf4Nko', 'id': 'qGtYn7DCIYo', 'statistics': {'viewCount': '2810690', 'likeCount': '103188', 'favoriteCount': '0', 'commentCount': '5723'}}, {'kind': 'youtube#video', 'etag': 'Z3ssvvMUwN6Pr3TCxbnr2xfr_h8', 'id': '48h57PspBec', 'statistics': {'viewCount': '71445187', 'likeCount': '3406250', 'favoriteCount': '0', 'commentCount': '93314'}}, {'kind': 'youtube#video', 'etag': 'rp5pqvrKuaRklmqQryrTMa9nyZw', 'id': '0ZWcFWPYxtk', 'statistics': {'viewCount': '1782286', 'likeCount': '53037', 'favoriteCount': '0', 'commentCount': '2164'}}, {'kind': 'youtube#video', 'etag': 'llNdIyhgCN5qw51QwWteWiz2yLA', 'id': 'XF0kMT39GNY', 'statistics': {'viewCount': '2005039', 'likeCount': '76041', 'favoriteCount': '0', 'commentCount': '7525'}}, {'kind': 'youtube#video', 'etag': 'a8D3ME20_9_1KgVYlFKdDAOln1o', 'id': '5emJgNiFbAA', 'statistics': {'viewCount': '734031', 'lik