# Top 5000 Youtube channels Data Engineering

### Youtube is an American video-sharing website where thousands of videos of different types are uploaded daily by people across the world. Some of the video uploaders are also making nice money out of it. Youtube is world's largest video sharing website and there are about 3.25 billion video views each month worldwide. Also, the average number of mobile YouTube video views per day is 1,000,000,000.

### This notebook simply demonstrates engineering a dataset about the basic information of top 5000 Youtube channels ranked by Socialblade. Socialblade is a website where all the information about Youtube, Instagram and other social media accounts can be found. Thanks to socialblade.com for doing the hardwork of collecting the data. 
For more about Socialblade and its' product offerings, visit: <a href="https://socialblade.com/">Socialblade</a>

<b>Note:</b> This work is not sponsored by Socialblade and is just a fun project made using Data Science technologies. The project does not aim at violation of any policies or privacy since the data on the website is publicly available.

Importing the necessary libraries:

In [1]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

We will use the state-of-the-art BeautifulSoup library of Python for scraping the data. Scrapy is another good library but since its' complexity according to me is higher as compared to BS4, we will simply use BS4 in this project. 

A basic tutorial on BeautifulSoup web scraping can be found here: <a href="https://www.pythonforbeginners.com/python-on-the-web/web-scraping-with-beautifulsoup">Scraping with BeautifulSoup</a>

The web scraping begins:

In [2]:
# Creating an object which holds the URL of the webpage, data of which we want to scrap.

URL = "https://socialblade.com/youtube/top/5000/"

In [3]:
#Creating a requests object which will request web data from the URL mentioned.
req = requests.get(URL)

#Passing the data to BS4 constructor and using html5lib parser to structure the data into raw HTML format.
htmlData = BeautifulSoup(req.content, 'html5lib')

Now we have a structured HTML data of the webpage where rankings are displayed. What we do ahead is traverse the HTML tree until we find the relevant HTML piece of code where the rankings data is mentioned.

After a lot of experimentation and debugging, I was able to check where the relevant code was present. Ahead, we create an empty dataframe and populate it by the values for each channel. After manually reading the HTML data, I found that there were many <i>div</i> tags under the body section of webpage and the code we are interested in, was present in the 9th div tag. Also, inside the 9th tag, there were further two divisions: left and right and the rankings were present in the second div tag. 

To display each ranking, a separate div was created and the data for each ranking was mentioned inside that div. The div tags that showed the rankings and which we wanted to fetch, were the 5th div tag all upto 5004th div tag. So now to sum up, we are about to fetch 5000 div tags numbered 5th to 5004th, from the second child div tag of the nine parent div tags present in the body. 


Talking about the code ahead, we use two for loops. First for loops iterates over the 5000 div tags of rankings and the inner for loop iterates over each values (6 values for each Youtube channel) and add it to our dataframe.

In [4]:
#Empty dataframe which will hold the data
## dataframe = DataFrame()

#Temporary list created to hold 6 values for each Youtube channel
## temp_list = []

#Looping begins
## for i in range(5,5005):
##    for j in range(1,7):
# Traversing the hierarchy of div tags for fetching values using select method of BS4 library. The method returns 
# a list.
##        html_text = htmlData.select("body > div:nth-of-type(9) > div:nth-of-type(2) \
##        > div:nth-of-type("+str(i)+") > div:nth-of-type("+str(j)+")")
    
# Converting the list to string. Now we have the html code which displays each value.
##        html_text1 = str(html_text[0])
    
# Converting to BS object.
##        html_text1_soup = BeautifulSoup(html_text1, "lxml")
    
# Fetching the textual value.
##        value = html_text1_soup.text
    
# Appending the value to the list
##        temp_list.append(value)

# The list of 6 values is now ready for a particular channel. Adding it to the dataframe.
##    dataframe = dataframe.append(Series(temp_list),ignore_index=True)
    
# Reinitializing the list for next iteration.
##    temp_list = []

    
# At the end of the execution of the above code, we get data into the dataframe.

It took about an hour for the loops to run and dump the data into a dataframe. Since I'll be running the notebook cells multiple times for final deployment purposes, I'll simply comment that code so that it doesn't run everytime and I'll use the exported CSV file for further purposes. If you want to play with the code and fork the notebook, kindle comment out the <b>two</b> hashtags which shows that the script ahead is the Python code and not the part of the comment. <b>One</b> hashtag actually denotes a comment.


So now that we have the data ready, let us have a first look:

In [5]:
dataframe = pd.read_csv("scraped.csv")
dataframe.head()

Unnamed: 0,0,1,2,3,4,5
0,1st,\nA++,\n\nZee TV\n\n,82757,"\n18,752,951","\n20,869,786,591"
1,2nd,\nA++,\n\nT-Series\n\n,12661,"\n61,196,302","\n47,548,839,843"
2,3rd,\nA++,\n\nCocomelon - Nursery Rhymes\n\n,373,"\n19,238,251","\n9,793,305,082"
3,4th,\nA++,\n\nSET India\n\n,27323,"\n31,180,559","\n22,675,948,293"
4,5th,\nA++,\n\nWWE\n\n,36756,"\n32,852,346","\n26,273,668,433"


As we can notice, the values aren't properly formatted and the columns are yet to be named.

In [6]:
dataframe.columns = ["Rank","Grade","Channel name","Video Uploads","Subscribers","Video views"]

In [7]:
dataframe.replace(r'\n',r'',regex=True,inplace=True)

In [8]:
dataframe.head()

Unnamed: 0,Rank,Grade,Channel name,Video Uploads,Subscribers,Video views
0,1st,A++,Zee TV,82757,18752951,20869786591
1,2nd,A++,T-Series,12661,61196302,47548839843
2,3rd,A++,Cocomelon - Nursery Rhymes,373,19238251,9793305082
3,4th,A++,SET India,27323,31180559,22675948293
4,5th,A++,WWE,36756,32852346,26273668433


Finally, the data is well formatted and ready. Phew! that was nice!

The dataset can be used to perform exploratory data analysis and visualizations which can help reveal some possible correlations and insights about factors powering the YouTube channel rankings. Though the data delineates only some basic information about the YouTube channels, I'm pretty sure that this data can be very helpful to all the beginners and neophytes of data science.

## Thanks for reading. 