# Web Scrapping Youtube.com
### Task:
1. Go to Youtube.com
2. Use selenium to type in a search term, - "Kishore kumar" or "ISB".
3. Scrape the top 10 links that show up in the search results. Collect data on link url, link title, subscription channel, no. of views and time when video was first uploaded or posted.
4. Now for each of the top 10 video links, scrape the top 50 comments (Note: you may need selenium to do the 'sort by' top comments).
5. For each comment, collect data on who wrote it (user handle), when it was posted, how many replies, upvotes and downvotes it received.
6. Build a data frame containing the results as output("output_scraping.csv")

# Solution

### Step 1 - Import the required packages 
We need to import the required packages. This example asks to emulate searching **Kishore Kumar** songs in youtube, which will be accomplished using Selenium. Ensure that
* Selenium is installed on your machine. You can use `pip install selenium` in Anaconda command prompt.
* Copy the webdriver for chrome in a local folder to be used in the examples. You can download the driver from [here](https://www.seleniumhq.org/download/)
* Since the imported data will be saved as a data frame before writing to the file, we need to import pandas as well.

In [261]:
# required for all the interactions with the browser
from selenium import webdriver
# required to select various elements on the page by id, name, css, xpath etc
from selenium.webdriver.common.by import By
# to be able to use the standard Keys like RETURN
from selenium.webdriver.common.keys import Keys
# required for manupalting data using data frames etc.
import pandas as pd
# in order to write records to a csv file 
import csv
# induce sleep to wait for navigation to new url
import time
from time import sleep
import numpy as np

### Step 2 -  Initialize the selenium chrome driver 
We need to invoke the webdriver, which means instanciating the chrome webdriver to create a driver object
> ! Ensure your chromewebdriver path matches the path specified below, or update the path appropriately

In [266]:
locationWebDriver = "D:/webdriver/chromedriver.exe"
driver = webdriver.Chrome(locationWebDriver)

### Step 3 - Scrape the top 10 songs of Kishore kumar
In this step we will perform the following actions
* Navigate to the website youtube.com in the instantiated chrome browser.
* Check the element for the **Search** box in the DOM.
* How do you find out the specific element, well follow the next set of instructions
    * Open a separate browser window manually
    * Navigate to https://youtube.com
    * Right click on the **Search** element on the page and choose **Inspect**. You would see the screenshot below
    ![Find Element By ID](search.png)
    * In this case we can use the **Id** of the element which is **"search"**
    * Similarly find the **Id** for the search button next to the text box
* In the **Search** box type Kishore kumar and send the **RETURN** key
* Now in the **Results** page, select all the nodes which contain the song details. Loop through each song node and extract the attributes required using XPath
* Create a data frame with the details of the top 10 song

In [271]:
try:
    # navigate to the website youtube.com
    driver.get("https://youtube.com")
    # find the element by the id search on the page
    search_box = driver.find_element_by_id("search")
    search_box.send_keys("Kishore kumar")

    # one option is to use ENTER 
    #search_box.send_keys(Keys.RETURN)

    # alternate option is to find the button element and press it 
    driver.find_element_by_id('search-icon-legacy').click()
    sleep(5)
    
    index = 1
    songList = []
    # get the DOM for the new url to identify elements
    driver.get(driver.current_url)
    time.sleep(5)
    driver.execute_script('window.scrollTo(1, 3000);')
    time.sleep(5)
    songElements = driver.find_elements_by_xpath('//div[@id="dismissable"]')
    for song in songElements:
        try:
            title = song.find_element_by_xpath('.//*[@id="video-title"]').get_attribute("title")
        except:
            title = ""
        try:
            url =  song.find_element_by_xpath('.//*[@id="video-title"]').get_attribute("href")
        except:
            url = ""
        try:
            channel = song.find_element_by_xpath('.//*[@id="byline"]').get_attribute("title")
        except:
            channel = ""
        try:
            views = song.find_element_by_xpath('.//*[@id="metadata-line"]/span[1]').text
        except:
            views = ""

        # write the data to a list. We will use this to finally write the data to a file
        songList.append((index,
                         title, 
                         url, 
                         channel, 
                         views))

        # increment the index to break at count 10
        index = index + 1
        if(index>10):
            break

    cols=['pkey','title','url','channel','views']
    songData = pd.DataFrame(songList, columns=cols)
    print('Data frame created')
    
except Exception as e:
    print(e)
    driver.close()

Data frame created


### Step 4 - Check if the top 10 videos have been populated
* List few records from the data frame

In [272]:
songData.head(10)

Unnamed: 0,pkey,title,url,channel,views
0,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views
1,2,Top 100 Songs Of R.D Burman & Kishore Kumar | ...,https://www.youtube.com/watch?v=uA67M0Lihz0,Saregama Music,3.7M views
2,3,Kishore Kumar Evergreen Hit Songs | Hindi Hit ...,https://www.youtube.com/watch?v=GSmXU2Q7TU8,Bollywood Classics,20M views
3,4,Kishore-Amitabh Ki Evergreen Jodi | Best of Ki...,https://www.youtube.com/watch?v=lKVIElw2IZM,Gaane Sune Ansune,6.4M views
4,5,Chala Jata Hoon - Rajesh Khanna-Kishore Kumar-...,https://www.youtube.com/watch?v=6vCbtPN9skk,Indrakr90,995K views
5,6,Best of Kishore Kumar Bangla Songs,https://www.youtube.com/watch?v=NQhPp_FyPzE,Up tv,2.9M views
6,7,Kishore Kumar Junior | Trailer | Prosenjit Ch...,https://www.youtube.com/watch?v=F5KO42OWQV0,Saregama Bengali,894K views
7,8,Best Of Lata Mangeshkar & Kishore Kumar Duets ...,https://www.youtube.com/watch?v=duee3ROzuKg,Gaane Sune Ansune,35M views
8,9,Kishore Kumar Top 10 Romantic Songs {HD} - Juk...,https://www.youtube.com/watch?v=aqBHgUkoZCE,Shemaroo Filmi Gaane,4.1M views
9,10,Kishor Kumar Live On TV,https://www.youtube.com/watch?v=FvqvlYR_YlQ,Dattatray Mirwankar,3.4M views


### Step 5 - Scrape top 50 comments for each song
* In each video get the details for the comments and associated attributes like name, comment, time, upvote and downvote

In [273]:
try:
    commentList = []
    for index, row in songData.iterrows():
        driver.get(row["url"])
        sleep(5)
        publishedDate = driver.find_element_by_xpath('//*[@id="upload-info"]/span').text
        sleep(3)
        driver.execute_script('window.scrollTo(1, 500);')
        sleep(5)
        driver.find_element_by_id('icon-label').click()
        sleep(5)
        driver.find_element_by_xpath('//*[@id="menu"]/a[1]/paper-item/paper-item-body/div[1]').click()
        sleep(5)
        driver.execute_script('window.scrollTo(1, 20000);')
        sleep(5)
        driver.execute_script('window.scrollTo(1, 20000);')
        sleep(5)       
        commentElements = driver.find_elements_by_xpath('//ytd-comment-thread-renderer[@class="style-scope ytd-item-section-renderer"]')
        counter = 1
        for comment in commentElements:
            try:
                commentAuthor = comment.find_element_by_xpath('.//*[@id="author-text"]/span').text
            except:
                commentAuthor = ""
            try:
                commentAge = comment.find_element_by_xpath('.//*[@id="published-time-text"]/a').text
            except:
                commentAge = ""
            try:
                commentText = comment.find_element_by_xpath('.//*[@id="content-text"]').text
            except:
                commentText = ""
            try:
                commentUpvote = comment.find_element_by_xpath('.//*[@id="vote-count-middle"]').text
            except:
                commentUpvote = "0"
            try:
                commentDownvote = comment.find_element_by_xpath('.//*[@id="dislike-button"]/a').text
            except:
                commentDownvote = "0"
            try:
                commentReply = comment.find_element_by_xpath('.//*[@id="more"]/div/paper-button').text
            except:
                commentReply = "0"

            commentList.append((row["pkey"], row["title"], row["url"], row["channel"], row["views"], publishedDate, commentAuthor, commentAge, commentText, commentUpvote, commentDownvote, commentReply))
            counter = counter + 1
            if(counter > 50):
                break

    cols=['pkey', 'title', 'url', 'channel', 'views', 'publishedDate', 'commentAuthor', 'commentAge', 'commentText', 'commentUpvote', 'commentDownvote', 'commentReply']
    commentData = pd.DataFrame(commentList, columns=cols)
    print("Data frame created")
except Exception as e:
    print(e)
    driver.close()

Data frame created


In [274]:
commentData.head(500)

Unnamed: 0,pkey,title,url,channel,views,publishedDate,commentAuthor,commentAge,commentText,commentUpvote,commentDownvote,commentReply
0,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,"Published on Feb 25, 2013",Hitler Rage,3 years ago,SADLY!!! KISHORE KUMAR died at the age of 58y...,140,,View 23 replies
1,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,"Published on Feb 25, 2013",Saikat Kundu,5 months ago,Kishore Kumar is a genius. The people who disl...,57,,View 6 replies
2,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,"Published on Feb 25, 2013",SK Aryan,1 year ago,Kishore da lives in our hearts... will remain ...,53,,View 9 replies
3,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,"Published on Feb 25, 2013",Jaswinder Singh,2 years ago,Only a dumb person with 0 knowledge of music c...,94,,View 13 replies
4,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,"Published on Feb 25, 2013",Mazhar Ali Bhatt,2 years ago,Kishore da!! Incredible man ( Singer+ Comedian...,192,,View 12 replies
5,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,"Published on Feb 25, 2013",sarmila sunuwar,2 years ago,it's very true line that old is always be gold...,201,,View 26 replies
6,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,"Published on Feb 25, 2013",Sourav Bhattacharjee,2 years ago,"2000 dislike ,,really???i mean really???",30,,View 11 replies
7,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,"Published on Feb 25, 2013",Sudhanshu,1 year ago,Serenity! Guys play this playlist in the mid n...,25,,View reply
8,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,"Published on Feb 25, 2013",Ankan Dutta,2 years ago,Happy Birthday The Legendary Kishor Kumar !!,18,,View 2 replies
9,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,"Published on Feb 25, 2013",Sazani Thapa,1 year ago,These songs give me chillllss. 🔥 lit songs. Th...,18,,View reply


### Step 6 - Close the driver

In [275]:
driver.close()

### Step 7 - Process the data before storing to file
* Change vote count to 0 where empty 
* Convert "Published on " string from Published Date
* Change comment reply to 0 where empty


In [286]:
commentData['commentReply'] = commentData.commentReply.str.replace("View all ", "")
commentData['commentReply'] = commentData.commentReply.str.replace(" replies", "")
commentData['commentReply'] = commentData.commentReply.str.replace("View reply", "1")
commentData['commentReply'] = commentData.commentReply.str.replace("View ", "")
commentData['publishedDate'] = commentData.publishedDate.str.replace("Published on ", "")
commentData['commentDownvote'] = commentData.commentDownvote.str.replace("", "0")

### Step 8 - Write data to the csv file

In [287]:
commentData.to_csv('output_scraping.csv', sep=',', encoding='utf-8', index=False)

### Step 9
#### Validate the file is created with relevant records
* Read the recently created csv file and insert into a data frame
* Read the top few records from the data frame

In [84]:
data = pd.read_csv('output_scraping.csv', sep=',', encoding='utf-8') 
data.head()

Unnamed: 0,pkey,title,url,channel,views,age,commentAuthor,commentAge,commentText,commentUpvote,commentDownvote,commentReply
0,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,5 years ago,Hitler Rage,3 years ago,SADLY!!! KISHORE KUMAR died at the age of 58y...,138,,View 20 replies
1,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,5 years ago,Sanjay Patel,2 years ago,rajesh khann amazing hero,7,,
2,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,5 years ago,ganganna talapelli,1 year ago,Its very true old iss gold song,9,,
3,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,5 years ago,dinesh bangera,1 year ago,my favourite songs... old is gold..,6,,
4,1,Kishore Kumar Hit Songs Jukebox - Evergreen Ro...,https://www.youtube.com/watch?v=b_iSFNJmAhU,Rajshri,17M views,5 years ago,Arvind Shrivastav,2 years ago,the best songs of the century,196,,View 20 replies


## Open Questions
* Processing data 
  * Extract the date PublishedDAte
  * Change DownVote to 0
  * Reply Count where 1, 0
  * Reply Count