## Webscraping with Selenium

If you are scraping web pages that use JavaScript then Selenium is a good tool.

What you'll need:

- [Chrome Version](https://www.google.com/chrome/update/)
- [ChromeDriver matching your version](https://chromedriver.chromium.org/downloads)
- [Webdriver Manager](https://pypi.org/project/webdriver-manager/)
- [Selinium](https://www.selenium.dev/downloads/)
- [Create Virtual Environment](https://realpython.com/lessons/creating-virtual-environment/)
- [Gitignore](https://www.toptal.com/developers/gitignore)

In [156]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

In [157]:
# Start the Browser
# Headless mode

options = Options()
options.headless = True
options.add_argument('--disable-blink-features=AutomationControlled')

DRIVER_PATH = Service('\\Users\\nunto\\drivers\\chromedriver.exe')
driver = webdriver.Chrome(options=options, service=DRIVER_PATH)
driver.get('https://www.youtube.com/channel/UCOl_a9rl1FykCd3ZO0yN6uQ')


#print(driver.page_source)
#driver.quit()

- Open https://www.youtube.com/
- Right click on page and select inspect

[Locating Elements]('https://selenium-python.readthedocs.io/locating-elements.html') 

Locate elements on the page using find_elements

In [158]:
from selenium.webdriver.common.by import By
from selenium.common.exceptions import ElementClickInterceptedException
driver.find_element(By.TAG_NAME, 'script')

<selenium.webdriver.remote.webelement.WebElement (session="9693185a59cf9f4832c443b905d30ddc", element="cb4d1a8e-7d80-4235-b849-6e1bf5a0f87e")>

Let's find all the videos on Hacl for LA YouTube page

In [159]:
links= driver.find_elements(By.TAG_NAME, 'a')

Get a list of web elements that match the locator value

In [160]:
links[:10]

[<selenium.webdriver.remote.webelement.WebElement (session="9693185a59cf9f4832c443b905d30ddc", element="05baa847-9704-4868-9543-80556138a6a0")>,
 <selenium.webdriver.remote.webelement.WebElement (session="9693185a59cf9f4832c443b905d30ddc", element="dc6b8840-ec9d-45cf-bb35-83776f2d6e47")>,
 <selenium.webdriver.remote.webelement.WebElement (session="9693185a59cf9f4832c443b905d30ddc", element="effaf605-01f0-47dd-99dd-046732297a33")>,
 <selenium.webdriver.remote.webelement.WebElement (session="9693185a59cf9f4832c443b905d30ddc", element="31b4b331-423a-4704-9a35-7dc5942b2740")>,
 <selenium.webdriver.remote.webelement.WebElement (session="9693185a59cf9f4832c443b905d30ddc", element="b992b260-05f9-4066-8362-0c038f8e7a12")>,
 <selenium.webdriver.remote.webelement.WebElement (session="9693185a59cf9f4832c443b905d30ddc", element="960bfb8b-f25c-4380-b1fa-5c9d1e2d771e")>,
 <selenium.webdriver.remote.webelement.WebElement (session="9693185a59cf9f4832c443b905d30ddc", element="d01aa5eb-9f7f-40b1-bd1f-55

Create a link of possible video links, then put them in a dataframe.

In [161]:
# get a list of links
possible_links= [link.get_attribute('href') for link in links]

In [162]:
possible_links[:20]

[None,
 None,
 'https://www.youtube.com/',
 None,
 None,
 None,
 None,
 'https://accounts.google.com/ServiceLogin?service=youtube&uilel=3&passive=true&continue=https%3A%2F%2Fwww.youtube.com%2Fsignin%3Faction_handle_signin%3Dtrue%26app%3Ddesktop%26hl%3Den%26next%3Dhttps%253A%252F%252Fwww.youtube.com%252Fchannel%252FUCOl_a9rl1FykCd3ZO0yN6uQ&hl=en&ec=65620',
 'https://www.youtube.com/',
 'https://www.youtube.com/',
 'https://www.youtube.com/feed/explore',
 None,
 'https://www.youtube.com/feed/subscriptions',
 'https://www.youtube.com/feed/library',
 'https://www.youtube.com/feed/history',
 'https://www.youtube.com/channel/null',
 None,
 'https://www.youtube.com/redirect?event=channel_banner&redir_token=QUFFLUhqbFVJMFRzZUVHOTN1cFFJcVZCMVpvUjcyNE43Z3xBQ3Jtc0treE5EWDdtWDc1VVExZGpBNDJnaWgxQ0h1NENTRjhNVlo2QWhpSW4taTRwSnVFZFVFcTN0RXk5WHNiMU1obWhZdERkVjhmTVZwb0J2UEVrT1BTLWhVM2RNMk9tX3VQR0JwZDZ1WGlXV0VOT2wwcWoyRQ&q=hackforla.org',
 'https://www.youtube.com/redirect?event=channel_banner&redir_toke

In [163]:
# get rid of the none values
possible_links= list(filter(None, possible_links))

Pick out the tutorial videos, put them in a dataframe

In [164]:
links=[]
for link in possible_links:
    if 'watch?v' in link:
        links.append(link)

In [165]:
links

['https://www.youtube.com/watch?v=gM8ZTktaFmI&list=UUOl_a9rl1FykCd3ZO0yN6uQ',
 'https://www.youtube.com/watch?v=gM8ZTktaFmI',
 'https://www.youtube.com/watch?v=gM8ZTktaFmI',
 'https://www.youtube.com/watch?v=3NjJ3RXfLvQ',
 'https://www.youtube.com/watch?v=3NjJ3RXfLvQ',
 'https://www.youtube.com/watch?v=g04XJCspuJ4',
 'https://www.youtube.com/watch?v=g04XJCspuJ4',
 'https://www.youtube.com/watch?v=zAfOKQR_Sfc',
 'https://www.youtube.com/watch?v=zAfOKQR_Sfc',
 'https://www.youtube.com/watch?v=NRgztzW0zmM',
 'https://www.youtube.com/watch?v=NRgztzW0zmM',
 'https://www.youtube.com/watch?v=x9wBtYs9RnM',
 'https://www.youtube.com/watch?v=x9wBtYs9RnM']

Add them to a dataframe

In [166]:
videodf=pd.DataFrame(links, columns=['tutorial_videos'])

In [167]:
videodf

Unnamed: 0,tutorial_videos
0,https://www.youtube.com/watch?v=gM8ZTktaFmI&li...
1,https://www.youtube.com/watch?v=gM8ZTktaFmI
2,https://www.youtube.com/watch?v=gM8ZTktaFmI
3,https://www.youtube.com/watch?v=3NjJ3RXfLvQ
4,https://www.youtube.com/watch?v=3NjJ3RXfLvQ
5,https://www.youtube.com/watch?v=g04XJCspuJ4
6,https://www.youtube.com/watch?v=g04XJCspuJ4
7,https://www.youtube.com/watch?v=zAfOKQR_Sfc
8,https://www.youtube.com/watch?v=zAfOKQR_Sfc
9,https://www.youtube.com/watch?v=NRgztzW0zmM
