## Webscraping with Selenium

Part 3.0 Vocab

| Term | Description |
| -------- | ----------- |
| Selenium | an open source automation testing tool that supports a number of scripting languages like Python. |
| Webdriver | an automated testing framework used for the validation of websites (and web applications). |
| Headless | a back-end-only content, the “body,” is separated or decoupled from the presentation layer, the “head.” |

If you are scraping web pages that use JavaScript then Selenium is a good tool.

What you'll need:

- [Chrome Version](https://www.google.com/chrome/update/)
- [ChromeDriver matching your version](https://chromedriver.chromium.org/downloads)
- [Webdriver Manager](https://pypi.org/project/webdriver-manager/)
- [Selinium](https://www.selenium.dev/downloads/)
- [Create Virtual Environment](https://realpython.com/lessons/creating-virtual-environment/)
- [Gitignore](https://www.toptal.com/developers/gitignore)

In [1]:
import pandas as pd
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service

In [2]:
# Start the Browser
# Headless mode

options = Options()
options.headless = True
options.add_argument('--disable-blink-features=AutomationControlled')

DRIVER_PATH = Service('\\Users\\nunto\\drivers\\chromedriver.exe')
driver = webdriver.Chrome(options=options, service=DRIVER_PATH)
driver.get('https://www.youtube.com/channel/UCOl_a9rl1FykCd3ZO0yN6uQ')


#print(driver.page_source)
#driver.quit()

- Open https://www.youtube.com/
- Right click on page and select inspect

[Locating Elements]('https://selenium-python.readthedocs.io/locating-elements.html') 

Locate elements on the page using find_elements

In [3]:
from selenium.webdriver.common.by import By
from selenium.common.exceptions import ElementClickInterceptedException
driver.find_element(By.TAG_NAME, 'script')

<selenium.webdriver.remote.webelement.WebElement (session="d6228f7c83f756672e3834eef253a690", element="0a57c8e7-8a7c-4b13-94be-06356aca55a8")>

Let's find all the videos on Hacl for LA YouTube page

In [4]:
links= driver.find_elements(By.TAG_NAME, 'a')

Get a list of web elements that match the locator value

In [5]:
links[:10]

[<selenium.webdriver.remote.webelement.WebElement (session="d6228f7c83f756672e3834eef253a690", element="1ab46784-ae05-4486-8dc2-b26fbeae649c")>,
 <selenium.webdriver.remote.webelement.WebElement (session="d6228f7c83f756672e3834eef253a690", element="6c339d22-c35f-471b-83da-bc15ec1749e2")>,
 <selenium.webdriver.remote.webelement.WebElement (session="d6228f7c83f756672e3834eef253a690", element="99987541-e00e-48f1-84ac-7662141ce945")>,
 <selenium.webdriver.remote.webelement.WebElement (session="d6228f7c83f756672e3834eef253a690", element="1f083c28-f907-4112-9e37-eba12bbf7a77")>,
 <selenium.webdriver.remote.webelement.WebElement (session="d6228f7c83f756672e3834eef253a690", element="ecafd494-bf81-4e42-8d73-8c945cd2d25f")>,
 <selenium.webdriver.remote.webelement.WebElement (session="d6228f7c83f756672e3834eef253a690", element="23d7bd7d-663a-46f2-b3b4-21fd707e09c1")>,
 <selenium.webdriver.remote.webelement.WebElement (session="d6228f7c83f756672e3834eef253a690", element="26c57355-510b-4f8a-9ffa-77

Create a link of possible video links, then put them in a dataframe.

In [6]:
# get a list of links
possible_links= [link.get_attribute('href') for link in links]

In [7]:
possible_links[:20]

[None,
 None,
 'https://www.youtube.com/',
 None,
 None,
 None,
 None,
 'https://accounts.google.com/ServiceLogin?service=youtube&uilel=3&passive=true&continue=https%3A%2F%2Fwww.youtube.com%2Fsignin%3Faction_handle_signin%3Dtrue%26app%3Ddesktop%26hl%3Den%26next%3Dhttps%253A%252F%252Fwww.youtube.com%252Fchannel%252FUCOl_a9rl1FykCd3ZO0yN6uQ&hl=en&ec=65620',
 'https://www.youtube.com/',
 'https://www.youtube.com/',
 'https://www.youtube.com/feed/explore',
 None,
 'https://www.youtube.com/feed/subscriptions',
 'https://www.youtube.com/feed/library',
 'https://www.youtube.com/feed/history',
 'https://www.youtube.com/channel/null',
 None,
 'https://www.youtube.com/redirect?event=channel_banner&redir_token=QUFFLUhqbFlmMDJaUEpfM2VodEF4U1F1VzJ3eS1NVWNMUXxBQ3Jtc0ttcU1wV3pRbEdaR2h0NUc1NE5NdjRCcVFER2Y5bUhuTE5iYXNwZ2NBT05BbjRhbjBLdGZwWlJWYzg5MDlQQmlOZzdYY2lxTHN5bWw1SkNFcnBpRXNFRXVaVGQ0RmdNTkZoYWxYUDA1bHoxZnlqTmVuYw&q=hackforla.org',
 'https://www.youtube.com/redirect?event=channel_banner&redir_toke

In [8]:
# get rid of the none values
possible_links= list(filter(None, possible_links))

Pick out the tutorial videos, put them in a dataframe

In [9]:
links=[]
for link in possible_links:
    if 'watch?v' in link:
        links.append(link)

In [10]:
links

['https://www.youtube.com/watch?v=gM8ZTktaFmI&list=UUOl_a9rl1FykCd3ZO0yN6uQ',
 'https://www.youtube.com/watch?v=gM8ZTktaFmI',
 'https://www.youtube.com/watch?v=gM8ZTktaFmI',
 'https://www.youtube.com/watch?v=3NjJ3RXfLvQ',
 'https://www.youtube.com/watch?v=3NjJ3RXfLvQ',
 'https://www.youtube.com/watch?v=g04XJCspuJ4',
 'https://www.youtube.com/watch?v=g04XJCspuJ4',
 'https://www.youtube.com/watch?v=zAfOKQR_Sfc',
 'https://www.youtube.com/watch?v=zAfOKQR_Sfc',
 'https://www.youtube.com/watch?v=NRgztzW0zmM',
 'https://www.youtube.com/watch?v=NRgztzW0zmM',
 'https://www.youtube.com/watch?v=x9wBtYs9RnM',
 'https://www.youtube.com/watch?v=x9wBtYs9RnM']

Add them to a dataframe

In [11]:
videodf=pd.DataFrame(links, columns=['tutorial_videos'])

In [12]:
videodf

Unnamed: 0,tutorial_videos
0,https://www.youtube.com/watch?v=gM8ZTktaFmI&li...
1,https://www.youtube.com/watch?v=gM8ZTktaFmI
2,https://www.youtube.com/watch?v=gM8ZTktaFmI
3,https://www.youtube.com/watch?v=3NjJ3RXfLvQ
4,https://www.youtube.com/watch?v=3NjJ3RXfLvQ
5,https://www.youtube.com/watch?v=g04XJCspuJ4
6,https://www.youtube.com/watch?v=g04XJCspuJ4
7,https://www.youtube.com/watch?v=zAfOKQR_Sfc
8,https://www.youtube.com/watch?v=zAfOKQR_Sfc
9,https://www.youtube.com/watch?v=NRgztzW0zmM
