# Web Scraping 101: What you need to know and how to scrape with Python & Selenium Webdriver

**Link:** [ Web Scraping 101: What you need to know and how to scrape with Python & Selenium Webdriver](http://www.byperth.com/2018/04/25/guide-web-scraping-101-what-you-need-to-know-and-how-to-scrape-with-python-selenium-webdriver/)

## Step 1) Import Selenium Webdriver & Test

In [5]:
# Import libraries
from selenium import webdriver # allow launching browser
from selenium.webdriver.common.by import By # allow search with parameters
from selenium.webdriver.support.ui import WebDriverWait # allow waiting for page to load
from selenium.webdriver.support import expected_conditions as EC # determine whether the web page has loaded
from selenium.common.exceptions import TimeoutException # handling timeout situation

In [64]:
# Prepare the code for easily opening new browser window
driver_option = webdriver.ChromeOptions()
driver_option.add_argument(" — incognito")
chromedriver_path = '/home/igor/Estudos/Ciencia de Dados e Big Data/RI/chromedriver_linux'

def create_webdriver():
    return webdriver.Chrome(executable_path=chromedriver_path, options = driver_option)

## Step 2) Open the Github page & Extract the HTML elements we need

In [19]:
# Open the website
browser = create_webdriver()
browser.get("https://github.com/collections/machine-learning")

In [56]:
# Extract all projects
projects = browser.find_elements_by_xpath("//h1/a")

In [60]:
# Extract information for each project
project_list = {}
for proj in projects:
    proj_name = proj.text # Project name
    proj_url = proj.get_attribute('href')
    project_list[proj_name] = proj_url

In [63]:
# Close connection
browser.quit()

## Step 3) Save the data to CSV using Pandas

In [66]:
import pandas as pd

In [135]:
project_df = pd.DataFrame.from_dict(project_list, orient='index')
project_df

Unnamed: 0,0
apache / spark,https://github.com/apache/spark
apache / hadoop,https://github.com/apache/hadoop
jbhuang0604 / awesome-computer-vision,https://github.com/jbhuang0604/awesome-compute...
GSA / data,https://github.com/GSA/data
GoogleTrends / data,https://github.com/GoogleTrends/data
nationalparkservice / data,https://github.com/nationalparkservice/data
fivethirtyeight / data,https://github.com/fivethirtyeight/data
beamandrew / medical-data,https://github.com/beamandrew/medical-data
src-d / awesome-machine-learning-on-source-code,https://github.com/src-d/awesome-machine-learn...
igrigorik / decisiontree,https://github.com/igrigorik/decisiontree


In [136]:
project_df['project_name'] = project_df.index
project_df.columns = ['project_url','project_name']

In [138]:
project_df.reset_index(drop = True, inplace=True)
project_df

Unnamed: 0,project_url,project_name
0,https://github.com/apache/spark,apache / spark
1,https://github.com/apache/hadoop,apache / hadoop
2,https://github.com/jbhuang0604/awesome-compute...,jbhuang0604 / awesome-computer-vision
3,https://github.com/GSA/data,GSA / data
4,https://github.com/GoogleTrends/data,GoogleTrends / data
5,https://github.com/nationalparkservice/data,nationalparkservice / data
6,https://github.com/fivethirtyeight/data,fivethirtyeight / data
7,https://github.com/beamandrew/medical-data,beamandrew / medical-data
8,https://github.com/src-d/awesome-machine-learn...,src-d / awesome-machine-learning-on-source-code
9,https://github.com/igrigorik/decisiontree,igrigorik / decisiontree


In [143]:
# Export project dataframe to CSV
project_df.to_csv('project_list.csv', index= False)

## [Tip] Speed up web scraping with parallelization

In [None]:
from concurrent.futures import ProcessPoolExecutor
import concurrent.futures
def scrape_url(url):
 new_browser = create_webdriver()
 new_browser.get(url)
 
 # Extract required data here
 # ...
 
 new_browser.quit()
 
 return data

with ProcessPoolExecutor(max_workers=4) as executor:
 future_results = {executor.submit(scrape_url, url) for url in urlarray}

results = []
 for future in concurrent.futures.as_completed(future_results):
 results.append(future.result())

In [142]:
project_df

Unnamed: 0,project_url,project_name
0,https://github.com/apache/spark,apache / spark
1,https://github.com/apache/hadoop,apache / hadoop
2,https://github.com/jbhuang0604/awesome-compute...,jbhuang0604 / awesome-computer-vision
3,https://github.com/GSA/data,GSA / data
4,https://github.com/GoogleTrends/data,GoogleTrends / data
5,https://github.com/nationalparkservice/data,nationalparkservice / data
6,https://github.com/fivethirtyeight/data,fivethirtyeight / data
7,https://github.com/beamandrew/medical-data,beamandrew / medical-data
8,https://github.com/src-d/awesome-machine-learn...,src-d / awesome-machine-learning-on-source-code
9,https://github.com/igrigorik/decisiontree,igrigorik / decisiontree
