**<center><font size="40">Web Scraping</font></center>**

# TABLES

## pandas

<code>pd.read_csv()</code> - Read HTML tables into a <code>list</code> of <code>DataFrame</code> objects.

In [21]:
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_" \
    "The_Simpsons_episodes_(seasons_1%E2%80%9320)"

# read html tables into list of df-s
df_list = pd.read_html(io=url)

print(type(df_list))
print(len(df_list))

<class 'list'>
23


In [12]:
df_list[1].head(2)

Unnamed: 0,No.overall,No. inseason,Title,Directed by,Written by,Original air date,Prod.code,U.S. viewers(millions)
0,1,1,"""Simpsons Roasting on an Open Fire""",David Silverman,Mimi Pond,"December 17, 1989",7G08,26.7[46]
1,2,2,"""Bart the Genius""",David Silverman,Jon Vitti,"January 14, 1990",7G02,24.5[46]


<code>match</code> argument in <code>pd.read_html()</code> : str or compiled regular expression, optional



In [22]:
df_list = pd.read_html(
    io=url,
    match="Treehouse of Horror III")

df_list[0].head()

Unnamed: 0,No.overall,No. inseason,Title,Directed by,Written by,Original air date,Prod.code,U.S. viewers(millions)
0,60,1,"""Kamp Krusty""",Mark Kirkland,David M. Stern,"September 24, 1992",8F24,21.8[104]
1,61,2,"""A Streetcar Named Marge""",Rich Moore,Jeff Martin,"October 1, 1992",8F18,18.3[105]
2,62,3,"""Homer the Heretic""",Jim Reardon,George Meyer,"October 8, 1992",9F01,19.3[106]
3,63,4,"""Lisa the Beauty Queen""",Mark Kirkland,Jeff Martin,"October 15, 1992",9F02,19[107]
4,64,5,"""Treehouse of Horror III""",Carlos Baeza,Al Jean & Mike ReissJay Kogen & Wallace Woloda...,"October 29, 1992",9F04,25.1[108]


## requests & bs4

Selecting table using table Headline / Heading

Iterative strategy:
1. Find table
2. Find first preceeding heading
3. Try matching it with given heading

Working in opposite order would be too ambigious to match the table with corresponding heading as there can be nested headings.

In [79]:
import requests
from bs4 import BeautifulSoup as BS
import re

In [44]:
# Creating the response object
response_object = requests.get(url)

# Extracting the text from the webpage
r_html = response_object.text

# Creating the soup object
soup = BS(r_html)

In [162]:
pd.read_html(io=soup.find(name='table').prettify())[0].head(1)

Unnamed: 0_level_0,Season,Episodes,Episodes,Originally aired,Originally aired,Households / viewers,Rank,Rating
Unnamed: 0_level_1,Season,Episodes,Episodes.1,First aired,Last aired,Households / viewers,Rank,Rating
0,1,13,13,"December 17, 1989","May 13, 1990",13.4m h. [n1] [12],30,14.5


In [171]:
#heading variable
heading = "Season 1 (1989–90)"

# finding all table soup objects on url
tables_soup_list = soup.find_all(name='table')

table_df = None
# iterating through table headings
for table_soup in tables_soup_list:
    #find preceeding headline if possible
    try:
        #find preceeding heading -> try to find heading string
        #re.escape() escapes metacharacters in the string
        heading_ = table_soup \
            .find_previous_sibling(name=re.compile("h\d+")) \
            .find(string=re.compile(re.escape(heading)))  
        
    except AttributeError: #skip Nonetype results
        continue
    
    #if match is found convert html code into DF
    if heading_ == heading:
        table_df = pd.read_html(io=table_soup.prettify())[0]
        break

table_df.head()

Unnamed: 0,No. overall,No. in season,Title,Directed by,Written by,Original air date,Prod. code,U.S. viewers (millions)
0,1,1,""" Simpsons Roasting on an Open Fire """,David Silverman,Mimi Pond,"December 17, 1989",7G08,26.7 [46]
1,2,2,""" Bart the Genius """,David Silverman,Jon Vitti,"January 14, 1990",7G02,24.5 [46]
2,3,3,""" Homer's Odyssey """,Wes Archer,Jay Kogen & Wallace Wolodarsky,"January 21, 1990",7G03,27.5 [47]
3,4,4,""" There's No Disgrace Like Home """,Gregg Vanzo & Kent Butterworth,Al Jean & Mike Reiss,"January 28, 1990",7G04,20.2 [48]
4,5,5,""" Bart the General """,David Silverman,John Swartzwelder,"February 4, 1990",7G05,27.1 [49]


# FILES
Automate saving multiple files from web.

# TEXT

## Selenium 4.x
[Webpage](https://github.com/SergeyPirogov/webdriver_manager) to install browser drivers.

### Opening the Browser

In [14]:
from selenium import webdriver
from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager

#installing drivers for Firefox
driver = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()))

[WDM] - Downloading: 16.2kB [00:00, 6.42MB/s]                   


In [18]:
#go to webpage
url = "https://www.thesun.co.uk/sport/football/"
driver.get(url)

#find all football news containers
containers = driver.find_elements(
    by='xpath', 
    value='//div[@class="teaser__copy-container"]'
    )
#find title, subtitles and hyperlink texts
for i,container in enumerate(containers):
    
    if i == 5: break
    
    title = container.find_element(by='xpath', value='./a/h2').text
    sub_title = container.find_element(by='xpath', value='./a/p').text
    href = container.find_element(by='xpath', value='./a').get_attribute("href")
    
    print(f"{title}\n{sub_title}\n{href}\n")

BOOT OR BUST
Southgate admits he will be given the boot if England flop at World Cup
https://www.thesun.co.uk/sport/19917002/england-germany-gareth-southgate-world-cup-2022/

CUP FLOG FEAR
Female football fans risk jail or flogging at World Cup if raped in Qatar
https://www.thesun.co.uk/sport/19916698/female-football-fans-jail-world-cup-qatar-rape/

'THE PUSH'
Gareth Southgate casts more doubt on Ivan Toney making England's World Cup squad
https://www.thesun.co.uk/sport/19916726/england-world-cup-squad-ivan-toney/

ROAR-HEEM
Sterling vows to lift wounded Three Lions as Southgate under increasing pressure
https://www.thesun.co.uk/sport/19917138/raheem-sterling-england-germany-southgate-world-cup/

UP NEXT
England could face Kazakhstan, Montenegro and Albania after Nations League drop
https://www.thesun.co.uk/sport/19914624/relegated-england-possible-nations-league-opponents/



### Headless Mode

In [19]:
from selenium import webdriver
from selenium.webdriver.firefox.options import Options #for headless mode
from selenium.webdriver.firefox.service import Service as FirefoxService
from webdriver_manager.firefox import GeckoDriverManager

In [20]:
options = Options()
options.headless = True #headless mode parameter

driver = webdriver.Firefox(
    service=FirefoxService(GeckoDriverManager().install()), 
    options=options)

driver.get(url)

#find all football news containers
containers = driver.find_elements(
    by='xpath', 
    value='//div[@class="teaser__copy-container"]'
    )

print(f"Headless script:")

#find title, subtitles and hyperlink texts
for i,container in enumerate(containers):
    
    if i == 5: break
    
    title = container.find_element(by='xpath', value='./a/h2').text
    sub_title = container.find_element(by='xpath', value='./a/p').text
    href = container.find_element(by='xpath', value='./a').get_attribute("href")
    
    print(f"{title}\n{sub_title}\n{href}\n")

[WDM] - Downloading: 16.2kB [00:00, 4.64MB/s]                   


Headless script:
BOOT OR BUST
Southgate admits he will be given the boot if England flop at World Cup
https://www.thesun.co.uk/sport/19917002/england-germany-gareth-southgate-world-cup-2022/

CUP FLOG FEAR
Female football fans risk jail or flogging at World Cup if raped in Qatar
https://www.thesun.co.uk/sport/19916698/female-football-fans-jail-world-cup-qatar-rape/

LOWERING THE TONE
Southgate casts more doubt on Ivan Toney making England's World Cup squad
https://www.thesun.co.uk/sport/19916726/england-world-cup-squad-ivan-toney/

ROAR-HEEM
Sterling vows to lift wounded Three Lions as Southgate under increasing pressure
https://www.thesun.co.uk/sport/19917138/raheem-sterling-england-germany-southgate-world-cup/

UP NEXT
England could face Kazakhstan, Montenegro and Albania after Nations League drop
https://www.thesun.co.uk/sport/19914624/relegated-england-possible-nations-league-opponents/

