#### Beautiful Soup Web Scraping Tutorial

Followed the tutorial [here](https://www.youtube.com/watch?v=GjKQ6V_ViQE&ab_channel=KeithGalli) by Keith Galli <br>
BS documentation [here](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) <br>
CSS Selector Reference [here](https://www.w3schools.com/cssref/css_selectors.asp)

In [2]:
import requests 
from bs4 import BeautifulSoup as bs

In [3]:
# Load WebPage
r = requests.get("https://keithgalli.github.io/web-scraping/example.html")

# Convert to a bs object
soup = bs(r.content)

#print(soup.prettify())

### find & find all

In [4]:
first_header = soup.find('h2')
headers = soup.find_all('h2')

In [5]:
# Pass in a list of elements to look for
# first_header gets the first element of the qualified items
first_header = soup.find(["h1", "h2"])
headers = soup.find_all(["h1", "h2"])


In [6]:
# pass in attributes to the find/find_all function
# the code below finds the paragraph with a specific id
paragraph = soup.find_all("p"
                          , attrs={"id": "paragraph-id"})
paragraph

[<p id="paragraph-id"><b>Some bold text</b></p>]

In [7]:
# can nest find/find_all calls
# code below defines the body as the entire body, but you can call the find function on the body variable as well and only saves the div
body = soup.find('body')
div = body.find('div')
div

<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>

In [8]:
# search specific strings in find/find_all calls

# only looks for a specific text
specific_string = soup.find_all("p", string="Some bold text")

# combine it with regex to find paragraphs that contains a certain pattern
import re
# re.compile compiles a re.Pattern object
paragraphs = soup.find_all("h2", string=re.compile("(H|h)eader"))
paragraphs

[<h2>A Header</h2>, <h2>Another header</h2>]

### selector (CSS Selector)

Usually used when you are selecting things that follow a specific path

In [9]:
#print(soup.body.prettify())

In [10]:
content = soup.select('p')
print(content)
# select the h1 inside div
content = soup.select('div h1')
print(content)
# select paragraphs preceded by header 2 / paragrpahs directly after header 2
paragraphs = soup.select('h2 ~ p')
print(paragraphs)
# select bold text within a paragraph that has a specific id
bold_text = soup.select('p#paragraph-id b')
print(bold_text)


[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>, <p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<h1>HTML Webpage</h1>]
[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<b>Some bold text</b>]


In [11]:
# Nested calls
# select paragraphs that is the direct descendent of the body
paragraphs = soup.select("body > p")
print(paragraphs)

for paragraph in paragraphs:
    print(paragraph.select("i"))

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<i>Some italicized text</i>]
[]


In [12]:
# Grab by element with specific property
soup.select("[align=middle]")

[<div align="middle">
 <h1>HTML Webpage</h1>
 <p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
 </div>]

### Get different properties of the HTML

string, get_text, link


In [30]:
header = soup.find('h2')
print(header.string)

# If multiple child elements, use get_text
# .string doesn't work here bc bs doesn't know if it should print p or header, so it returns none
div = soup.find("div")
print(div.prettify())
print(div.get_text())


A Header
<div align="middle">
 <h1>
  HTML Webpage
 </h1>
 <p>
  Link to more interesting example:
  <a href="https://keithgalli.github.io/web-scraping/webpage.html">
   keithgalli.github.io/web-scraping/webpage.html
  </a>
 </p>
</div>


HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



In [14]:
# get link
link = soup.find("a")
link['href']

paragraphs = soup.select('p#paragraph-id')
print(paragraphs)
paragraphs[0]['id']

[<p id="paragraph-id"><b>Some bold text</b></p>]


'paragraph-id'

### Code Navigation

path syntax
terms: parent, sibling, child


In [15]:
# path syntax
soup.body.div.h1.string

'HTML Webpage'

In [16]:
# Parent, sibling, child
soup.body.h2.find_next_siblings()

[<p><i>Some italicized text</i></p>,
 <h2>Another header</h2>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

### Exercises

webpage: https://keithgalli.github.io/web-scraping/webpage.html


In [17]:
r = requests.get("https://keithgalli.github.io/web-scraping/webpage.html")

webpage = bs(r.content)
#print(webpage.prettify())

##### Grab all of the social weblinks from the page

In [18]:
# method 1
links_with_tags = webpage.select("ul.socials a")

links = [link["href"] for link in links_with_tags]
links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [19]:
# method 2
ul = webpage.find("ul", attrs={"class": "socials"})
links = [link["href"] for link in links_with_tags]
links

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [20]:
# method 3 
links_with_tags = webpage.select("li.social a")
links_with_tags
links = [link["href"] for link in links_with_tags]
links


['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

#### explanation for method 3

Here, the class for social links have 2 keywords separated by space. class = "social instagram" for example. here, it means the instagram link has two class attributes: social and instagram. When using selector, either one works with the . operator.

in other words, to select the soical instagram list item, we can either do ```webpage.select("li.instagram")``` or ```webpage.select("li.social.instagram)```

### Scrape the Table

In [31]:
import pandas as pd

table = webpage.select("table.hockey-stats")[0]
columns = table.find("thead").find_all('th')
columns_names = [c.string for c in columns]

table_rows = table.find('tbody').find_all('tr')
l =[]
for tr in table_rows:
    td = tr.find_all('td')
    row = [str(td_temp.get_text()).strip() for td_temp in td ]
    l.append(row)


df = pd.DataFrame(l, columns=columns_names)
df
    

Unnamed: 0,S,Team,League,GP,G,A,TP,PIM,+/-,Unnamed: 10,POST,GP.1,G.1,A.1,TP.1,PIM.1,+/-.1
0,2014-15,MIT (Mass. Inst. of Tech.),ACHA II,17.0,3.0,9.0,12.0,20.0,,|,,,,,,,
1,2015-16,MIT (Mass. Inst. of Tech.),ACHA II,9.0,1.0,1.0,2.0,2.0,,|,,,,,,,
2,2016-17,MIT (Mass. Inst. of Tech.),ACHA II,12.0,5.0,5.0,10.0,8.0,0.0,|,,,,,,,
3,2017-18,Did not play,,,,,,,,|,,,,,,,
4,2018-19,MIT (Mass. Inst. of Tech.),ACHA III,8.0,5.0,10.0,15.0,8.0,,|,,,,,,,


In [48]:
webpage.find_all("li",string=re.compile("video"))

[]

In [79]:
fun_facts = webpage.select("ul.fun-facts li")
facts_with_is = [fact.find(string=re.compile("is")) for fact in fun_facts]

facts_with_is = [fact.find_parent().get_text() for fact in facts_with_is if fact]
facts_with_is

['Middle name is Ronald',
 'Dunkin Donuts coffee is better than Starbucks',
 "A favorite book series of mine is Ender's Game",
 'Current video game of choice is Rocket League',
 "The band that I've seen the most times live is the Zac Brown Band"]

### Download the images

In [113]:
img = webpage.select("body > div img")
paths = [i["src"] for i in img]
file_names= [p.split('/')[2] for p in paths]
paths

['images/italy/lake_como.jpg',
 'images/italy/pontevecchio.jpg',
 'images/italy/riomaggiore.jpg']

In [103]:
import os
os.getcwd()

'/Users/leksa/Documents/github_projects/web-scraping-academic/web-scraper'

In [114]:
og_webpage = 'https://keithgalli.github.io/web-scraping/'
full_paths = [og_webpage + p for p in paths]
i = 0
for p, fp in zip(paths, full_paths):
    r = requests.get(fp, stream=True).content
    if not os.path.isdir("images"):
        os.makedirs("images")
    with open(os.getcwd()+'/images/'+file_names[i], 'wb' ) as f:
        f.write(r)
        
    i += 1

 ### Mystery Challenge

In [120]:
f_paths = webpage.select("div.block > ul > li a")
f_paths = [fp["href"] for fp in f_paths]
f_paths
full_f_paths = [og_webpage + p for p in f_paths]
full_f_paths

['https://keithgalli.github.io/web-scraping/challenge/file_1.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_2.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_3.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_4.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_5.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_6.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_7.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_8.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_9.html',
 'https://keithgalli.github.io/web-scraping/challenge/file_10.html']

In [132]:
r = requests.get(full_f_paths[0]).content
f1 = bs(r)
a = f1.select("p#secret-word")
a[0].get_text()

'Make'

In [135]:
msg = []
for file in full_f_paths:
    r = requests.get(file).content
    file = bs(r)
    for i in file.select("p#secret-word"):
        msg.append(i.get_text())


In [141]:
" ".join(msg)

'Make sure to smash that like button and subscribe !!!'