##### Keith Galli's Beautiful Soup Tutorial
Link: https://www.youtube.com/watch?v=GjKQ6V_ViQE \


What is Web Scraping?\
Any website we go to are usually composed of HTML and CSS code. The goal of Web scraping is to look through the HTML and CSS source code and collect the information that we want to collect. Ex. Go on a page in youtube and scrape all the titles of the specific persons youtube video and convert it into a dataframe. 

Simple Definitions:\

HTML: Stands for "Hypertext Markup Language." HTML is the language used to create webpages. "Hypertext" refers to the hyperlinks that an HTML page may contain. "Markup language" refers to the way tags are used to define the page layout and elements within the page. Headers, Paragraphs, Body, etc /

CSS: Stands for "Cascading Style Sheet." Cascading style sheets are used to format the layout of Web pages. They can be used to define text styles, table sizes, and other aspects of Web pages that previously could only be defined in a page's HTML.

#### Loading in the Necessary Libaries 

In [1]:
## Import libraries

import requests # to load webpages
from bs4 import BeautifulSoup as bs

#### Loading our first page

In [2]:
# load the webpage content
r = requests.get('https://keithgalli.github.io/web-scraping/example.html')

# Convert to a Beautiful Soup object

soup = bs(r.content) # collects the HTML from the above website

# Print out our HTML

print(soup.prettify()) # prints our the html code, bsobject.prettify() adds indentation

<html>
 <head>
  <title>
   HTML Example
  </title>
 </head>
 <body>
  <div align="middle">
   <h1>
    HTML Webpage
   </h1>
   <p>
    Link to more interesting example:
    <a href="https://keithgalli.github.io/web-scraping/webpage.html">
     keithgalli.github.io/web-scraping/webpage.html
    </a>
   </p>
  </div>
  <h2>
   A Header
  </h2>
  <p>
   <i>
    Some italicized text
   </i>
  </p>
  <h2>
   Another header
  </h2>
  <p id="paragraph-id">
   <b>
    Some bold text
   </b>
  </p>
 </body>
</html>



#### Start using Beautiful Soup to scrape

In [3]:
#find and find_all

first_header = soup.find('h2') # returns first element that we passed into the method .find("element")
print(first_header)

headers = soup.find_all('h2') # finds all elements, and makes a list using the method .find_all('element')
print(headers)

<h2>A Header</h2>
[<h2>A Header</h2>, <h2>Another header</h2>]


In [4]:
# Pass in a list of elements to look for
first_header = soup.find(['h2','h1']) # returns first element of the list that we passed into the method .find("element")
print(first_header)

headers = soup.find_all(['h2','h1']) # finds all elements of the list passed in and returns a list of the elements
print(headers)

<h1>HTML Webpage</h1>
[<h1>HTML Webpage</h1>, <h2>A Header</h2>, <h2>Another header</h2>]


In [5]:
# Passing in attributes to the find/find_all function
paragraph = soup.find_all('p', attrs = {'id': 'paragraph-id'})
paragraph

[<p id="paragraph-id"><b>Some bold text</b></p>]

In [6]:
# Nest find/find_all calls to narrow down a search

body = soup.find('body') # narrows html code down just to a body section
print(body)
print("-----------")
div = body.find('div') # further narrows the body section to just a div section
print(div)
print("-----------")
header = div.find('h1') # further narrows the div section to a headers section
print(header)

<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>
-----------
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
-----------
<h1>HTML Webpage</h1>


In [7]:
# search for specific strings 

string_search = soup.find_all('p', string = 'Some bold text') # takes in specific strings on the web page and finds it 
print(string_search)
print('---------')

# If you'd like to search based on a certain key word, i. ie "Some" use regex
import re # library that finds sequence of characters that define a search pattern
string_search = soup.find_all('p', string = re.compile('Some'))
print(string_search)
print('---------')

# find all headers

headers = soup.find_all('h2', string = re.compile('(H|h)eader'))
print(headers)



[<p id="paragraph-id"><b>Some bold text</b></p>]
---------
[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
---------
[<h2>A Header</h2>, <h2>Another header</h2>]


#### select (CSS Selector)


In [8]:
print(soup.body)
print('---------')

# find all the paragraph contents

content = soup.select('p')
print(content)
print('---------')

# find paragraph inside div

content = soup.select('div p')
print(content)
print('---------')

# Grab all paragraphs that are proceeded by header 2 
paragraphs = soup.select('h2 ~ p')
print(paragraphs)
print('---------')

# Grab specific elements with ID's

bold_text = soup.select('p#paragraph-id b')
print(bold_text)
print('---------')

<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>
---------
[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>, <p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
---------
[<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>]
---------
[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
---------
[<b>Some bold text</b>]
---------


In [9]:
paragraphs = soup.select('body > p')
print(paragraphs)
# to iteratively find something within the list obtained by select
for paragraph in paragraphs:
    print(paragraph.select('i'))

[<p><i>Some italicized text</i></p>, <p id="paragraph-id"><b>Some bold text</b></p>]
[<i>Some italicized text</i>]
[]


In [10]:
#### Getting Different properties of the HTML 

header = soup.find('h2')
print(header)
print('---------')

# just to print the string inside the element

header = header.string
print(header)
print('---------')

# use get_test if there are multiple child elements
div = soup.find('div')
print(div.string)
print('---------') # does not find any string because it does not know which element to look for, use get_text instead
print(div.get_text())





<h2>A Header</h2>
---------
A Header
---------
None
---------

HTML Webpage
Link to more interesting example: keithgalli.github.io/web-scraping/webpage.html



In [11]:
# Get a specific property from an element
link = soup.find('a')
print(link)
print('-------')
print(link['href'])



<a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a>
-------
https://keithgalli.github.io/web-scraping/webpage.html


In [12]:
#### Code navigation(parents,children, siblings)

#Path syntax 
soup
print(soup)
print('-------')
soup.body
print(soup.body)
print('-------')
soup.body.div
print(soup.body.div)
soup.body.div.h1
print(soup.body.div.h1)
print('-------')
soup.body.div.h1.string
print(soup.body.div.h1.string)
print('-------')



<html>
<head>
<title>HTML Example</title>
</head>
<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>
</html>

-------
<body>
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h2>A Header</h2>
<p><i>Some italicized text</i></p>
<h2>Another header</h2>
<p id="paragraph-id"><b>Some bold text</b></p>
</body>
-------
<div align="middle">
<h1>HTML Webpage</h1>
<p>Link to more interesting example: <a href="https://keithgalli.github.io/web-scraping/webpage.html">keithgalli.github.io/web-scraping/webpage.html</a></p>
</div>
<h1>HTML Webpage</h1>
----

In [13]:
#Know the terms: Parent, Child, Sibiling
print(soup.body.prettify())
    # parent is body
    # children are div, h2, p 
    # div h2 and p are sibilings of eachother each have their own number of elements
    
soup.body.find('div').find_next_siblings()

<body>
 <div align="middle">
  <h1>
   HTML Webpage
  </h1>
  <p>
   Link to more interesting example:
   <a href="https://keithgalli.github.io/web-scraping/webpage.html">
    keithgalli.github.io/web-scraping/webpage.html
   </a>
  </p>
 </div>
 <h2>
  A Header
 </h2>
 <p>
  <i>
   Some italicized text
  </i>
 </p>
 <h2>
  Another header
 </h2>
 <p id="paragraph-id">
  <b>
   Some bold text
  </b>
 </p>
</body>



[<h2>A Header</h2>,
 <p><i>Some italicized text</i></p>,
 <h2>Another header</h2>,
 <p id="paragraph-id"><b>Some bold text</b></p>]

#### Lets Practice
https://keithgalli.github.io/web-scraping/webpage.html

In [14]:
# Load
# load the webpage content
r = requests.get('https://keithgalli.github.io/web-scraping/webpage.html')

# Convert to a Beautiful Soup object

webpage = bs(r.content) # collects the HTML from the above website

# Print out our HTML

print(webpage.prettify()) # prints our the html code, bsobject.prettify() adds indentation


<html>
 <head>
  <title>
   Keith Galli's Page
  </title>
  <style>
   table {
    border-collapse: collapse;
  }
  th {
    padding:5px;
  }
  td {
    border: 1px solid #ddd;
    padding: 5px;
  }
  tr:nth-child(even) {
    background-color: #f2f2f2;
  }
  th {
    padding-top: 12px;
    padding-bottom: 12px;
    text-align: left;
    background-color: #add8e6;
    color: black;
  }
  .block {
  width: 100px;
  /*float: left;*/
    display: inline-block;
    zoom: 1;
  }
  .column {
  float: left;
  height: 200px;
  /*width: 33.33%;*/
  padding: 5px;
  }

  .row::after {
    content: "";
    clear: both;
    display: table;
  }
  </style>
 </head>
 <body>
  <h1>
   Welcome to my page!
  </h1>
  <img src="./images/selfie1.jpg" width="300px"/>
  <h2>
   About me
  </h2>
  <p>
   Hi, my name is Keith and I am a YouTuber who focuses on content related to programming, data science, and machine learning!
  </p>
  <p>
   Here is a link to my channel:
   <a href="https://www.youtube.com/kgmi

In [15]:
## Task 1
# Grab all the social links from the webpage in 3 different ways

# 1
social = webpage.find_all('ul')[1]
links = social.find_all('a')
for link in links:
    print(link['href'])
print('----------')

#2

social = webpage.select('ul')[1]
links = social.select('a')
for link in links:
    print(link['href'])
print('----------')
    
#3
links1 = webpage.select('ul.socials a')
for links in links1:
    print(links['href'])
print('----------')




https://www.instagram.com/keithgalli/
https://twitter.com/keithgalli
https://www.linkedin.com/in/keithgalli/
https://www.tiktok.com/@keithgalli
----------
https://www.instagram.com/keithgalli/
https://twitter.com/keithgalli
https://www.linkedin.com/in/keithgalli/
https://www.tiktok.com/@keithgalli
----------
https://www.instagram.com/keithgalli/
https://twitter.com/keithgalli
https://www.linkedin.com/in/keithgalli/
https://www.tiktok.com/@keithgalli
----------


In [16]:
# Scrape the table into a PD dataframe

import pandas as pd
table = webpage.select('table.hockey-stats')[0] # selected the table's html code
columns = webpage.find_all('th') # found all the columns
column_name = [c.string for c in columns] #column names which are strings in the column
table_rows = table.find('tbody').find_all('tr') 
l=[]
for tr in table_rows:
    td = tr.find_all('td')
    row = [str(tr.get_text()).strip() for tr in td]
    l.append(row)
    
df = pd.DataFrame(l, columns = column_name)
df.head()

Unnamed: 0,S,Team,League,GP,G,A,TP,PIM,+/-,Unnamed: 10,POST,GP.1,G.1,A.1,TP.1,PIM.1,+/-.1
0,2014-15,MIT (Mass. Inst. of Tech.),ACHA II,17.0,3.0,9.0,12.0,20.0,,|,,,,,,,
1,2015-16,MIT (Mass. Inst. of Tech.),ACHA II,9.0,1.0,1.0,2.0,2.0,,|,,,,,,,
2,2016-17,MIT (Mass. Inst. of Tech.),ACHA II,12.0,5.0,5.0,10.0,8.0,0.0,|,,,,,,,
3,2017-18,Did not play,,,,,,,,|,,,,,,,
4,2018-19,MIT (Mass. Inst. of Tech.),ACHA III,8.0,5.0,10.0,15.0,8.0,,|,,,,,,,


In [23]:
# grab all fun facts that contain the word is 

facts = webpage.select('ul.fun-facts li')
get_is = [fact.find(string = re.compile('is')) for fact in facts]
get_is = [fact.find_parent().get_text() for fact in get_is if fact]
get_is


['Middle name is Ronald',
 'Dunkin Donuts coffee is better than Starbucks',
 "A favorite book series of mine is Ender's Game",
 'Current video game of choice is Rocket League',
 "The band that I've seen the most times live is the Zac Brown Band"]