<a href="https://colab.research.google.com/github/gtoubian/cce/blob/main/3_4_Web_Scraping_Lecture.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Web Scraping
In today's lecture, we'll be looking at Web Scraping using Beautiful Soup. Before we get into actually scraping data, let's dive a bit into what webscrapping actually is.


##Site Layout
If you right click on a page and click "View Page Source" it will pull up the HTML code for the site. HTML is essentially a typesetting language that only controls the formating/style of a website.

If we want to view a specific aspect of a site, we can right click it and click on "Inspect" and we can make changes to a site temporarily. It will change back once the page is refreshed.

Let's view the following website and see if we can load the page into python:
https://www.york.ac.uk/teaching/cws/wws/webpage1.html

In [1]:
#Load in Libraries
import requests
from bs4 import BeautifulSoup as bs

First, we must load the content using requests and then turn the content into a BeautifulSoup Object.

In [2]:
r = requests.get("https://www.york.ac.uk/teaching/cws/wws/webpage1.html")
soup = bs(r.content)
print(soup.prettify())


<html>
 <body>
  <hmtl>
   <title>
    webpage1
   </title>
   <table align="center" width="75%">
    <tr>
     <td>
      <div align="center">
       <h1>
        STARTING . . .
       </h1>
      </div>
      <div align="justify">
       <p>
        There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language -  HTML.
        <br/>
       </p>
       <p>
        HTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!
       </p>
       <p>
 

Now let's get into some of the basics of the Beautiful Soup Library. If you would like to read more, click on this link: https://www.crummy.com/software/BeautifulSoup/bs4/doc/


##Find and Find_all Methods

As the name suggests, the find methods allow you to collect specific elements on a given webpage.


In [3]:
element = soup.find("p")
element

<p>There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language -  HTML. 
<br/>
</p>

In [4]:
elements = soup.find_all("p")
elements

[<p>There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language -  HTML. 
 <br/>
 </p>,
 <p>HTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!</p>,
 <p>Learning HTML will enable you to:
 </p>,
 <p>A HTML web page is made up of tags. Tags are placed in brackets like this <b>&lt; tag &gt; </b>. A tag tells the browser how to display information. Most tags need to be opened &lt; tag &gt; and closed &lt; /tag &gt;.
 
 </p>,
 <p> To make a si

In [5]:
element = soup.find(["p", "i"])
element

<p>There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language -  HTML. 
<br/>
</p>

In [6]:
elements = soup.find_all(["p", "i"])
elements

[<p>There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language -  HTML. 
 <br/>
 </p>,
 <p>HTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!</p>,
 <p>Learning HTML will enable you to:
 </p>,
 <p>A HTML web page is made up of tags. Tags are placed in brackets like this <b>&lt; tag &gt; </b>. A tag tells the browser how to display information. Most tags need to be opened &lt; tag &gt; and closed &lt; /tag &gt;.
 
 </p>,
 <p> To make a si

In [None]:
body = soup.find("body")
div = body.find("div")
'''
header = div.find("...")
header
'''

Using find/find_all we can search for specfic strings of interest on our chosen webpage.

In [7]:
strings = soup.find_all("p", text="Write a simple web page.")
strings

[<p>Write a simple web page.</p>]

This Find_all method isn't very handy as we have to input th whole string in order to find it. Let's try this again using the regex library.

In [8]:
import re

In [9]:
paragraph = soup.find_all("p", text = re.compile("HTML"))
paragraph

#NOTE: Capital Letters are read as different strings. If there is a case issue, we would use re.compile("(H|h)TML")

[<p>HTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!</p>,
 <p>Learning HTML will enable you to:
 </p>,
 <p>If the page doesn't open, go back over your notepad typing and make sure that all the HTML tags are correct. Check there are no spaces between tags and internal text; check that all tags are closed; check that you haven't written &lt; HTLM &gt; or &lt; BDDY &gt;.  Your page will work eventually. 
 </p>]

#CSS

CSS stands for Cascading Style Sheets. A web page's CSS outlines how HTML elements are to be displayed on screen, paper, or in other media. CSS is very efficient as it can control the layout of multiple web pages all at once.

By using a CSS selector, we can pick out which sections in a webpage that we want to use easily. Here is a cheat sheet to the types of CSS Selectors:

https://www.w3schools.com/cssref/css_selectors.asp

##Select Method (CSS Selector)



In [10]:
#Paragraphs in a division
content = soup.select("div p")
content

[<p>There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language -  HTML. 
 <br/>
 </p>,
 <p>HTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!</p>,
 <p>Learning HTML will enable you to:
 </p>,
 <p>A HTML web page is made up of tags. Tags are placed in brackets like this <b>&lt; tag &gt; </b>. A tag tells the browser how to display information. Most tags need to be opened &lt; tag &gt; and closed &lt; /tag &gt;.
 
 </p>,
 <p> To make a si

In [11]:
#Select CSS elements that follows a particular order.
content = soup.select("i ~ br")
content

[<br/>, <br/>, <br/>, <br/>, <br/>, <br/>, <br/>]

In [12]:
print(soup.body.prettify())

<body>
 <hmtl>
  <title>
   webpage1
  </title>
  <table align="center" width="75%">
   <tr>
    <td>
     <div align="center">
      <h1>
       STARTING . . .
      </h1>
     </div>
     <div align="justify">
      <p>
       There are lots of ways to create web pages using already coded programmes. These lessons will teach you how to use the underlying HyperText Markup Language -  HTML.
       <br/>
      </p>
      <p>
       HTML isn't computer code, but is a language that uses US English to enable texts (words, images, sounds) to be inserted and formatting such as colo(u)r and centre/ering to be written in. The process is fairly simple; the main difficulties often lie in small mistakes - if you slip up while word processing your reader may pick up your typos, but the page will still be legible. However, if your HTML is inaccurate the page may not appear - writing web pages is, at the least, very good practice for proof reading!
      </p>
      <p>
       Learning HTML will enab

#Getting Different Properties of HTML

We can select information from our webpage and output them as strings:



In [13]:
#Use .string
para = soup.find("body")
print(para.string)

div = soup.find("div")
print(div.get_text())
print(div.prettify())

None
STARTING . . . 
<div align="center">
 <h1>
  STARTING . . .
 </h1>
</div>



In [14]:
#Get a specific property from an element
link = soup.find("a")
link['href']

'webpage2.html'

#Code Navigation

As you have seen, beautiful soup can be used to extract certain pieces of information.

In [15]:
print(soup.body.div)

<div align="center"><h1>STARTING . . . </h1></div>


##3 Need to knows - Parent, Sibling, Child

Depending on the indentation of the HTML code, a line can be considered a parent, sibling or child of another line. Each category has a specific method with which you can call it.

#Example

Let's load a webpage and grab all text, social links and a table.



In [16]:
# Load the webpage content
r = requests.get("https://keithgalli.github.io/web-scraping/webpage.html")

web = bs(r.content)
print(web.prettify())

<html>
 <head>
  <title>
   Keith Galli's Page
  </title>
  <style>
   table {
    border-collapse: collapse;
  }
  th {
    padding:5px;
  }
  td {
    border: 1px solid #ddd;
    padding: 5px;
  }
  tr:nth-child(even) {
    background-color: #f2f2f2;
  }
  th {
    padding-top: 12px;
    padding-bottom: 12px;
    text-align: left;
    background-color: #add8e6;
    color: black;
  }
  .block {
  width: 100px;
  /*float: left;*/
    display: inline-block;
    zoom: 1;
  }
  .column {
  float: left;
  height: 200px;
  /*width: 33.33%;*/
  padding: 5px;
  }

  .row::after {
    content: "";
    clear: both;
    display: table;
  }
  </style>
 </head>
 <body>
  <h1>
   Welcome to my page!
  </h1>
  <img src="./images/selfie1.jpg" width="300px"/>
  <h2>
   About me
  </h2>
  <p>
   Hi, my name is Keith and I am a YouTuber who focuses on content related to programming, data science, and machine learning!
  </p>
  <p>
   Here is a link to my channel:
   <a href="https://www.youtube.com/kgmi

In [17]:
links = web.select("ul.socials a")
specific_link = [link['href'] for link in links]
specific_link

['https://www.instagram.com/keithgalli/',
 'https://twitter.com/keithgalli',
 'https://www.linkedin.com/in/keithgalli/',
 'https://www.tiktok.com/@keithgalli']

In [18]:
header = web.body.find('h2', string="Photos")
previous_elements = header.find_previous_siblings()
previous_elements_sorted = previous_elements[::-1]
elements = [x.get_text() for x in previous_elements_sorted]
text = "\n".join(elements)
print(text)

Welcome to my page!

About me
Hi, my name is Keith and I am a YouTuber who focuses on content related to programming, data science, and machine learning!
Here is a link to my channel: youtube.com/kgmit
I grew up in the great state of New Hampshire here in the USA. From an early age I always loved math. Around my senior year of high school, my brother first introduced me to programming. I found it a creative way to apply the same type of logical thinking skills that I enjoyed with math. This influenced me to study computer science in college and ultimately create a YouTube channel to share some things that I have learned along the way.
Hobbies
Believe it or not, I don't code 24/7. I love doing all sorts of active things. I like to play ice hockey & table tennis as well as run, hike, skateboard, and snowboard. In addition to sports, I am a board game enthusiast. The two that I've been playing the most recently are Settlers of Catan and Othello.
Fun Facts

Owned my dream car in high schoo

In [20]:
import pandas as pd
table = web.select("table.hockey-stats")[0]
columns = table.find("thead").find_all("th")
column_names = [c.string for c in columns]

table_rows = table.find("tbody").find_all("tr")
l = []
for tr in table_rows:
  td = tr.find_all('td')
  row = [str(tr.get_text()).strip() for tr in td]
  l.append(row)

df=pd.DataFrame(l,columns=column_names)
df.head()

Unnamed: 0,S,Team,League,GP,G,A,TP,PIM,+/-,Unnamed: 10,POST,GP.1,G.1,A.1,TP.1,PIM.1,+/-.1
0,2014-15,MIT (Mass. Inst. of Tech.),ACHA II,17.0,3.0,9.0,12.0,20.0,,|,,,,,,,
1,2015-16,MIT (Mass. Inst. of Tech.),ACHA II,9.0,1.0,1.0,2.0,2.0,,|,,,,,,,
2,2016-17,MIT (Mass. Inst. of Tech.),ACHA II,12.0,5.0,5.0,10.0,8.0,0.0,|,,,,,,,
3,2017-18,Did not play,,,,,,,,|,,,,,,,
4,2018-19,MIT (Mass. Inst. of Tech.),ACHA III,8.0,5.0,10.0,15.0,8.0,,|,,,,,,,
