# Web Scraping with BeautifulSoup

### But first... what is HTML

**HyperText Markup Language**: it is NOT a programming language. As its name points it is a *markup language* is used to indicate to the browser how to layout content. 

HTML is based on tags, which indicates what should be done with the content.

The most basic tag is the `<html>`. Everything inside of it is HTML. **Important:** We need to use tags to delimit the scope, so we use open and close tags, like in the example:

```html
<html>
...
</html>
```

Inside of an `html` tag, we can use other tags. Usually, a HTML page has two other scopes defined by tags: `head` and `body`. The content of the web page goes into the body. The head contains metadata about the page, like the title of the page (it sometimes stores JS, CSSs, etc.)

When scrapping, we usually focus on what is inside of the `<body>  <\body>`

```html
<html>
   <head>
        ...
   </head>
   <body>
       ...
   </body>
</html>
```

There are many possible tags with different roles, for example `<p>` delimits a paragraph `<br>` breaks a line, `<a>` represents links

<html>
   <head>
   </head>

   <body>
      <p>
         Paragraph
         <a href="https://www.github.com">Link to GitHub</a>
      </p>
      <p>
         See the link below:
         <a href="https://www.twitter.com">Twitter</a> </p>
   </body>
</html>

In the above example, the `<a>` tag presents an `href` attribute, which determines where the link goes.

Elements (tags) may have multiple attributes to define its layout/behavior. The attribute `class`, for example, indicates the CSS that will be applied there. The attribute `id` is used sometimes to identify a tag

### Let's scrape

First, we need to import the module we are using... BeautifulSoup

In [2]:
import requests
from bs4 import BeautifulSoup

Let's get a page... using requests

In [5]:
result = requests.get("https://pythonprogramming.net/parsememcparseface/")

We use the content, to get ready to scrape

And we call/instantiate our BeautifulSoup object, using our response content.

In [7]:
content = result.content
soup=BeautifulSoup(content, "html.parser")
soup

<html>
<head>
<!--
		palette:
		dark blue: #003F72
		yellow: #FFD166
		salmon: #EF476F
		offwhite: #e7d7d7
		Light Blue: #118AB2
		Light green: #7DDF64
		-->
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<title>Python Programming Tutorials</title>
<meta content="Python Programming tutorials from beginner to advanced on a massive variety of topics. All video and text tutorials are free." name="description"/>
<link href="/static/favicon.ico" rel="shortcut icon"/>
<link href="/static/css/materialize.min.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<meta content="3fLok05gk5gGtWd_VSXbSSSH27F2kr1QqcxYz9vYq2k" name="google-site-verification">
<link href="/static/css/bootstrap.css" rel="stylesheet" type="text/css"/>
<!-- Compiled and minified CSS -->
<!-- Compiled and minified JavaScript -->
<script src="https://code.jquery.com/jquery-2.1.4.min.js"></script>
<script src="https://cdnjs.cloudflare.com/aj

If we want, we can make it easier to read...

In [8]:
print(soup.prettify())

<html>
 <head>
  <!--
		palette:
		dark blue: #003F72
		yellow: #FFD166
		salmon: #EF476F
		offwhite: #e7d7d7
		Light Blue: #118AB2
		Light green: #7DDF64
		-->
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <title>
   Python Programming Tutorials
  </title>
  <meta content="Python Programming tutorials from beginner to advanced on a massive variety of topics. All video and text tutorials are free." name="description"/>
  <link href="/static/favicon.ico" rel="shortcut icon"/>
  <link href="/static/css/materialize.min.css" rel="stylesheet"/>
  <link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
  <meta content="3fLok05gk5gGtWd_VSXbSSSH27F2kr1QqcxYz9vYq2k" name="google-site-verification">
   <link href="/static/css/bootstrap.css" rel="stylesheet" type="text/css"/>
   <!-- Compiled and minified CSS -->
   <!-- Compiled and minified JavaScript -->
   <script src="https://code.jquery.com/jquery-2.1.4.min.js">
   </script>
   <

We can use multiple attributes/methods depending on what we wanna scrape/get!

In [13]:
print(soup.title)

<title>Python Programming Tutorials</title>


We can deal with soup.title (which is a Tag object), getting name, content, parent, etc...

In [11]:
print(soup.title.name)

title


In [12]:
print(soup.title.string)

Python Programming Tutorials


In [14]:
print(soup.title.parent.name)

head


In [21]:
print(soup.p)

<p class="introduction">Oh, hello! This is a <span style="font-size:115%">wonderful</span> page meant to let you practice web scraping. This page was originally created to help people work with the <a href="https://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="blank"><strong>Beautiful Soup 4</strong></a> library.</p>


In [43]:
soup.p['class']

['introduction']

We can find all the items with the same Tag, and get them as an iterable object

In [56]:
all_para = soup.find_all('p')
print(type(all_para[1]))

<class 'bs4.element.Tag'>


We can even iterate :-) (+ text vs. string)

In [34]:
for para in all_para:
    print(para.string)
    print(str(para.text))
    print("----")

None
Oh, hello! This is a wonderful page meant to let you practice web scraping. This page was originally created to help people work with the Beautiful Soup 4 library.
----
None
The following table gives some general information for the following programming languages:
----
I think it's clear that, on a scale of 1-10, python is:
I think it's clear that, on a scale of 1-10, python is:
----
Javascript (dynamic data) test:
Javascript (dynamic data) test:
----
y u bad tho?
y u bad tho?
----
Whᶐt hαppéns now¿
Whᶐt hαppéns now¿
----
sitemap
sitemap
----
Contact: Harrison@pythonprogramming.net.
Contact: Harrison@pythonprogramming.net.
----
Programming is a superpower.
Programming is a superpower.
----


In [52]:
links = soup.find_all('a')
print(links[2])
for url in links:
    print(url.text)
    print(url.get('href'))
    print(url.get('class'))
    print("---")


<a href="/">Home</a>

/
['brand-logo']
---

#
['button-collapse']
---
Home
/
None
---
+=1
/+=1/
['tooltipped']
---
Support the Content
/support/
None
---
Community
https://goo.gl/7zgAVQ
None
---
Log in
/login/
None
---
Sign up
/register/
None
---
Home
/
None
---
+=1
/+=1/
['tooltipped']
---
Support the Content
/support/
None
---
Community
https://goo.gl/7zgAVQ
None
---
Log in
/login/
None
---
Sign up
/register/
None
---
Beautiful Soup 4
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
None
---
sitemap
/sitemap.xml
None
---
Support this Website!
/support-donate/
['grey-text', 'text-lighten-3']
---
Consulting and Contracting
/consulting/
['grey-text', 'text-lighten-3']
---
Facebook
https://www.facebook.com/pythonprogramming.net/
['grey-text', 'text-lighten-3']
---
Twitter
https://twitter.com/sentdex
['grey-text', 'text-lighten-3']
---
Instagram
https://instagram.com/sentdex
['grey-text', 'text-lighten-3']
---
Terms and Conditions
/about/tos/
['grey-text', 'text-lighten-3']
---
Priv

In [67]:
divs = soup.find_all('div', attrs={"class": "container", "style":"max-width:1500px; min-height:100%"})
len(divs)

1

In [66]:
body = soup.find('div',attrs={"class":"body", })
print(body.prettify())


SyntaxError: EOL while scanning string literal (<ipython-input-66-ff5aca60cfbd>, line 1)

In [69]:
footer = soup.find('footer')
print(footer.prettify())

<footer class="page-footer">
 <div class="container">
  <div class="row">
   <div class="col l6 s12">
    <h5 class="white-text">
     You've reached the end!
    </h5>
    <p class="grey-text text-lighten-4">
     Contact: Harrison@pythonprogramming.net.
    </p>
    <ul>
     <li>
      <a class="grey-text text-lighten-3" href="/support-donate/">
       Support this Website!
      </a>
     </li>
     <li>
      <a class="grey-text text-lighten-3" href="/consulting/">
       Consulting and Contracting
      </a>
     </li>
     <li>
      <a class="grey-text text-lighten-3" href="https://www.facebook.com/pythonprogramming.net/">
       Facebook
      </a>
     </li>
     <li>
      <a class="grey-text text-lighten-3" href="https://twitter.com/sentdex">
       Twitter
      </a>
     </li>
     <li>
      <a class="grey-text text-lighten-3" href="https://instagram.com/sentdex">
       Instagram
      </a>
     </li>
    </ul>
   </div>
   <div class="col l4 offset-l2 s12">
    <h6 cla

In [70]:
for child in footer.children:
    print(child)
    print("---")



---
<div class="container">
<div class="row">
<div class="col l6 s12">
<h5 class="white-text">You've reached the end!</h5>
<p class="grey-text text-lighten-4">Contact: Harrison@pythonprogramming.net.</p>
<ul>
<li><a class="grey-text text-lighten-3" href="/support-donate/">Support this Website!</a></li>
<li><a class="grey-text text-lighten-3" href="/consulting/">Consulting and Contracting</a></li>
<li><a class="grey-text text-lighten-3" href="https://www.facebook.com/pythonprogramming.net/">Facebook</a></li>
<li><a class="grey-text text-lighten-3" href="https://twitter.com/sentdex">Twitter</a></li>
<li><a class="grey-text text-lighten-3" href="https://instagram.com/sentdex">Instagram</a></li>
</ul>
</div>
<div class="col l4 offset-l2 s12">
<h6 class="white-text">Legal stuff:</h6>
<ul>
<li><a class="grey-text text-lighten-3" href="/about/tos/">Terms and Conditions</a></li>
<li><a class="grey-text text-lighten-3" href="/about/privacy-policy/">Privacy Policy</a></li>
</ul>
</div>
</div>


In [73]:
len(list(body.children))

28

In [None]:
body.find('a')


In [None]:
for item in body.findAll('a') :
    print(item.string + " is a link to " + item.get('href'))
    if (item.has_attr('target')):
        print("target is: " + item.get("target"))

In [None]:
body.findAll('div')

In [None]:
body.findAll('img')

In [None]:
len(list(body.descendants))

In [None]:
print(body.a)

In [None]:
for a in soup.findAll("a"):
    if (a.has_attr("data-delay")):
        print("YES: " + a.text)
    else:
        print("NO: " + a.text)


In [74]:
response = requests.get('https://github.com/igorsteinmacher/')
content = response.content
soup=BeautifulSoup(content, "html.parser")
soup


<!DOCTYPE html>

<html lang="en">
<head>
<meta charset="utf-8"/>
<link href="https://github.githubassets.com" rel="dns-prefetch"/>
<link href="https://avatars0.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars1.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars2.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://avatars3.githubusercontent.com" rel="dns-prefetch"/>
<link href="https://github-cloud.s3.amazonaws.com" rel="dns-prefetch"/>
<link href="https://user-images.githubusercontent.com/" rel="dns-prefetch"/>
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/frameworks-2b0800d93d72358b528dc71d5a482020.css" integrity="sha512-KwgA2T1yNYtSjccdWkggIHevOjkIgZqdsUtA0GyLMFalickiFVC+1RUh0JVQ5/gWaj/HQ/p5JIbABYtB68iwcQ==" media="all" rel="stylesheet">
<link crossorigin="anonymous" href="https://github.githubassets.com/assets/site-81b8499d5fa110569cb6fa2fe68c5b52.css" integrity="sha512-gbhJnV+hEFactvov5oxbUhQzmfVk0

In [86]:
bio = soup.find('div', attrs={'class':'p-note'})
list(bio.children)[0].text


'Assistant Professor @ NAU.\nResearcher: Mining Software Repositories and behavior in Open Source are my main topics'