# Intro to Webscraping
- Use 'requests' to download the HTML
- Use BeautifulSoup to parse the HTML to the things you need

## Process
- Step 1: use the `request` library to make an HTTP request across the web.
- Step 2: use the `response.text` property returned by the request and get the text of the HTML

In [1]:
from requests import get
from bs4 import BeautifulSoup

In [2]:
url = "https://site-to-scrape.glitch.me/"
#headers let others know who you are:
headers = {'User-Agent': 'Codeup Data Science'} # Some websites don't accept the pyhon-requests default user-agent

In [3]:
response = get(url, headers=headers)
response

<Response [200]>

In [4]:
response.text

'<!DOCTYPE html>\n<html lang="en">\n  <head>\n    <title>Site to Scrape!</title>\n    <meta charset="utf-8">\n    <meta http-equiv="X-UA-Compatible" content="IE=edge">\n    <meta name="viewport" content="width=device-width, initial-scale=1">\n    \n    <!-- import the webpage\'s stylesheet -->\n    <link rel="stylesheet" href="/style.css">\n    \n    <!-- import the webpage\'s javascript file -->\n    <script src="/script.js" defer></script>\n  </head>  \n  <body>\n    <header>\n      <h1>This is the header!</h1>\n      <hr>\n    </header>\n    \n    <main>\n      <div>\n        <h1 class="first">\n        This is the main\n        </h1>\n        <h2>\n          This is an h2 of main\n        </h2>\n        <h3>\n          H3 inside of first div inside of main\n        </h3>\n      </div>\n      <div>\n        <h3 class="first">\n          H3 inside of second div inside of main.\n        </h3>\n        <p>\n          Here\'s some text content for us to scrape! 👽\n        </p>\n        

In [5]:
# Make a soup variable holding the response content
soup = BeautifulSoup(response.content, 'html.parser')

In [6]:
#beautiful soup parses the html itself:
soup

<!DOCTYPE html>

<html lang="en">
<head>
<title>Site to Scrape!</title>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<!-- import the webpage's stylesheet -->
<link href="/style.css" rel="stylesheet"/>
<!-- import the webpage's javascript file -->
<script defer="" src="/script.js"></script>
</head>
<body>
<header>
<h1>This is the header!</h1>
<hr/>
</header>
<main>
<div>
<h1 class="first">
        This is the main
        </h1>
<h2>
          This is an h2 of main
        </h2>
<h3>
          H3 inside of first div inside of main
        </h3>
</div>
<div>
<h3 class="first">
          H3 inside of second div inside of main.
        </h3>
<p>
          Here's some text content for us to scrape! 👽
        </p>
<p>
          Here's another paragraph of content! ☠️
        </p>
<a href="https://github.com/ryanorsinger">Click here to visit my portfolio</a>
</div>
</main>
<footer>
<h1>This 

In [7]:
#you can also call elements of the html:
soup.title

<title>Site to Scrape!</title>

In [8]:
soup.h1

<h1>This is the header!</h1>

In [9]:
#for just the text in headers or sections, add .text at end:
soup.h1.text

'This is the header!'

In [10]:
soup.body

<body>
<header>
<h1>This is the header!</h1>
<hr/>
</header>
<main>
<div>
<h1 class="first">
        This is the main
        </h1>
<h2>
          This is an h2 of main
        </h2>
<h3>
          H3 inside of first div inside of main
        </h3>
</div>
<div>
<h3 class="first">
          H3 inside of second div inside of main.
        </h3>
<p>
          Here's some text content for us to scrape! 👽
        </p>
<p>
          Here's another paragraph of content! ☠️
        </p>
<a href="https://github.com/ryanorsinger">Click here to visit my portfolio</a>
</div>
</main>
<footer>
<h1>This is the footer</h1>
<img alt="" aria-hidden="true" src="https://traffic-analytics.glitch.me/counter.png?fallback=MY_WEBSITE&amp;color=black" style="vertical-align: bottom;"/>
</footer>
</body>

In [11]:
#can strip down text as well:
soup.h2.text.strip()[-5:]

' main'

In [12]:
#in comparison:
soup.h2.text

'\n          This is an h2 of main\n        '

In [13]:
#soup.prettify() is useful to print in case you want to see the HTML
soup.prettify()


'<!DOCTYPE html>\n<html lang="en">\n <head>\n  <title>\n   Site to Scrape!\n  </title>\n  <meta charset="utf-8"/>\n  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>\n  <meta content="width=device-width, initial-scale=1" name="viewport"/>\n  <!-- import the webpage\'s stylesheet -->\n  <link href="/style.css" rel="stylesheet"/>\n  <!-- import the webpage\'s javascript file -->\n  <script defer="" src="/script.js">\n  </script>\n </head>\n <body>\n  <header>\n   <h1>\n    This is the header!\n   </h1>\n   <hr/>\n  </header>\n  <main>\n   <div>\n    <h1 class="first">\n     This is the main\n    </h1>\n    <h2>\n     This is an h2 of main\n    </h2>\n    <h3>\n     H3 inside of first div inside of main\n    </h3>\n   </div>\n   <div>\n    <h3 class="first">\n     H3 inside of second div inside of main.\n    </h3>\n    <p>\n     Here\'s some text content for us to scrape! 👽\n    </p>\n    <p>\n     Here\'s another paragraph of content! ☠️\n    </p>\n    <a href="https://github.com/r

In [14]:
#soup.find_all("a") find all the anchor tags, or whatever argument is specified.
soup.find_all("a")

[<a href="https://github.com/ryanorsinger">Click here to visit my portfolio</a>]

_____________________________________

### soup.select is a really good one to understand and know out of all of this!

In [15]:
#The soup.select() method takes in a CSS selector as a string and returns all matching elements. super useful
soup.select("header")

[<header>
 <h1>This is the header!</h1>
 <hr/>
 </header>]

In [16]:
type(soup.select('p'))

bs4.element.ResultSet

In [17]:
soup.select('p')

[<p>
           Here's some text content for us to scrape! 👽
         </p>,
 <p>
           Here's another paragraph of content! ☠️
         </p>]

We can't grab all from the p tags but we can iterate through them (like a for loop)

In [18]:
for p in soup.select('p'):
    print(p.text)


          Here's some text content for us to scrape! 👽
        

          Here's another paragraph of content! ☠️
        


#### Notes:
.select will return a resultset and list, even if there's only one of something.

In [19]:
#example:
soup.select('img')

[<img alt="" aria-hidden="true" src="https://traffic-analytics.glitch.me/counter.png?fallback=MY_WEBSITE&amp;color=black" style="vertical-align: bottom;"/>]

________________________________

In [20]:
#if you only want to select one match, you can use select_one
soup.select_one('a')

<a href="https://github.com/ryanorsinger">Click here to visit my portfolio</a>

_______________________________

### Nested tags

In [21]:
soup.select('body')[0]

<body>
<header>
<h1>This is the header!</h1>
<hr/>
</header>
<main>
<div>
<h1 class="first">
        This is the main
        </h1>
<h2>
          This is an h2 of main
        </h2>
<h3>
          H3 inside of first div inside of main
        </h3>
</div>
<div>
<h3 class="first">
          H3 inside of second div inside of main.
        </h3>
<p>
          Here's some text content for us to scrape! 👽
        </p>
<p>
          Here's another paragraph of content! ☠️
        </p>
<a href="https://github.com/ryanorsinger">Click here to visit my portfolio</a>
</div>
</main>
<footer>
<h1>This is the footer</h1>
<img alt="" aria-hidden="true" src="https://traffic-analytics.glitch.me/counter.png?fallback=MY_WEBSITE&amp;color=black" style="vertical-align: bottom;"/>
</footer>
</body>

In [22]:
#blowing off elements and getting just text
soup.select('body')[0].text

"\n\nThis is the header!\n\n\n\n\n\n        This is the main\n        \n\n          This is an h2 of main\n        \n\n          H3 inside of first div inside of main\n        \n\n\n\n          H3 inside of second div inside of main.\n        \n\n          Here's some text content for us to scrape! 👽\n        \n\n          Here's another paragraph of content! ☠️\n        \nClick here to visit my portfolio\n\n\n\nThis is the footer\n\n\n"

In [23]:
#if we want to use dictionary syntax to access the attribute value (where img source is coming from)
soup.select_one('a')['href']

'https://github.com/ryanorsinger'

In [26]:
#assign this function that points to a link:
url =soup.select_one('a')['href']
url

'https://github.com/ryanorsinger'

In [33]:
#assign the url and headers to a get function to run the scraper
response2 = get(url, headers=headers)
response2

<Response [200]>

In [31]:
#reassign then to have BS run through to help parse the html content
github_soup= BeautifulSoup(response2.content, 'html.parser')

In [32]:
#how to get all of the urls from each link
anchors = github_soup.select('a')
urls = []
for a in anchors:
    #to access a HTML tag's attribute, use dict syntax
    href = a['href']
    urls.append(href)
    
urls

['#start-of-content',
 'https://github.com/',
 '/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E&source=header',
 '/features',
 '/mobile',
 '/features/actions',
 '/features/codespaces',
 '/features/copilot',
 '/features/packages',
 '/features/security',
 '/features/code-review',
 '/features/issues',
 '/features/discussions',
 '/features/integrations',
 '/sponsors',
 '/customer-stories',
 '/team',
 '/enterprise',
 '/explore',
 '/topics',
 '/collections',
 '/trending',
 'https://skills.github.com/',
 '/sponsors/explore',
 'https://opensource.guide',
 '/readme',
 '/events',
 'https://github.community',
 'https://education.github.com',
 'https://stars.github.com',
 '/marketplace',
 '/pricing',
 '/pricing#compare-features',
 'https://github.com/enterprise/contact',
 'https://education.github.com',
 '',
 '',
 '',
 '',
 '/login?return_to=https%3A%2F%2Fgithub.com%2Fryanorsinger',
 '/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F%3Cuser-name%3E&source

In [35]:
#if anything needs to be appended, we can utilize and remember that url holds to original grab:
response2.url

'https://github.com/ryanorsinger'

In [None]:
#beautiful soup tag element:
soup.find_all('h3')[0].text

In [None]:
#looking at children elements:
list(soup.children)

### CSS selectors:

In [36]:
# .select and .select_one take css selectors
#elements themselves are a string of that element
soup.select('p')

[<p>
           Here's some text content for us to scrape! 👽
         </p>,
 <p>
           Here's another paragraph of content! ☠️
         </p>]

In [37]:
soup.select('a')

[<a href="https://github.com/ryanorsinger">Click here to visit my portfolio</a>]

In [38]:
# .class name the . means 'hey I'm looking for a class'
soup.select(".first")

[<h1 class="first">
         This is the main
         </h1>,
 <h3 class="first">
           H3 inside of second div inside of main.
         </h3>]

In [40]:
#grabbing link/target on website, finding it via Inspect on website and then pulling title or text from element
github_soup.select('a.Link--primary')[0]['href']

'/features'

In [42]:
#grabbing baby shark icon on webpage:
baby_shark_selector = '#js-pjax-container > div.container-xl.px-3.px-md-4.px-lg-5 > div > div.Layout-sidebar > div > div.js-profile-editable-replace > div:nth-child(4) > div.d-flex.flex-wrap > a:nth-child(2)'
github_soup.select(baby_shark_selector)[0]['href']

'/ryanorsinger?achievement=pull-shark&tab=achievements'

#### Grabbing dictionary tags ( or items that are holding other items)

In [46]:
peace_icon = github_soup.select('#js-pjax-container > div.container-xl.px-3.px-md-4.px-lg-5 > div > div.Layout-sidebar > div > div.js-profile-editable-replace > div.border-top.color-border-muted.pt-3.mt-3.clearfix.hide-sm.hide-md > a:nth-child(3) > img')

In [48]:
#create a tag element
peace_icon[0]

<img alt="@codeup-ad-lister" class="avatar" data-view-component="true" height="32" size="32" src="https://avatars.githubusercontent.com/u/11996572?s=64&amp;v=4" width="32"/>

In [49]:
peace_icon[0].h2