<h1 align="center">WEB SCRAPING</h1>
<h2 align="left"><ins>Lesson Guide</ins></h2>

- [**BASIC COMPONENTS OF A WEBSITE**](#intro)
    - [**html**](#html)
    - [**css**](#css)
- [**WHAT IS HTML PARSING?**](#parsing)
    - [**Import a Python HTML Parser**](#python)
    - [**Exploring the Properties of html Objects**](#explore)
        - [**`find` and `findall`**](#find)
- [**WEB SCRAPING WITH BEAUTIFULSOUP**](#soup)
    - [**Grabbing the title of a page**](#title_page)
    - [**Grabbing all elements of a class**](#classes)
    - [**Getting an Image from a Website**](#images)
- [**Example 1: Scrape data from http://books.toscrape.com/**](#eg1)
- [**Example 2: Working with Multiple Pages and Items**](#eg2)
- [**Example 3: www.basketball-reference.com**](#eg3)
- [**Example 4: Abraham Lincoln Quotes**](#eg4)
- [**Web Services and APIs**](#apis)
- [**Example 5: Hacker News**](#hacker)
- [**XPath, Scrapy Selector and Scrapy Framework**](#xpath)
- [**Selenium**](#selenium)

### Documentation
https://www.crummy.com/software/BeautifulSoup/bs4/doc/<br>
https://developer.mozilla.org/en-US/docs/Learn/CSS/Building_blocks/Selectors


[Additional Resource: best practices for web scraping](https://www.synerzip.com/blogs/web-scraping-introduction-applications-and-best-practices/#:~:text=Web%20scraping%20typically%20extracts%20large,show%20data%20from%20a%20website.)

#### First Thing First
The following packages are required:
- pip install requests
- pip install selenium
- conda install lxml
- conda install scrapy
- conda install beautifulsoup4

Inspect the version of your Chrome Browser (version 74, 73, 72 etc.) and install the appropriate web driver from following link:<br>
[Selenium Web Drivers for Chrome](https://sites.google.com/a/chromium.org/chromedriver/downloads)

The following tool will help you build/test XPaths from an HTML tag.<br>
[XPath Helper for Chrome](https://chrome.google.com/webstore/detail/xpath-helper/hgimnogjllphhhkhlmebbmlgjoejdpjl?hl=en)

#### How to check what is allowed to be scraped:

Simply add `/robots.txt` at the end of the website

In [1]:
from bs4 import BeautifulSoup
import lxml
import requests
import pandas as pd
# from scrapy.selector import Selector
# from scrapy.http import HtmlResponse

<a id='intro'></a>
## BASIC COMPONENTS OF A WEBSITE

#### Let's Familiarize Ourselves with the Jargons:
1. **WWW:** World Wide Web (simply put....Internet)
2. **HTTP:** Hyper Text Transfer Protocol (simply put.....Language using which two devices communicate with each other over Internet)
3. **HTML:** Hyper Text Markup Language (simply put....Language to write web pages)
4. **XML:** eXtensible Markup Language (simply put....A text file format to exchange data having rows and columns (Structured Data))
5. **JSON:** JavaScript Object Notation (simply put....A text file format capable of exchanging data which does not have rows and columns (Unstructured Data))
6. **API:** Application Programming Interface (simply put....A technique using which two softwares/app/program communicate with each other irrespective of the programming languages they are individually writting in.)
7. **CSS:** Cascading Style Sheet (simply put....A style guide used to beautify web/HTML pages)
8. **DOM:** Document Object Model (simply put.....The organization inside an HTML/XML page)
9. **XPATH:** XML Path (simply put....A query language which understands the organization inside HTML/XML files and can query information from any part of those files)
10. **URL:** Uniform Resource Locator (simply put...Web Address :) )

<a id='html'></a>
### <ins>html</ins>
HTML stands for  Hypertext Markup Language and every website on the internet uses it to display information. Even the jupyter notebook system uses it to display this information in your browser. If you right click on a website and select "View Page Source" you can see the raw HTML of a web page. This is the information that Python will be looking at to grab information from. Let's take a look at a simple webpage's HTML:

    <!DOCTYPE html>  
    <html>  
        <head>
            <title>Title on Browser Tab</title>
        </head>
        <body>
            <h1> Website Header </h1>
            <p> Some Paragraph </p>
        <body>
    </html>

Let's breakdown these components.

Every <tag> indicates a specific block type on the webpage:

    1.<!DOCTYPE html> HTML documents will always start with this declaration, letting the browser know its an HTML file.
    2. The component blocks of the HTML document are placed between <html> and </html>.
    3. Meta data and script connections (like a link to a CSS file or a JS file) are often placed in the <head> block.
    4. The <title> tag block defines the title of the webpage (its what shows up in the tab of a website you're visiting).
    5. Is between <body> and </body> tags are the blocks that will be visible to the site visitor.
    6. Headings are defined by the <h1> through <h6> tags, where the number represents the size of the heading.
    7. Paragraphs are defined by the <p> tag, this is essentially just normal text on the website.

    There are many more tags than just these, such as <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns, and many more.

<a id='css'></a>
### <ins>css</ins>
CSS stands for Cascading Style Sheets, this is what gives "style" to a website, including colors and fonts, and even some animations! CSS uses tags such as **id** or **class** to connect an HTML element to a CSS feature, such as a particular color. **id** is a unique id for an HTML tag and must be unique within the HTML document, basically a single use connection. **class** defines a general style that can then be linked to multiple HTML tags. Basically if you only want a single html tag to be red, you would use an id tag, if you wanted several HTML tags/blocks to be red, you would create a class in your CSS doc and then link it to the rest of these blocks.

<a id='parsing'></a>
## WHAT IS HTML PARSING?
Below is some HTML as we can see. But Python interprets this as a multiline text. It does not understand that this string contains a head tag with a corresponding /head closing tag. 

For this reason, we need an HTML parser which can parse through this string and understand the HTML hierarchy with starting and corresponding closing tags.

In [2]:
html_str = '''
<html>
  <head>
    <title>My page</title>
  </head>
  <body>
    <h2>Welcome to my <a href="#">page</a></h2>
    <p>This is the first line of the first paragraph.\nThis is the second line of the first paragraph.</p>
    <p>This is the second paragraph.</p>
    <!-- this is the end -->
  </body>
</html>
'''
print (type(html_str))
print (html_str)

<class 'str'>

<html>
  <head>
    <title>My page</title>
  </head>
  <body>
    <h2>Welcome to my <a href="#">page</a></h2>
    <p>This is the first line of the first paragraph.
This is the second line of the first paragraph.</p>
    <p>This is the second paragraph.</p>
    <!-- this is the end -->
  </body>
</html>



<a id='python'></a>
### <ins>Import a Python HTML Parser</ins>

In [3]:
# from bs4 import BeautifulSoup
# A data structure representing a parsed HTML or XML document.

html = BeautifulSoup(markup=html_str,features='lxml')     # features = 'html.parser'
print (type(html),end='\n------\n')
print (html)

<class 'bs4.BeautifulSoup'>
------
<html>
<head>
<title>My page</title>
</head>
<body>
<h2>Welcome to my <a href="#">page</a></h2>
<p>This is the first line of the first paragraph.
This is the second line of the first paragraph.</p>
<p>This is the second paragraph.</p>
<!-- this is the end -->
</body>
</html>



<a id='explore'></a>
### <ins>Exploring the Properties of html Objects</ins>

In [4]:
print(html.title)

<title>My page</title>


In [5]:
print(html.title.text)

My page


In [6]:
print(html.body.text)


Welcome to my page
This is the first line of the first paragraph.
This is the second line of the first paragraph.
This is the second paragraph.




In [7]:
html.p

<p>This is the first line of the first paragraph.
This is the second line of the first paragraph.</p>

In [8]:
html.text

'\n\nMy page\n\n\nWelcome to my page\nThis is the first line of the first paragraph.\nThis is the second line of the first paragraph.\nThis is the second paragraph.\n\n\n\n'

In [9]:
[line for line in html.text.split('\n') if line !='']

['My page',
 'Welcome to my page',
 'This is the first line of the first paragraph.',
 'This is the second line of the first paragraph.',
 'This is the second paragraph.']

<a id='find'></a>
### <ins>`find` and `findall`</ins>

In [10]:
html.find('p')

<p>This is the first line of the first paragraph.
This is the second line of the first paragraph.</p>

In [11]:
html.find('p').text

'This is the first line of the first paragraph.\nThis is the second line of the first paragraph.'

In [12]:
html.find_all('p')

[<p>This is the first line of the first paragraph.
 This is the second line of the first paragraph.</p>,
 <p>This is the second paragraph.</p>]

In [13]:
html.find_all('p')

[<p>This is the first line of the first paragraph.
 This is the second line of the first paragraph.</p>,
 <p>This is the second paragraph.</p>]

In [14]:
[line.text for line in html.find_all('p')]

['This is the first line of the first paragraph.\nThis is the second line of the first paragraph.',
 'This is the second paragraph.']

In [15]:
[line.text.split('\n') for line in html.find_all('p')]

[['This is the first line of the first paragraph.',
  'This is the second line of the first paragraph.'],
 ['This is the second paragraph.']]

In [16]:
# Using a for loop

# flat_list = []
# for line in html.find_all('p'):
#     for item in line.text.split('\n'):
#         flat_list.append(item)
# flat_list

# Using a list comprehension
flat_list = [item for line in html.find_all('p') for item in line.text.split('\n')]
flat_list

['This is the first line of the first paragraph.',
 'This is the second line of the first paragraph.',
 'This is the second paragraph.']

In [17]:
import itertools

list2d = [line.text.split('\n') for line in html.find_all('p')]

# merged = list(itertools.chain(*list2d))    # unpacking the list with the * operator
merged = list(itertools.chain.from_iterable(list2d))
merged

['This is the first line of the first paragraph.',
 'This is the second line of the first paragraph.',
 'This is the second paragraph.']

In [18]:
import functools
import operator

functools.reduce(operator.iconcat, list2d, [])

['This is the first line of the first paragraph.',
 'This is the second line of the first paragraph.',
 'This is the second paragraph.']

Notice how the line `<!-- this is the end -->` does not appear. These are comments, that are not visible to the user but only when you view the actual HTML. They are also useful in commenting out a block of code when you're doing testing, that way you can prevent from having to cut and paste the code else where, and then copy them back.

In [19]:
print(html.find('!--'))

None


In [20]:
html.find_all('!--')

[]

<a id='soup'></a>
## WEB SCRAPING WITH BEAUTIFULSOUP

<a id='title_page'></a>
### <ins>Grabbing the title of a page</ins>

Let's start very simple, we will grab the title of a page. Remember that this is the HTML block with the **title** tag. For this task we will use **www.example.com** which is a website specifically made to serve as an example domain. Let's go through the main steps:

#### Step 1: Use the requests library to grab the page

In [21]:
# Note, this may fail if you have a firewall blocking Python/Jupyter 
# Note sometimes you need to run this twice if it fails the first time

res = requests.get("http://www.example.com")

This object is a `requests.models.Response` object and it actually contains the information from the website, for example:

In [22]:
type(res)

requests.models.Response

In [23]:
res.text

'<!doctype html>\n<html>\n<head>\n    <title>Example Domain</title>\n\n    <meta charset="utf-8" />\n    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />\n    <meta name="viewport" content="width=device-width, initial-scale=1" />\n    <style type="text/css">\n    body {\n        background-color: #f0f0f2;\n        margin: 0;\n        padding: 0;\n        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;\n        \n    }\n    div {\n        width: 600px;\n        margin: 5em auto;\n        padding: 2em;\n        background-color: #fdfdff;\n        border-radius: 0.5em;\n        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n    }\n    a:link, a:visited {\n        color: #38488f;\n        text-decoration: none;\n    }\n    @media (max-width: 700px) {\n        div {\n            margin: 0 auto;\n            width: auto;\n        }\n    }\n    </style>    \n</head>\n\n<body>\n<div>\n    <

Now we use BeautifulSoup to analyze the extracted page. Technically we could use our own custom script to look for items in the string of **res.text** but the BeautifulSoup library already has lots of built-in tools and methods to grab information from a string of this nature (basically an HTML file). Using BeautifulSoup we can create a "soup" object that contains all the "ingredients" of the webpage. Don't ask me about the weird library names, I didn't choose them! :)

In [24]:
import bs4

In [25]:
soup_html = bs4.BeautifulSoup(markup=res.text,features="lxml")

In [26]:
soup_html

<!DOCTYPE html>
<html>
<head>
<title>Example Domain</title>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-type"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<style type="text/css">
    body {
        background-color: #f0f0f2;
        margin: 0;
        padding: 0;
        font-family: -apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;
        
    }
    div {
        width: 600px;
        margin: 5em auto;
        padding: 2em;
        background-color: #fdfdff;
        border-radius: 0.5em;
        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);
    }
    a:link, a:visited {
        color: #38488f;
        text-decoration: none;
    }
    @media (max-width: 700px) {
        div {
            margin: 0 auto;
            width: auto;
        }
    }
    </style>
</head>
<body>
<div>
<h1>Example Domain</h1>
<p>This domain is for use in illustrative examples

Now let's use the **.select()** method to grab elements. We are looking for the 'title' tag, so we will pass in 'title'


In [27]:
soup_html.select('title')

[<title>Example Domain</title>]

Notice what is returned here, its actually a list containing all the title elements (along with their tags). You can use indexing or even looping to grab the elements from the list. Since this object it still a specialized tag, we can use method calls to grab just the text.

In [28]:
title_tag = soup_html.select('title')

In [29]:
title_tag[0]

<title>Example Domain</title>

In [30]:
type(title_tag[0])

bs4.element.Tag

In [31]:
title_tag[0].getText()

'Example Domain'

In [32]:
soup_html.select('title')[0].getText()

'Example Domain'

In [33]:
[title.getText() for title in soup_html.select('title')]

['Example Domain']

In [34]:
res.close()

<a id='classes'></a>
### <ins>Grabbing all elements of a class</ins>
Let's try to grab all the section headings of the Wikipedia Article on Room 641A from this URL: https://en.wikipedia.org/wiki/Room_641A

In [35]:
# First get the request
res = requests.get('https://en.wikipedia.org/wiki/Room_641A')
res

<Response [200]>

In [36]:
# Create a soup from request
soup_wiki = bs4.BeautifulSoup(res.text,"lxml")

Now its time to figure out what we are actually looking for. Inspect the element on the page to see that the section headers have the class "mw-headline". Because this is a class and not a straight tag, we need to adhere to some syntax for CSS. In this case

<table>

<thead >
<tr>
<th>
<p>Syntax to pass to the .select() method</p>
</th>
<th>
<p>Match Results</p>
</th>
</tr>
</thead>
<tbody>
<tr>
<td>
<p><code>soup.select('div')</code></p>
</td>
<td>
<p>All elements with the <code>&lt;div&gt;</code> tag</p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('#some_id')</code></p>
</td>
<td>
<p>The HTML element containing the <code>id</code> attribute of <code>some_id</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('.notice')</code></p>
</td>
<td>
<p>All the HTML elements with the CSS <code>class</code> named <code>notice</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('div span')</code></p>
</td>
<td>
<p>Any elements named <code>&lt;span&gt;</code> that are within an element named <code>&lt;div&gt;</code></p>
</td>
</tr>
<tr>
<td>
<p><code>soup.select('div &gt; span')</code></p>
</td>
<td>
<p>Any elements named <code class="literal2">&lt;span&gt;</code> that are <span><em >directly</em></span> within an element named <code class="literal2">&lt;div&gt;</code>, with no other element in between</p>
</td>
</tr>
<tr>

</tr>
</tbody>
</table>

After inspecting the website, there are multiple ways we can go about grabbing the section heading names.

In [37]:
soup_wiki.select(".mw-headline")

[<span class="mw-headline" id="Description">Description</span>,
 <span class="mw-headline" id="Lawsuits">Lawsuits</span>,
 <span class="mw-headline" id="Gallery">Gallery</span>,
 <span class="mw-headline" id="See_also">See also</span>,
 <span class="mw-headline" id="References">References</span>,
 <span class="mw-headline" id="External_links">External links</span>]

In [38]:
for item in soup_wiki.select(".mw-headline"):
    print(item.text)

Description
Lawsuits
Gallery
See also
References
External links


In [39]:
for content in soup_wiki.select(".toctext"):
    print(content.text)

Description
Lawsuits
Gallery
See also
References
External links


In [40]:
res.close()

<a id='images'></a>
### <ins>Getting an Image from a Website</ins>
Let's attempt to grab the Cicada image on this Wikipedia Page: https://en.wikipedia.org/wiki/Cicada_3301

In [41]:
res = requests.get("https://en.wikipedia.org/wiki/Cicada_3301")
res

<Response [200]>

In [42]:
soup_cicada = BeautifulSoup(res.text,'lxml')

In [43]:
soup_cicada.select('.thumbimage')

[<img alt="" class="thumbimage" data-file-height="246" data-file-width="405" decoding="async" height="134" src="//upload.wikimedia.org/wikipedia/en/thumb/7/7e/Cicada_3301_logo.jpg/220px-Cicada_3301_logo.jpg" srcset="//upload.wikimedia.org/wikipedia/en/thumb/7/7e/Cicada_3301_logo.jpg/330px-Cicada_3301_logo.jpg 1.5x, //upload.wikimedia.org/wikipedia/en/7/7e/Cicada_3301_logo.jpg 2x" width="220"/>,
 <img alt="" class="thumbimage" data-file-height="661" data-file-width="3961" decoding="async" height="37" src="//upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Cicada_3301_poster_locations.png/220px-Cicada_3301_poster_locations.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Cicada_3301_poster_locations.png/330px-Cicada_3301_poster_locations.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/e/e1/Cicada_3301_poster_locations.png/440px-Cicada_3301_poster_locations.png 2x" width="220"/>]

In [44]:
image_info = soup_cicada.select('.thumbimage')[0]

In [45]:
image_info

<img alt="" class="thumbimage" data-file-height="246" data-file-width="405" decoding="async" height="134" src="//upload.wikimedia.org/wikipedia/en/thumb/7/7e/Cicada_3301_logo.jpg/220px-Cicada_3301_logo.jpg" srcset="//upload.wikimedia.org/wikipedia/en/thumb/7/7e/Cicada_3301_logo.jpg/330px-Cicada_3301_logo.jpg 1.5x, //upload.wikimedia.org/wikipedia/en/7/7e/Cicada_3301_logo.jpg 2x" width="220"/>

In [46]:
type(image_info)

bs4.element.Tag

You can make dictionary like calls for parts of the Tag, in this case, we are interested in the **src** , or "source" of the image, which should be its own .jpg or .png link:

In [47]:
image_info['src']

'//upload.wikimedia.org/wikipedia/en/thumb/7/7e/Cicada_3301_logo.jpg/220px-Cicada_3301_logo.jpg'

Now that you have the actual src link, you can grab the image with requests and get along with the .content attribute. Note how we had to add http:// before the link, if you don't do this, requests will complain (but it gives you a pretty descriptive error code).

In [48]:
# image_link = requests.get('http://upload.wikimedia.org/wikipedia/en/thumb/7/7e/Cicada_3301_logo.jpg/220px-Cicada_3301_logo.jpg')
image_link = requests.get('http:'+image_info['src'])
image_link

<Response [200]>

In [49]:
# The raw content (its a binary file, meaning we will need to use binary read/write methods for saving it)
image_link.content

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xfe\x00FFile source: https://en.wikipedia.org/wiki/File:Cicada_3301_logo.jpg\xff\xdb\x00C\x00\x06\x04\x05\x06\x05\x04\x06\x06\x05\x06\x07\x07\x06\x08\n\x10\n\n\t\t\n\x14\x0e\x0f\x0c\x10\x17\x14\x18\x18\x17\x14\x16\x16\x1a\x1d%\x1f\x1a\x1b#\x1c\x16\x16 , #&\')*)\x19\x1f-0-(0%()(\xff\xc0\x00\x0b\x08\x00\x86\x00\xdc\x01\x01"\x00\xff\xc4\x00\x1c\x00\x01\x00\x03\x01\x01\x01\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01\x06\x07\x05\x04\x03\x08\x02\xff\xc4\x007\x10\x00\x01\x03\x04\x01\x03\x02\x04\x05\x03\x04\x01\x05\x00\x00\x00\x01\x02\x03\x04\x00\x05\x06\x11\x12\x07!1\x13A\x14"Qa\x08\x152q\x81#R\x91$3B\xa1\x17\x16C\x82\xc1\xe1\xff\xda\x00\x08\x01\x01\x00\x00?\x00\xfdSJR\x94\xa5)JR\x94\xa7\xbd)JR\x94\xa5)JR\x94\xa5)JR\x94\xa5)JR\x94\xa5)JR\x94\xa5)JR\x94\xa5)JR\x94\xa5)JR\x94\xa5)JR\x944\xa5(iJR\x94\xa5)\xefJR\x94\xa5)J\x0fzR\x95\x1b\xd5Ss.\xa6\xe2\x18z\xcb7\xdb\xdcf\xa5\xf6\xff\x00J\xd6\xde{\xbf\x8f\x91\x1b#\xf9\xd5

**Let's write this to a file:=, not the 'wb' call to denote a binary writing of the file.**

In [50]:
f = open('cicada_webscrape.jpg','wb')

In [51]:
f.write(image_link.content)

5892

In [52]:
f.close()

Now we can display this file right here in the notebook as markdown using:

    <img src='cicada_picture.jpg'>
    
Just write the above line in a new markdown cell and it will display the image we just downloaded!

<img src='cicada_webscrape.jpg'>

<a id='eg1'></a>
### Example 1: Scrape data from http://books.toscrape.com/ and load into a Pandas Dataframe

In [53]:
# import requests

url = 'http://books.toscrape.com/'
res = requests.get(url)
print(res)

<Response [200]>


In [54]:
type(res)

requests.models.Response

In [55]:
# res.text

# res.content

print(type(res.text))
print(type(res.content))

<class 'str'>
<class 'bytes'>


In [56]:
soup_bts = BeautifulSoup(markup=res.text, features='lxml')
# soup_bts

In [57]:
type(soup_bts)

bs4.BeautifulSoup

Suppose we wanted to grab all of the titles from the webpage. There are multiple ways of doing this.

In [58]:
# Method 1
soup_bts.find_all('h3')

[<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>,
 <h3><a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>,
 <h3><a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a></h3>,
 <h3><a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>,
 <h3><a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a></h3>,
 <h3><a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a></h3>,
 <h3><a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a></h3>,
 <h3><a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/ind

In [59]:
# this grabs the first title in the list
print(soup_bts.find_all('h3')[0],'\n')

# soup_bts.find_all('h3')[0].find('a').attrs['title']
soup_bts.find_all('h3')[0].find('a')['title']

<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3> 



'A Light in the Attic'

In [60]:
# Method 2
soup_bts.select('h3')

[<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>,
 <h3><a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a></h3>,
 <h3><a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a></h3>,
 <h3><a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a></h3>,
 <h3><a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a></h3>,
 <h3><a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a></h3>,
 <h3><a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a></h3>,
 <h3><a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/ind

In [61]:
print(soup_bts.select('h3')[0])

soup_bts.select('h3')[0].find('a').attrs['title']

<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>


'A Light in the Attic'

In [62]:
# Method 3
soup_bts.select('h3 a')

[<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>,
 <a href="catalogue/tipping-the-velvet_999/index.html" title="Tipping the Velvet">Tipping the Velvet</a>,
 <a href="catalogue/soumission_998/index.html" title="Soumission">Soumission</a>,
 <a href="catalogue/sharp-objects_997/index.html" title="Sharp Objects">Sharp Objects</a>,
 <a href="catalogue/sapiens-a-brief-history-of-humankind_996/index.html" title="Sapiens: A Brief History of Humankind">Sapiens: A Brief History ...</a>,
 <a href="catalogue/the-requiem-red_995/index.html" title="The Requiem Red">The Requiem Red</a>,
 <a href="catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html" title="The Dirty Little Secrets of Getting Your Dream Job">The Dirty Little Secrets ...</a>,
 <a href="catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html" title="The Coming Woman: A Novel Based on the Life of the 

In [63]:
print(soup_bts.select('h3 a')[0])

# soup_bts.select('h3 a')[0].attrs['title']
soup_bts.select('h3 a')[0]['title']

<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>


'A Light in the Attic'

Now suppose we wanted to grab more information about each book and put this into a pandas dataframe. We can do so by grabbing the information we need as follows:

In [64]:
# this contains all the info we will need for each book
soup_bts.find_all('li',{'class': 'col-xs-6 col-sm-4 col-md-3 col-lg-3'})

[<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image_container">
 <a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>
 </li>,
 <li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
 <article class="product_pod">
 <div class="image

In [65]:
book_info = soup_bts.find_all('li',{'class': 'col-xs-6 col-sm-4 col-md-3 col-lg-3'})

In [66]:
# looking at the first book
book_info[0]

<li class="col-xs-6 col-sm-4 col-md-3 col-lg-3">
<article class="product_pod">
<div class="image_container">
<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
</div>
<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>
<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
<div class="product_price">
<p class="price_color">Â£51.77</p>
<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>
<form>
<button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
</form>
</div>
</article>
</li>

#### Grabbing the Titles

In [67]:
print(book_info[0].select('h3'))
print()
print(book_info[0].select('h3')[0].find('a'))
print()
print(book_info[0].select('h3')[0].find('a').attrs['title'])
print(book_info[0].select('h3')[0].find('a')['title'])

[<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>]

<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>

A Light in the Attic
A Light in the Attic


In [68]:
print(book_info[0].select('h3 a'))
print()
print(book_info[0].select('h3 a')[0]['title'])
print(book_info[0].select('h3 a')[0].attrs['title'])

[<a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

A Light in the Attic
A Light in the Attic


In [69]:
# ensuring we are able to grab all the titles for the dataframe
for book in book_info:
    title = book.select('h3 a')[0]['title']
    print(title)

A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas


#### Grabbing the Links

In [70]:
print(book_info[0].find_all('h3'))
print()
print(book_info[0].find_all('h3')[0].find('a').attrs['href'])
print(book_info[0].find_all('h3')[0].find('a')['href'])

[<h3><a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>]

catalogue/a-light-in-the-attic_1000/index.html
catalogue/a-light-in-the-attic_1000/index.html


In [71]:
print(book_info[0].find_all('a'))
print()
print(book_info[0].find_all('a')[0].attrs['href'])
print(book_info[0].find_all('a')[1]['href'])

[<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>, <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

catalogue/a-light-in-the-attic_1000/index.html
catalogue/a-light-in-the-attic_1000/index.html


In [72]:
print(book_info[0].select('div a'))
print()
print(book_info[0].select('div a')[0]['href'])
print(book_info[0].select('div a')[1].attrs['href'])

[<a href="catalogue/a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>, <a href="catalogue/a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

catalogue/a-light-in-the-attic_1000/index.html
catalogue/a-light-in-the-attic_1000/index.html


In [73]:
book_info[0].select('h3 a')[0]['href']

'catalogue/a-light-in-the-attic_1000/index.html'

#### Grabbing the Prices

In [74]:
print(book_info[0].select('div p')[1])
print(book_info[0].select('div p')[1].text[1:])

<p class="price_color">Â£51.77</p>
£51.77


In [75]:
print(book_info[0].find('p',{'class': 'price_color'}))
print(book_info[0].find('p',{'class': 'price_color'}).text[1:])

<p class="price_color">Â£51.77</p>
£51.77


In [76]:
print(book_info[0].select('.price_color'))
print(book_info[0].select('.price_color')[0].text[1:])

[<p class="price_color">Â£51.77</p>]
£51.77


#### Grabbing the Ratings

In [77]:
print(book_info[0].find_all('p'))
print()
print(book_info[0].find_all('p')[0].attrs['class'])
print(book_info[0].find_all('p')[0].attrs['class'][1])

[<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>, <p class="price_color">Â£51.77</p>, <p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>]

['star-rating', 'Three']
Three


In [78]:
print(book_info[0].find('p'))
print()
print(book_info[0].find('p').attrs['class'])
print(book_info[0].find('p').attrs['class'][1])

<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>

['star-rating', 'Three']
Three


#### Grabbing the Availability Status

In [79]:
print(book_info[0].find('p',{'class': 'instock availability'}))
print()
print(book_info[0].find('p',{'class': 'instock availability'}).text)
print(book_info[0].find('p',{'class': 'instock availability'}).text.replace('\n','').strip())

<p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>



    
        In stock
    

In stock


In [80]:
print(book_info[0].select('div p'))
print()
print(book_info[0].select('div p')[2].text.replace('\n','').strip())

[<p class="star-rating Three">
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
<i class="icon-star"></i>
</p>, <p class="price_color">Â£51.77</p>, <p class="instock availability">
<i class="icon-ok"></i>
    
        In stock
    
</p>]

In stock


#### Now to put the results for the first page into a dataframe.

In [81]:
df = pd.DataFrame()
df['Titles'] = [book.select('h3 a')[0]['title'] for book in book_info]
df['link'] = [book.select('h3 a')[0]['href'] for book in book_info]
df['Rating'] = [book.find('p').attrs['class'][1] for book in book_info]
df['Price'] = [book.select('div p')[1].text[1:] for book in book_info]
df['Availability'] = [book.find('p',{'class': 'instock availability'}).text.replace('\n','').strip() for book in book_info]
df.head()

Unnamed: 0,Titles,link,Rating,Price,Availability
0,A Light in the Attic,catalogue/a-light-in-the-attic_1000/index.html,Three,£51.77,In stock
1,Tipping the Velvet,catalogue/tipping-the-velvet_999/index.html,One,£53.74,In stock
2,Soumission,catalogue/soumission_998/index.html,One,£50.10,In stock
3,Sharp Objects,catalogue/sharp-objects_997/index.html,Four,£47.82,In stock
4,Sapiens: A Brief History of Humankind,catalogue/sapiens-a-brief-history-of-humankind...,Five,£54.23,In stock


In [82]:
df.shape

(20, 5)

#### Now lets see if we can do this for all 1000 books

In [83]:
df = pd.DataFrame()

for i in range(1,51):
    url = f'http://books.toscrape.com/catalogue/page-{i}.html'    # crawl and fetch
    res = requests.get(url)

    soup_full = BeautifulSoup(res.text, features='lxml')    # or 'html.parser'
    
    book_info = soup_full.find_all('li',{'class': 'col-xs-6 col-sm-4 col-md-3 col-lg-3'})
    
    df_test = pd.DataFrame()
    df_test['Titles'] = [book.select('h3 a')[0]['title'] for book in book_info]
    df_test['link'] = [book.select('h3 a')[0]['href'] for book in book_info]
    df_test['Rating'] = [book.find('p').attrs['class'][1] for book in book_info]
    df_test['Price'] = [book.select('div p')[1].text[1:] for book in book_info]
    df_test['Availability'] = [book.find('p',{'class': 'instock availability'}).text.replace('\n','').strip() for book in book_info]
    
    df = df.append(df_test, ignore_index=True)

In [84]:
df.head()

Unnamed: 0,Titles,link,Rating,Price,Availability
0,A Light in the Attic,a-light-in-the-attic_1000/index.html,Three,£51.77,In stock
1,Tipping the Velvet,tipping-the-velvet_999/index.html,One,£53.74,In stock
2,Soumission,soumission_998/index.html,One,£50.10,In stock
3,Sharp Objects,sharp-objects_997/index.html,Four,£47.82,In stock
4,Sapiens: A Brief History of Humankind,sapiens-a-brief-history-of-humankind_996/index...,Five,£54.23,In stock


In [85]:
df.shape

(1000, 5)

In [86]:
df.tail()

Unnamed: 0,Titles,link,Rating,Price,Availability
995,Alice in Wonderland (Alice's Adventures in Won...,alice-in-wonderland-alices-adventures-in-wonde...,One,£55.53,In stock
996,"Ajin: Demi-Human, Volume 1 (Ajin: Demi-Human #1)",ajin-demi-human-volume-1-ajin-demi-human-1_4/i...,Four,£57.06,In stock
997,A Spy's Devotion (The Regency Spies of London #1),a-spys-devotion-the-regency-spies-of-london-1_...,Five,£16.97,In stock
998,1st to Die (Women's Murder Club #1),1st-to-die-womens-murder-club-1_2/index.html,One,£53.98,In stock
999,"1,000 Places to See Before You Die",1000-places-to-see-before-you-die_1/index.html,Five,£26.08,In stock


Suppose we wanted a list of all the books that have a rating of two.

In [87]:
df[df['Rating']=='Two']

Unnamed: 0,Titles,link,Rating,Price,Availability
10,"Starving Hearts (Triangular Trade Trilogy, #1)",starving-hearts-triangular-trade-trilogy-1_990...,Two,£13.99,In stock
18,Libertarianism for Beginners,libertarianism-for-beginners_982/index.html,Two,£51.33,In stock
19,It's Only the Himalayas,its-only-the-himalayas_981/index.html,Two,£45.17,In stock
21,How Music Works,how-music-works_979/index.html,Two,£37.32,In stock
36,Maude (1883-1993):She Grew Up with the country,maude-1883-1993she-grew-up-with-the-country_96...,Two,£18.02,In stock
...,...,...,...,...,...
963,Of Mice and Men,of-mice-and-men_37/index.html,Two,£47.11,In stock
965,My Perfect Mistake (Over the Top #1),my-perfect-mistake-over-the-top-1_35/index.html,Two,£38.92,In stock
967,Meditations,meditations_33/index.html,Two,£25.89,In stock
980,Frankenstein,frankenstein_20/index.html,Two,£38.00,In stock


In [88]:
two_star_list = df[df['Rating']=='Two']['Titles'].to_list()
two_star_list

['Starving Hearts (Triangular Trade Trilogy, #1)',
 'Libertarianism for Beginners',
 "It's Only the Himalayas",
 'How Music Works',
 'Maude (1883-1993):She Grew Up with the country',
 "You can't bury them all: Poems",
 'Reasons to Stay Alive',
 'Without Borders (Wanderlove #1)',
 'Soul Reader',
 'Security',
 'Saga, Volume 5 (Saga (Collected Editions) #5)',
 'Reskilling America: Learning to Labor in the Twenty-First Century',
 'Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics',
 'Obsidian (Lux #1)',
 'My Paris Kitchen: Recipes and Stories',
 'Masks and Shadows',
 'Lumberjanes, Vol. 2: Friendship to the Max (Lumberjanes #5-8)',
 'Lumberjanes Vol. 3: A Terrible Plan (Lumberjanes #9-12)',
 'Judo: Seven Steps to Black Belt (an Introductory Guide for Beginners)',
 'I Hate Fairyland, Vol. 1: Madly Ever After (I Hate Fairyland (Compilations) #1-5)',
 'Giant Days, Vol. 2 (Giant Day

In [89]:
len(two_star_list)

196

<a id='eg2'></a>
## Working with Multiple Pages and Items

Let's show a more realistic example of scraping a full site. The website: http://books.toscrape.com/index.html is specifically designed for people to scrape it. Let's try to get the title of every book that has a 2 star rating and at the end just have a Python list with all their titles.

We will do the following:

1. Figure out the URL structure to go through every page
2. Scrap every page in the catalogue
3. Figure out what tag/class represents the Star rating
4. Filter by that star rating using an if statement
5. Store the results to a list

We can see that the URL structure is the following:

    http://books.toscrape.com/catalogue/page-1.html

In [90]:
base_url = 'http://books.toscrape.com/catalogue/page-{}.html'

We can then fill in the page number with .format()

In [91]:
res = requests.get(base_url.format('1'))

In [92]:
res

<Response [200]>

Now let's grab the products (books) from the get request result:

In [93]:
soup = BeautifulSoup(res.text,"lxml")

In [94]:
soup.select(".product_pod")

[<article class="product_pod">
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>
 </article>,
 <article class="product_pod">
 <div class="image_container">
 <a href="tipping-the-velvet_999/index.html"><img alt="Tipping the Velvet" class="thumbnail" src="../media/cach

Now we can see that each book has the product_pod class. We can select any tag with this class, and then further reduce it by its rating.

In [95]:
products = soup.select(".product_pod")

In [96]:
book1 = products[0]

In [97]:
type(book1)

bs4.element.Tag

In [98]:
book1.attrs

{'class': ['product_pod']}

Now by inspecting the site we can see that the class we want is class='star-rating Two' , if you click on this in your browser, you'll notice it displays the space as a . , so that means we want to search for ".star-rating.Two"

In [99]:
book1.children

<list_iterator at 0x1a76b988408>

In [100]:
list(book1.children)

['\n',
 <div class="image_container">
 <a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>
 </div>,
 '\n',
 <p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>,
 '\n',
 <h3><a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a></h3>,
 '\n',
 <div class="product_price">
 <p class="price_color">Â£51.77</p>
 <p class="instock availability">
 <i class="icon-ok"></i>
     
         In stock
     
 </p>
 <form>
 <button class="btn btn-primary btn-block" data-loading-text="Adding..." type="submit">Add to basket</button>
 </form>
 </div>,
 '\n']

In [101]:
book1.select('.star-rating.Three')

[<p class="star-rating Three">
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 <i class="icon-star"></i>
 </p>]

But we are looking for 2 stars, so it looks like we can just check to see if something was returned

In [102]:
book1.select('.star-rating.Two')

[]

In [103]:
len(book1.select('.star-rating.Two'))

0

Alternatively, we can just quickly check the text string to see if "star-rating Two" is in it. Either approach is fine (there are also many other alternative approaches.)

In [104]:
book1.select('.star-rating.Three')[0].attrs

{'class': ['star-rating', 'Three']}

In [105]:
book1.select('.star-rating.Three')[0]['class'][1]

'Three'

Now let's see how we can get the title if we have a 2-star match:

In [106]:
book1.select('a')

[<a href="a-light-in-the-attic_1000/index.html"><img alt="A Light in the Attic" class="thumbnail" src="../media/cache/2c/da/2cdad67c44b002e7ead0cc35693c0e8b.jpg"/></a>,
 <a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>]

In [107]:
book1.select('a')[1]

<a href="a-light-in-the-attic_1000/index.html" title="A Light in the Attic">A Light in the ...</a>

In [108]:
book1.select('a')[1]['title']

'A Light in the Attic'

Okay, let's give it a shot by combining all the ideas we've talked about! (this should take about 20-60 seconds to complete running. Be aware a firwall may prevent this script from running. Also if you are getting a no response error, maybe try adding a sleep step with time.sleep(1).

In [110]:
two_star_titles = []

for n in range(1,51):

    scrape_url = base_url.format(n)
    res = requests.get(scrape_url)
    
    soup = BeautifulSoup(res.text,"lxml")
    books = soup.select(".product_pod")
    
    for book in books:
        if len(book.select('.star-rating.Two')) != 0:
            two_star_titles.append(book.select('a')[1]['title'])

In [111]:
two_star_titles

['Starving Hearts (Triangular Trade Trilogy, #1)',
 'Libertarianism for Beginners',
 "It's Only the Himalayas",
 'How Music Works',
 'Maude (1883-1993):She Grew Up with the country',
 "You can't bury them all: Poems",
 'Reasons to Stay Alive',
 'Without Borders (Wanderlove #1)',
 'Soul Reader',
 'Security',
 'Saga, Volume 5 (Saga (Collected Editions) #5)',
 'Reskilling America: Learning to Labor in the Twenty-First Century',
 'Political Suicide: Missteps, Peccadilloes, Bad Calls, Backroom Hijinx, Sordid Pasts, Rotten Breaks, and Just Plain Dumb Mistakes in the Annals of American Politics',
 'Obsidian (Lux #1)',
 'My Paris Kitchen: Recipes and Stories',
 'Masks and Shadows',
 'Lumberjanes, Vol. 2: Friendship to the Max (Lumberjanes #5-8)',
 'Lumberjanes Vol. 3: A Terrible Plan (Lumberjanes #9-12)',
 'Judo: Seven Steps to Black Belt (an Introductory Guide for Beginners)',
 'I Hate Fairyland, Vol. 1: Madly Ever After (I Hate Fairyland (Compilations) #1-5)',
 'Giant Days, Vol. 2 (Giant Day

In [112]:
len(two_star_titles)

196

<a id='eg3'></a>
### Example 3: www.basketball-reference.com

In [113]:
# import requests

url = 'https://www.basketball-reference.com/'
res_bb = requests.get(url)
print(res_bb)

soup_bb = BeautifulSoup(markup=res_bb.content, features='lxml')

<Response [200]>


In [114]:
teams = soup_bb.find('div',{'class':'table_container is_setup'})
teams

In [115]:
teams = soup_bb.find('table', {'id':'confs_standings_E'}).find('tbody').find_all('tr')
teams

[<tr class="full_table"><th class="left" data-stat="team_name" scope="row"><a href="/teams/PHI/2021.html" title="Philadelphia 76ers">PHI</a> * <span class="seed">(1) </span></th><td class="center" data-stat="franchise_text"><a href="/teams/PHI/" title="Philadelphia 76ers Franchise Index">F</a></td><td class="right" data-stat="payroll_text"><a href="/contracts/PHI.html" title="Philadelphia 76ers Team Payroll">$</a></td><td class="right" data-stat="wins">49</td><td class="right" data-stat="losses">23</td></tr>,
 <tr class="full_table"><th class="left" data-stat="team_name" scope="row"><a href="/teams/BRK/2021.html" title="Brooklyn Nets">BRK</a> * <span class="seed">(2) </span></th><td class="center" data-stat="franchise_text"><a href="/teams/NJN/" title="Brooklyn Nets Franchise Index">F</a></td><td class="right" data-stat="payroll_text"><a href="/contracts/BRK.html" title="Brooklyn Nets Team Payroll">$</a></td><td class="right" data-stat="wins">48</td><td class="right" data-stat="losses"

In [116]:
teams = []
for conf in ['E', 'W']:
    table = soup_bb.find('table', {'id':'confs_standings_' + conf})
    for row in table.find('tbody').find_all('tr'):
        team = {}
        team['slug'] = row.find('a').text
        team['name'] = row.find('a').attrs['title']
        team['wins'] = row.find_all('td')[2].text
        team['wins'] = row.find('td', {'data-stat': 'wins'}).text
        team['losses'] = row.find('td', {'data-stat': 'losses'}).text
        team['rank'] = row.find('span').text.strip()[1:-1]
        team['conference'] = conf

        teams.append(team)
df = pd.DataFrame(teams)
df

Unnamed: 0,slug,name,wins,losses,rank,conference
0,PHI,Philadelphia 76ers,49,23,1,E
1,BRK,Brooklyn Nets,48,24,2,E
2,MIL,Milwaukee Bucks,46,26,3,E
3,NYK,New York Knicks,41,31,4,E
4,ATL,Atlanta Hawks,41,31,5,E
5,MIA,Miami Heat,40,32,6,E
6,BOS,Boston Celtics,36,36,7,E
7,WAS,Washington Wizards,34,38,8,E
8,IND,Indiana Pacers,34,38,9,E
9,CHO,Charlotte Hornets,33,39,10,E


<a id='eg4'></a>
### Example 4: Abraham Lincoln Quotes
 produce a list of all of Lincoln's quotes available on this page: http://www.successories.com/iquote/author/291/abraham-lincoln-quotes/1

In [117]:
# declare a list
lincoln_quotes = []

# iteraterate through the 44 pages of lincoln quotes
for page in range(1,45):
    
    # make request for that page
    r = requests.get("http://www.successories.com/iquote/author/291/abraham-lincoln-quotes/%s" % page)
    
    # turn into a BeautifulSoup object
    soup = BeautifulSoup(r.text, 'lxml')
    
    # find quotes on page
    quotes = soup.find_all(name='div', attrs={'class':'quote'})
    
    # add to our quotes list
    for quote in quotes:
        lincoln_quotes.append(quote.text)

In [118]:
print((len(lincoln_quotes)))
print((lincoln_quotes[0:25]))

40
['"Human action can be modified to some extent, but human nature cannot be changed."', '"To stand in silence when they should be protesting makes cowards out of men"', '"I do the very best I know how, the very best I can, and I mean to keep on doing so until the end"', '"That this nation, under God, shall have a new birth of freedom; and that government of the people, by the people, and for the people, shall not perish from the earth."', '"That some achieve great success, is proof to all that others can achieve it as well."', '"I am a success today because I had a friend who believed in me and I didn\'t have the heart to let him down..."', '"I believe, if we take habitual drunkards as a class, their heads and their hearts will bear an advantageous comparison with those of any other class"', '"Books serve to show a man that those original thoughts of his aren\'t very new after all"', '"Things may come to those who wait...but only the things left by those who hustle."', '"You cannot b

<a id='apis'></a>
## Web Services and APIs

### <u>What is an API?</u>
An API (Application Programming Interface) is a set of routines, protocols, and tools for building software applications. It specifies how software components should interact.

APIs are a way developers abstract functionality to data, devices, and other resources they provide. 

Some examples include:

- Connectivity to a variety of databases
- Python modules that can turn LED lights on and off
- Application that runs on native Windows, OSX, or Linux
- Libraries that post content on Twitter, Facebook, Yelp, or LinkedIn
- Web services for accessing currency or stock prices

More abstract examples:
- Adding your own functions to Numpy itself
- Extending Python with C code
- Testing Frameworks

In the context of data science, APIs are a very common method to interact with data hosted by third parties and most commonly provided by **Web Service APIs**.

### <u>JSON</u>
JSON is short for _JavaScript Object Notation_, and is a way to store information in an organized, easy-to-access manner. In a nutshell, it gives us a human-readable collection of data that we can access in a really logical manner.

**JSON is built on two structures:**
* A collection of name/value pairs. In various languages, this is realized as an object, record, structure, dictionary, hash table, keyed list, or associative array.
* An ordered list of values. In most languages, this is realized as an array, vector, list, or sequence.

These are universal data structures. Virtually all modern programming languages support them in one form or another. It makes sense that a data format that is interchangeable with programming languages also be based on these structures.

### <u>JSON objects</u> 
An object is an unordered set of name/value pairs, like python dictionaries. An object begins with `{` (left brace) and ends with `}` (right brace). Each name is followed by `:` (colon) and the name/value pairs are separated by `,` (comma).

The syntax is as follows:

```
{ string : value, .......}
```
like:
```
{"count": 1, ...}
```
_Seems an awful lot like a python dictionary._

In [119]:
# Request example for the IMDB example
import pandas as pd
import requests

results = []
for i in range (1, 10):
    url = "http://swapi.dev/api/people/" + str(i) + "/"
    print(url)
    result = requests.get(url)
    results.append(result.json())
df = pd.DataFrame(results)
df

http://swapi.dev/api/people/1/
http://swapi.dev/api/people/2/
http://swapi.dev/api/people/3/
http://swapi.dev/api/people/4/
http://swapi.dev/api/people/5/
http://swapi.dev/api/people/6/
http://swapi.dev/api/people/7/
http://swapi.dev/api/people/8/
http://swapi.dev/api/people/9/


Unnamed: 0,name,height,mass,hair_color,skin_color,eye_color,birth_year,gender,homeworld,films,species,vehicles,starships,created,edited,url
0,Luke Skywalker,172,77,blond,fair,blue,19BBY,male,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",[],"[https://swapi.dev/api/vehicles/14/, https://s...","[https://swapi.dev/api/starships/12/, https://...",2014-12-09T13:50:51.644000Z,2014-12-20T21:17:56.891000Z,https://swapi.dev/api/people/1/
1,C-3PO,167,75,,gold,yellow,112BBY,,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",[https://swapi.dev/api/species/2/],[],[],2014-12-10T15:10:51.357000Z,2014-12-20T21:17:50.309000Z,https://swapi.dev/api/people/2/
2,R2-D2,96,32,,"white, blue",red,33BBY,,https://swapi.dev/api/planets/8/,"[https://swapi.dev/api/films/1/, https://swapi...",[https://swapi.dev/api/species/2/],[],[],2014-12-10T15:11:50.376000Z,2014-12-20T21:17:50.311000Z,https://swapi.dev/api/people/3/
3,Darth Vader,202,136,none,white,yellow,41.9BBY,male,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",[],[],[https://swapi.dev/api/starships/13/],2014-12-10T15:18:20.704000Z,2014-12-20T21:17:50.313000Z,https://swapi.dev/api/people/4/
4,Leia Organa,150,49,brown,light,brown,19BBY,female,https://swapi.dev/api/planets/2/,"[https://swapi.dev/api/films/1/, https://swapi...",[],[https://swapi.dev/api/vehicles/30/],[],2014-12-10T15:20:09.791000Z,2014-12-20T21:17:50.315000Z,https://swapi.dev/api/people/5/
5,Owen Lars,178,120,"brown, grey",light,blue,52BBY,male,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",[],[],[],2014-12-10T15:52:14.024000Z,2014-12-20T21:17:50.317000Z,https://swapi.dev/api/people/6/
6,Beru Whitesun lars,165,75,brown,light,blue,47BBY,female,https://swapi.dev/api/planets/1/,"[https://swapi.dev/api/films/1/, https://swapi...",[],[],[],2014-12-10T15:53:41.121000Z,2014-12-20T21:17:50.319000Z,https://swapi.dev/api/people/7/
7,R5-D4,97,32,,"white, red",red,unknown,,https://swapi.dev/api/planets/1/,[https://swapi.dev/api/films/1/],[https://swapi.dev/api/species/2/],[],[],2014-12-10T15:57:50.959000Z,2014-12-20T21:17:50.321000Z,https://swapi.dev/api/people/8/
8,Biggs Darklighter,183,84,black,light,brown,24BBY,male,https://swapi.dev/api/planets/1/,[https://swapi.dev/api/films/1/],[],[],[https://swapi.dev/api/starships/12/],2014-12-10T15:59:50.509000Z,2014-12-20T21:17:50.323000Z,https://swapi.dev/api/people/9/


In [120]:
results

[{'name': 'Luke Skywalker',
  'height': '172',
  'mass': '77',
  'hair_color': 'blond',
  'skin_color': 'fair',
  'eye_color': 'blue',
  'birth_year': '19BBY',
  'gender': 'male',
  'homeworld': 'https://swapi.dev/api/planets/1/',
  'films': ['https://swapi.dev/api/films/1/',
   'https://swapi.dev/api/films/2/',
   'https://swapi.dev/api/films/3/',
   'https://swapi.dev/api/films/6/'],
  'species': [],
  'vehicles': ['https://swapi.dev/api/vehicles/14/',
   'https://swapi.dev/api/vehicles/30/'],
  'starships': ['https://swapi.dev/api/starships/12/',
   'https://swapi.dev/api/starships/22/'],
  'created': '2014-12-09T13:50:51.644000Z',
  'edited': '2014-12-20T21:17:56.891000Z',
  'url': 'https://swapi.dev/api/people/1/'},
 {'name': 'C-3PO',
  'height': '167',
  'mass': '75',
  'hair_color': 'n/a',
  'skin_color': 'gold',
  'eye_color': 'yellow',
  'birth_year': '112BBY',
  'gender': 'n/a',
  'homeworld': 'https://swapi.dev/api/planets/1/',
  'films': ['https://swapi.dev/api/films/1/',
 

<a id='hacker'></a>
## Example 5: Hacker News

In [121]:
from bs4 import BeautifulSoup
import lxml
import requests

In [122]:
url = 'https://news.ycombinator.com/news'
res_hack = requests.get(url)
print(res_hack)
print(res_hack.text)
#soup_bb = BeautifulSoup(markup=res_bb.content, features='lxml')

<Response [200]>
<html lang="en" op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"><link rel="stylesheet" type="text/css" href="news.css?C7yC4eDsAdgGHWMz9pg3">
        <link rel="shortcut icon" href="favicon.ico">
          <link rel="alternate" type="application/rss+xml" title="RSS" href="rss">
        <title>Hacker News</title></head><body><center><table id="hnmain" border="0" cellpadding="0" cellspacing="0" width="85%" bgcolor="#f6f6ef">
        <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" width="100%" style="padding:2px"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img src="y18.gif" width="18" height="18" style="border:1px white solid;"></a></td>
                  <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
              <a href="newest">new</a> | <a href="front">pa

In [123]:
# This allows us to parse the information above by converting the string above into an object

soup = BeautifulSoup(markup=res_hack.text, features='html.parser')
print(soup)

<html lang="en" op="news"><head><meta content="origin" name="referrer"/><meta content="width=device-width, initial-scale=1.0" name="viewport"/><link href="news.css?C7yC4eDsAdgGHWMz9pg3" rel="stylesheet" type="text/css"/>
<link href="favicon.ico" rel="shortcut icon"/>
<link href="rss" rel="alternate" title="RSS" type="application/rss+xml"/>
<title>Hacker News</title></head><body><center><table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">
<tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a></td>
<td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
<a href="newest">new</a> | <a href="front">past</a> | <a href="newcomments">comments</a> | <a href="ask">ask</a> | <a href

In [124]:
# soup.body
soup.body.contents

[<center><table bgcolor="#f6f6ef" border="0" cellpadding="0" cellspacing="0" id="hnmain" width="85%">
 <tr><td bgcolor="#ff6600"><table border="0" cellpadding="0" cellspacing="0" style="padding:2px" width="100%"><tr><td style="width:18px;padding-right:4px"><a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a></td>
 <td style="line-height:12pt; height:10px;"><span class="pagetop"><b class="hnname"><a href="news">Hacker News</a></b>
 <a href="newest">new</a> | <a href="front">past</a> | <a href="newcomments">comments</a> | <a href="ask">ask</a> | <a href="show">show</a> | <a href="jobs">jobs</a> | <a href="submit">submit</a> </span></td><td style="text-align:right;padding-right:4px;"><span class="pagetop">
 <a href="login?goto=news">login</a>
 </span></td>
 </tr></table></td></tr>
 <tr id="pagespace" style="height:10px" title=""></tr><tr><td><table border="0" cellpadding="0" cellspacing="0" class="itemlist">
 <tr class="ath

In [125]:
# all the links of a page
soup.find_all('a')

[<a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a>,
 <a href="news">Hacker News</a>,
 <a href="newest">new</a>,
 <a href="front">past</a>,
 <a href="newcomments">comments</a>,
 <a href="ask">ask</a>,
 <a href="show">show</a>,
 <a href="jobs">jobs</a>,
 <a href="submit">submit</a>,
 <a href="login?goto=news">login</a>,
 <a href="vote?id=28099264&amp;how=up&amp;goto=news" id="up_28099264"><div class="votearrow" title="upvote"></div></a>,
 <a class="storylink" href="https://www.openwall.com/lists/oss-security/2021/08/07/1">Bug in Lynx' SSL certificate validation – leaks password in clear text via SNI</a>,
 <a href="from?site=openwall.com"><span class="sitestr">openwall.com</span></a>,
 <a class="hnuser" href="user?id=jwilk">jwilk</a>,
 <a href="item?id=28099264">59 minutes ago</a>,
 <a href="hide?id=28099264&amp;goto=news">hide</a>,
 <a href="item?id=28099264">12 comments</a>,
 <a href="vote?id=28098578&amp;how=up&amp;go

In [126]:
soup.select('a')

[<a href="https://news.ycombinator.com"><img height="18" src="y18.gif" style="border:1px white solid;" width="18"/></a>,
 <a href="news">Hacker News</a>,
 <a href="newest">new</a>,
 <a href="front">past</a>,
 <a href="newcomments">comments</a>,
 <a href="ask">ask</a>,
 <a href="show">show</a>,
 <a href="jobs">jobs</a>,
 <a href="submit">submit</a>,
 <a href="login?goto=news">login</a>,
 <a href="vote?id=28099264&amp;how=up&amp;goto=news" id="up_28099264"><div class="votearrow" title="upvote"></div></a>,
 <a class="storylink" href="https://www.openwall.com/lists/oss-security/2021/08/07/1">Bug in Lynx' SSL certificate validation – leaks password in clear text via SNI</a>,
 <a href="from?site=openwall.com"><span class="sitestr">openwall.com</span></a>,
 <a class="hnuser" href="user?id=jwilk">jwilk</a>,
 <a href="item?id=28099264">59 minutes ago</a>,
 <a href="hide?id=28099264&amp;goto=news">hide</a>,
 <a href="item?id=28099264">12 comments</a>,
 <a href="vote?id=28098578&amp;how=up&amp;go

In [127]:
soup.select('.score')

[<span class="score" id="score_28099264">55 points</span>,
 <span class="score" id="score_28098578">68 points</span>,
 <span class="score" id="score_28098853">46 points</span>,
 <span class="score" id="score_28068248">17 points</span>,
 <span class="score" id="score_28098888">51 points</span>,
 <span class="score" id="score_28084166">121 points</span>,
 <span class="score" id="score_28098664">67 points</span>,
 <span class="score" id="score_28096493">30 points</span>,
 <span class="score" id="score_28083440">86 points</span>,
 <span class="score" id="score_28096710">62 points</span>,
 <span class="score" id="score_28098658">52 points</span>,
 <span class="score" id="score_28085642">17 points</span>,
 <span class="score" id="score_28085920">123 points</span>,
 <span class="score" id="score_28085526">51 points</span>,
 <span class="score" id="score_28097600">28 points</span>,
 <span class="score" id="score_28095632">543 points</span>,
 <span class="score" id="score_28084861">58 points</s

In [128]:
soup.select('#score_26628233')

[]

In [129]:
soup.select('.storylink')[0]

<a class="storylink" href="https://www.openwall.com/lists/oss-security/2021/08/07/1">Bug in Lynx' SSL certificate validation – leaks password in clear text via SNI</a>

In [130]:
soup.select('.score')[0]

<span class="score" id="score_28099264">55 points</span>

In [131]:
links = soup.select('.storylink')
votes = soup.select('.score')

In [132]:
def create_custom_hn(links, votes):
    hn = []
    
    for index, item in enumerate(links):
        title = links[index].getText()
        hn.append(title)
    
    return hn

In [133]:
create_custom_hn(links,votes)

["Bug in Lynx' SSL certificate validation – leaks password in clear text via SNI",
 'Framework Patterns (2019)',
 'Powering the Lunar Base',
 'Benjamin Banneker’s Broods of Cicadas',
 'Qatar Airways grounds 13 Airbus A350s as fuselage degrading',
 'Body mapping study suggests chronic pain comes in nine distinct types',
 'The Era of Cheap Natural Gas Ends as Prices Surge by 1000%',
 'Pandas Manual [pdf]',
 'Feelings at the Fall of the Republic, Ancient and Medieval Living Standards',
 'Everything has changed in iOS 14, but Jailbreak is eternal [pdf]',
 'NASA is looking for people who want to spend a year simulating a mission on Mars',
 'Incorrect expression calculation between programming languages',
 'Show HN: Paper Time – Listen to abstracts of CS papers, like a custom podcast',
 'UPchieve (YC W21) is hiring a mobile engineer to democratize free tutoring',
 'In praise of habits',
 'Planning for Servers in 2022 and Beyond',
 'Swiss Ph.D student’s dismissal spotlights China’s influence'

In [134]:
def create_custom_hn(links, votes):
    hn = []
    
    for index, item in enumerate(links):
        title = links[index].getText()
        href = links[index].get('href', None)
        points = int(votes[index].getText().replace(' points', ''))
        print(points)
        hn.append({'title':title, 'link':href, 'points':points})
    
    return hn

In [135]:
create_custom_hn(links,votes)

55
68
46
17
51
121
67
30
86
62
52
17
123
51
28
543
58
873
13
588
14
31
63
69
28
113
73
44
160


IndexError: list index out of range

In [136]:
url = 'https://news.ycombinator.com/news'
res_hack = requests.get(url)

soup = BeautifulSoup(markup=res_hack.text, features='html.parser')
links = soup.select('.storylink')
subtext = soup.select('.subtext')

def create_custom_hn(links, subtext):
    hn = []
    
    for index, item in enumerate(links):
        title = links[index].getText()
        href = links[index].get('href', None)
        vote = subtext[index].select('.score')
        
        if len(vote):
            points = int(vote[0].getText().replace(' points', ''))
        
        hn.append({'title':title, 'link':href, 'points':points})
    
    return hn

In [137]:
create_custom_hn(links,subtext)

[{'title': "Bug in Lynx' SSL certificate validation – leaks password in clear text via SNI",
  'link': 'https://www.openwall.com/lists/oss-security/2021/08/07/1',
  'points': 55},
 {'title': 'Framework Patterns (2019)',
  'link': 'https://blog.startifact.com/posts/framework-patterns.html#',
  'points': 68},
 {'title': 'Powering the Lunar Base',
  'link': 'https://caseyhandmer.wordpress.com/2021/04/25/powering-the-lunar-base/',
  'points': 46},
 {'title': 'Benjamin Banneker’s Broods of Cicadas',
  'link': 'https://www.historytoday.com/archive/natural-histories/benjamin-bannekers-broods-cicadas',
  'points': 17},
 {'title': 'Qatar Airways grounds 13 Airbus A350s as fuselage degrading',
  'link': 'https://www.msn.com/en-us/money/news/qatar-airways-grounds-13-airbus-a350s-as-fuselage-degrading/ar-AAMYDOB',
  'points': 51},
 {'title': 'Body mapping study suggests chronic pain comes in nine distinct types',
  'link': 'https://www.sciencealert.com/large-body-map-study-suggests-chronic-pain-co

In [138]:
url = 'https://news.ycombinator.com/news'
res_hack = requests.get(url)

soup = BeautifulSoup(markup=res_hack.text, features='html.parser')
links = soup.select('.storylink')
subtext = soup.select('.subtext')

def sort_stories_by_votes(hnlist):
    return sorted(hnlist, key=lambda k: k['votes'], reverse=True)
    

def create_custom_hn(links, subtext):
    hn = []
    
    for index, item in enumerate(links):
        title = item.getText()
        href = item.get('href', None)
        vote = subtext[index].select('.score')
        
        if len(vote):
            points = int(vote[0].getText().replace(' points', ''))
            if points > 99:
                hn.append({'title':title, 'link':href, 'votes':points})
    
    return sort_stories_by_votes(hn)

In [139]:
create_custom_hn(links,subtext)

[{'title': 'CalyxOS – De-Googled Android Alternative',
  'link': 'https://calyxos.org/',
  'votes': 873},
 {'title': 'The Problem with Perceptual Hashes',
  'link': 'https://rentafounder.com/the-problem-with-perceptual-hashes/',
  'votes': 588},
 {'title': 'Swiss Ph.D student’s dismissal spotlights China’s influence',
  'link': 'https://www.nzz.ch/english/swiss-phd-students-dismissal-spotlights-chinas-influence-ld.1638771',
  'votes': 543},
 {'title': 'Crypto community slams ‘disastrous’ new amendment to big infrastructure bill',
  'link': 'https://techcrunch.com/2021/08/06/crypto-biden-amendment-infrastructure-bill-proof-of-work/',
  'votes': 160},
 {'title': 'Show HN: Paper Time – Listen to abstracts of CS papers, like a custom podcast',
  'link': 'https://papertime.app',
  'votes': 123},
 {'title': 'Body mapping study suggests chronic pain comes in nine distinct types',
  'link': 'https://www.sciencealert.com/large-body-map-study-suggests-chronic-pain-comes-in-9-distinct-types',
  '

In [140]:
# challenge - for multiple pages:

def sort_stories_by_votes(hnlist):
    return sorted(hnlist, key=lambda k: k['votes'], reverse=True)
    

def create_custom_hn(links, subtext):    
    hn = []
    for i in range(1,3):
        url = f'https://news.ycombinator.com/news?p={i}'
        res_hack = requests.get(url)

        soup = BeautifulSoup(markup=res_hack.text, features='html.parser')
        links = soup.select('.storylink')
        subtext = soup.select('.subtext')
        
        for index, item in enumerate(links):
            title = item.getText()
            href = item.get('href', None)
            vote = subtext[index].select('.score')
        
            if len(vote):
                points = int(vote[0].getText().replace(' points', ''))
                if points > 99:
                    hn.append({'title':title, 'link':href, 'votes':points})
    
    return sort_stories_by_votes(hn)

In [141]:
test1 = create_custom_hn(links,subtext)
test1[0]

{'title': "An open letter against Apple's new privacy-invasive client-side content scanning",
 'link': 'https://github.com/nadimkobeissi/appleprivacyletter',
 'votes': 941}

In [142]:
from selenium import webdriver

driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver.exe")
driver.get(test1[0]['link'])

In [143]:
driver.close()

<a id='xpath'></a>
## XPath, Scrapy Selector and Scrapy Framework
- What is CSS Selector
- What is XPath Selector
- How to use XPath with Scrapy
- Scrapy as a Framework

### Learn XPath by Finding Waldo
[XPath Cheatsheet](https://www.red-gate.com/simple-talk/wp-content/uploads/imported/1269-Locators_groups_1_0_2.pdf?file=4938)

In [144]:
import requests
from scrapy.selector import Selector
import pandas as pd

In [145]:
HTML = """
<html>
    <body>
        
        <ul id="waldo">
            <li class="waldo">
                <span> yo Im not here</span>
            </li>
            <li class="waldo">Height:  ???</li>
            <li class="waldo">Weight:  ???</li>
            <li class="waldo">Last Location:  ???</li>
            <li class="nerds">
                <div class="alpha">Bill gates</div>
                <div class="alpha">Zuckerberg</div>
                <div class="beta">Theil</div>
                <div class="animal">parker</div>
            </li>
        </ul>
        
        <ul id="tim">
            <li class="tdawg">
                <span>yo im here</span>
            </li>
        </ul>
        <li>stuff</li>
        <li>stuff2</li>
        
        <div id="cooldiv">
            <span class="dsi-rocks">
               YO!
            </span>
        </div>
        <waldo>Waldo</waldo>
        <waldo>Waldo2</waldo>
    </body>
</html>
"""

### Scrapy Selector

In [146]:
xpath_selector = Selector(text=HTML)

##### Absolute XPath

In [147]:
xpath_selector.xpath('/html/body/waldo/text()').extract()

['Waldo', 'Waldo2']

##### Relative XPath

In [148]:
xpath_selector.xpath('//waldo/text()').extract()

['Waldo', 'Waldo2']

**Find attribute(s) 'waldo'**

In [149]:
# Contents of all id attributes named waldo
print (xpath_selector.xpath('//*[@id="waldo"]').extract()[0])

<ul id="waldo">
            <li class="waldo">
                <span> yo Im not here</span>
            </li>
            <li class="waldo">Height:  ???</li>
            <li class="waldo">Weight:  ???</li>
            <li class="waldo">Last Location:  ???</li>
            <li class="nerds">
                <div class="alpha">Bill gates</div>
                <div class="alpha">Zuckerberg</div>
                <div class="beta">Theil</div>
                <div class="animal">parker</div>
            </li>
        </ul>


In [150]:
# Contents of all class attributes named waldo
xpath_selector.xpath('//*[@class="waldo"]').extract()

['<li class="waldo">\n                <span> yo Im not here</span>\n            </li>',
 '<li class="waldo">Height:  ???</li>',
 '<li class="waldo">Weight:  ???</li>',
 '<li class="waldo">Last Location:  ???</li>']

In [151]:
# gets everything around the text element waldo
xpath_selector.xpath('//*[text()="Waldo"]').extract()

['<waldo>Waldo</waldo>']

### Practice with perfumery.com.au Site

In [152]:
homepage = 'https://www.perfumery.com.au/womens/fragrances.html'
response = requests.get(homepage)
print (response.status_code)
HTML = response.text
HTML[0:150]

200


'<!DOCTYPE html>\n<html lang="en">\n<head>\n<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />\n<meta name="keywords" content="Women&#39'

In [153]:
xpath_selector_perfume = Selector(text=HTML)

##### Collect Data

In [154]:
links = xpath_selector_perfume.xpath('//div[@class="wrapper-thumbnail col-xs-8 col-sm-6 col-md-4 col-lg-4"]').extract()


In [155]:
xpath_name='//*[@id="_jstl__header_r"]/div/div[3]/h1'

# '//h1[@itemprop="name"]'
# //div[@class="wrapper-product-title col-xs-12"]/h1

xpath_selector_perfume.xpath(xpath_name).extract()

[]

In [156]:
xpath_product = '//*[@id="_jstl__header_r"]/div/div[3]/h1/text()'
xpath_price = '//*[@id="_jstl__pricing_r"]/div/div[2]/div[1]/div[2]/text()'
xpath_size = '//*[@id="main-content"]/div/div/div[2]/div[2]/div[2]/form/div/div/div[2]/div/span/select/option[1]/text()'
xpath_rrp = '//*[@id="_jstl__pricing_r"]/div/div[2]/div[1]/div[1]/text()' 
xpath_description = '//*[@id="description"]'

products = []
prices = []
sizes = []
rrps = []
descriptions = []

for itm in xpath_selector_perfume.xpath('//div[@class="wrapper-thumbnail col-xs-8 col-sm-6 col-md-4 col-lg-4"]/div/div/h3/a/@href').extract():
    res_details = requests.get(itm)
    html_details = res_details.text
    details_selector = Selector(text=html_details)
    try:
        product = details_selector.xpath(xpath_product).extract()[0]
        products.append(product)
    except:
        products.append('NA')
    
    try:
        price = details_selector.xpath(xpath_price).extract()[0]
        prices.append(price)
    except:
        prices.append('NA')
    
    try:
        size = details_selector.xpath(xpath_size).extract()[0]
        sizes.append(size)
    except:
        sizes.append('NA')
    
    try:
        rrp = details_selector.xpath(xpath_rrp).extract()[0]
        rrps.append(rrp)
    except:
        rrps.append('NA')
    
    try:
        description = details_selector.xpath(xpath_description).extract()[0]
        descriptions.append(description)
    except:
        descriptions.append('NA')


In [157]:
data = {'products':products, 
        'price':prices,
        'size':sizes,
        'rrp':rrps, 
        'description':descriptions}

##### Create DataFrame

In [158]:
df_perfume = pd.DataFrame(data)
df_perfume

Unnamed: 0,products,price,size,rrp,description
0,1000 by Jean Patou EDT Spray 75ml For Women,\nWAS\n$99.00\n,75ml EDT Spray,\nRRP\n$170.01\n,"<div class=""tab-pane active"" id=""description"">..."
1,1881 Pour Femme by Cerruti EDT Spray 100ml For...,\n$59.00\n,100ml EDT Spray,\nRRP:\n$120.00\n,"<div class=""tab-pane active"" id=""description"">..."
2,212 Sexy by Carolina Herrera EDP Spray 100ml F...,\n$96.00\n,100ml EDP Spray,\nRRP:\n$145.00\n,"<div class=""tab-pane active"" id=""description"">..."
3,212 VIP by Carolina Herrera EDP Spray 50ml For...,\nWAS\n$80.00\n,50ml EDP Spray,\nRRP\n$160.00\n,"<div class=""tab-pane active"" id=""description"">..."
4,24 Faubourg by Hermes EDT Spray 30ml For Women,,,\n$126.01\n,"<div class=""tab-pane active"" id=""description"">..."
5,24k Women by Jivago EDP Spray 75ml (with Rock ...,\nWAS\n$155.00\n,75ml EDP Spray,\nRRP\n$220.00\n,"<div class=""tab-pane active"" id=""description"">..."
6,273 Rodeo Drive by Fred Hayman 75ml EDP Spray ...,\nWAS\n$45.00\n,75ml EDP Spray,\nRRP\n$110.00\n,"<div class=""tab-pane active"" id=""description"">..."
7,4711 by Maurer & Wirtz Cologne 300ml For Unisex,,300ml EDC,,"<div class=""tab-pane active"" id=""description"">..."
8,5th Avenue by Elizabeth Arden EDP Spray 125ml ...,\n$33.00\n,125ml EDP Spray,\nRRP:\n$49.00\n,"<div class=""tab-pane active"" id=""description"">..."
9,5th Avenue NYC by Elizabeth Arden EDP Spray 12...,\n$25.00\n,,\nRRP:\n$49.00\n,"<div class=""tab-pane active"" id=""description"">..."


### Let's Learn to Use Scrapy as a Framework
Scrapy can't execute JavaScript. Below this plugin helps scrapy in this regard.

https://github.com/scrapy-plugins/scrapy-splash

#### Scrapy Tutorial
[Scrapy Tutorial](https://docs.scrapy.org/en/latest/intro/tutorial.html)

<a id='selenium'></a>
## Selenium

https://ivantay2003.medium.com/selenium-cheat-sheet-in-python-87221ee06c83<br>
https://dev.to/razgandeanu/selenium-cheat-sheet-9lc

In [159]:
from selenium import webdriver
import time

chrome_browser = webdriver.Chrome('./chromedriver/chromedriver.exe')

# print(chrome_browser)
# chrome_browser
# chrome_browser.maximize_window()

chrome_browser.maximize_window()
chrome_browser.get('https://www.seleniumeasy.com/test/basic-first-form-demo.html')

time.sleep(2)
lightbox_close_x = chrome_browser.find_element_by_id("at-cv-lightbox-close")
lightbox_close_x.click()
time.sleep(1)
# print(chrome_browser.title)
# assert 'python Easy Demo' in chrome_browser.title
assert 'Selenium Easy Demo' in chrome_browser.title

show_message_button = chrome_browser.find_element_by_class_name('btn-default')
# print(show_message_button.get_attribute('innerHTML'))

assert 'Show Message' in chrome_browser.page_source
user_message = chrome_browser.find_element_by_id('user-message')
user_message.clear()
user_message.send_keys('I am learning python')
show_message_button.click()

my_message = chrome_browser.find_element_by_id('display')
print(my_message.get_attribute('innerHTML'))
print(my_message.text)

# chrome_browser.close()
chrome_browser.quit()

I am learning python
I am learning python
