# Web Scraping 101 (oDCM)

*Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce pretium risus at ultricies egestas. Vivamus sit amet arcu sem. In hac habitasse platea dictumst. Nulla pharetra vitae mauris sed mollis. Pellentesque placerat mauris dui, in venenatis nisl posuere ac. Nunc vitae tincidunt risus, ut pellentesque odio. Donec quam neque, iaculis id eros et, condimentum vulputate nulla. Nullam sed ligula leo.*

--- 

## Learning Objectives

Students will be able to: 
* A
* B
* C


--- 

## Acknowledgements
This course draws on online resources built by Brian Keegan, Colt Steele, David Amos, Hannah Cushman Garland, Kimberly Fessel, and Thomas Laetsch. 


--- 

## Contact
For technical issues try to be as specific as possible (e.g., include screenshots, your notebook, errors) so that we can help you better.

**WhatsApp**  
+31 13 466 8938

**Email**  
odcm@uvt.nl

---

## 1. Generating Seeds
* Scrape all fiction books (65 books) - 4 pagina's
    * books on pages - href attribute
    * page numbers https://books.toscrape.com/catalogue/category/books_1/page-2.html
    * mocht er geen logica in zitten -> via next button
        * base_url (quotes.toscrape.com)
        * url = /page/1
    * timers
* Scrapy
    * with a couple of lines really impressive results (but have to follow the rules of the framework) - trade-off speed of use vs flexbility
        * variable names (start_url)
    * good documentation
    * crawl = scraping a page and moving on to the next page finding another link on that site and following that 
    
https://quotes.toscrape.com


In [None]:
from time import sleep

# don't overload server
sleep(2)

In [None]:
from csv import writer

# write data to csv
with open("blog_data.csv", "w") as csv_file:
    csv_writer = writer(csv_file)
    csv_writer.writerow(["title", "link", "date"])
    
    for article in articles: 
        title = ...
        url = ... 
        date = ...
        csv_writer.writerow([title, url, date])

* https://github.com/kimfetti/Conferences/tree/master/PyCon_2020
* https://www.youtube.com/watch?v=RUQWPJ1T6Zc&t=190s
* https://github.com/hancush/web-scraping-with-python/blob/master/session/web-scraping-with-python.ipynb#HTML-basics
* https://www.udemy.com/course/the-modern-python3-bootcamp/learn/lecture/7991196#overview
* https://campus.datacamp.com/courses/web-scraping-with-python/introduction-to-html?ex=1
* https://realpython.com/python-web-scraping-practical-introduction/
* https://github.com/CU-ITSS/Web-Data-Scraping-S2019

#### Classes and ids

In [None]:
#print(soup.find(class_='price_color'))
#print(soup.find(id='product_gallery'))

In [None]:
soup.find(attrs={"data-example": "yes"})

#### CSS Selectors
* Select by id of foo: #foo
* Select by class of bar: .bar
* Select children: div > p
* Select descendents: div p

* name - tag name
* attrs - dictionary of attributes -> bijv. number of stars - niet een aparte text voor -> attrs

In [None]:
soup.find('body').find('h1')

In [None]:
# in feite kun je dit ook met find(class_ = "... ") doen -> makkelijker
soup.select('.price_color')[0].get_text()

In [None]:
# in feite kun je dit ook met find(class_ = "... ") doen -> makkelijker
soup.select('.price_color')[0].get_text()

In [None]:
# number of stars
soup.find(class_="star-rating").attrs["class"][1]

In [None]:
# list of elements (including '\n')
soup.body.contents[1]

In [None]:
# sibling is on the same level of the hierarchy 
soup.body.contents[1].next_sibling

In [None]:
soup.body.contents[1].find_next_sibling()

`.parent`
`contents`
`next_sibling`
`previous_sibling`

Project
* Scrape data into CSV
* Goal: Grab all links from blog
* Data: store URL, anchor tag text, and date


* Looping through a list of books 
for book in books: 

---

## XPath & Selectors 

* Single forward slash `/` used to move forward one generation
* Double forward slashes `//` used to direct to all elements within the entire HTML code
* Tag names between slashes give direction to which element(s)
* Brackets [] after a tag name tell us which of the selected siblings to choose.

In case the cell below throws a `ModuleNotFoundError` at you, you first need to install the `scrapy` package. Go to your terminal and type `conda install scrapy` and press `y` to proceed.

* select the same book title with XPath

In [1]:
import scrapy 

class BookSpider(scrapy.Spider):
    name = 'bookspider'
    start_urls = ['http://books.toscrape.com']
    
    def parse(self, response):
        # response is what you get back from the HTTP request
        # yield instead of return
        for article in response.css('article.product_pod'):
            yield {
                'price': article.css(".price_color::text") 
            }

In [None]:
from scrapy import Selector

html = requests.get(url).content 
sel = Selector(text=html)

sel.xpath("//h1")

## Legality
* Some websites don't want people scraping them
* Best practise: consult the robots.txt file 
    * website's way of saying: we don't want any code accessing all of these pages but this page is OK
    * it's not law; it's convention
    * imdb.com/robots.txt
    * Andere regels voor Yahoo's code dan voor anderen zoekmachines en crawlers (everything is disallowed)
    * rithmschool.com/robots.txt
        * User-agent: * = wherever you're coming form you're allowed to access everything
        * Allow: / = you're allowed to access everything
* If making many requests, time them out (you don't want to constantly be making these requests one after another over and over way faster than any human would)
    * For one reason that's to be polite (don't overload their servers)
    * But also if they notice if the developers or a server notices 100000 requests coming from one IP address -> very clear that somebody is scraping them. 
    * https://en.wikipedia.org/wiki/Craigslist_Inc._v._3Taps_Inc.
        * Company 3Taps Inc. was scraping Craigslist and using their data to build their own website (not just analyzing) 
        * You cannot be sued, go to jail, or be fined or something simply for scraping, but once a cease-and-desist letter has been sent and enacting an IP address block is sufficient notice of online trespassing which a plaintiff can use to claim a violation of the computer fraud and abuse act. 
        * You don't want to just launch a website or a company that relies on scraping, especially if you do get a cease and desist letter (or they block your IP address)
        * https://www.lexology.com/library/detail.aspx?g=210e78b2-41df-4f75-a7fb-4e60909e231a
        * We're going to stay away from anything potentially in that gray area. We're going to scrape my own sites and sites that I've worke don where we have permission to scrape. 
* If you're too aggressive, your IP can be blocked