# Data Scraping

## Outline
* Access Structured Data
    * accessing ***Rest*** APIs
    * Comon Bus-Systems
        
* Access Unstructured Data
    * web scraping
    * PDF scraping
    

## Data Science Processing Pipeline
<img src="IMG/workflow.png" width=1200>

## Recall: JSON
### *Data with less structure than tables: sparse entries or flexible schema*
<img SRC="IMG/json-logo.png" width=400>

***JavaScript Object Notation (JSON)*** is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types. 
JSON is a language-independent data format. It was derived from JavaScript, but as of 2017 many programming languages include code to generate and parse JSON-format data. 

### JSON Document Tree
<img SRC="IMG/json.png">

### JSON in Pythhon


In [1]:
import pandas as pd

Data = {'Product': ['Desktop Computer','Tablet','iPhone','Laptop'],
        'Price': [700,250,800,1200]
        }

df = pd.DataFrame(Data, columns= ['Product', 'Price'])
 
print (df)


            Product  Price
0  Desktop Computer    700
1            Tablet    250
2            iPhone    800
3            Laptop   1200


In [2]:
#native JSON support in pandas
Export = df.to_json ('Export_DataFrame.json')

#### Use ***Colab*** to browse the JSON file.

## REST-APIs: Getting Data from Sensors and Services
### *IoT usecases and mesh-ups*

* **REST** **RE**presentational **S**tate **T**ransfer - de facto standard for network (HTTP) communication 
* performance, scalability, simplicity, and reliability for **client-server** data exchange



Also see: [https://en.wikipedia.org/wiki/Representational_state_transfer](https://en.wikipedia.org/wiki/Representational_state_transfer)

### REST
* **Stateless:** The server won’t maintain any state between requests from the client.
* **Client-server:** The client and server must be decoupled from each other, allowing each to develop independently.
* **Cacheable:** The data retrieved from the server should be cacheable either by the client or by the server.
* **Uniform interface:** The server will provide a uniform interface for accessing resources without defining their representation.
* **Layered system:** The client may access the resources on the server indirectly through other layers such as a proxy or load balancer.

### REST Schema
<img src='IMG/REST.png'>
[image from Wikipedia]

### Example

* ***GitHub*** REST API -> User information

[https://api.github.com/users/keuperj](https://api.github.com/users/keuperj)

### REST communication via HTTP(s) requests

* GET	Retrieve an existing resource.
* POST	Create a new resource.
* PUT	Update an existing resource.
* PATCH	Partially update an existing resource.
* DELETE	Delete a resource.

#### Data payload -> JSON !

### REST interactions in Python
#### with the REQUESTS lib

<center>
<img src="IMG/requests.png" width=300>
</center>

* [https://docs.python-requests.org/en/master/](https://docs.python-requests.org/en/master/)

In [1]:
# Example: read data from service

import requests
api_url = "https://jsonplaceholder.typicode.com/todos/23" # open REST service for tests
response = requests.get(api_url)
response.json()

{'userId': 2,
 'id': 23,
 'title': 'et itaque necessitatibus maxime molestiae qui quas velit',
 'completed': False}

We use the open REST test server at: https://jsonplaceholder.typicode.com

In [2]:
# write data to service

api_url = "https://jsonplaceholder.typicode.com/todos"
todo = {"userId": 99, "title": "Buy milk", "completed": False}
response = requests.post(api_url, json=todo)
response.json()
{'userId': 1, 'title': 'Buy milk', 'completed': False, 'id': 201}



{'userId': 1, 'title': 'Buy milk', 'completed': False, 'id': 201}

In [3]:
#check transaction
response.status_code

201

Error Codes:

* 2xx - SUCCESS
* 4xx - Client Error
* 5xx - Server Error

#### Large scale JSON handling and Queries -> MogoDB !

#### Other comunication libs i.e. for CAN-Bus available: [https://python-can.readthedocs.io/en/master/](https://python-can.readthedocs.io/en/master/)

## Getting Data from Web-Ressources
#### *Data Scraping*

* In some cases data is not *provided* via a defined API, but needs to be collected
   * i.e. from unstructured web-data 

In [4]:
# example using requests
import requests

r = requests.get('https://www.google.com')
print(r.text)

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="VjQWxBSbl9T4IkBF-rQxLg">(function(){var _g={kEI:'5JYTZpu5MZusi-gP3sCyyAg',kEXPI:'0,1365468,206,4804,2329817,654,361,424527,38927,14764,4998,55519,2872,2891,3926,7828,30449,825,32636,13491,230,37802,69410,6666,7596,1,42154,2,39761,6700,31121,4569,6255,24673,59701,8155,8861,14490,8702,13733,9779,42459,20198,57055,16124,3030,15816,1804,35269,11813,1632,13495,29968,5222327,711,2,296,1347,853,39,106,5992324,2837752,1,58,7444385,19234296,1305643,16672,43887,3,318,4,1281,3,2121778,2585,16815,1,3,23017956,7375,8408,10756,5909,16632,819,3851,4821,5,1899,1625,35245,1922,7288,3671,4834,1573,8049,5796,10052,2901,2211,154,7815,213,391,7526,5328,2781,1720,5,1440,2,2704,4,1061,8833,10707,5134,1860,2,1071,2076,6870,662,443

### How to get structured information from Websites?
#### BeautifulSoup
* Docu: https://beautiful-soup-4.readthedocs.io/en/latest/


* Example for news website: https://news.ycombinator.com

In [5]:
import requests
from bs4 import BeautifulSoup

r = requests.get('https://news.ycombinator.com')
soup = BeautifulSoup(r.text, 'html.parser')
links = soup.findAll('tr', class_='athing')

formatted_links = []

for link in links:
    data = {
        'id': link['id'],
        'title': link.find_all('td')[2].a.text,
        "url": link.find_all('td')[2].a['href'],
        "rank": int(links[0].td.span.text.replace('.', ''))
    }
    formatted_links.append(data)



In [6]:
for i in  range(10):
    print(formatted_links[i])

{'id': '39965006', 'title': 'When a black hole and a neutron star merge', 'url': 'https://www.mpg.de/21778967/0404-grav-mysterious-object-in-the-gap-152520-x', 'rank': 1}
{'id': '39962023', 'title': 'PumpkinOS, a Re-Implementation of PalmOS', 'url': 'https://github.com/migueletto/PumpkinOS', 'rank': 1}
{'id': '39965267', 'title': 'Phytomining – Extracting Minerals via Plants', 'url': 'https://arpa-e.energy.gov/news-and-media/press-releases/us-department-energy-announces-10-million-explore-using-plants', 'rank': 1}
{'id': '39966978', 'title': 'Bitmovin (YC S15) Is Hiring a Full Stack Engineer in Austria', 'url': 'https://bitmovin.com/careers/full-stack-engineer-7260509002', 'rank': 1}
{'id': '39961910', 'title': 'What John von Neumann did at Los Alamos (2020)', 'url': 'https://3quarksdaily.com/3quarksdaily/2020/10/what-john-von-neumann-really-did-at-los-alamos.html', 'rank': 1}
{'id': '39950989', 'title': 'Programming with DOS Debugger (2003)', 'url': 'https://susam.net/programming-with

In [7]:
#another example
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

In [8]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

In [9]:
#get the clean title
soup.title.string

"The Dormouse's story"

In [10]:
#iterate over all links
for link in soup.find_all('a'):
    print(link.get('href'))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie


In [11]:
#get all text
print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...



#### get table and convert to pandas
Examle: https://www.w3schools.com/html/html_tables.asp

In [12]:
import bs4 as bs
import urllib.request
import pandas as pd

source = urllib.request.urlopen('https://www.w3schools.com/html/html_tables.asp').read()
soup = bs.BeautifulSoup(source,'lxml')

table = soup.find_all('table')
df = pd.read_html(str(table))[0]

In [13]:
df.head()

Unnamed: 0,Company,Contact,Country
0,Alfreds Futterkiste,Maria Anders,Germany
1,Centro comercial Moctezuma,Francisco Chang,Mexico
2,Ernst Handel,Roland Mendel,Austria
3,Island Trading,Helen Bennett,UK
4,Laughing Bacchus Winecellars,Yoshi Tannamuri,Canada


## Scaling Web-Scaraping wiht Scrapy
#### Crowling the web

<img SRC="IMG/scrapy.jpg">

#### [https://scrapy.org/](https://scrapy.org/)

#### Example

Scraping [http://quotes.toscrape.com/page/1/](http://quotes.toscrape.com/page/1/)

In [14]:
%%writefile myCrwler.py 

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class TheFriendlyNeighbourhoodSpider(CrawlSpider):
    name = 'TheFriendlyNeighbourhoodSpider'

    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/page/1/']

    custom_settings = {
    'LOG_LEVEL': 'INFO'
    }

    rules = (
        Rule(LinkExtractor(), callback='parse_item', follow=True),
    )


    def parse_item(self, response):
        print('Downloaded... ', response.url)
        filename = "download/"+str(response.url.split("/")[-2]) +  '.html'
        print('Saving as :', filename)
        with open(filename, 'wb') as f:
            f.write(response.body)

Writing myCrwler.py


### Run in Shell:

In [15]:
#This would produce all the linked HTML file we can use for analysis 
!mkdir download
!scrapy runspider myCrwler.py

2024-04-08 09:05:11 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2024-04-08 09:05:11 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.11.4 (main, Jul  5 2023, 14:15:25) [GCC 11.2.0], pyOpenSSL 23.2.0 (OpenSSL 1.1.1u  30 May 2023), cryptography 41.0.2, Platform Linux-6.1.0-1036-oem-x86_64-with-glibc2.35
2024-04-08 09:05:11 [scrapy.crawler] INFO: Overridden settings:
{'LOG_LEVEL': 'INFO', 'SPIDER_LOADER_WARN_ONLY': True}


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2024-04-08 09:05:11 [scrapy.extensions.telnet] INFO: Telnet Password: 50cd564a60404e99
2024-04-08 09:05:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']

Downloaded...  http://quotes.toscrape.com/tag/tea/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/author/Stephenie-Meyer/
Saving as : download/Stephenie-Meyer.html
Downloaded...  http://quotes.toscrape.com/tag/wisdom/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/understanding/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/knowledge/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/author/Alfred-Tennyson/
Saving as : download/Alfred-Tennyson.html
Downloaded...  http://quotes.toscrape.com/tag/library/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/author/J-D-Salinger/
Saving as : download/J-D-Salinger.html
Downloaded...  http://quotes.toscrape.com/tag/reading/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/contentment/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/aut

Downloaded...  http://quotes.toscrape.com/tag/inspirational/page/2/
Saving as : download/2.html
Downloaded...  http://quotes.toscrape.com/author/Mother-Teresa/
Saving as : download/Mother-Teresa.html
Downloaded...  http://quotes.toscrape.com/tag/attributed-no-source/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/imagination/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/music/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/page/4/
Saving as : download/4.html
Downloaded...  http://quotes.toscrape.com/author/Allen-Saunders/
Saving as : download/Allen-Saunders.html
Downloaded...  http://quotes.toscrape.com/author/Douglas-Adams/
Saving as : download/Douglas-Adams.html
Downloaded...  http://quotes.toscrape.com/tag/happiness/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/hope/page/1/
Saving as : download/1.html
Downloaded...  http://quotes.toscrape.com/tag/de

In [16]:
!ls download

10.html			   Eleanor-Roosevelt.html    Jorge-Luis-Borges.html
1.html			   Elie-Wiesel.html	     J-R-R-Tolkien.html
2.html			   Ernest-Hemingway.html     Khaled-Hosseini.html
3.html			   Friedrich-Nietzsche.html  life.html
4.html			   friendship.html	     love.html
5.html			   friends.html		     Madeleine-LEngle.html
6.html			   Garrison-Keillor.html     Marilyn-Monroe.html
7.html			   George-Bernard-Shaw.html  Mark-Twain.html
8.html			   George-Carlin.html	     Martin-Luther-King-Jr.html
9.html			   George-Eliot.html	     Mother-Teresa.html
Albert-Einstein.html	   George-R-R-Martin.html    Pablo-Neruda.html
Alexandre-Dumas-fils.html  Harper-Lee.html	     quotes.toscrape.com.html
Alfred-Tennyson.html	   Haruki-Murakami.html      Ralph-Waldo-Emerson.html
Allen-Saunders.html	   Helen-Keller.html	     reading.html
Andre-Gide.html		   humor.html		     simile.html
Ayn-Rand.html		   inspirational.html	     Stephenie-Meyer.html
Bob-Marley.html		   James-Baldwin.html	     Steve-

## Scraping PDFs
* using ``tabula-py``
* Docu: https://tabula-py.readthedocs.io/en/latest/

In [17]:
pip install tabula-py

Collecting tabula-py
  Obtaining dependency information for tabula-py from https://files.pythonhosted.org/packages/d5/77/b34088cbb55ba59e1cc6512ab2ff3b7679102b7f7577982a96cbdcddb90c/tabula_py-2.9.0-py3-none-any.whl.metadata
  Downloading tabula_py-2.9.0-py3-none-any.whl.metadata (7.5 kB)
Downloading tabula_py-2.9.0-py3-none-any.whl (12.0 MB)
[2K   [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m11.5 MB/s[0m eta [36m0:00:00[0mm eta [36m0:00:01[0m[36m0:00:01[0m
[?25hInstalling collected packages: tabula-py
Successfully installed tabula-py-2.9.0
Note: you may need to restart the kernel to use updated packages.


#### Example: 
* a UN report as [PDF](https://www.un.org/en/development/desa/population/publications/pdf/urbanization/the_worlds_cities_in_2016_data_booklet.pdf)

In [6]:
import tabula
url= "https://www.un.org/en/development/desa/population/publications/pdf/urbanization/the_worlds_cities_in_2016_data_booklet.pdf"

#get table on page 4
df = tabula.read_pdf(url,  pages='6')


HTTPError: HTTP Error 403: Forbidden

In [29]:
#returns list of all tables - get the first one
df[0].head(10)

Unnamed: 0.1,"Of the world’s 31 megacities (that is,",Unnamed: 0,Population,Unnamed: 1,Population.1
0,,,in 2016,,in 2030
1,cities with 10 million inhabitants or Rank,"City, Country",(thousands),"City, Country",(thousands)
2,"more) in 2016, 24 are located in the 1","Tokyo, Japan",38 140,"Tokyo, Japan",37 190
3,less developed regions or the “global 2 3,"Delhi, India Shanghai, China",26 454 24 484,"Delhi, India Shanghai, China",36 06030 751
4,South”. China alone was home to six 4,"Mumbai (Bombay), India",21 357,"Mumbai (Bombay), India",27 797
5,"megacities in 2016, while India had 5","São Paulo, Brazil",21 297,"Beijing, China",27 706
6,27 374five. 6 7,"Beijing, China Ciudad de México (Mexico City),...",21 240 21 157,"Dhaka, Bangladesh Karachi, Pakistan",24 838
7,8,"Kinki M.M.A. (Osaka), Japan",20 337,"Al-Qahirah (Cairo), Egypt",24 502
8,The 10 cities that are projected to be- 9,"Al-Qahirah (Cairo), Egypt",19 128,"Lagos, Nigeria",24 239
9,10,"New York-Newark, USA",18 604,"Ciudad de México (Mexico City), Mexico",23 865


# Discussion