# Data Collection

## Outline
* Data Science Processing Pipeline
* Access Structured Data
    * ***Pandas*** -> working with tables in Python
    * Relational Data Bases: Python and ***SQL***
    * ***NumPy*** -> working with arry data in Python
* Access Unstructured Data
    * accessing ***Rest*** APIs
        * JSON files
    * web scraping
    

## Data Science Processing Pipeline
<img src="IMG/workflow.png" width=1200>

## ***Tabular Data*** in Python with Pandas
Started as "***spread sheets for python***" - now has become one of the most important ***Data Wrangling*** and **EDA** tools in ***Python***<BR>
<img src="IMG/pandas_logo.png">

***pandas*** is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
Python has long been great for data munging and preparation, but less so for data analysis and modeling. pandas helps fill this gap, enabling you to carry out your entire data analysis workflow in Python without having to switch to a more domain specific language like **R**.[pandas website]




### Pandas Documentation
* Pandas website:  https://pandas.pydata.org/
* Pandas user guide: http://pandas.pydata.org/pandas-docs/stable/user_guide/index.html
* Pandas API documentation: http://pandas.pydata.org/pandas-docs/stable/reference/index.html
* VERY USEFULL: Pandas Cheat Sheet: https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

### Pandas in a Nutshell


In [1]:
#import the pandas module
import pandas as pd #naming convention for pandas is pd

#### The central element of ***Pandas*** is the ***DataFrame***
* spreadsheet like data structure
* rectifies data into tables
* database like functionality
* arrray compatible

In [None]:
#get the data
!git clone https://github.com/keuperj/DATA.git

In [2]:
# Accessing Tabular Data from *CSV*-Files
d=pd.read_csv('DATA/weather.csv') 
d.head()#show first rows of the DataFrame

Unnamed: 0,Formatted Date,Summary,Precip Type,Temperature (C),Apparent Temperature (C),Humidity,Wind Speed (km/h),Wind Bearing (degrees),Visibility (km),Loud Cover,Pressure (millibars),Daily Summary
0,2006-04-01 00:00:00.000 +0200,Partly Cloudy,rain,9.472222,7.388889,0.89,14.1197,251.0,15.8263,0.0,1015.13,Partly cloudy throughout the day.
1,2006-04-01 01:00:00.000 +0200,Partly Cloudy,rain,9.355556,7.227778,0.86,14.2646,259.0,15.8263,0.0,1015.63,Partly cloudy throughout the day.
2,2006-04-01 02:00:00.000 +0200,Mostly Cloudy,rain,9.377778,9.377778,0.89,3.9284,204.0,14.9569,0.0,1015.94,Partly cloudy throughout the day.
3,2006-04-01 03:00:00.000 +0200,Partly Cloudy,rain,8.288889,5.944444,0.83,14.1036,269.0,15.8263,0.0,1016.41,Partly cloudy throughout the day.
4,2006-04-01 04:00:00.000 +0200,Mostly Cloudy,rain,8.755556,6.977778,0.83,11.0446,259.0,15.8263,0.0,1016.51,Partly cloudy throughout the day.


#### Pandas Features
* Data in- and export
* DataFrame (DF) data structure with functionality of
    * spreadsheet
    * relational data base
* DF Statistcs
* DF Visualization
* Rich library of ***wrangling*** methods -> Data Science lecture

#### <font color="red">Detailted introduction in the Lab session! </font>    



## Accessing Structured Data from Relational Databases

* Data structure: tables
* Relational Algebra 

<img src="IMG/mtable.png" width=700>

#### <font color="red">S</font>tructured <font color="red">Q</font>uery <font color="red">L</font>anguage <font color="red">: SQL</font>

Structured Query Language is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS). 



 

#### In a Nutshell: ACID properties of relational databases 
* <font color="red">A</font>tomicity 
* <font color="red">C</font>onsistency 
* <font color="red">I</font>solation 
* <font color="red">D</font>urability

### SQL in ***Pandas***
Get SQL query as ***pandas*** table

**NOTE:** not running code - no SQL server here...

```
df = psql.read_sql(('select "Timestamp","Value" from "MyTable" '
                     'where "Timestamp" BETWEEN %(dstart)s AND %(dfinish)s'),
                   db,params={"dstart":datetime(2014,6,24,16,0),"dfinish":datetime(2014,6,24,17,0)},
                   index_col=['Timestamp'])
```

#### <font color="red">Data Mining practice: use <it>SQL</it> in pre-prcessing to get data, then use pandas for fine grained data selection and pre-processing (wrangling)</font> 

-> more in the lab session this week

## Working with Array Data
### *Data with more structure than tables*
<center>
    <img src="IMG/numpy.jpeg">
    <br><br>
    <A HREF="https://docs.scipy.org/doc/numpy/">https://docs.scipy.org/doc/numpy/</A>
</center>

In [3]:
#comunity convention to name numpy "np"
import numpy as np

In [6]:
#Example: 2D array (=matrix) in NumPy
A=np.array([[1,2,3,4],[1,2,3,4],[5,6,7,8]])
A

array([[1, 2, 3, 4],
       [1, 2, 3, 4],
       [5, 6, 7, 8]])

In [7]:
#Example: 3D array in NumPy
B=np.array([[[1,2,3,4],[1,2,3,4],[5,6,7,8]],[[1,2,3,4],[1,2,3,4],[5,6,7,8]],[[1,2,3,4],[1,2,3,4],[5,6,7,8]]])
B

array([[[1, 2, 3, 4],
        [1, 2, 3, 4],
        [5, 6, 7, 8]],

       [[1, 2, 3, 4],
        [1, 2, 3, 4],
        [5, 6, 7, 8]],

       [[1, 2, 3, 4],
        [1, 2, 3, 4],
        [5, 6, 7, 8]]])

#### <font color="red">Detailted introduction in the Lab session next week </font>


## JSON
### *Data with less structure than tables: sparse entries or flexible schema*
<img SRC="IMG/json-logo.png" width=400>

***JavaScript Object Notation (JSON)*** is an open-standard file format that uses human-readable text to transmit data objects consisting of attribute–value pairs and array data types. 
JSON is a language-independent data format. It was derived from JavaScript, but as of 2017 many programming languages include code to generate and parse JSON-format data. 

### JSON Document Tree
<img SRC="IMG/json.png">

### JSON in Pythhon


In [8]:
import pandas as pd

Data = {'Product': ['Desktop Computer','Tablet','iPhone','Laptop'],
        'Price': [700,250,800,1200]
        }

df = pd.DataFrame(Data, columns= ['Product', 'Price'])
 
print (df)


            Product  Price
0  Desktop Computer    700
1            Tablet    250
2            iPhone    800
3            Laptop   1200


In [9]:
#native JSON support in pandas
Export = df.to_json ('Export_DataFrame.json')

#### Use ***Jupyter*** to browse the JSON file.

## REST-APIs: Getting Data from Sensors and Services
### *IoT usecases and mesh-ups*

* **REST** **RE**presentational **S**tate **T**ransfer - de facto standard for network (HTTP) communication 
* performance, scalability, simplicity, and reliability for **client-server** data exchange



Also see: [https://en.wikipedia.org/wiki/Representational_state_transfer](https://en.wikipedia.org/wiki/Representational_state_transfer)

### REST
* **Stateless:** The server won’t maintain any state between requests from the client.
* **Client-server:** The client and server must be decoupled from each other, allowing each to develop independently.
* **Cacheable:** The data retrieved from the server should be cacheable either by the client or by the server.
* **Uniform interface:** The server will provide a uniform interface for accessing resources without defining their representation.
* **Layered system:** The client may access the resources on the server indirectly through other layers such as a proxy or load balancer.

### REST Schema
<img src='IMG/REST.png'>
[image from Wikipedia]

### Example

* ***GitHub*** REST API -> User information

[https://api.github.com/users/keuperj](https://api.github.com/users/keuperj)

### REST communication via HTTP(s) requests

* GET	Retrieve an existing resource.
* POST	Create a new resource.
* PUT	Update an existing resource.
* PATCH	Partially update an existing resource.
* DELETE	Delete a resource.

#### Data payload -> JSON !

### REST interactions in Python
#### with the REQUESTS lib

<center>
<img src="IMG/requests.png" width=300>
</center>

* [https://docs.python-requests.org/en/master/](https://docs.python-requests.org/en/master/)

In [10]:
# read data from service

import requests
api_url = "https://jsonplaceholder.typicode.com/todos/1" # open REST service for tests
response = requests.get(api_url)
response.json()

{'userId': 1, 'id': 1, 'title': 'delectus aut autem', 'completed': False}

In [11]:
# write data to service

api_url = "https://jsonplaceholder.typicode.com/todos"
todo = {"userId": 99, "title": "Buy milk", "completed": False}
response = requests.post(api_url, json=todo)
response.json()
{'userId': 1, 'title': 'Buy milk', 'completed': False, 'id': 201}



{'userId': 1, 'title': 'Buy milk', 'completed': False, 'id': 201}

In [12]:
#check transaction
response.status_code

201

Error Codes:

* 2xx - SUCCESS
* 4xx - Client Error
* 5xx - Server Error

#### Large scale JSON handling and Queries -> second lecture of this block 

#### Other comunication libs i.e. for CAN-Bus available: [https://python-can.readthedocs.io/en/master/](https://python-can.readthedocs.io/en/master/)

## Getting Data from Web-Ressources
#### *Data Scraping*

* In some cases data is not *provided* via a defined API, but needs to be collected
   * i.e. from unstructured web-data 

In [13]:
# example using requests
import requests

r = requests.get('https://www.google.com')
print(r.text)

<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="de"><head><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="/B30J0v51vTgCYxzac8+YA==">(function(){window.google={kEI:'73NlYeT-ENaQxc8Pi9GN4AI',kEXPI:'0,1302536,56873,6059,206,4804,2316,383,246,5,1354,5251,1122515,1197707,184,501,328875,51223,16115,28684,17572,4858,1362,9290,3030,17579,4020,978,13228,2676,1171,10626,19040,2778,919,5081,1593,1279,2212,530,149,1103,840,1983,4314,3514,606,2024,1776,520,14670,3227,419,1570,856,7,5599,6755,5096,598,15722,908,2,941,15756,1,2,346,230,6182,278,148,13975,4,1253,275,2301,1241,5801,4684,2014,18375,2658,7357,30,13628,2305,638,18280,2521,3309,2527,992,3100,2,3138,7,907,3,3541,1,14710,1814,283,38,876,5990,15447,8,1273,1715,2,3037,5564,20,1218,1,35,1,4146,1244,1,686,1094,1,4494,743,5853,1576,3,8884,1160,1266,4924,2,507,2381,2719,37

### How to get structured information from Websites?
#### BeautifulSoup

In [14]:
import requests
from bs4 import BeautifulSoup

r = requests.get('https://news.ycombinator.com')
soup = BeautifulSoup(r.text, 'html.parser')
links = soup.findAll('tr', class_='athing')

formatted_links = []

for link in links:
    data = {
        'id': link['id'],
        'title': link.find_all('td')[2].a.text,
        "url": link.find_all('td')[2].a['href'],
        "rank": int(links[0].td.span.text.replace('.', ''))
    }
    formatted_links.append(data)



In [15]:
for i in  range(10):
    print(formatted_links[i])

{'id': '28838112', 'title': 'Magic Leap 2 announced for 2022', 'url': 'https://www.magicleap.com/en-us/news/op-ed/my-first-year-at-magic-leap-and-the-opportunity-ahead', 'rank': 1}
{'id': '28835690', 'title': 'Rancher Desktop, a Docker Desktop Replacement', 'url': 'https://rancherdesktop.io/', 'rank': 1}
{'id': '28838099', 'title': 'Effective Concurrency with Algebraic Effects in Multicore OCaml', 'url': 'https://kcsrk.info/ocaml/multicore/2015/05/20/effects-multicore/', 'rank': 1}
{'id': '28836382', 'title': 'Use Raspberry Pi as Airplay server to screen mirror on TVs, monitors, projectors', 'url': 'https://github.com/rahul-thakoor/air-pi-play', 'rank': 1}
{'id': '28837181', 'title': 'Tricks I wish I knew when I learned TypeScript', 'url': 'https://www.cstrnt.dev/blog/three-typescript-tricks', 'rank': 1}
{'id': '28837998', 'title': 'Cynthia Rudin wins the 2021 AAAI Squirrel AI Award', 'url': 'https://pratt.duke.edu/about/news/rudin-squirrel-award', 'rank': 1}
{'id': '28833933', 'title'

## Scaling Web-Scaraping wiht Scrapy
#### Crowling the web

<img SRC="IMG/scrapy.jpg">

#### [https://scrapy.org/](https://scrapy.org/)

#### Example

Scraping [http://quotes.toscrape.com/page/1/](http://quotes.toscrape.com/page/1/)

In [16]:
%%writefile myCrwler.py 

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f'quotes-{page}.html'
        with open(filename, 'wb') as f:
            f.write(response.body)

Overwriting myCrwler.py


### Run in Shell:

In [None]:
#This would produce all the linked HTML file we can use for analysis 
#!scrapy shell 'http://quotes.toscrape.com/page/1/'

# Discussion