# Data Collection

The first step of most data science pipelines, as you may imagine, is to get some data. Data that you typically use comes from many different sources. If you’re lucky, someone may hand directly had you a file, such as a CSV. Or sometimes you’ll need to issue a database query to collect the relevant data . But in this lecture, we’ll talk about collecting data from two main sources: 1) querying an API (the majority of which are web-based, these days); and 2) scraping data from a web page. 

## Collecting data from web-based sources

The vast majority of automated data queries you will run will use HTTP requests
(it’s become the dominant protocol for much more than just querying web pages)

In [2]:
import requests
response = requests.get("https://fmi.chnu.edu.ua/")

print("Status Code:", response.status_code)
print("Headers:", response.headers)

Status Code: 200
Headers: {'Date': 'Sun, 18 Aug 2024 13:47:40 GMT', 'Content-Type': 'text/html; charset=utf-8', 'Transfer-Encoding': 'chunked', 'Connection': 'keep-alive', 'vary': 'Accept-Encoding', 'content-security-policy': "default-src 'self' data: 'unsafe-eval' 'unsafe-inline' *.gstatic.com *.googleapis.com *.googletagmanager.com *.addtoany.com *.youtube-nocookie.com *.google.com *.google-analytics.com *.ytimg.com *.facebook.com forms.gle *.chnu.edu.ua madmagz.com", 'x-frame-options': 'SAMEORIGIN, SAMEORIGIN', 'x-content-type-options': 'nosniff', 'strict-transport-security': 'max-age=31536000', 'referrer-policy': 'no-referrer', 'permissions-policy': 'accelerometer=(), camera=(), geolocation=*, gyroscope=(), magnetometer=(), microphone=(), payment=(), usb=()', 'CF-Cache-Status': 'DYNAMIC', 'Report-To': '{"endpoints":[{"url":"https:\\/\\/a.nel.cloudflare.com\\/report\\/v4?s=jzsoc%2F9fmYB%2BNLhbPj7MGfIeQ0snoYVSMd7GURUDaxyfE6FjjTLRe%2BUn4%2BIEspNxELuFs%2FjkJ4H2FJBZlWAWcLqJjIeQ9bUJi2X3R

In [5]:
print(response.text[:480])

<!DOCTYPE html>
<html lang="uk" prefix="og: https://ogp.me/ns#">
<head>
    <meta charset="utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1">
        <title>&#x413;&#x43E;&#x43B;&#x43E;&#x432;&#x43D;&#x430; - &#x424;&#x430;&#x43A;&#x443;&#x43B;&#x44C;&#x442;&#x435;&#x442; &#x43C;&#x430;&#x442;&#x435;&#x43C;&#x430;&#x442;&#x438;&#x43A;&#x438; &#x442;&#x430; &#x456;&#x43D;&#x444;&#x43E;&#x440;&#x43C;&#x430;&#x442;&#x438;&#x43A;&#x438;</title


You’ve seen URLs like these:
https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=9&cad=rja&uact=8…
The weird statements after the url are parameters, you would provide them using
the requests library like this:

In [None]:
params = {"sa":"t", "rct":"j", "q":"", "esrc":"s",
"source":"web", "cd":"9", "cad":"rja", "uact":"8"}
response = requests.get("http://www.google.com/url", params=params)

HTTP GET is the most common method, but there are also PUT, POST, DELETE
methods that change some state on the server

## RESTful APIs

If you move beyond just querying web pages to web APIs, you’ll most likely
encounter REST APIs (Representational State Transfer)
REST is more a design architecture, but a few key points:
1. Uses standard HTTP interface and methods (GET, PUT, POST, DELETE)
2. Stateless – the server doesn’t remember what you were doing
Rule of thumb: if you’re sending your account key along with each API call,
you’re probably using a REST API

You query a REST API similar to standard HTTP requests, but will almost always
need to include parameters

Get your own access token at https://github.com/settings/tokens/new
GitHub API uses GET/PUT/DELETE to let you query or update elements in your
GitHub account automatically
Example of REST: server doesn’t remember your last queries, for instance you
always need to include your access token if using it this way

In [6]:
token = "" 
headers = {'Authorization': 'token '+token}
response = requests.get("https://api.github.com/user", headers=headers)
print(response.content)


b'{"login":"maxvonlancaster","id":38358502,"node_id":"MDQ6VXNlcjM4MzU4NTAy","avatar_url":"https://avatars.githubusercontent.com/u/38358502?v=4","gravatar_id":"","url":"https://api.github.com/users/maxvonlancaster","html_url":"https://github.com/maxvonlancaster","followers_url":"https://api.github.com/users/maxvonlancaster/followers","following_url":"https://api.github.com/users/maxvonlancaster/following{/other_user}","gists_url":"https://api.github.com/users/maxvonlancaster/gists{/gist_id}","starred_url":"https://api.github.com/users/maxvonlancaster/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/maxvonlancaster/subscriptions","organizations_url":"https://api.github.com/users/maxvonlancaster/orgs","repos_url":"https://api.github.com/users/maxvonlancaster/repos","events_url":"https://api.github.com/users/maxvonlancaster/events{/privacy}","received_events_url":"https://api.github.com/users/maxvonlancaster/received_events","type":"User","site_admin":false,"name":

## Data formats

The three most common formats:
1. CSV (comma separate value) files
2. JSON (Javascript object notation) files and strings
3. HTML/XML (hypertext markup language / extensible markup language) files
and strings