<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview</a></span><ul class="toc-item"><li><span><a href="#Elapsed-Time-Format" data-toc-modified-id="Elapsed-Time-Format-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Elapsed Time Format</a></span></li></ul></li><li><span><a href="#Viewing-Headers" data-toc-modified-id="Viewing-Headers-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Viewing Headers</a></span></li><li><span><a href="#SSL" data-toc-modified-id="SSL-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>SSL</a></span></li><li><span><a href="#Image-Request" data-toc-modified-id="Image-Request-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Image Request</a></span></li><li><span><a href="#JSON-Request" data-toc-modified-id="JSON-Request-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>JSON Request</a></span></li></ul></div>

# Basic Scraping with Requests

## Overview

The Request library is a simple HTTP request library for Python. Import the request library with a short alias because you're cool

In [2]:
#Request Library
import requests as r

Make a request with the request.get module (I'm using the r alias from now on)

In [3]:
#Request a page, store in 'res'
res = r.get('https://www.depts.ttu.edu/rawlsbusiness/people/faculty/isqs/david-lucus/index.php')

Use r.text or r.content to read the text of our variable 'res' to see the HTML. Prof explained why r.content was better, I wasn't smart enough to catch it.

I'm leaving the HTML output out of this notebook because it's ugly. Suffice to say, it's HTML alright.

In [5]:
#Let's see our HTML
#Use these functions to read the text or content of a variable
res.text #response in unicode
res.content #response in bytes

We can try a few more of these modules to get the status code, encoding, and how long the request took (elapsed time)

In [17]:
#Check HTML Status Code
#Should return 200, correct
res.status_code

200

### Elapsed Time Format
- This is a time delta
- Outputs in order: days, seconds, microseconds, milleseconds, minute, hours
    - Output may truncate the rightmost values on this list if they are zero
    - In this case we are shown 0 days, 0 seconds, and 133482 microseconds
- You can also access a specific time value, here we pull microseconds

In [19]:
#How long did the request take?
res.elapsed

#Can access each item as necessary
res.elapsed.microseconds

#Pull encoding
res.encoding

'UTF-8'

## Viewing Headers

View headers. For example, we see this is a Microsoft IIS server and it is running ASP.NET. 

In [7]:
#Returns server response headers
#note, the reported IIS server (tells you the server is microsoft based)
res.headers

{'Content-Type': 'text/html; charset=UTF-8', 'Content-Encoding': 'gzip', 'Vary': 'Accept-Encoding', 'Server': 'Microsoft-IIS/10.0', 'X-Powered-By': 'ASP.NET', 'Date': 'Tue, 01 Sep 2020 02:58:56 GMT', 'Content-Length': '15552'}

View some other site headers:

In [9]:
#Let's look at some other sites
new_res = r.get("http://apache.org")
new_res.headers

new_res = r.get("http://amazon.com")
new_res.headers

{'Content-Type': 'text/html', 'Content-Length': '1203', 'Connection': 'keep-alive', 'Server': 'Server', 'Date': 'Tue, 01 Sep 2020 03:16:01 GMT', 'x-amz-rid': '0WTHMV185BT5QN9Y369Q', 'Vary': 'Content-Type,Accept-Encoding,X-Amzn-CDN-Cache,X-Amzn-AX-Treatment,User-Agent', 'Last-Modified': 'Wed, 15 Jul 2020 01:20:52 GMT', 'ETag': '"a6f-5aa70bc136500-gzip"', 'Accept-Ranges': 'bytes', 'Content-Encoding': 'gzip', 'X-Cache': 'Error from cloudfront', 'Via': '1.1 e5d1ee16650956f26121cb5bd640c9e0.cloudfront.net (CloudFront)', 'X-Amz-Cf-Pop': 'DFW3-C1', 'X-Amz-Cf-Id': '2ePcaikgq7nxOjnHdDxdx8HqpaIO2B5X7Juo_XKmJvl-KI1NAF92Cg=='}

## SSL

If we look at Github, we see that by using http there is no response.
- Use r.history to see what happened in the request
- There is a response 301, which indicates that the source has been moved
- Using r.url, Github returns that we should use https

In [10]:
#Let's look at redirects.  Notice, the request for non-ssl
new_res = r.get("http://github.com")
new_res.history
new_res.url

'https://github.com/'

If we want, we can stop redirects in our get request. 
- You can also see if the request is returning a redirect or permanent redirect

In [11]:
#Note, this redirected to the SSL site.  We can stop this with
new_res = r.get("http://github.com", allow_redirects=False)
new_res.history
new_res.status_code
#code 301 is:  Move Permanently
new_res.is_redirect
new_res.is_permanent_redirect

True

## Image Request

Notice in the headers that Content-Type is image/jpeg rather than text/html

In [12]:
#Image request
img_res = r.get('https://www.depts.ttu.edu/rawlsbusiness/people/faculty/isqs/david-lucus/images/Photo-David-Lucas.jpg')
img_res.headers

{'Content-Type': 'image/jpeg', 'Last-Modified': 'Wed, 15 Jan 2020 23:55:51 GMT', 'Accept-Ranges': 'bytes', 'ETag': '"94193b51ffcbd51:0"', 'Server': 'Microsoft-IIS/10.0', 'X-Powered-By': 'ASP.NET', 'Date': 'Tue, 01 Sep 2020 03:34:22 GMT', 'Content-Length': '14209'}

## JSON Request

Content-Type is now JSON, but may be difficult to read in Python environment

In [13]:
json_res = r.get("https://data.cityofnewyork.us/api/views/kku6-nxdu/rows.json?accessType=DOWNLOAD")
json_res.headers
json_res.content

b'{\n  "meta" : {\n    "view" : {\n      "id" : "kku6-nxdu",\n      "name" : "Demographic Statistics By Zip Code",\n      "attribution" : "Department of Youth and Community Development (DYCD)",\n      "averageRating" : 0,\n      "category" : "City Government",\n      "createdAt" : 1311775554,\n      "description" : "Demographic statistics broken down by zip code",\n      "displayType" : "table",\n      "downloadCount" : 979790,\n      "hideFromCatalog" : false,\n      "hideFromDataJson" : false,\n      "indexUpdatedAt" : 1536596131,\n      "newBackend" : true,\n      "numberOfComments" : 3,\n      "oid" : 4208790,\n      "provenance" : "official",\n      "publicationAppendEnabled" : false,\n      "publicationDate" : 1372266760,\n      "publicationGroup" : 238846,\n      "publicationStage" : "published",\n      "rowClass" : "",\n      "rowsUpdatedAt" : 1372266747,\n      "rowsUpdatedBy" : "uurm-7z6x",\n      "tableId" : 942474,\n      "totalTimesRated" : 0,\n      "viewCount" : 63220,\n