# Python Requests

### by: Jean-Christophe Chouinard

### Twitter: @ChouinardJC

### Blog: jcchouinard.com

### Link to article: jcchouinard.com/python-requests/

---------

## Install Package

jcchouinard.com/install-python-with-anaconda-on-windows/

`$ pip install requests`

`$ pip install beautifulsoup4`

`$ pip install urllib`

---------

## Requests basics

- GET: get content of a page 
- POST: post a status on social media
- Other methods not covered here

### Simple GET request

In [17]:
import requests 

url = 'https://crawler-test.com/'
response = requests.get(url)

print('URL: ',  response.url)
print('Status code: ', response.status_code)
print('HTTP header: ', response.headers)

URL:  https://crawler-test.com/
Status code:  200
HTTP header:  {'Content-Encoding': 'gzip', 'Content-Type': 'text/html;charset=utf-8', 'Date': 'Wed, 06 Oct 2021 00:11:45 GMT', 'Server': 'nginx/1.10.3', 'Vary': 'Accept-Encoding', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8098', 'Connection': 'keep-alive'}


### Simple POST request

In [18]:
import requests 

url = 'https://httpbin.org/post'
payload = {
    'name':'Jean-Christophe',
    'last_name':'Chouinard',
    'website':'https://www.jcchouinard.com/'
    }
r = requests.post(url, data=payload)
r.json()

{'args': {},
 'data': '',
 'files': {},
 'form': {'last_name': 'Chouinard',
  'name': 'Jean-Christophe',
  'website': 'https://www.jcchouinard.com/'},
 'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Content-Length': '85',
  'Content-Type': 'application/x-www-form-urlencoded',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.24.0',
  'X-Amzn-Trace-Id': 'Root=1-615ce9f8-5b5ab52047b4e17c398995b2'},
 'json': None,
 'origin': '149.167.130.162',
 'url': 'https://httpbin.org/post'}

## Response Methods and Attributes

In [19]:
import requests 

url = 'https://crawler-test.com/'
r = requests.get(url) 

<Response [200]>

In [20]:
help(r)

Help on Response in module requests.models object:

class Response(builtins.object)
 |  The :class:`Response <Response>` object, which contains a
 |  server's response to an HTTP request.
 |  
 |  Methods defined here:
 |  
 |  __bool__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if the status code of the response is between
 |      400 and 600 to see if there was a client error or a server error. If
 |      the status code, is between 200 and 400, this will return True. This
 |      is **not** a check to see if the response code is ``200 OK``.
 |  
 |  __enter__(self)
 |  
 |  __exit__(self, *args)
 |  
 |  __getstate__(self)
 |  
 |  __init__(self)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self)
 |      Allows you to use a response as an iterator.
 |  
 |  __nonzero__(self)
 |      Returns True if :attr:`status_code` is less than 400.
 |      
 |      This attribute checks if

- text, data descriptor : Content of the response, in unicode.

- content, data descriptor : Content of the response, in bytes.

- url, attribute : URL of the request

- status_code, attribute : Status code returned by the server

- headers, attribute : HTTP headers returned by the server

- history, attribute : list of response objects holding the history of request

- links, attribute : Returns the parsed header links of the response, if any.

- json, method : Returns the json-encoded content of a response, if any.

## Access the Response Methods and Attributes

### Access attributes

In [22]:
import requests

url = 'https://crawler-test.com/'
r = requests.get(url)

In [26]:
r.headers

{'Content-Encoding': 'gzip', 'Content-Type': 'text/html;charset=utf-8', 'Date': 'Wed, 06 Oct 2021 00:14:21 GMT', 'Server': 'nginx/1.10.3', 'Vary': 'Accept-Encoding', 'X-Content-Type-Options': 'nosniff', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'Content-Length': '8099', 'Connection': 'keep-alive'}

In [27]:
# check redirections
url = 'https://crawler-test.com/redirects/redirect_chain_allowed'
r = requests.get(url)

In [29]:
for redirect in r.history:
    print(redirect.url, redirect.status_code)
print(r.url, r.status_code)

https://crawler-test.com/redirects/redirect_chain_allowed 301
https://crawler-test.com/redirects/redirect_chain_disallowed 301
https://crawler-test.com/redirects/redirect_target 200


In [30]:
r = requests.get(url, allow_redirects=False)
print(r.url, r.status_code)

https://crawler-test.com/redirects/redirect_chain_allowed 301


### Access methods

In [32]:
url = 'http://archive.org/wayback/available?url=jcchouinard.com'
r = requests.get(url)
r.json()

{'url': 'jcchouinard.com',
 'archived_snapshots': {'closest': {'status': '200',
   'available': True,
   'url': 'http://web.archive.org/web/20211005023018/https://www.jcchouinard.com/',
   'timestamp': '20211005023018'}}}

## Process the response

### Parse the HTML of a page with BeautifulSoup

In [36]:
from bs4 import BeautifulSoup

url = 'https://crawler-test.com/'
r = requests.get(url)

soup = BeautifulSoup(r.text, 'html.parser')
soup.find('title')


<title>Crawler Test Site</title>

In [39]:
soup.find_all('meta', attrs={'name':'description'})[0]

<meta content="Default description XIbwNE7SSUJciq0/Jyty" name="description"/>

In [41]:
from bs4 import BeautifulSoup
import requests

# Make the request
url = 'https://crawler-test.com/'
r = requests.get(url)

# Parse the HTML
soup = BeautifulSoup(r.text, 'html.parser')

title = soup.find('title')
h1 = soup.find('h1')
description = soup.find('meta', attrs={'name':'description'})
meta_robots = soup.find('meta', attrs={'name':'robots'})
canonical = soup.find('link', {'rel': 'canonical'})

In [44]:
title

<title>Crawler Test Site</title>

In [47]:
title = title.get_text() if title else ''
h1 = h1.get_text() if h1 else ''
description = description['content'] if description else ''
canonical = canonical['href'] if canonical else ''
meta_robots = meta_robots['content'] if meta_robots else ''

# Print the tags
print('Title: ', title)
print('h1: ', h1)
print('description: ', description)
print('meta_robots: ', meta_robots)
print('canonical: ', canonical)

Title:  Crawler Test Site
h1:  Crawler Test Site
description:  Default description XIbwNE7SSUJciq0/Jyty
meta_robots:  
canonical:  


### Get all links on the page

In [52]:
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin

url = 'https://crawler-test.com/'
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')

links = []
for link in soup.find_all('a', href=True):
    links.append(urljoin(url, link['href']))

links[:10]

['https://crawler-test.com/',
 'https://crawler-test.com/mobile/separate_desktop',
 'https://crawler-test.com/mobile/desktop_with_AMP_as_mobile',
 'https://crawler-test.com/mobile/separate_desktop_with_different_h1',
 'https://crawler-test.com/mobile/separate_desktop_with_different_title',
 'https://crawler-test.com/mobile/separate_desktop_with_different_wordcount',
 'https://crawler-test.com/mobile/separate_desktop_with_different_links_in',
 'https://crawler-test.com/mobile/separate_desktop_with_different_links_out',
 'https://crawler-test.com/mobile/separate_desktop_with_mobile_not_subdomain',
 'https://crawler-test.com/mobile/desktop_with_self_canonical_mobile_and_amp']

## Improve the request

## Handle errors

In [55]:
url = 'bad url'

try:
    r = requests.get(url)
except Exception as e:
    print(e)

Invalid URL 'bad url': No schema supplied. Perhaps you meant http://bad url?
this


### Change user-agent

In [56]:
import requests 

# https://www.whatismybrowser.com/detect/what-is-my-user-agent
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'

url = 'https://www.reddit.com/r/python/top.json?limit=1&t=day'

headers = {
    'User-Agent': user_agent
}

r = requests.get(url, headers=headers)
r.json()

{'kind': 'Listing',
 'data': {'after': 't3_q1z7q9',
  'dist': 1,
  'modhash': '',
  'geo_filter': '',
  'children': [{'kind': 't3',
    'data': {'approved_at_utc': None,
     'subreddit': 'Python',
     'selftext': "I have loved Pycharm ever since I first started to use it, it's hard for me to let go, but....\n\nAt my work I have successfully transitioned from an intern to a salaried worker for a small non-profit. I have been given some projects and one of which I think is appropriate for writing Python code (doing excel automation). \n\nAnyhow, given how small these projects are and how infrequently I may or may not use Python/code, I cannot afford to purchase a business license for Pycharm. Does anyone have recommendations for a very similar IDE that would be free for business use?",
     'author_fullname': 't2_ae7gi6fr',
     'saved': False,
     'mod_reason_title': None,
     'gilded': 0,
     'clicked': False,
     'title': 'IDE Similar to PyCharm for Work',
     'link_flair_richt

### Add Timeout to a request

In [57]:
url = 'http://httpbin.org/basic-auth/user/pass'

try:
    requests.get(url, timeout=0.1)
except Exception as e:
    print(e)


HTTPConnectionPool(host='httpbin.org', port=80): Max retries exceeded with url: /basic-auth/user/pass (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f96dc699950>, 'Connection to httpbin.org timed out. (connect timeout=0.1)'))


### Use Proxies

In [58]:
url = 'https://crawler-test.com/'

proxies = {
    'http': '128.199.237.57:8080'
}

requests.get(url, proxies=proxies)

<Response [200]>

### Add Headers to request

In [59]:
url = 'http://httpbin.org/headers'

access_token = {
    'Authorization': 'Bearer {access_token}'
}

r = requests.get(url, headers=access_token)
r.json()

{'headers': {'Accept': '*/*',
  'Accept-Encoding': 'gzip, deflate',
  'Authorization': 'Bearer {access_token}',
  'Host': 'httpbin.org',
  'User-Agent': 'python-requests/2.24.0',
  'X-Amzn-Trace-Id': 'Root=1-615cee98-2dbce24d25b6481d399ff5fd'}}

## Requests Session

The session object is useful when yyou need to make requests with parameters that are persist through all the requests in a single session.

In [60]:
import requests

session = requests.Session()

url = 'https://httpbin.org/headers'

access_token = {
    'Authorization': 'Bearer {access_token}'
    }

session.headers.update(access_token)

r1 = session.get(url)
r2 = session.get(url)

print('r1: ', r1.json()['headers']['Authorization'])
print('r2: ', r2.json()['headers']['Authorization'])

r1:  Bearer {access_token}
r2:  Bearer {access_token}
