In [None]:
! pip show requests
! pip show urllib3

Name: requests
Version: 2.23.0
Summary: Python HTTP for Humans.
Home-page: https://requests.readthedocs.io
Author: Kenneth Reitz
Author-email: me@kennethreitz.org
License: Apache 2.0
Location: /usr/local/lib/python3.7/dist-packages
Requires: idna, chardet, urllib3, certifi
Required-by: tweepy, torchtext, tensorflow-datasets, tensorboard, Sphinx, spacy, requests-oauthlib, pymystem3, pooch, panel, pandas-datareader, kaggle, gspread, google-colab, google-api-core, gdown, folium, fix-yahoo-finance, fastai, coveralls, community, CacheControl
Name: urllib3
Version: 1.24.3
Summary: HTTP library with thread-safe connection pooling, file post, and more.
Home-page: https://urllib3.readthedocs.io/
Author: Andrey Petrov
Author-email: andrey.petrov@shazow.net
License: MIT
Location: /usr/local/lib/python3.7/dist-packages
Requires: 
Required-by: requests, kaggle, httplib2shim


In [None]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

import requests

**requests library**

- sends HTTP request to get HTML document 

- HTML document is in the form of raw string

## 1. Get response

- requests.get(url)

In [None]:
# requests.get() method -> create an instance (Response object)

res = requests.get("https://google.com")

### 1.1 Handling Errors

* Status code  
    - 200 정상  
    - 4xx Client Error  
    - 5xx Server Error

* HTTP Error 403 Forbidden Error  
  The website detects as a bot, not a browser, and denies access. 

  Solution : Add `"User-Agent"` header  

* raise_for_status() method

In [None]:
print("Status code : ", res.status_code)
print(requests.codes.ok)

if res.status_code == requests.codes.ok :
  pass

else :
  print("Error code ", res.status_code)

Status code :  200
200


In [None]:
# Client error

url = 'https://en.wikipedia.org/wiki/nonexistingpage'
res = requests.get(url)

res.raise_for_status() # Print error message when error occurs.

HTTPError: ignored

In [None]:
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
}

res = requests.get(url, headers=headers)

### 1.2 Response object

`res.text` Decode into readable form  
`res.content`  Return in binary form (Get media files like image, etc)

In [None]:
print(len(res.text))

14042


In [None]:
res.encoding

'ISO-8859-1'

In [None]:
res.text

'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="7TH85VM8HPQ0B/BKpqx5jA==">(function(){window.google={kEI:\'tV9lYpD1BPqUwbkP7rWsuAM\',kEXPI:\'0,1302536,56873,6058,207,2414,2390,2316,383,246,5,1354,4013,1238,1122515,1197730,664,380097,16108,28690,17572,4858,1362,9290,3028,17581,4020,978,13228,3847,10622,14763,7350,628,6674,1279,2453,289,149,1103,840,1983,214,4100,3514,606,2023,2299,14668,3227,1990,855,7,17450,7540,8780,17175,432,3,346,1244,1,5444,149,11323,991,1661,4,1528,2304,7039,22023,3050,2658,7356,13659,2980,1457,15351,14

In [None]:
type(res.text)

str

In [None]:
res.content

b'<!doctype html><html itemscope="" itemtype="http://schema.org/WebPage" lang="en"><head><meta content="Search the world\'s information, including webpages, images, videos and more. Google has many special features to help you find exactly what you\'re looking for." name="description"><meta content="noodp" name="robots"><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"><meta content="/images/branding/googleg/1x/googleg_standard_color_128dp.png" itemprop="image"><title>Google</title><script nonce="7TH85VM8HPQ0B/BKpqx5jA==">(function(){window.google={kEI:\'tV9lYpD1BPqUwbkP7rWsuAM\',kEXPI:\'0,1302536,56873,6058,207,2414,2390,2316,383,246,5,1354,4013,1238,1122515,1197730,664,380097,16108,28690,17572,4858,1362,9290,3028,17581,4020,978,13228,3847,10622,14763,7350,628,6674,1279,2453,289,149,1103,840,1983,214,4100,3514,606,2023,2299,14668,3227,1990,855,7,17450,7540,8780,17175,432,3,346,1244,1,5444,149,11323,991,1661,4,1528,2304,7039,22023,3050,2658,7356,13659,2980,1457,15351,1

In [None]:
type(res.content)

bytes

### 1.3 "params" option

requests.get(params)

Search "tesla" in NY times.

`https://www.nytimes.com/search?query=tesla`

Looking into the search specifying options in the website, parameterize options (after "?")  

`https://www.nytimes.com/search?dropmab=true&endDate=20220424&query=tesla&sort=newest&startDate=20220417&types=article`

In [None]:
url = "https://www.nytimes.com/search?"
params = {
    'query':'tesla',
    'startDate':'20220301',
    'endDate':'20220330',
    'types':'Video',
    'sort':'best'
}

res = requests.get(url, params = params)

In [None]:
len(res.text)

132363

### 1.4 Problems in encoding  

- res.text 는 str type  
- requests는 HTTP header의 encoding 정보를 사용해서 데이터를 인코딩하고, res.encoding 속성에 그 값을 지정하고, res.text 속성을 구할 때 res.encoding 속성을 사용함  

Solutions  

- res.encoding = 'euc-kr' 지정  
- res.encoding = None  (prevent auto encoding)  
- content.decode('euc-kr') >> Not a good option..

In [None]:
res1 = requests.get("https://www.mk.co.kr/news/business/view/2022/04/357729/")
res1.encoding

'ISO-8859-1'

In [None]:
res1.text # 깨짐

'<!DOCTYPE html>\n<html lang="ko" itemId="https://www.mk.co.kr/news/business/view/2022/04/357729/" itemType="http://schema.org/NewsArticle" itemScope="" class="story" xmlns:og="http://opengraphprotocol.org/schema/">\n<head>\n    <!-- Google Tag Manager -->\n    <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\n                new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\n            j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\n            \'https://www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);\n        })(window,document,\'script\',\'dataLayer\',\'GTM-KTPV2SK\');</script>\n    <!-- End Google Tag Manager -->\n    <title data-rh="true">Å×½½¶ó, `°ªÁú`ÇÏ´Ù ¸Á½Å»ì¡¦ÇÑ±¹Â÷¿¡ ±¼¿å, ¸ðµ¨3µµ `³Ñ¹ö3` Ãß¶ô [¿Ö¸ô¶úÀ»Ä«] - ¸ÅÀÏ°æÁ¦</title>\n        <link rel="canonical" href="https://www.mk.co.kr/news/business/view/2022/04/357729/" />\n    <link rel="amphtml" href="https://m.mk.co.kr/new

In [None]:
res1.encoding = "euc-kr"
res1.text

'<!DOCTYPE html>\n<html lang="ko" itemId="https://www.mk.co.kr/news/business/view/2022/04/357729/" itemType="http://schema.org/NewsArticle" itemScope="" class="story" xmlns:og="http://opengraphprotocol.org/schema/">\n<head>\n    <!-- Google Tag Manager -->\n    <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\n                new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\n            j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\n            \'https://www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);\n        })(window,document,\'script\',\'dataLayer\',\'GTM-KTPV2SK\');</script>\n    <!-- End Google Tag Manager -->\n    <title data-rh="true">테슬라, `값질`하다 망신살…한국차에 굴욕, 모델3도 `넘버3` 추락 [왜몰랐을카] - 매일경제</title>\n        <link rel="canonical" href="https://www.mk.co.kr/news/business/view/2022/04/357729/" />\n    <link rel="amphtml" href="https://m.mk.co.kr/news/business/view-amp/2022/04/35772

In [None]:
res1 = requests.get("https://www.mk.co.kr/news/business/view/2022/04/357729/")
res1.encoding = None
res1.text

'<!DOCTYPE html>\n<html lang="ko" itemId="https://www.mk.co.kr/news/business/view/2022/04/357729/" itemType="http://schema.org/NewsArticle" itemScope="" class="story" xmlns:og="http://opengraphprotocol.org/schema/">\n<head>\n    <!-- Google Tag Manager -->\n    <script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({\'gtm.start\':\n                new Date().getTime(),event:\'gtm.js\'});var f=d.getElementsByTagName(s)[0],\n            j=d.createElement(s),dl=l!=\'dataLayer\'?\'&l=\'+l:\'\';j.async=true;j.src=\n            \'https://www.googletagmanager.com/gtm.js?id=\'+i+dl;f.parentNode.insertBefore(j,f);\n        })(window,document,\'script\',\'dataLayer\',\'GTM-KTPV2SK\');</script>\n    <!-- End Google Tag Manager -->\n    <title data-rh="true">테슬라, `값질`하다 망신살…한국차에 굴욕, 모델3도 `넘버3` 추락 [왜몰랐을카] - 매일경제</title>\n        <link rel="canonical" href="https://www.mk.co.kr/news/business/view/2022/04/357729/" />\n    <link rel="amphtml" href="https://m.mk.co.kr/news/business/view-amp/2022/04/35772

In [None]:
res1 = requests.get("https://www.mk.co.kr/news/business/view/2022/04/357729/")
res1.content.decode('euc-kr')

UnicodeDecodeError: ignored

In [None]:
res1 = requests.get("https://www.mk.co.kr/news/business/view/2022/04/357729/")
res1.content.decode('utf-8')

UnicodeDecodeError: ignored

In [None]:
# Just use encoding options