<p style="font-size:15pt; text-align:center">
    Introduction to Data Science
</p>
<p style="font-size:20pt; text-align:center">
    Data Collection
</p>

In [1]:
import pandas as pd
import requests

# HTTP

The Python `requests` library allows us to make HTTP requests in Python. 

In [2]:
url = "https://httpbin.org/html"
response = requests.get(url)
response

<Response [200]>

## The Request

Let’s take a closer look at the request we made. We can access the original request using `response` object; we display the request’s HTTP headers below:

In [None]:
request = response.request
for key in request.headers: # The headers in the response are stored as a dictionary.
    print(f'{key}: {request.headers[key]}')

Every HTTP request has a type. In this case, we used a `GET` request which retrieves information from a server.

In [None]:
request.method

## The Response

Let’s examine the response we received from the server. First, we will print the response’s HTTP headers.

In [None]:
for key in response.headers:
    print(f'{key}: {response.headers[key]}')

An HTTP response contains a status code, a special number that indicates whether the request succeeded or failed. The status code `200` indicates that the request succeeded.

In [None]:
response.status_code

Finally, we display the first 100 characters of the response’s content (the entire response content is too long to display nicely here).

In [None]:
response.text

## Types of Requests

### GET Requests

The `GET` request is used to retrieve information from the server. Since your web browser makes GET request whenever you enter in a URL into its address bar, `GET` requests are the most common type of HTTP requests.

### POST Request

The `POST` request is used to send information from the client to the server. For example, some web pages contain forms for the user to fill out—a login form, for example. After clicking the “Submit” button, most web browsers will make a `POST` request to send the form data to the server for processing.

### Types of Response Status Codes

* ***100s*** - Informational: More input is expected from client or server (e.g. 100 Continue, 102 Processing)

* ***200s*** - Success: The client’s request was successful (e.g. 200 OK, 202 Accepted)

* ***300s*** - Redirection: Requested URL is located elsewhere; May need user’s further action (e.g. 300 Multiple Choices, 301 Moved Permanently)

* ***400s*** - Client Error: Client-side error (e.g. 400 Bad Request, 403 Forbidden, 404 Not Found)

* ***500s*** - Server Error: Server-side error or server is incapable of performing the request (e.g. 500 Internal Server Error, 503 Service Unavailable)

# Web Crawler

## get the html content

In [None]:
url = 'https://vp.fact.qq.com/home'
content = requests.get(url)

In [None]:
help(requests.get)

In [None]:
content.text

## parse content from html

***Beautiful Soup***

> Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

- Beautiful Soup provides a few simple methods. It doesn't take much code to write an application
- Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Then you just have to specify the original encoding.
- Beautiful Soup sits on top of popular Python parsers like `lxml` and `html5lib`.

***Usefull methods in beautiful soup:***
* ***select***
* ***find***
* ***find_all***

In [None]:
from bs4 import BeautifulSoup # if error, you need to install the package

In [None]:
soup = BeautifulSoup(content.text, 'html.parser') 

In [None]:
soup

In [None]:
[i.text for i in soup.select('.InfiniteList_listContentText__89_ud')]
#[i.text.split('\xa0')[0].replace('\u200b', '') for i in soup.select('.InfiniteList_listContentText__89_ud')]

In [None]:
[i.text for i in soup.select('.InfiniteList_backgroundfalse__JK57x')]

## simple to parse table by `Pandas`

In [None]:
#url = "https://www.douban.com/group/730749/discussion?start=0"
url = "https://www.douban.com/group/14771/"
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4343.0 Safari/537.36',
         }
response = requests.get(url, headers=headers)
tables  = pd.read_html(response.text)


In [None]:
response

In [None]:
len(tables)

In [None]:
tables[0]

# The End 

**Source**

This notebook was adapted from:
* Data 100: Principles and Techniques of Data Science
* Introduction to Computational Communication by Chengjun Wang