<p style="font-size:15pt; text-align:center">
    Introduction to Data Science
</p>
<p style="font-size:20pt; text-align:center">
    Data Collection
</p>

In [1]:
import pandas as pd
import requests

# HTTP

The Python `requests` library allows us to make HTTP requests in Python. 

In [4]:
url = "https://httpbin.org/html"
response = requests.get(url)
response

<Response [200]>

## The Request

Let’s take a closer look at the request we made. We can access the original request using `response` object; we display the request’s HTTP headers below:

In [3]:
request = response.request
for key in request.headers: # The headers in the response are stored as a dictionary.
    print(f'{key}: {request.headers[key]}')

User-Agent: python-requests/2.27.1
Accept-Encoding: gzip, deflate, br
Accept: */*
Connection: keep-alive


Every HTTP request has a type. In this case, we used a `GET` request which retrieves information from a server.

In [5]:
request.method

'GET'

## The Response

Let’s examine the response we received from the server. First, we will print the response’s HTTP headers.

In [6]:
for key in response.headers:
    print(f'{key}: {response.headers[key]}')

Date: Wed, 14 Dec 2022 09:28:04 GMT
Content-Type: text/html; charset=utf-8
Content-Length: 3741
Connection: keep-alive
Server: gunicorn/19.9.0
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true


An HTTP response contains a status code, a special number that indicates whether the request succeeded or failed. The status code `200` indicates that the request succeeded.

In [7]:
response.status_code

200

Finally, we display the first 100 characters of the response’s content (the entire response content is too long to display nicely here).

In [8]:
response.text

"<!DOCTYPE html>\n<html>\n  <head>\n  </head>\n  <body>\n      <h1>Herman Melville - Moby-Dick</h1>\n\n      <div>\n        <p>\n          Availing himself of the mild, summer-cool weather that now reigned in these latitudes, and in preparation for the peculiarly active pursuits shortly to be anticipated, Perth, the begrimed, blistered old blacksmith, had not removed his portable forge to the hold again, after concluding his contributory work for Ahab's leg, but still retained it on deck, fast lashed to ringbolts by the foremast; being now almost incessantly invoked by the headsmen, and harpooneers, and bowsmen to do some little job for them; altering, or repairing, or new shaping their various weapons and boat furniture. Often he would be surrounded by an eager circle, all waiting to be served; holding boat-spades, pike-heads, harpoons, and lances, and jealously watching his every sooty movement, as he toiled. Nevertheless, this old man's was a patient hammer wielded by a patient arm.

## Types of Requests

### GET Requests

The `GET` request is used to retrieve information from the server. Since your web browser makes GET request whenever you enter in a URL into its address bar, `GET` requests are the most common type of HTTP requests.

### POST Request

The `POST` request is used to send information from the client to the server. For example, some web pages contain forms for the user to fill out—a login form, for example. After clicking the “Submit” button, most web browsers will make a `POST` request to send the form data to the server for processing.

### Types of Response Status Codes

* ***100s*** - Informational: More input is expected from client or server (e.g. 100 Continue, 102 Processing)

* ***200s*** - Success: The client’s request was successful (e.g. 200 OK, 202 Accepted)

* ***300s*** - Redirection: Requested URL is located elsewhere; May need user’s further action (e.g. 300 Multiple Choices, 301 Moved Permanently)

* ***400s*** - Client Error: Client-side error (e.g. 400 Bad Request, 403 Forbidden, 404 Not Found)

* ***500s*** - Server Error: Server-side error or server is incapable of performing the request (e.g. 500 Internal Server Error, 503 Service Unavailable)

# Web Crawler

## get the html content

In [9]:
url = 'https://vp.fact.qq.com/home'
content = requests.get(url)

In [12]:
content

<Response [200]>

In [13]:
help(requests.get)

Help on function get in module requests.api:

get(url, params=None, **kwargs)
    Sends a GET request.
    
    :param url: URL for the new :class:`Request` object.
    :param params: (optional) Dictionary, list of tuples or bytes to send
        in the query string for the :class:`Request`.
    :param \*\*kwargs: Optional arguments that ``request`` takes.
    :return: :class:`Response <Response>` object
    :rtype: requests.Response



In [14]:
content.text

'<!DOCTYPE html><html lang="zh-CN"><head><meta charSet="utf-8"/><meta name="description" content="查真假，就上腾讯新闻较真平台"/><title>腾讯较真实时辟谣</title><meta name="apple-mobile-web-app-capable" content="yes"/><meta name="browsermode" content="application"/><meta name="viewport" content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no"/><meta name="next-head-count" content="6"/><script src="https://beacon.cdn.qq.com/sdk/4.5.9/beacon_web.min.js"></script><script src="https://mat1.gtimg.com/www/js/emonitor/custom_e439ff7f.js" charSet="utf-8"></script><link rel="preload" href="https://tnfe.gtimg.com/jiaozhen/h5ssr/_next/static/css/e930cd01eea79c4b.css" as="style"/><link rel="stylesheet" href="https://tnfe.gtimg.com/jiaozhen/h5ssr/_next/static/css/e930cd01eea79c4b.css" data-n-g=""/><link rel="preload" href="https://tnfe.gtimg.com/jiaozhen/h5ssr/_next/static/css/a9a2977851c84c91.css" as="style"/><link rel="stylesheet" href="https://tnfe.gtimg.com/jiaozhen/h5ss

## parse content from html

***Beautiful Soup***

> Beautiful Soup is a Python library designed for quick turnaround projects like screen-scraping. Three features make it powerful:

- Beautiful Soup provides a few simple methods. It doesn't take much code to write an application
- Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8. Then you just have to specify the original encoding.
- Beautiful Soup sits on top of popular Python parsers like `lxml` and `html5lib`.

***Usefull methods in beautiful soup:***
* ***select***
* ***find***
* ***find_all***

In [15]:
from bs4 import BeautifulSoup # if error, you need to install the package

In [16]:
soup = BeautifulSoup(content.text, 'html.parser') 

In [17]:
soup

<!DOCTYPE html>
<html lang="zh-CN"><head><meta charset="utf-8"/><meta content="查真假，就上腾讯新闻较真平台" name="description"/><title>腾讯较真实时辟谣</title><meta content="yes" name="apple-mobile-web-app-capable"/><meta content="application" name="browsermode"/><meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/><meta content="6" name="next-head-count"/><script src="https://beacon.cdn.qq.com/sdk/4.5.9/beacon_web.min.js"></script><script charset="utf-8" src="https://mat1.gtimg.com/www/js/emonitor/custom_e439ff7f.js"></script><link as="style" href="https://tnfe.gtimg.com/jiaozhen/h5ssr/_next/static/css/e930cd01eea79c4b.css" rel="preload"/><link data-n-g="" href="https://tnfe.gtimg.com/jiaozhen/h5ssr/_next/static/css/e930cd01eea79c4b.css" rel="stylesheet"/><link as="style" href="https://tnfe.gtimg.com/jiaozhen/h5ssr/_next/static/css/a9a2977851c84c91.css" rel="preload"/><link data-n-p="" href="https://tnfe.gtimg.com/jiaozhen/h5ssr/_ne

In [18]:
[i.text for i in soup.select('.InfiniteList_listContentText__89_ud')]
#[i.text.split('\xa0')[0].replace('\u200b', '') for i in soup.select('.InfiniteList_listContentText__89_ud')]

['感染一次之后三个月内不可能再被感染，因此早阳早好',
 '打疫苗时的症状乘十倍，就是感染后的症状',
 '如果感染过但病毒载量低，就可能短期再次感染',
 '长得好看的人不容易得新冠',
 '新冠病毒能杀死肺癌细胞',
 '喝白酒能有效预防新冠',
 '奥密克戎是小艾滋，反复感染会免疫衰竭',
 '喝电解质水缓解新冠症状比喝白开水更有效',
 '感染新冠后不能吃布洛芬，否则会加重病情',
 '国务院联防联控小组正式摘牌',
 '新冠患者不能服用布洛芬',
 '新冠感染者居家隔离，会经楼道气溶胶传播病毒',
 '羊城晚报：奥密克戎病毒正式更名为新型冠状病毒感冒',
 '常戴口罩会吸入口罩纤维到肺里，患上肺结节',
 '抗原检测试剂盒不准，用橘子汁做都能阳',
 '高烧四十度也是无症状，出现肺炎才算确诊',
 '重庆火车站拉人到方舱医院有补贴',
 '石家庄查出白肺，可能是新冠变异',
 '天津老太太发口罩拐骗儿童',
 '北京重庆河北山东山西等地飞机喷洒打药全城消杀']

In [20]:
[i.text for i in soup.select('.InfiniteList_cardTitle__cPRz5')]

[' 查证者',
 'Y博',
 ' 查证者',
 '张卫',
 ' 查证者',
 'Y博',
 ' 查证者',
 '一节生姜',
 ' 查证者',
 '一节生姜',
 ' 查证者',
 '李治中',
 ' 查证者',
 'Y博',
 ' 查证者',
 '谢望时',
 ' 查证者',
 '较真团队',
 ' 查证者',
 '中国互联网联合辟谣平台',
 ' 查证者',
 '阿司匹林42195米',
 ' 查证者',
 '阿司匹林42195米',
 ' 查证者',
 '羊城派',
 ' 查证者',
 '科普中国',
 ' 查证者',
 '钟堃',
 ' 查证者',
 'Y博',
 ' 查证者',
 '中国新闻网',
 ' 查证者',
 '中国互联网联合辟谣平台',
 ' 查证者',
 '天津辟谣',
 ' 查证者',
 '中国互联网联合辟谣平台']

In [19]:
[i.text for i in soup.select('.InfiniteList_backgroundfalse__JK57x')]

['谣言',
 '谣言',
 '谣言',
 '谣言',
 '谣言',
 '谣言',
 '谣言',
 '谣言',
 '谣言',
 '谣言',
 '谣言',
 '谣言',
 '伪常识',
 '谣言',
 '谣言',
 '谣言',
 '谣言',
 '谣言']

## simple to parse table by `Pandas`

In [22]:
# pandas只能爬取表格数据
#url = "https://www.douban.com/group/730749/discussion?start=0"
url = "https://www.douban.com/group/14771/"
headers ={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4343.0 Safari/537.36',
         }
response = requests.get(url, headers=headers)
tables  = pd.read_html(response.text)


In [23]:
response

<Response [200]>

In [24]:
len(tables)

1

In [27]:
tables[0]

Unnamed: 0,0,1,2,3
0,讨论,作者,回应,最后回应
1,新人来报道，大家互相认识一下,°D,2645,2019-07-12
2,♥你想要不变心的情人 还是永远不老的青春,蜉蝣。,792,12-13 12:07
3,一见倾心的短文字,欧喏,892,12-12 23:53
4,❤ 可以不光芒万丈，但不能停止自己发光 ❤,茶㑍凁,28118,12-12 10:50
5,【不過•遇见】——而已,C°,873,12-12 00:30
6,谁省，谁省。从此簟纹灯影。,Sukie,199,12-10 00:31
7,总有一句 正中你心（我不在的时候 请自行进入我的...,邵略略,8934,12-08 21:08
8,心硬者得世界，温柔者成神。,Cat_Ting,5326,12-08 18:08
9,【若旧人终是不覆，让时间替我送上祝福】,Crush,1752,12-07 23:45


# The End 

**Source**

This notebook was adapted from:
* Data 100: Principles and Techniques of Data Science
* Introduction to Computational Communication by Chengjun Wang