<a href="https://colab.research.google.com/github/njulhy/funny_code/blob/main/spider/get_socks_proxy.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 从spys.one提取免费socks5代理
## 写在开头
&emsp;&emsp;学习及使用爬虫时，我们需要一些代理ip以免被反爬。但初学者或没有稳定长期需求的小伙伴则会想自己找免费的代理ip，但是这种ip往往不太稳定，可能今天能用明天就不能用了，所以就需要写脚本来自动从一些提供免费代理ip的网站爬取。
## 本文所使用的python库
&emsp;&emsp;主要是两个：
1. reqests库：提供常见的get和post等请求方式，构造请求头很直观
2. re库：python正则表达式的库

## 本文所爬取的网址
&emsp;&emsp;https://spys.one/en 是常见的提供免费代理ip的一个网站，本文将从该网站提取免费的socks5代理。即从https://spys.one/en/free-proxy-list/ 来提取代理。

&emsp;&emsp;该网址的浏览器界面长这样：

<center>
<img src="https://img-blog.csdnimg.cn/20201011005338145.png" alt="不支持加载图片" width=50%>
</center>

首先导入我们所使用的python库

In [1]:
import requests
import re
import sys

&emsp;&emsp;观察网页源码的形式，ip很容易提取，port即端口则使用了简单的加密。
由于本文提取的网址加密方式简单，在head中直接给出了对照表，形如d4w3c3=0^j0h8;j0c3q7=1^z6f6;m3b2m3=2^r8o5;比如d4w3c3代表0。而端口则在源码中以(l2z6a1^v2l2)+(j0c3q7^z6f6)+(l2z6a1^v2l2)+(p6u1v2^x4u1))的形式存在。
<center>
<img src="https://img-blog.csdnimg.cn/20201011015238495.png" alt="不支持加载图片" width=50%><br>
<font size="2">加密解密对照表<br>
<img src="https://img-blog.csdnimg.cn/20201011015238522.png" alt="不支持加载图片" width=50%><br>
源码中端口存在的形式
</center>

&emsp;&emsp;<br>因此本文需要做的工作有：
1. 构造正则表达式用于提取所有的ip、加密形式的端口、解密表
2. 使用request库请求到网页源码并提取所有的ip、加密形式的端口、解密表
3. 将解密表格式化为字典形式，形如{'d4w3c3': '0', 'j0c3q7': '1'}
5. 将加密的端口号替换为数字形式
6. 拼接ip和端口为ip:port的形式



&emsp;&emsp;<b>下面五个代码快分别实现了上面的五个功能</b><br>
&emsp;&emsp;为了使得代码块大小统一，第一部分还定义了使用request.post时的一些参数。需要注意的是本文提取的网址需要通过post形式来获取，其data的内容是网页中需要选择的五个参数。如第一个参数是ip的数量，下图展示源码的形式<br>
<center>
<img src="https://img-blog.csdnimg.cn/20201011015238494.png" alt="不支持加载图片" width=60%><br>
<font size="2"> 源码中xpp参数展示(xpp=5表示获取500个)

In [3]:
# 提取数据的正则表达式
ip_pattern = r"onmouseout.*?class=spy14>(.*?)<script.*?font>" # ip
port_pattern = r"\d<script.*?document.*?font>.\+(.*?)</script>" # port
decode_text_pattern = r'''</table><script type="text/javascript">(.*?)</script>''' # decode

# 使用request.post时的参数
user_agent = "{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.2 Safari/605.1.15'}"
headers = {'user-agent': user_agent}
url = "https://spys.one/en/socks-proxy-list/"
SHOW, AHM, SSL, Port, Type = '5', '4', '0', '0', '2'
data = {
    'xpp': SHOW,
    'xf1': AHM,
    'xf2': SSL,
    'xf4': Port,
    'xf5': Type
}

In [5]:
def get_all_information(url, headers, data):
    try:
        proxies = {'socks5': '{}'.format("47.110.49.177:1080")}
        response = requests.post(url=url, data=data, headers=headers, proxies=proxies)
        if response.status_code == 200:
            print('You have got the source code of "{}".'.format(url))
            html_text = response.text
    except ConnectionError:
        sys.exit("something wrong")
    else:
        all_ip = re.findall(ip_pattern, html_text)
        all_port_text = re.findall(port_pattern, html_text) # (t0f6u1^h8q7)+(a1e5a1^m3h8)+(t0f6u1^h8q7)+(g7z6n4^d4o5))
        decode_text = re.search(decode_text_pattern, html_text).group(1)
    return all_ip, all_port_text, decode_text
all_ip, all_port_text, decode_text = get_all_information(url, headers, data)
print("There is {} proxy ips and the first ip is {}".format(len(all_ip), all_ip[0]))
print("There is {} ports of proxy ips and the first port is {}".format(len(all_port_text), all_port_text[0]))
print("The decode text is", decode_text)

You have got the source code of "https://spys.one/en/socks-proxy-list/".
There is 500 proxy ips and the first ip is 161.35.173.134
There is 500 ports of proxy ips and the first port is (b2k1l2^o5g7)+(r8c3f6^q7l2)+(z6v2g7^c3k1)+(r8c3f6^q7l2))
The decode text is q7l2=6578;o5g7=7425;e5c3=8726;v2d4=5990;b2x4=2255;r8b2=5155;w3f6=6096;x4o5=8083;c3k1=4556;j0i9=9618;r8c3f6=0^q7l2;b2k1l2=1^o5g7;v2p6w3=2^e5c3;q7a1m3=3^v2d4;s9h8c3=4^b2x4;g7n4z6=5^r8b2;a1j0s9=6^w3f6;x4e5r8=7^x4o5;z6v2g7=8^c3k1;u1g7q7=9^j0i9;


In [6]:
def get_decode_key(decode_text):
    '''
    将形如155;k1s9=8233;d4w3c3=0^j0h8的内容提取为字典形式
    '''
    decode_key_chart = re.findall(r";(\w+=\d)\^", decode_text)
    decode_dict = {}
    for key in decode_key_chart:
        key, value = key.split("=")
        decode_dict[key] = value
    return decode_dict
decode_dict = get_decode_key(decode_text)
print("The decode dict is {}".format(decode_dict))

The decode dict is {'r8c3f6': '0', 'b2k1l2': '1', 'v2p6w3': '2', 'q7a1m3': '3', 's9h8c3': '4', 'g7n4z6': '5', 'a1j0s9': '6', 'x4e5r8': '7', 'z6v2g7': '8', 'u1g7q7': '9'}


In [7]:
def decode_port(all_port_text, decode_dict):
    '''
    将形如(l2z6a1^v2l2)+(j0c3q7^z6f6)+(l2z6a1^v2l2)+(p6u1v2^x4u1))的端口号根据上文的decode_dict提取为数字形式    
    '''
    all_port = []
    for current_port_text in all_port_text:
        current_port = ""
        current_port_pre = re.findall(r"(\w+)\^", current_port_text)

        for key in current_port_pre:
            current_port += decode_dict[key]
        all_port.append(current_port)
    return all_port
all_port = decode_port(all_port_text, decode_dict)
print("Take the first port as example:\nencrpyt code:{}\nafter decode:{}".format(all_port_text[0], all_port[0]))

Take the first port as example:
encrpyt code:(b2k1l2^o5g7)+(r8c3f6^q7l2)+(z6v2g7^c3k1)+(r8c3f6^q7l2))
after decode:1080


In [8]:
def get_proxy(all_ip, all_port):
    '''
    将上文的ip和端口拼接为ip:port的形式
    '''
    socks_pro = []
    for ip, port in zip(all_ip, all_port):
        socks_pro.append(ip+":"+port)
    return socks_pro
socks_pro = get_proxy(all_ip, all_port)
print("You have got {} proxies and the top ten are \n{}".format(len(socks_pro), socks_pro[:10]))

You have got 500 proxies and the top ten are 
['161.35.173.134:1080', '8.129.208.113:1080', '98.161.153.202:4145', '184.176.166.13:4145', '181.3.135.150:1080', '98.190.102.62:4145', '174.76.48.246:4145', '167.99.93.186:1080', '174.75.211.222:4145', '161.35.165.6:1080']


In [9]:
print(socks_pro)

['161.35.173.134:1080', '8.129.208.113:1080', '98.161.153.202:4145', '184.176.166.13:4145', '181.3.135.150:1080', '98.190.102.62:4145', '174.76.48.246:4145', '167.99.93.186:1080', '174.75.211.222:4145', '161.35.165.6:1080', '192.252.214.20:4145', '192.252.208.67:4145', '207.97.174.134:1080', '128.199.10.11:1080', '178.128.227.146:1080', '68.183.206.244:1080', '165.227.45.235:1080', '178.165.44.122:1080', '161.35.166.11:1080', '68.183.35.168:1080', '8.129.72.108:1080', '181.6.63.242:1080', '204.101.61.82:4145', '181.6.17.143:1080', '128.199.5.133:1080', '98.185.94.94:4145', '8.129.216.52:1080', '174.77.111.196:4145', '178.128.235.55:1080', '134.209.230.189:1080', '98.162.96.52:4145', '192.111.129.148:4145', '165.227.42.108:1080', '181.3.202.140:1080', '192.111.135.17:4145', '134.122.36.145:1080', '163.172.101.112:1080', '142.93.144.41:1080', '159.203.20.113:1080', '178.128.237.169:1080', '174.64.199.82:4145', '113.160.188.21:1080', '178.128.231.71:1080', '98.188.47.132:4145', '192.111.1