### **What is a User-Agent?**

The User-Agent (UA) is a string sent by a client (like a web browser or a bot) to identify itself to the server. It tells the server about the client’s operating system, browser, and device. For example, when you access a website via a browser like Chrome or Firefox, the browser sends a User-Agent string with each request to the website.

In [None]:
import requests
import httpx

In [None]:
"""
Objective: Understanding Request Headers
"""
# TODO: Send request to https://httpbin.org/get using requests and httpx
# TODO: Get the responses from both request
# TODO: Compare the request headers and understand the difference

In [None]:
"""
Objective: Modify request headers
"""
# TODO: Send request to https://httpbin.org/get using requests
# TODO: Get the request headers from the response

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept': '*/*',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive'
}

# TODO: Send new request using modified headers by passing headers params in the get method
# TODO: Get the response
# TODO: Compare the and understand the difference
# TODO: Experiment with different user-agents and share your thoughts

In [None]:
"""
Objective: Bypassing User-Agent Blocking
"""
# TODO: Send request to https://gamefaqs.gamespot.com/news using requests with and without custom headers
# TODO: Compare the response, which one is blocked and which one is not

In [None]:
"""
Objective: Understanding User-Agent Rotation
If all person you've met today using same shirts, what do you think?
"""
# List of common User-Agent strings
user_agents = [
    'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36',
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:50.0) Gecko/20100101 Firefox/50.0',
    'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:92.0) Gecko/20100101 Firefox/92.0'
]

# TODO: Send request to https://httpbin.org/get using random User-Agent from the list
# TODO: Get the response and print the used User-Agent
# TODO: Try to execute it again, is the User-Agent still the same?
# TODO: Try to loop to send up to 10 request, using different User-Agent from the list and print each user agents used

In [None]:
"""
Objective: Using fake user-agent library
"""
# TODO: Install fake_useragent using pip
# TODO: Create a UserAgent object
# TODO: To get a random User-Agent, use ua.random
# TODO: Send request to https://httpbin.org/get using random User-Agent from the fake_useragent
# TODO: Get the response and print the used User-Agent
# TODO: Try to execute it again, is the User-Agent still the same?

In [None]:
"""
Objective: Improve web scraping by using fake_useragent and logging
"""
# TODO: Visit https://www.cmegroup.com/markets/energy/crude-oil/light-sweet-crude.settlements.html
# TODO: Try sending request to that site without custom header
# TODO: If you failed, use random User-Agent using fake_useragent
# TODO: Extract the data table and save it to a json file

### **Proxy Rotation**

A proxy acts as an intermediary server between your scraping script (client) and the target website (server). When you send a request through a proxy, the request is routed through the proxy server, which then forwards the request to the destination website. The website sees the request coming from the proxy’s IP address instead of your actual IP address. This makes proxies particularly useful in web scraping, as they help with anonymity, bypassing rate limits, and preventing IP bans.

In [None]:
""" 
Objective: Understanding rate limits
"""
# TODO: Execute this cell

url = "https://api.github.com/users/octocat"

for i in range(1, 66):
    response = requests.get(url)
    print(f"Sending request {i}. Status code: {response.status_code}")

In [None]:
""" 
Objective: Understanding Proxy
"""
# This is a sample code
import requests

# Define the URL you want to scrape
url = 'https://httpbin.org/ip'  # httpbin provides your IP address for testing

# Define the proxy to use
proxy = {
    'http': 'http://10.10.1.10:8080',
    'https': 'http://10.10.1.10:8080'
}

# Send a request through the proxy
response = requests.get(url, proxies=proxy, timeout=5)

# Print the response (this will show the IP address the website sees)
print(response.text)

In [None]:
""" 
Objective: Finding free proxies
"""
# TODO: Do a google search for free proxies and share your thoughts


In [None]:
""" 
Objective: Using free proxies
"""

proxy_url = "https://api.proxyscrape.com/v4/free-proxy-list/get?request=display_proxies&proxy_format=protocolipport&format=text"

proxy_list = requests.get(proxy_url).text # Get the proxy list
proxy_list = proxy_list.strip().split('\r\n') # Split the proxy list
proxy_list = [proxy for proxy in proxy_list if 'http' in proxy] # Filter http only
print(len(proxy_list))

url = "https://api.github.com/users/octocat"

# TODO: Trigger blocking by sending 60 or more request to the URL
# TODO: Once we hit the rate limit, we need to use a proxy
# TODO: A free proxy may not always work, use looping to find a working proxy (response.status_code == 200)
# TODO: Once the request is successful, print the proxy used and exit the loop

### **Reflection**
What user-agent that typically get blocked? Can you mention it?

(answer here)

### **Exploration**
There are a lot of proxy providers out there. Do a research on at least 3 of them and compare it.