## **A Basic Web Scraper in Python**

## **Introduction**
Web scraping is a powerful method for acquiring data
from websites, especially when the information you need isn’t readily available in a structured format. By setting up a web scraper in your local environment, you can automate the process of gathering large amounts of data from the web.

### **Step 1: Import the necessary libraries**

In [2]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

### **Step 2: Send an HTTP request to the website**
Use the requests library to send an HTTP GET request to the website you want to scrape:

In [32]:
url = "https://en.wikipedia.org/wiki/Cloud-computing_comparison"

response = requests.get(url)

if response.status_code == 200:
  print("Request successful!")
else:
  print("Request failed")

Request successful!


### **3. Step 3: Parse the HTML content**
Once you’ve successfully retrieved the web page, use BeautifulSoup to parse the HTML content

In [33]:
soup = BeautifulSoup(response.content, "html.parser")
print(soup.title.text)

Cloud-computing comparison - Wikipedia


### **Step 4: Extract the data you need**
Now that you have the HTML parsed, you can start extracting the data you’re interested in. Let’s say you want to scrape a list of items from a table on the web page:

In [37]:
table = soup.find("table")
rows = table.find_all("tr")

# getting the column names
headers = [header.text.strip() for header in rows[0].find_all("th")]
headers

['Provider',
 'Launched',
 'Block storage',
 'Assignable IPs',
 'SMTP support',
 'IOPS Guaranteed minimum',
 'Security',
 'Locations',
 'Notes']

In [38]:
data = []
for row in rows[1:]:
  cols = row.find_all("td")
  cols = [col.text.strip() for col in cols]
  data.append(cols)

# now, we store the data as a Pandas dataFrame
df = pd.DataFrame(data, columns=headers)

In [39]:
df.head()

Unnamed: 0,Provider,Launched,Block storage,Assignable IPs,SMTP support,IOPS Guaranteed minimum,Security,Locations,Notes
0,Google Cloud Platform,2013,Yes,No,No[1],Yes,Yes[2],"br, ca, cl, us, be, ch, de, es, fi, it, po, nl...",SMTP blocked.[4]
1,Oracle Cloud Infrastructure,2014,Yes,Yes,Yes,Yes,Yes[5],"us, ca, br, de, uk, nl, ch, in, aus, jp, kr, saud",
2,Amazon Web Services,2006,Yes,Yes,Partial[6],Yes,Yes[7],"us, ca, br, ie, de, uk, cn, sg, au, jp, kr, in...",List of bugs[8]
3,IBM Cloud,2005,Yes,Yes,No[9],Yes,Yes[10],"us, gb, fr, de, nl, in, au, hk, kr, it, jp, no...",
4,Microsoft Azure,2010,Yes,Yes,Yes[11],Yes,Yes[12],"ca, us, br, ie, nl, de, uk, cn, au, jp, in, kr...",List of bugs[13]


In [41]:
# if you want to store the table as a csv or excel file
df.to_csv("cloud_computing_providers.csv", sep=",", index=False)
df.to_excel("cloud_computing_provider.xlsx", sheet_name="main", index=False)

### **4. Important considerations**

**Respect the website’s terms of service:** Always check the website’s terms of service to ensure that you’re allowed to scrape its content. Some websites explicitly prohibit scraping.

**Be mindful of rate limits**: Avoid sending too many requests in a short period to prevent overloading the website’s server. Implement delays between requests if necessary.

**Handle errors gracefully:**Always include error handling in your script to manage situations where the website structure changes or the page fails to load.

## **Conclusion**

By setting up a basic web scraper in Python, you can automate the process of gathering data from websites, making it easier to acquire the information you need for your AI/ML projects.