# Introduction

Web scraping is a powerful method for acquiring data from websites, especially when the information you need isn’t readily available in a structured format. By setting up a web scraper in your local environment, you can automate the process of gathering large amounts of data from the web. 

By the end of this reading, you will be able to: 

Set up a basic data scraper using Python, including code snippets and explanations to help you get started.

## Prerequisites

Before diving into the code, ensure you have the following tools installed on your local environment:

- Python 3.x: Python is the language we’ll use to build our web scraper.

- pip: pip is Python’s package installer, which you’ll use to install the necessary libraries.

- A code editor: Examples include Jupyter Notebooks, VS Code, PyCharm, or even a simple text editor such as Sublime Text.

You’ll also need a basic understanding of HTML, as web scraping involves interacting with the HTML structure of a web page.

# Example 1: Microsoft Learn

## Step 1: Import the necessary libraries

Start by importing the libraries you’ll need, if they’re not already in your kernel:

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

## Step 2: Send an HTTP request to the website

Use the requests library to send an HTTP GET request to the website you want to scrape:

In [45]:
url = 'https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.automl?view=azure-python'

try:
    response = requests.get(url)
    response.raise_for_status()  # Raises an HTTPError for bad responses
except requests.exceptions.HTTPError as err:
    print('HTTP error occurred:', err)
except Exception as err:
    print('Other error occurred:', err)

# Check if the request was successful
if response.status_code == 200:
    print('Request successful!')
else:
    print('Failed to retrieve the webpage')

Request successful!


## Step 3: Parse the HTML content

Once you’ve successfully retrieved the web page, use BeautifulSoup to parse the HTML content:

In [4]:
soup = BeautifulSoup(response.content, 'html.parser')

# Print the title of the webpage to verify
print(soup.title.text)

azure.ai.ml.automl package | Microsoft Learn


## Step 4: Extract the data you need

Now that you have the HTML parsed, you can start extracting the data you’re interested in. Let’s say you want to scrape a list of items from a table on the web page:

In [None]:
# Find the table containing the data
table = soup.find('table', {'title': 'python-keyword-only-parameter-table'})  # Replace 'data-table' with the actual id or class of the table

# Extract table rows
rows = table.find_all('tr')

# Loop through the rows and extract data
data = []
for row in rows[1:]: # Skipping the header row
    cols = row.find_all('td')
    if len(cols) == 2:
        cols = [col.text.strip() for col in cols]
        data.append(cols)
        time.sleep(0.5)  # Adds a delay before the next request

# Convert the data into a pandas DataFrame for easier manipulation
# Replace line breaks in the 'Description' column with spaces
df = pd.DataFrame(data, columns=['Name', 'Description']) 
df['Description'] = df['Description'].str.replace('\n', ' ', regex=True)

# Remove any rows with missing values
df = df.dropna()

# Display the scraped data
df

Unnamed: 0,Name,Description
0,training_data,Input The training data to be used within th...
1,target_column_name,str The name of the label column. This param...
2,primary_metric,The metric that Automated Machine Learning wil...
3,enable_model_explainability,bool Whether to enable explaining the best A...
4,weight_column_name,str The name of the sample weight column. Au...
5,validation_data,Input The validation data to be used within ...
6,validation_data_size,float What fraction of the data to hold out ...
7,n_cross_validations,"Union[str, int] \t\t How many cross validatio..."
8,cv_split_column_names,List[str] \t\t List of names of the columns t...
9,test_data,Input The Model Test feature using test data...


## Step 5: Save the scraped data

Finally, you can save the scraped data to a file for further analysis:

In [23]:
# Save the DataFrame to a CSV file
df.to_csv('scraped_data_microsoft_learn.csv', index=False)

# Example 2: Scraping information from Wikipedia

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Send an HTTP request to the webpage
url = 'https://en.wikipedia.org/wiki/Cloud-computing_comparison'  
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Print the title of the webpage to verify
print("Title: " + soup.title.text)

# Find the table containing the data (selecting the first table by default)
table = soup.find('table')

# Extract table rows
rows = table.find_all('tr')

# Extract headers from the first row (using <th> tags)
headers = [header.text.strip() for header in rows[0].find_all('th')]

# Loop through the rows and extract data (skip the first row with headers)
data = []
for row in rows[1:]:  # Start from the second row onwards
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append(cols)

# Convert the data into a pandas DataFrame, using the extracted headers as column names
df = pd.DataFrame(data, columns=headers)

# Display the first few rows of the DataFrame to verify
print(df.head())  

# Save the DataFrame to a CSV file
df.to_csv('scraped_data_wikipedia.csv', index=False)

Title: Cloud-computing comparison - Wikipedia
                      Provider Launched Block storage Assignable IPs  \
0        Google Cloud Platform     2013           Yes             No   
1  Oracle Cloud Infrastructure     2014           Yes            Yes   
2          Amazon Web Services     2006           Yes            Yes   
3                    IBM Cloud     2005           Yes            Yes   
4              Microsoft Azure     2010           Yes            Yes   

  SMTP support IOPS Guaranteed minimum Security  \
0        No[1]                     Yes   Yes[2]   
1          Yes                     Yes   Yes[5]   
2   Partial[6]                     Yes   Yes[7]   
3        No[9]                     Yes  Yes[10]   
4      Yes[11]                     Yes  Yes[12]   

                                           Locations             Notes  
0  br, ca, cl, us, be, ch, de, es, fi, it, po, nl...  SMTP blocked.[4]  
1  us, ca, br, de, uk, nl, ch, in, aus, jp, kr, saud                