# Practice Activity: Setup of a Basic Data Scraper in Python
## Introduction

Web scraping is a powerful method for acquiring data from websites, especially when the information you need isn’t readily available in a structured format. By setting up a web scraper in your local environment, you can automate the process of gathering large amounts of data from the web. 

By the end of this reading, you will be able to: 

Set up a basic data scraper using Python, including code snippets and explanations to help you get started.

### 1. Prerequisites

Before diving into the code, ensure you have the following tools installed on your local environment:

- **Python 3.x**: Python is the language we’ll use to build our web scraper.
- **pip**: pip is Python’s package installer, which you’ll use to install the  necessary libraries.
- **A code editor**: Examples include Jupyter Notebooks, VS Code, PyCharm, or even a simple text editor such as Sublime Text.

You’ll also need a basic understanding of HTML, as web scraping involves interacting with the HTML structure of a web page.

### 2. Writing the Python script

Now, let’s walk through the code to set up a basic web scraper that extracts data from a web page.

Step-by-step guide:

#### Step 1: Import the necessary libraries
Start by importing the libraries you’ll need, if they’re not already in your kernel:

In [47]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

#### Step 2: Send an HTTP request to the website

Use the requests library to send an HTTP GET request to the website you want to scrape:

In [48]:
# url = 'https://example.com'  # Replace with the URL of the website you want to scrape
url = 'https://pandiarajan-src.github.io/Pandi-PeriodicTable/'
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print('Request successful!')
else:
    print('Failed to retrieve the webpage')

Request successful!


#### Step 3: Parse the HTML content

Once you’ve successfully retrieved the web page, use BeautifulSoup to parse the HTML content:

In [49]:
soup = BeautifulSoup(response.content, 'html.parser')

# Print the title of the webpage to verify
print(soup.title.text)
#print(soup.get_text())

Pandi's Chemistry App


#### Step 4: Extract the data you need

Now that you have the HTML parsed, you can start extracting the data you’re interested in. Let’s say you want to scrape a list of items from a table on the web page:

In [51]:
# Find the table containing the data
table = soup.find('table', {'id': 'element-categories'})  # Replace 'data-table' with the actual id or class of the table

# Extract table rows
rows = table.find_all('tr')

# Extract headers from the first row (using <th> tags)
# headers = [header.text.strip() for header in rows[0].find_all('th')]

# Loop through the rows and extract data (skip the first row with headers)
data = []
for row in rows[0:]:  # Start from the second row onwards
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    
    # Skip rows with mismatched column counts
    data.append(cols)
    # if len(cols) == len(headers):
    #     data.append(cols)

# Convert the data into a pandas DataFrame, using the extracted headers as column names
df = pd.DataFrame(data)

# Display the scraped data
print(df)

                         0                                                  1
0                     None                                               None
1            Alkali Metals  Highly reactive metals in Group 1 of the perio...
2    Alkaline Earth Metals  Reactive metals in Group 2 of the periodic table.
3        Transition Metals  Metals with variable oxidation states, found i...
4   Post Transition Metals  Metals with lower melting points and softer th...
5               Metalloids  Elements with properties of both metals and no...
6                Nonmetals  Elements that are poor conductors of heat and ...
7                 Halogens  Highly reactive nonmetals in Group 17 of the p...
8              Noble Gases     Inert gases in Group 18 of the periodic table.
9              Lanthanides   Rare earth metals, part of the f-block elements.
10               Actinides  Radioactive elements, part of the f-block elem...


#### Step 5: Save the scraped data

Finally, you can save the scraped data to a file for further analysis:

In [52]:
# Save the DataFrame to a CSV file
df.to_csv('scraped_data.csv', index=False)

### 3. Example: Scraping information from Wikipedia

Let’s go through a more concrete example where we scrape information about cloud computing platforms:

In [53]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Send an HTTP request to the webpage
url = 'https://en.wikipedia.org/wiki/Cloud-computing_comparison'  
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Print the title of the webpage to verify
print("Title: " + soup.title.text)

# Find the table containing the data (selecting the first table by default)
table = soup.find('table')

# Extract table rows
rows = table.find_all('tr')

# Extract headers from the first row (using <th> tags)
headers = [header.text.strip() for header in rows[0].find_all('th')]

# Loop through the rows and extract data (skip the first row with headers)
data = []
for row in rows[1:]:  # Start from the second row onwards
    cols = row.find_all('td')
    cols = [col.text.strip() for col in cols]
    data.append(cols)

# Convert the data into a pandas DataFrame, using the extracted headers as column names
df = pd.DataFrame(data, columns=headers)

# Display the first few rows of the DataFrame to verify
print(df.head())  

# Save the DataFrame to a CSV file
df.to_csv('scraped_data.csv', index=False)

Title: Cloud-computing comparison - Wikipedia
                      Provider Launched Block storage Assignable IPs  \
0        Google Cloud Platform     2013           Yes             No   
1  Oracle Cloud Infrastructure     2014           Yes            Yes   
2          Amazon Web Services     2006           Yes            Yes   
3                    IBM Cloud     2005           Yes            Yes   
4              Microsoft Azure     2010           Yes            Yes   

  SMTP support IOPS Guaranteed minimum Security  \
0        No[1]                     Yes   Yes[2]   
1          Yes                     Yes   Yes[5]   
2   Partial[6]                     Yes   Yes[7]   
3        No[9]                     Yes  Yes[10]   
4      Yes[11]                     Yes  Yes[12]   

                                           Locations             Notes  
0  br, ca, cl, us, be, ch, de, es, fi, it, po, nl...  SMTP blocked.[4]  
1  us, ca, br, de, uk, nl, ch, in, aus, jp, kr, saud                

### 4. Important considerations

**Respect the website’s terms of service:** Always check the website’s terms of service to ensure that you’re allowed to scrape i**ts content. Some websites explicitly prohibit scraping.

**Be mindful of rate limits:** Avoid sending too many requests in a short period to prevent overloading the website’s server. Implement delays between requests if necessary.

**Handle errors gracefully:** Always include error handling in your script to manage situations where the website structure changes or the page fails to load.

## Conclusion

By setting up a basic web scraper in Python, you can automate the process of gathering data from websites, making it easier to acquire the information you need for your AI/ML projects. 

You've just learned the fundamentals of web scraping, from sending HTTP requests and parsing HTML to extracting and saving data. With this foundation, you can expand your scraper to handle more complex scenarios and integrate the data into your ML models.

Continue practicing with different websites and data structures, and always remember to scrape responsibly and ethically.