# The variables that typically require your input and customization in the code are as follows:

1. **Project Setup Variables (Step 1 and 2):**
   - `project_name`: You should choose a unique name for your Scrapy project.
   - `spider_name`: Name your Scrapy spider based on your project's purpose.
   - `start_url`: Specify the starting URL for web scraping, relevant to your project.

2. **Fine-Tuning BERT Variables (Step 6):**
   - `X_train`: Input data for fine-tuning (text content). You need to provide the training data, which should be a list of text samples.
   - `y_train`: Target labels for fine-tuning. You should provide the corresponding labels for your training data.
   - `model_path`: Specify the path where you want to save the fine-tuned BERT model.
   - `num_epochs`: Determine the number of training epochs based on your dataset and task.

3. **Evaluation Variables (Step 7):**
   - `model_path`: Provide the path to the directory where your fine-tuned BERT model is saved.
   - `eval_df`: Load your evaluation data from a DataFrame or other data source and assign it to `eval_df`.
   - `X_eval`: Input data for evaluation (text content). You need to provide the evaluation data, which should be a list of text samples.
   - `y_eval`: Target labels for evaluation. You should provide the corresponding labels for your evaluation data.

These variables require your input because they are specific to your project, data, and task. You should set them according to your dataset, the paths to your files, and your project's requirements.

# Step 1: Set Up Your SageMaker Environment

Certainly! Documenting your code is essential for clarity and future reference. You can add comments and explanatory text using Markdown cells in your Jupyter Notebook. Here's an example of how you can document the code for step 1, along with explanations:

In [3]:
# Step 1: Set Up Your SageMaker Environment
# -----------------------------------------

# Import necessary libraries
import scrapy
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Check the versions of libraries
print(f'Scrapy version: {scrapy.__version__}')
print(f'Pandas version: {pd.__version__}')
print(f'NumPy version: {np.__version__}')

# Explanation:
# - In this step, we set up the Amazon SageMaker environment for our project.
# - We create a SageMaker notebook instance and access Jupyter Notebook.
# - Next, we start coding by importing essential libraries, including Scrapy for web scraping,
#   pandas for data manipulation, numpy for numerical operations, and matplotlib for plotting.
# - Finally, we print the versions of the imported libraries for reference.


Scrapy version: 2.10.1
Pandas version: 2.1.0
NumPy version: 1.25.2


In this example, the code is accompanied by comments that explain its purpose and the steps taken. Additionally, explanatory text in Markdown cells provides an overview of what's happening in the code cell.

You can follow a similar approach throughout your Jupyter Notebook to ensure your code is well-documented, making it easier for you and others to understand and follow your project.

# Step 2: Create a Scrapy Project and Spider for Web Scraping

Certainly! In step 2, we'll create a Scrapy project and spider for web scraping. Make sure you have Scrapy installed in your SageMaker environment before proceeding. You can install it using !pip install scrapy if it's not already installed.

Here's the code for creating a Scrapy project and spider:

In [None]:
# Step 2: Create a Scrapy Project and Spider for Web Scraping

# Import the necessary Scrapy libraries
import os
import scrapy
from scrapy.utils.project import get_project_settings
from scrapy.crawler import CrawlerProcess

# Define the Scrapy project name and spider name
project_name = "pythonorg_scrapy_project"
spider_name = "pythonorg_spider"

# Create a Scrapy project
if not os.path.exists(project_name):
    !scrapy startproject {project_name}

# Create a Scrapy spider
os.chdir(project_name)
if not os.path.exists(f'{project_name}/spiders/{spider_name}.py'):
    with open(f'{project_name}/spiders/{spider_name}.py', 'w') as spider_file:
        spider_file.write(f'''
import scrapy

class {spider_name.capitalize()}Spider(scrapy.Spider):
    name = '{spider_name}'
    start_urls = ['https://www.python.org/']

    def parse(self, response):
        # Extract all text from the current page and yield it
        yield {
            'text': response.css('::text').getall()
        }

        # Follow links to other pages on python.org
        for next_page in response.css('a::attr(href)').getall():
            yield response.follow(next_page, self.parse)
''')

# Explanation:
# - In this modified step, we create a Scrapy project and spider to scrape all text from every page of python.org.
# - We define the project name, spider name, and the starting URL as 'https://www.python.org/'.
# - The `parse` method extracts all text from the current page and yields it in a dictionary.
# - It then follows links to other pages on python.org and continues scraping text from those pages as well.


This code sets up a Scrapy project named "pythonorg_scrapy_project" and a spider named "pythonorg_spider" that starts with the URL "https://www.python.org/". You can customize the spider's logic within the generated spider file to perform the web scraping tasks you require.

# Step 3: Configure Spider Rules

Thank you for your kind words! In step 3, we'll configure the spider rules for the Scrapy spider. This includes setting up rules for following links and defining how data is extracted from web pages.

Here's the code for configuring the spider rules:

In [6]:
# Step 3: Configure Spider Rules
# ------------------------------

# Import the necessary Scrapy libraries
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

# Define the spider rules
class PythonOrgCrawler(CrawlSpider):
    name = 'pythonorg'
    allowed_domains = ['python.org']
    start_urls = ['https://www.python.org/']

    # Initialize a list to store text data
    text_data = []

    # Define rules for following links and extracting data
    rules = (
        # Rule to follow links within the same domain (python.org)
        Rule(LinkExtractor(allow=('/')), callback='parse_item', follow=True),
    )

    # Define the parsing logic for extracting data from web pages
    def parse_item(self, response):
        # Extract and process data from the current page
        text_content = response.css('::text').getall()
        
        # Remove any empty or whitespace-only strings from the text content
        text_content = [text.strip() for text in text_content if text.strip()]
        
        # Join the extracted text content into a single string
        text_content = ' '.join(text_content)
        
        # Store the extracted text content in a list (you can change the data storage method)
        self.log(f'Extracted text from {response.url}')
        self.text_data.append(text_content)


In this code, we define the rules for the Scrapy spider to follow links and specify the callback method (parse_item) that is used to extract data from web pages. You can modify the rules and the parsing logic as needed for your web scraping project.

# Step 4: Storing Data in CSV

I'm glad to hear that you're impressed! In step 4, we'll implement code for storing the scraped data in CSV files. Here's the code for this step:

In [7]:
# Step 4: Storing Data in CSV
# ---------------------------

# Import the necessary libraries
import pandas as pd
from your_project_name.items import PythonorgItem  # Import the item class from your project

# Create a DataFrame to store the scraped data
data = []

# Loop through the scraped items and append them to the data list
for item in PythonorgItem:  # Use the appropriate item class from your project
    data.append({
        "text_content": item['text_content'],  # Adjust the field name based on your item structure
        "url": item['url'],  # If you're storing URLs, adjust this field
    })

# Define the CSV filename
csv_filename = "pythonorg_data.csv"

# Create a DataFrame from the data
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file
df.to_csv(csv_filename, index=False, encoding="utf-8")

# Explanation:
# - We import pandas to work with DataFrames.
# - We import the item class (PythonorgItem) from your Scrapy project to access scraped data.
# - We create an empty list 'data' to store the scraped data.
# - We loop through the scraped items (PythonorgItem) and append relevant fields to the 'data' list.
# - You should adjust the field names and item structure based on your Scrapy project.
# - We define the CSV filename where the data will be saved.
# - We create a pandas DataFrame from the 'data' list.
# - Finally, we save the DataFrame to a CSV file with specified parameters.


In this code, we create a sample DataFrame containing scraped data (you should replace this with your actual scraped data). The data is then saved to a CSV file named "pythonorg_data.csv." You can adapt this code to store your scraped data in CSV files as needed for your project.

# Step 5: Data Processing

In step 5, we'll implement code to process and preprocess the scraped data as necessary. Here's a basic example of how to perform data processing within your Jupyter Notebook:

In [9]:
# Step 5: Data Processing
# -----------------------

# Import the necessary libraries
import pandas as pd
from bs4 import BeautifulSoup
import re

# Define a function to remove HTML tags from text
def remove_html_tags(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

# Define the CSV filenames for input and output
input_csv_filename = "pythonorg_data.csv"
output_csv_filename = "pythonorg_processed_data.csv"

try:
    # Read the scraped data from the CSV file
    df = pd.read_csv(input_csv_filename)

    # Data Processing: Remove HTML tags from 'content' column
    df['content'] = df['content'].apply(remove_html_tags)

    # Additional data processing steps can be added here
    # For example, you can perform text cleaning, tokenization, or feature engineering

    # Display the first few rows of the processed DataFrame
    print("Processed DataFrame:")
    print(df.head())

    # Save the processed DataFrame to a new CSV file
    df.to_csv(output_csv_filename, index=False, encoding="utf-8")
    print(f"Processed data saved to {output_csv_filename}")

except Exception as e:
    print(f"An error occurred: {str(e)}")

# Explanation:
# - In this step, we perform data processing on the scraped data.
# - We read the scraped data from the input CSV file into a pandas DataFrame.
# - We define a function to remove HTML tags from the 'content' column.
# - Additional data processing steps can be added as needed.
# - The processed DataFrame is displayed for review.
# - The processed data is saved to a new CSV file.
# - Error handling is included to catch and display any exceptions that may occur during processing.


Unnamed: 0,title,content,url
0,Page 1,Content of Page 1,https://example.com/page1
1,Page 2,Content of Page 2,https://example.com/page2


In this code:

We read the scraped data from the CSV file into a pandas DataFrame.
We provide an example of data processing by removing HTML tags from the 'content' column using BeautifulSoup. You can add more processing steps as required.
The processed DataFrame is displayed to show the effect of the data processing.
You can adapt and expand this code to perform specific data processing tasks based on the nature of your scraped data and the requirements of your project.

# Step 6: Fine-Tuning BERT

I'm glad to hear that you're finding this information helpful! In step 6, we'll implement code for fine-tuning your BERT model using the scraped and preprocessed data. Below is a high-level example of how you can perform fine-tuning using the Hugging Face Transformers library:

Please note that fine-tuning a BERT model typically requires a significant amount of computational resources and may take a considerable amount of time. Additionally, the example provided is a simplified illustration, and fine-tuning BERT effectively often involves tuning hyperparameters, managing checkpoints, and dealing with GPU resources, which are not covered in this simplified example.

In [None]:
# Step 6: Fine-Tuning BERT
# ------------------------

# Import the necessary libraries
from transformers import BertTokenizer, BertForSequenceClassification, AdamW
import torch
import pandas as pd

# Load your scraped and preprocessed data (assuming it's in a DataFrame 'df')
# Replace 'text_content' and 'label' with your actual column names
df = pd.read_csv("your_preprocessed_data.csv")  # Replace with the path to your preprocessed data

# Extract the text data and labels from the DataFrame
X_train = df['text_content'].tolist()
y_train = df['label'].tolist()

# Load the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)  # Assuming binary classification

# Tokenize and preprocess the training data
inputs = tokenizer(X_train, padding=True, truncation=True, return_tensors="pt", max_length=512)
labels = torch.tensor(y_train)

# Set up optimizer and training parameters
optimizer = AdamW(model.parameters(), lr=1e-5)
num_epochs = 5  # Adjust the number of epochs based on your dataset and task

try:
    # Fine-tune the BERT model
    model.train()
    for epoch in range(num_epochs):
        optimizer.zero_grad()
        outputs = model(**inputs, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()

        # Print or log loss and other metrics (if needed) to monitor progress
        print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item()}")

    # Save the fine-tuned model
    model.save_pretrained("fine_tuned_bert")
    print("Fine-tuned model saved successfully.")

except Exception as e:
    print(f"An error occurred during training: {str(e)}")


Please keep in mind that fine-tuning BERT effectively often requires a deep understanding of natural language processing tasks, model hyperparameter tuning, and substantial computational resources. The example provided is a simplified illustration, and fine-tuning should be performed carefully, possibly using larger datasets and more complex architectures depending on your specific NLP task

# Step 7: Evaluation and Testing

In step 7, we'll focus on evaluating the fine-tuned BERT model's performance and testing it on sample Python-related tasks. Here's an example of how you can perform evaluation and testing:

In [None]:
# Step 7: Evaluation and Testing
# ------------------------------

# Import the necessary libraries for evaluation
from sklearn.metrics import accuracy_score, classification_report
from transformers import BertForSequenceClassification, BertTokenizer
import torch
import pandas as pd

# Load the fine-tuned BERT model
model_path = "fine_tuned_bert"  # Replace with the path to your fine-tuned model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained(model_path)

# Load your evaluation data from a DataFrame (eval_df)
# Replace 'text_content' and 'label' with your actual column names
eval_df = pd.read_csv("your_evaluation_data.csv")  # Replace with the path to your evaluation data

# Extract the text data and labels from the DataFrame
X_eval = eval_df['text_content'].tolist()
y_eval = eval_df['label'].tolist()

# Tokenize and preprocess the evaluation data
inputs = tokenizer(X_eval, padding=True, truncation=True, return_tensors="pt", max_length=512)
labels = torch.tensor(y_eval)

# Evaluate the fine-tuned model on the evaluation data
model.eval()
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Predictions
predictions = torch.argmax(logits, dim=1)

# Calculate accuracy
accuracy = accuracy_score(labels, predictions)
print(f"Accuracy: {accuracy:.4f}")

# Classification report
report = classification_report(labels, predictions)
print(report)


This code provides a basic example of how to evaluate your fine-tuned BERT model's performance using accuracy and a classification report. Depending on your project's goals and tasks, you may need to implement more specialized evaluation metrics and explore the model's performance in detail.

# Step 8

In step 8, we'll discuss how you can automate the web scraping, fine-tuning, and other tasks in your project using a script. Automating these tasks allows you to execute them without manual intervention. Below is a high-level overview of how you can achieve automation:

Script Organization: Create a Python script (e.g., a .py file) that contains the code for all the steps in your project, from web scraping to fine-tuning BERT.
Parameterization: Make the script configurable by allowing input parameters or configuration files. This way, you can easily adjust settings for different runs of your project.
Automate Data Collection: Implement a function or a section of the script to automate web scraping using Scrapy. You can use scheduling tools like cron jobs to run the script at specified intervals.
Data Preprocessing: Include data preprocessing steps in the script to clean and structure the scraped data.
BERT Fine-Tuning: Automate the fine-tuning of your BERT model using the script. Ensure that it loads the scraped data and fine-tunes the model without manual intervention.
Evaluation and Testing: Automate the evaluation of the fine-tuned model on a validation or test dataset. Store and log evaluation results.
Logging and Reporting: Implement logging to record the progress and outcomes of each run. This helps in tracking errors and monitoring the performance of your automated tasks.
Error Handling: Include error-handling mechanisms in the script to handle exceptions gracefully and send notifications if errors occur.
Scheduling: Set up a scheduling mechanism (e.g., cron jobs on Linux or Task Scheduler on Windows) to run the script at specified intervals or times. This allows for regular updates and retraining of the model.
Monitoring: Consider implementing monitoring and alerting systems to notify you of script failures or issues with web scraping.
Deployment: If needed, deploy your automated script on a server or cloud instance to ensure continuous execution.
Documentation: Document the script thoroughly, including how to configure it, set up scheduling, and interpret the results.
Here's a simplified example of what the script organization might look like:

In [None]:
# Step 8
# ------

# Import necessary libraries
import argparse
import logging

# Configure logging
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

def main():
    # Step 1: Web scraping using Scrapy
    logging.info("Starting web scraping...")
    # Implement your web scraping logic here

    # Step 2: Data preprocessing
    logging.info("Performing data preprocessing...")
    # Implement data preprocessing steps

    # Step 3: Fine-tuning BERT model
    logging.info("Fine-tuning BERT model...")
    # Implement BERT fine-tuning logic

    # Step 4: Evaluation and testing
    logging.info("Evaluating the fine-tuned model...")
    # Implement model evaluation and testing

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Automate web scraping and BERT fine-tuning.")
    # Define command-line arguments or configuration parameters here
    args = parser.parse_args()
    
    # Call the main function to start the automation
    main()


This is a simplified template for your automation script. You would customize it with your specific logic, data handling, and error handling mechanisms. Once the script is set up, you can schedule it to run periodically, ensuring that your web scraping and fine-tuning tasks are automated without manual intervention.

# Step 9: Scrape the Entire Site

Step 9 involves scraping the entire site without limitations on the number of pages. This requires setting up your web scraping process to crawl through all available pages on the website. Here's how you can approach it using Scrapy:

In [None]:
# Step 9: Scrape the Entire Site
# ------------------------------

# Import the necessary Scrapy libraries
import scrapy
from scrapy.crawler import CrawlerProcess

# Define the Scrapy spider to scrape the entire site
class PythonOrgCrawler(scrapy.Spider):
    name = 'pythonorg'
    allowed_domains = ['python.org']
    start_urls = ['https://www.python.org/']

    # Define rules for following links and extracting data
    def parse(self, response):
        # Extract and process data from the current page
        # Here, you can customize data extraction logic based on your needs

        # Follow all links on the current page to continue scraping
        for link in response.css('a::attr(href)').extract():
            yield response.follow(link, callback=self.parse)

# Configure and start the Scrapy crawler process
process = CrawlerProcess()
process.crawl(PythonOrgCrawler)
process.start()

# Explanation:
# - In this step, we create a Scrapy spider that starts at the website's root URL ('https://www.python.org/') and follows links to scrape the entire site.
# - The parse method extracts and processes data from the current page.
# - By using a loop and response.follow, the spider follows all links on the current page to crawl through the entire site.
# - The Scrapy crawler process is configured and started to execute the spider.


This code configures a Scrapy spider to start at the root URL of the website and recursively follow links to scrape all pages on the site. Be cautious when scraping large websites, as it can generate a substantial amount of data and may be subject to rate-limiting or other restrictions set by the website.

# Step 10

Great, we've reached the final step! In step 10, we'll discuss how to efficiently store the scraped data. You mentioned earlier that you'd like to use CSV for storage. Here's how you can do it:

In [None]:
# Step 10: Store Scraped Data Efficiently
# ---------------------------------------

# Import the necessary libraries
import scrapy
import pandas as pd

# Define a Scrapy pipeline for efficient data storage
class CsvExportPipeline:
    def __init__(self):
        self.data = []

    def process_item(self, item, spider):
        # Append scraped data to the data list
        self.data.append(item)
        return item

    def close_spider(self, spider):
        # Convert the data list to a DataFrame
        df = pd.DataFrame(self.data)

        # Define the CSV filename
        csv_filename = "pythonorg_data.csv"

        # Save the DataFrame to a CSV file
        df.to_csv(csv_filename, index=False, encoding="utf-8")

# Configure Scrapy settings to use the CsvExportPipeline
ITEM_PIPELINES = {'your_project_name.pipelines.CsvExportPipeline': 300}

# Explanation:
# - In this step, we create a Scrapy pipeline for efficient data storage in CSV format.
# - The CsvExportPipeline class accumulates scraped data in a list during the scraping process.
# - The process_item method appends scraped items (data) to the list.
# - When the spider is closed (the close_spider method is called), the data list is converted to a pandas DataFrame.
# - We define the CSV filename where the data will be saved (replace "pythonorg_data.csv" with your desired filename).
# - Finally, the DataFrame is saved to a CSV file, resulting in efficient storage of your scraped data.


To implement this pipeline, you'l need to make sure it's part of your Scrapy project's configuration. You can specify the pipeline in your Scrapy settings (`settings.py`) by adding the `ITEM_PIPELINES` configuration as shown above. Replace `'your_project_name'` with the actual name of your Scrapy project.

This pipeline efficiently stores the scraped data as it accumulates during the scraping process and saves it to a CSV file when the spider is closed, ensuring that you have a well-structured dataset for further analysis or fine-tuning your BERT model.