# Guide for Scraping Web Content and Storing in DeepLake

This guide will walk you through the process of scraping web content and storing it in a DeepLake vector store using the Python script provided. We'll scrape Hugging Face's documentation pages, process the scraped data, and then load it into a DeepLake database.

## Import Libraries

First, we'll need to import the required libraries.

In [None]:
import os
import requests
from bs4 import BeautifulSoup
from dotenv import load_dotenv
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import DeepLake
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import TextLoader
import re

## Load Environment Variables
Next, we'll load the environment variables from the .env file and get the dataset path.

In [None]:
load_dotenv()
dataset_path = os.environ.get('DEEPLAKE_DATASET_PATH')

## Instantiate OpenAIEmbeddings
We create an instance of OpenAIEmbeddings() which we'll use to transform our text data into embeddings.

In [None]:
embeddings = OpenAIEmbeddings()

## Define URL List
We define a function get_documentation_urls() which will return a list of relative URLs we intend to scrape. Modify this list as needed.

In [None]:
def get_documentation_urls():
    return ['/docs/huggingface_hub/guides/overview', '/docs/huggingface_hub/guides/download']  # list of URLs, can include more if needed

## Construct Full URLs
We construct the full URL by appending the relative URL to the base URL using construct_full_url() function.

In [None]:
def construct_full_url(base_url, relative_url):
    # Construct the full URL by appending the relative URL to the base URL
    return base_url + relative_url

## Scrape Page Content
Using scrape_page_content() function, we send a GET request to the URL, parse the HTML response, extract and clean the desired content from the page.

In [None]:
def scrape_page_content(url):
    # Send a GET request to the URL and parse the HTML response using BeautifulSoup
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')
    # Extract the desired content from the page (in this case, the body text)
    text=soup.body.text.strip()
    # Remove non-ASCII characters
    text = re.sub(r'[\x00-\x08\x0b-\x0c\x0e-\x1f\x7f-\xff]', '', text)
    # Remove extra whitespace and newlines
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

## Write Scraped Content to File
We scrape the content from all URLs and write it to a file using scrape_all_content() function.

In [None]:
def scrape_all_content(base_url, relative_urls, filename):
    # Loop through the list of URLs, scrape content, and add it to the content list
    content = []
    for relative_url in relative_urls:
        full_url = construct_full_url(base_url, relative_url)
        scraped_content = scrape_page_content(full_url)
        content.append(scraped_content.rstrip('\n'))

    # Write the scraped content to a file
    with open(filename, 'w', encoding='utf-8') as file:
        for item in content:
            file.write("%s\n" % item)
    
    return content

## Load Documents from a File
Next, we define a function load_docs() to load the scraped content from the file.

In [None]:
def load_docs(root_dir,filename):
    # Create an empty list to hold the documents
    docs = []
    try:
        # Load the file using the TextLoader class and UTF-8 encoding
        loader = TextLoader(os.path.join(
            root_dir, filename), encoding='utf-8')
        # Split the loaded file into separate documents and add them to the list of documents
        docs.extend(loader.load_and_split())
    except Exception as e:
        # If an error occurs during loading, ignore it and return an empty list of documents
        pass
    # Return the list of documents
    return docs

## Split Documents into Chunks
We then split the loaded documents into smaller chunks using split_docs() function.

In [None]:
def split_docs(docs):
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    return text_splitter.split_documents(docs) 

## Load Vectors into DeepLake
Finally, we load these chunks into a DeepLake database using load_vectors_into_deeplake() function.

In [None]:
def load_vectors_into_deeplake(dataset_path, source_chunks):
    # Initialize the DeepLake database with the dataset path and embedding function
    deeplake_db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
    # Add the text chunks to the database
    deeplakedb=deeplake_db.add_texts(source_chunks)
    return deeplakedb

## Main Function
This is then followed by  function runs the entire process by calling the functions we've defined in the correct order.

In [None]:
base_url = 'https://huggingface.co'
# Set the name of the file to which the scraped content will be saved
filename='content.txt'
# Set the root directory where the content file will be saved
root_dir ='./'
relative_urls = get_documentation_urls()
# Scrape all the content from the relative urls and save it to the content file
content = scrape_all_content(base_url, relative_urls,filename)
# Load the content from the file
docs = load_docs(root_dir,filename)
# Split the content into individual documents
texts = split_docs(docs)
# Create a DeepLake database with the given dataset path and embedding function
db = DeepLake(dataset_path=dataset_path, embedding_function=embeddings)
# Add the individual documents to the database
db.add_documents(texts)
# Clean up by deleting the content file
os.remove(filename)

## Conclusion
In this guide, we provided an overview of a script designed to scrape web content, specifically from the Hugging Face's documentation pages, and store it in a DeepLake vector store. This script integrates a wide array of operations such as web scraping, text cleaning, document loading, text splitting, and embedding generation, thus making it a comprehensive solution for extracting and preparing data for further analysis or application in machine learning tasks.

Using this script as a base, you can adapt it to your needs, for example by scraping different websites or storing the data in a different kind of database. Remember, the key lies in understanding each step, as it will allow you to modify and improve the script according to your needs.

Remember to respect the terms and conditions of the website you are scraping and avoid overloading servers with excessive requests. Happy scraping!