👋 Hello! Welcome to this amazing notebook! 🎉

This notebook utilizes the Scrapy library to create a web crawler, which allows you to extract and analyze data from web pages. 😄

The purpose of this notebook is to demonstrate how to use Scrapy to crawl the Python documentation website of Langchain (https://python.langchain.com/en/latest/index.html) and extract all the URLs present on the pages. 🕷️🌐

The notebook begins by installing the Scrapy library using the `!pip install scrapy` command. Then, it defines a custom spider named `WebCrawler` which starts crawling from the given URL and extracts all the URLs it encounters on the pages. The extracted URLs are stored in a CSV file named `url_list.csv` using the `ExtractUrls` pipeline.

Once the crawling process is complete, the notebook reads the CSV file into a Pandas DataFrame (`df`). It removes any duplicate URLs and filters the DataFrame to keep only the URLs related to the Python documentation on Langchain. The resulting DataFrame is displayed to examine the extracted URLs.

Next, the notebook saves the complete list of URLs in a text file named `urls_complete.txt`. Additionally, it identifies the URLs containing the term "started" in them and saves them in another text file named `urls_get_started.txt`.

That's it! This notebook provides you with a simple and efficient way to crawl a website, extract URLs, and perform further analysis on the extracted data. Happy crawling! 🚀🔍


In [None]:
!pip install scrapy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting scrapy
  Downloading Scrapy-2.9.0-py2.py3-none-any.whl (277 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m277.2/277.2 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting Twisted>=18.9.0 (from scrapy)
  Downloading Twisted-22.10.0-py3-none-any.whl (3.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m55.5 MB/s[0m eta [36m0:00:00[0m
Collecting cssselect>=0.9.1 (from scrapy)
  Downloading cssselect-1.2.0-py2.py3-none-any.whl (18 kB)
Collecting itemloaders>=1.0.1 (from scrapy)
  Downloading itemloaders-1.1.0-py3-none-any.whl (11 kB)
Collecting parsel>=1.5.0 (from scrapy)
  Downloading parsel-1.8.1-py2.py3-none-any.whl (17 kB)
Collecting pyOpenSSL>=21.0.0 (from scrapy)
  Downloading pyOpenSSL-23.1.1-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.9/57.9 kB[0m [31m6

In [None]:
import scrapy
from scrapy.crawler import CrawlerProcess

class WebCrawler(scrapy.Spider):
    name = "WebCrawler"
    start_urls = ['https://python.langchain.com/en/latest/index.html']
    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.ExtractUrls': 1
        },
        'FEEDS': {
            'url_list.csv': {
                'format': 'csv',
                'overwrite': True
            }
        }
    }

    def parse(self, response):
        """parse data from urls"""
        for href in response.css('a::attr(href)').getall():
            url = response.urljoin(href)
            yield {'url': url}
            yield response.follow(url, self.parse)


class ExtractUrls(object):
    def process_item(self, item, spider):
        return item

# Initiate the crawler
process = CrawlerProcess()
process.crawl(WebCrawler)
process.start()


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("/content/url_list.csv")

In [3]:
df

Unnamed: 0,url
0,https://python.langchain.com/en/latest/index.h...
1,https://python.langchain.com/en/latest/index.html
2,https://python.langchain.com/en/latest/getting...
3,https://python.langchain.com/en/latest/getting...
4,https://python.langchain.com/en/latest/getting...
...,...
47431,https://pages.awscloud.com/awsmp-contact-us.ht...
47432,https://aws.amazon.com/marketplace/b/27afd715-...
47433,https://aws.amazon.com/marketplace/b/cb276aab-...
47434,https://aws.amazon.com/marketplace/b/3141913d-...


In [4]:
df.drop_duplicates(inplace=True)

In [5]:
df

Unnamed: 0,url
0,https://python.langchain.com/en/latest/index.h...
1,https://python.langchain.com/en/latest/index.html
2,https://python.langchain.com/en/latest/getting...
3,https://python.langchain.com/en/latest/getting...
4,https://python.langchain.com/en/latest/getting...
...,...
47424,https://docs.aws.amazon.com/iam/index.html
47425,https://docs.aws.amazon.com/IAM/latest/UserGui...
47427,https://pages.awscloud.com/awsmp-contact-us.ht...
47431,https://pages.awscloud.com/awsmp-contact-us.ht...


In [7]:
df = df[df['url'].str.contains('python.langchain.com/en/latest/')]

In [None]:
df

In [14]:
urls = list(df['url'])

In [15]:
with open('urls_complete.txt', 'w') as file:
    for url in urls:
        file.write(url + '\n')


In [18]:
df_get_started = df[df.url.str.contains('started')]

In [19]:
get_started_urls = list(df_get_started['url'])

In [26]:
with open('urls_get_started.txt', 'w') as file:
    for url in get_started_urls:
        file.write(url + '\n')
