# Scrapy from Jupyter Notebook

**Why Jupyter Notebook?**

Using Jupyter Notebook allows you to:
- Write, run, and visualize Python code in an interactive environment.
- Great for data exploration, analysis, and showcasing results.
- Provides an easy way to integrate code, explanations, and visualizations in one place.
- Ideal for sharing and collaborating with others.


**How to Install Jupyter Notebook (via Anaconda) and Scrapy**

1. **Install Jupyter Notebook using Anaconda:**
   - Download and install Anaconda from the official website (https://www.anaconda.com/products/individual).
   - Follow the installation instructions for your operating system.
   - Open the Anaconda Navigator application after installation.

2. **Launch Jupyter Notebook:**
   - In Anaconda Navigator, click on "Launch" under the Jupyter Notebook icon.
   - A new tab will open in your web browser, showing the Jupyter Notebook dashboard.

3. **Create a New Jupyter Notebook:**
   - Click on "New" and select "Python 3" to create a new Jupyter Notebook.

4. **Install Scrapy via pip:**
   - Inside the newly created Jupyter Notebook cell, type `!pip install scrapy` and press Shift + Enter.
   - Scrapy will be installed in your Jupyter environment.

Now you have Jupyter Notebook installed and can use Scrapy to scrape websites directly from your Jupyter environment. Happy coding!


## Import necessary modules

In [None]:
#Install library if you are using this library for the first time
#!pip install scrapy

In [1]:
# Import necessary modules
import scrapy
from scrapy.crawler import CrawlerRunner
import re
from crochet import setup, wait_for

# Setup Crochet to work with Scrapy's Twisted reactor
setup()

In [14]:
# A Scrapy spider is a program that crawls websites, extracts data, and saves it, acting like a web-scraping robot.
# Define the Spider to scrape quotes from a webpage and save them to a CSV file
class QuotesToCsv(scrapy.Spider):
    name = "MJKQuotesToCsv"  # Spider name
    start_urls = [
        'https://en.wikiquote.org/wiki/Maynard_James_Keenan',  # URL to start scraping
    ]
    custom_settings = {
        'ITEM_PIPELINES': {
            '__main__.ExtractFirstLine': 1  # Use ExtractFirstLine pipeline to process items which extracts the first line of the scraped quote.Remove any HTML tags from the first line.

        },
        'FEEDS': {
            'Output/Quotes.csv': {
                'format': 'csv',  # Save the scraped data in CSV format
                'overwrite': True  # Overwrite the file if it already exists
            }
        }
    }

    def parse(self, response):
        # Parse data from the URLs
        for quote in response.css('div.mw-parser-output > ul > li'):
            yield {'quote': quote.extract()}  # Collect and output the extracted quote


# Pipeline to extract the first line from the scraped quote and remove HTML tags
class ExtractFirstLine(object):
    def process_item(self, item, spider):
        lines = dict(item)["quote"].splitlines()  # Split the quote into lines
        first_line = self.__remove_html_tags__(lines[0])  # Extract the first line and remove HTML tags
        return {'quote': first_line}  # Return the quote with only the first line

    def __remove_html_tags__(self, text):
        # Remove HTML tags from the given text using regex
        html_tags = re.compile('<.*?>')
        return re.sub(html_tags, '', text)


# Function to run the spider with CrawlerRunner
@wait_for(10)  # Set a timeout for running the spider
def run_spider():
    crawler = CrawlerRunner()
    d = crawler.crawl(QuotesToCsv)
    return d

# Run the spider
run_spider()


In [6]:
#Run python file
!python scrape_webpage.py

2023-07-25 17:28:21 [scrapy.utils.log] INFO: Scrapy 2.8.0 started (bot: scrapybot)
2023-07-25 17:28:21 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.1.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.2.0, Python 3.10.9 | packaged by Anaconda, Inc. | (main, Mar  1 2023, 18:18:15) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 1.1.1t  7 Feb 2023), cryptography 39.0.1, Platform Windows-10-10.0.19045-SP0
2023-07-25 17:28:21 [scrapy.crawler] INFO: Overridden settings:
{}


See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this deprecation.
  return cls(crawler)

2023-07-25 17:28:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2023-07-25 17:28:21 [scrapy.extensions.telnet] INFO: Telnet Password: e5ac53ba4250460e
2023-07-25 17:28:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 