## Simple parsing with HTMLParser

In this notebook you will practice one of the workflows for using `HTMLParser` effectively. As you already know, `HTMLParser` is a streaming parser, where data comes in with chunks. Each chunk of data has delimeters like tags. 

It might feel a bit complicated to have special methods to look at tags, and others to process data - this is one of the caveats of using a streaming parser.

For this exercise, you will use predefined HTML variables with raw content that can be parsed. Instead of requesting the data from the web, the content is already defined and available to be processed. The process is the same to scrape the html.

In [6]:
content = """
<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>1992 World Junior Championships in Athletics – Men's high jump - Wikipedia</title>
"""

In [9]:
from html.parser import HTMLParser

class Parser(HTMLParser):

    def __init__(self):
        super().__init__()
        self.recording = False

    def handle_starttag(self, tag, attrs):
        if tag == "title":
            self.recording = True
        else:
            self.recording = False
            
    def handle_data(self, data):
        if self.recording:
            print(f"Found data for tag: {data}")
            

In [10]:
p = Parser()
p.feed(content)

Found data for tag: 1992 World Junior Championships in Athletics – Men's high jump - Wikipedia
Found data for tag: 



Why is `handled_data()` printing twice? The second line appears to have an _empty_ data. Here is one way to find out: update the `handle_data()` method so that it displays the string with the `repr()` built-in function:

```python
    def handle_data(self, data):
        if self.recording:
            print(f"Found data for tag: {repr(data)}")
```

Run the cell where the class lives and re-run the Parser cell again to see if you spot the problem

# Scrapy and Xpath in python

pyton3 -m venv venv

source venv/bin/activate


pip install scrapy


In [12]:
pip install scrapy

scrapy --help


scrapy startproject myfirst_scrapy_projcet_name

dir

cd myfirst_scrapy_projcet_name

myfirst_scrapy_projcet_name> scrapy genspider name_of_the_spider_is_first any_domain_name.org


cd spiders

cd name_of_the_spider_is_first

code .


