In [1]:
pip install minet

Note: you may need to restart the kernel to use updated packages.


# Create an instance of Minet's `Scraper`

To use the **Scraper** class from *Minet*, we have to first create an instance of it in our program. The **Scraper** class can be instantiated with only one argument: a dictionary of instructions (known as a "definition") that tells the scraper how to navigate the HTML tree in order to extract the desired data. The definition's logic is expressed in a special [Domain Specific Language (DSL) created for *Minet*](https://github.com/medialab/minet/blob/master/cookbook/scraping_dsl.md). Minet's DSL is always in some kind of embedded data format that Python can read as a dictionary. This means that the **Scraper** class's definition can be (a) a YAML file, (b) a JSON file, or (c) a Python dictionary. When a file path (i.e. path to a YAML or JSON file) is given as the "definition", *Minet* tries to open and parse the file using the `load_definition()` function in *minet/utils.py*. If the file extension isn't ".json", ".yml", or ".yaml" *Minet* returns a `DefinitionInvalidFormatError`.

## Step 1. Checking the definition
Either having parsed the JSON or YAML file as a Python dictionary, or having been given directly a Python dictionary, the **Scraper** class then checks that the defintion is indeed valid using the `validate()` function from *minet/scrape/analysis.py*. The `validate()` function is extensive and appends to a list of errors any conflicts or errors it detected. If the list of errors is empty, the definition is considered valid and the instance of the **Scraper** class assigns the valid definition to itself.


## Step 2. Analyzing the definition
The instantiated **Scraper** class also analyzes its own definition (using the function `analyse()` from *minet/scrape/analysis*) in order to determine three parameters:

   1. what headers the output should have (`headers`)
   2. whether one node or multiple nodes are targeted (`plural`)
   3. what type of output is requested. (`output_type`)
   
Having analyzed the definition with `analyse()`, the instantiation of the **Scraper** class updates its arguments with the values of the analysis's arguments (`headers`, `plural`, `output_type`).
   
### More detail on the headers

> *Minet* exports scraped results in CSV or JSON format. In both formats, it allows for the user to specify under what customized headers certain data will be presented. If the definition contains the dictionary key called "tabulate," for example, the scraper will confirm that tabulate's value is also a dictionary and then get the value of the key "headers." (The value of "headers" should be/is probably a list, but it doesn't check for that.)

> DSL to declare headers with the "tabulate" dictionary key:
```
        "tabulate": 
                "headers": ["header1", "header2", "header3"]
```

> Another way *Minet's* DSL lets the user assign headers to the output data is with the key "fields." The value of "fields" should a dictionary; each key should be the name of a header, and each value should be a dictionary defining what data to extract. The function `headers_from_definition()` in *minet/scrape/analysis.py* checks that none of the keys in the "fields" dictionary has the value "fields."

> DSL to declare headers with the "fields" dictionary key:
```
        "fields":
            "content": "text"
            "link": "href"
```

## Step 3. Straining
Finally, the instatiation of *Minet's* **Scraper** class checks whether it has been given a "strainer," a clever term from the *BeautifulSoup* library, with which to parse HTML more efficiently. *BeautifulSoup* provides a **SoupStrainer** class that ["allows you to choose which parts of an incoming document are parsed."](https://beautiful-soup-4.readthedocs.io/en/latest/#soupstrainer) Minet instantiates *BeautifulSoup's* **SoupStrainer** class with the variable `strainer_from_css`.

... à mieux comprendre

# Give the scraper some HTML data to scrape
Once instantiated with a valid definition, the scraper is given some HTML to parse. The instantiated Scraper class takes only one argument and it must be valid HTML.

In [63]:
# import the Scraper class 
# (the class is defined in minet/scrape/__init__.py)
from minet import Scraper

def iterate_p(html):
    # define the HTML branches you want to scrape
    # (the definition is validated according to the function validate() in minet/scrape/analysis.py)
    scraper_definition = {
        'iterator':  'p' # scrape all the elements <p> in the HTML file
    }

    # instantiate the class Scraper with your definition
    scraper = Scraper(scraper_definition)

    # give to the instantiated Scraper some HTML 
    data = scraper(html)

The scraper can prase the HTML it's given. To this object, the scraper recursively applies its de logic of its definition.

The `scraper` acts on the HTML object using functions defined in `minet/scrape/interpreter.py`. 

In [49]:
# Create some HTML for the Scraper class 
some_html = """
<html>
  <head>First Scrape</head>
  <body>
    <p>Hello World!</p>
  </body>
</html>
"""

In [57]:
# apply the instantiated/defined Scraper to the HTML
iterate_p(some_html)

# verify the type of data produced by the instantiated Scraper
print(type(data))
# examine at the results
print(data)

<class 'list'>
['First Scrape', 'Hello World!']


In [48]:
# Create some more HTML for the Scraper class
some_other_html = """
<html>
  <head>
    <title>First Scrape</title>
  </head>
  <body>
    <p>Hello World!</p>
  </body>
</html>
"""

In [58]:
# apply the instantiated/defined Scraper to the HTML
iterate_p(some_other_html)

# verify the type of data produced by the instantiated Scraper
print(type(data))
# examine at the results
print(data)

<class 'list'>
['First Scrape', 'Hello World!']
