<a href="https://colab.research.google.com/github/ifeuerstein/SJICWeek5/blob/main/DS_Try_Assembl%C3%A9e_nationalePage1a4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# An example scraper using a list

The code below can be copied and adapted to create your own scraper.

The first part installs all the libraries. I've kept this separate to the other parts so that you don't have to install them every time you want to run the scraper itself.

In [7]:
#install the libraries 
#scraperwiki is a library for scraping webpages
!pip install scraperwiki
import scraperwiki
#lxml.html is used to convert it into xml (more structured)
import lxml.html
#cssselect is used to drill down into that and find data in tags
!pip install cssselect
import cssselect
#the pandas library which is used to work with data - we call it 'pd' here so we have to type less!
import pandas as pd

Collecting scraperwiki
  Downloading https://files.pythonhosted.org/packages/30/84/d874847baad89f03e6984fcd87505a37bf924b66519d1e07bf76e2369af0/scraperwiki-0.5.1.tar.gz
Collecting alembic
[?25l  Downloading https://files.pythonhosted.org/packages/8e/07/799a76aca0acd406e3259cc6c558ca1cdadf88250953b6c8105b421a9e33/alembic-1.5.5.tar.gz (1.2MB)
[K     |████████████████████████████████| 1.2MB 8.3MB/s 
Collecting Mako
[?25l  Downloading https://files.pythonhosted.org/packages/5c/db/2d2d88b924aa4674a080aae83b59ea19d593250bfe5ed789947c21736785/Mako-1.1.4.tar.gz (479kB)
[K     |████████████████████████████████| 481kB 34.7MB/s 
[?25hCollecting python-editor>=0.3
  Downloading https://files.pythonhosted.org/packages/c6/d3/201fc3abe391bbae6606e6f1d598c15d367033332bd54352b12f35513717/python_editor-1.0.4-py3-none-any.whl
Building wheels for collected packages: scraperwiki, alembic, Mako
  Building wheel for scraperwiki (setup.py) ... [?25l[?25hdone
  Created wheel for scraperwiki: filename=sc

## Using a list

Below we write some code to create a list of counties that can be used to generate URLs on a karting site.

We also store the 'base URL' that we will add to each item in the list to create a full URL.

In [8]:
#create a list of counties that we will need to generate URLs
pages = ["1","2","3","4"]
#store the base URL we will add those to
baseurl = "https://www.assemblee-nationale.fr/dyn/15/amendements?dossier_legislatif=DLR5L15N40245&examen=EXANR5L15PO59048B3360P1D1&page="

## Using a loop

Next we loop through each item in the list and add it to that base url using the `+` operator.

We add a `print` function inside the loop to check that it works each time - and copy those links into a browser to check that they are the right links.

In [9]:
#start looping through our list
for i in pages:
  fullurl = baseurl+i
  print(fullurl)

https://www.assemblee-nationale.fr/dyn/15/amendements?dossier_legislatif=DLR5L15N40245&examen=EXANR5L15PO59048B3360P1D1&page=1
https://www.assemblee-nationale.fr/dyn/15/amendements?dossier_legislatif=DLR5L15N40245&examen=EXANR5L15PO59048B3360P1D1&page=2
https://www.assemblee-nationale.fr/dyn/15/amendements?dossier_legislatif=DLR5L15N40245&examen=EXANR5L15PO59048B3360P1D1&page=3
https://www.assemblee-nationale.fr/dyn/15/amendements?dossier_legislatif=DLR5L15N40245&examen=EXANR5L15PO59048B3360P1D1&page=4


## Scraping each URL as we loop

Now that we know the loop works in generating the right URLs, we can extend the code inside the loop so that it *scrapes* each URL.

At this point we are using some of the libraries we imported at the start. `scraperwiki.scrape()`, for example, is the `scrape()` function from the `scraperwiki` library. 

Let's look at the code first, and then explain it...

In [10]:
#start looping through our list
for i in pages:
  fullurl = baseurl+i
  print(fullurl)
  #Scrape the html at that url
  html = scraperwiki.scrape(fullurl)
  # turn our HTML into an lxml object
  root = lxml.html.fromstring(html) 
  #The names are all in <h2> and then <a 
  #This targets the contents of those html tags
  depute = root.cssselect('p')
  #the results are always a list so we have to loop through it using a 'for' loop
  for i in depute:
    #each item in the list is called i as it loops
    print(i)
    #on its own it looks odd, but we can attach .text_content() to translate it into text
    depute = i.text_content()
    print(depute)


https://www.assemblee-nationale.fr/dyn/15/amendements?dossier_legislatif=DLR5L15N40245&examen=EXANR5L15PO59048B3360P1D1&page=1
<Element p at 0x7f513354bf50>
Article UNIQUE, alinÃ©a 2
<Element p at 0x7f513354bfb0>
M. Charles de Courson
<Element p at 0x7f5133563050>
Article UNIQUE, alinÃ©a 2
<Element p at 0x7f51335630b0>
Mme ValÃ©rie Rabault
<Element p at 0x7f5133563110>
Article UNIQUE, alinÃ©a 2
<Element p at 0x7f5133563170>
Mme Marie-Christine Dalloz
<Element p at 0x7f51335631d0>
Article UNIQUE, alinÃ©a 2
<Element p at 0x7f5133563230>
Mme Marie-Christine Dalloz
<Element p at 0x7f5133563290>
Article UNIQUE, alinÃ©a 2
<Element p at 0x7f51335632f0>
M. Ãric Woerth
<Element p at 0x7f5133563350>
Article UNIQUE, alinÃ©a 2
<Element p at 0x7f51335633b0>
M. Ãric Woerth
<Element p at 0x7f5133563410>
Article UNIQUE
<Element p at 0x7f5133563470>
Mme ValÃ©rie Rabault
<Element p at 0x7f51335634d0>
Article 2, alinÃ©a 7
<Element p at 0x7f5133563530>
M. Jean-Paul Mattei
<Element p at 0x7f5133563590>
A

## The functions we are using

Let's break some of this down.

So `scraperwiki.scrape()` is the `scrape()` function from the `scraperwiki` library. The *ingredient* we give to that function is the URL we stored in the `fullurl` variable.

The `scrape()` function basically fetches the whole webpage at a given address (the ingredient it's given).

The results of running that function are stored in a new variable called `html`.

This isn't in a form we can easily work with, yet, so we need another function to convert it to something we can drill down into. 

That function is the `fromstring()` function from the `lxml.html` library. The *ingredient* we give to that function is the `html` variable we just created.

The results are stored in another new variable, `root`.

This variable is a particular type of object (an "lxml object" if you need to know) that can be drilled down into using the `cssselect` function. That function will grab elements that match the *CSS selectors* that you give it as an ingredient.

In this case we specify `'h2'`, which means "any h2 tag" - so it will grab the contents of any h2 tags in the page.

Don't worry about memorising any of the code above: this is code that you can re-use time and time again. The only bit you will need to change is the selector, in order to specify the particular HTML you're after. 

To work out the selector you need, you'll often need to Google around, learning as you go, but selectors are pretty easy to get the hang of, and I'll talk about it more below.

## Using CSS selectors

**CSS selectors** are used to target different elements in a HTML page. A basic selector can target just one type of HTML tag, like `<h2>` or `<p>`, but you can also target a combination of tags (such as any `<strong>` tags within `<p>` tags). 

More complicated selectors can also be used to target tags based on their attributes (e.g. not just `<p>` but specifically `<p class="summary">`).

You can find lots of resources to help you with CSS selectors, such as [this one](https://www.w3schools.com/cssref/css_selectors.asp). Many will relate to styling webpages (which is how CSS selectors are most often used - selectors are used to target the HTML elements that you want to style), but the principles are the same.


## Saving the information we've grabbed.

Now we've grabbed some information we can extend the code further to save it.

At this point we need to use functions from another library: `pandas`. This is a library for data storage and analysis. When we imported `pandas` we called it `pd` for short. This is quite common. Any reference to `pd` in the code, then, means `pandas`

First, we use the function `DataFrame()` which creates a pandas dataframe. As ingredients it needs to know the names of any columns.

You will see below that we add a line *before* the loop which uses that to create an empty dataframe to store the data in.

Then, inside the loop, the data we extract is added to the dataframe.

Here's the code first - then I'll explain the new bits after.


In [11]:
#Create a dataframe to store the data we are about to scrape
#It has one column called 'title'
#We call this dataframe 'df'
df = pd.DataFrame(columns=["Nom"])

#start looping through our list
for i in pages:
  fullurl = baseurl+i
  print(fullurl)
  #Scrape the html at that url
  html = scraperwiki.scrape(fullurl)
  # turn our HTML into an lxml object
  root = lxml.html.fromstring(html) 
  #There are 100 recordings on the page
  #The titles are all in <div class="title"> and then <a 
  #This targets the contents of those html tags
  depute = root.cssselect('p')
  #the results are always a list so we have to loop through it using a 'for' loop
  for i in depute:
    #each item in the list is called i as it loops
    print(i)
    #on its own it looks odd, but we can attach .text_content() to translate it into text
    depute = i.text_content()
    print(depute)
    #Now we need to store it in that variable called 'df' 
    df = df.append({
      "Nom" : depute
      }, ignore_index=True)


https://www.assemblee-nationale.fr/dyn/15/amendements?dossier_legislatif=DLR5L15N40245&examen=EXANR5L15PO59048B3360P1D1&page=1
<Element p at 0x7f513357c290>
Article UNIQUE, alinÃ©a 2
<Element p at 0x7f513357c2f0>
M. Charles de Courson
<Element p at 0x7f513357c350>
Article UNIQUE, alinÃ©a 2
<Element p at 0x7f513357c3b0>
Mme ValÃ©rie Rabault
<Element p at 0x7f513357c410>
Article UNIQUE, alinÃ©a 2
<Element p at 0x7f513357c470>
Mme Marie-Christine Dalloz
<Element p at 0x7f513357c4d0>
Article UNIQUE, alinÃ©a 2
<Element p at 0x7f513357c530>
Mme Marie-Christine Dalloz
<Element p at 0x7f513357c590>
Article UNIQUE, alinÃ©a 2
<Element p at 0x7f513357c5f0>
M. Ãric Woerth
<Element p at 0x7f513357c650>
Article UNIQUE, alinÃ©a 2
<Element p at 0x7f513357c6b0>
M. Ãric Woerth
<Element p at 0x7f513357c710>
Article UNIQUE
<Element p at 0x7f513357c770>
Mme ValÃ©rie Rabault
<Element p at 0x7f513357c7d0>
Article 2, alinÃ©a 7
<Element p at 0x7f513357c830>
M. Jean-Paul Mattei
<Element p at 0x7f513357c890>
A

## The new code

The first line of new code is this:

`df = pd.DataFrame(columns=["title"])`

We are creating a new variable here, called `df`, and assigning to it the results of using a function: `pd.DataFrame()` (the `pandas` function `DataFrame`).

That takes an ingredient which specifies the columns as being a list (note the square brackets) of one string: `"title"`.

The second line of new code is this:

```
df = df.append({
      "title" : title
      }, ignore_index=True)
```

This takes the `df` variable and updates it. 

On the right of the equals sign is `df.append()` - this means it is using a function called `append` to append (add) new data to the `df` variable it's attached to.

The `append` function [can include various ingredients](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html): firstly the data that you want to append to the dataframe; but also settings, such as whether you want something called `ignore_index` to be `True` or `False`. Setting this to `True` just avoids problems when your data isn't unique.

What about the data that you are appending? Well, this has to be in the form of a **dictionary**. A dictionary is like a list, but with two key differences: firstly that it uses curly brackets instead of square ones: `{}`, and secondly it's a list of *pairs*: a 'key', and a 'value', separated by a colon.

Here's the dictionary in our code:

`{"title" : title}`

The first part, `"title"` is the **key**. This matches the column heading in the empty data frame. Note that it's a **string**: a label, basically.

The second part, `title`, is the **value**. This isn't in quotes so it's not a string - it's a variable. A few lines earlier we created this variable with `title = i.text_content()`

So having extracted that information and stored it in `title`, the line of code is storing it in a dataframe with the label (key) "title":

```
df = df.append({
      "title" : title
      }, ignore_index=True)
```

We can print the dataframe to see what's in there:


In [12]:
#Once the loop has finished we can take a look at the data
print(df)

                                                   Nom
0                            Article UNIQUE, alinÃ©a 2
1                                M. Charles de Courson
2                            Article UNIQUE, alinÃ©a 2
3                                 Mme ValÃ©rie Rabault
4                            Article UNIQUE, alinÃ©a 2
..                                                 ...
163                                     M. Marc Le Fur
164                                 AprÃ¨s l'Article 2
165                                     M. Marc Le Fur
166  \n                                    Inscrive...
167  \n                                    Votre ad...

[168 rows x 1 columns]


## Exporting the data

The `pandas` library has another function for exporting data: `to_csv()`.

It needs to be attached to the name of the dataframe variable with a period, then, in the brackets, you specify the name of the file you want to export it as. Make sure this ends in '.csv' so it can be used in a spreadsheet.

In [14]:
#And we can export it
df.to_csv("DeputesPLFPage1à4.csv")

## Downloading the data

Once exported, it should appear in the file explorer in Google Colab on the left hand side. Click on the folder icon to open this up and you should see the file you just created (there's a refresh button above if you can't).

Hover over the file name to see three dots, then click on those to select **Download** and download to your computer.

## How to adapt it

You can use most of this code without having to change it. All you *need* to change is the lines specifying the base URL, and the list of words to add to it.

And this line, which specifies what you want to scrape from that page:

`titles = root.cssselect('h2')`

If you're scraping one type of information from one page, that will be enough. 

For the CSS selector you will need to identify the HTML in the page you are scraping, and the combination of tags that is being used. 

Some [reading around CSS selectors](https://www.w3schools.com/cssref/css_selectors.asp) will help you here, but a couple of useful things to know include:

* A period `.` means `class="`
* A hash `#` means `id="`

So `'div.title a'` means `<div class="title"><a ...>` - or, in other words, anything on the page inside an `<a>` tag (a link) within a `<div class="title">` tag.

The words used for variables (like "baseurl" and "titles" above) may not be relevant to what you are scraping - but that doesn't matter, because those words are arbitrary. If you do decide to change them, make sure you change them *throughout* the code, or it will create an error.


## Generating URLs for a scraper to loop through

Alternatively you might *generate* the URLs: for example, if they end in a number that goes up by 1 each time you can use `range` to generate that list of numbers and add them to the URL using `+`.

However, you cannot mix numbers and strings, so you need to convert the numbers to a string as you do this. Here's an example:

In [None]:
#Create the basic URL that appears before the number
baseurl = "http://mypage.com?page="
#Create a list of numbers to put on the end
pagenums = range(1,11)
#Now generate the URLs by looping through the list and adding it to the URL
for i in pagenums:
  #Combine the two - 
  #this will generate an error because we are trying to combine a string and a number
  fullurl = baseurl+i

TypeError: ignored

## Tip: converting numbers into strings

You can see the error `must be str, not int` - in other words the second part must be a string not an integer.

To fix that you can use the `str()` function, which will convert a number into a string.

In [None]:
#Create the basic URL that appears before the number
baseurl = "http://mypage.com?page="
#Create a list of numbers to put on the end
pagenums = range(1,11)
#Now generate the URLs by looping through the list and adding it to the URL
for i in pagenums:
  #Convert i to a string
  i = str(i)
  #Combine the two
  fullurl = baseurl+i
  #print it
  print(fullurl)

http://mypage.com?page=1
http://mypage.com?page=2
http://mypage.com?page=3
http://mypage.com?page=4
http://mypage.com?page=5
http://mypage.com?page=6
http://mypage.com?page=7
http://mypage.com?page=8
http://mypage.com?page=9
http://mypage.com?page=10
