# Web Scraping with Beautiful Soup

* * * 

### Icons used in this notebook
🔔 **Question**: A quick question to help you understand what's going on.<br>
🥊 **Challenge**: Interactive exercise. We'll work through these in the workshop!<br>
⚠️ **Warning**: Heads-up about tricky stuff or common mistakes.<br>
💡 **Tip**: How to do something a bit more efficiently or effectively.<br>
🎬 **Demo**: Showing off something more advanced – so you know what Python can be used for!<br>

### Learning Objectives
1. [Reflection: To Scape Or Not To Scrape](#when)
2. [Extracting and Parsing HTML](#extract)
3. [Scraping the Illinois General Assembly](#scrape)

<a id='when'></a>

# To Scrape Or Not To Scrape

When we'd like to access data from the web, we first have to make sure if the website we are interested in offers a Web API. Platforms like Twitter, Reddit, and the New York Times offer APIs. **Check out D-Lab's [Python Web APIs](https://github.com/dlab-berkeley/Python-Web-APIs) workshop if you want to learn how to use APIs.**

However, there are often cases when a Web API does not exist. In these cases, we may have to resort to web scraping, where we extract the underlying HTML from a web page, and directly obtain the information we want. There are several packages in Python we can use to accomplish these tasks. We'll focus two packages: Requests and Beautiful Soup.

Our case study will be scraping information on the [state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) each senator has sponsored. Before we get started, peruse these websites to take a look at their structure.

## Installation

We will use two main packages: [Requests](http://docs.python-requests.org/en/latest/user/quickstart/) and [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/). Go ahead and install these packages, if you haven't already:

Se instalan las librerías

In [None]:
%pip install requests

In [None]:
%pip install beautifulsoup4

We'll also install the `lxml` package, which helps support some of the parsing that Beautiful Soup performs:

In [None]:
%pip install lxml

Se importan las librerias

In [2]:
# Import required libraries
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import time

<a id='extract'></a>

# Extracting and Parsing HTML 

In order to succesfully scrape and analyse HTML, we'll be going through the following 4 steps:
1. Make a GET request
2. Parse the page with Beautiful Soup
3. Search for HTML elements
4. Get attributes and text of these elements

## Step 1: Make a GET Request to Obtain a Page's HTML

We can use the Requests library to:

1. Make a GET request to the page, and
2. Read in the webpage's HTML code.

The process of making a request and obtaining a result resembles that of the Web API workflow. Now, however, we're making a request directly to the website, and we're going to have to parse the HTML ourselves. This is in contrast to being provided data organized into a more straightforward `JSON` or `XML` output.

In [3]:
# Hacer peticion al URL para obtener el contenido de la pagina
req = requests.get('http://www.ilga.gov/senate/default.asp')
# Se lee el contenido de la respuesta del servidor
src = req.text
# Se ve la salida de las 1000 primeras lineas
print(src[:1000])

<!DOCTYPE html>
<html lang="en">
<head id="Head1">
    <meta name="viewport" content="width=device-width, initial-scale=1.0" />
    <meta http-equiv="content-type" content="text/html;charset=utf-8" />
    <meta http-equiv="X-UA-Compatible" content="IE=Edge" />
    <meta charset="utf-8" />
    <meta charset="UTF-8">
    <!-- Meta Description -->
    <meta name="description" content="Welcome to the official government website of the Illinois General Assembly">
    <meta name="contactName" content="Legislative Information System">
    <meta name="contactOrganization" content="LIS Staff Services">
    <meta name="contactStreetAddress1" content="705 Stratton Office Building">
    <meta name="contactCity" content="Springfield">
    <meta name="contactZipcode" content="62706">
    <meta name="contactNetworkAddress" content="webmaster@ilga.gov">
    <meta name="contactPhoneNumber" content="217-782-3944">
    <meta name="contactFaxNumber" content="217-524-6059">
    <meta name


## Step 2: Parse the Page with Beautiful Soup

Now, we use the `BeautifulSoup` function to parse the reponse into an HTML tree. This returns an object (called a **soup object**) which contains all of the HTML in the original document.

If you run into an error about a parser library, make sure you've installed the `lxml` package to provide Beautiful Soup with the necessary parsing tools.

In [4]:
# Parse the response into an HTML tree
# BeatifulSoup es una libreria de Phyton que permite analizar o navegar documentos HTML o XML.
# src debe ser una cadena de texto que contiene código HTML y 'lxml' es el analizador que BeautifulSoup usará para leer el HTML.
soup = BeautifulSoup(src, 'lxml')
# Take a look
# Muestra el contenido HTML como un árbol estructurado que puedes recorrer y manipular fácilmente y muestra los primeros 1000 caracteres si el HTML es un largo.
print(soup.prettify()[:1000])

<!DOCTYPE html>
<html lang="en">
 <head id="Head1">
  <meta content="width=device-width, initial-scale=1.0" name="viewport"/>
  <meta content="text/html;charset=utf-8" http-equiv="content-type"/>
  <meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
  <meta charset="utf-8"/>
  <meta charset="utf-8"/>
  <!-- Meta Description -->
  <meta content="Welcome to the official government website of the Illinois General Assembly" name="description"/>
  <meta content="Legislative Information System" name="contactName"/>
  <meta content="LIS Staff Services" name="contactOrganization"/>
  <meta content="705 Stratton Office Building" name="contactStreetAddress1"/>
  <meta content="Springfield" name="contactCity"/>
  <meta content="62706" name="contactZipcode"/>
  <meta content="webmaster@ilga.gov" name="contactNetworkAddress"/>
  <meta content="217-782-3944" name="contactPhoneNumber"/>
  <meta content="217-524-6059" name="contactFaxNumber"/>
  <meta content="State Of Illinois" name="originatorJur

The output looks pretty similar to the above, but now it's organized in a `soup` object which allows us to more easily traverse the page.

## Step 3: Search for HTML Elements

Beautiful Soup has a number of functions to find useful components on a page. Beautiful Soup lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors

Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with an a particular HTML tag, and returns all of those elements.

What does the example below do?

In [5]:
#Buscar todos las etiquetas que dispongan enlaces con la eqtiqueta de HTML <a> de la página. 
a_tags = soup.find_all("a")
print(a_tags[:10])

[<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="af" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-za"></span> Afrikaans
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="sq" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-al"></span> Albanian
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="ar" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-ae"></span> Arabic
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="hy" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-am"></span> Armenian
                            </a>, <a b-0yw6sxot5c="" class="dropdown-item" data-lang="az" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-az"></span> Azerbaijani
            

Because `find_all()` is the most popular method in the Beautiful Soup search API, you can use a shortcut for it. If you treat the BeautifulSoup object as though it were a function, then it’s the same as calling `find_all()` on that object. 

These two lines of code are equivalent:

In [6]:
#Busca todos los enlaces que hay en una página web y luego muestra en pantalla el primero que encuentra
a_tags = soup.find_all("a")
a_tags_alt = soup("a")
print(a_tags[0])
print(a_tags_alt[0])

<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>
<a b-0yw6sxot5c="" class="dropdown-item" data-lang="en" href="#">
<span b-0yw6sxot5c="" class="flag-icon flag-icon-us"></span> English
                            </a>


How many links did we obtain?

In [7]:
#Cuenta cuántos enlaces (<a>) se encontraron en la página web (almacenados en la lista a_tags)
print(len(a_tags))

270


That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get more hits, many of which you might not want. Remember, the `a` tag defines a hyperlink, so you'll usually find many on any given page.

What if we wanted to search for HTML tags with certain attributes, such as particular CSS classes? 

We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class_="sidemenu"`.

In [8]:
# Este segmento de código extrae los primeros 5 enlaces que están en el menú lateral de la página web.
side_menus = soup("a", class_="sidemenu")
side_menus[:5]

[]

A more efficient way to search for elements on a website is via a **CSS selector**. For this we have to use a different method called `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use `"a.sidemenu"` as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [9]:
# Trae los primeros 5 enlaces que están en el menú lateral de la página, pero esta vez usando la sintaxis de etiquetas CSS
selected = soup.select("a.sidemenu")
selected[:5]

[]

## 🥊 Challenge: Find All

Use BeautifulSoup to find all the `a` elements with class `mainmenu`.

In [None]:
# Ese código extrae todos los enlaces que pertenecen al menú principal de la página web.
#soup.select("a.mainmenu")
soup.select("a.notranslate")
# Como los arrays salian vacíos se cambio a otra clase para continuar las pruebas, esta es notranslate

[<a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a>,
 <a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a>,
 <a class="notranslate" href="/Senate/Members/Details/3316">Omar Aquino</a>,
 <a class="notranslate" href="/Senate/Members/Details/3316">Omar Aquino</a>,
 <a class="notranslate" href="/Senate/Members/Details/3383">Li Arellano, Jr.</a>,
 <a class="notranslate" href="/Senate/Members/Details/3383">Li Arellano, Jr.</a>,
 <a class="notranslate" href="/Senate/Members/Details/3413">Chris Balkema</a>,
 <a class="notranslate" href="/Senate/Members/Details/3413">Chris Balkema</a>,
 <a class="notranslate" href="/Senate/Members/Details/3337">Christopher Belt</a>,
 <a class="notranslate" href="/Senate/Members/Details/3337">Christopher Belt</a>,
 <a class="notranslate" href="/Senate/Members/Details/3386">Terri Bryant</a>,
 <a class="notranslate" href="/Senate/Members/Details/3386">Terri Bryant</a>,
 <a class="notranslate" href="/Senate/Members/

## Step 4: Get Attributes and Text of Elements

Once we identify elements, we want the access information in that element. Usually, this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [None]:
#El código obtiene los enlaces del menú lateral, muestra el primero y verifica su tipo de objeto (first_link)). 
# Obtener todos los enlaces del menú lateral como una lista
#side_menu_links = soup.select("a.sidemenu") #Se cambio a esta clase para continuar las pruebas
side_menu_links = soup.select("a.notranslate")

# Examina el primer enlace.
first_link = side_menu_links[0]
print(first_link)

# A qué clase pertenece esta variable?
print('Class: ', type(first_link))

# la clase resulto como un bs4.element.tag

<a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a>
Class:  <class 'bs4.element.Tag'>


It's a Beautiful Soup tag! This means it has a `text` member:

In [24]:
#Imprime en pantalla solo el texto visible del enlace guardado en first_link
print(first_link.text)

Neil Anderson


Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.

💡 **Tip**: You can access a tag’s attributes by treating the tag like a dictionary:

In [25]:
#Imprime la URL a la que lleva ese enlace en la página web
print(first_link['href'])

/Senate/Members/Details/3312


## 🥊 Challenge: Extract specific attributes

Extract all `href` attributes for each `mainmenu` URL.

In [27]:
#Crea una lista con los enlaces (URLs) del menú principal de la página web
[link['href'] for link in soup.select("a.mainmenu")]

#Como el array sigue saliendo vacío entonces se procede a realizar el procedimiento con la clase btn
[link['href'] for link in soup.select("a.btn")]

['/Senate/Members/List',
 '/Documents/Senate/104th_Senate_Officers.pdf',
 '/Documents/Senate/104th_Senate_Leadership.pdf',
 'https://www.ilga.gov/Documents/Senate/104th_Senate_Seating_Chart.pdf',
 'Members/rptMemberList']

<a id='scrape'></a>

# Scraping the Illinois General Assembly

Believe it or not, those are really the fundamental tools you need to scrape a website. Once you spend more time familiarizing yourself with HTML and CSS, then it's simply a matter of understanding the structure of a particular website and intelligently applying the tools of Beautiful Soup and Python.

Let's apply these skills to scrape the [Illinois 98th General Assembly](http://www.ilga.gov/senate/default.asp?GA=98).

Specifically, our goal is to scrape information on each senator, including their name, district, and party.

## Scrape and Soup the Webpage

Let's scrape and parse the webpage, using the tools we learned in the previous section.

In [28]:
#Este bloque descarga una página web y la prepara para analizarla con BeautifulSoup.
# Descarga el contenido de la página web
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Guarda el contenido HTML como texto
src = req.text
# Convierte el HTML en un objeto manipulable
soup = BeautifulSoup(src, "lxml")

## Search for the Table Elements

Our goal is to obtain the elements in the table on the webpage. Remember: rows are identified by the `tr` tag. Let's use `find_all` to obtain these elements.

In [31]:
# Este bloque obtiene todas las filas de tabla de la página y guarda cuántas hay
rows = soup.find_all("tr")
len(rows)

## El resultado sale vacío pues ahora la pagina usa un sistema de grid de bootstrap, pero para la prueba se realizará con los elementos h5 que son los titulos de las tarjetas
rows = soup.find_all("h5")
len(rows)

122

⚠️ **Warning**: Keep in mind: `find_all` gets *all* the elements with the `tr` tag. We only want some of them. If we use the 'Inspect' function in Google Chrome and look carefully, then we can use some CSS selectors to get just the rows we're interested in. Specifically, we want the inner rows of the table:

In [96]:
# Este bloque extrae las primeras 5 filas de la tabla y las imprime en pantalla 
# rows = soup.select('tr tr tr')
rows = soup.select('div.member-overlay')

for row in rows[:5]:
    print(row, '\n')

<div class="member-overlay">
<h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a> (R)</h5>
<p class="card-text">
                                            Republican Caucus Chair
                                            <br/>47th District
                                        </p>
</div> 

<div class="member-overlay">
<h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a> (R)</h5>
<p class="card-text">
                                            Republican Caucus Chair
                                            <br/>47th District
                                        </p>
</div> 

<div class="member-overlay">
<h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3316">Omar Aquino</a> (D)</h5>
<p class="card-text">
                                            Majority Caucus Chair
                                            <br/>2nd District
                        

It looks like we want everything after the first two rows. Let's work with a single row to start, and build our loop from there.

In [59]:
example_row = rows[2]
#Imprime el contenido del example_row de manera ordenada anadiendo saltos de linea usando prettify().
print(example_row.prettify())

<div class="member-overlay">
 <h5 class="card-title">
  <a class="notranslate" href="/Senate/Members/Details/3316">
   Omar Aquino
  </a>
  (D)
 </h5>
 <p class="card-text">
  Majority Caucus Chair
  <br/>
  2nd District
 </p>
</div>



Let's break this row down into its component cells/columns using the `select` method with CSS selectors. Looking closely at the HTML, there are a couple of ways we could do this.

* We could identify the cells by their tag `td`.
* We could use the the class name `.detail`.
* We could combine both and use the selector `td.detail`.

In [63]:
""" for cell in example_row.select('td'):
    print(cell)
print()

for cell in example_row.select('.detail'):
    print(cell)
print()

for cell in example_row.select('td.detail'):
    print(cell)
print() """

for cell in example_row.select('p'):
    print(cell)
print()

for cell in example_row.select('.card-text'):
    print(cell)
print()

for cell in example_row.select('a.card-text'):
    print(cell)
print()

#Como no hay sistema de filas, se usan los elementos que se tienen disponibles

<p class="card-text">
                                            Majority Caucus Chair
                                            <br/>2nd District
                                        </p>

<p class="card-text">
                                            Majority Caucus Chair
                                            <br/>2nd District
                                        </p>




We can confirm that these are all the same.

In [None]:
# assert — una afirmación que Python verificará como cierta. Si no se cumple, Python lanzará un error.
# Afirma que los 3 resultados son iguales( en contenido y orden).
#assert example_row.select('td') == example_row.select('.detail') == example_row.select('td.detail')
assert example_row.select('a') == example_row.select('.card-text') == example_row.select('p.card-text')

Let's use the selector `td.detail` to be as specific as possible.

In [None]:
# Select only those 'td' tags with class 'detail' 
# Busca todos los elementos <td> que tengan la clase detail dentro de example_row.
#detail_cells = example_row.select('td.detail')
detail_cells = example_row.select('p.card-text')
# Muestra la lista de objetos <td> con clase detail, renderizados como HTML.
#detail_cells
detail_cells

#los elementos td se cambian por p y los detail por card-text

[<p class="card-text">
                                             Majority Caucus Chair
                                             <br/>2nd District
                                         </p>]

Most of the time, we're interested in the actual **text** of a website, not its tags. Recall that to get the text of an HTML element, we use the `text` member:

In [66]:
# Keep only the text in each of those cells
#Recorre todas las celdas de la lista details_cells, obtiene el contenido de texto de cada celda y guarda en una lista llamada row data.
row_data = [cell.text for cell in detail_cells]
print(row_data)

['\r\n                                            Majority Caucus Chair\r\n                                            2nd District\r\n                                        ']


Looks good! Now we just use our basic Python knowledge to get the elements of this list that we want. Remember, we want the senator's name, their district, and their party.

In [69]:
# Imprime distintos elementos de la lista row_data. Esta lista debe contener al menos 5 elementos.
print(row_data[0]) # Name
#print(row_data[3]) # District
#print(row_data[4]) # Party

#se omiten las otras lineas ya que no se usa el sistema de filas


                                            Majority Caucus Chair
                                            2nd District
                                        


## Getting Rid of Junk Rows

We saw at the beginning that not all of the rows we got actually correspond to a senator. We'll need to do some cleaning before we can proceed forward. Take a look at some examples:

In [70]:
# Imprime el primer elemento de la lista rows. rows[0] es una fila HTML completa.
print('Row 0:\n', rows[0], '\n')
# Imprime el segundo elemento de la lista rows. rows[1] es una fila HTML completa.
print('Row 1:\n', rows[1], '\n')
# Imprime el ultimo elemento de la lista rows. rows[-1] es una fila HTML completa.
print('Last Row:\n', rows[-1])

Row 0:
 <div class="member-overlay">
<h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a> (R)</h5>
<p class="card-text">
                                            Republican Caucus Chair
                                            <br/>47th District
                                        </p>
</div> 

Row 1:
 <div class="member-overlay">
<h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a> (R)</h5>
<p class="card-text">
                                            Republican Caucus Chair
                                            <br/>47th District
                                        </p>
</div> 

Last Row:
 <div class="member-overlay">
<h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3315">Dan McConchie</a> (R)</h5>
<p class="card-text">
                                            Senator
                                            <br/>26th District
        

When we write our for loop, we only want it to apply to the relevant rows. So we'll need to filter out the irrelevant rows. The way to do this is to compare some of these to the rows we do want, see how they differ, and then formulate that in a conditional.

As you can imagine, there a lot of possible ways to do this, and it'll depend on the website. We'll show some here to give you an idea of how to do this.

In [None]:
# Mide el numero de elementos hijos directos del objeto <div>, no cuántas celdas <p> tiene exactamente.
# Bad rows
print(len(rows[0]))
print(len(rows[1]))

# Good rows
print(len(rows[2]))
print(len(rows[3]))

5
5
5
5


Perhaps good rows have a length of 5. Let's check:

In [72]:
# Crea una nueva lista llamada good_rows que filtra la lista original rows asumiendo filas con datos validos las que tienen al menos 5 celdas.
good_rows = [row for row in rows if len(row) == 5]
# Muestra los datos de la nueva lista good_rows.
# Let's check some rows
# print(good_rows[0], '\n') #como no hay sistema de celdas, da error
# print(good_rows[-2], '\n')
# print(good_rows[-1])

We found a footer row in our list that we'd like to avoid. Let's try something else:

In [73]:
# Usa un selector para encontrar todas las celdas <td> que tengan la clase "detail" dentro de la tercera fila de rows[]
#rows[2].select('td.detail') 
rows[2].select('p.card-text') 

[<p class="card-text">
                                             Majority Caucus Chair
                                             <br/>2nd District
                                         </p>]

In [74]:
# Esto selecciona todas las celdas <td> que tienen la clase detail dentro de esa fila
# Si devuelve una lista vacia es una bad row. Si la fila contiene celdas con esa clase, devuelve una good row.
# Bad row
print(rows[-1].select('td.detail'), '\n')
# Good row
print(rows[5].select('td.detail'), '\n')

# Bad row
print(rows[-1].select('p.card-text'), '\n')
# Good row
print(rows[5].select('p.card-text'), '\n')

# Filtra las filas que tengan elementos td.detail descartando filas vacias.
# How about this?
#good_rows = [row for row in rows if row.select('td.detail')]
good_rows = [row for row in rows if row.select('p.card-text')]

# Imprime texto "Analizando filas..."
print("Checking rows...\n")
# Imprime elementos de la lista good_rows. En este caso la primera y ultima fila en especifico.
print(good_rows[0], '\n')
print(good_rows[-1])

[] 

[] 

[<p class="card-text">
                                            Senator
                                            <br/>26th District
                                        </p>] 

[<p class="card-text">
                                            Senator
                                            <br/>37th District
                                        </p>] 

Checking rows...

<div class="member-overlay">
<h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3312">Neil Anderson</a> (R)</h5>
<p class="card-text">
                                            Republican Caucus Chair
                                            <br/>47th District
                                        </p>
</div> 

<div class="member-overlay">
<h5 class="card-title"><a class="notranslate" href="/Senate/Members/Details/3315">Dan McConchie</a> (R)</h5>
<p class="card-text">
                                            Senator
                                           

Looks like we found something that worked!

## Loop it All Together

Now that we've seen how to get the data we want from one row, as well as filter out the rows we don't want, let's put it all together into a loop.

In [84]:
# Define storage list
# Crea lista vacia de nombre members.
members = []

# Get rid of junk rows
# Filtra las filas validas, descartado las filas que no contienen elementos td.detail. 
# valid_rows = [row for row in rows if row.select('td.detail')]
valid_rows = [row for row in rows if row.select('p.card-text')]

# Loop through all rows
# Loop para iterar en todas las filas validas.
for row in valid_rows:
    # Select only those 'td' tags with class 'detail'
    # Selecciona las filas validas que tengan etiquetas 'td' con clase detail y las guarda en lista detail_cells.
    #detail_cells = row.select('td.detail')
    detail_cells = row.select('p.card-text')

    # Keep only the text in each of those cells
    # Extrae solamente el texto de las celdas con elementos <td class='detail'> en las filas de detail_cells y lo guarda en lista row_data.
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    # Crea variables name, district, party donde asigna informacion de elementos especificos de la lista row_data.
    name = row_data[0]
    #district = int(row_data[3])
    #party = row_data[4]
    #Como no es el mismo sistema de filas, se guarda la informacion en un texto 

    # Store in a tuple
    # Guarda las variables en una tupla.
    #senator = (name, district, party)
    senator = (name)

    # Append to list
    # Anade la informacion de la tupla senator a la lista members.
    members.append(senator)

In [85]:
# Should be 61
len(members)

120

Let's take a look at what we have in `members`.

In [86]:
print(members[:5])

['\r\n                                            Republican Caucus Chair\r\n                                            47th District\r\n                                        ', '\r\n                                            Republican Caucus Chair\r\n                                            47th District\r\n                                        ', '\r\n                                            Majority Caucus Chair\r\n                                            2nd District\r\n                                        ', '\r\n                                            Majority Caucus Chair\r\n                                            2nd District\r\n                                        ', '\r\n                                            Senator\r\n                                            37th District\r\n                                        ']


## 🥊  Challenge: Get `href` elements pointing to members' bills 

The code above retrieves information on:  

- the senator's name,
- their district number,
- and their party.

We now want to retrieve the URL for each senator's list of bills. Each URL will follow a specific format. 

The format for the list of bills for a given senator is:

`http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=[MEMBER_ID]&Primary=True`

to get something like:

`http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True`

in which `MEMBER_ID=1911`. 

You should be able to see that, unfortunately, `MEMBER_ID` is not currently something pulled out in our scraping code.

Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.

Tips: 

* To do this, you will want to get the appropriate anchor element (`<a>`) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.
* The anchor elements' HTML will look like `<a href="/senate/Senator.asp/...">Bills</a>`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a BeatifulSoup `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. See the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag">documentation</a> for more details.
* There are a _lot_ of different ways to use BeautifulSoup to get things done. whatever you need to do to pull the `href` out is fine.

The code has been partially filled out for you. Fill it in where it says `#YOUR CODE HERE`. Save the path into an object called `full_path`.

In [116]:
# Se hace una peticion get
req = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')
# Se lee el contenido de la pagina
src = req.text
# Se le inserta en la 'sopa' con la libreria
soup = BeautifulSoup(src, "lxml")
# Se crea una lista vacia para guardar los miembros
members = []

# Returns every ‘tr tr tr’ css selector in the page
# como no funciona con filas se hace con div.member-overlay
rows = soup.select('div.member-overlay')
# Get rid of junk rows
# como no funciona con columnas se hace con la propiedad p o a
rows = [row for row in rows if row.select('a.notranslate')]

# Se hace un lazo for en todas las filas
for row in rows:
    # Seleccionamos solo con el tag a y clase notranslate
    detail_cells = row.select('a.notranslate') 
    # Keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    # Collect information
    name = row_data[0]
    #district = int(row_data[3])
    #party = row_data[4]
    #unicamente seleccionamos el nombre 

    # YOUR CODE HERE
    # extraemos el href obteniendolo del elemento a de row
    href = detail_cells[0]['href']
    # creamos el path completo
    full_path = "http://www.ilga.gov/" + href

    # Guardamos en una tupla
    senator = (name, full_path)
    # Se inserta en la lista members
    members.append(senator)

In [117]:
# Descomentamos para probar
members[:5]

[('Neil Anderson', 'http://www.ilga.gov//Senate/Members/Details/3312'),
 ('Neil Anderson', 'http://www.ilga.gov//Senate/Members/Details/3312'),
 ('Omar Aquino', 'http://www.ilga.gov//Senate/Members/Details/3316'),
 ('Omar Aquino', 'http://www.ilga.gov//Senate/Members/Details/3316'),
 ('Li Arellano, Jr.', 'http://www.ilga.gov//Senate/Members/Details/3383')]

## 🥊  Challenge: Modularize Your Code

Turn the code above into a function that accepts a URL, scrapes the URL for its senators, and returns a list of tuples containing information about each senator. 

In [123]:
# YOUR CODE HERE
# se define la funcion
def get_members(url):
    # Se hace una peticion get
    req = requests.get(url)
    # Se lee el contenido de la pagina
    src = req.text
    # Se le inserta en la 'sopa' con la libreria
    soup = BeautifulSoup(src, "lxml")
    # Se crea una lista vacia para guardar los miembros
    members = []

    # Returns every ‘tr tr tr’ css selector in the page
    # como no funciona con filas se hace con div.member-overlay
    rows = soup.select('div.member-overlay')
    # Get rid of junk rows
    # como no funciona con columnas se hace con la propiedad p o a
    rows = [row for row in rows if row.select('a.notranslate')]

    # Se hace un lazo for en todas las filas
    for row in rows:
        # Seleccionamos solo con el tag a y clase notranslate
        detail_cells = row.select('a.notranslate') 
        # Keep only the text in each of those cells
        row_data = [cell.text for cell in detail_cells]
        # Collect information
        name = row_data[0]
        #district = int(row_data[3])
        #party = row_data[4]
        #unicamente seleccionamos el nombre 

        # YOUR CODE HERE
        # extraemos el href obteniendolo del elemento a de row
        href = detail_cells[0]['href']
        # creamos el path completo
        full_path = "http://www.ilga.gov/" + href

        # Guardamos en una tupla
        senator = (name, full_path)
        # Se inserta en la lista members
        members.append(senator)
    
    return [members]


In [125]:
# Test your code
url = 'http://www.ilga.gov/senate/default.asp?GA=98'
senate_members = get_members(url)
print(senate_members)
len(senate_members[0])

[[('Neil Anderson', 'http://www.ilga.gov//Senate/Members/Details/3312'), ('Neil Anderson', 'http://www.ilga.gov//Senate/Members/Details/3312'), ('Omar Aquino', 'http://www.ilga.gov//Senate/Members/Details/3316'), ('Omar Aquino', 'http://www.ilga.gov//Senate/Members/Details/3316'), ('Li Arellano, Jr.', 'http://www.ilga.gov//Senate/Members/Details/3383'), ('Li Arellano, Jr.', 'http://www.ilga.gov//Senate/Members/Details/3383'), ('Chris Balkema', 'http://www.ilga.gov//Senate/Members/Details/3413'), ('Chris Balkema', 'http://www.ilga.gov//Senate/Members/Details/3413'), ('Christopher Belt', 'http://www.ilga.gov//Senate/Members/Details/3337'), ('Christopher Belt', 'http://www.ilga.gov//Senate/Members/Details/3337'), ('Terri Bryant', 'http://www.ilga.gov//Senate/Members/Details/3386'), ('Terri Bryant', 'http://www.ilga.gov//Senate/Members/Details/3386'), ('Cristina Castro', 'http://www.ilga.gov//Senate/Members/Details/3317'), ('Cristina Castro', 'http://www.ilga.gov//Senate/Members/Details/33

120

## 🥊 Take-home Challenge: Writing a Scraper Function

We want to scrape the webpages corresponding to bills sponsored by each bills.

Write a function called `get_bills(url)` to parse a given bills URL. This will involve:

  - requesting the URL using the <a href="http://docs.python-requests.org/en/latest/">`requests`</a> library
  - using the features of the `BeautifulSoup` library to find all of the `<td>` elements with the class `billlist`
  - return a _list_ of tuples, each with:
      - description (2nd column)
      - chamber (S or H) (3rd column)
      - the last action (4th column)
      - the last action date (5th column)
      
This function has been partially completed. Fill in the rest.

In [None]:
def get_bills(url):
    src = requests.get(url).text
    soup = BeautifulSoup(src)
    rows = soup.select('tr')
    bills = []
    # Se itera sobre las filas
    for row in rows:
        # Se escojen las celdas
        cells1 = row.select('th')
        cells2 = row.select('td')
        # El nombre del senador no es una celda de la billist
        if len(cells2) == 5:
            row_text = [cell.text for cell in cells2]
            # Se extrae la informacion
            description = row_text[1]
            chamber = row_text[2]
            last_action = row_text[3]
            last_action_date = row_text[4]
            # Se consolida la informacion

            # Se extrae el bill id
            row_text1 = [cell.text for cell in cells1]
            bill_id = row_text1[0]

            bill = (bill_id, description, chamber, last_action, last_action_date)
            bills.append(bill)
    return bills

In [None]:
## Se prueba la funcion
test_url = senate_members[0][1][1]
get_bills(test_url)[0:5]

[('SB0023',
  'AUDIT-IPA RENEWABLE',
  'S',
  'Referred to Assignments',
  '1/13/2025'),
 ('SB0038',
  'COUNTIES-WIND & SOLAR ENERGY',
  'S',
  'Referred to Assignments',
  '1/13/2025'),
 ('SB0039',
  'VETS-TINY HOMES-EV EXEMPTION',
  'S',
  'Public Act . . . . . . . . . 104-0341',
  '8/15/2025'),
 ('SB0081',
  'CRIM CD-AGG BAT-DCFS WORKER',
  'S',
  'Rule 3-9(a) / Re-referred to Assignments',
  '3/21/2025'),
 ('SB0096',
  'SCH CD-NONPUBLIC STUDENT-SPORT',
  'S',
  'Referred to Assignments',
  '1/17/2025')]

### Scrape All Bills

Finally, create a dictionary `bills_dict` which maps a district number (the key) onto a list of bills (the value) coming from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.

**NOTE:** please call the function `time.sleep(1)` for each iteration of the loop, so that we don't destroy the state's web site.

In [None]:
# Se crea un diccionario para guardar los resultados
bills_dict = {}
print(senate_members[0][:5])
for member in senate_members[0][:5]:
    bills_dict[member[0]] = get_bills(member[1])
    time.sleep(1) # Se inserta una pausa de 1 segundo entre cada solicitud para no sobrecargar el servidor

[('Neil Anderson', 'http://www.ilga.gov//Senate/Members/Details/3312'), ('Neil Anderson', 'http://www.ilga.gov//Senate/Members/Details/3312'), ('Omar Aquino', 'http://www.ilga.gov//Senate/Members/Details/3316'), ('Omar Aquino', 'http://www.ilga.gov//Senate/Members/Details/3316'), ('Li Arellano, Jr.', 'http://www.ilga.gov//Senate/Members/Details/3383')]


In [None]:
len(bills_dict[52])