# Scraping lawmakers from LegiScan

Aim of this script is to scrape LegiScan to obtain information about lawmakers of anti-LGBTQ bills and their ID in the FollowTheMoney website.

It's the first time I try to write a fully-functioning scraping, so there might be some things to fix. Please make changes whenever you feel necessary.

This attempt is made only for educational purposes. This is something I am trying to learn and, instead of doing this on a random website, I tried to applied to our current work in OpenDemocracy. I will not invoice the hours I've been working on this to OpenDemocracy.

In [39]:
# Import libraries

import pandas as pd
from bs4 import BeautifulSoup
import requests
import csv
import re as re

The first thing I am going to do is to store my url and select the pieces of information that I will use to name the .csv file that I am going to create. Because final goal is to iterate this process for all the variou urls, I can't assign each name individually.

I decided that the files I am going to create will be formatted in the following way:

* State + Bill Number + Year

One note: because lenght of bill numbers can vary from three to five digits, I will only select three that is available. By doing so we should be able to avoid selecting "/" characters that are not admittable as a file name.


# Store URL, create file, scraping

The first thing I am going to do is to store my url and select some pieces of information that will be useful later, when I'll save the stored information in a new .csv file that needs to have a unique name.


In [43]:

url = 'https://legiscan.com/OH/sponsors/HB454/2021'

# From the url I create the name I will use for the file with State, Bill Number and Year

fileName = url.replace('https://legiscan.com/','')
fileName = fileName.replace('/','')
fileName = fileName.replace('sponsors','')


In [44]:
fileName

'OHHB4542021'

In [45]:
# create our file, make sure it's writeable, set some defaults
f = open(fileName + ".csv", 'w', encoding='utf8', newline='')

# create a writer to write data to the CSV file
writer = csv.writer(f, delimiter=',')
# use the writer to write the first row, the column headers, to the file
writer.writerow(['legiscan', 'sponsorName', 'followTheMoney'])

# and here's the actual scraping. Although there's only one table in the page, we use the ID just to be sure.

page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table', id="gaits-sponsorlist")

# for every row in the table...

for row in table.find_all('tr'):
    
    #...save the data in each cell inside that row to cells
    
    cells = row.find_all(['th','td'])
    
    # ...get the data from each of the cells and save it to a variable
    
    sponsorName = cells[0].text
    sponsorType = cells[1].text
    sponsorship = cells[2].text
    district = cells[3].text
    followTheMoney = cells[4].find_all("a")
    ballotpedia = cells[5].text
    biography = cells[6].text
    
    # .... create a list called rowData containing all variables. I only take what I need (sponsorName e FollowTheMoney)
    
    # add a column with the url (we will use it to retrieve the bill number.)
    
    rowData = [url, sponsorName, followTheMoney]
    
    # write our data in the file
    writer.writerow(rowData)

f.close()

In [46]:
# Now, some cleaning is needed. We remove the first line.

df = pd.read_csv(fileName + ".csv")

df.drop(df.index[0], inplace=True)

df

Unnamed: 0,legiscan,sponsorName,followTheMoney
1,https://legiscan.com/OH/sponsors/HB454/2021,Representative Gary Click [R],"[<a href=""https://www.followthemoney.org/entit..."
2,https://legiscan.com/OH/sponsors/HB454/2021,Representative Diane Grendell [R],"[<a href=""https://www.followthemoney.org/entit..."
3,https://legiscan.com/OH/sponsors/HB454/2021,Representative Adam Bird [R],"[<a href=""https://www.followthemoney.org/entit..."
4,https://legiscan.com/OH/sponsors/HB454/2021,Representative Rodney Creech [R],"[<a href=""https://www.followthemoney.org/entit..."
5,https://legiscan.com/OH/sponsors/HB454/2021,Representative Bill Dean [R],"[<a href=""https://www.followthemoney.org/entit..."
6,https://legiscan.com/OH/sponsors/HB454/2021,Representative Ron Ferguson [R],"[<a href=""https://www.followthemoney.org/entit..."
7,https://legiscan.com/OH/sponsors/HB454/2021,Representative Sarah Fowler Arthur [R],"[<a href=""https://www.followthemoney.org/entit..."
8,https://legiscan.com/OH/sponsors/HB454/2021,Representative Jennifer Gross [R],"[<a href=""https://www.followthemoney.org/entit..."
9,https://legiscan.com/OH/sponsors/HB454/2021,Representative Thomas Hall [R],"[<a href=""https://www.followthemoney.org/entit..."
10,https://legiscan.com/OH/sponsors/HB454/2021,Representative Adam Holmes [R],"[<a href=""https://www.followthemoney.org/entit..."


# Get the FollowTheMoney lawmaker ID number

Now, the most important part. We want to extract our ID number for each legislator from the FollowTheMoney link, that looks like that:

href="https://www.followthemoney.org/entity-details?eid=7247802&amp;default=candidate" target="_blank" title="View FollowTheMoney financial contribution information for Senator Marty Harbin [R]">FollowTheMoney

Of all that, 7247802 is the only part that interests us. How do we extract it it?

There are probably various and more efficient ways. This is the one I chose.

* First, we cut everything that comes before and after our ID number. We'll leave some additional characters just to be sure we have some wiggle room for longer or shorter IDs.

* Second, we remove all nonnumeric values from the string.

In [49]:
# create a new column with only the juicy part of our string

df['sponsorID'] = df['followTheMoney'].str[60:75]

# now we keep only numeric values...

def find_number(text):
    num = re.findall(r'[0-9]+',text)
    return " ".join(num)
df['sponsorID']=df['sponsorID'].apply(lambda x: find_number(x))

# ...and here's our ID numbers !

In [48]:
df['sponsorID']

1     25002859
2      6564127
3     48809001
4     16065548
5     12107035
6     23442194
7     16072638
8     27966956
9     48808984
10    47141673
11     3389117
12    37049242
13    14913210
14     6674692
15     3104766
16    18906040
17    48808997
18    44259677
19     2855152
20    44423243
21    44043306
22     2787667
23    48808966
24    13004657
25     3215368
Name: sponsorID, dtype: object

In [50]:
# we remove the old column we no longer need

df = df.drop('followTheMoney', 1)

  df = df.drop('followTheMoney', 1)


In [51]:
df

Unnamed: 0,legiscan,sponsorName,sponsorID
1,https://legiscan.com/OH/sponsors/HB454/2021,Representative Gary Click [R],25002859
2,https://legiscan.com/OH/sponsors/HB454/2021,Representative Diane Grendell [R],6564127
3,https://legiscan.com/OH/sponsors/HB454/2021,Representative Adam Bird [R],48809001
4,https://legiscan.com/OH/sponsors/HB454/2021,Representative Rodney Creech [R],16065548
5,https://legiscan.com/OH/sponsors/HB454/2021,Representative Bill Dean [R],12107035
6,https://legiscan.com/OH/sponsors/HB454/2021,Representative Ron Ferguson [R],23442194
7,https://legiscan.com/OH/sponsors/HB454/2021,Representative Sarah Fowler Arthur [R],16072638
8,https://legiscan.com/OH/sponsors/HB454/2021,Representative Jennifer Gross [R],27966956
9,https://legiscan.com/OH/sponsors/HB454/2021,Representative Thomas Hall [R],48808984
10,https://legiscan.com/OH/sponsors/HB454/2021,Representative Adam Holmes [R],47141673


In [52]:
# save everything in our file and we're done!

df.to_csv(fileName + ".csv", encoding='utf8')
