# saving scraped data
For our project, remember that we want to scrape information about each bill contained within the bill cards. 

And like all good programmers, we broke this our task up into a number of steps, some of which we've already done in the previous notebook:
1. isolate the bill_cards data from the rest of the webpage (already done)
2. pick out the information we want from the bill cards (already done)
3. process our information into lists
4. clean up our information
5. save that information to a csv file

Now, we are on step three, processing elements and saves them into a list. Each of these steps itself contains smaller steps, which we will figure out as we go along. 

Before continuing our work, we will import the libraries we need and create our `soup` object (that holds our website content), and our `bill_cards` object (which holds our bill card data).

In [1]:
import requests
from bs4 import BeautifulSoup

In [2]:
site = requests.get('https://translegislation.com/bills/2024/US')
html_code = site.content
soup = BeautifulSoup(html_code, 'lxml')

In [3]:
# to get the element and class for the cards, use the inspector

bill_cards = soup.find_all('div', class_ ='css-4rck61')

Now, we can write our loop that grabs all the elements we want. 

In [4]:
# runs the loop on the bill cards

for item in bill_cards[:10]: # only the first ten cards, just to check if it is working
    print(item.h3.text) # title
    print(item.h2.text) # caption
    print(item.span.text) # category
    print(item.p.text) # description (if any)
    print(item.a['href']) # add https://translegislation.com/bills/2023/US

US HB1064
Ensuring Military Readiness Act of 2023
MILITARY
To provide requirements related to the eligibility of transgender individuals from serving in the Armed Forces.
/bills/2024/US/HB1064
US HB1112
Ensuring Military Readiness Act of 2023
MILITARY
To provide requirements related to the eligibility of individuals who identify as transgender from serving in the Armed Forces.
/bills/2024/US/HB1112
US HB1276
Protect Minors from Medical Malpractice Act of 2023
HEALTHCARE
To protect children from medical malpractice in the form of gender transition procedures.
/bills/2024/US/HB1276
US HB1399
Protect Children’s Innocence Act
HEALTHCARE
To amend chapter 110 of title 18, United States Code, to prohibit gender affirming care on minors, and for other purposes.
/bills/2024/US/HB1399
US HB1490
Preventing Violence Against Female Inmates Act of 2023
INCARCERATION
To secure the dignity and safety of incarcerated women.
/bills/2024/US/HB1490
US HB1585
Prohibiting Parental Secrecy Policies In School

## step 3: process our information into lists
Now, the next step is to assign a variable for each item. This allows us to save the data to the variable name, and later, to add it to a list.

In [5]:
for item in bill_cards[:10]:
    title = item.h3.text
    caption = item.h2.text
    category = item.find('span').text
    description = item.p.text
    link = 'https://translegislation.com/bills/2023/passed' + item.a['href']
    print(title, caption, category, description, link)

US HB1064 Ensuring Military Readiness Act of 2023 MILITARY To provide requirements related to the eligibility of transgender individuals from serving in the Armed Forces. https://translegislation.com/bills/2023/passed/bills/2024/US/HB1064
US HB1112 Ensuring Military Readiness Act of 2023 MILITARY To provide requirements related to the eligibility of individuals who identify as transgender from serving in the Armed Forces. https://translegislation.com/bills/2023/passed/bills/2024/US/HB1112
US HB1276 Protect Minors from Medical Malpractice Act of 2023 HEALTHCARE To protect children from medical malpractice in the form of gender transition procedures. https://translegislation.com/bills/2023/passed/bills/2024/US/HB1276
US HB1399 Protect Children’s Innocence Act HEALTHCARE To amend chapter 110 of title 18, United States Code, to prohibit gender affirming care on minors, and for other purposes. https://translegislation.com/bills/2023/passed/bills/2024/US/HB1399
US HB1490 Preventing Violence 

It works! Now let's save it to lists. 

In [6]:
# a bunch of empty lists where we will dump our data
titles = []
captions = []
categories = []
descriptions = []
links = []

# our for loop that saves each item we want from the bill_cards
for item in bill_cards:
    title = item.h3.text
    category = item.find('span').text
    caption = item.h2.text
    description = item.p.text
    link = 'https://translegislation.com/bills/2023/passed' + item.a['href']
    
    # adding the items to the empty lists
    titles.append(title)
    categories.append(category)
    captions.append(caption)
    descriptions.append(description)
    links.append(link)

AttributeError: 'NoneType' object has no attribute 'text'

### individual challenge:
Google this error, try to understand what it means. And then try out a solution from Stak Overflow, making sure to change out the variable names.

In [14]:
# a bunch of empty lists where we will dump our data
titles = []
captions = []
categories = []
descriptions = []
links = []

# our for loop that saves each item we want from the bill_cards
for item in bill_cards:
    title = item.h3.text
    category = item.find('span').text
    caption = item.h2.text
    if item.h2.text is not None:
        description = item.h2.text
    else:
        description = 'No bill description'
    link = 'https://translegislation.com/bills/2023/passed' + item.a['href']
    
    # adding the items to the empty lists
    titles.append(title)
    categories.append(category)
    captions.append(caption)
    descriptions.append(description)
    links.append(link)

## step 4: cleaning up our lists
Before adding saving our dataset to a spreadsheet, we are going to do a bit more data processing and gathering. This will enable us to make a more robust dataset at the end. Here, we are going to get the link directly to the bill page on LegiScan.

Like the previous sections, I'm going to use comments to write some pseudo-code that separates out the steps of the larger task. This is good practice for all programmers. 

In [None]:
## now, we will get the link to state bill, in the following steps:

## first, make a list of URLs:
## then, for each URL, make a soup.
## then, for each soup, get the link to the state bill, called "extension"
## then, add the link extension to the root, saving it as "urls"
## finally, add the urls to a new list, called "legiscan links"

In [20]:
for item in bill_cards[:10]:
    extension = 'https://translegislation.com/' + item.a['href']
    print(extension)

https://translegislation.com//bills/2024/US/HB1064
https://translegislation.com//bills/2024/US/HB1112
https://translegislation.com//bills/2024/US/HB1276
https://translegislation.com//bills/2024/US/HB1399
https://translegislation.com//bills/2024/US/HB1490
https://translegislation.com//bills/2024/US/HB1585
https://translegislation.com//bills/2024/US/HB216
https://translegislation.com//bills/2024/US/HB3101
https://translegislation.com//bills/2024/US/HB3102
https://translegislation.com//bills/2024/US/HB3328


In [21]:
urls = []
for item in bill_cards:
    extension = 'https://translegislation.com/' + item.a['href']
    urls.append(extension)

In [22]:
urls[:10]

['https://translegislation.com//bills/2024/US/HB1064',
 'https://translegislation.com//bills/2024/US/HB1112',
 'https://translegislation.com//bills/2024/US/HB1276',
 'https://translegislation.com//bills/2024/US/HB1399',
 'https://translegislation.com//bills/2024/US/HB1490',
 'https://translegislation.com//bills/2024/US/HB1585',
 'https://translegislation.com//bills/2024/US/HB216',
 'https://translegislation.com//bills/2024/US/HB3101',
 'https://translegislation.com//bills/2024/US/HB3102',
 'https://translegislation.com//bills/2024/US/HB3328']

In [23]:
# making a soup object of *every* page that is linked
# this may take several seconds

soups = []
for item in urls:
    site = requests.get(item)
    html_code = site.content
    soup = BeautifulSoup(html_code, 'lxml')
    soups.append(soup)

In [24]:
legiscan_links = []
for item in soups:
    # get the url for state url
    anchor_tag = item.find('a', class_='chakra-link css-oga2ct')
    link = anchor_tag['href']
    legiscan_links.append(link)

In [25]:
legiscan_links

['https://legiscan.com/US/text/HB1064/id/2737306',
 'https://legiscan.com/US/text/HB1112/id/2742708',
 'https://legiscan.com/US/text/HB1276/id/2755407',
 'https://legiscan.com/US/text/HB1399/id/2796538',
 'https://legiscan.com/US/text/HB1490/id/2761146',
 'https://legiscan.com/US/text/HB1585/id/2763467',
 'https://legiscan.com/US/text/HB216/id/2654610',
 'https://legiscan.com/US/text/HB3101/id/2830677',
 'https://legiscan.com/US/text/HB3102/id/2815463',
 'https://legiscan.com/US/text/HB3328/id/2818358',
 'https://legiscan.com/US/text/HB3329/id/2818922',
 'https://legiscan.com/US/text/HB3462/id/2827206',
 'https://legiscan.com/US/text/HB3887/id/2833147',
 'https://legiscan.com/US/text/HB429/id/2674746',
 'https://legiscan.com/US/text/HB4365/id/2846650',
 'https://legiscan.com/US/text/HB4367/id/2866079',
 'https://legiscan.com/US/text/HB4398/id/2835893',
 'https://legiscan.com/US/text/HB4665/id/2843997',
 'https://legiscan.com/US/text/HB4821/id/2849882',
 'https://legiscan.com/US/text/HB

## step 5: saving our data to a CSV
This is the final step. First, we will import two libraries for working with tabular data `pandas` and `csv`. 

Then, we will add each of our lists into the "DataFrame" (the `pandas` term for a tabular type of object), where they will appear as separate columns. Finally, we will save our DataFrame as a .csv file. 

In [26]:
# importing the necessary libraries

import pandas as pd
import csv

In [28]:
# creating empty lists to hold all of our data

titles = []
captions = []
categories = []
descriptions = []

# extracting the data from the bill cards
for item in bill_cards:
    title = item.h3.text
    category = item.find('span').text
    caption = item.h2.text
    if item.h2.text is not None:
        description = item.h2.text
    else:
        description = 'No bill description'
    
    # adding the items to the empty lists
    titles.append(title)
    categories.append(category)
    captions.append(caption)
    descriptions.append(description)
    # remember that "legiscan_links" is already saved as a list, so we don't have to create it here

In [38]:
# creating a dataframe, with separate columns to hold each of our lists
df = pd.DataFrame(
    {'title': titles,
     'caption': captions,
     'category': categories,
     'description': descriptions,
     'url': urls,
     'legiscan': legiscan_links
    })

In [39]:
# checking the first 5 lines of the dataframe

df.head()

Unnamed: 0,title,caption,category,description,url,legiscan
0,US HB1064,Ensuring Military Readiness Act of 2023,MILITARY,Ensuring Military Readiness Act of 2023,https://translegislation.com//bills/2024/US/HB...,https://legiscan.com/US/text/HB1064/id/2737306
1,US HB1112,Ensuring Military Readiness Act of 2023,MILITARY,Ensuring Military Readiness Act of 2023,https://translegislation.com//bills/2024/US/HB...,https://legiscan.com/US/text/HB1112/id/2742708
2,US HB1276,Protect Minors from Medical Malpractice Act of...,HEALTHCARE,Protect Minors from Medical Malpractice Act of...,https://translegislation.com//bills/2024/US/HB...,https://legiscan.com/US/text/HB1276/id/2755407
3,US HB1399,Protect Children’s Innocence Act,HEALTHCARE,Protect Children’s Innocence Act,https://translegislation.com//bills/2024/US/HB...,https://legiscan.com/US/text/HB1399/id/2796538
4,US HB1490,Preventing Violence Against Female Inmates Act...,INCARCERATION,Preventing Violence Against Female Inmates Act...,https://translegislation.com//bills/2024/US/HB...,https://legiscan.com/US/text/HB1490/id/2761146


In [40]:
# saving the dataframe as a csv file

df.to_csv('bill_data.csv')

And that's all! If you are on google colab, check your sidebar under the "files" tab. You should see a .csv file containing the data we've scraped from the `translegislation.com` website. Well done!

In the next section, we will look at an API method for getting legislative data, and save that data to a CSV file. In that activity, you'll see the differences in handling data acrossn web scraping and API methods.