## Scraping basics

If you're reading this in advance of the scraping class, consider attending Melissa's session on How the Internet Works. It'll be a useful introduction to the processes underlying the code you write.

Code-free tools for data harvesting are handy in a pinch, but scripts written in Python or another language are more flexible and adaptable. They can also run automatically in the background on a schedule. Also, you don't have to worry about a service or a tool ever disappearing, making all your hard work for naught.

Using code -- especially if you host it on a site like GitHub -- also makes your work reproducible, saving you work the next time you need to complete a similar task, and making your methods more transparent.

<hr>

This exercise uses a simple HTML table as an example before trying a live site.

In [None]:
# import modules to facilitate the scrape
import csv

from bs4 import BeautifulSoup

### Load the file
```python
open('some file', 'read/write/append?')
# open a file and tell Python how to treat it```

In [None]:
# fetch the HTML file with open() and stick in a variable
our_file = open('simple_table.html', 'r')

### Parse HTML with `BeautifulSoup`

In [None]:
# let BeautifulSoup parse the HTML
soup = BeautifulSoup(our_file, 'html.parser')

# close the initial file we opened
our_file.close()

### Target the data

Two key ways (among many) to isolate specific sections of the web page in question with `BeautifulSoup`:
```python
soup.find('some HTML tag')
# returns the first tag that matches

soup.find_all('some HTML tag')
# returns a list of all tags that match
```

(`BeautifulSoup` also has [detailed documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) for the various ways in which it can parse HTML and XML.)

In [None]:
# snip out the table and pass it to a new variable
tbl = soup.find('table')

In [None]:
# use .find_all to create a list of rows in the table
tbl_rows = tbl.find_all('tr')

In [None]:
# isolate the second row and print it
print(tbl_rows[1])

In [None]:
# use .find_all again to generate a list of the row's cells and return it
cells = tbl_rows[1].find_all('td')
cells

`BeautifulSoup` lets you fetch text from within HTML tags:
```python
soup.text
# returns the text in a tag as a string
```

In [None]:
# print the first item in cells, then try printing what .text can extract from the same thing
print(cells[0])
print(cells[0].text)

### Extract the data

OK, now for the tricky part. We need to create the list of rows in the table and then extract the text contents of each cell. We'll set up an empty list beforehand and append each row of extracted data to it as its own list.

In [None]:
# make an empty list to hold extracted data
table_data = []

# loop through rows, and then each cell in each row, returning a list of extracted text
for row in tbl.find_all('tr'):
    cells = row.find_all(['th', 'td'])
    table_data.append([cells[0].text, cells[1].text, cells[2].text])

### Write the data to CSV

```python
writer_obj = csv.writer('some file we opened')
# make a writer object that can move information from your script to a file in CSV form

writer_obj.writerow('some list of strings')
# write a single row

writer_obj.writerows('some list of lists of strings')
# write a bunch of rows
```

Check out [the documentation](https://docs.python.org/2/library/csv.html) for more examples of how it all works.

In [None]:
# open a file and write our data to it
with open('simple.csv', 'w') as outfile:
    writer = csv.writer(outfile)
    writer.writerows(table_data)

"Manual" example of the above, opening/writing to/closing file without "with":  

```python
outfile = open('simple.csv', 'w')
writer = csv.writer(outfile)
writer.writerows(table_data)
outfile.close()```