# School Board Minutes

Scrape all of the school board minutes from http://www.mineral.k12.nv.us/pages/School_Board_Minutes

Save a CSV called `minutes.csv` with the date and the URL to the file. The date should be formatted as YYYY-MM-DD.

**Bonus:** Download the PDF files

**Bonus 2:** Use [PDF OCR X](https://solutions.weblite.ca/pdfocrx/index.php) on one of the PDF files and see if it can be converted into text successfully.

* **Hint:** If you're just looking for links, there are a lot of other links on that page! Can you look at the link to know whether it links or minutes or not? You'll want to use an "if" statement.
* **Hint:** You could also filter out bad links later on using pandas instead of when scraping
* **Hint:** If you get a weird error that you can't really figure out, you can always tell Python to just ignore it using `try` and `except`, like below. Python will try to do the stuff inside of 'try', but if it hits an error it will skip right out.
* **Hint:** Remember the codes at http://strftime.org
* **Hint:** If you have a date that you've parsed, you can use `.dt.strftime` to turn it into a specially-formatted string. You use the same codes (like %B etc) that you use for converting strings into dates.

```python
try:
  blah blah your code
  your code
  your code
except:
  pass
```

* **Hint:** You can use `.apply` to download each pdf, or you can use one of a thousand other ways. It'd be good `.apply` practice though!

In [1]:
import re
import pandas as pd
from bs4 import BeautifulSoup
import requests

In [2]:
my_url = "http://www.mineral.k12.nv.us/pages/School_Board_Minutes"
raw_html = requests.get(my_url).content

soup_doc = BeautifulSoup(raw_html, "html.parser")
school = soup_doc('p', align='center')

In [7]:
school

[<p align="center" style="text-align: left;"><span style="color: #000000;"><a href="/files/9.1.20_minutes.pdf"><span style="font-family: arial, helvetica, sans-serif; color: #000000;">September 1, 2020</span></a></span></p>,
 <p align="center" style="text-align: left;"><a href="/files/8.11.20_minutes.pdf"><span style="color: #000000;"><span style="font-family: 'arial black', 'avant garde'; color: #000000;"><span style="font-family: arial, helvetica, sans-serif;">August 11, 2020</span></span></span></a></p>,
 <p align="center" style="text-align: left;"><a href="/files/7.28.20_minutes.pdf"><span style="color: #000000;"><span style="font-family: 'arial black', 'avant garde'; color: #000000;"><span style="font-family: arial, helvetica, sans-serif;">July 28, 2020</span></span></span></a></p>,
 <p align="center" style="text-align: left;"><a href="/files/7.14.20_minutes.pdf"><span style="color: #000000;"><span style="font-family: 'arial black', 'avant garde'; color: #000000;"><span style="fon

In [None]:
error_log = [] 
try:
    school_dict = {}
    school_dict['date'] = each.text
    school_dict['link'] = each.a['href']
    school_list.append(school_dict)
except:
    error_log.append(each) 

In [8]:
school_list = []
error_log = [] 

for each in school:
    try:
        school_dict = {}
        school_dict['date'] = each.text
        school_dict['link'] = each.a['href']
        school_list.append(school_dict)
    except:
        error_log.append(each)
        
        
school_list

[{'date': 'September 1, 2020', 'link': '/files/9.1.20_minutes.pdf'},
 {'date': 'August 11, 2020', 'link': '/files/8.11.20_minutes.pdf'},
 {'date': 'July 28, 2020', 'link': '/files/7.28.20_minutes.pdf'},
 {'date': 'July 14, 2020', 'link': '/files/7.14.20_minutes.pdf'},
 {'date': 'June 16, 2020', 'link': '/files/6.16.20_minutes.pdf'},
 {'date': 'May 20, 2020', 'link': '/files/5.20.20_minutes.pdf'},
 {'date': 'April 7, 2020\xa0', 'link': '/files/4.7.20_minutes.pdf'},
 {'date': 'March 12, 2020', 'link': '/files/3.12.20_minutes.pdf'},
 {'date': 'March 5, 2020', 'link': '/files/3.5.20_minutes.pdf'},
 {'date': 'February 21, 2020', 'link': '/files/2.21.20_minutes.pdf'},
 {'date': 'February 4, 2020', 'link': '/files/2-4-20_minutes.pdf'},
 {'date': 'January 21, 2020', 'link': '/files/1.21.20.pdf'},
 {'date': 'January 7, 2020', 'link': '/files/1.7.20_pdf.pdf'},
 {'date': 'December 16, 2019', 'link': '/files/12.16.19_minutes.pdf'},
 {'date': 'December 3, 2019', 'link': '/files/12.3.19_minutes.pdf'

In [9]:
error_log

[<p align="center" style="text-align: left;"><strong style="color: #0000ff; font-family: 'times new roman', times; font-size: large;">2019 Board Meeting Minutes</strong></p>,
 <p align="center" style="text-align: left;">May 21, 2019 CANCELLED</p>,
 <p align="center" style="text-align: left;"> </p>,
 <p align="center" style="text-align: left;"><strong style="color: #0000ff; font-size: large; font-family: 'times new roman', times;">2018 Board Meeting Minutes</strong></p>,
 <p align="center" style="text-align: left;"><span>October 16, 2018</span>  </p>,
 <p align="center" style="text-align: left;"><span style="color: #000000;">June 6, 2018</span></p>]

In [12]:
#Converting list of dictionaries to dataframe

school_df = pd.DataFrame(school_list)
school_df

Unnamed: 0,date,link
0,"September 1, 2020",/files/9.1.20_minutes.pdf
1,"August 11, 2020",/files/8.11.20_minutes.pdf
2,"July 28, 2020",/files/7.28.20_minutes.pdf
3,"July 14, 2020",/files/7.14.20_minutes.pdf
4,"June 16, 2020",/files/6.16.20_minutes.pdf
5,"May 20, 2020",/files/5.20.20_minutes.pdf
6,"April 7, 2020",/files/4.7.20_minutes.pdf
7,"March 12, 2020",/files/3.12.20_minutes.pdf
8,"March 5, 2020",/files/3.5.20_minutes.pdf
9,"February 21, 2020",/files/2.21.20_minutes.pdf


In [19]:
school_df['date'] = pd.to_datetime(school_df['date'])

In [20]:
import urllib2

ModuleNotFoundError: No module named 'urllib2'

In [44]:
#Downloading the PDFs

def get_my_pdf(link):
    url = "http://www.mineral.k12.nv.us"+link
    pdf_file = requests.get(url)
    open(link,'wb').write(pdf_file.content)

In [45]:
pdf_trial = get_my_pdf('/files/6.28.18.pdf')

FileNotFoundError: [Errno 2] No such file or directory: '/files/6.28.18.pdf'