# Collecting and Appending Data
#### 11/28/2017 | Hunter Heaivilin | Sprint 2
#### Description
Exploring various tools to grab data (downloads, web scraping, APIs), clean it as needed, and scraped websites to add to the existing data.

## Skill Backlog User Story
As a researcher, I need to understand command line-based applications, basic scripting, databases, web resources, APIs, and data warehouses so that I can download data from websites, connect to and query databases, interface with APIs, and web scrape.

### Possible Projects

- Go to website ([AGRICULTURAL DEDICATIONS](https://www.realpropertyhonolulu.com/dedications/agricultural-dedications/)
- Grab pdf ([2017 Dedicated Agricultural Parcels list](https://www.realpropertyhonolulu.com/media/1465/ag.pdf)) 
- Parse pdf for relevant data 
- Compile data into some new format (e.g., csv), 
- Add a bit of the data onto a url as custom suffix
- Go to url with custom suffix (e.g, http://qpublic9.qpublic.net/hi_honolulu_display.php?county=hi_honolulu&KEY=290170150000)
- - Scrape data from url pages for more data that I want
- add new data from url to existing dataset from pdf
- join data with spatial dataset


- Pull data from API ([Organic Integrity Database API](https://organic.ams.usda.gov/integrity/Developer/APIHelp.aspx))
- Geocode [2017 Dedicated Agricultural Parcels list](https://www.realpropertyhonolulu.com/media/1465/ag.pdf) from site address (or TMK if can find good data) to csv with lat long (via geocodio or similar) and map with Python using Shaply and Fiona.
- Perform similar but using [Organic Integrity Database API](https://organic.ams.usda.gov/integrity/Developer/APIHelp.aspx) to map certified organic farms in the state.

Project Proposal
--
This is what I propose to do in this project and why I think it will be useful to me and my overall objective


Go to a website ([AGRICULTURAL DEDICATIONS](https://www.realpropertyhonolulu.com/dedications/agricultural-dedications/) and download a pdf with tables in it([2017 Dedicated Agricultural Parcels list](https://www.realpropertyhonolulu.com/media/1465/ag.pdf)). Parse the pdf for relevant data and compile the data into a more useful format (e.g., csv). 
- Add a bit of the data onto a url as custom suffix
- Go to url with custom suffix (e.g, http://qpublic9.qpublic.net/hi_honolulu_display.php?county=hi_honolulu&KEY=290170150000)
- Parse url page for more data that I want
- add new data from url to existing dataset from pdf
- join data with spatial dataset



## Key Questions
- What doth R?
- Call data from an existing set and use it to create custom urls
- Webscraping tools and basic operation

## Key Findings
- R is wonderful
- CRAN is my bu

## Gameplan
Here is my overall approach:
1. Use R to go to the City and County of Honolulu's Real Property Assessment Division [website](https://www.realpropertyhonolulu.com/) to grab data about [agricultural dedications](https://www.realpropertyhonolulu.com/dedications/agricultural-dedications/) in the county.
2. Download a pdf with tables of the . 
3. Parse the pdf for relevant data and compile the data into a more useful format (e.g., csv). 
4. Clean, if needed, the resulting file output
5. Find a way to add the Tax Map Key (TMK, e.g., 290170150000) from csv file onto a url as a suffix to create a custom urls (e.g, http://qpublic9.qpublic.net/hi_honolulu_display.php?county=hi_honolulu&KEY=290170150000)
6. Find a way to do this over and over
7. Go to url with custom suffix 
8. Screap url more data that I want
9. Add new data from url to existing dataset from pdf/csv

*Stretch Goals*
10. Geocode address data from pdf/csv
11. Join with spatial dataset



Day 1 Work
--
#### Compiling Data on Multiple Excel Sheets
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/80/Seal_of_Honolulu%2C_Hawaii.svg/2000px-Seal_of_Honolulu%2C_Hawaii.svg.png" alt="Drawing" align="right" style="width: 70px" />


-  Went to RPAD [agricultural dedications](https://www.realpropertyhonolulu.com/dedications/agricultural-dedications/) page via browers and manually downloaded the [2017 Dedicated Agricultural Parcels list](https://www.realpropertyhonolulu.com/media/1465/ag.pdf) pdf.
- - - -



<img src="https://cdns.tblsft.com/sites/all/themes/tabwow/logo.png" alt="Drawing" align="right" style="width: 100px"/>

-  Opened the pdf in [Tableau](https://www.tableau.com/) which has a built in [Data Interpreter](https://onlinehelp.tableau.com/current/pro/desktop/en-us/data_interpreter.html) designed to help clean data that converts and opens the pdf as an .xlsx file, but each page (of 33) was on a new sheet. I now needed a way to compile the data from the 34 sheets (minus the extra first sheet that has the Tableau interpreter key) into a single excel or csv sheet. 

----
<img src="https://www.rstudio.com/wp-content/uploads/2017/05/readxl-259x300.png" alt="Drawing" align="right" style="width: 70px;" />

- The [Readxl](http://readxl.tidyverse.org/index.html) R package is able to grab, fairly simply, from multiple sheets in either and .xsl or .xslx file.




In [None]:
# Download file from url 
download.file("https://www.realpropertyhonolulu.com/media/1465/ag.pdf", "/Users/hunterheaivilin/agdedis.pdf")

#Convert pdf into excel or csv


# Install necessary packages
install.packages("readxl")
install.packages("purrr")

#Call package libraries
library(readxl)
library(purrr)

#Define file
file <- 'path'

sheets <- excel_sheets(file)

df <- map_df(sheets, ~ read_excel(file, sheet = 2, range = "A2:D45"))

write.csv(df, file = "df.csv")

Day 2 Work
--
#### Downloading a File at a URL
Used [Download.file](https://stat.ethz.ch/R-manual/R-devel/library/utils/html/download.file.html) to grab file from website.


- - - -
#### Extracting Data from Tables in a PDF

 - - - -
<img src="http://blog.infographics.tw/wp-content/uploads/2015/06/cover4.jpg" alt="Drawing" align="right" style="width: 200px;"/>


1. Tried to follow [Extracting Tables from PDFs in R using the Tabulizer Package](https://www.r-bloggers.com/extracting-tables-from-pdfs-in-r-using-the-tabulizer-package/), which shows how to use the [Tabulizer](https://github.com/ropensci/tabulizer) package for R ([tutorial](https://ropensci.org/tutorials/tabulizer_tutorial/)), but unfortunately 
 
> install.packages("tabulizer")

> Warning in install.packages :

> package ‘tabulizer’ is not available (for R version 3.4.2)

Apparently others have had [this issue](https://github.com/ropensci/tabulizer/issues/44).
Fortunately there is an app version called [Tabula](http://tabula.technology/) that works quite well!

** *Caveat:* ** The csv output retained the column headers from each page, meaning some further post processing is still needed.
</br >
- - - -
<img src="https://pdftables.com/images/pdftables-logo.svg" alt="Drawing" align="right" style="width: 200px;"/>

2. [PDFTables](https://pdftables.com/) also has an [R package](https://github.com/expersso/pdftables) with an API to the [web app](https://pdftables.com/), there is a sparse [CRAN documentation](https://cran.r-project.org/web/packages/pdftables/pdftables.pdf) and [tutorial](https://cran.r-project.org/web/packages/pdftables/vignettes/convert_pdf_tables.html)

- - - - 
3. [pdftools](https://cran.r-project.org/web/packages/pdftools/pdftools.pdf)






## Peer Feedback on Day 3

After talking it over with a peer, I received the following feedback and decided to make these changes

## Here are some overall notes on the skills I learned
And perhaps some stream of consciousness notes about what I did, and other questions I might have

In [6]:
%load_ext rmagic



In [1]:
rmagic

NameError: name 'rmagic' is not defined

In [7]:
%R getwd()

ERROR:root:Line magic function `%R` not found.
