# Collecting and Appending Data
#### 11/28/2017 | Hunter Heaivilin | Sprint 2
#### Description
Exploring various tools to grab data (downloads, web scraping, APIs), clean it as needed, and scraped websites to add to the existing data.

## Skill Backlog User Story
As a researcher, I need to understand command line-based applications, basic scripting, databases, web resources, APIs, and data warehouses so that I can download data from websites, connect to and query databases, interface with APIs, and web scrape.

### Possible Projects

- Go to website ([AGRICULTURAL DEDICATIONS](https://www.realpropertyhonolulu.com/dedications/agricultural-dedications/)
- Grab pdf ([2017 Dedicated Agricultural Parcels list](https://www.realpropertyhonolulu.com/media/1465/ag.pdf)) 
- Parse pdf for relevant data 
- Compile data into some new format (e.g., csv), 
- Add a bit of the data onto a url as custom suffix
- Go to url with custom suffix (e.g, http://qpublic9.qpublic.net/hi_honolulu_display.php?county=hi_honolulu&KEY=290170150000)
- - Scrape data from url pages for more data that I want
- add new data from url to existing dataset from pdf
- join data with spatial dataset


- Pull data from APIs ([Organic Integrity Database API](https://organic.ams.usda.gov/integrity/Developer/APIHelp.aspx) or one of [these](https://www.programmableweb.com/category/agriculture/api) ag related ones)
- Geocode [2017 Dedicated Agricultural Parcels list](https://www.realpropertyhonolulu.com/media/1465/ag.pdf) from site address (or TMK if can find good data) to csv with lat long (via geocodio or similar) and map with Python using Shaply and Fiona.
- Perform similar but using [Organic Integrity Database API](https://organic.ams.usda.gov/integrity/Developer/APIHelp.aspx) to map certified organic farms in the state.

Project Proposal
--
This is what I propose to do in this project and why I think it will be useful to me and my overall objective


Go to a website ([AGRICULTURAL DEDICATIONS](https://www.realpropertyhonolulu.com/dedications/agricultural-dedications/) and download a pdf with tables in it([2017 Dedicated Agricultural Parcels list](https://www.realpropertyhonolulu.com/media/1465/ag.pdf)). Parse the pdf for relevant data and compile the data into a more useful format (e.g., csv). 
- Add a bit of the data onto a url as custom suffix
- Go to url with custom suffix (e.g, http://qpublic9.qpublic.net/hi_honolulu_display.php?county=hi_honolulu&KEY=290170150000)
- Parse url page for more data that I want
- add new data from url to existing dataset from pdf
- join data with spatial dataset



## Key Questions
- What doth R?
- Call data from an existing set and use it to create custom urls
- Webscraping tools and basic operation

## Key Findings
- R is wonderful
- CRAN is my bu

## Gameplan
Here is my overall approach:
1. Use R to go to the City and County of Honolulu's Real Property Assessment Division [website](https://www.realpropertyhonolulu.com/) to grab data about [agricultural dedications](https://www.realpropertyhonolulu.com/dedications/agricultural-dedications/) in the county.
2. Download a pdf with tables of the . 
3. Parse the pdf for relevant data and compile the data into a more useful format (e.g., csv). 
4. Clean, if needed, the resulting file output
5. Find a way to add the Tax Map Key (TMK, e.g., 290170150000) from csv file onto a url as a suffix to create a custom urls (e.g, http://qpublic9.qpublic.net/hi_honolulu_display.php?county=hi_honolulu&KEY=290170150000)
6. Find a way to do this over and over
7. Go to url with custom suffix 
8. Screap url more data that I want
9. Add new data from url to existing dataset from pdf/csv

*Stretch Goals*
10. Geocode address data from pdf/csv
11. Join with spatial dataset



Day 1 Work
--
#### Compiling Data on Multiple Excel Sheets
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/80/Seal_of_Honolulu%2C_Hawaii.svg/2000px-Seal_of_Honolulu%2C_Hawaii.svg.png" alt="Drawing" align="right" style="width: 70px" />


-  Went to RPAD [agricultural dedications](https://www.realpropertyhonolulu.com/dedications/agricultural-dedications/) page via browers and manually downloaded the [2017 Dedicated Agricultural Parcels list](https://www.realpropertyhonolulu.com/media/1465/ag.pdf) pdf.
- - - -



<img src="https://cdns.tblsft.com/sites/all/themes/tabwow/logo.png" alt="Drawing" align="right" style="width: 100px"/>

-  Opened the pdf in [Tableau](https://www.tableau.com/) which has a built in [Data Interpreter](https://onlinehelp.tableau.com/current/pro/desktop/en-us/data_interpreter.html) designed to help clean data that converts and opens the pdf as an .xlsx file, but each page (of 33) was on a new sheet. I now needed a way to compile the data from the 34 sheets (minus the extra first sheet that has the Tableau interpreter key) into a single excel or csv sheet. 

----
<img src="https://www.rstudio.com/wp-content/uploads/2017/05/readxl-259x300.png" alt="Drawing" align="right" style="width: 70px;" />

- The [Readxl](http://readxl.tidyverse.org/index.html) R package is able to grab, fairly simply, from multiple sheets in either and .xsl or .xslx file.




In [4]:
# Download file from url 
download.file("https://www.realpropertyhonolulu.com/media/1465/ag.pdf", "/Users/hunterheaivilin/agdedis.pdf")

#Convert pdf into excel or csv


# Install necessary packages
install.packages("readxl")
install.packages("purrr")

#Call package libraries
library(readxl)
library(purrr)

#Define file
file <- 'path'

# Make variable from exell sheets
sheets <- excel_sheets(file)

# Make dataframe from range of cells on multiple excel sheets 
df <- map_df(sheets, ~ read_excel(file, sheet = 2, range = "A2:D45"))

write.csv(df, file = "df.csv")

NameError: name 'download' is not defined

Day 2 Work
--
#### Downloading a File at a URL
Used [Download.file](https://stat.ethz.ch/R-manual/R-devel/library/utils/html/download.file.html) to grab file from website.


- - - -
#### Extracting Data from Tables in a PDF

 - - - -
<img src="http://blog.infographics.tw/wp-content/uploads/2015/06/cover4.jpg" alt="Drawing" align="right" style="width: 200px;"/>


1. Tried to follow [Extracting Tables from PDFs in R using the Tabulizer Package](https://www.r-bloggers.com/extracting-tables-from-pdfs-in-r-using-the-tabulizer-package/), which shows how to use the [Tabulizer](https://github.com/ropensci/tabulizer) package for R ([tutorial](https://ropensci.org/tutorials/tabulizer_tutorial/)), but unfortunately 
 
> install.packages("tabulizer")

> Warning in install.packages :

> package ‘tabulizer’ is not available (for R version 3.4.2)

Apparently others have had [this issue](https://github.com/ropensci/tabulizer/issues/44).
Fortunately there is an app version called [Tabula](http://tabula.technology/) that works quite well!

** *Caveat:* ** The csv output retained the column headers from each page, meaning some further post processing is still needed.
</br >
- - - -
<img src="https://pdftables.com/images/pdftables-logo.svg" alt="Drawing" align="right" style="width: 200px;"/>

2. [PDFTables](https://pdftables.com/) also has an [R package](https://github.com/expersso/pdftables) with an API to the [web app](https://pdftables.com/), there is a sparse [CRAN documentation](https://cran.r-project.org/web/packages/pdftables/pdftables.pdf) and [tutorial](https://cran.r-project.org/web/packages/pdftables/vignettes/convert_pdf_tables.html)

- - - - 
3. [pdftools](https://cran.r-project.org/web/packages/pdftools/pdftools.pdf)



### Scraping with Various Tools

**tl;dr:** Using old website is a considerable pain..
Even when following a [handy tutorial](https://bradleyboehmke.github.io//2015/12/scraping-html-tables.html)

In [2]:
# Check for packages, install if necessary

# Create list of packages needed for process
list.of.packages <- c("rvest", "magrittr")

# Check list of needed packages against list of already installed packages and return those not install to new.packages variable
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]

# Check new.packages variable for any data, if present, install new packages
if(length(new.packages)) install.packages(new.packages)


# Call package libraries
library(rvest)
library(magrittr)

# Create variable with html of webpage
webpage <- read_html("http://ecocrop.fao.org/ecocrop/srv/en/dataSheet?id=17807")

# Grab all the tables from the webpage
tbls <- html_nodes(webpage, "table")


# Or, since none of the tables have unique identifiers (<table> for all!) you can
# create empty list to add table data to
tbls2_ls <- list()

# then specify which table(s) you want to grab & name them something useful (e.g., Ecology, ... , Uses)
tbls2_ls$Description <- webpage %>%
     html_nodes("table") %>% 
        .[1] %>%
     html_table(fill = TRUE) %>%
     .[[1]]

tbls2_ls$Ecology <- webpage %>%
     html_nodes("table") %>% 
    .[2] %>%
    html_table(fill = TRUE) %>%
     .[[1]]

tbls2_ls$ClimaticZone <- webpage %>%
     html_nodes("table") %>% 
    .[3] %>%
     html_table(fill = TRUE) %>%
     .[[1]]
 
tbls2_ls$Cultivation <- webpage %>%
     html_nodes("table") %>% 
    .[4] %>%
     html_table(fill = TRUE) %>%
     .[[1]]

tbls2_ls$Uses <- webpage %>%
     html_nodes("table") %>% 
    .[6] %>%
     html_table(fill = TRUE) %>%
     .[[1]]
 


IndentationError: unindent does not match any outer indentation level (<tokenize>, line 37)

In [2]:
#Clean up list tables into a better format

# Rename columns with variables in first row
colnames(tbls2_ls$Uses) <- tbls2_ls$Uses[1,]

#Remove first row
tbls2_ls$Uses <- tbls2_ls$Uses[-1,]


SyntaxError: invalid syntax (<ipython-input-2-cf6aa2744b11>, line 4)




## Peer Feedback on Day 3

After talking it over with a peer, I received the following feedback and decided to make these changes

## Here are some overall notes on the skills I learned
And perhaps some stream of consciousness notes about what I did, and other questions I might have

### Scraping from HTML tables into R

In [1]:

# Grab list of crop variables from a csv
item_list <- read.csv("/Users/hunterheaivilin/Downloads/2num.csv", header = FALSE)
izem <- data.frame(item_list[1,])

# Grab list of crop codes and urls
urllists <- read.csv("/Users/hunterheaivilin/Downloads/datasheeturl.csv")

#Create variable of all urls
webpages <- (urllists)

truncated <- data.frame(urllists[1:3,])

# Grab list of crop variables from a csv
item_list <- read.csv("/Users/hunterheaivilin/Downloads/2num.csv", header = FALSE)



# for loop to move through
for(i in 1:nrow(truncated)) {
  url <- toString(truncated[i,1])
  html <- read_html(url)

  #Grab the species name
  species <- html_text(html_nodes(html, "h2"))
  
  # Create empty list to add table data into
  tbls2_ls <- list()
  
  # Specify which table(s) from html you want to grab & name them something useful (e.g., Ecology, ... , Uses)
  tbls2_ls$Description <- html %>%
    html_nodes("table") %>% 
    .[1] %>%
    html_table(fill = TRUE) %>%
    .[[1]]
  
  tbls2_ls$Ecology <- html %>%
    html_nodes("table") %>% 
    .[2] %>%
    html_table(fill = TRUE) %>%
    .[[1]]
  
  tbls2_ls$ClimaticZone <- html %>%
    html_nodes("table") %>% 
    .[3] %>%
    html_table(fill = TRUE) %>%
    .[[1]]
  
  tbls2_ls$Cultivation <- html %>%
    html_nodes("table") %>% 
    .[4] %>%
    html_table(fill = TRUE) %>%
    .[[1]]
  
  tbls2_ls$Uses <- html %>%
    html_nodes("table") %>% 
    .[6] %>%
    html_table(fill = TRUE) %>%
    .[[1]]
  
  
  #Clean up 'Uses' table into a better format
  # Rename columns with variables in first row
  colnames(tbls2_ls$Uses) <- tbls2_ls$Uses[1,]
  
  #Remove first row
  tbls2_ls$Uses <- tbls2_ls$Uses[-1,]
  
# Assign variables from table data
  
  
  # Creates variables that concide with item_list 
  
  c1 <- "crop code should go here"
  c2 <- species
  c3 <- tbls2_ls$Description[1, 2]
  c4 <- tbls2_ls$Description[2, 2]
  c5 <- tbls2_ls$Description[3, 2]
  c6 <- tbls2_ls$Description[1, 4]
  c7 <- tbls2_ls$Description[2, 4]
  c8 <- tbls2_ls$Description[3, 4]
  c9 <- tbls2_ls$Ecology[3,2]
  c10 <- tbls2_ls$Ecology[3,3]
  c11 <- tbls2_ls$Ecology[3,4]
  c12 <- tbls2_ls$Ecology[3,5]
  c13 <- tbls2_ls$Ecology[4,2]
  c14 <- tbls2_ls$Ecology[4,3]
  c15 <- tbls2_ls$Ecology[4,4]
  c16 <- tbls2_ls$Ecology[4,5]
  c17 <- tbls2_ls$Ecology[5,2]
  c18 <- tbls2_ls$Ecology[5,3]
  c19 <- tbls2_ls$Ecology[5,4]
  c20 <- tbls2_ls$Ecology[5,5]
  c21 <- tbls2_ls$Ecology[6,2]
  c22 <- tbls2_ls$Ecology[6,3]
  c23 <- tbls2_ls$Ecology[6,4]
  c24 <- tbls2_ls$Ecology[6,5]
  c25 <- tbls2_ls$Ecology[7,2]
  c26 <- tbls2_ls$Ecology[7,3]
  c27 <- tbls2_ls$Ecology[7,4]
  c28 <- tbls2_ls$Ecology[7,5]
  c29 <- tbls2_ls$Ecology[8,2]
  c30 <- tbls2_ls$Ecology[8,3]
  c31 <- tbls2_ls$Ecology[8,4]
  c32 <- tbls2_ls$Ecology[8,5]
  c33 <- tbls2_ls$Ecology[2,7]
  c34 <- tbls2_ls$Ecology[2,8]
  c35 <- tbls2_ls$Ecology[3,7]
  c36 <- tbls2_ls$Ecology[3,8]
  c37 <- tbls2_ls$Ecology[4,7]
  c38 <- tbls2_ls$Ecology[4,8]
  c39 <- tbls2_ls$Ecology[5,7]
  c40 <- tbls2_ls$Ecology[5,8]
  c41 <- tbls2_ls$Ecology[6,7]
  c42 <- tbls2_ls$Ecology[6,8]
  c43 <- tbls2_ls$Ecology[7,7]
  c44 <- tbls2_ls$Ecology[7,8]
  c45 <- tbls2_ls$ClimaticZone[1,2]
  c46 <- tbls2_ls$ClimaticZone[1,4]
  c47 <- tbls2_ls$ClimaticZone[2,2]
  c48 <- tbls2_ls$ClimaticZone[2,4]
  c49 <- tbls2_ls$ClimaticZone[3,2]
  c50 <- tbls2_ls$ClimaticZone[3,4]
  c51 <- tbls2_ls$ClimaticZone[4,2]
  c52 <- tbls2_ls$Cultivation[2,2]
  c53 <- tbls2_ls$Cultivation[3,1]
  c54 <- tbls2_ls$Cultivation[3,2]
  c55 <- tbls2_ls$Cultivation[3,3]
  c56 <- tbls2_ls$Cultivation[3,4]
  c57 <- tbls2_ls$Cultivation[3,5]
  c58 <- tbls2_ls$Cultivation[2,4]
  c59 <- tbls2_ls$Cultivation[2,5]
  c60 <- c(tbls2_ls$Uses [1,1])
  c61 <- c(tbls2_ls$Uses[1,2])
  c62 <- c(tbls2_ls$Uses[1,3])
  c63 <- url
  
  # Make a big 'ol list
  crop_data <- list (c1, c2, c3, c4 ,c5, c6, c7, c8, c9, c10, c11, c12, c13, c14, c15, c16, c17, c18, c19, c20, c21, c22, c23, c24, c25, c26, c27, c28, c29, c30, c31, c32, c33, c34, c35, c36, c37, c38, c39, c40, c41, c42, c43, c44, c45, c46, c47, c48, c49, c50, c51, c52, c53, c54, c55, c56, c57, c58, c59, c60, c61, c62, c63)
  
 super <- data.frame(crop_data)

  # Transpose
  crop_data <- t(crop_data)
  
  print(crop_data)
  
  #convert to data frame
  crop_df <- data.frame(crop_data)
  


# attach ____ as a row in dataframe  
  



}

# Read html from each url into tables
newb <- newdf

# parse and return multiple grabs as a ____



 


SyntaxError: invalid syntax (<ipython-input-1-e06ded7b64df>, line 20)