# Data Wrangling in R (part 2)

## Table of Contents

- [Reshaping Data with tidyr](#resh)
- [Accessing databases with dbplyr](#db)
- [Accessing Web APIs](#api)

---
<a id='resh'></a>

## Reshaping Data with tidyr

One of the most common data wrangling challenges is adjusting how exactly rows and columns are used to represent data. [**tidyr ("tidy-er")**](https://tidyr.tidyverse.org/) is a package helping in structuring data frames to have the desired shape (transforming orientation) for visualization, running a statistical model or implementing a machine learning algorithm. **tidyr** helps in following the **principles of tidy data**. Tidy data is data where:

- Every column is variable.
- Every row is an observation.
- Every cell is a single value.


<img src="images/data-wide.png" alt="" style="width: 500px;"/>

The format is wide, because the price data is spread wide across multiple columns.

<img src="images/data-long.png" alt="" style="width: 500px;"/>

The format is long, because the price data has ist own column. This format includes duplicated cities and bands.

In [None]:
# gather() - to move from wide format to long format 
#  you need to gather all of the prices into a single columns

# Reshape by gathering prices into a single feature
band_data_long <- gather(
    band_data_wide, # data frame to gather from
    # name for new column listing the gathered featu 
    # (will contain values of column names from the wide form)
    key = band, 
    # name for new column listing the gathered values 
    # (here will be all gathered values)
    value = price, 
    # columns to gather data from
    # (gather from all columns except city)
    -city 
)

<img src="images/data-tidyr-gather.png" alt="" style="width: 600px;"/>


In [None]:
# spread() - from rows to columns, from long into wide format

# Reshape by spreading prices out among multiple features
price_by_band <- spread(
    band_data_long, # data frame to spread from
    key = city, # get new colum names from this column
    value = price # get values for the new columns from this column
)

<img src="images/data-tidyr-spread.png" alt="" style="width: 600px;"/>


In [None]:
# Unite multiple columns into a single column
# unite()

# Separate a single column into multiple columns
# separate() 

---
<a id='db'></a>

## Accessing databases with dbplyr

**[dplyr](https://github.com/tidyverse/dbplyr)** is the database backend for [dplyr](https://dplyr.tidyverse.org/). It allows you to use remote database tables as if they are in-memory data frames by automatically converting dplyr code into SQL.

To learn more about why you might use dbplyr instead of writing SQL, see vignette("sql"). To learn more about the details of the SQL translation, see vignette("translation-verb") and vignette("translation-function").

In [2]:
install.packages("dbplyr")
library("DBI")
library("dplyr")


The downloaded binary packages are in
	/var/folders/qk/0l3zx9w11959pqp8tr5s_1780000gn/T//RtmpjGbWsk/downloaded_packages



Attaching package: ‘dplyr’


The following objects are masked from ‘package:stats’:

    filter, lag


The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union




In [3]:
# To access an SQLite database 
install.packages("RSQLite") # once per machine 
library("RSQLite") # in each relevant script

# To access a Postgres database 
install.packages("RPostgreSQL") # once per machine 
library("RPostgreSQL") # in each relevant script

also installing the dependencies ‘bit’, ‘bit64’, ‘blob’





The downloaded binary packages are in
	/var/folders/qk/0l3zx9w11959pqp8tr5s_1780000gn/T//RtmpjGbWsk/downloaded_packages

The downloaded binary packages are in
	/var/folders/qk/0l3zx9w11959pqp8tr5s_1780000gn/T//RtmpjGbWsk/downloaded_packages


In [7]:
# Install and load the `dplyr`, `DBI`, and `RSQLite` packages for accessing
# databases
library("dplyr")
library("DBI")
library("RSQLite")

In [8]:
# Create a connection to the `Chinook_Sqlite.sqlite` file in the `data` folder
# Be sure to set your working directory!
db_connection <- dbConnect(SQLite(), dbname = "data/r/Chinook_Sqlite.sqlite")

In [9]:
# Use the `dbListTables()` function (passing in the connection) to get a list
# of tables in the database.
dbListTables(db_connection)

In [12]:
# Use the `tbl()`function to create a reference to the table of music genres.
# Print out the the table to confirm that you've accessed it.
genre_tbl <- tbl(db_connection, "Genre")
genre_tbl

[38;5;246m# Source:   table<Genre> [?? x 2][39m
[38;5;246m# Database: sqlite 3.29.0
#   [/Users/ksatola/Documents/git/Data-Science-Notes/data/r/Chinook_Sqlite.sqlite][39m
   GenreId Name              
     [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m             
[38;5;250m 1[39m       1 Rock              
[38;5;250m 2[39m       2 Jazz              
[38;5;250m 3[39m       3 Metal             
[38;5;250m 4[39m       4 Alternative & Punk
[38;5;250m 5[39m       5 Rock And Roll     
[38;5;250m 6[39m       6 Blues             
[38;5;250m 7[39m       7 Latin             
[38;5;250m 8[39m       8 Reggae            
[38;5;250m 9[39m       9 Pop               
[38;5;250m10[39m      10 Soundtrack        
[38;5;246m# … with more rows[39m

In [13]:
# Try to use `View()` to see the contents of the table. What happened?
View(genre_tbl)

ERROR: Error in View(genre_tbl): ‘View()’ not yet supported in the Jupyter R kernel


In [14]:
# Use the `collect()` function to actually load the genre table into memory
# as a data frame. View that data frame.
genre_df <- collect(genre_tbl)
View(genre_df)

ERROR: Error in View(genre_df): ‘View()’ not yet supported in the Jupyter R kernel


In [15]:
# Use dplyr's `count()` function to see how many rows are in the genre table
genre_tbl %>% count()

[38;5;246m# Source:   lazy query [?? x 1][39m
[38;5;246m# Database: sqlite 3.29.0
#   [/Users/ksatola/Documents/git/Data-Science-Notes/data/r/Chinook_Sqlite.sqlite][39m
      n
  [3m[38;5;246m<int>[39m[23m
[38;5;250m1[39m    25

In [16]:
# Use the `tbl()` function to create a reference the table with track data.
# Print out the the table to confirm that you've accessed it.
track_tbl <- tbl(db_connection, "Track")
print(track_tbl)

[38;5;246m# Source:   table<Track> [?? x 9][39m
[38;5;246m# Database: sqlite 3.29.0
#   [/Users/ksatola/Documents/git/Data-Science-Notes/data/r/Chinook_Sqlite.sqlite][39m
   TrackId Name  AlbumId MediaTypeId GenreId Composer Milliseconds  Bytes
     [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m   [3m[38;5;246m<int>[39m[23m       [3m[38;5;246m<int>[39m[23m   [3m[38;5;246m<int>[39m[23m [3m[38;5;246m<chr>[39m[23m           [3m[38;5;246m<int>[39m[23m  [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m       1 For …       1           1       1 Angus Y…       [4m3[24m[4m4[24m[4m3[24m719 1.12[38;5;246me[39m7
[38;5;250m 2[39m       2 Ball…       2           2       1 [31mNA[39m             [4m3[24m[4m4[24m[4m2[24m562 5.51[38;5;246me[39m6
[38;5;250m 3[39m       3 Fast…       3           2       1 F. Balt…       [4m2[24m[4m3[24m[4m0[24m619 3.99[38;5;246me[39m6
[38;5;250m 4[39m       4 Rest…       3           2       1 F. Balt

In [17]:
# Use dplyr functions to query for a list of artists in descending order by
# popularity in the database (e.g., the artist with the most tracks at the top)
# - Start by filting for rows that have an artist listed (use `is.na()`), then
#   group rows by the artist and count them. Finally, arrange the results.
# - Use pipes to do this all as one statement without collecting the data into
#   memory!
popular_artists <- track_tbl %>%
  filter(is.na(Composer) == FALSE) %>%
  group_by(Composer) %>%
  count() %>%
  arrange(-n)
print(popular_artists)

[38;5;246m# Source:     lazy query [?? x 2][39m
[38;5;246m# Database:   sqlite 3.29.0
#   [/Users/ksatola/Documents/git/Data-Science-Notes/data/r/Chinook_Sqlite.sqlite][39m
[38;5;246m# Groups:     Composer[39m
[38;5;246m# Ordered by: -n[39m
   Composer                                           n
   [3m[38;5;246m<chr>[39m[23m                                          [3m[38;5;246m<int>[39m[23m
[38;5;250m 1[39m Steve Harris                                      80
[38;5;250m 2[39m U2                                                44
[38;5;250m 3[39m Jagger/Richards                                   35
[38;5;250m 4[39m Billy Corgan                                      31
[38;5;250m 5[39m Kurt Cobain                                       26
[38;5;250m 6[39m Bill Berry-Peter Buck-Mike Mills-Michael Stipe    25
[38;5;250m 7[39m The Tea Party                                     24
[38;5;250m 8[39m Chico Science                                     23
[38;5;250m 9

In [19]:
# Use dplyr functions to query for the most popular _genre_ in the library.
# You will need to count the number of occurrences of each genre, and join the
# two tables together in order to also access the genre name.
# Collect the resulting data into memory in order to access the specific row of
# interest
genre_count_names_df <- genre_counts %>%
  group_by(GenreId) %>%
  count() %>%
  left_join(genre_tbl) %>%
  arrange(-count) %>%
  collect()
print(genre_count_names_df[1, "Name"])

ERROR: Error in eval(lhs, parent, parent): object 'genre_counts' not found


In [20]:
# Bonus: Query for a list of the most popular artist for each genre in the
# library (a "representative" artist for each).
# Consider using multiple grouping operations. Note that you can only filter
# for a `max()` value if you've collected the data into memory.
track_tbl %>%
  filter(is.na(Composer) == FALSE) %>%
  group_by(GenreId, Composer) %>%
  count() %>%
  left_join(genre_tbl) %>%
  select(Genre = Name, Composer, count = n) %>%
  collect() %>%
  group_by(Genre) %>%
  filter(count == max(count)) %>%
  arrange(-count)

Joining, by = "GenreId"



GenreId,Genre,Composer,count
<int>,<chr>,<chr>,<int>
1,Rock,U2,44
3,Metal,Steve Harris,36
4,Alternative & Punk,Billy Corgan,31
2,Jazz,Miles Davis,23
7,Latin,Chico Science,23
6,Blues,Chris Robinson/Rich Robinson,18
10,Soundtrack,"Brian Eno, Bono, Adam Clayton, The Edge & Larry Mullen Jnr.",14
16,World,João Suplicy,14
13,Heavy Metal,Steve Harris,13
23,Alternative,Chris Cornell,13


In [21]:
# Remember to disconnect from the database once you are done with it!
dbDisconnect(db_connection)

---
<a id='api'></a>

## Accessing Web APIs

Access tokens are a lot like passwords; you will want to keep them secret and not share them with others. This means that you should not include them in any files you commit to git and push to GitHub. The best way to ensure the secrecy of access tokens in R is to create a separate script file in your repo (e.g., api_keys.R) that includes exactly one line, assigning the key to a variable:

In [22]:
# Store your API key from a web service in a variable 
# It should be in a separate file (e.g., `api_keys.R`) 
api_key <- "123456789abcdefg"

In [24]:
# In your "main" script (e.g., `my_script.R`) load your API key from another file
# (Make sure working directory is set before running the following code!)
source("data/r/api_keys.R") # load the script using a *relative path* 
print(api_key) # the key is now available!

[1] "123456789abcdefg"


Anyone else who runs the script will need to provide an `api_key` variable to access the API using that user’s own key. This practice keeps everyone’s account separate.
You can keep your `api_keys.R` file from being committed by including the filename in the `.gitignore` file in your repo; that will keep it from even possibly being committed with your code!

In [None]:
---
<a id='data'></a>