### Housing Project step 1
#### Data Cleaning
##### Group: Karan, Lance, Gil, Rachel, Alex, Travis

### This is a simple R script to make converting your columns a bit easier
### We can save much time using this tool

* First, we're going to read in the data from all the years. 
* __Double Check your directory path__, my .csv's happen to be <br>
in the directory upstream of my working directory.

In [7]:
# Imports
library(dplyr)

# Create a list of the years for which we have data
years <- c(1991, seq(1993, 2017, 3))

# Declare a list to store our dataframes in
df_list <- list()

# Make a list of the file names we're going to read in 
for (i in 1:length(years)) {

    # Data-only data frame (with no headers):
    data <- read.csv(paste('../nycHousing', years[i], '.csv', sep = ''), 
                     skip = 2, header = FALSE)

    # Temporary data frame from which to extract the 
    # first row of headers, nrows = 2 so we don't waste time reading
    # the whole csv again.
    tmp <- read.csv(paste('../nycHousing', years[i], '.csv', sep = ''), 
                    header = TRUE, nrows = 2)

    # Use headers from tmp for nych17:
    names(data) <- names(tmp)

    # Remove the temporary data frame:
    rm(tmp)
    
    # Append the dataframe to a list of dataframes
    df_list[[i]] <- data

} # End for loop

### Make your codebooks, y'all.

In [2]:
# Create dataframe codebooks for appropriate columns.
codebook1 <- data.frame(borough = c(1, 2, 3, 4, 5),
                       boroughName = c("Bronx", "Brooklyn", "Manhattan",
                                       "Queens", "Staten Island"))

codebook2 <- data.frame(X_32a = c(0, 1, 8),
                       heating_breakdown = c('Yes', 'No', 'Not Reported'))

codebook3 <- data.frame(X_32b = c(2, 3, 4, 5, 8, 9),
                       num_heat_breakdowns = c("1", "2", "3",
                                        "4+", "Not Reported",
                                        "No Breakdowns"))

### These are the functions to do the 'ole swap-a-roo

In [3]:
# The following is a function to change numerical values to appropriate
# categorical (named) values.
#
# This function replaces the old column with the new, and drops the old. 
#
# The inputs are as follows:
# orig_df - This is the unaltered dataframe from your .csv import (df object)
# codebook_df - This is dataframe that represents your 'dictionary' (df object)
# orig_name - the original name of the column (string)
# new_name - a new (meaningful) name you would like for the column
#
# Note that old_name and new_name, have to match the column names specified in your 
# codebook dataframe object

# This function appends the year column, it only needs to be used on one
# of the columns to maintain a year column in the compiled dataframe
rf_func_year <- function(orig_df, 
                         codebook_df, 
                         orig_name,
                         new_name) {
    
    df <- left_join(x = codebook_df, y = orig_df, by = orig_name)
    df <- select(df, -c(orig_name))
    colnames(df)[colnames(df) == orig_name] <- new_name
    df <- select(df, new_name, year)
    return(df)
}

# This function doesn't append the year. It should be used for all 
# but the last column converted 
rf_func <- function(orig_df, 
                    codebook_df, 
                    orig_name,
                    new_name) {
    
    df <- left_join(x = codebook_df, y = orig_df, by = orig_name)
    df <- select(df, -c(orig_name))
    colnames(df)[colnames(df) == orig_name] <- new_name
    df <- select(df, new_name)
    return(df)
}

### Use the above functions to clean 'em up. Then slap 'em together.
* __Notice that you'll have to change the parameters__ of the function<br>
calls to the terms referenced in your codebooks.
* This will be great to use for visualizations, but not so good for building<br>
an index. You cannot execute mathematical operations on the data in this <br>
form, rendering it pretty useless in terms of calculating a quality index.<br>
* We'll end up converting it back to integers (or maybe modifying it differently<br>
with NA's for some of the categories to create a weighted numerical index?
* You tell me.

In [6]:
# Declare a list to capture the cleaned dataframes
clean_df_list <- list()

# Counter
i <- 1

# Iterate through the uncleaned dataframes, clean and factor them
for (dataframe in df_list) {
    
    # Use the appropriate function with parameters to rename your answers
    # and drop the unnecessary columns
    df1 <- rf_func(dataframe, codebook1, 'borough', 'boroughName')
    df2 <- rf_func(dataframe, codebook2, 'X_32a', 'heating_breakdown')
    df3 <- rf_func_year(dataframe, codebook3, 'X_32b', 'num_heat_breakdowns')
    
    # Bind up the cleaned data frames
    df <- cbind(df1, df2, df3)
    
    # Append the clean dataframe to the list
    clean_df_list[[i]] <- df
    
    # Increase the counter
    i <- i + 1
}

# Row bind the list of cleaned dataframes, in order
aggregated_df <- bind_rows(clean_df_list)

# Factor the year column
aggregated_df$year <- as.factor(aggregated_df$year)

# Show some proof that this works
print(c('total samples: ', length(aggregated_df$borough)))
head(aggregated_df)
str(aggregated_df)

[1] "total samples: " "156230"         


boroughName,heating_breakdown,num_heat_breakdowns,year
Bronx,Yes,1,91
Bronx,Yes,1,91
Bronx,Yes,1,91
Bronx,Yes,1,91
Bronx,Yes,1,91
Bronx,Yes,1,91


'data.frame':	156230 obs. of  4 variables:
 $ boroughName        : Factor w/ 5 levels "Bronx","Brooklyn",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ heating_breakdown  : Factor w/ 3 levels "No","Not Reported",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ num_heat_breakdowns: Factor w/ 6 levels "1","2","3","4+",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ year               : Factor w/ 10 levels "91","93","96",..: 1 1 1 1 1 1 1 1 1 1 ...
