# Sprint Journal for Michael Allen

### Title: Cleaning “Dirty Data” from Excel And Making it “Tidy” in R  

#### Project Description:  To utilize R to tidy (clean) real Survey Monkey "End of Course Survey" Data (known as Dirty Data) from the collection time period of February 2016 through August 2016. Then export the tidy (write) it back to a CSV file.

### Game Plan 
*1. Export “Real” Excel SurveyMonkey “End of Course Survey Data” (Dirty Data)*  
*2. “Read” the Exported Excel Dirty Data into R*  
*3. Write a Script in R to “Tidy” (Clean) the  Dirty Data and “View” it in R.*  
*4. Export the R Tidy Data back to a CSV file*  





---
## DAY 1: Tuesday (Week- 1)
### What I Expect to Learn
* To learn how to take Excel dirty data and make it tidy in order to work with it in R for future statisical analysis.  
### Project References
- Project Pitch: To utilize R to clean (real) Survey Monkey End of Course Survey Data (known as Dirty Data) for the time period of (February 2016 through August 2016) which has been exported to R. And then export it back to a CSV file.


#### When it comes to clumsy column headers namely., wide ones with spaces and special characters, I see many get panic and change the headers in the source file, which is an awkward option given variety of alternatives that exist in R for handling them.    

One easy handling of such scenarios is using library(janitor), as name suggested can be employed for cleaning and maintaining. Janitor has function by name clean_names() which can be useful while directly importing the data itself as show in the below example:
“ library(janitor); newdataobject <- read.csv(“yourcsvfilewithpath.csv”, header=T) %>% clean_names()   

https://www.r-bloggers.com/clean-or-shorten-column-names-while-importing-the-data-itself/  


### Past (Real) Survey Monkey Data from End of Course Survey dated: February 2016 - August 2016  

https://www.survey.monkey.com  

#### Export a CSV file back to Excel  
If you want to export a csv file to Excel, use write_excel_csv() — this writes a special character (a “byte order mark”) at the start of the file which tells Excel that you’re using the UTF-8 encoding.  

http://r4ds.had.co.nz/data-import.html  

#### Datacamp Cleaning data in R:    
https://campus.datacamp.com/courses/cleaning-data-in-r/chapter-1-introduction-and-exploring-raw-data?ex=1  


#### Potential Tips that may be handy later:  

Other types of data
To get other types of data into R, we recommend starting with the tidyverse packages listed below. They’re certainly not perfect, but they are a good place to start. For rectangular data:  
haven reads SPSS, Stata, and SAS files.  
readxl reads excel files (both .xls and .xlsx).    
DBI, along with a database specific backend (e.g. RMySQL, RSQLite, RPostgreSQL etc) allows you to run SQL queries against a database and return a data frame.  
For hierarchical data: use jsonlite (by Jeroen Ooms) for json, and xml2 for XML. Jenny Bryan has some excellent worked examples at https://jennybc.github.io/purrr-tutorial/.  
For other file types, try the R data import/export manual and the rio package.  



--- 
## DAY 2: Wednesday (Week 1)

### Prototype Notes  
There are three interrelated rules which make a dataset tidy:  
Each variable must have its own column.  
Each observation must have its own row.  
Each value must have its own cell.  
Figure 12.1 shows the rules visually.  

![](_mallen/Three_rules_for_tidy_data.jpg)
 

### Pair Show & Tell Comments
*Hunter gave me a couple ideas of how to proceed  
Essentially lay things out in Excel so then I know my ultimate goal in R

#### Code that I ran/installed for R  
Rstudio – installed tidyverse & janitor  
install.packages("tidyverse")  
library(tidyverse)    

install.packages("janitor")  
library(janitor)  


#### From the DataCamp Course examples but on my data set in R:  
https://campus.datacamp.com/courses/cleaning-data-in-r/chapter-1-introduction-and-exploring-raw-data?ex=5  


Understanding the Structure of My existing (Dirty) Data:  
#setting the intialial working directory

setwd("C:/Users/micha/Desktop/DevLeague Begins Nov 7 2017/Project_Sprint_3  

read.csv("Real_CSV_EOC_Survey.csv")

Sprint_3 <- read.csv("Real_CSV_EOC_Survey.csv")

#View its Class  
class(Sprint_3)
[1] "data.frame"  

#View its Dimenisions   
dim(Sprint_3)  
[1] 491  71  
#The above refers to 491 Rows and 71 Columns

#Look at the existing Column names  

names(Sprint_3)
 [1] "Respondent.ID"                                                                                                                                                  
 [2] "Collector.ID"                                                                                                                                                   
 [3] "Start.Date"                                                                                                                                                     
 [4] "End.Date"                                                                                                                                                       
 [5] "IP.Address"                                                                                                                                                     
 [6] "Email.Address"                                                                                                                                                  
 [7] "First.Name"                                                                                                                                                     
 [8] "Last.Name"                                                                                                                                                      
 [9] "Custom.Data.1"                                                                                                                                                  
[10] "What.is.your.name..optional.."                                                                                                                                  
[11] "Which.company.do.you.represent."                                                                                                                                
[12] "Where.did.you.attend.the.training.course."                                                                                                                      
[13] "X"                                                                                                                                                              
[14] "What.training.course.did.you.attend."                                                                                                                           
[15] "X.1"                                                                                                                                                            
[16] "Who.was.your.training.instructor.s.."                                                                                                                           
[17] "X.2"                                                                                                                                                            
[18] "X.3"                                                                                                                                                            
[19] "X.4"                                                                                                                                                            
[20] "X.5"                                                                                                                                                            
[21] "X.6"                                                                                                                                                            
[22] "X.7"                                                                                                                                                            
[23] "X.8"                                                                                                                                                            
[24] "X.9"                                                                                                                                                            
[25] "X.10"                                                                                                                                                           
[26] "X.11"                                                                                                                                                           
[27] "X.12"                                                                                                                                                           
[28] "X.13"                                                                                                                                                           
[29] "X.14"                                                                                                                                                           
[30] "X.15"                                                                                                                                                           
[31] "X.16"                                                                                                                                                           
[32] "X.17"                                                                                                                                                           
[33] "X.18"                                                                                                                                                           
[34] "The.instructor.s.lecture.was.delivered.clearly.and.effectively."                                                                                                
[35] "X.19"                                                                                                                                                           
[36] "The.instructor.was.responsive.to.my.questions."                                                                                                                 
[37] "X.20"                                                                                                                                                           
[38] "The.technical.details.in.the.course.were.appropriate.for.my.learning..If.you.Disagree.or.Strongly.Disagree..please.also.mark.if.it.was.Too.Much.or.Too.Little.."
[39] "X.21"                                                                                                                                                           
[40] "X.22"                                                                                                                                                           
[41] "X.23"                                                                                                                                                           
[42] "X.24"                                                                                                                                                           
[43] "X.25"                                                                                                                                                           
[44] "X.26"                                                                                                                                                           
[45] "The.Student.Guide.was.organized.and.easy.to.use."                                                                                                               
[46] "X.27"                                                                                                                                                           
[47] "The.Lab.Exercises.were.relevant.and.useful.to.my.learning."                                                                                                     
[48] "X.28"                                                                                                                                                           
[49] "The.classroom.was.comfortable..examples..not.too.hot..not.too.cold..not.too.noisy..etc..."                                                                      
[50] "X.29"                                                                                                                                                           
[51] "The.computer.I.used.worked.sufficiently.to.perform.the.hands.on.lab.exercises."                                                                                 
[52] "X.30"                                                                                                                                                           
[53] "The.network.connectivity.and.speed.was.satisfactory.for.my.learning."                                                                                           
[54] "X.31"                                                                                                                                                           
[55] "I.could.adequately.see.and.hear.the.presentation."                                                                                                              
[56] "X.32"                                                                                                                                                           
[57] "How.many.total.years.of.experience.do.you.have.working.in.a.technical.discipline..i.e...Telecommunications..IT..IP..Computer.Information.Systems.."             
[58] "X.33"                                                                                                                                                           
[59] "Is.your.background.rooted.in.IP.Networking.or.Telephony."                                                                                                       
[60] "X.34"                                                                                                                                                           
[61] "I.had.the.knowledge.and.or.skills.required.to.attend.this.course."                                                                                              
[62] "X.35"                                                                                                                                                           
[63] "Prior.to.attending.this.course..how.much.experience.did.you.have.working.with.any.of.your.Metaswitch.equipment."                                                
[64] "X.36"                                                                                                                                                           
[65] "Do.you.think.you.attended.this.class.at.the.appropriate.time.in.relation.to.your.experience.on.your.Metaswitch.equipment...If.No..please.comment."              
[66] "X.37"                                                                                                                                                           
[67] "I.learned.what.I.needed.to.learn.in.this.training.course."                                                                                                      
[68] "X.38"                                                                                                                                                           
[69] "Given.your.experience.in.this.class..would.you.recommend.this.course.to.your.colleagues."                                                                       
[70] "X.39"                                                                                                                                                           
[71] "Is.there.any.other.feedback.or.suggestions.you.would.like.to.offer.regarding.any.aspect.of.your.experience.with.your.Metaswitch.training."  




### Proposed Plan: Key Milestones by Day



##### Day 3 (Thu Week 1):
- *milestone 1* To be able to rewrite the long winded Column Headers with one or two workd descriptions using R commands*   

*Begin working Column by Column (There are currently 71 Columns) and learn and use R command to Tidy the data*


##### Day 4 (Tue Week 2): 
- Complete Repo/REad Me and Project work.
- 
- Push docs / repo

##### Day 5 (Wed Week 2):
- Project highlights
- Identify question or cohort knowledge gap for sprint review
- Develop Topic Project + Presentation
- Push Repo / docs / Presentation

### Project Definition and README.MD Discussion 
*As a Technology Teacher I need to learn how to take dirty data (From Excel during this Sprint) and make it tidy in order to eventually perform statistical analysis with it in R. I hope to be able to use this code/concepts as I move further along in this program and also outside of this program in the future.*

### Proposed Plan: Key Milestones by Day


##### Day 2 (Wed Week 1):
- Develop Project Proposal
- Push Docs / Repo / Roadmap Update 
*Export “Real” SurveyMonkey “End of Course Survey Data” (Dirty Data)*  
*“Read” the Exported Excel Dirty Data into R*  


##### Day 3 (Thu Week 1):
- *Frustrating day trying to find the proper command to copy / collapse multiple columns to a single column*  
*Still very much stuck on it over 24 hours later (even with asking for help...)

*For some reason the Slack messages were not coming through.  Eventually got the suggestion from Justin to utilize the Paste Command, which I actually tried unsuccessfully the day before as I did not know the proper use of the syntax.  Wrote up a Specific Slide in PowerPoint that illustrates the commands. *

##### Day 4 (Tue Week 2): 
- *Begin working Column by Column (There are currently 71 Columns) and learn and use R command to Tidy the data*
- 
*Write the Cleaned Tidy Data from R back to a CSV File (named Tidy_Data_August_2016)*  

*write.csv(Aug_2016, "Tidy_Data_Aug_2016.csv")*  


##### Day 5 (Wed Week 2):
- Project highlights

- Develop Topic Project + Presentation
- Push Repo / docs / Presentation




### Continuing from above 
#loaded the library(dplyr) - which is designed to abstract over how the data is stored. That means as well as working with local data frames, you can also work with remote database tables, using exactly the same R code.  

library(dplyr)  

#view a summary  (Awesome Command..!)  
summary(Sprint_3)  


#view just the top Six lines of the data (this is the default)    
head(Sprint_3)  

#view the top 15 lines of the data  
head(Sprint_3, n = 15)  

#view just the botto Six lines of the data (this is the default)
head(Sprint_3) 
#view the top 15 lines of the data
head(Sprint_3, n = 15) 



#### Here is a link describing how to change the name of the column headers.

(I think this is slight different than the email that I sent myself)  
http://rprogramming.net/rename-columns-in-r/  
    
    

#### Removing entire columns
To remove one or more columns by name, when the column names are known (as opposed to being determined at run-time), I like the subset() syntax. E.g. for the data-frame  

df <- data.frame(a=1:3, d=2:4, c=3:5, b=4:6)  

to remove just the a column you could do  

Data <- subset( Data, select = -a )  

and to remove the b and d columns you could do  

Data <- subset( Data, select = -c(d, b ) )  

You can remove all columns between d and b with:  

Data <- subset( Data, select = -c( d : b ))  

*And note: the documenation was actually wrong on the line above as it showed only one  
*parenthesis when in reality it needs two..!*

https://stackoverflow.com/questions/6286313/remove-an-entire-column-from-a-data-frame-in-r  


When removing a column when a column name is too long Rstudio abbreviates it – use the specific abbreviation in order to delete it..!!  
As an example: The true Excel Column name was this:  
What is your name (optional)?  

But R Studio at the top (view) showed it as this:  
What.is.your.name..optional..  
So the actual command to delete the column header and entire column was this:  
Aug_2016 <- subset(Aug_2016, select = -What.is.your.name..optional..)  

So the double .. dots are actually open and closed Parenthesis  
And Aug_2016 is my variable name for the entire dataset of February 2016 to August 2016  




--- 
## DAY 3: Thursday/Friday (Week 1)

#### The Big Hurdle
*Trying to copy one column into another, or in my case copy approximately 13 columns into one column was many hours of going in circles.*  

*But essentially paste command will do it.*

*From the website (And Justin) it states:*  
Use paste.  
 df$x <- paste(df$n,df$s)  
 
 *https://stackoverflow.com/questions/18115550/how-to-combine-two-or-more-columns-in-a-dataframe-into-a-new-column-with-a-new-n*  
 
 
 
 *Which equates for me as*  
 Aug_2016$Instructor <- paste(Aug_2016$X.14,Aug_2016$X.15  
 
 *which takes my data.frame (file name) Aug_2016 and names a new Column "Instructor" and copies the two columns X.14 and X.15 (which were Column names named from Survey Monkey) which held two different Instructor names.*  
 
 *After seeing how the command works, then I tried copying another two columns into the Instructor Column but it wipes out the original*  
 
 *So the final line of code is this to collapse all the various Instructor Columns into a single instructor column"  
 Aug_2016$X.3 <- paste(Aug_2016$X.4,Aug_2016$X.6, Aug_2016$X.7, Aug_2016$X.8,    
 Aug_2016$X.9,Aug_2016$X.10,Aug_2016$X.11,Aug_2016$X.13,  
 Aug_2016$X.14,Aug_2016$X.15,Aug_2016$X.17,Aug_2016$X.18)  
 
 Aug_2016$X.3 <- paste(Aug_2016$X.4,Aug_2016$X.6, Aug_2016$X.7,Aug_2016$X.8,  
 Aug_2016$X.9,Aug_2016$X.10,Aug_2016$X.11,Aug_2016$X.13,Aug_2016$X.14,  
 Aug_2016$X.15,Aug_2016$X.17,Aug_2016$X.18)  
 
 And I have no idea why the year 2016 is not being written straight out but instead it lowers the 2 (two) in this program...  
 
 
 
 
 
 
 
 
 
 






---
## DAY 4: Tuesday (Week 2)

#### Work in Progress Feedback 
*Finish up with the renaming of the Column names and push up repo and read me docs *




--- 

## DAY 5: Wednesday (Week 2)

### Project Highlights: The things I am most excited about in my project
Me
- To eventually be able to take the Dirty Data from Excel and have it Tidy in R.  I 





### Something else I learned which clarified things from an earlier Sprint.
Tibbles are designed so that you don’t accidentally overwhelm your console when you print large data frames. But sometimes you need more output than the default display. There are a few options that can help.
First, you can explicitly print() the data frame and control the number of rows (n) and the width of the display. width = Inf will display all columns:
nycflights13::flights %>% 
  print(n = 10, width = Inf)  





### Day 5 (Wed Week 2):
Project highlights
Identify question or cohort knowledge gap for sprint review
Develop Topic Project + Presentation
Push Repo / docs / Presentation
Project Definition and README.MD Discussion
This is a discussion of how this project will fit into my overall roadmap. I will update my roadmap with the following project definition
I will focus my project Repo's README.MD on the same topic, but with this additional detail.











### Here is a nice graph that outlines vector/dataframe/table
![](_mallen/Pictorial_description_of_dataframe_vector_table.jpg)


#### Important Learning in R

*Be careful when cut and pasting from the script to the Console and Console to Script.  The script only shows lower case.  Though in fact the console needs to have certain things capitalized (like to match what a Column header is) though it shows as lower case in the script.*  

*#To remove the original column "Where did you attend the training course"   
Aug_2016 <- subset(Aug_2016, select = -Where.did.you.attend.the.training.course.)*  

#### Renaming Column names:
It’s also possible to rename by index in names vector as follow.  

*names(my_data)[1] <- "sepal_length"  
*names(my_data)[2] <- "sepal_width"  
http://www.sthda.com/english/wiki/renaming-data-frame-columns-in-r


*My actual Code for the first column
*Renaming first Column to Start_Date  
*names(Aug_2016)[1] <- "Start_Date"  

#### Learning Summary:  
During Sprint #3 I learned how to take dirty data from an Exported Excel file from Survey Monkey and to make it Tidy (clean) by doing the following steps:  I read in the Exported Excel Dirty Data file with the read.csv command, I learned how to collapse the multiple column names that Survey Monkey created for the various Instructors into one Instuctor Column name with the "Paste" command, I learned how to rename the long Column names with the names command utlizing the Vector number, I learned how to delete the extra columns that were added by Survey Monkey with the subset command, and the last (and easiest) thing I learned was how to export the R Tidy Data back to a clean CSV file with the write.csv command. During this process I learned how to write all of this in a R script along with the comments and I titled it " Cleaning_the_Excel_Dirty_Data.R"  
