# How to get tidy data & how to rename a column in a data frame

- The data we are using below came from https://www.kaggle.com/spscientist/students-performance-in-exams
- What is tidy data? https://www.jstatsoft.org/article/view/v059i10

In [16]:
# load packages
using DataFrames, CSV

In [17]:
# Import data

df = CSV.read("data/StudentPerformance.csv", DataFrame)

Unnamed: 0_level_0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math score,reading score,writing score
Unnamed: 0_level_1,String7,String7,String31,String15,String15,Int64,Int64,Int64
1,female,group B,bachelor's degree,standard,none,72,72,74
2,female,group C,some college,standard,completed,69,90,88
3,female,group B,master's degree,standard,none,90,95,93
4,male,group A,associate's degree,free/reduced,none,47,57,44
5,male,group C,some college,standard,none,76,78,75
6,female,group B,associate's degree,standard,none,71,83,78
7,female,group B,some college,standard,completed,88,95,92
8,male,group B,some college,free/reduced,none,40,43,39
9,male,group D,high school,free/reduced,completed,64,64,67
10,female,group B,high school,free/reduced,none,38,60,50


## Now, let's rename the last three columns. 

Why do we need to rename them? Since we are going to make this dataframe tidy, the math, reading, and writing score columns will be compressed into two columns: one is subject (mathc, reading, and writing), and the other is score. If we don't rename them before the transformation, the data for the subject column will be "math score", "reading score", "writing score". This is not what we want:

In [18]:
df_tidy = stack(df, 6:8)

Unnamed: 0_level_0,gender,race/ethnicity,parental level of education,lunch,test preparation course,variable,value
Unnamed: 0_level_1,String7,String7,String31,String15,String15,Cat…,Int64
1,female,group B,bachelor's degree,standard,none,math score,72
2,female,group C,some college,standard,completed,math score,69
3,female,group B,master's degree,standard,none,math score,90
4,male,group A,associate's degree,free/reduced,none,math score,47
5,male,group C,some college,standard,none,math score,76
6,female,group B,associate's degree,standard,none,math score,71
7,female,group B,some college,standard,completed,math score,88
8,male,group B,some college,free/reduced,none,math score,40
9,male,group D,high school,free/reduced,completed,math score,64
10,female,group B,high school,free/reduced,none,math score,38


In [19]:
# rename the last three collumns:
rename!(df, Dict("math score" => "math", "reading score" => "reading", "writing score" => "writing"))

Unnamed: 0_level_0,gender,race/ethnicity,parental level of education,lunch,test preparation course,math,reading,writing
Unnamed: 0_level_1,String7,String7,String31,String15,String15,Int64,Int64,Int64
1,female,group B,bachelor's degree,standard,none,72,72,74
2,female,group C,some college,standard,completed,69,90,88
3,female,group B,master's degree,standard,none,90,95,93
4,male,group A,associate's degree,free/reduced,none,47,57,44
5,male,group C,some college,standard,none,76,78,75
6,female,group B,associate's degree,standard,none,71,83,78
7,female,group B,some college,standard,completed,88,95,92
8,male,group B,some college,free/reduced,none,40,43,39
9,male,group D,high school,free/reduced,completed,64,64,67
10,female,group B,high school,free/reduced,none,38,60,50


In [20]:
df_tidy = stack(df, 6:8)

Unnamed: 0_level_0,gender,race/ethnicity,parental level of education,lunch,test preparation course,variable,value
Unnamed: 0_level_1,String7,String7,String31,String15,String15,Cat…,Int64
1,female,group B,bachelor's degree,standard,none,math,72
2,female,group C,some college,standard,completed,math,69
3,female,group B,master's degree,standard,none,math,90
4,male,group A,associate's degree,free/reduced,none,math,47
5,male,group C,some college,standard,none,math,76
6,female,group B,associate's degree,standard,none,math,71
7,female,group B,some college,standard,completed,math,88
8,male,group B,some college,free/reduced,none,math,40
9,male,group D,high school,free/reduced,completed,math,64
10,female,group B,high school,free/reduced,none,math,38


Now we need to rename the column in the above tidy df:

In [21]:
df_tidy = rename!(df_tidy, Dict(:variable => :subject, :value => :score))

Unnamed: 0_level_0,gender,race/ethnicity,parental level of education,lunch,test preparation course,subject,score
Unnamed: 0_level_1,String7,String7,String31,String15,String15,Cat…,Int64
1,female,group B,bachelor's degree,standard,none,math,72
2,female,group C,some college,standard,completed,math,69
3,female,group B,master's degree,standard,none,math,90
4,male,group A,associate's degree,free/reduced,none,math,47
5,male,group C,some college,standard,none,math,76
6,female,group B,associate's degree,standard,none,math,71
7,female,group B,some college,standard,completed,math,88
8,male,group B,some college,free/reduced,none,math,40
9,male,group D,high school,free/reduced,completed,math,64
10,female,group B,high school,free/reduced,none,math,38
