### HSML 6295 Preparing Your Data Set

#### I. Properties of a .csv file suitable for analysis    

The .csv should be laid out as follows:

1. 
Each row (except the first) should represent an observation, e.g. a patient, a hospital, a jurisdiction or country.

2. 
The first line (row) contains the variable names

3. 
The second through last rows contain the values of the variables.

4.
Each column (except possibly the first) should represent a variable -- either the response or a predictor.

5.
If they are available, the first column contains the observation identifiers, which serve as data point labels in graphs, e.g. the names or ID numbers. This column should be called "label".

6.
The second column should be the variable representing the response.

7.
The third through last columns should be the variables representing the possible predictors.

8.
The data set should contain no missing values, which may show as blank spaces, a dot ".", "NA". 
See sections III and IV for code to remove variables and observations with missing values.

9.
All the variables should be numeric, i.e. they should have numbers as values, not strings (text).
See section V for code to convert categorical string variables into set of binary indicator ("dummy") variables.

10. 
The ideal data set should have at least 20 observations (rows) and at least 10 variables (columns). The techniques that we are studying tend to behave in a weird way when the data set is smaller than that.

11.
The number of variables should not exceed the number of observations. Again, the techniques that we are studying work better when the number of observations, denoted $n$, is substantially larger than the number of variables, denoted $p$:  $n > p$. The chapter "6.4 Considerations in High Dimensions" in the textbook covers how we need to adapt the methods when $p > n$.

12.
The variables should capture different dimensions of the problem. The techniques still work but the results won't be too informative (e.g. which variable is the strongest predictor); the predictive performance may also be stronger if you add variables that capture additional dimensions. For instance, if suppose your predictors consist of 5 variables, each measuring the proportion of the population in a 20-year age bracket, and 4 more variables, each measuring the proportion of a religious group in the population. In this case, your predictors only represent two dimensions, namely the age structure and the religious makeup of the population.

#### II. Load a .csv File


In [None]:
# read in the data set:
mydata = read.csv("HSML 6295 ds High School Scores.csv")
# describe the data set:
str(mydata)                               


#### III. Drop variables (columns) with too many missing values

Show number and proportion of missing values for each variable:


In [None]:
# define the function that computes the proportion of missing values
propmiss <- function(dataframe) {
	m <- sapply(dataframe, function(x) {
		data.frame(
			nmiss=sum(is.na(x)), 
			n=length(x), 
			propmiss=sum(is.na(x))/length(x)
		)
	})
	d <- data.frame(t(m))
	d <- sapply(d, unlist)
	d <- as.data.frame(d)
	d$variable <- row.names(d)
	row.names(d) <- NULL
	d <- cbind(d[ncol(d)],d[-ncol(d)])
	return(d[order(-d$propmiss), ])
}
# run the function that computes the proportion of missing values
propmiss(mydata)



Drop select variables, e.g. with at least 10% missing values


In [None]:
dim(mydata)
# create list of variables to drop:
vars_to_drop = names(mydata) %in% c("socst", "science") 
mydata <- mydata[!vars_to_drop]
dim(mydata)


#### IV. Drop observations with at least one missing value



In [None]:
mydata = na.omit(mydata)
dim(mydata)


#### V. Convert categorical string variables

Convert categorical string variables to sets of binary indicator ("dummy") variables:


In [None]:
# install.packages("dummies")
# load the library "dummies":
library(dummies)
# create new data frame in which the factor variables 
#    have been replaced by sets of "dummies":
mydata <- dummy.data.frame(mydata, sep = ".")
# display the result:
str(mydata)


#### VI. Display summary statistics for new data frame

The means of the dummy variables show the distributions of the categories.


In [None]:
library(stargazer)
stargazer(mydata, 
          summary.stat = c("n", "mean", "median", "sd"),
          type = "text", title="Descriptive statistics", digits=2)


#### VII. Save the new data frame as a .csv file



In [None]:
write.csv(mydata, file = "HSML 6295 ds High School Scores CLEAN.csv")



#### VIII. Sample code for the results table


|                   | **Model 1**   | **Model 2**   | **Model 3**
| ---               | ---           | ---           | ---
| **Training Set**  | PP1_train     | PP2_train     | PP3_train
| **Test Set**      |	PP1_test      | PP2_test      | PP3_test
