The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. This test provides a quantitative way to compare the performance of students from different parts of the world. In this homework assignment, we will predict the reading scores of students from the United States of America on the 2009 PISA exam.

The datasets pisa2009train.csv and pisa2009test.csv contain information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES). While the datasets are not supposed to contain identifying information about students taking the test, by using the data you are bound by the NCES data use agreement, which prohibits any attempt to determine the identity of any student in the datasets.

Each row in the datasets pisa2009train.csv and pisa2009test.csv represents one student taking the exam. The datasets have the following variables:

- grade: The grade in school of the student (most 15-year-olds in America are in 10th grade)

- male: Whether the student is male (1/0)

- raceeth: The race/ethnicity composite of the student

- preschool: Whether the student attended preschool (1/0)

- expectBachelors: Whether the student expects to obtain a bachelor's degree (1/0)

- motherHS: Whether the student's mother completed high school (1/0)

- motherBachelors: Whether the student's mother obtained a bachelor's degree (1/0)

- motherWork: Whether the student's mother has part-time or full-time work (1/0)

- fatherHS: Whether the student's father completed high school (1/0)

- fatherBachelors: Whether the student's father obtained a bachelor's degree (1/0)

- fatherWork: Whether the student's father has part-time or full-time work (1/0)

- selfBornUS: Whether the student was born in the United States of America (1/0)

- motherBornUS: Whether the student's mother was born in the United States of America (1/0)

- fatherBornUS: Whether the student's father was born in the United States of America (1/0)

- englishAtHome: Whether the student speaks English at home (1/0)

- computerForSchoolwork: Whether the student has access to a computer for schoolwork (1/0)

- read30MinsADay: Whether the student reads for pleasure for 30 minutes/day (1/0)

- minutesPerWeekEnglish: The number of minutes per week the student spend in English class

- studentsInEnglish: The number of students in this student's English class at school

- schoolHasLibrary: Whether this student's school has a library (1/0)

- publicSchool: Whether this student attends a public school (1/0)

- urban: Whether this student's school is in an urban area (1/0)

- schoolSize: The number of students in this student's school

- readingScore: The student's reading score, on a 1000-point scale

Load the training and testing sets using the read.csv() function, and save them as variables with the names pisaTrain and pisaTest.

How many students are there in the training set?

In [1]:
pisaTrain = read.csv('./dataset/pisa2009train.csv')
pisaTest = read.csv('./dataset/pisa2009test.csv')

In [2]:
str(pisaTrain)

'data.frame':	3663 obs. of  24 variables:
 $ grade                : int  11 11 9 10 10 10 10 10 9 10 ...
 $ male                 : int  1 1 1 0 1 1 0 0 0 1 ...
 $ raceeth              : Factor w/ 7 levels "American Indian/Alaska Native",..: NA 7 7 3 4 3 2 7 7 5 ...
 $ preschool            : int  NA 0 1 1 1 1 0 1 1 1 ...
 $ expectBachelors      : int  0 0 1 1 0 1 1 1 0 1 ...
 $ motherHS             : int  NA 1 1 0 1 NA 1 1 1 1 ...
 $ motherBachelors      : int  NA 1 1 0 0 NA 0 0 NA 1 ...
 $ motherWork           : int  1 1 1 1 1 1 1 0 1 1 ...
 $ fatherHS             : int  NA 1 1 1 1 1 NA 1 0 0 ...
 $ fatherBachelors      : int  NA 0 NA 0 0 0 NA 0 NA 0 ...
 $ fatherWork           : int  1 1 1 1 0 1 NA 1 1 1 ...
 $ selfBornUS           : int  1 1 1 1 1 1 0 1 1 1 ...
 $ motherBornUS         : int  0 1 1 1 1 1 1 1 1 1 ...
 $ fatherBornUS         : int  0 1 1 1 0 1 NA 1 1 1 ...
 $ englishAtHome        : int  0 1 1 1 1 1 1 1 1 1 ...
 $ computerForSchoolwork: int  1 1 1 1 1 1 1 1 1 1 ...
 $ re

Using tapply() on pisaTrain, what is the average reading test score of males? of females?

In [6]:
tapply(pisaTrain$readingScore, pisaTrain$male, mean)

Which variables are missing data in at least one observation in the training set? Select all that apply.

In [7]:
summary(pisaTrain)

     grade            male                      raceeth       preschool     
 Min.   : 8.00   Min.   :0.0000   White             :2015   Min.   :0.0000  
 1st Qu.:10.00   1st Qu.:0.0000   Hispanic          : 834   1st Qu.:0.0000  
 Median :10.00   Median :1.0000   Black             : 444   Median :1.0000  
 Mean   :10.09   Mean   :0.5111   Asian             : 143   Mean   :0.7228  
 3rd Qu.:10.00   3rd Qu.:1.0000   More than one race: 124   3rd Qu.:1.0000  
 Max.   :12.00   Max.   :1.0000   (Other)           :  68   Max.   :1.0000  
                                  NA's              :  35   NA's   :56      
 expectBachelors     motherHS    motherBachelors    motherWork    
 Min.   :0.0000   Min.   :0.00   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:1.0000   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :1.0000   Median :1.00   Median :0.0000   Median :1.0000  
 Mean   :0.7859   Mean   :0.88   Mean   :0.3481   Mean   :0.7345  
 3rd Qu.:1.0000   3rd Qu.:1.00   3rd Qu.:1.0000  

Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets. Later in the course, we will learn about imputation, which deals with missing data by filling in missing values with plausible information.

Type the following commands into your R console to remove observations with any missing value from pisaTrain and pisaTest:
```R
pisaTrain = na.omit(pisaTrain)

pisaTest = na.omit(pisaTest)
```
How many observations are now in the training set?
How many observations are now in the testing set?

In [8]:
pisaTrain = na.omit(pisaTrain)
pisaTest = na.omit(pisaTest)

In [9]:
nrow(pisaTrain)
nrow(pisaTest)

Because the race variable takes on text values, it was loaded as a factor variable when we read in the dataset with read.csv() -- you can see this when you run str(pisaTrain) or str(pisaTest). However, by default R selects the first level alphabetically ("American Indian/Alaska Native") as the reference level of our factor instead of the most common level ("White"). Set the reference level of the factor by typing the following two lines in your R console:

```R
pisaTrain$raceeth = relevel(pisaTrain$raceeth, "White")

pisaTest$raceeth = relevel(pisaTest$raceeth, "White")
```
Now, build a linear regression model (call it lmScore) using the training set to predict readingScore using all the remaining variables.

It would be time-consuming to type all the variables, but R provides the shorthand notation "readingScore ~ ." to mean "predict readingScore using all the other variables in the data frame." The period is used to replace listing out all of the independent variables. As an example, if your dependent variable is called "Y", your independent variables are called "X1", "X2", and "X3", and your training data set is called "Train", instead of the regular notation:
```R
LinReg = lm(Y ~ X1 + X2 + X3, data = Train)
```
You would use the following command to build your model:
```R
LinReg = lm(Y ~ ., data = Train)
```
What is the Multiple R-squared value of lmScore on the training set?

In [10]:
pisaTrain$raceeth = relevel(pisaTrain$raceeth, "White")
pisaTest$raceeth = relevel(pisaTest$raceeth, "White")

In [11]:
str(pisaTrain)

'data.frame':	2414 obs. of  24 variables:
 $ grade                : int  11 10 10 10 10 10 10 10 11 9 ...
 $ male                 : int  1 0 1 0 1 0 0 0 1 1 ...
 $ raceeth              : Factor w/ 7 levels "White","American Indian/Alaska Native",..: 1 4 5 1 6 5 1 5 1 1 ...
 $ preschool            : int  0 1 1 1 1 1 1 1 1 1 ...
 $ expectBachelors      : int  0 1 0 1 1 1 1 0 1 1 ...
 $ motherHS             : int  1 0 1 1 1 1 1 0 1 1 ...
 $ motherBachelors      : int  1 0 0 0 1 0 0 0 0 1 ...
 $ motherWork           : int  1 1 1 0 1 1 1 0 0 1 ...
 $ fatherHS             : int  1 1 1 1 0 1 1 0 1 1 ...
 $ fatherBachelors      : int  0 0 0 0 0 0 1 0 1 1 ...
 $ fatherWork           : int  1 1 0 1 1 0 1 1 1 1 ...
 $ selfBornUS           : int  1 1 1 1 1 0 1 0 1 1 ...
 $ motherBornUS         : int  1 1 1 1 1 0 1 0 1 1 ...
 $ fatherBornUS         : int  1 1 0 1 1 0 1 0 1 1 ...
 $ englishAtHome        : int  1 1 1 1 1 0 1 0 1 1 ...
 $ computerForSchoolwork: int  1 1 1 1 1 0 1 1 1 1 ...
 $ read30Mi

In [12]:
lmScore = lm(readingScore ~ ., data=pisaTrain)
summary(lmScore)


Call:
lm(formula = readingScore ~ ., data = pisaTrain)

Residuals:
    Min      1Q  Median      3Q     Max 
-247.44  -48.86    1.86   49.77  217.18 

Coefficients:
                                                Estimate Std. Error t value
(Intercept)                                   143.766333  33.841226   4.248
grade                                          29.542707   2.937399  10.057
male                                          -14.521653   3.155926  -4.601
raceethAmerican Indian/Alaska Native          -67.277327  16.786935  -4.008
raceethAsian                                   -4.110325   9.220071  -0.446
raceethBlack                                  -67.012347   5.460883 -12.271
raceethHispanic                               -38.975486   5.177743  -7.528
raceethMore than one race                     -16.922522   8.496268  -1.992
raceethNative Hawaiian/Other Pacific Islander  -5.101601  17.005696  -0.300
preschool                                      -4.463670   3.486055  -1.280

In [18]:
SSE = sum(lmScore$residuals^2)
SSE
RMSE = sqrt(SSE / nrow(pisaTrain))
RMSE

### Predicting on unseen data

In [19]:
predTest = predict(lmScore, newdata = pisaTest)
summary(predTest)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  353.2   482.0   524.0   516.7   555.7   637.7 

In [21]:
637.7 - 353.2

### Test set SSE and RMSE

In [23]:
# Out-of-sample error
SSE = sum((predTest - pisaTest$readingScore)^2)
SSE
RMSE = sqrt(SSE / nrow(pisaTest))
RMSE

### Baseline prediction and test-set SSE

In [24]:
mean(pisaTrain$readingScore)

SST = sum((mean(pisaTrain$readingScore) - pisaTest$readingScore)^2)
SST

### Test-set R-squared

In [26]:
R_squared = 1 - SSE / SST
R_squared