In [11]:
library(dplyr)

#### Read Input Data

In [12]:
# Read in csv files
df <- read.table("auto-mpg.data", 
                 header = FALSE,
                 sep = ",")

# Display the first 5 rows of the data
head(df, 5)

V1,V2,V3,V4,V5,V6,V7,V8,V9
18,8,307,130,3504,12.0,70,1,chevrolet chevelle malibu
15,8,350,165,3693,11.5,70,1,buick skylark 320
18,8,318,150,3436,11.0,70,1,plymouth satellite
16,8,304,150,3433,12.0,70,1,amc rebel sst
17,8,302,140,3449,10.5,70,1,ford torino


#### Preprocess The Data: Drop NA rows, car name column

In [13]:
# Take out the columns with categorical values
nums <- sapply(df, is.numeric)
df <- df[,nums]

# Take out rows contain NA value
df <- na.omit(df)

# Display processed data
head(df, 5)

V1,V2,V3,V4,V5,V6,V7,V8
18,8,307,130,3504,12.0,70,1
15,8,350,165,3693,11.5,70,1
18,8,318,150,3436,11.0,70,1
16,8,304,150,3433,12.0,70,1
17,8,302,140,3449,10.5,70,1


#### Split Data into 2 Portions: 80% for Training, 20% for Testing

In [14]:
training_set <- df[sample(nrow(df), floor(nrow(df)*0.8)),]
test_set = anti_join(df, training_set)

Joining, by = c("V1", "V2", "V3", "V4", "V5", "V6", "V7", "V8")


#### Print The Equation for The Linear Model
Note: This model discards the last variable: car name

$$ mpg = \beta_0 + \beta_1\times(cylinders) + \beta_2\times(displacement) + \beta_3\times(horsepower) + \beta_4\times(weight) + \beta_5\times(acceleration) + \beta_6\times(model\_year) + \beta_7\times(origin) $$ 

#### Fit Linear Model

In [15]:
fitted_model <- lm(V1~V2+V3+V4+V5+V6+V7+V8, data=training_set)
summary(fitted_model)


Call:
lm(formula = V1 ~ V2 + V3 + V4 + V5 + V6 + V7 + V8, data = training_set)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.9232 -2.2309 -0.2106  1.8222 12.5117 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.704e+01  5.188e+00  -3.284  0.00114 ** 
V2          -7.079e-01  3.614e-01  -1.959  0.05103 .  
V3           2.689e-02  8.580e-03   3.134  0.00189 ** 
V4          -1.973e-02  1.592e-02  -1.239  0.21612    
V5          -6.593e-03  7.298e-04  -9.034  < 2e-16 ***
V6           1.483e-01  1.108e-01   1.339  0.18159    
V7           7.332e-01  5.700e-02  12.864  < 2e-16 ***
V8           1.841e+00  3.223e-01   5.713 2.64e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.323 on 305 degrees of freedom
Multiple R-squared:  0.8277,	Adjusted R-squared:  0.8237 
F-statistic: 209.3 on 7 and 305 DF,  p-value: < 2.2e-16


#### Make Prediction Using The Test Set and The Fitted Model

In [16]:
print(predict(fitted_model, test_set))

        1         2         3         4         5         6         7         8 
10.571237 20.901423 25.961561 27.784088  6.379957 16.220921 18.131728 19.857537 
        9        10        11        12        13        14        15        16 
25.296930 13.058916 12.722067 20.464497 24.476249 22.932668 13.586592  9.352859 
       17        18        19        20        21        22        23        24 
21.305273 21.599129  9.636520 10.718981 27.846522 27.994996 21.918657 12.088565 
       25        26        27        28        29        30        31        32 
27.051182 16.001430 31.551895 13.533777 10.553860 30.555766 25.313412 27.415042 
       33        34        35        36        37        38        39        40 
19.768342 21.339268 13.487614 18.543157 17.589331 26.545252 27.676516 24.183985 
       41        42        43        44        45        46        47        48 
20.977899 28.161288 17.566042 15.888086 24.620547 28.107681 30.766879 33.709912 
       49        50        5

#### Report Performance by Computing The Sum Squared Error

In [17]:
predicted_vals <- as.matrix(predict(fitted_model, test_set))
SSRes <- sum((test_set['V1']-predicted_vals)^2)

In [18]:
cat("Sum squared error:", SSRes)

Sum squared error: 926.4498

#### Fit The Linear Model on The Training Set and Test The Model on The Test Set Used in Question 1

In [64]:
# Read training set file
training_set_2 <- read.table("training_set.csv", header = TRUE, sep = ",")

# Read test set file
test_set_2 <- read.table("test_set.csv", header = TRUE, sep = ",") 
test_set_2 <- test_set_2[c("mpg", "cylinders", "displacement", "horsepower", "weight", "acceleration", "model.year", "origin")]

“EOF within quoted string”

In [60]:
fitted_model_2 <- lm(mpg~cylinders+displacement+horsepower+weight+acceleration+model.year+origin, data=training_set_2)
summary(fitted_model_2)


Call:
lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
    acceleration + model.year + origin, data = training_set_2)

Residuals:
   Min     1Q Median     3Q    Max 
-7.428 -2.765 -0.133  2.349 12.220 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.365671  12.987420  -1.337 0.184883    
cylinders     -1.303389   0.736752  -1.769 0.080595 .  
displacement   0.027050   0.017610   1.536 0.128377    
horsepower    -0.056926   0.033995  -1.675 0.097833 .  
weight        -0.004789   0.001329  -3.604 0.000535 ***
acceleration  -0.346059   0.220746  -1.568 0.120808    
model.year     0.874560   0.142334   6.144 2.76e-08 ***
origin         1.556091   0.695449   2.238 0.027964 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.963 on 82 degrees of freedom
Multiple R-squared:  0.8152,	Adjusted R-squared:  0.7995 
F-statistic: 51.69 on 7 and 82 DF,  p-value: < 2.2e-16


#### Report Performance by Computing The Sum Squared Error

In [69]:
predicted_vals_2 <- as.matrix(predict(fitted_model_2, test_set_2))
SSRes <- sum((test_set_2['mpg']-predicted_vals_2)^2)

In [76]:
cat("Sum squared error:", SSRes , "\n")
print("The sum squared error obtained from using the lm function in R is a little bit higher than the one obtained from using scikit learn in Python")

Sum squared error: 616.2629 
[1] "The sum squared error obtained from using the lm function in R is a little bit higher than the one obtained from using scikit learn in Python"
