In [2]:
%load_ext rpy2.ipython

---
title: "M2R_Parallel_Quicksort"
output: html_notebook
---
# Task 1: Compute confidence intervals for the data from M2R Parallel quicksort experiment

Data is contained in "measurements_03_47.csv"
First we read the data:

In [14]:
%%R 

data <- read.csv("measurements_03_47.csv")



Then we group the data in new tables, by the experiment types sequential, parallel, and built-in.

In [15]:
%%R
data_sequential = data[c(TRUE,FALSE, FALSE), ] 
data_parallel = data[c(FALSE,TRUE,FALSE), ]
data_built_in = data[c(FALSE, FALSE, TRUE), ]

data




      Size        Type     Time
1      100  Sequential 0.000010
2      100    Parallel 0.004024
3      100    Built-in 0.000013
4      100  Sequential 0.000010
5      100    Parallel 0.004448
6      100    Built-in 0.000014
7      100  Sequential 0.000009
8      100    Parallel 0.003384
9      100    Built-in 0.000013
10     100  Sequential 0.000010
11     100    Parallel 0.003738
12     100    Built-in 0.000012
13     100  Sequential 0.000010
14     100    Parallel 0.003133
15     100    Built-in 0.000011
16    1000  Sequential 0.000128
17    1000    Parallel 0.020407
18    1000    Built-in 0.000209
19    1000  Sequential 0.000126
20    1000    Parallel 0.022003
21    1000    Built-in 0.000201
22    1000  Sequential 0.000128
23    1000    Parallel 0.016149
24    1000    Built-in 0.000210
25    1000  Sequential 0.000128
26    1000    Parallel 0.014594
27    1000    Built-in 0.000209
28    1000  Sequential 0.000129
29    1000    Parallel 0.014905
30    1000    Built-in 0.000210
31   100

Let's compute confidence intervals for the mean time for the different sizes:
1. for each size\
compute the sample mean time $S_{5} = \frac{1}{5} \times \sum_{i=1}^{5} x_i$\
compute the sample standard deviation $\sigma = \sum_{i=1}^{5} (x_i - \mu)^2 $
compute the 95% confidence interval = $[\mu - 2 \times \frac{\sigma}{\sqrt{n}} , \mu + 2 \times \frac{\sigma}{\sqrt{n}} ] $
   

In [5]:
%%R
library(dplyr)

data_sequential_mean = data_sequential %>% group_by(Size) %>% summarize(mean = sum(Time)/5,) # 5x2 array holding Size and mean
replication_times = c(5, 5, 5, 5, 5) # to extend the mean array to 25x2
data_sequential_mean_long = data_sequential_mean[rep(row.names(data_sequential_mean), times = replication_times),] # 25x2 array holding Size and mean


data_sequential$squared_difference = (data_sequential_mean_long$mean - data_sequential$Time)^2



Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union



In [6]:
%%R
data_sequential_variance = data_sequential %>% group_by(Size) %>% summarize(variance = sum(squared_difference)/4)



In [7]:
%%R
data_sequential_stdeviation = (data_sequential_variance$variance)^(1/2)  # list of the standard errors


Now make this list of standard deviations into a list of confidence intervals:

In [8]:
%%R
data_sequential_mean$lower_bound = data_sequential_mean$mean - 2*data_sequential_stdeviation/2
data_sequential_mean$upper_bound = data_sequential_mean$mean + 2*data_sequential_stdeviation/2
data_sequential_mean


# A tibble: 5 × 4
     Size      mean lower_bound upper_bound
    <int>     <dbl>       <dbl>       <dbl>
1     100 0.0000098  0.00000935   0.0000102
2    1000 0.000128   0.000127     0.000129 
3   10000 0.00170    0.00165      0.00174  
4  100000 0.0199     0.0197       0.0200   
5 1000000 0.234      0.230        0.237    


the lower_bound and upper_Bound columns in data_sequential_mean now hold the corresponding confidence intervals fo

## Redoing the computations in a simpler way, after feedback:

In [13]:
%%R
data2 <- read.csv("measurements_03_47.csv")

In [10]:
%%R


data_sequential = data2[c(TRUE,FALSE, FALSE), ] 
data_parallel = data2[c(FALSE,TRUE,FALSE), ]
data_built_in = data2[c(FALSE, FALSE, TRUE), ]



In [11]:
%%R
library(dplyr)
# using the summarise() function to obtain confidence intervals:

sequential = summarise(data_sequential, mean=mean(Time), n=n(), err = sd(Time)/sqrt(n), lb=mean-2*err, ub=mean+2*err)
print(paste("The confidence interval for the sequential method is:",sequential$lb,",",sequential$ub))

parallel = summarise(data_parallel, mean=mean(Time), n=n(), err = sd(Time)/sqrt(n), lb=mean-2*err, ub=mean+2*err)
print(paste("The confidence interval for the parallel method is:",parallel$lb,",",parallel$ub))

built_in = summarise(data_built_in, mean=mean(Time), n=n(), err = sd(Time)/sqrt(n), lb=mean-2*err, ub=mean+2*err)
print(paste("The confidence interval for the built_in method is:",built_in$lb,",",built_in$ub))

## This doesn't seem right because now for each method, there is only one confidence interval. I think I should construct one 
## confidence interval for the data for each Size = [100, 1000, ...]





[1] "The confidence interval for the sequential method is: 0.0136736801858198 , 0.0884414398141802"
[1] "The confidence interval for the parallel method is: 0.0260325319851516 , 0.0735357880148484"
[1] "The confidence interval for the built_in method is: 0.0141461169770554 , 0.0916996430229446"


## Fitting a linear model for the data:
### Lets first look for outliers

In [26]:
%%R
library(ggplot2)

data_sequential




      Size        Type     Time
1      100  Sequential 0.000010
4      100  Sequential 0.000010
7      100  Sequential 0.000009
10     100  Sequential 0.000010
13     100  Sequential 0.000010
16    1000  Sequential 0.000128
19    1000  Sequential 0.000126
22    1000  Sequential 0.000128
25    1000  Sequential 0.000128
28    1000  Sequential 0.000129
31   10000  Sequential 0.001774
34   10000  Sequential 0.001698
37   10000  Sequential 0.001652
40   10000  Sequential 0.001680
43   10000  Sequential 0.001675
46  100000  Sequential 0.020040
49  100000  Sequential 0.020004
52  100000  Sequential 0.019763
55  100000  Sequential 0.019913
58  100000  Sequential 0.019726
61 1000000  Sequential 0.230648
64 1000000  Sequential 0.235778
67 1000000  Sequential 0.238383
70 1000000  Sequential 0.232921
73 1000000  Sequential 0.230096


## Fitting the linear models

In [35]:
%%R
sequential_model = lm(Time ~ Size, data = data_sequential)
summary(sequential_model)


Call:
lm(formula = Time ~ Size, data = data_sequential)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0032211 -0.0023774  0.0004466  0.0010015  0.0050659 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.016e-03  4.583e-04  -2.217   0.0368 *  
Size         2.343e-07  1.020e-09 229.807   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.001992 on 23 degrees of freedom
Multiple R-squared:  0.9996,	Adjusted R-squared:  0.9995 
F-statistic: 5.281e+04 on 1 and 23 DF,  p-value: < 2.2e-16



In [37]:
%%R
parallel_model = lm(Time ~ Size, data = data_parallel)
summary(parallel_model)


Call:
lm(formula = Time ~ Size, data = data_parallel)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.014052 -0.002723 -0.001290  0.004686  0.018888 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.717e-02  2.364e-03   7.263 2.16e-07 ***
Size        1.468e-07  5.260e-09  27.904  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01027 on 23 degrees of freedom
Multiple R-squared:  0.9713,	Adjusted R-squared:  0.9701 
F-statistic: 778.6 on 1 and 23 DF,  p-value: < 2.2e-16



In [40]:
%%R
built_in_model = lm(Time ~ Size, data = data_built_in)
summary(built_in_model)


Call:
lm(formula = Time ~ Size, data = data_built_in)

Residuals:
       Min         1Q     Median         3Q        Max 
-0.0029607 -0.0003849  0.0005001  0.0010631  0.0010858 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.096e-03  3.513e-04   -3.12  0.00482 ** 
Size         2.431e-07  7.817e-10  310.98  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.001527 on 23 degrees of freedom
Multiple R-squared:  0.9998,	Adjusted R-squared:  0.9998 
F-statistic: 9.671e+04 on 1 and 23 DF,  p-value: < 2.2e-16



All three linear models are very good fits (high R-values meaning that the model explains the entire dataset, not much noise is unexplained) (as expected based on what we know about the experiment). The p-values for the models are very small, which is a good reason to accep that there is a relationship between the Size (of dataset) and Time taken to sort. 

## Let's plot the Residuals vs Leverage plot and analyse: #TODO

In [39]:
%%R
# plot(sequential_model)

NULL
