In [3]:
%load_ext rpy2.ipython

---
title: "M2R_Parallel_Quicksort"
output: html_notebook
---
# Task 1: Compute confidence intervals for the data from M2R Parallel quicksort experiment

Data is contained in "measurements_03_47.csv"
First we read the data:

In [23]:
%%R 

data <- read.csv("measurements_03_47.csv")



Then we group the data in new tables, by the experiment types sequential, parallel, and built-in.

In [50]:
%%R
data_sequential = data[c(TRUE,FALSE, FALSE), ] 
data_parallel = data[c(FALSE,TRUE,FALSE), ]
data_built_in = data[c(FALSE, FALSE, TRUE), ]

data




      Size        Type     Time
1      100  Sequential 0.000010
2      100    Parallel 0.004024
3      100    Built-in 0.000013
4      100  Sequential 0.000010
5      100    Parallel 0.004448
6      100    Built-in 0.000014
7      100  Sequential 0.000009
8      100    Parallel 0.003384
9      100    Built-in 0.000013
10     100  Sequential 0.000010
11     100    Parallel 0.003738
12     100    Built-in 0.000012
13     100  Sequential 0.000010
14     100    Parallel 0.003133
15     100    Built-in 0.000011
16    1000  Sequential 0.000128
17    1000    Parallel 0.020407
18    1000    Built-in 0.000209
19    1000  Sequential 0.000126
20    1000    Parallel 0.022003
21    1000    Built-in 0.000201
22    1000  Sequential 0.000128
23    1000    Parallel 0.016149
24    1000    Built-in 0.000210
25    1000  Sequential 0.000128
26    1000    Parallel 0.014594
27    1000    Built-in 0.000209
28    1000  Sequential 0.000129
29    1000    Parallel 0.014905
30    1000    Built-in 0.000210
31   100

Let's compute confidence intervals for the mean time for the different sizes:
1. for each size\
compute the sample mean time $S_{5} = \frac{1}{5} \times \sum_{i=1}^{5} x_i$\
compute the sample standard deviation $\sigma = \sum_{i=1}^{5} (x_i - \mu)^2 $
compute the 95% confidence interval = $[\mu - 2 \times \frac{\sigma}{\sqrt{n}} , \mu + 2 \times \frac{\sigma}{\sqrt{n}} ] $
   

In [36]:
%%R
library(dplyr)

data_sequential_mean = data_sequential %>% group_by(Size) %>% summarize(mean = sum(Time)/5,) # 5x2 array holding Size and mean
replication_times = c(5, 5, 5, 5, 5) # to extend the mean array to 25x2
data_sequential_mean_long = data_sequential_mean[rep(row.names(data_sequential_mean), times = replication_times),] # 25x2 array holding Size and mean


data_sequential$squared_difference = (data_sequential_mean_long$mean - data_sequential$Time)^2


# A tibble: 5 × 2
     Size      mean
    <int>     <dbl>
1     100 0.0000098
2    1000 0.000128 
3   10000 0.00170  
4  100000 0.0199   
5 1000000 0.234    


In [7]:
%%R
data_sequential_variance = data_sequential %>% group_by(Size) %>% summarize(variance = sum(squared_difference)/4)



In [8]:
%%R
data_sequential_stdeviation = (data_sequential_variance$variance)^(1/2)  # list of the standard errors


Now make this list of standard deviations into a list of confidence intervals:

In [9]:
%%R
data_sequential_mean$lower_bound = data_sequential_mean$mean - 2*data_sequential_stdeviation/2
data_sequential_mean$upper_bound = data_sequential_mean$mean + 2*data_sequential_stdeviation/2
data_sequential_mean


# A tibble: 5 × 4
     Size      mean lower_bound upper_bound
    <int>     <dbl>       <dbl>       <dbl>
1     100 0.0000098  0.00000935   0.0000102
2    1000 0.000128   0.000127     0.000129 
3   10000 0.00170    0.00165      0.00174  
4  100000 0.0199     0.0197       0.0200   
5 1000000 0.234      0.230        0.237    


the lower_bound and upper_Bound columns in data_sequential_mean now hold the corresponding confidence intervals fo

## Redoing the computations in a simpler way, after feedback:

In [11]:
%%R
data2 <- read.csv("measurements_03_47.csv")

In [91]:
%%R


data_sequential = data2[c(TRUE,FALSE, FALSE), ] 
data_parallel = data2[c(FALSE,TRUE,FALSE), ]
data_built_in = data2[c(FALSE, FALSE, TRUE), ]



In [97]:
%%R
library(dplyr)
# using the summarise() function to obtain confidence intervals:

sequential = summarise(data_sequential, mean=mean(Time), n=n(), err = sd(Time)/sqrt(n), lb=mean-2*err, ub=mean+2*err)
print(paste("The confidence interval for the sequential method is:",sequential$lb,",",sequential$ub))

parallel = summarise(data_parallel, mean=mean(Time), n=n(), err = sd(Time)/sqrt(n), lb=mean-2*err, ub=mean+2*err)
print(paste("The confidence interval for the parallel method is:",parallel$lb,",",parallel$ub))

built_in = summarise(data_built_in, mean=mean(Time), n=n(), err = sd(Time)/sqrt(n), lb=mean-2*err, ub=mean+2*err)
print(paste("The confidence interval for the built_in method is:",built_in$lb,",",built_in$ub))

## This doesn't seem right because now for each method, there is only one confidence interval. I think I should construct one 
## confidence interval for the data for each Size = [100, 1000, ...]





[1] "The confidence interval for the sequential method is: 0.0136736801858198 , 0.0884414398141802"
[1] "The confidence interval for the parallel method is: 0.0260325319851516 , 0.0735357880148484"
[1] "The confidence interval for the built_in method is: 0.0141461169770554 , 0.0916996430229446"


## Fitting a linear model for the data: