# Performing Parallelized Monte-Carlo Simulations in R 

Note: Prior to running this notebook, please select the "R" Kernel; othewise, the code will not run.

## Introduction

When performing simulations, the ability to parallize processes can be very important. Distributing processes amoungst different processers is one way to allow for parallel computation, speeding up time to final results. 
The demo shows how a large SageMaker notebook instance can be leveraged to perform Monte-Carlo simulations in a parallelized fashion.

We will perform Monte-Carlo simulations to explore social distancing. Within the context of diseases (e.g. COVID-19) social distancing is an essential practice to minimize infection.
Give a room of dimensions of certain size (e.g. 10 feet by 10 feet) and a certain number of people in the room, we will answer how many expected social distancing violations per person are found, assuming that people are randomly distributed in the room, and that none of them are moving.


This notebook was originally executed on a **ml.m5.24xlarge** instance. This instance has 96 vCPUs and can thus take advantage of parallelization across many  CPUs. If you are using a smaller instance, you should parallelize across fewer CPUs.

In [2]:
#skip if installed already
install.packages("doParallel")

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done


In [1]:
library('foreach')
library('doParallel')
detectCores(all.tests = FALSE, logical = TRUE) #find out how many cores are on the machine
num_to_parallelize=90 #change depending on the instance being used
cl <- makeCluster(num_to_parallelize)
registerDoParallel(cl)

Loading required package: iterators
Loading required package: parallel


In [2]:
perform_room_simulation <- function(x_length,y_length,num_people){
    #given dimensions of the room, and number of people return all pairwise distances
    x=runif(num_people, min = 0, max = x_length)
    y=runif(num_people, min = 0, max = y_length)
    df=t(rbind(x,y))
    df_dist=dist(df)
    #return(df_dist)
    to_return=as.list(df_dist)
    to_return=lapply(to_return,round,5)
    return(to_return)
}

We will start with a small simulation

In [3]:
start.time <- Sys.time()
x_length=10
y_length=10
num_people=5
max_iterations=100
result <- foreach (i =1:max_iterations) %do%
{
    mini_result=perform_room_simulation(x_length=x_length,y_length=y_length,num_people=num_people)
    mini_result=unlist(mini_result)
    mini_result_pre=mini_result
    num_violates=length(mini_result[mini_result <6]) #number of violations of social distancing 
    average_violations_per_person=num_violates/num_people
    
    
}
result=unlist(result)
end.time <- Sys.time()
time.taken <- end.time - start.time
time_taken=time.taken
cat("Finished simulation",time_taken,"\n")


Finished simulation 0.04391122 


In [4]:
cat ("Simulation Mean: ",mean(result),"\n")
cat("Simulation Standard Deviation: ", sd(result))

Simulation Mean:  1.196 
Simulation Standard Deviation:  0.4148555

Thus, we have an average of 1.264 violations per person

Lets increase the number of people, size of the room and iterations in our simulation. This new simulation takes about 1 minute

In [5]:
start.time <- Sys.time()
x_length=100
y_length=100
num_people=1000
max_iterations=100
result <- foreach (i =1:max_iterations) %do%
{
    mini_result=perform_room_simulation(x_length=x_length,y_length=y_length,num_people=num_people)
    mini_result=unlist(mini_result)
    mini_result_pre=mini_result
    num_violates=length(mini_result[mini_result <6]) #number of violations of social distancing 
    average_violations_per_person=num_violates/num_people
    
    
}
result=unlist(result)
end.time <- Sys.time()
time.taken <- end.time - start.time
time_taken=time.taken
cat("Finished simulation",time_taken,"\n")

Finished simulation 37.03234 


In [6]:
cat ("Simulation Mean: ",mean(result),"\n")
cat("Simulation Standard Deviation: ", sd(result))

Simulation Mean:  5.36624 
Simulation Standard Deviation:  0.08609193

As expected, both the mean and variance have changed. There are now about 5.4 violations per person.

To help scale, we will take advantage of parallel processing. 

In [7]:
#do it in parallel
start.time <- Sys.time()
x_length=100
y_length=100
num_people=1000
max_iterations=100
result <- foreach (i =1:max_iterations) %dopar% #parallize each iteraction
{
    mini_result=perform_room_simulation(x_length=x_length,y_length=y_length,num_people=num_people)
    mini_result=unlist(mini_result)
    mini_result_pre=mini_result
    num_violates=length(mini_result[mini_result <6])
    average_violations_per_person=num_violates/num_people
    
    
}
result=unlist(result)
end.time <- Sys.time()
time.taken <- end.time - start.time
time_taken=time.taken
cat("Finished simulation",time_taken,"\n")


Finished simulation 2.928248 


In [8]:
cat ("Simulation Mean: ",mean(result),"\n")
cat("Simulation Standard Deviation: ", sd(result))

Simulation Mean:  5.36654 
Simulation Standard Deviation:  0.08450526

## Conclusion

The parallelized version of this simulation took ~2.5 seconds, while the non-parallized version took about ~35 seconds; a 14 fold improvement. Note that your specific results may differ depending on the size of the instance you are running.
