## Simple simulation to show use of mclapply in R & pool in Python

In [1]:
%load_ext rpy2.ipython

### R and mclapply for 10 million obs

In [11]:
%%R

library(data.table)
library(parallel)
library(dplyr)

#Make dummy data with simple groups, and dummy function to parallelize over

color_groups <- c("blue", "green", "red", "yellow", "purple")
df <- data.table( grouping = rep(color_groups, 10^7/5), val = runif(10^7))

print(head(df))

#Function to calculate logit on group passed to the function
stupid_function <- function(group){
    
    subset <- df[grouping==group,]
    subset[, val_logit := log(val/(1-val))]
    return(subset)
}

#Check out how many cores we can use
cores <- detectCores()
print(sprintf("Number of cores: %s", cores))

#Parallelize and take a peek at the results
new_df <- mclapply(color_groups, stupid_function, mc.cores = cores,  mc.preschedule = F) %>%
          rbindlist()

print(head(new_df))

   grouping        val
1:     blue 0.39611141
2:    green 0.58412509
3:      red 0.06484023
4:   yellow 0.52193571
5:   purple 0.24946483
6:     blue 0.03084971
[1] "Number of cores: 56"
   grouping        val   val_logit
1:     blue 0.39611141 -0.42169421
2:     blue 0.03084971 -3.44729227
3:     blue 0.86535328  1.86048339
4:     blue 0.52144631  0.08583789
5:     blue 0.75221600  1.11046613
6:     blue 0.28713435 -0.90934275


### Python and multiprocessing for 10 million obs

In [25]:
import multiprocessing as mp
from multiprocessing import Pool
import pandas as pd
import numpy as np

#Build dummy dataframe and groups
color_groups = ["blue", "green", "red", "yellow", "purple"]
obs = {'grouping':np.repeat(color_groups, 10**7/5), 'val':np.random.uniform(size=10**7)}
df = pd.DataFrame(data = obs)

#Take a peek at the dataset
print(df.head())

#Set up function
def stupid_function(group):
    subset = df[df.grouping==group]
    subset["val_logit"] = np.log(subset['val']/(1-subset['val']))
    return(subset)

#Check number of cores
cores = mp.cpu_count()
print("Number of cores: {}".format(cores))

p = Pool(cores)
new_df = p.map(stupid_function, color_groups)
new_df = pd.concat(new_df)

print(new_df.head())

  grouping       val
0     blue  0.397099
1     blue  0.015617
2     blue  0.494578
3     blue  0.676089
4     blue  0.526428
Number of cores: 56


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

  grouping       val  val_logit
0     blue  0.397099  -0.417566
1     blue  0.015617  -4.143638
2     blue  0.494578  -0.021687
3     blue  0.676089   0.735856
4     blue  0.526428   0.105811
