# Simulated Dataset

In this notebook you can find code snippets that I always reuse for simulating dataset

- toc: true 
- badges: true
- comments: false
- categories: [r]


*TOC*
* writing a function for creating a dataset with a desired number of rows and cols given a mean and an sd (same for all cols)
* writing a function for creating a small dataset (n col < 5) with a desired number of rows and cols given a mean and an sd (different for each cols)
* writing a function for creating a dataset (n col > 5) with a desired number of rows and cols given a mean and an sd (different for each cols)
* writing functions for creating automatic labels for ID and categories

In [1]:

#°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
#  loading required libraries for this notebook
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°

#loading libraries

library(ggplot2)
library(gridExtra)
library(data.table)

#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
#  Example 1 a very simple test dataset
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
# we are creating a dataframe from a matrix obtained replicating x 
# the desired number_of_cols a vector of length number_of_rows from 
# a normal distribution rnorm with a mean of my mean and standard deviation as sd

number_of_rows <- 7
number_of_cols <- 6
my_mean <- 2
my_sd <- 0.5

newdat <-  as.data.frame( 
           replicate(
           number_of_cols,
           rnorm(n = number_of_rows, mean = my_mean, sd = my_sd ))
           )



#in order to print a fancy table
newdat


V1,V2,V3,V4,V5,V6
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
1.8166731,1.591285,2.791132,1.870521,1.152394,2.171957
1.7893052,2.520669,2.660691,3.267868,1.748584,1.68079
1.7679939,2.567371,1.817092,1.692758,2.386137,1.225817
0.7540983,1.721482,1.663453,1.836473,1.472142,1.833182
2.5337442,1.016417,2.629636,1.321766,2.336993,1.5444
1.9064151,1.989799,2.237042,1.594813,2.561147,2.068847
1.9511049,1.784966,1.74462,1.929755,2.055294,1.780852


*note about code:* we create a `data.frame` using `replicate` to  replicate a serie of vectors with normal distributions generated with `rnorm`

-  Code for creating small (n col < 5) dataset.
Each column has its own mean and sd. 
In the example reported we have n = 3  (A, B, C)
with n row = 50. means are 100,110,120 and sd 1,2,3


In [2]:

#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
#  Ex 2 Another way for a simple dataset
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°

#in order to obtain ALWAYS same "random" results REMEMBER TO initialize the seed 
#set.seed(42) 

number_of_rows <- 5
A <- rnorm( n=number_of_rows, mean=100, sd=1 ) 
B <- rnorm( n=number_of_rows, mean=110, sd=2 ) 
C <- rnorm( n=number_of_rows, mean=120, sd=2 ) 

dat=data.frame(A,B,C) 

dat



A,B,C
<dbl>,<dbl>,<dbl>
99.37304,114.1706,121.2134
100.82189,106.1683,120.7714
98.49317,108.5022,119.126
101.78762,110.6774,115.1857
101.42265,103.3642,117.4354


- same as the one above but more useful for dataset with i columns n col > 5

In [3]:

#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
#  Ex 3 recipes for adding labels
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°

number_of_rows = 3
means=c(100, 120, 130, 145)
sds=c(10 ,20 ,40 ,10)

dat <- lapply(
       seq(1,length(means)) ,
       function(x) rnorm(number_of_rows,m = means[x], sds[x])
       )
dat <- as.data.frame(do.call(cbind, dat))
 
names_length = 3
dictionary_size <- 10
my_labels <- sort(
            replicate(
            length(means),
            paste(sample(LETTERS[1:dictionary_size],
            names_length,
            replace = TRUE),
            collapse="")
            )
            )
my_labels <- unlist(strsplit(my_labels," "))  
colnames(dat) <- my_labels

dat

CCD,DJH,HHH,JFD
<dbl>,<dbl>,<dbl>,<dbl>
97.88498,119.5026,45.92577,142.0278
110.03806,157.6695,176.53169,140.7523
103.59204,114.9973,121.65529,149.3127


- Same as above but shorter 

In [4]:

#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
#  Ex 4 a variation on recipe 2 
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°

# building a function for generating data with custom number of rows, means and sds

simpleDataset <- function(number_of_rows,means,sds)
{
l <- length(means)
res <- lapply(seq(1:l),function(x) 
	   eval(
	   parse(text=paste("rnorm(",number_of_rows,",",means[x],",",sds[x],")",sep="")))
	   ) 
dat <- data.frame((sapply(res,c)))
id <- rownames(dat)
dat <-  cbind(id=id,dat)
dt <- data.table(dat)
return(dt)
}


dat1 <- simpleDataset(number_of_rows=30,
					  means=c(180,200,205),
				      sds=c(30,20,25))
				  
dat2 <- simpleDataset(number_of_rows=30,
					  means=c(45,50,35),
				      sds=c(2,10,5))
					  
dat <-  rbind(dat1,dat2)
# rearranging table using melt from data.table 
dt.melt <- melt(dat, id.vars="id")
colnames(dt.melt) <- c("id","category","var1")


- to create sample names or labels (see https://stackoverflow.com/a/60789938/6483091)


*note on the code*: the core line is `parse(text=paste("rnorm(",number_of_rows,",",means[x],",",sds[x],")",sep="")))` where we use `parse` inside the `lapply`

In [5]:

#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
#  More recipes for labelling
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°

#short names
my_labels <- letters[1:5]
my_labels
# or <- 
my_labels <- LETTERS[1:5]
my_labels
# or arbitrary number of letters using roman letters as in the 
#function 
# letters() or LETTERS()

dictionary_size <- 7
label_length <- 5 
n_replicates <- 3

#random

my_labels <- replicate(
             n_replicates,
             paste(sample(LETTERS[1:dictionary_size],
             label_length,
             replace = TRUE),
             collapse="")
             )
my_labels

#sorted

my_labels_sorted <- sort(replicate(
                    n_replicates, 
                    paste
                    (sample(LETTERS[1:dictionary_size],
                     label_length,
                     replace = TRUE),
                     collapse="")
                    )
                    )
my_labels_sorted
#if you want to mix letters and numbers

alfanum_labels <- paste0(rep(LETTERS[1:dictionary_size],
                             each = n_replicates),
                             sep = "-",
                             1:n_replicates)
alfanum_labels