# Simulated Dataset

In this notebook you can find code snippets that I always reuse for simulating dataset

- toc: true 
- badges: true
- comments: false
- categories: [r]


*TOC*
* writing a function for creating a dataset with a desired number of rows and cols given a mean and an sd (same for all cols)
* writing a function for creating a small dataset (n col < 5) with a desired number of rows and cols given a mean and an sd (different for each cols)
* writing a function for creating a dataset (n col > 5) with a desired number of rows and cols given a mean and an sd (different for each cols)
* writing functions for creating automatic labels for ID and categories

In [1]:
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
#  loading required libraries for this notebook
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°

#loading libraries

library(ggplot2)
library(gridExtra)
library(data.table)

#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
#  Example 1 a very simple test dataset
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
# we are creating a dataframe from a matrix obtained replicating x 
# the desired number_of_cols a vector of length number_of_rows from 
# a normal distribution rnorm with a mean of my mean and standard deviation as sd

number_of_rows <- 7
number_of_cols <- 6
my_mean <- 2
my_sd <- 0.5

newdat <-  as.data.frame( 
           replicate(
           number_of_cols,
           rnorm(n = number_of_rows, mean = my_mean, sd = my_sd ))
           )



#in order to print a fancy table
newdat


V1,V2,V3,V4,V5,V6
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2.321458,2.299545,1.755115,1.870463,2.056384,2.482437
1.402577,2.516995,2.834609,1.863092,1.895504,2.307111
2.306785,1.086934,1.663395,1.640607,2.098327,1.021668
2.304635,1.754539,2.539501,1.841906,1.655369,1.674042
1.904062,2.189214,2.646093,2.511605,2.371681,2.094404
2.628779,1.257258,1.890032,2.057319,2.131018,1.740958
1.903312,1.81574,2.140261,1.585239,1.493231,1.472626


-  Code for creating small (n col < 5) dataset.
Each column has its own mean and sd. 
In the example reported we have n = 3  (A, B, C)
with n row = 50. means are 100,110,120 and sd 1,2,3


In [2]:
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
#  Ex 2 Another way for a simple dataset
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°

#in order to obtain ALWAYS same "random" results REMEMBER TO initialize the seed 
#set.seed(42) 

number_of_rows <- 5
A <- rnorm( n=number_of_rows, mean=100, sd=1 ) 
B <- rnorm( n=number_of_rows, mean=110, sd=2 ) 
C <- rnorm( n=number_of_rows, mean=120, sd=2 ) 

dat=data.frame(A,B,C) 

dat



A,B,C
<dbl>,<dbl>,<dbl>
100.80638,108.6651,119.7474
99.64688,112.5239,121.8453
101.57203,109.7251,121.3858
100.69102,107.9891,119.74
99.62137,108.9515,118.5509


- same as the one above but more useful for dataset with i columns n col > 5

In [3]:
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
#  Ex 3 recipes for adding labels
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°

number_of_rows = 3
means=c(100, 120, 130, 145)
sds=c(10 ,20 ,40 ,10)

dat <- lapply(
       seq(1,length(means)) ,
       function(x) rnorm(number_of_rows,m = means[x], sds[x])
       )
dat <- as.data.frame(do.call(cbind, dat))
 
names_length = 3
dictionary_size <- 10
my_labels <- sort(
            replicate(
            length(means),
            paste(sample(LETTERS[1:dictionary_size],
            names_length,
            replace = TRUE),
            collapse="")
            )
            )
my_labels <- unlist(strsplit(my_labels," "))  
colnames(dat) <- my_labels

dat

AII,CAE,HBE,IGG
<dbl>,<dbl>,<dbl>,<dbl>
108.75016,131.5941,37.42085,153.1532
91.07011,153.6062,149.42347,163.8499
104.73926,145.7393,112.96202,144.1259


- Same as above but shorter 

In [1]:

#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
#  Ex 4 a variation on recipe 2 
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°

# building a function for generating data with custom number of rows, means and sds

simpleDataset <- function(number_of_rows,means,sds)
{
l <- length(means)
res <- lapply(seq(1:l),function(x) 
	   eval(
	   parse(text=paste("rnorm(",number_of_rows,",",means[x],",",sds[x],")",sep="")))
	   ) 
dat <- data.frame((sapply(res,c)))
id <- rownames(dat)
dat <-  cbind(id=id,dat)
dt <- data.table(dat)
return(dt)
}


dat1 <- simpleDataset(number_of_rows=30,
					  means=c(180,200,205),
				      sds=c(30,20,25))
				  
dat2 <- simpleDataset(number_of_rows=30,
					  means=c(45,50,35),
				      sds=c(2,10,5))
					  
dat <-  rbind(dat1,dat2)
# rearranging table using melt from data.table 
dt.melt <- melt(dat, id.vars="id")
colnames(dt.melt) <- c("id","category","var1")


id,category,var1
<chr>,<fct>,<dbl>
1,X1,88.28721
2,X1,92.59345
3,X1,89.94163
4,X1,88.66391
5,X1,89.01439
6,X1,88.57556


- to create sample names or labels (see https://stackoverflow.com/a/60789938/6483091)


In [3]:

#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°
#  More recipes for labelling
#+++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°+++++++++°

#short names
my_labels <- letters[1:5]
my_labels
# or <- 
my_labels <- LETTERS[1:5]
my_labels
# or arbitrary number of letters using roman letters as in the 
#function 
# letters() or LETTERS()

dictionary_size <- 7
label_length <- 5 
n_replicates <- 3

#random

my_labels <- replicate(
             n_replicates,
             paste(sample(LETTERS[1:dictionary_size],
             label_length,
             replace = TRUE),
             collapse="")
             )
my_labels

#sorted

my_labels_sorted <- sort(replicate(
                    n_replicates, 
                    paste
                    (sample(LETTERS[1:dictionary_size],
                     label_length,
                     replace = TRUE),
                     collapse="")
                    )
                    )
my_labels_sorted
#if you want to mix letters and numbers

alfanum_labels <- paste0(rep(LETTERS[1:dictionary_size],
                             each = n_replicates),
                             sep = "-",
                             1:n_replicates)
alfanum_labels