<h1 align="center"> Creating the Artificial Dataset </h1>

<h2> 1. Introduction </h2>

In this notebook you can find analytical information of my approach in creating the ArtificialDataset based on the instructions of the provided Metadata file.
My first step was to create the full Dataset, containing all the variables in the correct Data Type, and in the second step I created the missings values, by the given percentages.
I followed this approach in order to avoid the creation of multiple missing values, as some of the variables are connected to each other.

<h2> 2. Creating the Full Dataset </h2>

In [1]:
#Display the numbers without the scientific format
options(scipen = 999)

<h3> 2.1 CompanyID </h3>

Firstly I created the CompanyID variable and I used the common format which indicates the unique number of ID (from 10001 to 10100). 

In [2]:
a <- seq(from = 10001, to = 10100, by = 1)  # Unique Company ID
b <- NULL
for (i in 1:100){
  b <- c(b, rep(a[i], 12))
}
CompanyID <- sort(b) # CopanyID Vector

In [3]:
#Inspecting the results
length(unique(CompanyID))
CompanyID

<h3> 2.2 Date and ArtificialDataset Template </h3>

In [4]:
Date <- NA

In [5]:
ArtificialDataset <- data.frame(Date, CompanyID)
ArtificialDataset$CompanyID <- as.factor(ArtificialDataset$CompanyID)

In [6]:
ArtificialDataset$Date <- seq(as.Date('2016/01/01'), as.Date('2016/12/01'), by = "month")

In [7]:
#Inspecting the ArtificialDataset
head(ArtificialDataset)
tail(ArtificialDataset)
str(ArtificialDataset)

Date,CompanyID
2016-01-01,10001
2016-02-01,10001
2016-03-01,10001
2016-04-01,10001
2016-05-01,10001
2016-06-01,10001


Unnamed: 0,Date,CompanyID
1195,2016-07-01,10100
1196,2016-08-01,10100
1197,2016-09-01,10100
1198,2016-10-01,10100
1199,2016-11-01,10100
1200,2016-12-01,10100


'data.frame':	1200 obs. of  2 variables:
 $ Date     : Date, format: "2016-01-01" "2016-02-01" ...
 $ CompanyID: Factor w/ 100 levels "10001","10002",..: 1 1 1 1 1 1 1 1 1 1 ...


<h3> 2.3 Revenue </h3>

In [8]:
ArtificialDataset$Revenue <- sample(round(runif(1200*10, min = 0, max = 1)*10^6, digits = 2), 1200, replace = TRUE)

In [9]:
#Inspecting the ArtificialDataset
head(ArtificialDataset)

Date,CompanyID,Revenue
2016-01-01,10001,690710.54
2016-02-01,10001,218199.33
2016-03-01,10001,887134.93
2016-04-01,10001,505122.23
2016-05-01,10001,312168.56
2016-06-01,10001,43971.11


<h3> 2.4 Expenses </h3>

In [10]:
ArtificialDataset$Expenses <- sample(round(runif(1200*10, min = 0, max = 0.5)*10^6, digits = 2), 1200, replace = TRUE)

In [11]:
#Inspecting the ArtificialDataset
head(ArtificialDataset)

Date,CompanyID,Revenue,Expenses
2016-01-01,10001,690710.54,399434.3
2016-02-01,10001,218199.33,357003.3
2016-03-01,10001,887134.93,163712.5
2016-04-01,10001,505122.23,168801.5
2016-05-01,10001,312168.56,417686.1
2016-06-01,10001,43971.11,160718.6


<h3> 2.5 Profit </h3>

In [12]:
ArtificialDataset$Profit <- ArtificialDataset$Revenue - ArtificialDataset$Expenses

In [13]:
#Inspecting the ArtificialDataset
head(ArtificialDataset)

Date,CompanyID,Revenue,Expenses,Profit
2016-01-01,10001,690710.54,399434.3,291276.2
2016-02-01,10001,218199.33,357003.3,-138804.0
2016-03-01,10001,887134.93,163712.5,723422.5
2016-04-01,10001,505122.23,168801.5,336320.7
2016-05-01,10001,312168.56,417686.1,-105517.6
2016-06-01,10001,43971.11,160718.6,-116747.5


<h3> 2.6 LossFlag </h3>

In [14]:
ArtificialDataset$LossFlag <- ifelse(ArtificialDataset$Profit < 0, 1, 0)

In [15]:
#Inspecting the ArtificialDataset
head(ArtificialDataset)

Date,CompanyID,Revenue,Expenses,Profit,LossFlag
2016-01-01,10001,690710.54,399434.3,291276.2,0
2016-02-01,10001,218199.33,357003.3,-138804.0,1
2016-03-01,10001,887134.93,163712.5,723422.5,0
2016-04-01,10001,505122.23,168801.5,336320.7,0
2016-05-01,10001,312168.56,417686.1,-105517.6,1
2016-06-01,10001,43971.11,160718.6,-116747.5,1


<h3> 2.7 Employees </h3>

In [16]:
ArtificialDataset$Employees <- NA

for(i in 1:100){
  a <- ArtificialDataset[which(ArtificialDataset$CompanyID == 
                                 levels(ArtificialDataset$CompanyID)[i]),]
  
  a$Employees <- rep(sample(seq(from = 10, to = 100, by = 1), 1, replace = TRUE))
  
  ArtificialDataset[which(ArtificialDataset$CompanyID == 
                            levels(ArtificialDataset$CompanyID)[i]),]$Employees <-
    a$Employees
}

In [17]:
#Inspecting the ArtificialDataset
head(ArtificialDataset)

Date,CompanyID,Revenue,Expenses,Profit,LossFlag,Employees
2016-01-01,10001,690710.54,399434.3,291276.2,0,34
2016-02-01,10001,218199.33,357003.3,-138804.0,1,34
2016-03-01,10001,887134.93,163712.5,723422.5,0,34
2016-04-01,10001,505122.23,168801.5,336320.7,0,34
2016-05-01,10001,312168.56,417686.1,-105517.6,1,34
2016-06-01,10001,43971.11,160718.6,-116747.5,1,34


<h3> 2.8 Region </h3>

In [18]:
ArtificialDataset$Region <- NA    

for(i in 1:100){    
  a <- ArtificialDataset[which(ArtificialDataset$CompanyID == 
                                 levels(ArtificialDataset$CompanyID)[i]),]
  
  a$Region <- sample(c("A", "B", "C", "D", "E"),
                     1,
                     replace = TRUE,
                     prob = c(0.25, 0.20, 0.10, 0.05, 0.40))
  
  ArtificialDataset[which(ArtificialDataset$CompanyID == 
                            levels(ArtificialDataset$CompanyID)[i]),]$Region <-
    a$Region
}

In [19]:
#Inspecting the ArtificialDataset
head(ArtificialDataset)

Date,CompanyID,Revenue,Expenses,Profit,LossFlag,Employees,Region
2016-01-01,10001,690710.54,399434.3,291276.2,0,34,B
2016-02-01,10001,218199.33,357003.3,-138804.0,1,34,B
2016-03-01,10001,887134.93,163712.5,723422.5,0,34,B
2016-04-01,10001,505122.23,168801.5,336320.7,0,34,B
2016-05-01,10001,312168.56,417686.1,-105517.6,1,34,B
2016-06-01,10001,43971.11,160718.6,-116747.5,1,34,B


<h3> 2.9 BusinessValuation </h3>

In [20]:
ArtificialDataset$BusinessValuation <- NA

for(i in 1:100){    
  a <- ArtificialDataset[which(ArtificialDataset$CompanyID == 
                                 levels(ArtificialDataset$CompanyID)[i]),]
  
  for(j in 1:12){      
    a$BusinessValuation[j] <- round(sample(seq(from = 0.03*a$Profit[j], to = 0.10*a$Profit[j]), 1, replace = TRUE),
                                    digits = 2)
  }
 
  ArtificialDataset[which(ArtificialDataset$CompanyID == 
                            levels(ArtificialDataset$CompanyID)[i]),]$BusinessValuation <-
    a$BusinessValuation
}

In [21]:
#Inspecting the ArtificialDataset
head(ArtificialDataset)

Date,CompanyID,Revenue,Expenses,Profit,LossFlag,Employees,Region,BusinessValuation
2016-01-01,10001,690710.54,399434.3,291276.2,0,34,B,29038.29
2016-02-01,10001,218199.33,357003.3,-138804.0,1,34,B,-5387.12
2016-03-01,10001,887134.93,163712.5,723422.5,0,34,B,44855.67
2016-04-01,10001,505122.23,168801.5,336320.7,0,34,B,28444.62
2016-05-01,10001,312168.56,417686.1,-105517.6,1,34,B,-7074.53
2016-06-01,10001,43971.11,160718.6,-116747.5,1,34,B,-8930.43


<h3> 2.10 ClosedFlag </h3>

In [22]:
ArtificialDataset$ClosedFlag <- NA

for(i in 1:100){    
  a <- ArtificialDataset[which(ArtificialDataset$CompanyID == 
                                 levels(ArtificialDataset$CompanyID)[i]),]
  
  a$ClosedFlag <- ifelse(sum(a$Profit < 0) > 3, 
                         sample(c(0, 1), 1, replace = TRUE, prob = c(0.9, 0.1)),
                         sample(c(0, 1), 1, replace = TRUE, prob = c(0.995, 0.005)))
  
  ArtificialDataset[which(ArtificialDataset$CompanyID == 
                            levels(ArtificialDataset$CompanyID)[i]),]$ClosedFlag <-
    a$ClosedFlag
}

In [23]:
#Inspecting the ArtificialDataset
head(ArtificialDataset)

Date,CompanyID,Revenue,Expenses,Profit,LossFlag,Employees,Region,BusinessValuation,ClosedFlag
2016-01-01,10001,690710.54,399434.3,291276.2,0,34,B,29038.29,0
2016-02-01,10001,218199.33,357003.3,-138804.0,1,34,B,-5387.12,0
2016-03-01,10001,887134.93,163712.5,723422.5,0,34,B,44855.67,0
2016-04-01,10001,505122.23,168801.5,336320.7,0,34,B,28444.62,0
2016-05-01,10001,312168.56,417686.1,-105517.6,1,34,B,-7074.53,0
2016-06-01,10001,43971.11,160718.6,-116747.5,1,34,B,-8930.43,0


<h2> 3. Inspecting the Full ArtificialDataset </h2>

In [24]:
  ArtificialDataset

Date,CompanyID,Revenue,Expenses,Profit,LossFlag,Employees,Region,BusinessValuation,ClosedFlag
2016-01-01,10001,690710.54,399434.31,291276.23,0,34,B,29038.29,0
2016-02-01,10001,218199.33,357003.28,-138803.95,1,34,B,-5387.12,0
2016-03-01,10001,887134.93,163712.47,723422.46,0,34,B,44855.67,0
2016-04-01,10001,505122.23,168801.52,336320.71,0,34,B,28444.62,0
2016-05-01,10001,312168.56,417686.14,-105517.58,1,34,B,-7074.53,0
2016-06-01,10001,43971.11,160718.63,-116747.52,1,34,B,-8930.43,0
2016-07-01,10001,972983.26,149589.18,823394.08,0,34,B,56004.82,0
2016-08-01,10001,107875.96,863.45,107012.51,0,34,B,9702.38,0
2016-09-01,10001,292321.96,179847.88,112474.08,0,34,B,8319.22,0
2016-10-01,10001,733022.87,382622.34,350400.53,0,34,B,17880.02,0


In [25]:
dim(ArtificialDataset)
str(ArtificialDataset)

'data.frame':	1200 obs. of  10 variables:
 $ Date             : Date, format: "2016-01-01" "2016-02-01" ...
 $ CompanyID        : Factor w/ 100 levels "10001","10002",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Revenue          : num  690711 218199 887135 505122 312169 ...
 $ Expenses         : num  399434 357003 163712 168802 417686 ...
 $ Profit           : num  291276 -138804 723422 336321 -105518 ...
 $ LossFlag         : num  0 1 0 0 1 1 0 0 0 0 ...
 $ Employees        : num  34 34 34 34 34 34 34 34 34 34 ...
 $ Region           : chr  "B" "B" "B" "B" ...
 $ BusinessValuation: num  29038 -5387 44856 28445 -7075 ...
 $ ClosedFlag       : num  0 0 0 0 0 0 0 0 0 0 ...


From the structure of the Dataset is clear that some variables' formats needed to be changed.

In [26]:
ArtificialDataset$LossFlag <- as.factor(ArtificialDataset$LossFlag)
ArtificialDataset$Region <- as.factor(ArtificialDataset$Region)
ArtificialDataset$ClosedFlag <- as.factor(ArtificialDataset$ClosedFlag)

In [27]:
#New inspection of the ArtificialDataset
str(ArtificialDataset)
summary(ArtificialDataset)

'data.frame':	1200 obs. of  10 variables:
 $ Date             : Date, format: "2016-01-01" "2016-02-01" ...
 $ CompanyID        : Factor w/ 100 levels "10001","10002",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Revenue          : num  690711 218199 887135 505122 312169 ...
 $ Expenses         : num  399434 357003 163712 168802 417686 ...
 $ Profit           : num  291276 -138804 723422 336321 -105518 ...
 $ LossFlag         : Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 1 1 1 ...
 $ Employees        : num  34 34 34 34 34 34 34 34 34 34 ...
 $ Region           : Factor w/ 5 levels "A","B","C","D",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ BusinessValuation: num  29038 -5387 44856 28445 -7075 ...
 $ ClosedFlag       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...


      Date              CompanyID       Revenue            Expenses       
 Min.   :2016-01-01   10001  :  12   Min.   :   525.1   Min.   :   863.5  
 1st Qu.:2016-03-24   10002  :  12   1st Qu.:259395.6   1st Qu.:130226.8  
 Median :2016-06-16   10003  :  12   Median :510784.8   Median :259458.4  
 Mean   :2016-06-16   10004  :  12   Mean   :508448.2   Mean   :254895.0  
 3rd Qu.:2016-09-08   10005  :  12   3rd Qu.:759357.1   3rd Qu.:381613.0  
 Max.   :2016-12-01   10006  :  12   Max.   :999428.7   Max.   :499749.5  
                      (Other):1128                                        
     Profit        LossFlag   Employees      Region  BusinessValuation  
 Min.   :-452175   0:900    Min.   : 11.00   A:300   Min.   :-40385.25  
 1st Qu.:   1071   1:300    1st Qu.: 25.50   B:240   1st Qu.:    98.57  
 Median : 255008            Median : 55.00   C:156   Median : 14738.09  
 Mean   : 253553            Mean   : 53.45   D: 60   Mean   : 16238.44  
 3rd Qu.: 498591            3rd Qu.

Every variable is in the correct type and range.

<h2> 4. Creating the missing values of the ArtificialDataset </h2>

<h3> 4.1 Revenue </h3>

In [28]:
for(i in sample(1:1200, 0.01*nrow(ArtificialDataset), replace = TRUE)){
  ArtificialDataset$Revenue[i] <- NA
}

<h3> 4.2 Expenses </h3>

In [29]:
for(i in sample(1:1200, 0.01*nrow(ArtificialDataset), replace = TRUE)){
  ArtificialDataset$Expenses[i] <- NA
}

<h3> 4.3 Profit </h3>

In [30]:
for(i in sample(1:1200, 0.01*nrow(ArtificialDataset), replace = TRUE)){
  ArtificialDataset$Profit[i] <- NA
}

<h3> 4.4 LossFlag </h3>

In [31]:
for(i in sample(1:1200, 0.01*nrow(ArtificialDataset), replace = TRUE)){
  ArtificialDataset$LossFlag[i] <- NA
}

<h2> 5. Inspecting the Final ArtificialDataset </h2>

In [32]:
ArtificialDataset

Date,CompanyID,Revenue,Expenses,Profit,LossFlag,Employees,Region,BusinessValuation,ClosedFlag
2016-01-01,10001,690710.54,399434.31,291276.23,0,34,B,29038.29,0
2016-02-01,10001,218199.33,357003.28,-138803.95,1,34,B,-5387.12,0
2016-03-01,10001,887134.93,,723422.46,0,34,B,44855.67,0
2016-04-01,10001,505122.23,168801.52,336320.71,0,34,B,28444.62,0
2016-05-01,10001,312168.56,417686.14,-105517.58,1,34,B,-7074.53,0
2016-06-01,10001,43971.11,160718.63,-116747.52,1,34,B,-8930.43,0
2016-07-01,10001,972983.26,149589.18,823394.08,0,34,B,56004.82,0
2016-08-01,10001,107875.96,863.45,107012.51,0,34,B,9702.38,0
2016-09-01,10001,292321.96,179847.88,112474.08,0,34,B,8319.22,0
2016-10-01,10001,733022.87,382622.34,350400.53,0,34,B,17880.02,0


In [33]:
str(ArtificialDataset)
summary(ArtificialDataset)

'data.frame':	1200 obs. of  10 variables:
 $ Date             : Date, format: "2016-01-01" "2016-02-01" ...
 $ CompanyID        : Factor w/ 100 levels "10001","10002",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Revenue          : num  690711 218199 887135 505122 312169 ...
 $ Expenses         : num  399434 357003 NA 168802 417686 ...
 $ Profit           : num  291276 -138804 723422 336321 -105518 ...
 $ LossFlag         : Factor w/ 2 levels "0","1": 1 2 1 1 2 2 1 1 1 1 ...
 $ Employees        : num  34 34 34 34 34 34 34 34 34 34 ...
 $ Region           : Factor w/ 5 levels "A","B","C","D",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ BusinessValuation: num  29038 -5387 44856 28445 -7075 ...
 $ ClosedFlag       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...


      Date              CompanyID       Revenue            Expenses       
 Min.   :2016-01-01   10001  :  12   Min.   :   525.1   Min.   :   863.5  
 1st Qu.:2016-03-24   10002  :  12   1st Qu.:259395.6   1st Qu.:130226.8  
 Median :2016-06-16   10003  :  12   Median :507714.4   Median :259458.4  
 Mean   :2016-06-16   10004  :  12   Mean   :508061.0   Mean   :254673.2  
 3rd Qu.:2016-09-08   10005  :  12   3rd Qu.:759357.1   3rd Qu.:381613.0  
 Max.   :2016-12-01   10006  :  12   Max.   :999428.7   Max.   :499749.5  
                      (Other):1128   NA's   :12         NA's   :12        
     Profit        LossFlag     Employees      Region  BusinessValuation  
 Min.   :-452175   0   :890   Min.   : 11.00   A:300   Min.   :-40385.25  
 1st Qu.:   4218   1   :298   1st Qu.: 25.50   B:240   1st Qu.:    98.57  
 Median : 255008   NA's: 12   Median : 55.00   C:156   Median : 14738.09  
 Mean   : 253440              Mean   : 53.45   D: 60   Mean   : 16238.44  
 3rd Qu.: 498591         

<h2> 6. Exporting the ArtificialDataset </h2>

In [34]:
write.csv(ArtificialDataset, file = "ArtificialDataset.csv", row.names = FALSE)