# Comparing model performances

This R code performs a series of tasks including package installation, loading libraries, and defining a function for performing one-way ANOVA with post-hoc analysis. Here's a detailed breakdown of what each part of the code does:

### **1. Installing and Loading Required Packages:**

```r
list.of.packages <- c("tidyverse", "reticulate", "lubridate", "Benchmarking", "rstatix", "car", "knitr")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

for (i in list.of.packages) {
    library(i, character.only = TRUE)
}
```
- **Package List**: `list.of.packages` is a vector containing the names of several R packages that will be used in the script. These include packages for data manipulation (`tidyverse`, `lubridate`), statistical analysis (`rstatix`, `Benchmarking`).
- **Package Installation**: It checks which of the listed packages are not already installed using `installed.packages()` and installs them with `install.packages()`.
- **Loading Libraries**: Once the required packages are installed, it loads them using `library()`.

### **2. Defining the `one_way_2` Function:**

```r
one_way_2 <- function (y="AP", x="Country", data) {
  data <- data |> rename(y = {{y}}, x = {{x}})
  data <- data |> mutate(x = factor(x))
  print(leveneTest(y ~ x , data = data))
  
  model <- oneway.test(y ~ x, data = data, var.equal = FALSE)
  print(model)
  
  mns <- data |>
    group_by(x) |>
    summarise(mean = mean(y), sd = sd(y)) |>
    arrange(desc(mean))
  
  lst <- rstatix::games_howell_test(data, formula = y ~ x, conf.level = 0.95, detailed = FALSE)
  
  a <- lst |> 
    mutate(gr1 = factor(group1, levels = mns$x), gr2 = factor(group2, levels = mns$x)) |>
    filter(p.adj.signif == "ns") |> select(gr1, gr2)
  
  a1 <- a
  colnames(a1) <- c("gr2", "gr1")
  tbl <- rbind(a, a1) |> table()
  
  mns <- mns |> mutate(l = "")
  tbl
  
  lidx <- 1
  excl <- NULL
  
  for (i in 1:nrow(tbl)) {
    nl <- letters[lidx]
    c1 <- rownames(tbl)[i]
    if (i %in% excl) next
    if (sum(tbl[i,]) == 0) {
      mns$l[i] = paste0(mns$l[i], nl)
      lidx <- lidx + 1
      excl <- c(excl, i)
    } else {
      idx <- which(tbl[i,] == 1)
      mns$l[i] = paste0(mns$l[i], nl)
      mns$l[idx] = paste0(mns$l[idx], nl)
      excl <- c(excl, i, idx)
      lidx <- lidx + 1
    }
  }
  
  return(mns)
}
```

- **Function Definition (`one_way_2`)**: This function performs a one-way ANOVA (analysis of variance) for a specified dependent variable (`y`) and grouping factor (`x`) from the input `data`. 
    - **Parameters**: 
        - `y`: The dependent variable (default is `"AP"`).
        - `x`: The grouping factor (default is `"Country"`).
        - `data`: The data frame that contains the data for the analysis.
    
- **Levene's Test**: It first performs Levene's Test to check for the homogeneity of variances between groups using `leveneTest()` from the `car` package.
  
- **One-Way ANOVA**: It performs the one-way ANOVA using `oneway.test()` with unequal variance assumption (`var.equal = FALSE`).

- **Summary Statistics**: The function calculates the mean and standard deviation of the dependent variable (`y`) for each group in the factor (`x`), and arranges them in descending order.

- **Post-Hoc Analysis (Games-Howell Test)**: The `games_howell_test()` function from `rstatix` is used for post-hoc analysis when variances are unequal, comparing the groups based on the adjusted p-values (`p.adj.signif == "ns"` indicates non-significant differences).

- **Tables**: It creates a table of the groups that are not significantly different from each other and adds them to a new column `l` to track which groups are similar. This helps to visually group and label the levels with non-significant differences.

### **3. Setting Output Options:**

```r
options(scipen = 99, digits = 2)
```
- This sets the output options for numbers:
    - **`scipen = 99`**: Prevents scientific notation in numbers (useful when working with large or small numbers).
    - **`digits = 2`**: Limits the number of digits displayed in numeric outputs to 2.

### **Summary of What the Code Does:**
- The code installs and loads several essential R packages.
- It defines a function `one_way_2()` that:
    - Performs Levene’s test for homogeneity of variances.
    - Runs a one-way ANOVA to compare groups.
    - Uses the Games-Howell test for post-hoc comparisons.
    - Creates a summary table of the results with additional labeling of groups with non-significant differences.
    - Returns a table of mean and standard deviation values with group labels.

This setup allows you to quickly conduct one-way ANOVA tests with the specified dependent and grouping variables and apply post-hoc analysis when necessary.

In [2]:
list.of.packages <- c("tidyverse", "reticulate","Benchmarking",
                      "rstatix","car","knitr")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

for (i in list.of.packages) {
    library(i, ,character.only = TRUE)
}
one_way_2 <- function (y="AP",x="Country",data) {
data <- data |> rename(y = {{y}}, x = {{x}})
data <- data |> mutate(x = factor(x))
print(leveneTest(y ~ x , data = data))

#     print(data)
model <- oneway.test(y~x, data = data, var.equal = FALSE)
# summary(model)
print(model)
mns <- data |>
    group_by(x) |>
    summarise(mean = mean(y), sd = sd(y)) |>
    arrange(desc(mean))
lst <- rstatix::games_howell_test(data, formula=y~x, 
                         conf.level = 0.95, detailed = FALSE) 
a <- lst |>  mutate(gr1 = factor (group1, levels = mns$x),
              gr2 = factor (group2, levels = mns$x)) |>
filter(p.adj.signif=="ns") |> select(gr1,gr2) 
a1 <- a
colnames(a1) <- c("gr2","gr1")
tbl <- rbind(a,a1) |> table()

mns <- mns |> mutate(l = "")
tbl
lidx <- 1
excl <- NULL
for (i in 1:nrow(tbl)) {
    nl <- letters[lidx]
    c1 <- rownames(tbl)[i]
    if (i %in% excl) next
    if (sum(tbl[i,])==0) {
            mns$l[i] = paste0(mns$l[i],nl)
            lidx <- lidx +1
            excl <- c(excl,i)
        } else {
          idx<- which(tbl[i,]==1)
        mns$l[i] = paste0(mns$l[i],nl)
        mns$l[idx] = paste0(mns$l[idx],nl)
            excl <- c(excl,i, idx)
                lidx <- lidx +1
        }}
return(mns)
}
options(scipen = 99, digits = 2)

The R code you provided reads a previously saved `.Rds` file and arranges the data based on the `test_score` column in descending order.

Here’s a breakdown of the code:

### 1. **Reading the `.Rds` File:**
```r
data <- read_rds(file = "./data/cau_results.Rds")
```
- **`read_rds()`**: This function loads the data saved in an RDS file. In this case, it reads the `cau_results.Rds` file stored in the `./data/` directory.
- The data from this file is assigned to the variable `data`.

### 2. **Arranging Data by `test_score`:**
```r
data |> arrange(desc(test_score))
```
- **`arrange()`**: This function is from the `dplyr` package and is used to reorder the rows of a data frame.
- **`desc(test_score)`**: This indicates that the `data` should be sorted in descending order based on the `test_score` column.
- The use of **`|>`** (the pipe operator) is a feature from R 4.1.0 and allows for a more streamlined flow of operations.

### **Summary:**
- The code loads the data from the `cau_results.Rds` file into the `data` variable.
- Then, it sorts the data in descending order by the `test_score` column, allowing you to see the highest scoring models or results at the top.


In [3]:
data <- read_rds(file = "./data/cau_results.Rds")
data |> arrange(desc(test_score)) 

seed,model_name,time,train_score,test_score
<dbl>,<chr>,<dbl>,<dbl>,<dbl>
362,MLP,6.39,0.95,0.88
475,MLP,6.01,0.93,0.88
362,RF,8.92,0.92,0.87
438,DT,0.31,1.00,0.87
812,LR,0.35,0.76,0.87
100,MLP,5.79,0.94,0.87
288,MLP,6.31,0.94,0.87
400,DT,0.33,1.00,0.87
925,MLP,6.17,0.92,0.87
1000,DT,0.36,0.95,0.87


The R code  provided below is used to perform a one-way ANOVA for the `test_score` across different `model_name` categories, then format the results and present them in a Markdown table using `kable()`. Here's a breakdown of what each part does:

### 1. **Calling `one_way_2` Function:**
```r
one_way_2(y = test_score, x = model_name, data = data)
```
- **`one_way_2()`**: This function (from your previous code) is being called to perform a one-way ANOVA with `test_score` as the dependent variable (`y`) and `model_name` as the grouping factor (`x`) in the dataset `data`. 
- The function returns a data frame with the mean and standard deviation (`mean`, `sd`) of `test_score` for each model.

### 2. **Mutating Results to Combine Mean and Standard Deviation:**
```r
mutate(mean = paste0(round(mean, 2), l, "+-", round(sd, 2)))
```
- **`mutate()`**: This function is used to modify or create new columns in the data frame.
- **`mean = paste0(round(mean, 2), l, "+-", round(sd, 2))`**: Here, a new `mean` column is created by concatenating:
    - The rounded `mean` (to two decimal places).
    - The label `l` (which was generated in the previous steps of the `one_way_2()` function).
    - The rounded standard deviation (`sd`) with `"+-"` to indicate the variation around the mean.
  
### 3. **Selecting the Relevant Columns:**
```r
select(-sd, -l)
```
- **`select(-sd, -l)`**: This removes the `sd` (standard deviation) and `l` (labels for groups) columns from the data, leaving only the `mean` and `model_name` columns for the final output.

### 4. **Formatting the Output as a Markdown Table:**
```r
kable(format = "markdown", booktabs = TRUE, col.names = c("Model", "Score"))
```
- **`kable()`**: This function from the `knitr` package is used to create tables in R.
- **`format = "markdown"`**: Specifies that the output table should be formatted in Markdown.
- **`booktabs = TRUE`**: Adds better styling to the table with horizontal lines.
- **`col.names = c("Model", "Score")`**: Specifies the column names for the output table, labeling them as "Model" for the `model_name` and "Score" for the concatenated mean ± standard deviation.

### **Full Breakdown:**
This entire block of code:
- Performs a one-way ANOVA on `test_score` by `model_name` using the `one_way_2()` function.
- Mutates the results to combine the mean and standard deviation into a single string.
- Selects only the relevant columns (`model_name` and the combined `mean ± sd` column).
- Converts the final data frame into a nicely formatted Markdown table with `kable()`.

### **Result:**
The output will be a table in Markdown format that shows the `model_name` and the associated `mean ± sd` values, making it easier to visualize and interpret the results.


In [4]:
one_way_2(y = test_score,x = model_name, data=data)|> 
    mutate(mean = paste0(round(mean,2),l,"+-",round(sd,2))) |>
    select(-sd, -l) |>
    kable(format = "markdown", booktabs = TRUE, 
          col.names = c("Model","Score"))

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)  
group  3    2.97  0.036 *
      96                 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	One-way analysis of means (not assuming equal variances)

data:  y and x
F = 4, num df = 3, denom df = 53, p-value = 0.008





|Model |Score        |
|:-----|:------------|
|MLP   |0.8a+-0.06   |
|DT    |0.8a+-0.05   |
|RF    |0.78ab+-0.05 |
|LR    |0.73b+-0.08  |

## Identical evaluation procedures applied to training scores for each model

In [5]:
one_way_2(y = train_score,x = model_name, data=data)|> 
    mutate(mean = paste0(round(mean,2),l,"+-",round(sd,2))) |>
    select(-sd, -l) |>
    kable(format = "markdown", booktabs = TRUE, 
          col.names = c("Model","Score"))

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)   
group  3     4.1 0.0088 **
      96                  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	One-way analysis of means (not assuming equal variances)

data:  y and x
F = 268, num df = 3, denom df = 51, p-value <0.0000000000000002





|Model |Score       |
|:-----|:-----------|
|DT    |0.98a+-0.02 |
|RF    |0.96a+-0.04 |
|MLP   |0.92b+-0.06 |
|LR    |0.8c+-0.02  |

## Identical evaluation procedures applied to times for each model

In [6]:
one_way_2(y = time,x = model_name, data=data)|> 
    mutate(mean = paste0(round(mean,2),l,"+-",round(sd,2))) |>
    select(-sd, -l) |>
    kable(format = "markdown", booktabs = TRUE, 
          col.names = c("Model","Time"))

Levene's Test for Homogeneity of Variance (center = median)
      Df F value     Pr(>F)    
group  3    13.6 0.00000018 ***
      96                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

	One-way analysis of means (not assuming equal variances)

data:  y and x
F = 15400, num df = 3, denom df = 48, p-value <0.0000000000000002





|Model |Time        |
|:-----|:-----------|
|RF    |9a+-0.2     |
|MLP   |6.61b+-0.88 |
|LR    |0.36c+-0.02 |
|DT    |0.32d+-0.02 |

## DEA Analysis
The R code you provided is for performing Data Envelopment Analysis (DEA) on the `test_score` and `time` columns from the `data` dataset. Here's a breakdown of each part of the code:

### 1. **Preparing the Data:**
```r
data2 <- data |> 
    group_by(model_name) |> 
    summarise(score = mean(test_score), time = mean(time))
```
- **`group_by(model_name)`**: Groups the data by the `model_name` column.
- **`summarise(score = mean(test_score), time = mean(time))`**: For each group (`model_name`), it calculates the mean `test_score` and the mean `time` for that group, resulting in a data frame with columns for `model_name`, `score`, and `time`.

### 2. **Ungrouping the Data:**
```r
data3 <- data2 |> ungroup()
```
- **`ungroup()`**: Removes any grouping structure applied in the previous step, ensuring that further operations can be performed on the entire data frame.

### 3. **Defining Inputs and Outputs:**
```r
inputs <- as.matrix(data3 |> select(time))
outputs <- as.matrix(data3 |> select(score))
```
- **`inputs`**: Extracts the `time` column as the input matrix for the DEA model.
- **`outputs`**: Extracts the `score` column as the output matrix for the DEA model.
- **`as.matrix()`**: Converts these columns into matrices, as required by the `dea()` function.

### 4. **Defining Model Names:**
```r
nms <- paste0(data2$model_name)
```
- **`nms`**: Creates a vector of model names (from the `model_name` column) to be used for labeling the results.

### 5. **Running the DEA Analysis:**
```r
result <- dea(inputs, outputs, RTS = "crs")
```
- **`dea()`**: The main function for running Data Envelopment Analysis.
  - `inputs`: The input data matrix (time).
  - `outputs`: The output data matrix (score).
  - `RTS = "crs"`: Specifies constant returns to scale (CRS), meaning that the efficiency is measured under the assumption that scaling up or down the inputs will proportionally scale the outputs.
  
  The result of the DEA is stored in the `result` variable.

### 6. **Displaying Efficiency Scores:**
```r
print(data.frame(model = nms, Efficiency = result$eff) |> arrange(desc(Efficiency))) |> kable("markdown")
```
- **`result$eff`**: Extracts the efficiency scores for each model from the `result` object.
- **`data.frame(model = nms, Efficiency = result$eff)`**: Creates a data frame with the model names and their corresponding efficiency scores.
- **`arrange(desc(Efficiency))`**: Sorts the models by efficiency in descending order.
- **`kable("markdown")`**: Converts the data frame into a Markdown table for display.

### 7. **Summary of DEA Results:**
```r
summary(result)
```
- **`summary(result)`**: Provides a detailed summary of the DEA analysis, including various statistics about the efficiency of each model and other relevant metrics.

### **Full Workflow Summary:**
- The code first prepares the dataset by grouping it by `model_name`, calculating the mean `test_score` and `time` for each model.
- It then defines the input (`time`) and output (`score`) matrices for DEA analysis.
- The DEA analysis is run using the `dea()` function with constant returns to scale (CRS).
- The results are printed as a Markdown table showing the efficiency scores of each model, sorted in descending order.
- A detailed summary of the DEA analysis is provided via the `summary(result)` function.

This shows the relative efficiency of each model, where `1.000` indicates the most efficient model (i.e., it is on the frontier of efficiency).

In [7]:
data2 <- data |>
    group_by(model_name) |>
    summarise(score=mean(test_score), time = mean(time)) 
data3 <- data2 |> ungroup()
inputs <- as.matrix(data3 |> select(time))
outputs <- as.matrix(data3 |> select(score))
nms <- paste0(data2$model_name)
result <- dea(inputs, outputs, RTS = "crs")  
print(data.frame(model = nms, Efficieny = result$eff) |> arrange(desc(Efficieny))) |> kable("markdown") # Etkinlik skorlarını görüntüleme
summary(result)

  model Efficieny
1    DT     1.000
2    LR     0.827
3   MLP     0.049
4    RF     0.035




|model | Efficieny|
|:-----|---------:|
|DT    |      1.00|
|LR    |      0.83|
|MLP   |      0.05|
|RF    |      0.04|

Summary of efficiencies
CRS technology and input orientated efficiency
Number of firms with efficiency==1 are 1 out of 4 
Mean efficiency: 0.478 
---                
  Eff range      #  %
  0<= E <0.1     2 50
  0.1<= E <0.2   0  0
  0.2<= E <0.3   0  0
  0.3<= E <0.4   0  0
  0.4<= E <0.5   0  0
  0.5<= E <0.6   0  0
  0.6<= E <0.7   0  0
  0.7<= E <0.8   0  0
  0.8<= E <0.9   1 25
  0.9<= E <1     0  0
        E ==1    1 25
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.04    0.05    0.44    0.48    0.87    1.00 
