# Abstract

Four different models have been compared for fitting SARS-nCoV-2 total cumulative cases curves in 185 countries over a period of 97 days. Evaluated models have been: _Simple Logistic Function_ (**SLF**), _Simple Gompertz Function_ (**SGF**), _Double Logistic Function_ (**DLF**) and _Double Gompertz Function_ (**DGF**). **DGF** model showed lower MSE, RMSE, NRMSE, MAE, $I_2$ and higher Pearson R compared to the others. AIC scores have been found higher in 137 countries compared with **DLF** ($\mu=0.991$, $\sigma=0.044$), 153 countries compared with **SGF** ($\mu=0.991$, $\sigma=0.049$) and 158 countries compared with **SLF** ($\mu=0.985$, $\sigma=0.068$). Results suggest that _Double Gompertz Function_ may be a good fitting model for SARS-nCoV-2 analysis and forecasting.

# Methods

## Data

SARS-nCoV-2 total cumulative cases data have been gathered from Johns Hopkins University GitHub repository [REF] and summed into single countries where regional level was provided [REF]. Data have been used "as is" without rejecting any outlier and/or error (negative daily $\Delta$).
Data and results have been stored in a `pandas` n-dimensional `DataFrame`.

Raw data contained 185 countries and daily cumulative confirmed cases for 97 days, from 2020-01-22 to 2020-04-29.

## Models

Models have been defined with `lmfit` (implentation of classical `curve_fit` in `scipy`) using Nelder-Mead method for fitting [REF].

Total residual from each function have been initially compared (unsorted, sorted, gaussian distribution) to find the model with residual $\mu$ closer to $0$ and shorter $\sigma$. _Akaike Information Criterion_ (**AIC**) mean have been used to find the likely better fitting model that has been finally compared, country by country, with the others using Akaike weights distribution to get the _AIC scores_ probability **AICp**, defined as

$$ \mathbf{AIC_p} =
\frac{
    e^{
        -0.5 \cdot (\mathbf{AIC_1} - \mathbf{AIC_0})
    }
}{
    1 + e^{
        -0.5 \cdot (\mathbf{AIC_1} - \mathbf{AIC_0})
    }
} $$

where $\mathbf{AIC_0} \leq \mathbf{AIC_1}$.

Models have been defined as follow:

- Simple Logisic Function (**SLF**):

```python
def logit_function(x, a, b, k, e):
    d = k * (b - np.array(x))
    return (a / (1 + np.exp(d))) + e

```
$$ f(t) = \frac{ a }{ 1 + e^{ k (b - t) } } + \varepsilon $$

- Double Logisic Function (**DLF**):

```python
def double_logit_function(x, a1, b1, k1, a2, b2, k2, e):
    d1 = k1 * (b1 - np.array(x))
    g1 = a1 / (1 + np.exp(d1))
    d2 = k2 * (b2 - np.array(x))
    g2 = (a2 - a1) / (1 + np.exp(d2))
    return g1 + g2 + e
```
$$ f(t) = \frac{ a_1 }{ 1 + e^{ k_1 (b_1 - t) } } + \frac{ a_2 - a_1 }{ 1 + e^{ k_2 (b_2 - t) } } + \varepsilon $$

- Simple Gompertz Function (**SGF**):

```python
def gompertz_function(x, a, b, k, e):
    exp = - np.exp(k * (b - x))
    return a * np.exp(exp) + e
```
$$ f(t) = a \cdot e^{ -e^{ k (b - t) } } + \varepsilon $$

- Double Gompertz Function (**DGF**):

```python
def double_gompertz_function(x, a1, b1, k1, a2, b2, k2, e):
    exp1 = - np.exp(k1 * (b1 - x))
    g1 = a1 * np.exp(exp1)
    exp2 = - np.exp(k2 * (b2 - x))
    g2 = (a2 - a1) * np.exp(exp2)
    return g1 + g2 + e
```
$$ f(t) = a_1 \cdot e^{ -e^{ k_1 (b_1 - t) } } + (a_2 - a_1) \cdot e^{ -e^{ k_2 (b_2 - t) } } + \varepsilon $$

# Model fitting

Fitting has been performed with `lmfit` using Nelder-Mead method

```python
model = lmfit.Model(function)
result = model.fit(data=y, params=p, x=x, method='Nelder', nan_policy='omit')
```

guessing initial parameters `p` as:

- **SLF** and **SGF**

```python
p = model.make_params(
    a=y[-1],
    b=max_y_i,
    k=.1,
    e=y[0]
)
```

$$ a = \hat{y}_{-1} \\%
b = x_{\mathbf{max}(\hat{y})} \\%
k = 0.1 \\%
\varepsilon = \hat{y}_0$$

- **DLF** and **DGF**

```python
p = model.make_params(
    a1=y[max_y_i] * 2,
    b1=max_y_i,
    k1=.1,
    a2=max(y),
    b2=len(y),
    k2=.1,
    e=y[0]
)
```

$$ a_1 = 2 \hat{y}_{\mathbf{max}(d\hat{y})} \\%
b_1 = x_{\mathbf{max}(\hat{y})} \\%
k_1 = 0.1 \\%
a_2 = \mathbf{max}(\hat{y}) \\%
b_2 = x_{\hat{y}_{-1}} \\%
k_2 = 0.1 \\%
\varepsilon = \hat{y}_0$$

Fitting failed for one country only (Yemen) returning best fit information from 184 countries, for a total of 17,848 points for each of the four models.

Complete `python` backend for data gathering, fitting and analysis along with a `pickle` saved dataframe of all measured data and results is online avalaible at [REF].

Fitting examples are reported in figures [REF] [REF] [REF].

# Analysis

## Residual

Total residual from each model have been collected and compared to get a first "rough" evidence of the most likely better fitting model [FIG].

![img](modelresidual.png)

**SLF** showed a residual mean $\mu=1.694$ closer to $0$ but the wider standard deviation $\sigma=1247$. **DGF** showed the lower residual standard deviation $\sigma=351$ and a mean $\mu=-5.197$.

Mean Squared Error (**MSE**, Variance), Root Mean Squared Error (**RMSE**, Standard Deviation), Normalize Root Mean Squared Error (**NRMSE**, Normalized Standard Deviation), Pearson Correlation (**Pearson R**), Mean Absolute Error (**MAE**) and (**I2**) have been calculated for all models residual [FIG]. **SGF** showed the best results for all indices confirming the first hypothesis that **DGF** could have been the best fitting model among the chosen ones.

![img](residualstats.png)

Since compared models are nested (**SLF** and **SGF** have 4 parameters; **DLF** and **DGF** have 7 parameters) Akaike Information Criterion (**AIC**) and score has been used instead of classical $H_0$ null hypothesis. **AIC**s has been then collected from all fits and compared with each other, country by country using the **AIC** values returned by `lmfit.minimize`. Plots are reported in [FIG] [FIG] [FIG].

![img](aicscores.png)

Calcuting mean $\mu$ and standard deviation $\sigma$, **AIC** scores strongly confirmed _Double Gompertz Function_ as the better fitting model for SARS-nCoV-2 cumulative cases curve fitting. Results also showed that **DGF** is not only much more fitting than models with less parameters (**SLF** and **SGF**) but also compared with _Double Logistic Function_.

# Conclusions

Among the compared models _Double Gompertz Function_ has showed the best results and scores fitting data of SARS-nCoV-2 cumulative cases, suggesting that this model should be studied more deeply (possibly improved) and compared to other existing models for further analysis, including forecasting capabilities.

# Plots

## AICs scores

### DGF vs SLF

![img](dgf_slf.png)

### DGF vs SGF

![img](dgf_sgf.png)

### DGF vs DLF

![img](dgf_dlf.png)

## Fit examples

In [5]:
from IPython.display import display, Markdown
for j in range(6):
    display(Markdown(f"![img](fit{j+1}.png)\n\n***\n\n"))

![img](fit1.png)

***



![img](fit2.png)

***



![img](fit3.png)

***



![img](fit4.png)

***



![img](fit5.png)

***



![img](fit6.png)

***

