# Completing missing values in a Diamonds dataset and building forecasts based on the completed data

In [1]:
import random

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
from sklearn.metrics import mean_squared_error

## Loading data

The analysis uses the Diamonds from [**Kaggle**](https://www.kaggle.com/datasets/shivam2503/diamonds) dataset, which contains the prices and other attributes of nearly 54,000 diamonds.

### Attributes

**price** - Price in US dollars (326 - 18823)

**carat** - Diamond weight in carats (0.2 - 5.01)

**cut** - Cut quality (satisfactory, good, very good, premium, perfect)

**color** - The color of the diamond, from J (worst) to D (best)

**clarity** - A measure of how clear a diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

**x** - Length in mm (0 - 10.74)

**y** - Width in mm (0 - 58.9)

**z** - Depth in mm (0 - 31.8)

**depth** - Total depth in percent = z / mean(x, y) = 2 * z / (x + y) (43 - 79)

**table** - Width of the upper part of the diamond relative to the widest point (43 - 95)

In [2]:
data = pd.read_csv('diamonds.csv')

In [3]:
data.head(10)

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,2,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,5,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75
5,6,0.24,Very Good,J,VVS2,62.8,57.0,336,3.94,3.96,2.48
6,7,0.24,Very Good,I,VVS1,62.3,57.0,336,3.95,3.98,2.47
7,8,0.26,Very Good,H,SI1,61.9,55.0,337,4.07,4.11,2.53
8,9,0.22,Fair,E,VS2,65.1,61.0,337,3.87,3.78,2.49
9,10,0.23,Very Good,H,VS1,59.4,61.0,338,4.0,4.05,2.39


In [4]:
data.describe()

Unnamed: 0.1,Unnamed: 0,carat,depth,table,price,x,y,z
count,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0,53940.0
mean,26970.5,0.79794,61.749405,57.457184,3932.799722,5.731157,5.734526,3.538734
std,15571.281097,0.474011,1.432621,2.234491,3989.439738,1.121761,1.142135,0.705699
min,1.0,0.2,43.0,43.0,326.0,0.0,0.0,0.0
25%,13485.75,0.4,61.0,56.0,950.0,4.71,4.72,2.91
50%,26970.5,0.7,61.8,57.0,2401.0,5.7,5.71,3.53
75%,40455.25,1.04,62.5,59.0,5324.25,6.54,6.54,4.04
max,53940.0,5.01,79.0,95.0,18823.0,10.74,58.9,31.8


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  53940 non-null  int64  
 1   carat       53940 non-null  float64
 2   cut         53940 non-null  object 
 3   color       53940 non-null  object 
 4   clarity     53940 non-null  object 
 5   depth       53940 non-null  float64
 6   table       53940 non-null  float64
 7   price       53940 non-null  int64  
 8   x           53940 non-null  float64
 9   y           53940 non-null  float64
 10  z           53940 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.5+ MB


Percentage of missing values in each column

In [6]:
round(100 * (data.isnull().sum() / len(data)), 2)

Unnamed: 0    0.0
carat         0.0
cut           0.0
color         0.0
clarity       0.0
depth         0.0
table         0.0
price         0.0
x             0.0
y             0.0
z             0.0
dtype: float64

The data set has no gaps

# Completing missing values

## Adding artificial passes

Let's create artificial gaps for the features "Carat", "Depth" and "Price" in ratios of 5%, 10% and 15% of the total sample

In [7]:
missing_data_columns = ['carat', 'depth', 'price']
missing_data_ratios = [0.05, 0.10, 0.15]

In [8]:
data_with_missing = data.copy()
for i in range(len(missing_data_columns)):
    num_missing = int(len(data_with_missing) * missing_data_ratios[i])
    missing_indices = random.sample(range(len(data_with_missing)), num_missing)
    data_with_missing.loc[missing_indices, missing_data_columns[i]] = np.nan

In [9]:
data_with_missing.head(10)

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1,,Ideal,E,SI2,61.5,55.0,,3.95,3.98,2.43
1,2,,Premium,E,SI1,59.8,61.0,326.0,3.89,3.84,2.31
2,3,0.23,Good,E,VS1,56.9,65.0,327.0,4.05,4.07,2.31
3,4,0.29,Premium,I,VS2,62.4,58.0,334.0,4.2,4.23,2.63
4,5,,Good,J,SI2,63.3,58.0,,4.34,4.35,2.75
5,6,0.24,Very Good,J,VVS2,,57.0,336.0,3.94,3.96,2.48
6,7,0.24,Very Good,I,VVS1,62.3,57.0,,3.95,3.98,2.47
7,8,0.26,Very Good,H,SI1,61.9,55.0,337.0,4.07,4.11,2.53
8,9,0.22,Fair,E,VS2,65.1,61.0,337.0,3.87,3.78,2.49
9,10,0.23,Very Good,H,VS1,59.4,61.0,338.0,4.0,4.05,2.39


Percentage of missing values in each column

In [10]:
round(100 * (data_with_missing.isnull().sum() / len(data_with_missing)), 2)

Unnamed: 0     0.0
carat          5.0
cut            0.0
color          0.0
clarity        0.0
depth         10.0
table          0.0
price         15.0
x              0.0
y              0.0
z              0.0
dtype: float64

The number of missing values in each column

In [11]:
data_with_missing.isnull().sum()

Unnamed: 0       0
carat         2697
cut              0
color            0
clarity          0
depth         5394
table            0
price         8091
x                0
y                0
z                0
dtype: int64

The dataset with artificial omissions contains 2,697 omissions in the Carat column, 5,394 in the Depth column, and 8,091 in the Price column

## Completing missing values

Statistical values of selected features of the initial data set

In [12]:
data[missing_data_columns].describe()

Unnamed: 0,carat,depth,price
count,53940.0,53940.0,53940.0
mean,0.79794,61.749405,3932.799722
std,0.474011,1.432621,3989.439738
min,0.2,43.0,326.0
25%,0.4,61.0,950.0
50%,0.7,61.8,2401.0
75%,1.04,62.5,5324.25
max,5.01,79.0,18823.0


Functions for encoding labels

In [13]:
def label_encoding(data, variables):
    encoded_data = data.copy()
    mapping = dict()
    for variable in variables:
        mapping[variable] = {i: j for j, i in enumerate(encoded_data[variable].dropna().unique(), 0)}
    for variable in variables:
        encoded_data[variable] = encoded_data[variable].map(mapping[variable])
    return encoded_data, mapping

def inverse_label_encoding(data, variables, mapping):
    encoded_data = data.copy()
    for variable in variables:
        inverse_mapping = {i: j for j, i in mapping[variable].items()}
        encoded_data[variable] = encoded_data[variable].map(inverse_mapping)
    return encoded_data

Completing missing values using mean

In [14]:
data_mean = data_with_missing.copy()
data_mean, mapping = label_encoding(data_mean, ['cut', 'color', 'clarity'])
imputer = SimpleImputer(strategy='mean')
data_mean = pd.DataFrame(imputer.fit_transform(data_mean), columns=data_mean.columns)
data_mean = inverse_label_encoding(data_mean, ['cut', 'color', 'clarity'], mapping)

In [15]:
data_mean.head(10)

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1.0,0.79685,Ideal,E,SI2,61.5,55.0,3932.233266,3.95,3.98,2.43
1,2.0,0.79685,Premium,E,SI1,59.8,61.0,326.0,3.89,3.84,2.31
2,3.0,0.23,Good,E,VS1,56.9,65.0,327.0,4.05,4.07,2.31
3,4.0,0.29,Premium,I,VS2,62.4,58.0,334.0,4.2,4.23,2.63
4,5.0,0.79685,Good,J,SI2,63.3,58.0,3932.233266,4.34,4.35,2.75
5,6.0,0.24,Very Good,J,VVS2,61.755706,57.0,336.0,3.94,3.96,2.48
6,7.0,0.24,Very Good,I,VVS1,62.3,57.0,3932.233266,3.95,3.98,2.47
7,8.0,0.26,Very Good,H,SI1,61.9,55.0,337.0,4.07,4.11,2.53
8,9.0,0.22,Fair,E,VS2,65.1,61.0,337.0,3.87,3.78,2.49
9,10.0,0.23,Very Good,H,VS1,59.4,61.0,338.0,4.0,4.05,2.39


In [16]:
data_mean.isnull().sum()

Unnamed: 0    0
carat         0
cut           0
color         0
clarity       0
depth         0
table         0
price         0
x             0
y             0
z             0
dtype: int64

In [17]:
data_mean[missing_data_columns].describe()

Unnamed: 0,carat,depth,price
count,53940.0,53940.0,53940.0
mean,0.79685,61.755706,3932.233266
std,0.461501,1.361125,3684.924145
min,0.2,43.0,326.0
25%,0.4,61.2,1069.0
50%,0.71,61.755706,3310.0
75%,1.03,62.4,4710.0
max,5.01,79.0,18823.0


Completing missing values using median

In [18]:
data_median = data_with_missing.copy()
data_median, mapping = label_encoding(data_median, ['cut', 'color', 'clarity'])
imputer = SimpleImputer(strategy='median')
data_median = pd.DataFrame(imputer.fit_transform(data_median), columns=data_median.columns)
data_median = inverse_label_encoding(data_median, ['cut', 'color', 'clarity'], mapping)

In [19]:
data_median.head(10)

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1.0,0.7,Ideal,E,SI2,61.5,55.0,2400.0,3.95,3.98,2.43
1,2.0,0.7,Premium,E,SI1,59.8,61.0,326.0,3.89,3.84,2.31
2,3.0,0.23,Good,E,VS1,56.9,65.0,327.0,4.05,4.07,2.31
3,4.0,0.29,Premium,I,VS2,62.4,58.0,334.0,4.2,4.23,2.63
4,5.0,0.7,Good,J,SI2,63.3,58.0,2400.0,4.34,4.35,2.75
5,6.0,0.24,Very Good,J,VVS2,61.8,57.0,336.0,3.94,3.96,2.48
6,7.0,0.24,Very Good,I,VVS1,62.3,57.0,2400.0,3.95,3.98,2.47
7,8.0,0.26,Very Good,H,SI1,61.9,55.0,337.0,4.07,4.11,2.53
8,9.0,0.22,Fair,E,VS2,65.1,61.0,337.0,3.87,3.78,2.49
9,10.0,0.23,Very Good,H,VS1,59.4,61.0,338.0,4.0,4.05,2.39


In [20]:
data_median.isnull().sum()

Unnamed: 0    0
carat         0
cut           0
color         0
clarity       0
depth         0
table         0
price         0
x             0
y             0
z             0
dtype: int64

In [21]:
data_median[missing_data_columns].describe()

Unnamed: 0,carat,depth,price
count,53940.0,53940.0,53940.0
mean,0.792008,61.760135,3702.398276
std,0.461983,1.36119,3725.319879
min,0.2,43.0,326.0
25%,0.4,61.2,1069.0
50%,0.7,61.8,2400.0
75%,1.03,62.4,4710.0
max,5.01,79.0,18823.0


Completing missing values using k-Nearest Neighbors

In [22]:
data_KNN = data_with_missing.copy()
data_KNN, mapping = label_encoding(data_KNN, ['cut', 'color', 'clarity'])
imputer = KNNImputer(n_neighbors=3, weights='uniform')
data_KNN = pd.DataFrame(imputer.fit_transform(data_KNN), columns=data_KNN.columns)
data_KNN = inverse_label_encoding(data_KNN, ['cut', 'color', 'clarity'], mapping)

In [23]:
data_KNN.head(10)

Unnamed: 0.1,Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,1.0,0.263333,Ideal,E,SI2,61.5,55.0,332.0,3.95,3.98,2.43
1,2.0,0.253333,Premium,E,SI1,59.8,61.0,326.0,3.89,3.84,2.31
2,3.0,0.23,Good,E,VS1,56.9,65.0,327.0,4.05,4.07,2.31
3,4.0,0.29,Premium,I,VS2,62.4,58.0,334.0,4.2,4.23,2.63
4,5.0,0.263333,Good,J,SI2,63.3,58.0,335.666667,4.34,4.35,2.75
5,6.0,0.24,Very Good,J,VVS2,62.2,57.0,336.0,3.94,3.96,2.48
6,7.0,0.24,Very Good,I,VVS1,62.3,57.0,335.666667,3.95,3.98,2.47
7,8.0,0.26,Very Good,H,SI1,61.9,55.0,337.0,4.07,4.11,2.53
8,9.0,0.22,Fair,E,VS2,65.1,61.0,337.0,3.87,3.78,2.49
9,10.0,0.23,Very Good,H,VS1,59.4,61.0,338.0,4.0,4.05,2.39


In [24]:
data_KNN.isnull().sum()

Unnamed: 0    0
carat         0
cut           0
color         0
clarity       0
depth         0
table         0
price         0
x             0
y             0
z             0
dtype: int64

In [25]:
data_KNN[missing_data_columns].describe()

Unnamed: 0,carat,depth,price
count,53940.0,53940.0,53940.0
mean,0.797686,61.757018,3933.013663
std,0.473055,1.385352,3988.510343
min,0.2,43.0,326.0
25%,0.4,61.1,952.0
50%,0.7,61.8,2401.0
75%,1.04,62.5,5324.0
max,5.01,79.0,18823.0


## Comparison of the quality of methods

Root Mean Square Deviation (RMSE) for data fitted using mean

In [26]:
rmse_mean = np.sqrt(mean_squared_error(data[missing_data_columns], data_mean[missing_data_columns]))
rmse_mean

882.5947874495396

Root Mean Square Deviation (RMSE) for data fitted using median

In [27]:
rmse_median = np.sqrt(mean_squared_error(data[missing_data_columns], data_median[missing_data_columns]))
rmse_median

947.068702376986

Root Mean Square Deviation (RMSE) for data fitted using k-Nearest Neighbors

In [28]:
rmse_KNN = np.sqrt(mean_squared_error(data[missing_data_columns], data_KNN[missing_data_columns]))
rmse_KNN

48.371787765469215

The best result was obtained by the k-Nearest Neighbors method

## Data filtering

Data filtering using exponential smoothing (exponential filter)

In [29]:
def exponential_filter(data, column, alpha):
    data_filtered = data.copy()
    filtered_column = [data_filtered[column].iloc[0]]
    for i in range(1, len(data_filtered)):
        filtered_column.append(alpha * data_filtered[column].iloc[i] + (1 - alpha) * filtered_column[i-1])
    data_filtered[column+'_filtered'] = pd.Series(filtered_column, index=data_filtered.index)
    return data_filtered

Smoothing at α = 0.3

In [30]:
data_KNN_filtered_1 = exponential_filter(data_KNN, 'price', 0.3)
data_KNN_filtered_1[['price', 'price_filtered']].describe()

Unnamed: 0,price,price_filtered
count,53940.0,53940.0
mean,3933.013663,3932.908765
std,3988.510343,3956.147261
min,326.0,329.24
25%,952.0,980.696301
50%,2401.0,2415.897245
75%,5324.0,5298.836701
max,18823.0,18810.932082


Smoothing at α = 0.5

In [31]:
data_KNN_filtered_2 = exponential_filter(data_KNN, 'price', 0.5)
data_KNN_filtered_2[['price', 'price_filtered']].describe()

Unnamed: 0,price,price_filtered
count,53940.0,53940.0
mean,3933.013663,3932.968706
std,3988.510343,3972.912823
min,326.0,328.0
25%,952.0,964.995931
50%,2401.0,2408.994046
75%,5324.0,5312.734848
max,18823.0,18817.240086


Smoothing at α = 0.7

In [32]:
data_KNN_filtered_3 = exponential_filter(data_KNN, 'price', 0.7)
data_KNN_filtered_3[['price', 'price_filtered']].describe()

Unnamed: 0,price,price_filtered
count,53940.0,53940.0
mean,3933.013663,3932.994396
std,3988.510343,3980.815174
min,326.0,327.24
25%,952.0,957.0
50%,2401.0,2402.33245
75%,5324.0,5317.098744
max,18823.0,18820.488578


Smoothing at α = 0.9

In [33]:
data_KNN_filtered_4 = exponential_filter(data_KNN, 'price', 0.9)
data_KNN_filtered_4[['price', 'price_filtered']].describe()

Unnamed: 0,price,price_filtered
count,53940.0,53940.0
mean,3933.013663,3933.008668
std,3988.510343,3986.148664
min,326.0,326.6
25%,952.0,955.0
50%,2401.0,2401.0
75%,5324.0,5322.000747
max,18823.0,18822.398838


## Conclusions

For the Diamonds dataset, 3 gap filling methods were applied: using mean, using median, and k-Nearest Neighbors (KNN) method. The first 2 methods gave rather poor results, they are definitely not suitable for this data. The KNN method showed a significantly lower estimate of the root mean square deviation, and the data filled by this method are quite close to the real ones. An exponential filter with parameter α equal to 0.3, 0.5, 0.7 and 0.9 was applied to the data obtained by the KNN method.

# Prediction

For the selected initial data, we will build a forecast when dividing the sample into training and testing for ratios of 50/50, 60/40, 70/30, 80/20 and 90/10.

In [34]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split

Let's create auxiliary variables in order to automate the calculation of metric values

In [35]:
result_table =  pd.DataFrame(columns=['R2', 'DW', 'SSE', 'MSE', 'MAE', 'MAPE', 'Theil'])
split_size = [0.5, 0.6, 0.7, 0.8, 0.9]
dataframes = [data, data_mean, data_median, data_KNN, data_KNN_filtered_1, data_KNN_filtered_2, data_KNN_filtered_3, data_KNN_filtered_4]
dataframes_name = ['data', 'data_mean', 'data_median', 'data_KNN', 'data_KNN_filtered_1', 'data_KNN_filtered_2', 'data_KNN_filtered_3', 'data_KNN_filtered_4', 'data_KNN_filtered_4']

The `linear_model` function below takes four arguments: a list of `dataframes` , the name of the `result_table`, a list of sizes for the training sample `split_size`, and the name of the dataframes `dataframes_name`. The function performs linear regression on each dataframe from the `dataframes` list with different training sample sizes specified in the `split_size` list. A number of metrics are calculated for each model:

* R^2 (R-squared) is a measure of how well the model fits the data. It is defined as the coefficient of determination and indicates the proportion of response variance that can be explained by the independent variables. The value of R^2 can be between 0 and 1, where 0 means that the model explains no variance and 1 means that the model explains all of the variance.
* DW (Durbin-Watson statistic) is a measure of autocorrelation of model residuals. The DW value can be between 0 and 4. A DW value less than 2 means that there is positive autocorrelation, while a DW value greater than 2 means that there is negative autocorrelation.
* MSE (Mean Squared Error) is the mean squared deviation between the predicted values of the model and the actual values. It shows how far the model predictions deviate from the real values.
* MAE (Mean Absolute Error) is the average absolute deviation between the predicted values of the model and the actual values. It also shows how far the model predictions deviate from the actual values, but does not take into account the direction of this deviation.
* MAPE (Mean Absolute Percentage Error) is the mean absolute relative error between the predicted values of the model and the actual values. It reflects the percentage by which the model's predictions deviate from the actual values.
* Theil's U-Statistic (Theil's Coefficient) is a metric used to evaluate the accuracy of predictive models. Theil's U-Statistic measures the degree of error between the actual values and the predicted values, taking into account the random component and the predictive component. It is calculated as the ratio between the root mean square deviation of the actual values and the predicted values.
* Root Mean Squared Error (RMSE) is a metric that measures the average of the squared deviations between the actual values and the predicted values. This metric allows you to assess the accuracy of forecasts by measuring the average value of deviations in units of measurement.
* Sum of Squared Errors (SSE) is a metric that measures the sum of squared deviations between actual values and predicted values. This metric allows you to estimate the total error of the model by measuring the sum of the squared deviations in units of measurement. SSE is used to compare the accuracy of different models. A model with a smaller SSE is considered more accurate.

Metric values are stored in the specified result table with the corresponding dataframe name and training sample size. At the output, the function returns a table of results.

In [36]:
def linear_model(dataframes, result_table, split_size, dataframes_name):
    for s in split_size:
        i = 0
        for df in dataframes:
            df, mapping = label_encoding(df, ['cut', 'color', 'clarity'])
            
            y = df['price']
            X = df.drop('price', axis=1)

            X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=s, random_state=42)

            # Creating a linear regression model object
            lr = LinearRegression()

            # Model training on training data
            lr.fit(X_train, y_train)

            # Prediction based on test data
            y_pred = lr.predict(X_test)

            # Calculation of R^2, DW, MSE, MAE, MAPE, Theil metrics
            r2 = r2_score(y_test, y_pred)
            dw = np.sum(np.diff(y_pred)**2) / np.sum(y_pred**2)
            mse = mean_squared_error(y_test, y_pred)
            mae = mean_absolute_error(y_test, y_pred)
            mape = np.mean(np.abs((y_test - y_pred) / y_test)) * 100
            theil = np.sqrt(np.mean((y_test - y_pred)**2) / np.mean(y_test**2))
            # Calculation of RMSE
            rmse = mean_squared_error(y_test, y_pred, squared=False)
            # Calculation of SSE
            n = len(y_train)
            sse = rmse * np.sqrt(2) * n

            result_table.loc[f'{dataframes_name[i]} {s}'] = [r2, dw, sse, mse, mae, mape, theil]
            i = i + 1
            

    return result_table

## Table of prediction results

In [37]:
linear_model(dataframes, result_table, split_size, dataframes_name)

Unnamed: 0,R2,DW,SSE,MSE,MAE,MAPE,Theil
data 0.5,0.87266,0.936756,54108010.0,2012478.0,860.944381,34.750934,0.254326
data_mean 0.5,0.70935,0.767261,75485840.0,3916869.0,1301.543375,46.954633,0.368598
data_median 0.5,0.693449,0.823512,78365410.0,4221404.0,1277.794448,52.394771,0.392763
data_KNN 0.5,0.87395,0.937414,53812990.0,1990593.0,862.723432,35.091506,0.252989
data_KNN_filtered_1 0.5,0.98994,1.001783,15202350.0,158865.5,133.627744,9.370212,0.07147
data_KNN_filtered_2 0.5,0.996379,1.004824,9121290.0,57190.07,57.508734,4.327506,0.042882
data_KNN_filtered_3 0.5,0.998918,1.005995,4984927.0,17081.49,22.891573,1.861017,0.023435
data_KNN_filtered_4 0.5,0.999889,1.006455,1595621.0,1750.12,5.391891,0.470088,0.007501
data 0.6,0.874034,0.936645,64643110.0,1994758.0,857.42778,34.459304,0.252819
data_mean 0.6,0.711369,0.767557,90425900.0,3903295.0,1299.69661,46.636561,0.367006


This data contains prediction results using different methods on the same data set with different parameters. The most important metrics for evaluating forecasting results are R2, MSE, and MAE. Other metrics such as DW, SSE, MAPE and Theil can also be useful but are not as important.

## Conclusions

This data contains prediction results using different methods on the same data set with different parameters. The most important metrics for evaluating forecasting results are R2, MSE, and MAE. Other metrics such as DW, SSE, MAPE and Theil can also be useful but are not as important.

In general, it can be seen from the table that the prediction results using the "df" method (which is a method without artificial omissions) have the best values of R2, DW, SSE, MSE and MAE metrics. That is, the forecasts of this method are the most accurate. In the future, we will not compare this initial dataset with the datasets in which gaps were filled. Since our task is to compare datasets that lack data.

Forecasting methods that use filtering, such as "df_filter1", "df_filter2", and "df_filter3", have the smallest MAPE error, which means they have the smallest mean relative error.

The k-nearest neighbors ("df_knn") method has quite high accuracy compared to other methods, but the value of the MAPE metric is the worst among all methods. This may be an indication that it has some tendency to overtrain on some data.

The mean ("df_mean") and median ("df_median") prediction methods have metric values that are significantly worse than the other methods. Therefore, these methods may not be the best choice for prediction in this case.

Thus, the best results of metrics when predicted by a linear regression model were obtained for the dataset with gaps filled by the **KNN method** ("df_filter3"), which was subjected to **exponential filtering** with the parameter $\alpha$ = 0.9, when partitioning samples for training and verification with a **ratio of 60/40**.