## Evaluation
Submissions are scored on the root mean squared error.

## References
https://www.kaggle.com/code/sebastianvangerwen/1st-place-solution-tps-jun-denoising-ae issued by **@SEBASTIAN VAN GERWEN**<br>
https://towardsdatascience.com/denoising-autoencoders-dae-how-to-use-neural-networks-to-clean-up-your-data-cd9c19bc6915 issued by **@Saul Dobilas**

## Blue Print

## 0. Import Packages

In [1]:
import math
import pandas as pd
import numpy as np

from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error

import torch
import torch.nn.functional as F
import torch.utils.data
from torch import nn

from tqdm import tqdm

## 1. Data Loading

In [2]:
# Load dataset
data = pd.read_csv('data.csv')
data.shape

(1000000, 81)

In [3]:
# Check data types and missing values
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 81 columns):
 #   Column  Non-Null Count    Dtype  
---  ------  --------------    -----  
 0   row_id  1000000 non-null  int64  
 1   F_1_0   981603 non-null   float64
 2   F_1_1   981784 non-null   float64
 3   F_1_2   981992 non-null   float64
 4   F_1_3   981750 non-null   float64
 5   F_1_4   981678 non-null   float64
 6   F_1_5   981911 non-null   float64
 7   F_1_6   981867 non-null   float64
 8   F_1_7   981872 non-null   float64
 9   F_1_8   981838 non-null   float64
 10  F_1_9   981751 non-null   float64
 11  F_1_10  982039 non-null   float64
 12  F_1_11  981830 non-null   float64
 13  F_1_12  981797 non-null   float64
 14  F_1_13  981602 non-null   float64
 15  F_1_14  981961 non-null   float64
 16  F_2_0   1000000 non-null  int64  
 17  F_2_1   1000000 non-null  int64  
 18  F_2_2   1000000 non-null  int64  
 19  F_2_3   1000000 non-null  int64  
 20  F_2_4   1000000 non-null 

**Comments**: Column `F_1_0` ~ `F_1_14`, `F_3_0` ~ `F_3_24`, `F_4_0` ~ `F_4_14` have missing values. The types of missing values are all floats.

In [16]:
# List of features
features = data.columns.drop('row_id').tolist()
features

['F_1_0',
 'F_1_1',
 'F_1_2',
 'F_1_3',
 'F_1_4',
 'F_1_5',
 'F_1_6',
 'F_1_7',
 'F_1_8',
 'F_1_9',
 'F_1_10',
 'F_1_11',
 'F_1_12',
 'F_1_13',
 'F_1_14',
 'F_2_0',
 'F_2_1',
 'F_2_2',
 'F_2_3',
 'F_2_4',
 'F_2_5',
 'F_2_6',
 'F_2_7',
 'F_2_8',
 'F_2_9',
 'F_2_10',
 'F_2_11',
 'F_2_12',
 'F_2_13',
 'F_2_14',
 'F_2_15',
 'F_2_16',
 'F_2_17',
 'F_2_18',
 'F_2_19',
 'F_2_20',
 'F_2_21',
 'F_2_22',
 'F_2_23',
 'F_2_24',
 'F_3_0',
 'F_3_1',
 'F_3_2',
 'F_3_3',
 'F_3_4',
 'F_3_5',
 'F_3_6',
 'F_3_7',
 'F_3_8',
 'F_3_9',
 'F_3_10',
 'F_3_11',
 'F_3_12',
 'F_3_13',
 'F_3_14',
 'F_3_15',
 'F_3_16',
 'F_3_17',
 'F_3_18',
 'F_3_19',
 'F_3_20',
 'F_3_21',
 'F_3_22',
 'F_3_23',
 'F_3_24',
 'F_4_0',
 'F_4_1',
 'F_4_2',
 'F_4_3',
 'F_4_4',
 'F_4_5',
 'F_4_6',
 'F_4_7',
 'F_4_8',
 'F_4_9',
 'F_4_10',
 'F_4_11',
 'F_4_12',
 'F_4_13',
 'F_4_14']

## 2. 

In [29]:
# Binomial x Random one 0 per row
def random_mask(n, k):
    mask = np.ones((n, k))

    # Set one random per row at 0
    mask[(np.arange(n), np.random.randint(0, k, n))] = 0
    
    # Add binomial probability as well
    b_mask = np.random.binomial(1, 0.5, (n, k))    # 1 trial, p=0.5
    return mask * b_mask

In [70]:
def mask_n_rows(n, k, n_missing):
    # n_missing number of indices of columns with small values
    idx = np.random.rand(n, k).argsort(1)[:, :n_missing]

    col_idx = idx.flatten()
    row_idx = np.arange(n).repeat(n_missing)
    
    mask = np.ones((n, k))
    mask[(row_idx, col_idx)] = 0
    return mask