# **How the noisy diabete data sets were created using Scikit-learn.**

The toy datasets included with Scikit-learn tend to achieve overly high accuracy when used for machine learning, so noise is added to intentionally lower the accuracy.
The resulting noisy data was formatted as a Pandas DataFrame and saved as a CSV file. The computational processing was performed on Google Colab. The noisy toy dataset is a custom version compatible with APIs like `sklearn.datasets.load_diabetes()`, and since it uses entirely publicly available data and code, there are no issues with running it on the cloud. In that sense, learners can also clone the publicly available GitHub repository on Colab.

Scikit-learnに付属するtoy datasetsは機械学習を行った際精度が高すぎるので、ノイズを加えてあえて精度を低下させる。
このようにして生成されたnoisyなデータは、PandasのDataFrameの形で整形したのち、csvファイルとして保存された。計算処理はGoogle Colab上で行った。sklearn.datasets.load_diabetes()のようなAPI互換の私家版 noisy toy datasetで、完全に公開されたデータとコードを用いているので、Cloud上での操作は問題が無いことに注意したい。またその意味でGitHubの公開リポジトリをColab上で学習者が!git cloneすることが可能となっている。

In [None]:
# import libraries
import numpy as np
import pandas as pds
from pandas import Series,DataFrame
import csv
import random

from sklearn.datasets import load_diabetes
from sklearn.utils import Bunch

In [None]:
# use diabetes sample data from sklearn
diabetes = load_diabetes()
# show the input variables from the data
pds.DataFrame(diabetes.data, columns=("age", "sex", "bmi", "map", "tc", "ldl", "hdl", "tch", "ltg", "glu"))

Unnamed: 0,age,sex,bmi,map,tc,ldl,hdl,tch,ltg,glu
0,0.038076,0.050680,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019907,-0.017646
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.068332,-0.092204
2,0.085299,0.050680,0.044451,-0.005670,-0.045599,-0.034194,-0.032356,-0.002592,0.002861,-0.025930
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022688,-0.009362
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031988,-0.046641
...,...,...,...,...,...,...,...,...,...,...
437,0.041708,0.050680,0.019662,0.059744,-0.005697,-0.002566,-0.028674,-0.002592,0.031193,0.007207
438,-0.005515,0.050680,-0.015906,-0.067642,0.049341,0.079165,-0.028674,0.034309,-0.018114,0.044485
439,0.041708,0.050680,-0.015906,0.017293,-0.037344,-0.013840,-0.024993,-0.011080,-0.046883,0.015491
440,-0.045472,-0.044642,0.039062,0.001215,0.016318,0.015283,-0.028674,0.026560,0.044529,-0.025930


Gaussian noise was added with a mean of 0 and a standard deviation of 0.1 to the data to deliberately reduce its accuracy.
To simulate repeated measurements assuming data augmentation, we added uniform noise within the range [-0.001, 0.001] five times
 to the default row data created by adding Gaussian noise. We then added the corresponding participant and session numbers for each generated row.
To do this, the Pandas to_csv() function with the index=false option was employed, and when reading the data using the read_csv() function,
we specified the existing feature names of the diabetes dataset along with dtype={'participants': int, 'sessions': int}.
The total (augmented) noisy data 'X_noisy' was created in this way with the additional five sessions for each participant.

Strictly speaking, this method of adding noise is incorrect, and it is necessary to use raw data (diabetes_data_raw.csv.gz) from GitHub and calculate the mean and standard deviation as follows, then perform an inverse transformation. However, for the sake of simplicity, this method was not used here.

データに平均0,標準偏差0.1のガウシアンノイズを加え、データの精度を敢えて落とす。さらにdata augmentationを想定した反復measurementをsimulateするために、ガウシアンノイズを加えてできたこのdefault行データに対し、5回繰り返して一定の([-0.001, 0.001]の範囲で)一様ノイズを加える。さらに生成した各行に対応するparticipantsとsessionsのnumberingを追加する。
このため、Pandasのto_csv()関数をindex=falseのオプション付きで利用し、read_csv()関数を使って読み取る際、diabetesのデータセットの既存のfeature namesに加え、dtype={'participants': int, 'sessions': int}を追加指定した。

厳密に言うとこのノイズの加え方は正しくなく、GitHubの生データを使い、以下のように平均・標準偏差を求めて逆変換する必要がある。しかし、ここでは簡便さを重視してこの方法は取らなかった。

    df_raw = pd.read_csv("diabetes_data_raw_revised.csv", header=None)

    X_raw = df_raw.iloc[:, :-1]

    y_raw = df_raw.iloc[:, -1]

    mean = X_raw.mean(axis=0)

    std = X_raw.std(axis=0)

    #X_recovered = X_std * std.values + mean.values

In [None]:
def add_gaussian_noise_np(X, mean=0.0, sd=0.1):
    X = np.array(X)
    noise = np.random.normal(loc=mean, scale=sd, size=X.shape)
    noisy = X + noise
    col_mins = X.min(axis=0)
    col_maxs = X.max(axis=0)
    return np.clip(noisy, col_mins, col_maxs)
def add_uniform_noise_np(X, low=-0.01, high=0.01):
    X = np.array(X)
    noise = np.random.uniform(low=low, high=high, size=X.shape)
    noisy = X + noise
    return noisy

In [None]:
X = diabetes.data.copy()
X_noisy_single = add_gaussian_noise_np(X,0.0,0.1)
print(X_noisy_single)

[[ 0.11072668  0.05068012  0.1017001  ...  0.08818906  0.06087051
  -0.13776723]
 [-0.07303296 -0.04464164 -0.03887381 ... -0.0763945  -0.12609712
  -0.09785742]
 [ 0.07219719  0.05068012  0.13596191 ...  0.04559019  0.02351131
  -0.00352858]
 ...
 [-0.01096403 -0.00543593  0.07801133 ...  0.04931533 -0.03595717
   0.04320936]
 [-0.00086888  0.05068012  0.11049561 ...  0.18523444 -0.11804317
  -0.03093692]
 [-0.04428999 -0.04464164 -0.07312773 ... -0.0763945  -0.08431455
  -0.01876428]]


In [None]:
lsted = [[X[i], [add_uniform_noise_np(X[i], -0.001, 0.001) for _ in range(5)]] for i in range(len(X))]
X_augmented = [[item for sublist in lsted[i] for item in (sublist if isinstance(sublist, list) else [sublist])] for i in range(len(X))]
#The above lsted is a nested list, so remove the inner [].
X_repeated = [item for sublist in X_augmented for item in (sublist if isinstance(sublist, list) else [sublist])]
#Repeat the above results for each person.
participants = [int(s) for s in np.ravel([[i]*6 for i in range(442)])]
sessions = [int(s) for s in np.ravel(np.array([list(range(6))] * 442))]
#Adding new data features so that each participant includes 6 sessions.
X_noisy_total = np.insert(np.insert(X_repeated, 0, sessions, axis=1), 0, participants, axis=1)
X_noisy_total = [
    list(map(int, row[:2])) + list(map(float, row[2:]))
    for row in X_noisy_total
]
print(X_noisy_total)

[[0, 0, 0.038075906433423026, 0.05068011873981862, 0.061696206518683294, 0.0218723855140367, -0.04422349842444599, -0.03482076283769895, -0.04340084565202491, -0.002592261998183278, 0.019907486170462722, -0.01764612515980379], [0, 1, 0.0374427867874237, 0.04988977472612088, 0.062317062385228626, 0.02261231069714533, -0.04400423421052884, -0.034509613544107115, -0.04384941072149081, -0.003051596033338101, 0.0200490156528441, -0.016805136069356023], [0, 2, 0.03893394625091848, 0.05109608440297945, 0.06132067569942399, 0.022568414321446, -0.044690640682832086, -0.03538113212401081, -0.042860018004795046, -0.00312169705272072, 0.02073321742092966, -0.01797164352014097], [0, 3, 0.037424187553520984, 0.05015879913212133, 0.061740852475485414, 0.021596312776973294, -0.04339298440988627, -0.035772647062396204, -0.044329930804758444, -0.0025491768941866434, 0.019207533252775418, -0.01726619613878872], [0, 4, 0.03859334063864335, 0.049971747093192916, 0.06090378137124669, 0.02199156977947745, -0

How to create the total (augmented) noisy target index values 'y_noisy' for the additional five sessions
 using random integers following the uniform distribution with the range of [-2, 2] to be added to each original y value.

Note that the first element in each nested list corresponds to the original target value (aka, without noise) stored in the load_diabetes() and published from Scikit-learn.

追加5セッションのノイジーターゲットインデックス値'y_noisy'を作成する方法
 一様分布[-2, 2]の範囲に従うランダムな整数を使用して、各オリジナルのy値に追加する。

各入れ子リストの最初の要素は、load_diabetes()に格納され、Scikit-learnから公開されたオリジナルのターゲット値（別名、ノイズなし）に対応することに注意してください。


In [None]:
y = diabetes.target.copy()
y_noisy = np.ravel([[float(j) for j in list(map(lambda x: x+k, np.append([0.0],[float(i) for i in np.random.default_rng().integers(-2,2,size=5,endpoint=True)])))] for k in y])
print(y_noisy)

[151. 153. 152. ...  57.  56.  59.]


To create a private noisy toy dataset compatible with APIs such as sklearn.datasets.load_diabetes(), I defined functions that return Bunch in an object-oriented manner so that load_noisy_diabetes() and load_total_noisy_diabetes() can be used in the same way as load_diabetes().

Note that the csv files were originally loaded from Google Colab.

sklearn.datasets.load_diabetes()のようなAPI互換の私家版 noisy toy datasetを作成するにあたり、オブジェクト指向でBunchを返すようにして、load_diabetes()のごとくload_noisy_diabetes()、load_total_noisy_diabetes()が使えるように関数を定義した。

csvファイルは最初Colabからロードした。

In [None]:
def load_noisy_diabetes(return_X_y=False, as_frame=False, scaled=True):
    diabetes = load_diabetes(return_X_y=False, as_frame=as_frame, scaled=scaled)

    with open('/content/drive/My Drive/Colab Notebooks/noisy_diabetes_data.csv') as f:
        reader = csv.reader(f)
        l = [row for row in reader]
    f.close()
    del(l[0])
    X_noisy = [list(map(lambda x:float(x), i)) for i in l]
    y = diabetes.target.copy()

    if return_X_y:
        return X_noisy, y

    return Bunch(
        data=X_noisy,
        target=y,
        feature_names=["age", "sex", "bmi", "map", "tc", "ldl", "hdl", "tch", "ltg", "glu"],
        frame=pds.DataFrame(X_noisy, columns=["age", "sex", "bmi", "map", "tc", "ldl", "hdl", "tch", "ltg", "glu"]) if as_frame else None,
        DESCR="Noisy version of the diabetes dataset",
        data_filename=None,
        target_filename=None,
    )


In [None]:
def load_total_noisy_diabetes(return_X_y=False, as_frame=False, scaled=True):
    diabetes = load_diabetes(return_X_y=False, as_frame=as_frame, scaled=scaled)

    with open('/content/drive/My Drive/Colab Notebooks/total_noisy_diabetes_data.csv') as f:
        reader = csv.reader(f)
        l = [row for row in reader]
    f.close()
    del(l[0])
    X_noisy = [
        list(map(int, row[:2])) + list(map(float, row[2:]))
        for row in l
    ]

    with open('/content/drive/My Drive/Colab Notebooks/total_noisy_diabetes_target.csv') as f:
        reader = csv.reader(f)
        l = [row for row in reader]
    f.close()
    del(l[0])
    y_noisy = [list(map(lambda x:float(x), i)) for i in l]

    if return_X_y:
        return X_noisy, y_noisy

    return Bunch(
        data=X_noisy,
        target=y_noisy,
        feature_names=["participants", "sessions","age", "sex", "bmi", "map", "tc", "ldl", "hdl", "tch", "ltg", "glu"],
        frame=pds.DataFrame(X_noisy, columns=["participants", "sessions","age", "sex", "bmi", "map", "tc", "ldl", "hdl", "tch", "ltg", "glu"]) if as_frame else None,
        DESCR="Total (augmented) version of the noisy diabetes dataset",
        data_filename=None,
        target_filename=None,
    )
