# Intellligent Data Analysis

- **Data Set:** #31
- **Author 1:** Juraj Baráth
- **Author 2:** Vladimír Svitok

## Strojové učenie (max. 15b)

Pri dátovej analýze nemusí byť naším cieľom získať len znalosti obsiahnuté v aktuálnych dátach, ale aj natrénovať model, ktorý bude schopný robiť rozumné predikcie pre nové pozorovania. Na to sa využívajú techniky strojového učenia. V tomto projekte sa zameriame na rozhodovacie stromy vzhľadom na ich jednoduchú interpretovateľnosť.

V tejto fáze dostanete nový dataset, na ktorom oddemonštrujete znovupoužiteľnosť vami realizovaného predspracovania. Vami natrénované klasifikátory budú porovnané medzi sebou; uvidíte tak, ako dobre ste sa umiestnili v rámci vášho cvičenia, resp. celého predmetu.

V poslednej fáze sa od vás očakáva:
- **Predspracovanie nového datasetu vami realizovaným postupom predspracovania a opis prípadných zmien (2b).** Spustite postup predspracovania realizovaný v predchádzajúcej fáze nad novým datasetom. Nový dataset bude mať rovnakú štruktúru ako váš pôvodný, nebudú sa v ňom však možno nachádzať niektoré problémy (nové vám nepribudnú). Ak si spustenie predspracovania vyžiada zmeny v kóde, opíšte ich.
- **Manuálne vytvorenie a vyhodnotenie rozhodovacích pravidiel pre klasifikáciu (3b).** Vyskúšajte jednoduché pravidlá zahŕňajúce jeden atribút, ale aj komplikovanejšie zahŕňajúce viacero atribútov (ich kombinácie). Pravidlá by v tomto kroku mali byť vytvorené manuálne na základe pozorovaných závislostí v dátach. Pravidlá (manuálne vytvorené klasifikátory) vyhodnoťte pomocou metrík správnosť (angl. *accuracy*), presnosť (angl. *precision*) a úplnosť (angl. *recall*). 
- **Natrénovanie a vyhodnotenie klasifikátora s využitím rozhodovacích stromov (4b).** Na trénovanie využite algoritmus dostupný v knižnici `scikit-learn` (`CART`). Vizualizujte natrénované pravidlá. Vyhodnoťte natrénovaný rozhodovací strom pomocou metrík správnosť (angl. *accuracy*), presnosť (angl. *precision*) a úplnosť (angl. *recall*). Porovnajte natrénovaný klasifikátor s vašimi manuálne vytvorenými pravidlami z druhého kroku. 
- **Optimalizácia hyperparametrov (4b).** Preskúmajte hyperparametre klasifikačného algoritmu `CART` a vyskúšajte ich rôzne nastavenie tak, aby ste minimalizovali preučenie. Vysvetlite, čo jednotlivé hyperparametre robia. Pri nastavovaní hyperparametrov algoritmu využite 10-násobnú krížovú validáciu na trénovacej množine.
- **Vyhodnotenie vplyvu zvolenej stratégie riešenia chýbajúcich hodnôt na správnosť klasifikácie (2b).** Zistite, či použitie zvolených stratégií riešenia chýbajúcich hodnôt vplýva na správnosť (angl. accuracy) klasifikácie. Ktorá stratégia sa ukázala ako vhodnejšia pre daný problém?

Správa sa odovzdáva v 12. týždni semestra na cvičení (dvojica svojmu cvičiacemu odprezentuje výsledky strojového učenia v `Jupyter Notebooku`). Následne správu elektronicky odovzdá jeden člen z  dvojice do systému AIS do **nedele 15.12.2019 do 23:59**.


In [1]:
# Automatically reformat python code
# Check https://medium.com/openplanetary/code-formatting-in-jupyter-cells-8fee4eda072f for more info
# %load_ext lab_black

# Data analysis and debugging helper methods

In [2]:
"""
    Checks the not availability of a value
"""


def is_nan(v):
    sv = str(v)
    return sv == "?" or sv == "NaN" or sv == "nan"

In [3]:
"""
    Prints the possible values of columns having less than 50 possible values
    For columns having numeric values prints the min, 5% percentile, 95% percentile, max, mean, median
"""


def print_col_values(d):

    """ Formats a number by converting floats to 4 decimals and keeps integers"""

    def format_num(num):
        num = round(num, 4)
        return int(num) if num == int(num) else num

    cols = list(d.columns)
    cols.sort()

    for c in cols:
        if str(d.iloc[0][c]).replace(".", "").isdigit():
            v = d[c].apply(
                lambda n: float(n) if str(n).replace(".", "").isdigit() else np.nan
            )
            print(
                "••► " + c + ": min = {}, 5% = {}, 95% = {}, max = {},"
                " mean = {}, median = {}".format(
                    format_num(v.min()),
                    format_num(np.nanpercentile(v, 5)),
                    format_num(np.nanpercentile(v, 95)),
                    format_num(v.max()),
                    format_num(v.mean()),
                    format_num(v.median()),
                ),
                end="\n\n",
            )
        else:
            l = list(d.drop_duplicates(c)[c])
            if len(l) < 50:
                print("••► " + c + " (" + str(len(l)) + "): " + str(l), end="\n\n")
            else:
                print("••► " + c + " (" + str(len(l)) + ")", end="\n\n")

In [4]:
"""
    Prints the statistics of both personal and other train data at a certain stage
"""


def print_data_stats(stage, data, data2):
    print(
        "=========================="
        + stage
        + " ======================"
    )
    print("===== Data =====")
    print_col_values(data)
    print("==== Data 2 =====")
    print_col_values(data2)

In [5]:
"""
    Draws a chart of a column in the data set with the given label
"""


def draw_chart(label, data, column):
    data.plot.scatter(label=label, x="id", y=column, figsize=(15, 5)).plot()

In [6]:
"""
    Draws charts of the medical metrics data against the class with the given titles in the given data sets
"""


def draw_md_charts(titles, data_array):
    for gl in [
        "skewness_glucose",
        "mean_glucose",
        "kurtosis_glucose",
        "std_glucose",
        "mean_oxygen",
        "std_oxygen",
        "kurtosis_oxygen",
        "skewness_oxygen",
    ]:
        for cl in range(2):
            for i in range(len(data_array)):
                d_source = data_array[i]
                d = d_source[d_source["class"] == cl]
                draw_chart(titles[i] + " - " + gl + " - Class " + str(cl), d, gl)

# Data transformations

In [7]:
"""
    Renames the Unnamed: 0 column to id
"""


def rename_id_column(data, data2, debug):
    data.rename(columns={data.columns[0]: "id"}, inplace=True)
    data2.rename(columns={data2.columns[0]: "id"}, inplace=True)
    return data, data2

In [8]:
""" 
    Extracts the 4 columns of the medical info from data2 and puts them as new columns, removes the old column
"""


def extract_medical_info(data, data2, debug):
    # Generate the list of columns we need to extract from medical_info
    mi_cols = list(dict(eval(data2["medical_info"][0])).keys())
    # Extract the 4 data included in the medical_info column and put it to 4 separate columns
    for c in mi_cols:
        data2[c] = data2["medical_info"].apply(
            lambda d: 0.0 if str(d) == "nan" else float(dict(eval(str(d)))[c])
        )
    data2.drop(columns="medical_info", inplace=True)
    return data, data2

In [9]:
""" 
    Removes the time from the date_of_birth and replaces "/" characters to "-" characters in it
"""


def fix_date_of_birth(data, data2, debug):
    # Remove time from the date_of_birth
    data["date_of_birth"] = data["date_of_birth"].str.split(" ", expand=True)[0]

    # Replace '/' characters to '-' characters in date_of_birth
    data["date_of_birth"] = data["date_of_birth"].str.replace("/", "-")
    return data, data2

In [10]:
"""
    Replace non numeric age values with 0
"""


def fix_non_numeric_age_values(data, data2, debug):
    data["age"] = data["age"].apply(
        lambda age: int(age) if str(age).replace(".", "", 1).isdigit() else 0
    )
    return data, data2

In [11]:
"""
    Fix the date format in the date_of_birth, make it consistently being yyyy-mm-dd
"""


def fix_date_format(data, data2, debug):
    def fix_date(date):
        s = date.split("-")
        if (
            len(s[0]) == 4
        ):  # Don't need to fix anything if the first field is a 4 digit year number
            return date
        if len(s[2]) == 4:  # Replace dd-mm-yyyy format to yyyy-mm-dd format
            return s[2] + "-" + s[1] + "-" + s[0]

        # Replace yy-mm-dd format to yyyy-mm-dd format
        #
        # If the last 2 digit of the 4 digit year field is bigger than 20,
        # then we expect a date of birth between 1921 and 1999, otherwise between 2000 and 2020,
        # because we have verified earlier that we don't have people older than 90 years old in such date format
        if int(s[0]) > 20:
            return "19" + s[0] + "-" + s[1] + "-" + s[2]
        else:
            return "20" + s[0] + "-" + s[1] + "-" + s[2]

        return date

    data["date_of_birth"] = data["date_of_birth"].apply(fix_date)
    return data, data2

In [12]:
"""
    Calculate the missing ages of people based on the dataset date
"""


def calc_missing_ages_from_date_of_birth(data, data2, debug):
    # Approximate the date of the dataset by adding the age of people to their date_of_birth. Choose the highest value.
    def add_age(row):
        return (
            str(int(row["date_of_birth"][:4]) + row["age"]) + row["date_of_birth"][4:]
        )

    # Fixes the age based on the previously calculated dataset age
    def fix_age(row):
        if row["age"] == 0:
            birth = [int(i) for i in row["date_of_birth"].split("-")]
            dif = [ds_date[0] - birth[0], ds_date[1] - birth[1], ds_date[2] - birth[2]]
            if debug:
                print(ds_date, "-", birth, "=", dif, end=" --> age = ")
            if dif[2] < 0:
                dif[1] -= 1
            if dif[1] < 0:
                dif[0] -= 1
            if debug:
                print(dif[0])
            return dif[0]
        return row["age"]

    ds_date = [int(i) for i in data.apply(add_age, axis=1).max().split("-")]
    if debug:
        print("Data set age:", ds_date)

    data["age"] = data.apply(lambda row: fix_age(row), axis=1)
    return data, data2

In [13]:
"""
    Trim all the leading spaces from every String data
"""


def trim_leading_spaces(data, data2, debug):
    for d in (data, data2):
        for c in d.columns:
            d[c] = d[c].apply(lambda s: s if type(s) != str else s.strip())
    return data, data2

In [14]:
"""
    Merge duplicate rows by choosing the average of every available data,
    for non numeric data using the first available match.

    If one of the numeric value is 0, while a none zero value exist, ignore the zero value
"""


def merge_duplicate_rows(data, data2, debug):
    if debug:
        print("Detecting duplicates based on names...")

    duplicate_names = list(data2[data2.duplicated(["name"])]["name"])
    if debug:
        print("Found", len(duplicate_names), "duplicates, merging them...")

    fixed_rows = []

    for name in duplicate_names:
        rows = data2[data2["name"] == name]
        l = len(rows)
        if debug:
            print("\nMerging", l, "rows of name " + name + "...")
        bestRow = rows.iloc[0].copy()
        for c in data2.columns:
            if c == "id":  # Do not merge the id column
                continue
            if debug:
                print(c, list(rows[c]), end=" --> ")
            m = 1
            for i in range(1, l):
                row = rows.iloc[i]
                if is_nan(row[c]):
                    continue
                if is_nan(bestRow[c]):
                    bestRow[c] = row[c]
                    m = 1
                elif (type(row[c]) != str) and row[c] != 0 and row[c] != 0.0:
                    bestRow[c] += row[c]
                    if bestRow[c] != row[c]:
                        m += 1
            if type(bestRow[c]) != str and m > 0:
                bestRow[c] /= m
            if debug:
                print([bestRow[c]], end="\n")
        fixed_rows.append(bestRow)

    if debug:
        print("Applying changes...")

    data2.drop_duplicates(subset=["name"], keep=False, inplace=True)
    for r in fixed_rows:
        data2 = data2.append(r)
    return data, data2

In [15]:
"""
    Fixes the pregnant column values, keeps nan rows, changes others to 1 
    if they start with t (case insensitive), changes them to 0 otherwise
"""


def fix_pregnant(data, data2, debug):
    data2["pregnant"] = data2["pregnant"].apply(
        lambda p: np.nan if str(p) == "nan" else int(p[0].lower() == "t")
    )
    return data, data2

In [16]:
"""
    Replaces missing data with median values for numbers and
    with the data having the highest occurance at Strings
"""


def replace_missing_data_with_medians(data, data2, debug):
    if debug:
        print("Filling missing data with median values...")
    data2med = data2.copy()
    for c in data2.columns:
        med = (
            data2[c].value_counts().idxmax()
            if type(data2.iloc[0][c]) == str
            else data2[c].median()
        )
        if debug:
            print(c, "-->", med)
        rdata2[c] = rdata2[c].apply(lambda v: med if is_nan(v) else v)
    return data2, rdata2

In [17]:
"""
    Replaces missing data with linear regression for numbers and
    with the data having the highest occurance at Strings
"""


def replace_missing_data_with_regr(data, data2, rdata, rdata2, debug):
    if debug:
        print("Filling missing data with linear regression values...")
    model1 = LinearRegression()
    model2 = LinearRegression()
    modeldata = data2[data2[['mean_glucose','std_glucose']].notnull().all(axis=1)]
    x = np.array(modeldata['mean_glucose']).reshape(-1,1)
    y = np.array(modeldata['std_glucose']).reshape(-1,1)
    model1.fit(x, y)
    model2.fit(y, x)
    data2['mean_glucose'] = data2.apply(lambda row: (
        row['mean_glucose'] if not is_nan(row['mean_glucose']) else (model2.intercept_ + model2.coef_ * row['std_glucose'])[0][0] if not is_nan(row['std_glucose']) else np.nan
    ), axis=1)
    data2['std_glucose'] = data2.apply(lambda row: (
        row['std_glucose'] if not is_nan(row['std_glucose']) else (model1.intercept_ + model1.coef_ * row['mean_glucose'])[0][0] if not is_nan(row['mean_glucose']) else np.nan
    ), axis=1)
    rdata2['mean_glucose'] = rdata2.apply(lambda row: (
        row['mean_glucose'] if not is_nan(row['mean_glucose']) else (model2.intercept_ + model2.coef_ * row['std_glucose'])[0][0] if not is_nan(row['std_glucose']) else np.nan
    ), axis=1)
    rdata2['std_glucose'] = rdata2.apply(lambda row: (
        row['std_glucose'] if not is_nan(row['std_glucose']) else (model1.intercept_ + model1.coef_ * row['mean_glucose'])[0][0] if not is_nan(row['mean_glucose']) else np.nan
    ), axis=1)
    return data2, rdata2

In [18]:
"""
    Replaces outlying values in every numeric column the following way:
    --> Applies 95% percentil for values >90% percentil * 2 of the data set
    --> Applies 5% percentil for values <10% percentil * 0.5 of the data set
"""


def replace_outlying_0595(data, data2, rdata, rdata2, debug):
    for c in data2.columns:
        if type(data2.iloc[0][c]) == str or len(list(data2.drop_duplicates(c)[c])) <= 10:
            continue
        pct05 = np.nanpercentile(data2[c], 5)
        pct95 = np.nanpercentile(data2[c], 95)
        data2[c] = data2[c].apply(
            lambda d: pct95 if d > pct95 * 2 else pct05 if d < pct05 else d
        )
        rdata2[c] = rdata2[c].apply(
            lambda d: pct95 if d > pct95 * 2 else pct05 if d < pct05 else d
        )

    return data2, rdata2

# Invokers of the data transformations

In [19]:
"""
    Transforms the train data into a usable trainable form
"""


def fix_data(data, data2, rdata, rdata2, debug):

    data, data2 = rename_id_column(data, data2, debug)
    data, data2 = extract_medical_info(data, data2, debug)
    data, data2 = fix_date_of_birth(data, data2, debug)
    data, data2 = fix_non_numeric_age_values(data, data2, debug)
    data, data2 = fix_date_format(data, data2, debug)
    data, data2 = calc_missing_ages_from_date_of_birth(data, data2, debug)
    data, data2 = trim_leading_spaces(data, data2, debug)
    data, data2 = merge_duplicate_rows(data, data2, debug)
    data, data2 = fix_pregnant(data, data2, debug)
    
    rdata, rdata2 = rename_id_column(rdata, rdata2, debug)
    rdata, rdata2 = extract_medical_info(rdata, rdata2, debug)
    rdata, rdata2 = fix_date_of_birth(rdata, rdata2, debug)
    rdata, rdata2 = fix_non_numeric_age_values(rdata, rdata2, debug)
    rdata, rdata2 = fix_date_format(rdata, rdata2, debug)
    rdata, rdata2 = calc_missing_ages_from_date_of_birth(rdata, rdata2, debug)
    rdata, rdata2 = trim_leading_spaces(rdata, rdata2, debug)
    rdata, rdata2 = merge_duplicate_rows(rdata, rdata2, debug)
    rdata, rdata2 = fix_pregnant(rdata, rdata2, debug)
    
    data2, rdata2 = replace_missing_data_with_medians(data, replace_missing_data_with_regr(data, data2, rdata, rdata2, debug)[1], debug)
    data2, rdata2 = replace_outlying_0595(data, data2, rdata, rdata2, debug)
    return pd.merge(data, data2, on = 'name'), pd.merge(rdata, rdata2, on = 'name')

# Here starts the actual program

In [20]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import numpy as np
from sklearn.linear_model import LinearRegression
import json
import seaborn as sea

In [21]:
# Read train data
data = pd.read_csv("personal_train.csv")
data2 = pd.read_csv("other_train.csv")
rdata = pd.read_csv("personal_valid.csv")
rdata2 = pd.read_csv("other_valid.csv")

In [22]:
# Print train data before fix
print_data_stats("BEFORE FIX - TRAIN", data, data2)

===== Data =====
••► Unnamed: 0: min = 0, 5% = 196.6, 95% = 3735.4, max = 3932, mean = 1966, median = 1966

••► address (3933)

••► age: min = 3, 5% = 32, 95% = 70, max = 99, mean = 51.8638, median = 52

••► date_of_birth (3706)

••► name (3933)

••► sex (2): [' Male', ' Female']

==== Data 2 =====
••► Unnamed: 0: min = 0, 5% = 199.1, 95% = 3782.9, max = 3982, mean = 1991, median = 1991

••► address (3933)

••► capital-gain: min = 0, 5% = 0, 95% = 4697.95, max = 99999, mean = 1021.0453, median = 0

••► capital-loss: min = 0, 5% = 0, 95% = 0, max = 4356, mean = 89.8525, median = 0

••► class: min = 0, 5% = 0, 95% = 1, max = 1, mean = 0.2559, median = 0

••► education (17): [' Assoc-acdm', ' HS-grad', ' Some-college', ' Bachelors', ' Assoc-voc', ' 10th', ' Doctorate', ' 5th-6th', ' Masters', ' 12th', ' 9th', ' 11th', ' 7th-8th', nan, ' Prof-school', ' 1st-4th', ' Preschool']

••► education-num: min = 1, 5% = 5, 95% = 14, max = 16, mean = 10.1058, median = 10

••► fnlwgt: min = 19302, 5% 

In [23]:
# Print real data before fix
print_data_stats("BEFORE FIX - REAL", rdata, rdata2)

===== Data =====
••► Unnamed: 0: min = 0, 5% = 65.5, 95% = 1244.5, max = 1310, mean = 655, median = 655

••► address (1311)

••► age: min = 15, 5% = 32, 95% = 71, max = 98, mean = 51.9983, median = 52

••► date_of_birth (1289)

••► name (1311)

••► sex (2): [' Female', ' Male']

==== Data 2 =====
••► Unnamed: 0: min = 0, 5% = 68, 95% = 1292, max = 1360, mean = 680, median = 680

••► address (1311)

••► capital-gain: min = 0, 5% = 0, 95% = 4646.65, max = 99999, mean = 886.2723, median = 0

••► capital-loss: min = 0, 5% = 0, 95% = 0, max = 2547, mean = 81.8586, median = 0

••► class: min = 0, 5% = 0, 95% = 1, max = 1, mean = 0.2628, median = 0

••► education (17): [' HS-grad', ' Some-college', ' Bachelors', ' Masters', ' Prof-school', ' 12th', ' Assoc-acdm', ' 7th-8th', ' 9th', ' 10th', ' 11th', ' 5th-6th', ' Assoc-voc', ' Doctorate', nan, ' 1st-4th', ' Preschool']

••► education-num: min = 1, 5% = 6, 95% = 14, max = 16, mean = 10.1119, median = 10

••► fnlwgt: min = 20534, 5% = 41814.8,

In [24]:
data.head(50)

Unnamed: 0.1,Unnamed: 0,name,address,age,sex,date_of_birth
0,0,Roscoe Bohannon,"7183 Osborne Ways Apt. 651\r\nEast Andrew, OH ...",60,Male,1959-09-26
1,1,Ernest Kline,"391 Ball Road Suite 961\r\nFlowersborough, IN ...",52,Male,1966-10-28
2,2,Harold Hendriks,"8702 Vincent Square\r\nNew Jerryfurt, CO 30614",60,Male,1959-06-16
3,3,Randy Baptiste,"2751 Harris Crossroad\r\nWest Ashley, CA 30311",39,Male,1980-09-09
4,4,Anthony Colucci,"904 Robert Cliffs Suite 186\r\nWest Kyle, CO 7...",49,Male,1970-02-22
5,5,Ronald Lange,"30973 Martinez Shores\r\nJameston, CA 70245",46,Male,1973-09-25
6,6,Boyd Eiselein,29941 Christopher Curve Apt. 682\r\nRaychester...,54,Female,1964-10-31
7,7,Raymond Smith,53487 Scott Extensions Apt. 824\r\nMccartytown...,67,Male,1952-07-23
8,8,Harold Miller,"8514 Elizabeth Crescent\r\nWest Joseland, GA 4...",52,Male,1967-06-28
9,9,Charles Czachorowski,"6798 Wagner Locks Suite 377\r\nLake Brenda, DC...",30,Male,1988-11-07


In [25]:
data2.head(50)

Unnamed: 0.1,Unnamed: 0,name,address,race,marital-status,occupation,pregnant,education-num,relationship,skewness_glucose,...,education,fnlwgt,class,std_glucose,income,medical_info,native-country,hours-per-week,capital-loss,workclass
0,0,Mike Riley,023 Joseph Estate Suite 799\r\nLake Andrewvill...,White,Married-civ-spouse,Prof-specialty,f,12.0,Husband,1.60542,...,Assoc-acdm,251396.0,0.0,37.450318,>50K,"{'mean_oxygen':'10.03511706','std_oxygen':'35....",Canada,45.0,0.0,Local-gov
1,1,Earl Hoffmann,"700 Darlene Mill\r\nJackburgh, GA 99369",White,Divorced,Craft-repair,f,9.0,Unmarried,10.994893,...,HS-grad,157446.0,1.0,33.842927,<=50K,"{'mean_oxygen':'7.834448161','std_oxygen':'33....",United-States,45.0,0.0,Private
2,2,Lorenzo Mann,"56863 Stephen Island\r\nSouth Danielle, NV 40760",White,Divorced,Craft-repair,t,10.0,Not-in-family,-0.290893,...,Some-college,106014.0,0.0,51.123164,<=50K,"{'mean_oxygen':'3.505852843','std_oxygen':'19....",United-States,60.0,0.0,Private
3,3,Justin Trevino,"18446 Pace Junction\r\nNew Christyfurt, SD 32280",White,Never-married,Tech-support,f,13.0,Not-in-family,13.476089,...,Bachelors,189590.0,1.0,45.923142,<=50K,"{'mean_oxygen':'83.55351171','std_oxygen':'66....",United-States,40.0,0.0,Private
4,4,Thomas Davis,"67753 Wilson Ford\r\nNew Rachelport, NV 50148",White,Married-civ-spouse,Transport-moving,f,9.0,Husband,0.680234,...,HS-grad,134768.0,0.0,46.867134,>50K,"{'mean_oxygen':'2.602842809','std_oxygen':'18....",United-States,40.0,0.0,Private
5,5,Lynn Quist,"2232 Flores Ridge\r\nSanchezstad, IL 31455",White,Divorced,Transport-moving,f,9.0,Not-in-family,1.315007,...,HS-grad,143542.0,0.0,40.466252,<=50K,"{'mean_oxygen':'4.886287625','std_oxygen':'24....",United-States,40.0,0.0,Private
6,6,Roscoe Shelton,"6954 Carrillo Shoals Apt. 139\r\nSandersview, ...",White,Married-civ-spouse,Adm-clerical,f,13.0,Husband,2.731511,...,Bachelors,259840.0,0.0,35.750382,<=50K,"{'mean_oxygen':'1.705685619','std_oxygen':'17....",United-States,45.0,0.0,Private
7,7,Jose Aigner,"76756 Ashley Mount\r\nGomezchester, NV 49283",White,Divorced,Sales,f,13.0,Not-in-family,2.022478,...,Bachelors,364548.0,0.0,39.279398,>50K,"{'mean_oxygen':'2.117056856','std_oxygen':'15....",United-States,40.0,0.0,Private
8,8,Robert Gerke,"820 Mark Drives\r\nMichaelchester, OH 64396",White,Married-civ-spouse,Sales,F,10.0,Husband,0.169639,...,Some-college,114520.0,0.0,49.502052,<=50K,"{'mean_oxygen':'5.197324415','std_oxygen':'26....",United-States,40.0,0.0,Self-emp-not-inc
9,9,Andre Bennett,"25878 Hector Canyon\r\nJerryfurt, AZ 38098",White,Never-married,Prof-specialty,f,11.0,Not-in-family,0.743007,...,Assoc-voc,436770.0,0.0,40.479149,<=50K,"{'mean_oxygen':'0.766722408','std_oxygen':'9.5...",United-States,40.0,0.0,Private


In [26]:
# Fix data
data, rdata = fix_data(data, data2, rdata, rdata2, True)

Data set age: [2019, 12, 2]
[2019, 12, 2] - [1985, 11, 3] = [34, 1, -1] --> age = 34
[2019, 12, 2] - [1961, 4, 3] = [58, 8, -1] --> age = 58
[2019, 12, 2] - [1980, 7, 30] = [39, 5, -28] --> age = 39
[2019, 12, 2] - [1974, 11, 19] = [45, 1, -17] --> age = 45
[2019, 12, 2] - [1970, 8, 18] = [49, 4, -16] --> age = 49
[2019, 12, 2] - [1984, 9, 9] = [35, 3, -7] --> age = 35
[2019, 12, 2] - [1937, 8, 18] = [82, 4, -16] --> age = 82
[2019, 12, 2] - [1981, 10, 15] = [38, 2, -13] --> age = 38
[2019, 12, 2] - [1972, 12, 18] = [47, 0, -16] --> age = 46
[2019, 12, 2] - [1955, 11, 14] = [64, 1, -12] --> age = 64
[2019, 12, 2] - [1953, 9, 30] = [66, 3, -28] --> age = 66
[2019, 12, 2] - [1955, 8, 13] = [64, 4, -11] --> age = 64
[2019, 12, 2] - [1961, 5, 29] = [58, 7, -27] --> age = 58
[2019, 12, 2] - [1976, 7, 23] = [43, 5, -21] --> age = 43
[2019, 12, 2] - [1963, 3, 24] = [56, 9, -22] --> age = 56
[2019, 12, 2] - [1955, 5, 2] = [64, 7, 0] --> age = 64
[2019, 12, 2] - [1967, 3, 7] = [52, 9, -5] --> a

[2019, 12, 2] - [1978, 6, 26] = [41, 6, -24] --> age = 41
[2019, 12, 2] - [1986, 8, 27] = [33, 4, -25] --> age = 33
[2019, 12, 2] - [1970, 1, 16] = [49, 11, -14] --> age = 49
[2019, 12, 2] - [1985, 8, 24] = [34, 4, -22] --> age = 34
[2019, 12, 2] - [1955, 12, 30] = [64, 0, -28] --> age = 63
[2019, 12, 2] - [1969, 6, 24] = [50, 6, -22] --> age = 50
[2019, 12, 2] - [1967, 11, 13] = [52, 1, -11] --> age = 52
[2019, 12, 2] - [1958, 11, 27] = [61, 1, -25] --> age = 61
[2019, 12, 2] - [1968, 6, 8] = [51, 6, -6] --> age = 51
[2019, 12, 2] - [1969, 7, 29] = [50, 5, -27] --> age = 50
[2019, 12, 2] - [1966, 3, 6] = [53, 9, -4] --> age = 53
[2019, 12, 2] - [1972, 6, 13] = [47, 6, -11] --> age = 47
[2019, 12, 2] - [1998, 9, 29] = [21, 3, -27] --> age = 21
[2019, 12, 2] - [1973, 5, 21] = [46, 7, -19] --> age = 46
[2019, 12, 2] - [1959, 1, 29] = [60, 11, -27] --> age = 60
[2019, 12, 2] - [1971, 6, 17] = [48, 6, -15] --> age = 48
[2019, 12, 2] - [1979, 2, 20] = [40, 10, -18] --> age = 40
[2019, 12, 2

hours-per-week [40.0, 40.0] --> [40.0]
capital-loss [0.0, 0.0] --> [0.0]
workclass [nan, 'Private'] --> ['Private']
mean_oxygen [91.55183946, 91.55183946] --> [91.55183946]
std_oxygen [102.500925, 102.500925] --> [102.500925]
kurtosis_oxygen [0.330277214, 0.330277214] --> [0.330277214]
skewness_oxygen [-1.784581873, -1.784581873] --> [-1.784581873]

Merging 2 rows of name Michael Barnum...
name ['Michael Barnum', 'Michael Barnum'] --> ['Michael Barnum']
address ['9740 Joshua Roads\r\nVazquezmouth, TN 15386', '9740 Joshua Roads\r\nVazquezmouth, TN 15386'] --> ['9740 Joshua Roads\r\nVazquezmouth, TN 15386']
race ['White', nan] --> ['White']
marital-status ['Married-civ-spouse', 'Married-civ-spouse'] --> ['Married-civ-spouse']
occupation ['Exec-managerial', 'Exec-managerial'] --> ['Exec-managerial']
pregnant ['f', 'f'] --> ['f']
education-num [9.0, nan] --> [9.0]
relationship ['Husband', 'Husband'] --> ['Husband']
skewness_glucose [0.796409978, 0.796409978] --> [0.796409978]
mean_glucose 

name ['James Limerick', 'James Limerick'] --> ['James Limerick']
address ['78039 Velez Streets\r\nNew Wendytown, CO 99241', '78039 Velez Streets\r\nNew Wendytown, CO 99241'] --> ['78039 Velez Streets\r\nNew Wendytown, CO 99241']
race ['Black', 'Black'] --> ['Black']
marital-status ['Married-civ-spouse', 'Married-civ-spouse'] --> ['Married-civ-spouse']
occupation [nan, 'Craft-repair'] --> ['Craft-repair']
pregnant ['f', 'f'] --> ['f']
education-num [10.0, 10.0] --> [10.0]
relationship ['Husband', 'Husband'] --> ['Husband']
skewness_glucose [0.210647601, 0.210647601] --> [0.210647601]
mean_glucose [126.8828125, 126.8828125] --> [126.8828125]
capital-gain [0.0, nan] --> [0.0]
kurtosis_glucose [0.046673846, nan] --> [0.046673846]
education ['Some-college', 'Some-college'] --> ['Some-college']
fnlwgt [178383.0, 178383.0] --> [178383.0]
class [0.0, 0.0] --> [0.0]
std_glucose [48.68402926, 48.68402926] --> [48.68402926]
income ['<=50K', nan] --> ['<=50K']
native-country ['United-States', nan]


Merging 2 rows of name Allen Brickley...
name ['Allen Brickley', 'Allen Brickley'] --> ['Allen Brickley']
address ['58775 Thomas Mills Apt. 873\r\nLake Richardhaven, KY 78090', '58775 Thomas Mills Apt. 873\r\nLake Richardhaven, KY 78090'] --> ['58775 Thomas Mills Apt. 873\r\nLake Richardhaven, KY 78090']
race ['White', 'White'] --> ['White']
marital-status ['Married-civ-spouse', 'Married-civ-spouse'] --> ['Married-civ-spouse']
occupation ['Other-service', 'Other-service'] --> ['Other-service']
pregnant [nan, 'f'] --> ['f']
education-num [9.0, nan] --> [9.0]
relationship ['Husband', 'Husband'] --> ['Husband']
skewness_glucose [5.954116332000001, 5.954116332000001] --> [5.954116332000001]
mean_glucose [76.046875, 76.046875] --> [76.046875]
capital-gain [0.0, 0.0] --> [0.0]
kurtosis_glucose [1.690227856, 1.690227856] --> [1.690227856]
education [nan, 'HS-grad'] --> ['HS-grad']
fnlwgt [211075.0, 211075.0] --> [211075.0]
class [1.0, 1.0] --> [1.0]
std_glucose [nan, nan] --> [nan]
income ['

marital-status ['Never-married', 'Never-married'] --> ['Never-married']
occupation ['Other-service', 'Other-service'] --> ['Other-service']
pregnant ['f', 'f'] --> ['f']
education-num [11.0, 11.0] --> [11.0]
relationship ['Own-child', nan] --> ['Own-child']
skewness_glucose [0.213604545, 0.213604545] --> [0.213604545]
mean_glucose [94.8125, 94.8125] --> [94.8125]
capital-gain [0.0, 0.0] --> [0.0]
kurtosis_glucose [0.4985249470000001, nan] --> [0.4985249470000001]
education ['Assoc-voc', 'Assoc-voc'] --> ['Assoc-voc']
fnlwgt [170800.0, nan] --> [170800.0]
class [0.0, 0.0] --> [0.0]
std_glucose [nan, 50.88878038] --> [50.88878038]
income ['<=50K', '<=50K'] --> ['<=50K']
native-country ['United-States', 'United-States'] --> ['United-States']
hours-per-week [12.0, 12.0] --> [12.0]
capital-loss [0.0, nan] --> [0.0]
workclass ['Private', 'Private'] --> ['Private']
mean_oxygen [0.488294314, 0.488294314] --> [0.488294314]
std_oxygen [9.561140874, 9.561140874] --> [9.561140874]
kurtosis_oxygen 

skewness_oxygen [178.05562450000005, 178.05562450000005] --> [178.05562450000005]

Merging 2 rows of name Robert Michels...
name ['Robert Michels', 'Robert Michels'] --> ['Robert Michels']
address ['81903 Derrick Prairie Suite 019\r\nDonaldburgh, HI 14855', '81903 Derrick Prairie Suite 019\r\nDonaldburgh, HI 14855'] --> ['81903 Derrick Prairie Suite 019\r\nDonaldburgh, HI 14855']
race ['White', 'White'] --> ['White']
marital-status [nan, 'Never-married'] --> ['Never-married']
occupation ['Adm-clerical', 'Adm-clerical'] --> ['Adm-clerical']
pregnant [nan, 'f'] --> ['f']
education-num [nan, 11.0] --> [11.0]
relationship ['Not-in-family', 'Not-in-family'] --> ['Not-in-family']
skewness_glucose [3.891114308, 3.891114308] --> [3.891114308]
mean_glucose [68.90625, 68.90625] --> [68.90625]
capital-gain [0.0, 0.0] --> [0.0]
kurtosis_glucose [1.940449831, 1.940449831] --> [1.940449831]
education [nan, 'Assoc-voc'] --> ['Assoc-voc']
fnlwgt [216608.0, 216608.0] --> [216608.0]
class [1.0, 1.0] -->

skewness_oxygen [60.3107299, 60.3107299] --> [60.3107299]

Merging 2 rows of name Maurice Riley...
name ['Maurice Riley', 'Maurice Riley'] --> ['Maurice Riley']
address ['USS Allen\r\nFPO AP 87677', 'USS Allen\r\nFPO AP 87677'] --> ['USS Allen\r\nFPO AP 87677']
race ['White', 'White'] --> ['White']
marital-status ['Never-married', nan] --> ['Never-married']
occupation ['Adm-clerical', 'Adm-clerical'] --> ['Adm-clerical']
pregnant ['F', 'F'] --> ['F']
education-num [10.0, nan] --> [10.0]
relationship ['Not-in-family', 'Not-in-family'] --> ['Not-in-family']
skewness_glucose [-0.095817466, -0.095817466] --> [-0.095817466]
mean_glucose [136.6171875, 136.6171875] --> [136.6171875]
capital-gain [0.0, 0.0] --> [0.0]
kurtosis_glucose [0.120976311, 0.120976311] --> [0.120976311]
education [nan, nan] --> [nan]
fnlwgt [183945.0, 183945.0] --> [183945.0]
class [0.0, 0.0] --> [0.0]
std_glucose [44.2867293, 44.2867293] --> [44.2867293]
income ['<=50K', nan] --> ['<=50K']
native-country ['United-Stat

workclass ['Private', 'Private'] --> ['Private']
mean_oxygen [3.022575251, 3.022575251] --> [3.022575251]
std_oxygen [19.59541425, 19.59541425] --> [19.59541425]
kurtosis_oxygen [8.179942886000001, 8.179942886000001] --> [8.179942886000001]
skewness_oxygen [75.08747933, 75.08747933] --> [75.08747933]

Merging 2 rows of name Leon Ack...
name ['Leon Ack', 'Leon Ack'] --> ['Leon Ack']
address ['12348 Ford Bypass\r\nPort Zacharyville, RI 36085', '12348 Ford Bypass\r\nPort Zacharyville, RI 36085'] --> ['12348 Ford Bypass\r\nPort Zacharyville, RI 36085']
race ['White', 'White'] --> ['White']
marital-status ['Married-civ-spouse', 'Married-civ-spouse'] --> ['Married-civ-spouse']
occupation [nan, 'Exec-managerial'] --> ['Exec-managerial']
pregnant ['f', 'f'] --> ['f']
education-num [nan, 13.0] --> [13.0]
relationship ['Wife', 'Wife'] --> ['Wife']
skewness_glucose [32.24540555, 32.24540555] --> [32.24540555]
mean_glucose [29.6875, 29.6875] --> [29.6875]
capital-gain [0.0, 0.0] --> [0.0]
kurtosis

class [nan, 1.0] --> [1.0]
std_glucose [37.99430644, 37.99430644] --> [37.99430644]
income ['<=50K', '<=50K'] --> ['<=50K']
native-country ['United-States', 'United-States'] --> ['United-States']
hours-per-week [40.0, 40.0] --> [40.0]
capital-loss [0.0, 0.0] --> [0.0]
workclass [nan, 'Private'] --> ['Private']
mean_oxygen [34.35451505, 34.35451505] --> [34.35451505]
std_oxygen [59.02645009, 59.02645009] --> [59.02645009]
kurtosis_oxygen [1.850542127, 1.850542127] --> [1.850542127]
skewness_oxygen [2.812553835, 2.812553835] --> [2.812553835]
Applying changes...
Filling missing data with linear regression values...
Filling missing data with median values...
id --> 670.0
name --> Jorge Clark
address --> 038 Diaz Motorway Apt. 382
Grayberg, OK 51763
race --> White
marital-status --> Married-civ-spouse
occupation --> Craft-repair
pregnant --> 0.0
education-num --> 10.0
relationship --> Husband
skewness_glucose --> 0.40531565
mean_glucose --> 110.0546875
capital-gain --> 0.0
kurtosis_glucose

In [27]:
# Print train data after fix
print("========================== Train Data Statistics - AFTER FIX ======================")
print_col_values(data)

••► address_x (1311)

••► address_y (1311)

••► age: min = 12, 5% = 31, 95% = 70, max = 113, mean = 51.7007, median = 52

••► capital-gain: min = 0, 5% = 0, 95% = 4467.2025, max = 8614, mean = 377.5377, median = 0

••► capital-loss: min = 0, 5% = 0, 95% = 0, max = 0, mean = 0, median = 0

••► class: min = 0, 5% = 0, 95% = 1, max = 1, mean = 0.2643, median = 0

••► date_of_birth (1256)

••► education (17): ['Some-college', 'HS-grad', nan, 'Bachelors', '11th', 'Masters', 'Assoc-voc', '9th', '10th', 'Assoc-acdm', '12th', 'Prof-school', 'Doctorate', '1st-4th', '7th-8th', '5th-6th', 'Preschool']

••► education-num: min = 6, 5% = 6, 95% = 14, max = 16, mean = 10.2034, median = 10

••► fnlwgt: min = 41658.25, 5% = 41741.1625, 95% = 363341.1375, max = 556652, mean = 185901.5112, median = 176813.5

••► hours-per-week: min = 16, 5% = 16, 95% = 60, max = 99, mean = 40.4134, median = 40

••► id_x: min = 100, 5% = 165.6, 95% = 1344.4, max = 1410, mean = 755.1706, median = 756

••► id_y: min = 65.6,

In [28]:
# Real data after fix
print("========================== Train Data Statistics - AFTER FIX ======================")
print_col_values(rdata)

••► address_x (1263)

••► address_y (1263)

••► age: min = 14, 5% = 32, 95% = 71, max = 98, mean = 51.9042, median = 52

••► capital-gain: min = 0, 5% = 0, 95% = 4548.405, max = 8614, mean = 377.4552, median = 0

••► capital-loss: min = 0, 5% = 0, 95% = 0, max = 0, mean = 0, median = 0

••► class: min = 0, 5% = 0, 95% = 1, max = 1, mean = 0.2644, median = 0

••► date_of_birth (1213)

••► education (16): ['Some-college', 'HS-grad', 'Bachelors', '11th', 'Masters', 'Assoc-voc', '9th', '10th', 'Assoc-acdm', '12th', 'Prof-school', 'Doctorate', '1st-4th', '7th-8th', '5th-6th', 'Preschool']

••► education-num: min = 6, 5% = 6, 95% = 14, max = 16, mean = 10.209, median = 10

••► fnlwgt: min = 41658.25, 5% = 41811.9, 95% = 363595.275, max = 556652, mean = 186447.8519, median = 176711

••► hours-per-week: min = 16, 5% = 16, 95% = 60, max = 99, mean = 40.2985, median = 40

••► id_x: min = 0, 5% = 66.2, 95% = 1244.9, max = 1310, mean = 654.038, median = 652

••► id_y: min = 65.6, 5% = 66.1, 95% = 

In [29]:
data.head(50)

Unnamed: 0,id_x,name,address_x,age,sex,date_of_birth,id_y,address_y,race,marital-status,...,std_glucose,income,native-country,hours-per-week,capital-loss,workclass,mean_oxygen,std_oxygen,kurtosis_oxygen,skewness_oxygen
0,100,Philip Miller,"76348 Tran Harbor Apt. 760\r\nHarrisonton, GA ...",43,Male,1976-08-05,281.0,"7910 Rosales Plain Apt. 454\r\nPort Carl, GA 6...",White,Never-married,...,35.629364,<=50K,United-States,16.0,0.0,State-gov,3.311873,19.789627,7.709831,68.631028
1,101,Mitch Wilson,"309 James Hill Apt. 427\r\nPort Veronica, AR 6...",47,Male,1972-08-04,1092.0,055 Morgan Plains Suite 225\r\nEast Darrylmout...,White,Married-civ-spouse,...,34.900962,<=50K,United-States,40.0,0.0,?,1.799331,14.546599,10.377052,127.849582
2,102,James Olsen,"768 Brett Keys Suite 702\r\nSouth Tarashire, I...",22,Male,1997-02-23,180.0,"PSC 7359, Box 2088\r\nAPO AE 62717",White,Never-married,...,52.997015,<=50K,United-States,40.0,0.0,Private,6.313545,28.933747,4.957514,25.549865
3,103,Maurice Riley,"58487 Schneider Street\r\nGriffinfurt, RI 56689",41,Male,1978-04-03,90.0,USS Allen\r\nFPO AP 87677,White,Never-married,...,44.286729,<=50K,United-States,60.0,0.0,Private,3.30602,21.814243,7.619408,63.388103
4,104,Larry Stanley,"505 Mary Greens Suite 084\r\nNew Michael, CO 4...",49,Male,1969-12-24,410.0,"491 Fields Key Suite 544\r\nGracestad, MT 13456",White,Married-civ-spouse,...,51.490736,<=50K,United-States,40.0,0.0,State-gov,17.957358,54.002406,2.802044,6.229853
5,105,Stanley Orndorff,"836 Donna Vista Suite 550\r\nMikestad, NE 27413",53,Male,1966-05-22,282.0,"1989 Dylan Inlet Apt. 862\r\nCoreystad, MA 30654",White,Never-married,...,47.861009,>50K,United-States,45.0,0.0,Private,2.219064,13.922402,9.65565,123.287033
6,106,Mark Bierlein,"144 Gary Trail Suite 203\r\nNew Christopher, D...",38,Female,1980-11-20,679.0,Unit 7145 Box 9167\r\nDPO AA 86245,White,Married-civ-spouse,...,52.020039,<=50K,United-States,48.0,0.0,Private,3.0,17.994193,8.505253,85.741876
7,107,Matthew Spence,"6415 Martin Dale Apt. 748\r\nBrandonhaven, NH ...",46,Male,1972-12-18,178.0,741 Alvarez Village Suite 345\r\nNew Stephanie...,White,Widowed,...,41.638119,<=50K,United-States,40.0,0.0,Private,40.301003,71.851596,1.547338,0.974635
8,108,Willie Williams,"458 Moreno Unions Suite 381\r\nSouth Pamela, M...",43,Male,1976-09-30,442.0,"149 Parker Tunnel\r\nRyanside, VA 94814",White,Never-married,...,37.858543,<=50K,United-States,40.0,0.0,Private,1.087625,12.929366,15.195374,254.685045
9,109,Curtis Johnson,"6711 Cristina River Apt. 482\r\nKimberlyfurt, ...",50,Male,1969-07-12,111.0,08244 Burton Junctions Suite 409\r\nScottmouth...,White,Never-married,...,50.446403,<=50K,United-States,40.0,0.0,Private,4.761706,24.852493,6.614463,48.131944


# Documentation

## Integrácia dát a prípadná deduplikácia záznamov

Tabuľky sme spájali podľa atribútu `name`. Spojili sme tabuľky `data`, ktorá obsahuje upravené data a tabuľku `data2reg0595`, ktorá obsahuje funkciami opravené dáta z tabuľky `data2`
Spájanie tabuliek robíme až po spracovaní pôvodných tabuliek a oprave dát

## Realizácia krokov predspracovania dát a ich zdokumentovanie

Na odstránenie vychýlených hodnôt sme použili metódy **nahradenie vychýlenej hodnoty hraničnými hodnotami rozdelenia** a **transformácia atribútu s vychýlenými hodnotami pomocou zvolenej funkcie** - konkrétne pomocou logaritmu.

Pomocou percentilu sme upravovali väčšinu číselných hodnôt pretože obsahovali aj záporné čisla.
Odchylku v atribúte `mean_oxygen` sme opravili pomocou logaritmu pretože všetky jeho hodnoty boli kladné

Na riešenie chýbajúcich hodnôt sme použili metódy **nahradenie chýbajúcej hodnoty mediánom** a **nahradenie chýbajúcej hodnoty pomocou lineárnej regresie**

Nahradenie hodnôt regresiou sme použili v prípade hodnôt `mean_glucose` a `std_glucose` kde vidno že v priemere je `mean_glucose` násobok hodnoty `std_glucose`

Na zvyšné hodnoty sme použili dopočítavanie mediánom.

## Opätovná realizácia podstatných častí prieskumnej analýzy

**Všeobecné**
- Niektoré stĺpce obsahovali medzeru pred hodnotou. Tieto medzeri sme odstránili

- Spájanie riadkov v prípade duplicitných záznamov: pokiaľ záznam obsahuje “nan” alebo “?”, doplní sa z iného záznamu. Pokiaľ oba záznamy obsahujú hodnotu v stĺpci, numerické hodnoty zapíšu priemer hodnôt a nenumerické hodnoty vyberú prvú z možností 

**Date_of_birth**
- Dátum je spracovaný do formátu yyyy-mm-dd. Boli z neho odstránené časy narodenia. „/“ bola nahradená „-“. Rok narodenia zapísaný v tvare napr. 68 (1968) bol podľa veku upravený na požadovaný formát.

**Age**
- Z upravených dátumov narodenia sa dopočítal chýbajúci vek

**Sex**
- Zjednotené na "Male" "Female" 

**Pregnant**
- Hodnoty boli zjednotené na „True“ a „False“

**Education_num**
- `nan` hodnoty boli nahradené mediánom

**Income**
- Zjednotené na 2 hodnoty “>50K“ a „<=50K” pomocou mediánu

**Class**
- Hodnoty sme zmenili na integer a obsahujú len hodnoty “1” a “0”

**Std_glucose**
- Chýbajúce hodnoty boli dopočítané priemerom. Odľahlé hodnoty sme nahradili hraničnou hodnotou rozdelenia  - 95 percentilom a 5 percentilom

**Skewness_oxygen**
- Chýbajúce hodnoty boli dopočítané priemerom. Odľahlé hodnoty sme nahradili hraničnou hodnotou rozdelenia  - 95 percentilom

**mean_oxygen**
- Chýbajúce hodnoty boli dopočítané priemerom. Odľahlé hodnoty sme nahradili logaritmom hodnôt.

**Skewness_glucose**
- Chýbajúce hodnoty boli dopočítané priemerom. Odľahlé hodnoty sme nahradili hraničnou hodnotou rozdelenia  - 95 percentilom

**Std_oxygen, mean_glucose, kurtosis_glucose, kurtosis_oxygen**
- Chýbajúce hodnoty boli nahradené priemerom

**Capital_loss, capital_gain, fnlwgt, hour-per-week**
- Chýbajúce hodnoty boli nahradené mediánom

**Education, marital_status, native-country, occupation, race, relationship, workclass**
- Chýbajúce hodnoty boli doplnené najčastejšie sa vyskytujúcim prvkom


## Znovupoužiteľnosť predspracovania

Keďže je naša oprava dát písaná vo funkciach je dobre použiteľná aj na ďalšie datasety.

- Funkcia `fix_date_format` opraví akýkoľvek formát dátumu do nášho požadovaného formátu.

- Funkcia `calc_missing_ages_from_date_of_birth` dopočíta chýbajúce údaje o veku pomocou dátumu narodenia.

- `merge_duplicate_rows` je funkcia, ktorá nájde duplicitné záznamy podľa mena pacienta a spojí ich do jedného záznamu.  Z číselných hodnôt sa vypočíta priemer bez nulových hodnôt a z nečíselných hodnôt sa zapíše prvá vyskitnutá hodnota.

- `fix_pregnant` zmení hodnoty stĺpca **pregnant** na jednoznačné `nan`, `0` a `1` počiatočné hodnoty.

- `replace_missing_data_with_medians` pre čiselné atribúty doplní medián atribútu a pre nečiselné hodnoty doplní prvok s najčastejším výskytom v datasete

- `replace_missing_data_with_regr` slúži na doplnenie chýbajúcich dát pomocou lineárnej regresie

Tieto funkcie sú aplikovatelné na akýkoľvek dataset a mali by byť schopné opraviť / doplniť hodnoty, pre ktoré sú navrhnuté:

- `replace_outlying_log` opravuje vychýlené hodnoty pomocou logaritmu
- `replace_outlying_0595` opravuje vychýlené hodnoty pomocou percentilu

Na opravu vychýlených hodnôt sa bude musieť dávat pozor, ktorá funkcia sa používa.

Oprava pomocou percentilu je použitelná kedykoľvek, ale oprava pomocou logaritmu by sa teoreticky dalo len použiť na dáta, ktoré obsahujú len kladné hodnoty čísla. Na riešenie tohto problému urobíme logaritmus z absolútnych hodnôt dát.