# IAU - Project

**Authors:** Peter Mačinec, Lukáš Janík

## Setup and import libraries

In [118]:
# import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

# pipelines
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin

# models
from sklearn.linear_model import LinearRegression
from sklearn import model_selection as ms
from sklearn import metrics
from functools import reduce

## Read the data

Data are divided into two files, personal and other, so we need to read both of them:

In [119]:
# read datasets
df1 = pd.read_csv('data/personal_train.csv', index_col=0)
df2 = pd.read_csv('data/other_train.csv', index_col=0)

## Preprocessing

### Merge datasets

At first, we need to merge both datasets into one. In previous analysis, we found that name and address would be used for merging:

In [120]:
df_train = pd.merge(df1, df2, on=["name", "address"])

In descriptive analysis, we found some duplicates. In second dataset with medical information, there were some duplicates, so we will merge their values and drop duplicated rows.

### Data repairing

We know from a previous analysis that some data need to be repaired. Some columns have one value represented by more strings, another case is that column holds several values that need to be expanded, etc. In this section, data will be repaired at first so missing values would be replaced in next step.

All operations will be done using **Pipelines**, so whole preprocessing process will be reusable.

#### Merge and drop duplicates

As mentioned before, there are some duplicates. Let's check them:

In [121]:
duplicates = df_train[df_train.duplicated(['name', 'address'], keep='first')].sort_values('name')

In [122]:
duplicates.head()

Unnamed: 0,name,address,age,sex,date_of_birth,query hyperthyroid,T4U measured,FTI measured,lithium,TT4,...,personal_info,T3 measured,on antithyroid medication,referral source,education-num,psych,occupation,TBG measured,TBG,pregnant
1656,Alfred Still,"4175 Smith Keys\r\nNew Taylor, NH 39815",57.0,M,1960-11-02,f,t,t,f,82.0,...,,f,f,other,13.0,f,Prof-specialty,f,?,f
855,Amelia Rodriguez,"087 Gary Port\r\nWest Sarah, KY 66896",77.0,F,1941-03-17,f,,t,f,84.0,...,White|United-States\r\nBachelors -- Widowed|Un...,t,f,SVI,,f,Sales,f,?,f
904,Angela Boyer,"3750 Chen Groves\r\nPamelatown, ME 02894",75.0,F,1942-12-28,,,t,f,92.0,...,White|United-States\r\nHS-grad -- Divorced|Own...,t,f,SVI,9.0,f,Priv-house-serv,f,?,f
1597,Anna Garcia,"71052 Annette Roads\r\nChristinechester, MT 16249",65.0,F,1953-05-06,f,f,f,f,,...,White|United-States\r\nHS-grad -- Never-marrie...,f,f,,9.0,f,Handlers-cleaners,f,?,
2204,Annette Hunt,USNV Lamb\r\nFPO AA 85130,33.0,F,1984-12-08,f,f,f,f,,...,White|United-States\r\nSome-college -- Married...,f,f,,10.0,f,Adm-clerical,f,?,f


We can see there are duplicates with same name and address, but they are even not representing different medical records (measurements are the same). In some attributes, one of duplicates has value and in the other one is this value missing. That means we need to merge those records before droping duplicates.

In [123]:
class MergeRemoveDuplicates(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, df, y=None, **fit_params):
        return self

    def merge(self, duplicates):
        return reduce(lambda x,y: x if not pd.isna(x) or str(x).startswith('?') else y, duplicates)

    def transform(self, df, **transform_params):
        duplicated = df[df.duplicated(['name', 'address'], keep=False)]
        duplicate_names = df[df.duplicated(['name', 'address'], keep='first')].name.values
        df =  df[~df['name'].isin(duplicate_names)]

        return df.append(duplicated.groupby(['name', 'address']).agg(self.merge).reset_index())

**Note:** This class will be used for preprocessing in **Pipelines**.

#### Drop rows with missing values in predicted attribute

Rows where even value of predicted attribute is missing, will not help classifying in *supervised learning*. In this case, those values would be dropped. Let's check records with missing values for **class** attribute:

In [124]:
df_train[df_train['class'].isnull()][['name', 'class']]

Unnamed: 0,name,class
362,Frank Gerace,
575,Carol Crum,
1321,Cynthia Schmidtke,
1519,Don Carroll,
1675,Shirley Kiser,
1771,Lila Womack,
1840,Jane Little,


To make this operation reusable, it is better to write custom pipeline with column as parameter, so every row with missing values in this column will be dropped. 

In [125]:
class DropRowsNanColumn(TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, df, y=None, **fit_params):
        return self

    def transform(self, df, **transform_params):
        df = df[pd.notnull(df[self.column])]
        return df

#### Missing values unifying

In some columns, missing values are represented by *nan*, or also by *'?'* character. Those values need to be unified, so we can fill them later using universal pipeline.

In [126]:
class NanUnifier(TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, df, y=None, **fit_params):
        return self

    def transform(self, df, **transform_params):
        df.loc[df[self.column].str.strip() == '?', self.column] = np.NaN
        #df[self.column] = pd.to_numeric(df[self.column])
        return df

#### Boolean unifying

A lot of columns that store boolean values, mostly whether was measurement done or not, have inconsistent representation of boolean values (t, t.19, ...). It is better to unify them, because as it is categorical attribute, every reasonable value type should be represented just by one specific value.

In [127]:
class BooleanUnifier(TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, df, y=None, **fit_params):
        return self

    def transform(self, df, **transform_params):
        df_copy = df.copy()
        df_copy[self.column] = df_copy[self.column].map(lambda x: str(x).lower().startswith('t'), na_action='ignore')
        return df_copy

#### Drop useless columns

Some columns will not help us in predicting class of the patient. It is because those column store only one value, as *TBG measured* attribute:

In [128]:
df_train['TBG measured'].unique()

array(['f', nan, 'f.14'], dtype=object)

After unifying boolean values, it will contain only *false* value. It is better to drop columns like this one.

In [129]:
class DropColumn(TransformerMixin):
    def __init__(self, column):
        self.column = column
        
    def fit(self, df, y=None, **fit_params):
        return self
    
    def transform(self, df, **transform_params):
        df = df.drop([self.column], axis=1)
        return df

#### Expanding columns

In analysis we found also column that store more attributes and their values. With this datatype, machine learning algorithms will not be able to work even though it can hold important information for prediction. We need to expand those objects into alone-standing columns.

In [130]:
class ColumnExpander(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, df, y=None, **fit_params):
        return self

    def transform(self, df, **transform_params):
        df['bred'] = df['personal_info'].str.extract('(^[^|]+)', expand=False).str.strip().str.lower()
        df['origin'] = df['personal_info'].str.extract('[|](.*)\r', expand=False).str.strip().str.lower()
        df['study'] = df['personal_info'].str.extract('[\n](.*)--', expand=False).str.strip().str.lower()
        df['status1'] = df['personal_info'].str.extract('--(.*)[|]', expand=False).str.strip().str.lower()
        df['status2'] = df['personal_info'].str.extract('--.*[|](.*)', expand=False).str.strip().str.lower()
        return df

#### Columns data type transformations

Some columns can contain numbers, but are represented by string. In this case, we need to convert those attributes to numerical, so algorithm can treat them like numbers, not categories.

In [111]:
class ColumnToNumber(TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, df, y=None, **fit_params):
        return self

    def transform(self, df, **transform_params):
        df[self.column] = pd.to_numeric(df[self.column])
        return df

#### Repair data using Pipeline

For data repair, Pipeline will be used. All pipeline custom classes have already been defined in each section of **data repairing**, so we can just use them now:

In [112]:
repair_ppl = Pipeline([
                # unify boolean values
                ('ub01', BooleanUnifier('query hyperthyroid')),
                ('ub02', BooleanUnifier('T4U measured')),
                ('ub03', BooleanUnifier('on thyroxine')),
                ('ub04', BooleanUnifier('FTI measured')),
                ('ub05', BooleanUnifier('lithium')),
                ('ub06', BooleanUnifier('TT4 measured')),
                ('ub07', BooleanUnifier('query hypothyroid')),
                ('ub08', BooleanUnifier('query on thyroxine')),
                ('ub09', BooleanUnifier('tumor')),
                ('ub10', BooleanUnifier('T3 measured')),
                ('ub11', BooleanUnifier('sick')),
                ('ub12', BooleanUnifier('thyroid surgery')),
                ('ub13', BooleanUnifier('I131 treatment')),
                ('ub14', BooleanUnifier('goitre')),
                ('ub15', BooleanUnifier('TSH measured')),
                ('ub16', BooleanUnifier('on antithyroid medication')),
                ('ub17', BooleanUnifier('psych')),
                ('ub18', BooleanUnifier('TBG measured')),
                ('ub19', BooleanUnifier('pregnant')),
                ('ub20', BooleanUnifier('hypopituitary')),

                # drop column
                ('drop_TBG_measured', DropColumn('TBG measured')),
                ('drop_TBG', DropColumn('TBG')),

                # expand column
                ('expand_personal_info', ColumnExpander()),
    
                # unify nan values
                ('nan_unify_FTI', NanUnifier('FTI')),
                ('nan_unify_sex', NanUnifier('sex')),
                ('nan_unify_origin', NanUnifier('origin')),
                ('nan_unify_occupation', NanUnifier('occupation')),
        
                # transform data type
                ('column_to_number_FTI', ColumnToNumber('FTI')),

                # drop, where are nan values
                ('drop_class', DropRowsNanColumn('class')),
    
                # merge and remove duplicates
                ('test',MergeRemoveDuplicates())
              ])

In [113]:
model = repair_ppl.fit(df_train)

In [114]:
transformed = model.transform(df_train)

### Normalize and remove outliers

In a lot of columns, mostly in those storing measurements values, outliers were found. For some algorithms, it is better to remove them or replace with quantiles. Before doing this, values should be normalized.

In [0]:
from scipy.stats import boxcox

#### Normalize numerical attributes 

In [0]:
class Normalizer(TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, df, y=None, **fit_params):
        _, self.lmbda = boxcox(df[self.column]+2)
        return self

    def transform(self, df, **transform_params):
        df_copy = df.copy()
        df_copy[self.column] = boxcox(df_copy[self.column]+2, lmbda=attr)
        return df_copy

#### Remove outliers

In [0]:
class OutliersRemover(TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, df, y=None, **fit_params):
        self.quantile_05 = df[self.column].quantile(.05)
        self.quantiles_95 = df[self.column].quantile(.95)
        return self

    def transform(self, df, **transform_params):
        df.loc[df[self.column] > self.quantile_05, self.column] = self.quantile_05
        df.loc[df[self.column] > self.quantile_95, self.column] = self.quantile_95
        return df

In [0]:
normalize_ppl = Pipeline([
                    ('nieco', Normalizer('column')),
                    ('nieco2', OutliersRemover('column'))
              ])

In [0]:
model = normalize_ppl.fit(df_train)

In [0]:
transformed = model.transform(df_train)

### Filling missing values

Our dataset contains also missing values (NaN), that should be filled before using them in machine learning algorithm. Missing values of numerical, and also categorical attributes should be filled.

#### Fill numerical with median

POPISAT IMPUTER

In [0]:
class NumMedianFiller(TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, df, y=None, **fit_params):
        self.median = df[self.column].median()
        return self

    def transform(self, df, **transform_params):
        df_copy = df.copy()
        df_copy.loc[df_copy[self.column].isnull(), self.column] = self.median
        return df_copy

#### Fill numerical with Linear Regression algorithm

In [0]:
class NumModelFiller(TransformerMixin):
    def __init__(self, model):
        self.model = model

    def fit(self, df, y=None, **fit_params):
        self.model.fit(df)
        return self

    def transform(self, df, **transform_params):
        self.model.predict(df)
        return df_copy

#### Fill categorical with most frequent values

POPISAT IMPUTER

In [0]:
class CategoricalMostFrequentFiller(TransformerMixin):
    def __init__(self, column):
        self.column = column
        
    def fit(self, df, y=None, **fit_params):
        self.most_frequent = df[self.column].value_counts().index[0]
        return self
    
    def transform(self, df, **transform_params):
        df_copy = df.copy()
        df_copy.loc[df_copy[self.column].isnull(), self.column] = self.most_frequent
        return df_copy

#### Fill categorical with k-NN (k-nearest neighbours) algorithms

In [0]:
class CategoricalModelFiller(TransformerMixin):
    def __init__(self):
        pass

    def fit(self, df, y=None, **fit_params):
        return self

    def transform(self, df, **transform_params):
        pass

In [0]:
# fill_ppl = Pipeline([
#                 ('nieco', Imputer(strategy='median')),
#                 ('nieco2', Imputer(strategy='most_frequent'))
#     ])
fill_ppl = Pipeline([
                ('nieco', NumMedianFiller('T4U')),
                ('nieco2', CategoricalMostFrequentFiller('sex'))
            ])

In [0]:
transformed.sex.isnull().sum()

In [0]:
model = fill_ppl.fit(transformed)

In [0]:
transf2 = model.transform(transformed)

In [0]:
fill_model_ppl = Pipeline([
                    ('nieco', NumModelFiller('column')),
                    ('nieco2', CategoricalModelFiller('column'))
                ])

In [0]:
model = fill_ppl.fit(df_train)

In [0]:
transformed = model.transform(df_train)

# !!! porovnat ich ako pisali

In [0]:
tmp = transformed[~transformed['FTI'].isnull()]
columns = ['TT4', 'T4U','capital-loss', 'capital-gain', 'TSH', 'T3', 'fnlwgt', 'hours-per-week', 'education-num']

for column in columns:
    tmp.dropna(subset=[column], inplace=True)
    
X = tmp[columns]
y = tmp['FTI']

X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression().fit(X_train, y_train)
train_preds = model.predict(X_test)
metrics.mean_absolute_error(y_test, train_preds)


In [0]:
len(X_train)

In [0]:
len(train_preds)

In [0]:
from sklearn.decomposition import TruncatedSVD

pca = TruncatedSVD(n_components=1)
pca.fit(X_train)

In [0]:
train1 = pca.transform(X_train)
test1 = pca.transform(X_test)

In [0]:
train_preds = model.predict(X_train)

In [0]:
plt.scatter(test1, y_test, color = 'red')
plt.plot(train1, train_preds, color = 'blue')
plt.show()

# TO-DO
* urobit numeric co nie je numeric a malo by byt (FTI)
* porovnat pristupy toho doplnania hodnot
* po data repair ukazat, ze sme opravili hodnoty...
* otestovat normalizer
* otestovat outlier remover
* urcit, ktore treba normalizovat a napisat do pipeliny
* urcit, kde treba vymazat outlierov a napisat do pipeliny
* okomentovat cast s normalizaciou a outliermi
* tie imputery nase zmenit na fillna()