# Data Preprocessing

The goal of this lab is to introduce you to data preprocessing techniques in order to make your data suitable for applying a learning algorithm.

## 1. Handling Missing Values

A common (and very unfortunate) data property is the ocurrence of missing and erroneous values in multiple features in datasets. For this exercise we will be using a data set about abalone snails.
The data set is contained in the Zip file you downloaded from Moodle (abalone.csv).

To determine the age of a abalone snail you have to kill the snail and count the annual
rings. You are told to estimate the age of a snail on the basis of the following attributes:
1. type: male (0), female (1) and infant (2)
2. length in mm
3. width in mm
4. height in mm
5. total weight in grams
6. weight of the meat in grams
7. drained weight in grams
8. weight of the shell in grams
9. number of annual rings (number of rings +1, 5 yields age)

However, the data is incomplete. Missing values are marked with −1.

In [118]:
import pandas as pd
# load data 
df = pd.read_csv("abalone.csv") #Should this not work please use the csv that was part of the zip file. (it didn't work)
df.columns=['type','length','width','height','total_weight','meat_weight','drained_weight','shell_weight','num_rings']
df.head()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,-1
1,1,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9
2,0,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10
3,2,-1.0,0.255,0.08,0.205,0.0895,0.0395,0.055,7
4,2,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8


### Exercise 1.1

Compute the mean of of each numeric column and the counts of each categorical column, excluding the missing values.

In [119]:
##################
#INSERT CODE HERE#
##################

def mean(dataframe, columns):
    means = [] #list where we collect the mean of each column
    for column in columns: #loop through all columns
        count = 0 #we count how many entries are valid
        result = 0 #our result variable
        for i in range(0, len(dataframe)): #loop through dataframe
            if (dataframe[column][i] == -1): #check if entry is not valid
                continue #entry is not valid and we do nothing
            count += 1 #increase count since we found a valid entry
            result += dataframe[column][i] #add entry
        means.append(result/count) #devide by amount of valid entries
    return means #return list with the means

print("The means of the numeric columns:")
print(mean(df, df.columns))


The means of the numeric columns:
[0.9535338713621913, 0.5236920039486674, 0.40795533070089013, 0.13961006910167725, 0.8288428746928771, 0.3592626511972346, 0.18024858618146095, 0.23860444280805088, 9.921756193279371]


### Exercise 1.2

Compute the median of each numeric column,  excluding the missing values.

In [120]:
##################
#INSERT CODE HERE#
##################

def median(dataframe, columns):
    medians = [] #list where we collect the median of each column
    for column in columns: #loop through all columns
        values = [] #list where we collect all valid entries of a column
        for i in range(0,len(dataframe)): #loop through dataframe
            if(dataframe[column][i] == -1): #check if entry is not valid
                continue #entry is not valid and we do nothing
            values.append(dataframe[column][i]) #add value to list since it is valid
        values.sort() #sort all values so we can determine the median
        medians.append(values[len(values)//2]) #get the value in the middle of the list
    return medians #return list with medians

print("The medians of the numeric columns:")
print(median(df, df.columns))


The medians of the numeric columns:
[1, 0.545, 0.425, 0.14, 0.802, 0.336, 0.1705, 0.2335, 9]


### Exercise 1.3

Handle the missing values in a way that you find suitable. Think about different ways. Discuss dis-/advantages of your approach. Argue your choices.


In [121]:
##################
#INSERT CODE HERE#
##################

def fix_dataframe(dataframe, columns):
    means = mean(dataframe, columns) #get means of the dataframe
    for i in range(0, len(columns)): #loop through columns
        for j in range(0, len(dataframe)): #loop through rows
            if (dataframe[columns[i]][j] == -1): #check if entry is not valid
                dataframe.at[j, columns[i]] = means[i] #replace entry with mean of the column

fix_dataframe(df, df.columns)

"""
The way I choose is to replace a missing value with the mean of the column.
The mean is most likley to be close to the actual value we were not able to record.
Of course it is not guaranteed that this value is close to the actual value and very
unlikely to be the exact value. Because of that the invalid entries might misslead
while making a decision. Therefore another way would be to drop the entries with
missing information or at least drop those, which have multiple missing values.
"""

df.head()

Unnamed: 0,type,length,width,height,total_weight,meat_weight,drained_weight,shell_weight,num_rings
0,0.0,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,9.921756
1,1.0,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9.0
2,0.0,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10.0
3,2.0,0.523692,0.255,0.08,0.205,0.0895,0.0395,0.055,7.0
4,2.0,0.425,0.3,0.095,0.3515,0.141,0.0775,0.12,8.0


### Exercise 1.4

Perform Z-score normalization on every column (except the type of course!)

In [122]:
##################
#INSERT CODE HERE#
##################

def mean_of_column(column):
    result = 0
    for i in column: #loop through the array
        result += i
    return result/column.size #return mean

def variance(column):
    mean = mean_of_column(column)
    result = 0
    for i in column: #loop through the array
        result += (i-mean)**2
    return result/column.size #return variance

def standard_deviation(column):
    return variance(column)**(1/2)

def z_score_normalization(df, columns):
    for column in columns: #loop through columns
        mean = mean_of_column(df[column]) #get mean
        sd = standard_deviation(df[column]) #get standard deviation
        for i in range(0, len(df)): #loop through dataframe
            df.at[i, column] = (df.at[i, column] - mean) / sd #replace old value with the new one

z_score_normalization(df, df.columns[1:]) #normalize all columns except "type"

"""
The mean of the normalized values is 0 and the standard deviation of
the nromalized values is 1. With the following loop we check if the
normalization worked. Note that rounding errors cause slight deviation
from 0 and 1.
"""

for i in df.columns[1:]:
    print("Mean of", i, ":", mean_of_column(df[i]))
    print("Standard deviation of", i, ":", standard_deviation(df[i]))

Mean of length : 6.823086157373052e-15
Standard deviation of length : 0.9999999999999927
Mean of width : -6.9622362366567704e-15
Standard deviation of width : 0.999999999999997
Mean of height : 4.033368998518127e-14
Standard deviation of height : 1.0000000000000184
Mean of total_weight : -1.6130136817542504e-15
Standard deviation of total_weight : 0.9999999999999978
Mean of meat_weight : 3.3130161024111232e-15
Standard deviation of meat_weight : 0.9999999999999988
Mean of drained_weight : -4.445252169765916e-15
Standard deviation of drained_weight : 0.999999999999998
Mean of shell_weight : -7.512083760490523e-16
Standard deviation of shell_weight : 1.0000000000000067
Mean of num_rings : -8.852273958559348e-16
Standard deviation of num_rings : 1.0000000000000175
