In this document, we introduce a modified iris dataset. This dataset has undergone some changes, including the deletion of certain values and the addition of a nominal feature named <em>petal width</em>. Our primary objective is to demonstrate the processes of data loading, handling missing values, normalizing data, and evaluating predictions.

## 1.Data Preparation

Access the [Iris](https://archive.ics.uci.edu/dataset/53/iris) dataset from https://archive.ics.uci.edu/, the University of California, Irvine (UCI) Machine Learning Repository. UCI is a well-known online repository that hosts various datasets for machine learning research. This toy dataset is widely used for testing out machine learning algorithms and visualizations.

**More Information on the Modified Iris Dataset**: A dataset with 150 instances of iris flowers, each described by four features (<em>sepal length</em>, <em>sepal width</em>, <em>petal length</em>, and <em>petal width</em>) and is assigned to one of three iris species.  The first three features are **numerical** and the last one *petal width* is **nominal** among a value from {0,1,2,3,4}.

## 2.Data Loading

We first import the packages that will be used in this document.

1. [Pandas](https://pandas.pydata.org/): Pandas is an open-source Python library widely used for data manipulation, analysis, and cleaning tasks. The central data structure in Pandas is the [DataFrame](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) which provides methods to facilitate the preliminary examination of essential properties, statistical summaries, and a select number of rows for a cursory exploration of the data.

2. [sklearn.preprocessing.MinMaxScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html): MinMaxScaler is a package from scikit-learn (sklearn) used for normalization. It scales the data to a specific range (usually between 0 and 1).

3. [sklearn.preprocessing.StandardScaler](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html): StandardScaler is another package from scikit-learn (sklearn) used for normalization. It scales the data to have a mean of 0 and a standard deviation of 1.

4. [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html): train_test_split() is used to split a dataset into training and testing subsets, allowing users to evaluate the performance of machine learning models on unseen data.

5. [sklearn.metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics): sklearn.metrics includes performance metrics functions used to evaluate a classifier's performance.

These packages will be utilized later in the following tasks for data preprocessing and evaluating the performance of a classifier.

In [1]:
import pandas as pd 
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics

After importing these packages, we can begin to load the dataset. The dataset is stored in the <em>iris_modified.csv</em> file. We will use the [read_csv()](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) for loading this file and use [info()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.info.html) to get an initial overview of the dataset.



In [2]:
df = pd.read_csv('iris_modified.csv')
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   sepal length in cm  148 non-null    float64
 1   sepal width in cm   148 non-null    float64
 2   petal length in cm  150 non-null    float64
 3   petal width         147 non-null    float64
 4   class               150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


In this dataset, we observe a total of 150 instances, with each instance containing 5 columns. The first three columns represent the **numerical** values of <em>sepal length</em>, <em>sepal width</em>, and <em>petal length</em>, respectively. The fourth column denotes the **nominal** feature <em>petal width</em> of each instance whose values are among {0,1,2,3,4}. Lastly, the fifth column provides the class label for each instance. 

We can then get some statistical information on each feature of the dataset by [describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html):

In [3]:
print(df.describe())

       sepal length in cm  sepal width in cm  petal length in cm  petal width
count          148.000000         148.000000          150.000000   147.000000
mean             5.831757           3.053378            3.758667     1.476190
std              0.819328           0.435096            1.764420     1.206906
min              4.300000           2.000000            1.000000     0.000000
25%              5.100000           2.800000            1.600000     0.000000
50%              5.800000           3.000000            4.350000     2.000000
75%              6.400000           3.300000            5.100000     2.000000
max              7.900000           4.400000            6.900000     4.000000


Despite there being 150 instances, we only have 148 or 147 values for some features. There definitely exist some missing values. Moreover, the value ranges of features are also different from each other. That means we may normalize the data to achieve a better performance.

Let's have a look at the first 10 rows of the dataset by [head()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.head.html).

In [4]:
df.head(10)

Unnamed: 0,sepal length in cm,sepal width in cm,petal length in cm,petal width,class
0,5.1,3.5,1.4,0.0,Iris-setosa
1,4.9,3.0,1.4,0.0,Iris-setosa
2,4.7,3.2,1.3,0.0,Iris-setosa
3,4.6,3.1,1.5,0.0,Iris-setosa
4,5.0,3.6,1.4,0.0,Iris-setosa
5,5.4,3.9,1.7,0.0,Iris-setosa
6,4.6,3.4,1.4,0.0,Iris-setosa
7,5.0,3.4,1.5,0.0,Iris-setosa
8,4.4,2.9,1.4,0.0,Iris-setosa
9,4.9,3.1,1.5,0.0,Iris-setosa


We can see all the detailed information of the first 10 instances, and they all belong to the Iris-setora class. Note that checking the dataset and understanding the properties of it are always necessary before we conduct further steps! 

## 3. Missing Value

As expounded in the lecture, the handling of missing values in the dataset entails two primary approaches: the former entails the removal of all data instances containing missing values, while the latter involves the process of imputation.
Regarding the imputation process, it encompasses two distinct strategies: one based on the utilization of aggregate values for all instances, and the other employing class-specific values for imputing missing data.

### 3.1. Removal

This can be conducted by employing the [Pandas.DataFrames](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) function [dropna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html).

In [5]:
df_deleting = df.dropna()

In [6]:
print(df_deleting.info())

<class 'pandas.core.frame.DataFrame'>
Int64Index: 143 entries, 0 to 149
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   sepal length in cm  143 non-null    float64
 1   sepal width in cm   143 non-null    float64
 2   petal length in cm  143 non-null    float64
 3   petal width         143 non-null    float64
 4   class               143 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.7+ KB
None


We can see there are 143 instances left after the removal of any instance with a missing value.

### 3.2. Imputation by all values 

Utilize the [fillna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) method to impute missing or NaN values within the DataFrame by replacing them with the respective feature's average computed from all available non-null values.


Firstly, we apply the [DataFrame.copy()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.copy.html) method to create a new DataFrame, ensuring that any modifications made will not affect the original one. This precautionary step allows us to work with a separate copy, preserving the integrity of the original data for further analysis or comparison.

In [7]:
df_impu_all = df.copy()

#### 3.2.1 Impute the numerical features

We then impute the numerical values by the average of the feature's all values.
We use the [DataFrame.iloc[]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html) method with the [fillna()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html) to impute the first three numerical columns. 

In [8]:
df_impu_all.iloc[:,:3] = df_impu_all.iloc[:,:3].fillna(df_impu_all.iloc[:,:3].mean())

In [9]:
print(df_impu_all.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   sepal length in cm  150 non-null    float64
 1   sepal width in cm   150 non-null    float64
 2   petal length in cm  150 non-null    float64
 3   petal width         147 non-null    float64
 4   class               150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


After printing the information and observing that all the numerical values have been imputed (the Non-Null values increase from 148 to 150), we proceed to handle the missing values in the nominal feature. 

#### 3.2.2 Impute the nominal features

For this task, we will use the [DataFrame.mode()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mode.html) method, which allows us to impute the missing categorical values with the most common one in the corresponding column. 

In [10]:
df_impu_all.iloc[:,3] = df_impu_all.iloc[:,3].fillna(df_impu_all.iloc[:,3].mode().iloc[0])

In [11]:
print(df_impu_all.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   sepal length in cm  150 non-null    float64
 1   sepal width in cm   150 non-null    float64
 2   petal length in cm  150 non-null    float64
 3   petal width         150 non-null    float64
 4   class               150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


After conducting the imputations for both numerical and nominal features, we can verify the successful completion of this process by printing the information.

### 3.3. Imputation by class-specific values

In this part, we will employ a similar approach to addressing missing or NaN values in another way. Specifically, the missing or NaN values are replaced with the mean or most common of the corresponding feature values within their respective class.

To obtain the list of unique classes, we employ the [unique()](https://pandas.pydata.org/docs/reference/api/pandas.unique.html) method. In the following codes, we use a for loop to iterate through each class, and perform the imputation process. We will utilize [DataFrame.loc()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html) to confirm conditions and facilitate the imputation process.

In [12]:
df_impu_class = df.copy()
cat_list = df_impu_class.iloc[:,4].unique()
for cat in cat_list:
    #imputate numerical values
    df_impu_class.loc[df_impu_class.iloc[:,4]==cat,df_impu_class.columns[:3]] = df_impu_class.loc[df_impu_class.iloc[:,4]==cat,df_impu_class.columns[:3]].fillna(
        df_impu_class.loc[df_impu_class.iloc[:,4]==cat,df_impu_class.columns[:3]].mean()
    )
    #imputate categorical values
    df_impu_class.loc[df_impu_class.iloc[:,4]==cat,df_impu_class.columns[3]]= df_impu_class.loc[df_impu_class.iloc[:,4]==cat,df_impu_class.columns[3]].fillna(
        df_impu_class.loc[df_impu_class.iloc[:,4]==cat,df_impu_class.columns[3]].mode().iloc[0]
    )

In [13]:
print(df_impu_class.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   sepal length in cm  150 non-null    float64
 1   sepal width in cm   150 non-null    float64
 2   petal length in cm  150 non-null    float64
 3   petal width         150 non-null    float64
 4   class               150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB
None


## 4. Normalization

After successfully performing the imputation using class-specific values on the dataset, we will proceed to normalization. Note that normalization can be only applied to numerical features so it is essential to select those columns.

Before doing normalization, let us have a look again at the dataset by [describe()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html).

In [14]:
print(df_impu_class.describe())

       sepal length in cm  sepal width in cm  petal length in cm  petal width
count          150.000000         150.000000          150.000000   150.000000
mean             5.837374           3.053918            3.758667     1.480000
std              0.816055           0.433823            1.764420     1.208027
min              4.300000           2.000000            1.000000     0.000000
25%              5.100000           2.800000            1.600000     0.000000
50%              5.800000           3.000000            4.350000     2.000000
75%              6.400000           3.300000            5.100000     2.000000
max              7.900000           4.400000            6.900000     4.000000


### 4.1. Max-min normalization

We apply the [MinMaxScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) to achieve the max-min normalization.

In [15]:
df_class_minmax = df_impu_class.copy()
scaler = MinMaxScaler()
scaler.fit(df_class_minmax.loc[:,df_impu_class.columns[:3]])
df_class_minmax.loc[:,df_impu_class.columns[:3]] = scaler.transform(df_class_minmax.loc[:,df_impu_class.columns[:3]])

Let us have a look at the data again

In [16]:
print(df_class_minmax.describe())

       sepal length in cm  sepal width in cm  petal length in cm  petal width
count          150.000000         150.000000          150.000000   150.000000
mean             0.427048           0.439133            0.467571     1.480000
std              0.226682           0.180760            0.299054     1.208027
min              0.000000           0.000000            0.000000     0.000000
25%              0.222222           0.333333            0.101695     0.000000
50%              0.416667           0.416667            0.567797     2.000000
75%              0.583333           0.541667            0.694915     2.000000
max              1.000000           1.000000            1.000000     4.000000


All three numeraical features are now normalized to the range of [0, 1].

### 4.2. Standardization

We then apply the [StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) to achieve the standardization (or z-score normalization).

In [17]:
df_class_standard = df_impu_class.copy()
scaler = StandardScaler()
scaler.fit(df_class_standard.loc[:,df_impu_class.columns[:3]])
df_class_standard.loc[:,df_impu_class.columns[:3]] = scaler.transform(df_class_standard.loc[:,df_impu_class.columns[:3]])

We can also have a brief look at the dataset after standardization.

In [18]:
print(df_class_standard.describe())

       sepal length in cm  sepal width in cm  petal length in cm  petal width
count        1.500000e+02       1.500000e+02        1.500000e+02   150.000000
mean         1.421085e-15      -1.586879e-15        3.315866e-16     1.480000
std          1.003350e+00       1.003350e+00        1.003350e+00     1.208027
min         -1.890220e+00      -2.437514e+00       -1.568735e+00     0.000000
25%         -9.066106e-01      -5.872651e-01       -1.227541e+00     0.000000
50%         -4.595198e-02      -1.247030e-01        3.362659e-01     2.000000
75%          6.917554e-01       5.691402e-01        7.627586e-01     2.000000
max          2.536024e+00       3.113232e+00        1.786341e+00     4.000000


Ideally, the mean should be 0 with std to be 1 after z-score normalization. We can round the results to avoid those floating point precision deviations.

In [19]:
print(df_class_standard.describe().round(2))

       sepal length in cm  sepal width in cm  petal length in cm  petal width
count              150.00             150.00              150.00       150.00
mean                 0.00              -0.00                0.00         1.48
std                  1.00               1.00                1.00         1.21
min                 -1.89              -2.44               -1.57         0.00
25%                 -0.91              -0.59               -1.23         0.00
50%                 -0.05              -0.12                0.34         2.00
75%                  0.69               0.57                0.76         2.00
max                  2.54               3.11                1.79         4.00


# 5.Evaluation

Since we don't have a test file, we split the dataset into two subsets: a training set (used for training) and a testing set (used for validating or testing) by [train_test_split()](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method. By the default setting, it will create a test set consisting of 25% of the original data. In the current dataset consisting of 150 instances, it will lead to 38 test instances.

In [20]:
y = df_class_standard.iloc[:,-1].values
X = df_class_standard.iloc[:,0:4].values
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)

Since we will introduce constructing classifiers from Week 3, we can proceed directly to the evaluation phase using the test results provided. Classifier_A provides a prediction on the test data which is stored in *prediction_1.csv*, and Classifier B provides one in *prediction_2.csv*. In the following, we will evaluate Classifier_B and leave evaluating Classifier_A as optional homework. 

First, we load the test csv file *prediction_2.csv* and have a look at them.

In [21]:
df_prediction_1 = pd.read_csv('prediction_2.csv')
df_prediction_1.head(38)

Unnamed: 0,predicted class
0,Iris-virginica
1,Iris-versicolor
2,Iris-setosa
3,Iris-virginica
4,Iris-setosa
5,Iris-virginica
6,Iris-setosa
7,Iris-versicolor
8,Iris-virginica
9,Iris-versicolor


We can observe that the file provides the predicted class, which we will use for evaluation so we select those columns.

In [22]:
y_predict = df_prediction_1.iloc[:,-1].values

We can then show the accuracy of the classifier by [accuracy_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html)

In [23]:
acc = metrics.accuracy_score(y_test, y_predict)
print("The prediction accuracy is: ", acc)

The prediction accuracy is:  0.9473684210526315


Also, we can get the macro f1-score by [metrics.f1_score()](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html)

In [24]:
f1 = metrics.f1_score(y_test, y_predict, average='macro')
print("The prediction macro f1-score is: ", f1)

The prediction macro f1-score is:  0.9444444444444445


Please try to evaluate Classifier_A by yourself:)

Author: *Kaki Zhou* 28/7/2023