# Practical Session 2
This session consists of some first steps of a machine learning process:

*   loading data
*   data exploration
*   data pre-processing including
    * dealing with missing values
    * encoding categorical features
    * feature scaling

In this practical, we will use the well used Titanic data.

Python packages used in this practics:
* sklean
* pandas
* matplotlib
* seaborn

Author: Yuhua Li

Date:   November 2022 updated

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt


# Get data

Get the Titanic data and have a simple inspection of the data

---



In [None]:
# load Titanic data from URL
titanic = pd.read_csv('http://bit.ly/kaggletrain')
#titanic = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/titanic_train.csv')


Have an initia inspection of the data

In [None]:
pd.set_option('display.max_columns', None, 'max_colwidth', None, 'display.expand_frame_repr', False) # print all columns in full, prevent line break

print('\n--- Show the number of data points (rows) and features (columns)---\n', titanic.shape)

print('\n---Information of the titanic dataset --- \n')
print(titanic.info())
print('\n ---Column names of the dataset --- \n', titanic.columns)


In [None]:
# Look at the first 10 and the last 5 lines of the dataset
print('\nBelow is the first 10 lines of the dataset......\n', titanic.head(10))
print('\n\nBelow is the last 5 lines of the dataset......\n', titanic.tail(5))

In [None]:
# Get the statistics of the dataset. Note it only show the results for features with neumerical values
print('\nBelow is the statistics of the dataset......\n\n', titanic.describe())

### Remove features that are not relevant to modelling

The dataset contains columns for passenger IDs, names and ticket numbers, which aren't useful for machine learning modelling. So we can remove 'PassengerId', 'Name', 'Ticket' columns from the original dataset. Note  generally we need to remove personally identifiable information from a dataset to avoid violation of the law of GDPR (Genrla Data Protection Regulation).

In [None]:
try:
  titanic.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace = True)
except KeyError:
  print('Attributes already removed')

print(titanic.head(10))

# Dealing with missing values

We first check if there are missing values in each feature, 0 for no missing value or a number for the number of missing values. You will see there missing values for Age, Cabin and Embarked

In [None]:
titanic.isna().sum()   

As you can see from above, there are missing values in this Titanic dataset. 

Some packages of machine learning methods provide the ability of dealing with a dataset with missing values so they can take the data directly without us explicitly dealing with missing values. However, many others need us to process missing values before feeding the dataset for machine leanring modelling.

You have a few options to deal with missing values:
1. Get rid of the corresponding instances.
2. Get rid of the whole attribute.
3. Set the values to some value (zero, the mean, the median, etc.).
4. Use imputation methods.

## 1. Get rid of the instances (data points) that contain missing values. 
If the number of instances in a dataset is large and the fraction of instances with missing values is small, an easy way is simply to remove those instances containing missing values.


In [None]:
print('Data size BEFORE deleteting instances with missing values: ', titanic.shape)

titanic_ins = titanic.dropna(subset=['Age', 'Cabin', 'Embarked'])   # delete all instances that have missing values for the features of Age, Cabin, and Embarked
print('\nData size AFTER deleteting instances containing missing values: ', titanic_ins.shape)
titanic_ins.isna().sum()

## 2. Get rid of the whole attribute.

As seen above, the feature Cabin contains 687 missing values which is a significant portion of the total instances (891). If we remove those instances based on the Cabin feature (and Age and Embarked), we only get 183 instances left from the original 891 instances, which means we lost the majority portion of the original dataset. So we'd better drop the feature of *Cabin* entirely.

In [None]:
titanic1 = titanic.drop("Cabin", axis=1)
print(titanic1.head())
titanic1.isna().sum()

## 3. Set the missing values to some value
After removing the entire feature of *Cabin*, the resulting dataset ***titanic1*** still contains missing values for *Age* and *Embarked*. We may fill the missing values by statistics, e.g., mean and median of a numeric feature or the most frequent value of a categorical feature, etc.

In [None]:
# Age is a numeric feature, we may replace missign values by the median of Age 
median = titanic1["Age"].median() # option 3
titanic1["Age"].fillna(median, inplace=True)
print('After filling missing values of Age\n', titanic1.isna().sum())

## 4. Use imputation methods.
sci-kit learn provide many imputation methods. Here we replace missing values of *Embarked* using its mode, i.e., its most frequent value.


In [None]:
from sklearn.impute import SimpleImputer    # For more information about SimpleImputer, see https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html

imputer = SimpleImputer(strategy="most_frequent")
imputer.fit(titanic1)
embarked = imputer.transform(titanic1)

#print(embarked[:20, ])
print(embarked.shape)
print(pd.DataFrame(embarked).isna().sum())
pd.DataFrame(embarked).info()

# Data Exploration

## Scatter plot

Scatter plot shows data distribution of a pair of features, it can visuallly reveal the relationship between a feature pair.

In [None]:
# use scatter_matrix of pandas.plotting     https://pandas.pydata.org/docs/reference/plotting.html
from pandas.plotting import scatter_matrix
scatter_matrix(titanic1[['Age', 'SibSp', 'Parch', 'Fare']], figsize=(12, 8))

In [None]:
# We can also use plot of pandas DataFrame to plot a pair of features: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
titanic1.plot(kind='scatter', x='Age', y='Fare')

## Box plot

In [None]:
# We demonstrate boxplot for 'Age', 'Fare'. boxplot can visualise the distribution of a feature for outlies removal if needed.

print(titanic1[['Age', 'Fare']].describe())
titanic1[['Age', 'Fare']].boxplot()

## Correlation
The correlation coefficient measures the linear relationship between a pair of variables, range [-1, 1]. 1 indicates a full linear relationship, 0 no linear relationship, and -1 full negatively linear relationship. It can be used for feature selection to remove redundant features.

In [None]:
corr_matrix = titanic1.corr()
print(corr_matrix)

import seaborn as sns
plt.figure(figsize=(10,10))
sns.heatmap(corr_matrix, vmax=0.9, square=True, annot=True, linewidths=0.3, cmap="YlGnBu", fmt=".1f")

## Histogram
We use the histogram to see what a feature distribution looks like.

In [None]:
titanic1['Age'].hist()

# Feature scaling
Feature scaling is a necessary step for most machine learning methods in order to achieve good learning performance and a faster learning process.

There are two commonly used feature scaling methods: 
1. Scaling features to a defined range: e.g., min-max scaling
2. Standardization: zero mean and unit variance

## 1. Scaling features to a range that you define
Commonly used ranges are [0, 1] and [-1, 1]

In [None]:
from sklearn.preprocessing import MinMaxScaler

minMax_scale = MinMaxScaler()   # to default range [0, 1]
titanic1['Age'] = minMax_scale.fit_transform(titanic1[['Age']])

titanic1

## Standardization (z-score)
Standardization scales a featue to a feature with 0 mean and 1 standard deviation.

In [None]:
from sklearn.preprocessing import StandardScaler

standard_scale = StandardScaler()   # to default 0 mean and 1 standard deviation
titanic1['Fare'] = standard_scale.fit_transform(titanic1[['Fare']])

titanic1

# Encoding categorical features

Most machine learning methods take only numerical data, except decision tree based methods which can take numerical and categorical features directly. So we need to convert categories to numbers.

## One hot encoder

In [None]:
from sklearn.preprocessing import OneHotEncoder

titanic1 = titanic[['Pclass', 'Sex', 'Embarked']].dropna()
#print(titanic1.info())
print('\n---The original data---\n', titanic1.head())

enc = OneHotEncoder(sparse=False, handle_unknown='ignore')
enc.fit(titanic1)

print('\n', enc.categories_)

titanic2 = pd.DataFrame(enc.transform(titanic1))
titanic2.columns = np.concatenate(enc.categories_).ravel().tolist()
print('\n---The data after one hot encoding---\n', titanic2.head(5))

## Effect encoding
A feature with N categories will produce N binary features after one hot encoding as above. One of the N binary features is perfectly collinear with the other N-1 features. This means there are only N-1 non-collinear binary features for a N-category feature. So we can drop one of the derived binary features as follows.

In [None]:
enc = OneHotEncoder(drop='first', sparse=False)
titanic1 = titanic[['Pclass', 'Sex', 'Embarked']].dropna()
print(titanic1.head())
enc.fit(titanic1)
titanic2 = pd.DataFrame(enc.transform(titanic1))
print(titanic2.shape)

print('\n---The data after one hot encoding---\n', titanic2.head())

titanic2 = pd.concat([titanic2, titanic[['Age', 'SibSp',  'Parch', 'Fare']]], axis=1)
titanic2.head()

## Use Transformer
Or use make_column_transformer to encode the named categorical features and keep other features unchanged

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="most_frequent")
imputer.fit(pd.DataFrame(titanic1["Embarked"]))

In [None]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer

titanic3 = titanic[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
print('Before filling missing values of Age\n', titanic3.isna().sum())
print(titanic3.shape)
print(titanic3)

In [None]:
imputer = SimpleImputer(strategy="most_frequent")
titanic3["Embarked"] = imputer.fit_transform(pd.DataFrame(titanic3["Embarked"]))
imputer = SimpleImputer(strategy="median")
titanic3["Age"] = imputer.fit_transform(pd.DataFrame(titanic3["Age"]))
print('After filling missing values of Age\n', titanic3.isna().sum())

print(titanic3.shape)
print(titanic3)

In [None]:

col_trans = make_column_transformer(
    (SimpleImputer(strategy="most_frequent", add_indicator=True), ["Embarked"]),
    (SimpleImputer(strategy="median", add_indicator=True), ["Age"]),
    (OneHotEncoder(), ['Sex', 'Embarked']),
    remainder = 'passthrough')

titanic_ce = col_trans.fit_transform(titanic3)

np.set_printoptions(threshold=np.inf, linewidth=np.inf, suppress=True, precision=2)
print(titanic_ce[0:20, ])

## Exercise

### Tips
There are many packages at our disposal. It is nearly impossible for us to remember all functionalities and details of every package. A good practice is always to consult the package documentation first when you have questions about the use of a package and its methods or algorithms.

So, have a look at sklearn.preprocessing for its main modules that you will use to preprocess data https://scikit-learn.org/stable/modules/preprocessing.html#

Read the Category Encoders for other encoding methods at https://contrib.scikit-learn.org/category_encoders/

# First machine learning project
Finally we put things together to practice our first machine learning project, here we build a k-nerest neighbours model to predict Suvival based on passengers' data.

## Get data

In [None]:
import pandas as pd 

# Get a new copy of titanic data. 
# We ignore 'PassengerId', 'Name' and 'Ticket', as they are useless for machine learnign.
# We also ignore 'Cabin' as it contains too many missing values.
titanic = pd.read_csv('http://bit.ly/kaggletrain')
titanic.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace = True)

pd.set_option('display.max_columns', None, 'max_colwidth', None, 'display.expand_frame_repr', False)
print(titanic)

# split the data into input and output
titanic_x = titanic[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']]
titanic_y = titanic[['Survived']]

print(titanic_x.head(20))
#print(titanic_x['Embarked'].isna().sum())

## Split data into training set and test set

In [None]:
from sklearn.model_selection import train_test_split

# 70% of the data for traingin, 30% for test, i.e., test_size=0.3
X_train, X_test, y_train, y_test = train_test_split(titanic_x, titanic_y, test_size=0.3, random_state=42, stratify=titanic_y) 
X_train

## Preprocessing Pipeline & ColumnTransformer
We now put all transformations using Pipeline & ColumnTransformer. Doing so not only makes code tidy but also has advantages:

*   allow to include the preprocessing steps in the hyperparameter tuning (will learn later)
*   avoid data leakage, i.e., avoid making the mistake of using any test data for model training
*   guarantee that your data is always preprocessed the same way. For example,  if a categorical feature has a category in the test set that does not occur in the training set or a category in the training set that doesn't occur in the test set.



In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler, StandardScaler
from sklearn.compose import ColumnTransformer
#from sklearn.compose import make_column_transformer

# transformer for categorical features
categorical_features = ['Pclass', 'Sex', 'Embarked']
categorical_transformer = Pipeline(
    [
        ('imputer_cat', SimpleImputer(strategy = 'most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
    ]
)

# transformer for numerical features
numeric_features = ['Age', 'SibSp', 'Parch', 'Fare']
numeric_transformer = Pipeline(
    [
        ('imputer_num', SimpleImputer(strategy = 'median')),
        ('scaler', StandardScaler())
    ]
)

# combine them in a single ColumnTransformer
preprocessor = ColumnTransformer(
    [
        ('categoricals', categorical_transformer, categorical_features),
        ('numericals', numeric_transformer, numeric_features)
    ],
    remainder = 'drop'
)



In [None]:

X_train_processed = preprocessor.fit_transform(X_train) # fit and transform X_train
X_test_processed = preprocessor.transform(X_test) # transform X_test using the model fitted on X_train

np.set_printoptions(threshold=np.inf, linewidth=np.inf, suppress=True, precision=2)

print(X_train_processed[0:20, :])

## Define the model
We use a pipeline to put together the preprocessor from above and the k-nearest neighbours classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

myClassfier = Pipeline(
    [
     ('preprocessing', preprocessor),
     ('classifier', KNeighborsClassifier())
    ]
)

## Train the model

In [None]:

myClassfier.fit(X_train, y_train)


## Evaluate the model
We evalute the performance of the trained classifier on test set.

In [None]:
from sklearn.metrics import accuracy_score

y_pred = myClassfier.predict(X_test)
accuracy_score(y_test, y_pred)

# Exercise
Download Credit Approval Data Set from UCI Machine Learning Repsoitory. Do:

- practicing exploratory data analysis
- dealing with missing values if any
- encoding categorical features
- scaling features
- if you have time, implementing a classifier to predict if a credit card application is approved (+ of the last column) or reject (- of the last column)

You can read more information about the data set from https://archive.ics.uci.edu/ml/datasets/Credit+Approval


In [None]:
import pandas as pd

# Get Credit Approval Data Set
crx = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/credit-screening/crx.data', header='infer')
crx.columns = ['A1', 'A2', 'A3', 'A4', 'A5', 'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A12', 'A13', 'A14', 'A15', 'Target']

# Start writing your IPython notebook............
