# PARENT/AI-4-NICU Training School Hands-On Workshops


## Lab 1. Introduction to data manipulation and processing with Python

Let's first introduce our working environment. Welcome to Google Colab. It's a fork of an open-source project called jupyter notebook, which is a software that allows for creating interactive notebooks(projects that conveniently interlace text content and text descriptions). As soon as we attempt to execute a code cell we're connected to a virtual machine in Google's cloud environment.

Today we're going to be working on the dataset available [HERE](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic).

1. Let's try to download the mentioned dataset using the `wget` tool available in most Linux distributions. Example usage could be as follows:
```bash
!wget -O dataset.zip <DATASET_URL>
```
This command will download the file from the provided url and place it into the destination pointed by the `-O` option. You may have noticed that the command is predecessed by an exclamation mark. It's used to destinguish the shell commands executed directly on the machine from the python code lines.

In [None]:
# TODO: Download the mentioned dataset to the colab's machine
!wget -O dataset.zip https://archive.ics.uci.edu/static/public/17/breast+cancer+wisconsin+diagnostic.zip

2. The dataset is distributed in the form of a compressed `.zip` file. Before starting the work on the data we need to extract it. The `unzip` package should be a perfect fit for that. To extract the contents to the working directory we can just run:
```bash
!unzip <ARCHIVE_PATH>
```

In [None]:
# TODO: Uzip the dataset
!unzip dataset.zip

3. Our archive contained two files `wdbc.data` and `wdbc.names`. Let's peek into both of them and try to understand their contents. There are two commands that may come in handy in this kind of situations:
  - `cat <FILE_PATH>` - prints the whole file to the console
  - `head <FILE_PATH>` - prints few first lines (the number can be defined by the `-n` option)

In [None]:
# TODO: View contents of the wdbc.names file
!cat wdbc.names

In [None]:
# TODO: Print the first line of the dataset
!head wdbc.data -n 1

4. Now that we understand the structure of the dataset we can "read" it (import it to our script), but before that we need to import [pandas](https://pandas.pydata.org/) library, which is one of the most popular data analysis and manipulation library for python.

In [None]:
import pandas as pd

5. We can use a convenient `read_csv` method ([DOCS](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)) available in the pandas package to read the file in the CSV format and convert it to the dataframe tabular format implemented in pandas.

In [None]:
# TODO: Import the dataset to dataframe format with a use of read_csv method
data = pd.read_csv("wdbc.data")

6. To view and examine the data we can call the following methods on the dataframe:
- `.head()`
- `.tail()`
- `.describe()`

In [None]:
# TODO: Use the .head() method to view the first 10 lines in the dataset
data.head(10)

7. You've probably noticed that the first row of the data was used as the columns labels. It's not ideal. Let's fix that.

In [None]:
FEATURE_NAMES = ["Radius", "Texture", "Perimeter", "Area", "Smoothness", "Compactness", "Concavity", "Concave points", "Symmetry", "Fractal dimension"]
FEATURE_STATISTICS = ["MEAN", "SE", "WORST"]

In [None]:
# The code below assembles the list of the dataset columns.

FEATURES = []

for feature_name in FEATURE_NAMES:
  for statistic in FEATURE_STATISTICS:
    FEATURES.append(f"{feature_name} {statistic}")

# and prints it
print(FEATURES)

In [None]:
# TODO: Read the dataset properly - utilising the computed columns' names
data = pd.read_csv("wdbc.data", names=["ID", "Diagnosis"] + FEATURES)

In [None]:
# TODO: View the first 10 lines in the properly imported dataset
data.head(10)

In [None]:
# TODO: Test the .describe() method of the dataframe
data.describe()

Diagnosis
B    357
M    212
Name: count, dtype: int64

In [None]:
data['Diagnosis'].value_counts()

8. Now that our data was succesfully imported and labeled let's perform some aggregations to get insights about it.

In [None]:
# TODO: Filter just the malignant records
data[data["Diagnosis"] == 'M']

In [None]:
data.query("Diagnosis == 'M'")

In [None]:
# TODO: Get the number of malignant records
data[data["Diagnosis"] == 'M'].shape

In [None]:
# TODO: Display the records with mean radius greater than 25
data[data["Radius MEAN"] > 25]

In [None]:
# TODO: Display the records with mean radius greater than 25
data[(data["Radius MEAN"] > 25) & (data["Radius MEAN"] < 28)]

In [None]:
data.query("`Radius MEAN` > 25 and `Radius MEAN` < 28")

In [None]:
# TODO: Display five top records in terms of the mean radius
data.sort_values(["Radius MEAN"], ascending=False).head()

In [None]:
# TODO: Present the distribution of malignant and benign records depending on the mean radius
data.groupby([data["Radius MEAN"].astype(int), data["Diagnosis"]]).agg({"Diagnosis": "count"})

In [None]:
# TODO: Save results of your aggregations to a file
data.sort_values(["Radius MEAN"], ascending=False).head().to_csv("top_5_by_mean_radius.csv", sep=";")
data.sort_values(["Radius MEAN"], ascending=False).head().to_excel("top_5_by_mean_radius.xlsx")

## Lab 2. Shallow machine learning methods

1. Let's take a look at the data that we have available.

In [None]:
# TODO: Display the dataset columns
data.columns

2. Since we're provided with the labeled dataset the first thing that comes to mind is training a model that would process the rest of the columns and predict the label. We'll implement such models and learn how to measure their performance. All of that with the help of [scikitlearn](https://scikit-learn.org/stable/) library. The interface of the machine learning models available in the library requires us to split the data into two distinct parts: the input data a.k.a. features and the expected output of the model a.k.a. labels. It's a custom to name those x and y respectively, as the models essentialy are transformations(functions) that convert x -> y.

In [None]:
# TODO: Prepare the features set
x = data.drop(['ID', 'Diagnosis'], axis=1)
x

In [None]:
# TODO: Prepare the labels set
y = data[['Diagnosis']] == 'M'
y

3. We're now almost ready to train a model, but we still miss one important detail, which is splitting the dataset into the training and test sets. Doing that is the only way to ensure a fair performance measurement, and avoiding the model overfitting. We'll use the [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method available in scikitlearn.

In [None]:
# TODO: Prepare training and test sets
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [None]:
# TODO: Verify the classes distribution in both sets
print(y_train.groupby("Diagnosis").agg({"Diagnosis":"count"}))

In [None]:
print(y_test.groupby("Diagnosis").agg({"Diagnosis":"count"}))

4. Now that our data is prepared, let's see it in action. We'll start by training a simple decision tree model.

In [None]:
# TODO: Train a decision tree clasifier
from sklearn.tree import DecisionTreeClassifier

model = DecisionTreeClassifier()
model.fit(x_train, y_train.values.ravel())

5. The model was trained, so what do we do now? We need to somehow evaluate its performance. There is a plenty of ways and different metrics that we can use. We'll start with a so called confussion matrix which will show us what kind of mistakes does our model make.

In [None]:
# TODO: Predict the labels for the records in test set
predicted = model.predict(x_test)

In [None]:
# TODO: Compute and display the confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, predicted)

print(cm)
ConfusionMatrixDisplay(confusion_matrix=cm).plot()

6. Scikitlearn provides the whole framework for calculating the models' metrics. It provides it in a form of [classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#classification-report) method.

In [None]:
# TODO: Compute and display the classification metrics
from sklearn.metrics import classification_report

predicted = model.predict(x_test)

print(classification_report(y_test, predicted))

In [None]:
# TODO: Discover the insights of a decision tree
from sklearn.tree import plot_tree
from matplotlib import pyplot as plt

plt.figure(figsize=(25, 10))
plot_tree(model, fontsize=11, feature_names=x.columns)
plt.show()

7. Let's now paralelly train multiple classifier trees and ellect the most probable label. That kind of architecture is called a random forest.

In [None]:
# TODO: Repeat the past steps with the random forest classifier
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100)
model.fit(x_train, y_train.values.ravel())

8. Let's now train a completely different model, a logistic-regression-based classifier, but before that we'll need to adjust our input data a bit, by performing scaling.

In [None]:
# TODO: Use the StandardScaler provided by scikitlearn to transform the train and test features sets.
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

x_train_scaled = sc.fit_transform(x_train)
x_test_scaled = sc.transform(x_test)

In [None]:
# TODO: Train the LogisticRegression model
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(x_train_scaled, y_train.values.ravel())

In [None]:
# TODO: Study the classification report
from sklearn.metrics import classification_report

predicted = model.predict(x_test_scaled)

print(classification_report(y_test, predicted))

In [None]:
pd.DataFrame(zip(x_train.columns, model.coef_[0]))

9. Overfitting

In [None]:
benign = data[data["Diagnosis"] == 'B']
malignant = data[data["Diagnosis"] == 'M'].head(10)

data_filtered = pd.concat([benign, malignant])

In [None]:
x = data_filtered.drop(['ID', 'Diagnosis'], axis=1)

In [None]:
y = data_filtered[['Diagnosis']] == 'M'

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

In [None]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

x_train_scaled = sc.fit_transform(x_train)
x_test_scaled = sc.transform(x_test)

In [None]:
from sklearn.linear_model import LogisticRegression

# Try setting the max_iter argument to a very low number.
# This should cause a model not to converge and therefore we'll observe an even more significant overfitting.
model = LogisticRegression(random_state=2)
model.fit(x_train_scaled, y_train.values.ravel())

In [None]:
from sklearn.metrics import classification_report

predicted = model.predict(x_test_scaled)

print(classification_report(y_test, predicted))

In [None]:
predicted = model.predict(x_train_scaled)

print(classification_report(y_train, predicted))