Skip to content

”Multilayer Perceptron” project at Hive Coding School

Notifications You must be signed in to change notification settings

jussihyva/Multilayer-Perceptron

Repository files navigation

Multilayer-Perceptron

1. Project overview

The project is part of my studying at Hive (Helsinki) Coding School. Programs are written in C-, and python- languages. An additional visualization (Bonus part of the project) I used Influxdb and Grafana which are running inside a docker container. For visualization I used Jupyter notebook. The picture below visualizes a general overview of a curriculum (Holy Graph) and my progress at Hive Coding School.

Hive_Holy_Graph

2. Project description

Note. Text into this chapter (project description) was copied (partly or fully) from Hive Coding School document.

2.1 Introduction

In the language of your choice you are going to implement a multilayer perceptron, in order to predict whether a cancer is malignant or benign on a dataset of breast cancer diagnosis in the Wisconsin.

2.2 Mandatory part

2.2.1 Foreword

A non-negligible part of the evaluation will be based on your understanding of the training phase (also called the learning phase) and the underlying algorithms. You will be asked to explain to your corrector the notions of feedforward, backpropagation and gradient descent. Points will be attributed depending on the clarity of your explanations. These notions are important for the next projects of the branch and will represent a real asset if you wish to continue in this field.

2.2.2 Dataset

The dataset is provided in the resources. It is a csv file of 32 columns, the column diagnosis being the label you want to learn given all the other features of an example, it can be either the value M or B (for malignant or benign). The features of the dataset describe the characteristics of a cell nucleus of breast mass extracted with fine-needle aspiration. (for more detailed informations, go here). As you will see, there is an important work of data understanding before starting to implement the algorithm which will be able to classify it. A good practice would be to begin by playing with the dataset by displaying it with graphs, visualizing and manipulating its different features.

2.2.3 Implementation

Your implementation of the neural network must contain at least two hidden layers (the idea is to make you write a program a bit more modular. Although this won’t be graded, it still is a good habit to take). You must also implement the softmax function on the output layer in order to obtain the output as a probabilistic distribution. In order to evaluate the performances of your model in a robust way during training, you will split your dataset in two parts, one for the training, and one for the validation (the validation dataset is used to determine the accuracy of your model on unknown examples). To visualize your model performances during training, you will display at each epoch the training and validation metrics, for example :

    epoch 39/70 - loss: 0.0750 - val_loss: 0.0406
    epoch 40/70 - loss: 0.0749 - val_loss: 0.0404
    epoch 41/70 - loss: 0.0747 - val_loss: 0.0403

You will also implement a learning curve graph displayed at the end of the training phase (you are free to use any library you want for this purpose).

2.2.4 Submission

You will submit two programs, the first for the training phase, and the second for the prediction phase (or you can submit a single program with an option to switch between the two phases) : • The training program will use backpropagation and gradient descent to learn on the training dataset and will save the model (network topology and weights) at the end of its execution. • The prediction program will load the weights learned in the previous phase, perform a prediction on a given set (which will also be loaded), then evaluate it using the binary cross-entropy error function:

g

2.2.5 Input file

ID number Diagnosis Mean Radius Mean Texture Mean Perimeter Mean Area Mean Smoothness Mean Compactness Mean Concavity Mean Concave points Mean Symmetry Mean Fractal dimension Radius SE Texture SE Perimeter SE Area SE Smoothness SE Compactness SE Concavity SE Concave points SE Symmetry SE Fractal dimension SE Worst Radius Worst Texture Worst Perimeter Worst Area Worst Smoothness Worst Compactness Worst Concavity Worst Concave points Worst Symmetry Worst Fractal dimension
842302 M 17.99 10.38 122.8 1001 0.1184 0.2776 0.3001 0.1471 0.2419 0.07871 1.095 0.9053 8.589 153.4 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.6 2019 0.1622 0.6656 0.7119 0.2654 0.4601 0.1189
842517 M 20.57 17.77 132.9 1326 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.0186 0.0134 0.01389 0.003532 24.99 23.41 158.8 1956 0.1238 0.1866 0.2416 0.186 0.275 0.08902
84300903 M 19.69 21.25 130 1203 0.1096 0.1599 0.1974 0.1279 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.00615 0.04006 0.03832 0.02058 0.0225 0.004571 23.57 25.53 152.5 1709 0.1444 0.4245 0.4504 0.243 0.3613 0.08758
:
926954 M 16.6 28.08 108.3 858.1 0.08455 0.1023 0.09251 0.05302 0.159 0.05648 0.4564 1.075 3.425 48.55 0.005903 0.03731 0.0473 0.01557 0.01318 0.003892 18.98 34.12 126.7 1124 0.1139 0.3094 0.3403 0.1418 0.2218 0.0782
927241 M 20.6 29.33 140.1 1265 0.1178 0.277 0.3514 0.152 0.2397 0.07016 0.726 1.595 5.772 86.22 0.006522 0.06158 0.07117 0.01664 0.02324 0.006185 25.74 39.42 184.6 1821 0.165 0.8681 0.9387 0.265 0.4087 0.124
92751 B 7.76 24.54 47.92 181 0.05263 0.04362 0 0 0.1587 0.05884 0.3857 1.428 2.548 19.15 0.007189 0.00466 0 0 0.02676 0.002783 9.456 30.37 59.16 268.6 0.08996 0.06444 0 0 0.2871 0.07039

3. Solution

3.1 Exploratory data analysis

The aim of this chapter is to give an overview of the dataset content

3.1.1 Aggregate functions

Based on the result I did these observations:

  • The Count function shows that the examples (rows in the dataset) includes data for all columns
  • Mode function is missing from the list. I think that is not big issue. That because of all columns includes values so no reason to add any missing values.
  • Fo some reason (maybe a lack of competence) I cannot sey much more about these numbers

Aggregate functions

3.1.2 Histogram

Six of the mean gauges (Radius, Perimeter, Area, Compactness, Concavity and Concave points) separate cancer types very well. The rest four are overlapping. So, maybe only six gauges could be used as input for a machine-learning model. Error gauges are overlapping much more. I don't have experience with cancer-related medical details, I cannot make a decision on how important error gauges are analysis point of view. Worst-gauges behave similarly to mean gauges. So, maybe the usage of mean gauges gives a good enough result to predict a type of cancer. So, based on Histogram visualization, I think that mean gauge values are enough for valid cancer type prediction.

Histogram mean

Histogram SE

Histogram worst

3.1.3 Scatter matrix

There are three gauges (Radius, Perimeter and Area) that correlate very well with each other, so maybe just one of them is enough to be used for cancer type prediction.

Histogram mean

Histogram SE

Histogram worst

3.1.4 Box

All gauges have some values that are out of the box (nice sentence). So, maybe those measurement results should be removed. I have to check (I don't know yet how to do that professionally) are those measurements invalid with all gauge values or just randomly one faulty result here and there.

Histogram mean

Histogram SE

Histogram worst

3.1.5 Correlation

The correlation pictures indicate similar things to Scatter matrix pictures. So, there is nothing additional information from these pictures.

Histogram mean

Histogram SE

Histogram worst

3.2 Implemented functions

Implemented functions are used for Logistic regression.

3.2.1 Main steps

  • Input params normalization
  • Gradient descent
    • Initialize parameters
    • Forward propagation
    • Cost calcualtion
    • Initialize backward propacation
    • Backward propagation
    • Update parameters

3.2.2 Logistic regression

3.2.2.1 Linear function

g

3.2.2.2 Non-linear functions
3.2.2.2.1 Sigmoid function

sigmoid function is used in last layer to get results (0 or 1) from a model. Why did I decide to use sigmoid? To be honest I don't know (maybe in the future, I will) any other possible functionality to get a logistic result (0 < x < 1) than sigmoid. And it seems that sigmoid is widely used in machine learning models.

g

drawing

3.2.2.2.1 ReLU function (Rectified Linear Unit)

Main reason to use ReLU function

  • The rectified linear activation is the default activation when developing multilayer Perceptron and convolutional neural networks.
  • The rectified linear activation function overcomes the vanishing gradient problem, allowing models to learn faster and perform better.
  • The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problem.

g

drawing

3.2.3 Gradient descent

3.2.3.1 Overview

Gradient Descent

3.2.4 Cost functions
3.2.4.1 Cross Entropy

g

3.2.4.2 Another cost functions
3.2.5 Derivative
3.2.5.1 Overview

Derivative

3.2.5.2 y hat

g

g

3.2.5.3 Z output layer

g

3.2.5.4 Z hidden layer

g g

3.2.5.5 weight

g

3.2.5.6 bias

g

3.2.5.7 Update weight

g

3.2.5.8 Update bias

g

3.2.6 Sofmax

The softmax function is a function that turns a vector of K real values into a vector of K real values that sum to 1. The input values can be positive, negative, zero, or greater than one, but the softmax transforms them into values between 0 and 1, so that they can be interpreted as probabilities. If one of the inputs is small or negative, the softmax turns it into a small probability, and if an input is large, then it turns it into a large probability, but it will always remain between 0 and 1. Refer to Softmax Function

3.2.7 Argmax

In mathematics, the arguments of the maxima (abbreviated arg max or argmax) are the points, or elements, of the domain of some function at which the function values are maximized.[note 1] In contrast to global maxima, which refers to the largest outputs of a function, arg max refers to the inputs, or arguments, at which the function outputs are as large as possible. Refer to Arg max

3.3 Neural network models

3.3.1 One layer model

One layer model is the first model I implemented. A purpose of the model is to implement basic functiolnaity like normalization, gradien descent, linear, sigmoid, cost, argmax functions (Based on my previous DSLR project) and additional logistics function (softmax). Based on my learnings from previous projects I restrudctured C-code to support multi layer implementation with several hidden layers. The picture below highlights the key functions in the model.

drawing

3.3.1.1 Naming rules and equations

g g

g

[l] = The superscript value inside squared brackets is the layer number. Input layer is always number 0. (i) = The superscript value inside round brackets is the example number. The first example is number 1. j = The subscript value is the node number. In this model I have two nodes.

g

g

g

g

g

3.3.2 Three layers model

drawing

3.3.2.1 Layer 2

Layer 2 is hidden layer which includes three nodes (Units). Non-linear function could be either Sigmoid or ReLU.

drawing

3.3.2.2 Layer 3

Layer 3 is hidden layer which includes three nodes (Units). Non-linear function could be either Sigmoid or ReLU.

drawing

4. Lessons learned

Used hyper parameters Learning rate: 0.04 Epochs: 50000 Activation function: ReLU and Sigmoid Weight start values : Random (0...1) Bias start values: 0 Cost function: Refert to chapter 3.2.3.2 (Cost function) Input param normalization (values between 0 to 1)

4.1 Three layers neural network

4.1.1 Bias and Cost relation

drawing

4.1.2 Weight and Cost relation

drawing

drawing

4.2 Four layers neural network

4.2.1 Bias and Cost relation

drawing

4.1.2 Weight and Cost relation

drawing

drawing

About

”Multilayer Perceptron” project at Hive Coding School

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published