Skip to content

Commit

Permalink
README: Advanced usage example on Logistic regression (#139)
Browse files Browse the repository at this point in the history
  • Loading branch information
gyrdym committed Jun 23, 2020
1 parent 492f6bc commit 470dc08
Show file tree
Hide file tree
Showing 3 changed files with 115 additions and 98 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
@@ -1,5 +1,8 @@
# Changelog

## 14.1.1
- `README`: Advanced usage example on Logistic regression added

## 14.1.0
- `Model selection`: `splitData` helper added

Expand Down
208 changes: 111 additions & 97 deletions README.md
Expand Up @@ -16,7 +16,7 @@ the lib, please do not use it in a browser.
- #### Model selection
- [CrossValidator](https://github.com/gyrdym/ml_algo/blob/master/lib/src/model_selection/cross_validator/cross_validator.dart).
Factory that creates instances of cross validators. Cross validation allows researchers to fit different
[hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) of machine learning algorithms,
[hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) of machine learning algorithms
assessing prediction quality on different parts of a dataset.

- #### Classification algorithms
Expand All @@ -33,38 +33,38 @@ the lib, please do not use it in a browser.

- [KnnClassifier](https://github.com/gyrdym/ml_algo/blob/master/lib/src/classifier/knn_classifier/knn_classifier.dart)
A class that performs classification using `k nearest neighbours algorithm` - it makes prediction basing on
first `k` closest observations to the given one.
the first `k` closest observations to the given one.

- #### Regression algorithms
- [LinearRegressor](https://github.com/gyrdym/ml_algo/blob/master/lib/src/regressor/linear_regressor/linear_regressor.dart).
A class that finds a linear pattern in training data and predicts outcome as real numbers depending on the pattern.

- [KnnRegressor](https://github.com/gyrdym/ml_algo/blob/master/lib/src/regressor/knn_regressor/knn_regressor.dart)
A class that makes prediction for each new observation basing on first `k` closest observations from
training data. It may catch non-linear pattern of the data.
training data. It may catch non-linear pattern of the data.

For more information on the library's API, please visit [API reference](https://pub.dev/documentation/ml_algo/latest/ml_algo/ml_algo-library.html)

## Examples

### Logistic regression

Let's classify records from well-known dataset - [Pima Indians Diabets Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database)
via [Logistic regressor](https://github.com/gyrdym/ml_algo/blob/master/lib/src/classifier/linear_classifier.dart)
via [Logistic regressor](https://github.com/gyrdym/ml_algo/blob/master/lib/src/classifier/logistic_regressor/logistic_regressor.dart)

Import all necessary packages. First, it's needed to ensure, if you have `ml_preprocessing` and `ml_dataframe` package
in your dependencies:

````
dependencies:
ml_dataframe: ^0.0.11
ml_preprocessing: ^5.0.1
ml_dataframe: ^0.1.1
ml_preprocessing: ^5.1.0
````

We need these repos to parse raw data in order to use it farther. For more details, please
visit [ml_preprocessing](https://github.com/gyrdym/ml_preprocessing) repository page.

````dart
import 'dart:async';
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';
Expand All @@ -84,32 +84,64 @@ on each row. This column is our target - we should predict a class label for eac
````dart
final targetColumnName = 'class variable (0 or 1)';
````

Then we should create an instance of `CrossValidator` class to fit [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))
of our model. We should pass training data (our `samples` variable), a list of target column names (in our case it's

Now it's the time to prepare data splits. Since we have a smallish dataset (only 768 records), we can't afford to
split the data into just train and test sets and evaluate the model on them, the best approach in our case is Cross
Validation. According to this, let's split the data in the following way using the library's [splitData](https://github.com/gyrdym/ml_algo/blob/master/lib/src/model_selection/split_data.dart)
function:

```dart
final splits = splitData(samples, [0.7]);
final validationData = splits[0];
final testData = splits[1];
```

`splitData` accepts `DataFrame` instance as the first argument and ratio list as the second one. Now we have 70% of our
data as a validation set and 30% as a test set for evaluating generalization error.

Then we may create an instance of `CrossValidator` class to fit [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))
of our model. We should pass validation data (our `validationData` variable), a list of target column names (in our case it's
just a name stored in `targetColumnName` variable) and a number of folds into CrossValidator constructor.

````dart
final validator = CrossValidator.KFold(samples, [targetColumnName], numberOfFolds: 5);
final validator = CrossValidator.KFold(validationData, [targetColumnName], numberOfFolds: 5);
````

All are set, so we can do our classification.
Let's create a factory for the classifier with desired hyperparameters. We have to decide after the cross validation,
if the selected hyperparametrs are good enough or not:

Evaluate our model via accuracy metric:
```dart
final createClassifier = (DataFrame samples, _) =>
LogisticRegressor(
samples
[targetColumnName],
optimizerType: LinearOptimizerType.gradient,
iterationsLimit: 90,
learningRateType: LearningRateType.decreasingAdaptive,
batchSize: trainSamples.rows.length,
probabilityThreshold: 0.7,
);
```

Let's describe our hyperparameters:
- `optimizerType` - type of optimization algorithm that will be used to learn coefficients of our model, this time we
decided to use vanilla gradient ascent algorithm
- `iterationsLimit` - number of learning iterations. Selected optimization algorithm (gradient ascent in our case) will
be run this amount of times
- `learningRateType` - a strategy for learning rate update. In our case the learning rate will decrease after every
iteration
- `batchSize` - size of data (in rows) that will be used per each iteration. As we have a really small dataset we may use
full-batch gradient ascent, that's why we used `trainSamples.rows.length` here - the total amount of data.
- `probabilityThreshold` - lower bound for positive label probability

Assume, we chose good hyperprameters which can lead to a high-performant model. In order to validate our hypothesis let's
use CrossValidator instance created before:

````dart
final scores = await validator.evaluate((samples, targetNames) =>
LogisticRegressor(
samples,
targetNames[0], // remember, we provided a list of just a single name
optimizerType: LinearOptimizerType.gradient,
learningRateType: LearningRateType.decreasingAdaptive,
probabilityThreshold: 0.7,
randomSeed: 3,
), MetricType.accuracy);
final scores = await validator.evaluate(createClassifier, MetricType.accuracy);
````

Since the CrossValidator's instance returns a Vector of scores as a result of our predictor evaluation, we may choose
Since the CrossValidator's instance returns a [Vector](https://github.com/gyrdym/ml_linalg/blob/master/lib/vector.dart) of scores as a result of our predictor evaluation, we may choose
any way to reduce all the collected scores to a single number, for instance we may use Vector's `mean` method:

```dart
Expand All @@ -118,101 +150,83 @@ final accuracy = scores.mean();

Let's print the score:
````dart
print('accuracy on classification: ${accuracy.toStringAsFixed(2)}');
print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');
````

We will see something like this:

````
acuracy on classification: 0.65
````

All the code above all together:

````dart
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';
Future main() async {
final samples = await fromCsv('datasets/pima_indians_diabetes_database.csv', headerExists: true);
final targetColumnName = 'class variable (0 or 1)';
final validator = CrossValidator.KFold(samples, [targetColumnName], numberOfFolds: 5);
final scores = await validator.evaluate((samples, targetNames) =>
LogisticRegressor(
samples,
targetNames[0], // remember, we provide a list of just a single name
optimizerType: LinearOptimizerType.gradient,
learningRateType: LearningRateType.decreasingAdaptive,
probabilityThreshold: 0.7,
randomSeed: 3,
), MetricType.accuracy);
final accuracy = scores.mean();
print('accuracy on classification: ${accuracy.toStringFixed(2)}');
}
accuracy on k fold validation: 0.65
````

### K nearest neighbour regression

Let's do some prediction with a well-known non-parametric regression algorithm - k nearest neighbours. Let's take a
state of the art dataset - [boston housing](https://www.kaggle.com/c/boston-housing).
Let's assess our hyperparameters on test set in order to evaluate the model's generalization error:

As usual, import all necessary packages

````dart
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';
````
```dart
final testSplits = splitData(testData, [0.8]);
final classifier = createClassifier(testSplits[0], targetNames);
final finalScore = classifier.assess(testSplits[1], targetNames, MetricType.accuracy);
```

and download and read the data
The final score is like:

````dart
final samples = await fromCsv('lib/_datasets/housing.csv',
headerExists: false,
fieldDelimiter: ' ',
);
````
```dart
print(finalScore.toStringAsFixed(2)); // approx. 0.75
```

As you can see, the dataset is headless, and that means that there is no a descriptive line in the beginning of the file,
so we may use an autogenerated header in order to point the target column:
Seems, our model has a good generalization ability, and that means we may use it in the future.
To do so we may store the model to file as JSON:

```dart
print(samples.header);
```
await classifier.saveAsJson('diabetes_classifier.json');
```

It will output the following:
````dart
(col_0, col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8, col_9, col_10, col_11, col_12, col_13)
````
After that we can simply read the model from the file:

Our target is `col_13`. Let's store it:
```dart
import 'dart:io';
````dart
final targetColumnName = 'col_13';
````
final file = File(fileName);
final encodedData = await file.readAsString();
final classifier = LogisticRegressor.fromJson(encodedData);
```

Let's create a cross-validator instance:
All the code above all together:

````dart
final validator = CrossValidator.KFold(samples, [targetColumnName], numberOfFolds: 5);
````

Let the `k` parameter be equal to `4`.

Assess a knn regressor with the chosen `k` value using MAPE metric
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';
````dart
final scores = await validator.evaluate((samples, targetNames) =>
KnnRegressor(samples, targetNames[0], 4), MetricType.mape);
final averageError = scores.mean();
````
void main() async {
final samples = await fromCsv('datasets/pima_indians_diabetes_database.csv', headerExists: true);
final targetColumnName = 'class variable (0 or 1)';
final splits = splitData(samples, [0.7]);
final validationData = splits[0];
final testData = splits[1];
final validator = CrossValidator.KFold(validationData, [targetColumnName], numberOfFolds: 5);
final createClassifier = (DataFrame samples, _) =>
LogisticRegressor(
samples
[targetColumnName],
optimizerType: LinearOptimizerType.gradient,
iterationsLimit: 90,
learningRateType: LearningRateType.decreasingAdaptive,
batchSize: trainSamples.rows.length,
probabilityThreshold: 0.7,
);
final scores = await validator.evaluate(createClassifier, MetricType.accuracy);
final accuracy = scores.mean();
print('accuracy on k fold validation: ${accuracy.toStringAsFixed(2)}');
Let's print our error
final testSplits = splitData(testData, [0.8]);
final classifier = createClassifier(testSplits[0], targetNames);
final finalScore = classifier.assess(testSplits[1], targetNames, MetricType.accuracy);
print(finalScore.toStringAsFixed(2));
````dart
print('MAPE error on k-fold validation: ${averageError.toStringAsFixed(2)}%'); // it yields approx. 6.18
await classifier.saveAsJson('diabetes_classifier.json');
}
````

### Contacts
Expand Down
2 changes: 1 addition & 1 deletion pubspec.yaml
@@ -1,6 +1,6 @@
name: ml_algo
description: Machine learning algorithms, Machine learning models performance evaluation functionality
version: 14.1.0
version: 14.1.1
homepage: https://github.com/gyrdym/ml_algo

environment:
Expand Down

0 comments on commit 470dc08

Please sign in to comment.