Skip to content

Commit

Permalink
Merge pull request #91 from gyrdym/add-more-examples-to-readme
Browse files Browse the repository at this point in the history
Add softmax regression example to README, one-hot encoding documentation extended
  • Loading branch information
gyrdym committed Mar 11, 2019
2 parents 6f8e87b + e01212a commit f284ced
Show file tree
Hide file tree
Showing 4 changed files with 132 additions and 19 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
# Changelog

## 9.2.2
- Softmax regression example added to README

## 9.2.1
- README corrected

Expand Down
118 changes: 109 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,20 +8,22 @@
**Table of contents**
- [What for is the library?](#what-is-the-ml_algo-for)
- [The library's structure](#the-librarys-structure)
- [Usage](#usage)
- [Examples](#examples)
- [Logistic regression](#logistic-regression)
- [Softmax regression](#softmax-regression)

## What is the ml_algo for?

The main purpose of the library - to give developers, interested both in Dart language and data science, native Dart
implementation of machine learning algorithms. This library targeted to dart vm, so, to get smoothest experience with
the lib, please, do not use it in a browser.

Following algorithms are implemented:
- Linear regression:
**Following algorithms are implemented:**
- *Linear regression:*
- Gradient descent algorithm (batch, mini-batch, stochastic) with ridge regularization
- Lasso regression (feature selection model)

- Linear classifier:
- *Linear classifier:*
- Logistic regression (with "one-vs-all" multiclass classification)
- Softmax regression

Expand All @@ -31,7 +33,7 @@ To provide main purposes of machine learning, the library exposes the following

- [DataFrame](https://github.com/gyrdym/ml_algo/blob/master/lib/src/data_preprocessing/data_frame/data_frame.dart).
Factory, that creates instances of different adapters for data. For example, one can create a csv reader, that makes
work with csv data easier: you just need to point, where your dataset resides and then get features and labels in
work with csv data easier: it's just needed to point, where a dataset resides and then get features and labels in
convenient data science friendly format.

- [CrossValidator](https://github.com/gyrdym/ml_algo/blob/master/lib/src/model_selection/cross_validator/cross_validator.dart). Factory, that creates
Expand All @@ -56,9 +58,9 @@ that performs feature selection along with regression process. It uses [coordina
instead of [gradient descent optimization]() and [gradient vector]() like in `LinearRegressor.gradient` to provide
regression. If you want to decide, which features are less important - go ahead and use this regressor.

## Usage
## Examples

### Real life example
### Logistic regression

Let's classify records from well-known dataset - [Pima Indians Diabets Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database)
via [Logistic regressor](https://github.com/gyrdym/ml_algo/blob/master/lib/src/classifier/linear_classifier.dart)
Expand All @@ -85,7 +87,7 @@ on each row. This column is our target - we should predict values of class label
should point, where to get label values. Let's use `labelName` parameter for that (labels column name, 'class variable
(0 or 1)' in our case).

Processed features and labels are contained in a data structure of `Matrix` type. To get more information about
Processed features and labels are contained in data structures of `Matrix` type. To get more information about
`Matrix` type, please, visit [ml_linal repo](https://github.com/gyrdym/ml_linalg)

Then, we should create an instance of `CrossValidator` class for fitting [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))
Expand All @@ -103,7 +105,7 @@ final model = LinearClassifier.logisticRegressor(
iterationsLimit: 500,
gradientType: GradientType.batch,
fitIntercept: true,
interceptScale: .1,
interceptScale: 0.1,
learningRateType: LearningRateType.constant);
````

Expand Down Expand Up @@ -152,6 +154,104 @@ Future main() async {
}
````

### Softmax regression
Let's classify another famous dataset - [Iris dataset](https://www.kaggle.com/uciml/iris). Data in this csv is separated into 3 classes - therefore we need
to use different approach to data classification - [Softmax regression](http://deeplearning.stanford.edu/tutorial/supervised/SoftmaxRegression/).

As usual, start with data preparation:
````Dart
final data = DataFrame.fromCsv('datasets/iris.csv',
labelName: 'Species',
columns: [const Tuple2(1, 5)],
categoryNameToEncoder: {
'Species': CategoricalDataEncoderType.oneHot,
},
);
final features = await data.features;
final labels = await data.labels;
````

The csv database has 6 columns, but we need to get rid of the first column, because it contains just ID of every
observation - it is absolutely useless data. So, as you may notice, we provided a columns range to exclude ID-column:

````Dart
columns: [const Tuple2(1, 5)]
````

Also, since the label column 'Species' has categorical data, we encoded it to numerical format:

````Dart
categoryNameToEncoder: {
'Species': CategoricalDataEncoderType.oneHot,
},
````

To see how encoding works, visit the [api reference](https://pub.dartlang.org/documentation/ml_algo/latest/ml_algo/CategoricalDataEncoderType-class.html).

Next step - create a cross validator instance:

````Dart
final validator = CrossValidator.kFold(numberOfFolds: 5);
````

And finally, create an instance of the classifier:

````Dart
final softmaxRegressor = LinearClassifier.softmaxRegressor(
initialLearningRate: 0.03,
iterationsLimit: null,
minWeightsUpdate: 1e-6,
randomSeed: 46,
learningRateType: LearningRateType.constant);
````

Evaluate quality of prediction:

````Dart
final accuracy = validator.evaluate(softmaxRegressor, features, labels, MetricType.accuracy);
print('Iris dataset, softmax regression: accuracy is '
'${accuracy.toStringAsFixed(2)}'); // It yields 0.93
````

Gather all the code above all together:

````Dart
import 'dart:async';
import 'package:ml_algo/ml_algo.dart';
import 'package:tuple/tuple.dart';
Future main() async {
final data = DataFrame.fromCsv('datasets/iris.csv',
labelName: 'Species',
columns: [const Tuple2(1, 5)],
categoryNameToEncoder: {
'Species': CategoricalDataEncoderType.oneHot,
},
);
final features = await data.features;
final labels = await data.labels;
final validator = CrossValidator.kFold(numberOfFolds: 5);
final softmaxRegressor = LinearClassifier.softmaxRegressor(
initialLearningRate: 0.03,
iterationsLimit: null,
minWeightsUpdate: 1e-6,
randomSeed: 46,
learningRateType: LearningRateType.constant);
final accuracy = validator.evaluate(
softmaxRegressor, features, labels, MetricType.accuracy);
print('Iris dataset, softmax regression: accuracy is '
'${accuracy.toStringAsFixed(2)}');
}
````

For more examples please see [examples folder](https://github.com/gyrdym/dart_ml/tree/master/example)

### Contacts
Expand Down
28 changes: 19 additions & 9 deletions lib/src/data_preprocessing/categorical_encoder/encoder_type.dart
Original file line number Diff line number Diff line change
@@ -1,25 +1,35 @@
/// Types of categorical data encoders
///
/// [CategoricalDataEncoderType.oneHot] One-hot encoder. Encodes every
/// categorical value to a sequence of all possible values of its category: `1`
/// for the given value, `0` - for the rest values.
/// categorical value to a list of length, that is equal to the number of all
/// possible category's values. Each element of the list is a binary value: `1`
/// for the current value, `0` - for the rest values.
///
/// For example:
///
/// Category `'GENDER'` given. Its possible values:
/// Category `'AGE'` given. Its possible values:
/// ```
/// ['female', 'male']
/// ['0-17', '18-30', '31+']
/// ```
/// Also, we have some data to encode - a list of `'GENDER'` values:
///
/// '0-17' will be encoded as [1.0, 0.0, 0.0]
///
/// '18-30' will be encoded as [0.0, 1.0, 0.0]
///
/// '31+' will be encoded as [0.0, 0.0, 1.0]
///
/// Also, we have some data of this category - a list of `'AGE'` values:
/// ```
/// ['female', 'female', 'male', 'male', 'male', 'female']
/// ['0-17', '0-17', '18-30', '18-30', '18-30', '31+']
/// ```
/// After one-hot encoding the data will be as:
///
/// After one-hot encoding the data will be look as:
/// ```
/// [[1, 0], [1, 0], [0, 1], [0, 1], [0, 1], [1, 0]]
/// [[1.0, 0.0, 0.0], [1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 1.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
/// ```
///
/// [CategoricalDataEncoderType.ordinal] Ordinal encoder. Encodes every
/// categorical value to an ordinal number
/// categorical value to an ordinal number.
///
enum CategoricalDataEncoderType {
ordinal,
Expand Down
2 changes: 1 addition & 1 deletion pubspec.yaml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
name: ml_algo
description: Machine learning algorithms written in native dart (without bindings to any popular ML libraries, just pure Dart implementation)
version: 9.2.1
version: 9.2.2
author: Ilia Gyrdymov <ilgyrd@gmail.com>
homepage: https://github.com/gyrdym/ml_algo

Expand Down

0 comments on commit f284ced

Please sign in to comment.