Skip to content

Commit

Permalink
Merge pull request #83 from gyrdym/softmax-unit-tests
Browse files Browse the repository at this point in the history
DataFrame introduced, ml_linalg 6.0.0 supported, softmax regression unit tests added, optimizer api changed - Vector -> Matrix
  • Loading branch information
gyrdym committed Mar 5, 2019
2 parents 98ebbc8 + e495e97 commit 75d19a5
Show file tree
Hide file tree
Showing 114 changed files with 1,462 additions and 1,134 deletions.
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
language: dart
dart:
- "2.1.0"
- "2.2.0"
dart_task:
- test: --platform vm
- dartanalyzer: true
Expand Down
6 changes: 6 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# Changelog

## 9.0.0
- `ml_linalg` v6.0.2 supported
- `Classifier`: type of `weightsByClasses` changed from `Map` to `Matrix`
- `SoftmaxRegressor`: more detailed unit tests for softmax regression added
- Data prepprocessing: `DataFrame` introduced (former `MLData`)

## 8.0.0
- `LinearClassifier.softmaxRegressor` implemented
- `Metric` interface refactored (`getError` renamed to `getScore`)
Expand Down
37 changes: 25 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,9 +29,10 @@ Following algorithms are implemented:

To provide main purposes of machine learning, the library exposes the following classes:

- [MLData](https://github.com/gyrdym/ml_algo/blob/master/lib/src/data_preprocessing/ml_data/ml_data.dart). Factory, that creates instances of
different adapters for data. For example, one can create a csv reader, that makes work with csv data easier: you just
need to point, where your dataset resides and then get features and labels in convenient data science friendly format.
- [DataFrame](https://github.com/gyrdym/ml_algo/blob/master/lib/src/data_preprocessing/data_frame/data_frame.dart).
Factory, that creates instances of different adapters for data. For example, one can create a csv reader, that makes
work with csv data easier: you just need to point, where your dataset resides and then get features and labels in
convenient data science friendly format.

- [CrossValidator](https://github.com/gyrdym/ml_algo/blob/master/lib/src/model_selection/cross_validator/cross_validator.dart). Factory, that creates
instances of a cross validator. In a few words, this entity allows researchers to fit different [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) of machine learning
Expand Down Expand Up @@ -70,17 +71,25 @@ import 'dart:async';
import 'package:ml_algo/ml_algo.dart';
````

Read `csv`-file `pima_indians_diabetes_database.csv` with test data. You can use csv from the library's
Read `csv`-file `pima_indians_diabetes_database.csv` with test data. You can use a csv file from the library's
[datasets directory](https://github.com/gyrdym/ml_algo/tree/master/datasets):
````dart
final data = MLData.fromCsvFile('datasets/pima_indians_diabetes_database.csv');
final data = DataFrame.fromCsv('datasets/pima_indians_diabetes_database.csv',
labelIdx: 8,
categoryNameToEncoder: {
'class variable (0 or 1)': CategoricalDataEncoderType.oneHot,
});
final features = await data.features;
final labels = await data.labels;
````

Data in this file is represented by 768 records and 8 features. Processed features are contained in a data structure of
`MLMatrix` type and processed labels are contained in a data structure of `MLVector` type. To get
more information about these types, please, visit [ml_linal repo](https://github.com/gyrdym/ml_linalg)
Data in this file is represented by 768 records and 8 features. 9th column is a label column, it contains either 0 or 1
on each row. This column is our target - we should predict values of class labels for each observation. Therefore, we
should point, where to get label values, with help of `labelIdx` parameter (labels column index, 8 in our case), and,
also, we should specify how to encode the labels (one-hot encoding in our case)

Processed features are contained in a data structure of `Matrix` type and processed labels are contained in a data
structure also of `Matrix` type. To get more information about `Matrix` type, please, visit [ml_linal repo](https://github.com/gyrdym/ml_linalg)

Then, we should create an instance of `CrossValidator` class for fitting [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))
of our model
Expand Down Expand Up @@ -118,15 +127,15 @@ if (accuracy > maxAccuracy) {

Let's print score:
````dart
print('best accuracy on classification: ${(maxAccuracy * 100).toFixed(2)}');
print('best accuracy on classification: ${maxAccuracy.toFixed(2)}');
print('best learning rate: ${bestLearningRate.toFixed(3)}');
````

Best model parameters search takes much time so far, so be patient. After the search is over, we will see something like
this:

````
best acuracy on classification: 67.0%
best acuracy on classification: 0.68
best learning rate: 0.155
````

Expand All @@ -137,7 +146,11 @@ import 'dart:async';
import 'package:ml_algo/ml_algo.dart';
Future<double> logisticRegression() async {
final data = CsvMLData.fromFile('datasets/pima_indians_diabetes_database.csv');
final data = DataFrame.fromCsv('datasets/pima_indians_diabetes_database.csv',
labelIdx: 8,
categoryNameToEncoder: {
'class variable (0 or 1)': CategoricalDataEncoderType.oneHot,
});
final features = await data.features;
final labels = await data.labels;
Expand All @@ -161,7 +174,7 @@ Future<double> logisticRegression() async {
}
}
print('best accuracy on classification: ${(maxAccuracy * 100).toFixed(2)}');
print('best accuracy on classification: ${maxAccuracy.toFixed(2)}');
print('best learning rate: ${bestLearningRate.toFixed(3)}');
}
````
Expand Down
7 changes: 3 additions & 4 deletions benchmark/gradient_descent_regression.dart
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,9 @@ import 'dart:typed_data';
import 'package:benchmark_harness/benchmark_harness.dart';
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_linalg/matrix.dart';
import 'package:ml_linalg/vector.dart';

MLMatrix features;
MLVector labels;
Matrix features;
Matrix labels;
LinearRegressor regressor;

class GDRegressorBenchmark extends BenchmarkBase {
Expand All @@ -31,7 +30,7 @@ class GDRegressorBenchmark extends BenchmarkBase {
}

Future gradientDescentRegressionBenchmark() async {
final data = MLData.fromCsvFile('datasets/advertising.csv',
final data = DataFrame.fromCsv('datasets/advertising.csv',
dtype: Float32x4, labelIdx: 3);
features = await data.features;
labels = await data.labels;
Expand Down
7 changes: 3 additions & 4 deletions benchmark/logistic_regression.dart
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,9 @@ import 'dart:typed_data';
import 'package:benchmark_harness/benchmark_harness.dart';
import 'package:ml_algo/ml_algo.dart';
import 'package:ml_linalg/matrix.dart';
import 'package:ml_linalg/vector.dart';

MLMatrix features;
MLVector labels;
Matrix features;
Matrix labels;
LinearClassifier regressor;

class LogisticRegressorBenchmark extends BenchmarkBase {
Expand All @@ -31,7 +30,7 @@ class LogisticRegressorBenchmark extends BenchmarkBase {
}

Future logisticRegressionBenchmark() async {
final data = MLData.fromCsvFile('datasets/pima_indians_diabetes_database.csv',
final data = DataFrame.fromCsv('datasets/pima_indians_diabetes_database.csv',
labelIdx: 8, dtype: Float32x4);
features = await data.features;
labels = await data.labels;
Expand Down
6 changes: 3 additions & 3 deletions benchmark/one_hot_encoder.dart
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,12 @@ import 'package:benchmark_harness/benchmark_harness.dart';
import 'package:ml_algo/src/data_preprocessing/categorical_encoder/one_hot_encoder.dart';

class OneHotEncoderBenchmark extends BenchmarkBase {
final OneHotEncoder _encoder;
final Iterable<Object> _data;

OneHotEncoderBenchmark(this._encoder, this._data)
: super('One Hot Encoder benchmark');

final OneHotEncoder _encoder;
final Iterable<Object> _data;

@override
void run() {
_data.forEach(_encoder.encodeSingle);
Expand Down
20 changes: 11 additions & 9 deletions example/classification/logistic_regression.dart
Original file line number Diff line number Diff line change
@@ -1,26 +1,28 @@
import 'dart:async';
import 'dart:typed_data';

import 'package:ml_algo/ml_algo.dart';

Future main() async {
final data = MLData.fromCsvFile('datasets/pima_indians_diabetes_database.csv',
labelIdx: 8, dtype: Float32x4);
final data = DataFrame.fromCsv('datasets/pima_indians_diabetes_database.csv',
labelIdx: 8,
categoryNameToEncoder: {
'class variable (0 or 1)': CategoricalDataEncoderType.oneHot,
},
);

final features = await data.features;
final labels = await data.labels;

final validator = CrossValidator.kFold(numberOfFolds: 5, dtype: Float32x4);

// lr=0.0102, randomSeed=134, minWeightsUpdate: 0.000000000001, iterationLimit: 100 => error = 0.3449
final validator = CrossValidator.kFold(numberOfFolds: 5);

final logisticRegressor = LinearClassifier.logisticRegressor(
initialLearningRate: 0.0102,
initialLearningRate: 0.00001,
iterationsLimit: 7000,
learningRateType: LearningRateType.constant,
randomSeed: 134);
randomSeed: 150);

final accuracy = validator.evaluate(
logisticRegressor, features, labels, MetricType.accuracy);

print('Accuracy is ${(accuracy * 100).toStringAsFixed(2)}%');
print('Accuracy is ${accuracy.toStringAsFixed(2)}');
}
15 changes: 7 additions & 8 deletions example/classification/sofmax_regression.dart
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,11 @@ import 'package:ml_algo/ml_algo.dart';
import 'package:tuple/tuple.dart';

Future main() async {
final data = MLData.fromCsvFile('datasets/iris.csv',
final data = DataFrame.fromCsv('datasets/iris.csv',
labelIdx: 5,
columns: [const Tuple2<int, int>(1, 5)],
encoderType: CategoricalDataEncoderType.ordinal,
categories: {
'Species': ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'],
categoryNameToEncoder: {
'Species': CategoricalDataEncoderType.oneHot,
},
);

Expand All @@ -19,14 +18,14 @@ Future main() async {
final validator = CrossValidator.kFold(numberOfFolds: 5);

final softmaxRegressor = LinearClassifier.softmaxRegressor(
initialLearningRate: 0.00053,
iterationsLimit: 500,
minWeightsUpdate: null,
initialLearningRate: 0.03,
iterationsLimit: null,
minWeightsUpdate: 1e-6,
randomSeed: 46,
learningRateType: LearningRateType.constant);

final accuracy = validator.evaluate(
softmaxRegressor, features, labels, MetricType.accuracy);

print('Accuracy is ${(accuracy * 100).toStringAsFixed(2)}%');
print('Accuracy is ${accuracy.toStringAsFixed(2)}');
}
23 changes: 14 additions & 9 deletions example/main.dart
Original file line number Diff line number Diff line change
Expand Up @@ -2,29 +2,34 @@ import 'dart:async';

import 'package:ml_algo/ml_algo.dart';
import 'package:ml_linalg/matrix.dart';
import 'package:ml_linalg/vector.dart';

/// A simple usage example using synthetic data. To see more complex examples, please, visit other directories in this
/// folder
/// A simple usage example using synthetic data. To see more complex examples,
/// please, visit other directories in this folder
Future main() async {
// Let's create a feature matrix (a set of independent variables)
final features = MLMatrix.from([
final features = Matrix.from([
[2.0, 3.0, 4.0, 5.0],
[12.0, 32.0, 1.0, 3.0],
[27.0, 3.0, 0.0, 59.0],
]);

// Let's create dependent variables vector. It will be used as `true` values to adjust regression coefficients
final labels = MLVector.from([4.3, 3.5, 2.1]);
// Let's create dependent variables vector. It will be used as `true` values
// to adjust regression coefficients
final labels = Matrix.from([
[4.3],
[3.5],
[2.1]
]);

// Let's create a regressor itself. With its help we can train some linear model to predict a label value for a new
// features
// Let's create a regressor itself. With its help we can train some linear
// model to predict label values for new features
final regressor = LinearRegressor.gradient(
iterationsLimit: 100,
initialLearningRate: 0.0005,
learningRateType: LearningRateType.constant);

// Let's train our model (training or fitting is a coefficients adjusting process)
// Let's train our model (training or fitting is a coefficients
// adjusting process)
regressor.fit(features, labels);

// Let's see adjusted coefficients
Expand Down
2 changes: 1 addition & 1 deletion example/regression/lasso_regression.dart
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ import 'package:ml_algo/ml_algo.dart';
import 'package:tuple/tuple.dart';

Future main() async {
final data = MLData.fromCsvFile('datasets/advertising.csv',
final data = DataFrame.fromCsv('datasets/advertising.csv',
columns: [const Tuple2<int, int>(1, 4)], labelIdx: 4, dtype: Float32x4);
final features = await data.features;
final labels = await data.labels;
Expand Down
4 changes: 2 additions & 2 deletions example/regression/stochastic_gradient_descent.dart
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ import 'package:ml_algo/ml_algo.dart';
import 'package:tuple/tuple.dart';

Future main() async {
final data = MLData.fromCsvFile(
final data = DataFrame.fromCsv(
'datasets/black_friday.csv',
labelIdx: 11,
rows: [const Tuple2(0, 2999)],
Expand Down Expand Up @@ -36,5 +36,5 @@ Future main() async {
final error =
validator.evaluate(regressor, features, labels, MetricType.mape);

print('MAPE error on k-fold validation: $error');
print('MAPE error on k-fold validation: ${error.toStringAsFixed(2)}%');
}
4 changes: 2 additions & 2 deletions lib/ml_algo.dart
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@ export 'package:ml_algo/src/classifier/classifier.dart';
export 'package:ml_algo/src/classifier/linear_classifier.dart';
export 'package:ml_algo/src/data_preprocessing/categorical_encoder/encode_unknown_strategy_type.dart';
export 'package:ml_algo/src/data_preprocessing/categorical_encoder/encoder_type.dart';
export 'package:ml_algo/src/data_preprocessing/ml_data/ml_data.dart';
export 'package:ml_algo/src/data_preprocessing/data_frame/data_frame.dart';
export 'package:ml_algo/src/metric/classification/type.dart';
export 'package:ml_algo/src/metric/metric_type.dart';
export 'package:ml_algo/src/metric/regression/type.dart';
export 'package:ml_algo/src/model_selection/cross_validator/cross_validator.dart';
export 'package:ml_algo/src/optimizer/gradient/learning_rate_generator/learning_rate_type.dart';
export 'package:ml_algo/src/optimizer/optimizer_type.dart';
export 'package:ml_algo/src/predictor.dart';
export 'package:ml_algo/src/predictor/predictor.dart';
export 'package:ml_algo/src/regressor/gradient_type.dart';
export 'package:ml_algo/src/regressor/linear_regressor.dart';
17 changes: 8 additions & 9 deletions lib/src/classifier/classifier.dart
Original file line number Diff line number Diff line change
@@ -1,24 +1,23 @@
import 'package:ml_algo/src/predictor.dart';
import 'package:ml_algo/src/predictor/predictor.dart';
import 'package:ml_linalg/matrix.dart';
import 'package:ml_linalg/vector.dart';

/// An interface for any classifier (linear, non-linear, parametric,
/// non-parametric, etc.)
abstract class Classifier implements Predictor {
/// A map, where each key is a class label and each value, associated with
/// the key, is a set of weights (coefficients), specific for the class
Map<double, MLVector> get weightsByClasses;
/// A matrix, where each column is a vector of weights, associated with
/// the specific class
Matrix get weightsByClasses;

/// A collection of encoded class labels. Can be transformed back to original
/// A collection of class labels. Can be transformed back to original
/// labels by a [MLData] instance, that was used previously to encode the
/// labels
Iterable<double> get classLabels;
Matrix get classLabels;

/// Returns predicted distribution of probabilities for each observation in
/// the passed [features]
MLMatrix predictProbabilities(MLMatrix features);
Matrix predictProbabilities(Matrix features);

/// Return a collection of predicted class labels for each observation in the
/// passed [features]
MLVector predictClasses(MLMatrix features);
Matrix predictClasses(Matrix features);
}
5 changes: 0 additions & 5 deletions lib/src/classifier/labels_processor/labels_processor.dart

This file was deleted.

This file was deleted.

This file was deleted.

0 comments on commit 75d19a5

Please sign in to comment.