Merge pull request #83 from gyrdym/softmax-unit-tests

DataFrame introduced, ml_linalg 6.0.0 supported, softmax regression unit tests added, optimizer api changed - Vector -> Matrix
gyrdym · Mar 5, 2019 · 75d19a5 · 75d19a5
2 parents 98ebbc8 + e495e97
commit 75d19a5
Show file tree

Hide file tree

Showing 114 changed files with 1,462 additions and 1,134 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -1,6 +1,6 @@
 language: dart
 dart:
-  - "2.1.0"
+  - "2.2.0"
 dart_task:
   - test: --platform vm
   - dartanalyzer: true

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,11 @@
 # Changelog
 
+## 9.0.0
+- `ml_linalg` v6.0.2 supported
+- `Classifier`: type of `weightsByClasses` changed from `Map` to `Matrix` 
+- `SoftmaxRegressor`: more detailed unit tests for softmax regression added
+- Data prepprocessing: `DataFrame` introduced (former `MLData`)
+
 ## 8.0.0
 - `LinearClassifier.softmaxRegressor` implemented
 - `Metric` interface refactored (`getError` renamed to `getScore`)

diff --git a/README.md b/README.md
@@ -29,9 +29,10 @@ Following algorithms are implemented:
 
 To provide main purposes of machine learning, the library exposes the following classes:
 
-- [MLData](https://github.com/gyrdym/ml_algo/blob/master/lib/src/data_preprocessing/ml_data/ml_data.dart). Factory, that creates instances of 
-different adapters for data. For example, one can create a csv reader, that makes work with csv data easier: you just 
-need to point, where your dataset resides and then get features and labels in convenient data science friendly format.
+- [DataFrame](https://github.com/gyrdym/ml_algo/blob/master/lib/src/data_preprocessing/data_frame/data_frame.dart). 
+Factory, that creates instances of different adapters for data. For example, one can create a csv reader, that makes 
+work with csv data easier: you just need to point, where your dataset resides and then get features and labels in 
+convenient data science friendly format.
 
 - [CrossValidator](https://github.com/gyrdym/ml_algo/blob/master/lib/src/model_selection/cross_validator/cross_validator.dart). Factory, that creates 
 instances of a cross validator. In a few words, this entity allows researchers to fit different [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) of machine learning
@@ -70,17 +71,25 @@ import 'dart:async';
 import 'package:ml_algo/ml_algo.dart';
 ````
 
-Read `csv`-file `pima_indians_diabetes_database.csv` with test data. You can use csv from the library's 
+Read `csv`-file `pima_indians_diabetes_database.csv` with test data. You can use a csv file from the library's 
 [datasets directory](https://github.com/gyrdym/ml_algo/tree/master/datasets):
 ````dart
-final data = MLData.fromCsvFile('datasets/pima_indians_diabetes_database.csv');
+final data = DataFrame.fromCsv('datasets/pima_indians_diabetes_database.csv', 
+  labelIdx: 8,
+  categoryNameToEncoder: {
+    'class variable (0 or 1)': CategoricalDataEncoderType.oneHot,
+  });
 final features = await data.features;
 final labels = await data.labels;
 ````
 
-Data in this file is represented by 768 records and 8 features. Processed features are contained in a data structure of 
-`MLMatrix` type and processed labels are contained in a data structure of `MLVector` type. To get 
-more information about these types, please, visit [ml_linal repo](https://github.com/gyrdym/ml_linalg)
+Data in this file is represented by 768 records and 8 features. 9th column is a label column, it contains either 0 or 1
+ on each row. This column is our target - we should predict values of class labels for each observation. Therefore, we
+ should point, where to get label values, with help of `labelIdx` parameter (labels column index, 8 in our case), and,
+ also, we should specify how to encode the labels (one-hot encoding in our case)  
+
+ Processed features are contained in a data structure of `Matrix` type and processed labels are contained in a data 
+ structure also of `Matrix` type. To get more information about `Matrix` type, please, visit [ml_linal repo](https://github.com/gyrdym/ml_linalg)
 
 Then, we should create an instance of `CrossValidator` class for fitting [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) 
 of our model
@@ -118,15 +127,15 @@ if (accuracy > maxAccuracy) {
 
 Let's print score:
 ````dart
-print('best accuracy on classification: ${(maxAccuracy * 100).toFixed(2)}');
+print('best accuracy on classification: ${maxAccuracy.toFixed(2)}');
 print('best learning rate: ${bestLearningRate.toFixed(3)}');
 ````
 
 Best model parameters search takes much time so far, so be patient. After the search is over, we will see something like 
 this:
 
 ````
-best acuracy on classification: 67.0%
+best acuracy on classification: 0.68
 best learning rate: 0.155
 ````
 
@@ -137,7 +146,11 @@ import 'dart:async';
 import 'package:ml_algo/ml_algo.dart';
 
 Future<double> logisticRegression() async {
-  final data = CsvMLData.fromFile('datasets/pima_indians_diabetes_database.csv');
+  final data = DataFrame.fromCsv('datasets/pima_indians_diabetes_database.csv', 
+     labelIdx: 8,
+     categoryNameToEncoder: {
+       'class variable (0 or 1)': CategoricalDataEncoderType.oneHot,
+  });
   final features = await data.features;
   final labels = await data.labels;
 
@@ -161,7 +174,7 @@ Future<double> logisticRegression() async {
     }
   }
 
-  print('best accuracy on classification: ${(maxAccuracy * 100).toFixed(2)}');
+  print('best accuracy on classification: ${maxAccuracy.toFixed(2)}');
   print('best learning rate: ${bestLearningRate.toFixed(3)}');
 }
 ````

diff --git a/benchmark/gradient_descent_regression.dart b/benchmark/gradient_descent_regression.dart
@@ -4,10 +4,9 @@ import 'dart:typed_data';
 import 'package:benchmark_harness/benchmark_harness.dart';
 import 'package:ml_algo/ml_algo.dart';
 import 'package:ml_linalg/matrix.dart';
-import 'package:ml_linalg/vector.dart';
 
-MLMatrix features;
-MLVector labels;
+Matrix features;
+Matrix labels;
 LinearRegressor regressor;
 
 class GDRegressorBenchmark extends BenchmarkBase {
@@ -31,7 +30,7 @@ class GDRegressorBenchmark extends BenchmarkBase {
 }
 
 Future gradientDescentRegressionBenchmark() async {
-  final data = MLData.fromCsvFile('datasets/advertising.csv',
+  final data = DataFrame.fromCsv('datasets/advertising.csv',
       dtype: Float32x4, labelIdx: 3);
   features = await data.features;
   labels = await data.labels;

diff --git a/benchmark/logistic_regression.dart b/benchmark/logistic_regression.dart
@@ -4,10 +4,9 @@ import 'dart:typed_data';
 import 'package:benchmark_harness/benchmark_harness.dart';
 import 'package:ml_algo/ml_algo.dart';
 import 'package:ml_linalg/matrix.dart';
-import 'package:ml_linalg/vector.dart';
 
-MLMatrix features;
-MLVector labels;
+Matrix features;
+Matrix labels;
 LinearClassifier regressor;
 
 class LogisticRegressorBenchmark extends BenchmarkBase {
@@ -31,7 +30,7 @@ class LogisticRegressorBenchmark extends BenchmarkBase {
 }
 
 Future logisticRegressionBenchmark() async {
-  final data = MLData.fromCsvFile('datasets/pima_indians_diabetes_database.csv',
+  final data = DataFrame.fromCsv('datasets/pima_indians_diabetes_database.csv',
       labelIdx: 8, dtype: Float32x4);
   features = await data.features;
   labels = await data.labels;

diff --git a/benchmark/one_hot_encoder.dart b/benchmark/one_hot_encoder.dart
@@ -2,12 +2,12 @@ import 'package:benchmark_harness/benchmark_harness.dart';
 import 'package:ml_algo/src/data_preprocessing/categorical_encoder/one_hot_encoder.dart';
 
 class OneHotEncoderBenchmark extends BenchmarkBase {
-  final OneHotEncoder _encoder;
-  final Iterable<Object> _data;
-
   OneHotEncoderBenchmark(this._encoder, this._data)
       : super('One Hot Encoder benchmark');
 
+  final OneHotEncoder _encoder;
+  final Iterable<Object> _data;
+
   @override
   void run() {
     _data.forEach(_encoder.encodeSingle);

diff --git a/example/classification/logistic_regression.dart b/example/classification/logistic_regression.dart
@@ -1,26 +1,28 @@
 import 'dart:async';
-import 'dart:typed_data';
 
 import 'package:ml_algo/ml_algo.dart';
 
 Future main() async {
-  final data = MLData.fromCsvFile('datasets/pima_indians_diabetes_database.csv',
-      labelIdx: 8, dtype: Float32x4);
+  final data = DataFrame.fromCsv('datasets/pima_indians_diabetes_database.csv',
+    labelIdx: 8,
+    categoryNameToEncoder: {
+      'class variable (0 or 1)': CategoricalDataEncoderType.oneHot,
+    },
+  );
 
   final features = await data.features;
   final labels = await data.labels;
 
-  final validator = CrossValidator.kFold(numberOfFolds: 5, dtype: Float32x4);
-
-  // lr=0.0102, randomSeed=134, minWeightsUpdate: 0.000000000001, iterationLimit: 100 => error = 0.3449
+  final validator = CrossValidator.kFold(numberOfFolds: 5);
 
   final logisticRegressor = LinearClassifier.logisticRegressor(
-      initialLearningRate: 0.0102,
+      initialLearningRate: 0.00001,
+      iterationsLimit: 7000,
       learningRateType: LearningRateType.constant,
-      randomSeed: 134);
+      randomSeed: 150);
 
   final accuracy = validator.evaluate(
       logisticRegressor, features, labels, MetricType.accuracy);
 
-  print('Accuracy is ${(accuracy * 100).toStringAsFixed(2)}%');
+  print('Accuracy is ${accuracy.toStringAsFixed(2)}');
 }
diff --git a/example/classification/sofmax_regression.dart b/example/classification/sofmax_regression.dart
@@ -4,12 +4,11 @@ import 'package:ml_algo/ml_algo.dart';
 import 'package:tuple/tuple.dart';
 
 Future main() async {
-  final data = MLData.fromCsvFile('datasets/iris.csv',
+  final data = DataFrame.fromCsv('datasets/iris.csv',
     labelIdx: 5,
     columns: [const Tuple2<int, int>(1, 5)],
-    encoderType: CategoricalDataEncoderType.ordinal,
-    categories: {
-      'Species': ['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'],
+    categoryNameToEncoder: {
+      'Species': CategoricalDataEncoderType.oneHot,
     },
   );
 
@@ -19,14 +18,14 @@ Future main() async {
   final validator = CrossValidator.kFold(numberOfFolds: 5);
 
   final softmaxRegressor = LinearClassifier.softmaxRegressor(
-      initialLearningRate: 0.00053,
-      iterationsLimit: 500,
-      minWeightsUpdate: null,
+      initialLearningRate: 0.03,
+      iterationsLimit: null,
+      minWeightsUpdate: 1e-6,
       randomSeed: 46,
       learningRateType: LearningRateType.constant);
 
   final accuracy = validator.evaluate(
       softmaxRegressor, features, labels, MetricType.accuracy);
 
-  print('Accuracy is ${(accuracy * 100).toStringAsFixed(2)}%');
+  print('Accuracy is ${accuracy.toStringAsFixed(2)}');
 }
diff --git a/example/main.dart b/example/main.dart
@@ -2,29 +2,34 @@ import 'dart:async';
 
 import 'package:ml_algo/ml_algo.dart';
 import 'package:ml_linalg/matrix.dart';
-import 'package:ml_linalg/vector.dart';
 
-/// A simple usage example using synthetic data. To see more complex examples, please, visit other directories in this
-/// folder
+/// A simple usage example using synthetic data. To see more complex examples,
+/// please, visit other directories in this folder
 Future main() async {
   // Let's create a feature matrix (a set of independent variables)
-  final features = MLMatrix.from([
+  final features = Matrix.from([
     [2.0, 3.0, 4.0, 5.0],
     [12.0, 32.0, 1.0, 3.0],
     [27.0, 3.0, 0.0, 59.0],
   ]);
 
-  // Let's create dependent variables vector. It will be used as `true` values to adjust regression coefficients
-  final labels = MLVector.from([4.3, 3.5, 2.1]);
+  // Let's create dependent variables vector. It will be used as `true` values
+  // to adjust regression coefficients
+  final labels = Matrix.from([
+    [4.3],
+    [3.5],
+    [2.1]
+  ]);
 
-  // Let's create a regressor itself. With its help we can train some linear model to predict a label value for a new
-  // features
+  // Let's create a regressor itself. With its help we can train some linear
+  // model to predict label values for new features
   final regressor = LinearRegressor.gradient(
       iterationsLimit: 100,
       initialLearningRate: 0.0005,
       learningRateType: LearningRateType.constant);
 
-  // Let's train our model (training or fitting is a coefficients adjusting process)
+  // Let's train our model (training or fitting is a coefficients
+  // adjusting process)
   regressor.fit(features, labels);
 
   // Let's see adjusted coefficients

diff --git a/example/regression/lasso_regression.dart b/example/regression/lasso_regression.dart
@@ -5,7 +5,7 @@ import 'package:ml_algo/ml_algo.dart';
 import 'package:tuple/tuple.dart';
 
 Future main() async {
-  final data = MLData.fromCsvFile('datasets/advertising.csv',
+  final data = DataFrame.fromCsv('datasets/advertising.csv',
       columns: [const Tuple2<int, int>(1, 4)], labelIdx: 4, dtype: Float32x4);
   final features = await data.features;
   final labels = await data.labels;

diff --git a/example/regression/stochastic_gradient_descent.dart b/example/regression/stochastic_gradient_descent.dart
@@ -4,7 +4,7 @@ import 'package:ml_algo/ml_algo.dart';
 import 'package:tuple/tuple.dart';
 
 Future main() async {
-  final data = MLData.fromCsvFile(
+  final data = DataFrame.fromCsv(
     'datasets/black_friday.csv',
     labelIdx: 11,
     rows: [const Tuple2(0, 2999)],
@@ -36,5 +36,5 @@ Future main() async {
   final error =
       validator.evaluate(regressor, features, labels, MetricType.mape);
 
-  print('MAPE error on k-fold validation: $error');
+  print('MAPE error on k-fold validation: ${error.toStringAsFixed(2)}%');
 }
diff --git a/lib/ml_algo.dart b/lib/ml_algo.dart
@@ -4,13 +4,13 @@ export 'package:ml_algo/src/classifier/classifier.dart';
 export 'package:ml_algo/src/classifier/linear_classifier.dart';
 export 'package:ml_algo/src/data_preprocessing/categorical_encoder/encode_unknown_strategy_type.dart';
 export 'package:ml_algo/src/data_preprocessing/categorical_encoder/encoder_type.dart';
-export 'package:ml_algo/src/data_preprocessing/ml_data/ml_data.dart';
+export 'package:ml_algo/src/data_preprocessing/data_frame/data_frame.dart';
 export 'package:ml_algo/src/metric/classification/type.dart';
 export 'package:ml_algo/src/metric/metric_type.dart';
 export 'package:ml_algo/src/metric/regression/type.dart';
 export 'package:ml_algo/src/model_selection/cross_validator/cross_validator.dart';
 export 'package:ml_algo/src/optimizer/gradient/learning_rate_generator/learning_rate_type.dart';
 export 'package:ml_algo/src/optimizer/optimizer_type.dart';
-export 'package:ml_algo/src/predictor.dart';
+export 'package:ml_algo/src/predictor/predictor.dart';
 export 'package:ml_algo/src/regressor/gradient_type.dart';
 export 'package:ml_algo/src/regressor/linear_regressor.dart';
diff --git a/lib/src/classifier/classifier.dart b/lib/src/classifier/classifier.dart
@@ -1,24 +1,23 @@
-import 'package:ml_algo/src/predictor.dart';
+import 'package:ml_algo/src/predictor/predictor.dart';
 import 'package:ml_linalg/matrix.dart';
-import 'package:ml_linalg/vector.dart';
 
 /// An interface for any classifier (linear, non-linear, parametric,
 /// non-parametric, etc.)
 abstract class Classifier implements Predictor {
-  /// A map, where each key is a class label and each value, associated with
-  /// the key, is a set of weights (coefficients), specific for the class
-  Map<double, MLVector> get weightsByClasses;
+  /// A matrix, where each column is a vector of weights, associated with
+  /// the specific class
+  Matrix get weightsByClasses;
 
-  /// A collection of encoded class labels. Can be transformed back to original
+  /// A collection of class labels. Can be transformed back to original
   /// labels by a [MLData] instance, that was used previously to encode the
   /// labels
-  Iterable<double> get classLabels;
+  Matrix get classLabels;
 
   /// Returns predicted distribution of probabilities for each observation in
   /// the passed [features]
-  MLMatrix predictProbabilities(MLMatrix features);
+  Matrix predictProbabilities(Matrix features);
 
   /// Return a collection of predicted class labels for each observation in the
   /// passed [features]
-  MLVector predictClasses(MLMatrix features);
+  Matrix predictClasses(Matrix features);
 }
diff --git a/lib/src/classifier/labels_processor/labels_processor.dart b/lib/src/classifier/labels_processor/labels_processor.dart
diff --git a/lib/src/classifier/labels_processor/labels_processor_factory.dart b/lib/src/classifier/labels_processor/labels_processor_factory.dart
diff --git a/lib/src/classifier/labels_processor/labels_processor_factory_impl.dart b/lib/src/classifier/labels_processor/labels_processor_factory_impl.dart