Merge pull request #91 from gyrdym/add-more-examples-to-readme

Add softmax regression example to README, one-hot encoding documentation extended
gyrdym · Mar 11, 2019 · f284ced · f284ced
2 parents 6f8e87b + e01212a
commit f284ced
Show file tree

Hide file tree

Showing 4 changed files with 132 additions and 19 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,8 @@
 # Changelog
 
+## 9.2.2
+- Softmax regression example added to README
+
 ## 9.2.1
 - README corrected
 

diff --git a/README.md b/README.md
@@ -8,20 +8,22 @@
 **Table of contents**
 - [What for is the library?](#what-is-the-ml_algo-for)
 - [The library's structure](#the-librarys-structure)
-- [Usage](#usage)
+- [Examples](#examples)
+    - [Logistic regression](#logistic-regression)
+    - [Softmax regression](#softmax-regression)
 
 ## What is the ml_algo for?
 
 The main purpose of the library - to give developers, interested both in Dart language and data science, native Dart 
 implementation of machine learning algorithms. This library targeted to dart vm, so, to get smoothest experience with 
 the lib, please, do not use it in a browser.
 
-Following algorithms are implemented:
-- Linear regression:
+**Following algorithms are implemented:**
+- *Linear regression:*
     - Gradient descent algorithm (batch, mini-batch, stochastic) with ridge regularization
     - Lasso regression (feature selection model)
 
-- Linear classifier:
+- *Linear classifier:*
     - Logistic regression (with "one-vs-all" multiclass classification)
     - Softmax regression
 
@@ -31,7 +33,7 @@ To provide main purposes of machine learning, the library exposes the following
 
 - [DataFrame](https://github.com/gyrdym/ml_algo/blob/master/lib/src/data_preprocessing/data_frame/data_frame.dart). 
 Factory, that creates instances of different adapters for data. For example, one can create a csv reader, that makes 
-work with csv data easier: you just need to point, where your dataset resides and then get features and labels in 
+work with csv data easier: it's just needed to point, where a dataset resides and then get features and labels in 
 convenient data science friendly format.
 
 - [CrossValidator](https://github.com/gyrdym/ml_algo/blob/master/lib/src/model_selection/cross_validator/cross_validator.dart). Factory, that creates 
@@ -56,9 +58,9 @@ that performs feature selection along with regression process. It uses [coordina
 instead of [gradient descent optimization]() and [gradient vector]() like in `LinearRegressor.gradient` to provide 
 regression. If you want to decide, which features are less important - go ahead and use this regressor. 
 
-## Usage
+## Examples
 
-### Real life example
+### Logistic regression
 
 Let's classify records from well-known dataset - [Pima Indians Diabets Database](https://www.kaggle.com/uciml/pima-indians-diabetes-database)
 via [Logistic regressor](https://github.com/gyrdym/ml_algo/blob/master/lib/src/classifier/linear_classifier.dart)
@@ -85,7 +87,7 @@ on each row. This column is our target - we should predict values of class label
 should point, where to get label values. Let's use `labelName` parameter for that (labels column name, 'class variable 
 (0 or 1)' in our case).  
 
-Processed features and labels are contained in a data structure of `Matrix` type. To get more information about 
+Processed features and labels are contained in data structures of `Matrix` type. To get more information about 
 `Matrix` type, please, visit [ml_linal repo](https://github.com/gyrdym/ml_linalg)
 
 Then, we should create an instance of `CrossValidator` class for fitting [hyperparameters](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))
@@ -103,7 +105,7 @@ final model = LinearClassifier.logisticRegressor(
     iterationsLimit: 500,
     gradientType: GradientType.batch,
     fitIntercept: true,
-    interceptScale: .1,
+    interceptScale: 0.1,
     learningRateType: LearningRateType.constant);
 ````
 
@@ -152,6 +154,104 @@ Future main() async {
 }
 ````
 
+### Softmax regression
+Let's classify another famous dataset - [Iris dataset](https://www.kaggle.com/uciml/iris). Data in this csv is separated into 3 classes - therefore we need
+to use different approach to data classification - [Softmax regression](http://deeplearning.stanford.edu/tutorial/supervised/SoftmaxRegression/).
+
+As usual, start with data preparation:
+````Dart
+final data = DataFrame.fromCsv('datasets/iris.csv',
+    labelName: 'Species',
+    columns: [const Tuple2(1, 5)],
+    categoryNameToEncoder: {
+      'Species': CategoricalDataEncoderType.oneHot,
+    },
+);
+
+final features = await data.features;
+final labels = await data.labels;
+````
+
+The csv database has 6 columns, but we need to get rid of the first column, because it contains just ID of every 
+observation - it is absolutely useless data. So, as you may notice, we provided a columns range to exclude ID-column:
+
+````Dart
+columns: [const Tuple2(1, 5)]
+````
+
+Also, since the label column 'Species' has categorical data, we encoded it to numerical format:
+
+````Dart
+categoryNameToEncoder: {
+  'Species': CategoricalDataEncoderType.oneHot,
+},
+````
+
+To see how encoding works, visit the [api reference](https://pub.dartlang.org/documentation/ml_algo/latest/ml_algo/CategoricalDataEncoderType-class.html).
+
+Next step - create a cross validator instance:
+
+````Dart
+final validator = CrossValidator.kFold(numberOfFolds: 5);
+````
+
+And finally, create an instance of the classifier:
+
+````Dart
+final softmaxRegressor = LinearClassifier.softmaxRegressor(
+      initialLearningRate: 0.03,
+      iterationsLimit: null,
+      minWeightsUpdate: 1e-6,
+      randomSeed: 46,
+      learningRateType: LearningRateType.constant);
+````
+
+Evaluate quality of prediction:
+
+````Dart
+final accuracy = validator.evaluate(softmaxRegressor, features, labels, MetricType.accuracy);
+
+print('Iris dataset, softmax regression: accuracy is '
+  '${accuracy.toStringAsFixed(2)}'); // It yields 0.93
+````
+
+Gather all the code above all together:
+
+````Dart
+import 'dart:async';
+
+import 'package:ml_algo/ml_algo.dart';
+import 'package:tuple/tuple.dart';
+
+Future main() async {
+  final data = DataFrame.fromCsv('datasets/iris.csv',
+    labelName: 'Species',
+    columns: [const Tuple2(1, 5)],
+    categoryNameToEncoder: {
+      'Species': CategoricalDataEncoderType.oneHot,
+    },
+  );
+
+  final features = await data.features;
+  final labels = await data.labels;
+
+  final validator = CrossValidator.kFold(numberOfFolds: 5);
+
+  final softmaxRegressor = LinearClassifier.softmaxRegressor(
+      initialLearningRate: 0.03,
+      iterationsLimit: null,
+      minWeightsUpdate: 1e-6,
+      randomSeed: 46,
+      learningRateType: LearningRateType.constant);
+
+  final accuracy = validator.evaluate(
+      softmaxRegressor, features, labels, MetricType.accuracy);
+
+  print('Iris dataset, softmax regression: accuracy is '
+      '${accuracy.toStringAsFixed(2)}');
+}
+````
+
 For more examples please see [examples folder](https://github.com/gyrdym/dart_ml/tree/master/example)
 
 ### Contacts

diff --git a/lib/src/data_preprocessing/categorical_encoder/encoder_type.dart b/lib/src/data_preprocessing/categorical_encoder/encoder_type.dart
@@ -1,25 +1,35 @@
 /// Types of categorical data encoders
 ///
 /// [CategoricalDataEncoderType.oneHot] One-hot encoder. Encodes every
-/// categorical value to a sequence of all possible values of its category: `1`
-/// for the given value, `0` - for the rest values.
+/// categorical value to a list of length, that is equal to the number of all
+/// possible category's values. Each element of the list is a binary value: `1`
+/// for the current value, `0` - for the rest values.
 ///
 /// For example:
 ///
-/// Category `'GENDER'` given. Its possible values:
+/// Category `'AGE'` given. Its possible values:
 /// ```
-/// ['female', 'male']
+/// ['0-17', '18-30', '31+']
 /// ```
-/// Also, we have some data to encode - a list of `'GENDER'` values:
+///
+/// '0-17' will be encoded as [1.0, 0.0, 0.0]
+///
+/// '18-30' will be encoded as [0.0, 1.0, 0.0]
+///
+/// '31+' will be encoded as [0.0, 0.0, 1.0]
+///
+/// Also, we have some data of this category - a list of `'AGE'` values:
 /// ```
-/// ['female', 'female', 'male', 'male', 'male', 'female']
+/// ['0-17', '0-17', '18-30', '18-30', '18-30', '31+']
 /// ```
-/// After one-hot encoding the data will be as:
+///
+/// After one-hot encoding the data will be look as:
 /// ```
-/// [[1, 0], [1, 0], [0, 1], [0, 1], [0, 1], [1, 0]]
+/// [[1.0, 0.0, 0.0], [1.0, 0.0, 0.0], [0.0, 1.0, 0.0], [0.0, 1.0, 0.0], [0.0, 1.0, 0.0], [0.0, 0.0, 1.0]]
 /// ```
+///
 /// [CategoricalDataEncoderType.ordinal] Ordinal encoder. Encodes every
-/// categorical value to an ordinal number
+/// categorical value to an ordinal number.
 ///
 enum CategoricalDataEncoderType {
   ordinal,

diff --git a/pubspec.yaml b/pubspec.yaml
@@ -1,6 +1,6 @@
 name: ml_algo
 description: Machine learning algorithms written in native dart (without bindings to any popular ML libraries, just pure Dart implementation)
-version: 9.2.1
+version: 9.2.2
 author: Ilia Gyrdymov <ilgyrd@gmail.com>
 homepage: https://github.com/gyrdym/ml_algo