Merge pull request #14 from gyrdym/add-normalizer-and-scaler

Normalizer entity
gyrdym · Sep 17, 2019 · dd5d56d · dd5d56d
2 parents ecc528c + 74f5798
commit dd5d56d
Show file tree

Hide file tree

Showing 18 changed files with 197 additions and 64 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,10 @@
 # Changelog
 
+## 5.0.0
+- `Encoder` interface changed: there is no more `encode` method, use `process` from `Pipeable` instead
+- `Normalizer` entity added
+- `normalize` operator added
+
 ## 4.0.0
 - `DataFrame` class split up into separate smaller entities
 - `DataFrame` class core moved to separate repository

diff --git a/README.md b/README.md
@@ -47,11 +47,14 @@ before doing preprocessing. An example with a part of pubspec.yaml:
 ````
 dependencies:
   ...
-  ml_dataframe: ^0.0.3
+  ml_dataframe: ^0.0.4
   ...
 ````
 
-## A simple usage example
+## Usage examples
+
+### Getting started
+
 Let's download some data from [Kaggle](https://www.kaggle.com) - let it be amazing [black friday](https://www.kaggle.com/mehdidag/black-friday) 
 dataset. It's pretty interesting data with huge amount of observations (approx. 538000 rows) and a good number of 
 categorical features.
@@ -61,7 +64,6 @@ First, import all necessary libraries:
 ````dart
 import 'package:ml_dataframe/ml_dataframe.dart';
 import 'package:ml_preprocessing/ml_preprocessing.dart';
-import 'package:xrange/zrange.dart';
 ````
 
 Then, we should read the csv and create a data frame:
@@ -71,19 +73,25 @@ final dataFrame = await fromCsv('example/black_friday/black_friday.csv',
   columns: [2, 3, 5, 6, 7, 11]);
 ````
 
+### Categorical data
+
 After we get a dataframe, we may encode all the needed features. Let's analyze the dataset and decide, what features 
 should be encoded. In our case these are:
 
 ````dart
 final featureNames = ['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status'];
 ````
 
-Let's fit the encoder. 
+### One-hot encoding
+
+Let's fit the one-hot encoder. 
 
-Why should we fit it? Categorical data encoder fitting is a process, when all the unique category values are being 
+Why should we fit it? Categorical data encoder fitting - a process, when all the unique category values are being 
 searched for in order to create an encoded labels list. After the fitting is complete, one may use the fitted encoder for 
-new data of the same source. In order to fit the encoder it's needed to create the entity and pass the fitting data as 
-an argument to the constructor, along with the features to be encoded:
+the new data of the same source. 
+
+In order to fit the encoder it's needed to create the entity and pass the fitting data as an argument to the 
+constructor, along with the features to be encoded:
 
 
 ````dart
@@ -97,7 +105,7 @@ final encoder = Encoder.oneHot(
 Let's encode the features:
 
 ````dart
-final encoded = encoder.encode(dataFrame);
+final encoded = encoder.process(dataFrame);
 ````
 
 We used the same dataframe here - it's absolutely normal, since when we created the encoder, we just fit it with the 
@@ -112,3 +120,52 @@ print(data);
 ```` 
 
 In the output we will see just numerical data, that's exactly we wanted to reach.
+
+### Label encoding
+
+Another one well-known encoding method. The technique is the same - first, we should fit the encoder and after that we
+may use this "trained" encoder in some applications:
+
+````dart
+final encoder = Encoder.label(
+  dataFrame,
+  featureNames: featureNames,
+);
+
+final encoded = encoder.process(dataFrame);
+````
+
+### Numerical data normalizing
+
+Sometimes we need to have our numerical features normalized, that means we need to treat every dataframe row as a 
+vector and divide this vector element-wise by its norm (Euclidean, Manhattan, etc.). To do so the library exposes
+`Normalizer` entity:
+
+````dart
+final normalizer = Normalizer(); // by default Euclidean norm will be used
+final transformed = normalizer.process(dataFrame);
+```` 
+
+Please, notice, if your data has raw categorical values, the normalization will fail as it requires only numerical 
+values. In this case you should encode data (e.g. using one-hot encoding) before normalization.
+
+### Pipeline
+
+There is a convenient way to organize a bunch of data preprocessing operations - `Pipeline`:
+
+````dart
+final pipeline = Pipeline(dataFrame, [
+  encodeAsOneHotLabels(featureNames: ['Gender', 'Age', 'City_Category']),
+  encodeAsIntegerLabels(featureNames: ['Stay_In_Current_City_Years', 'Marital_Status']),
+  normalize(),
+]);
+````
+
+Once you create (or rather fit) a pipeline, you may use it farther in your application:
+
+````dart
+final processed = pipeline.process(dataFrame);
+````
+
+`encodeAsOneHotLabels`, `encodeAsIntegerLabels` and `normalize` are pipeable operator functions. Pipeable operator 
+function is a factory, that takes fitting data and creates a fitted pipeable entity (e.g., `Normalizer` instance)  
diff --git a/example/black_friday/black_friday.dart b/example/black_friday/black_friday.dart
@@ -12,7 +12,7 @@ Future processDataSetWithCategoricalData() async {
     dataFrame,
     featureNames: ['Gender', 'Age', 'City_Category',
       'Stay_In_Current_City_Years', 'Marital_Status'],
-  ).encode(dataFrame);
+  ).process(dataFrame);
 
   final observations = encoded.toMatrix();
   final genderEncoded = observations.submatrix(columns: ZRange.closed(0, 1));

diff --git a/example/main.dart b/example/main.dart
@@ -2,8 +2,8 @@ import 'dart:async';
 
 import 'package:ml_dataframe/ml_dataframe.dart';
 import 'package:ml_preprocessing/ml_preprocessing.dart';
-import 'package:ml_preprocessing/src/encoder/pipeable/label_encode.dart';
-import 'package:ml_preprocessing/src/encoder/pipeable/one_hot_encode.dart';
+import 'package:ml_preprocessing/src/encoder/encode_as_integer_labels.dart';
+import 'package:ml_preprocessing/src/encoder/encode_as_one_hot_labels.dart';
 import 'package:ml_preprocessing/src/pipeline/pipeline.dart';
 
 Future main() async {
@@ -12,11 +12,11 @@ Future main() async {
 
   final pipeline = Pipeline(dataFrame, [
     encodeAsOneHotLabels(
-      columnNames: ['position'],
+      featureNames: ['position'],
       headerPostfix: '_position',
     ),
     encodeAsIntegerLabels(
-      columnNames: ['country'],
+      featureNames: ['country'],
     ),
   ]);
 

diff --git a/lib/ml_preprocessing.dart b/lib/ml_preprocessing.dart
@@ -1,4 +1,7 @@
+export 'package:ml_linalg/norm.dart';
+export 'package:ml_preprocessing/src/encoder/encode_as_integer_labels.dart';
+export 'package:ml_preprocessing/src/encoder/encode_as_one_hot_labels.dart';
 export 'package:ml_preprocessing/src/encoder/encoder.dart';
-export 'package:ml_preprocessing/src/encoder/pipeable/label_encode.dart';
-export 'package:ml_preprocessing/src/encoder/pipeable/one_hot_encode.dart';
+export 'package:ml_preprocessing/src/normalizer/normalize.dart';
+export 'package:ml_preprocessing/src/normalizer/normalizer.dart';
 export 'package:ml_preprocessing/src/pipeline/pipeline.dart';
diff --git a/lib/src/encoder/pipeable/label_encode.dart → ...src/encoder/encode_as_integer_labels.dart b/lib/src/encoder/pipeable/label_encode.dart → ...src/encoder/encode_as_integer_labels.dart
@@ -5,16 +5,16 @@ import 'package:ml_preprocessing/src/pipeline/pipeable.dart';
 
 /// A factory function to use label categorical data encoder in pipeline
 PipeableOperatorFn encodeAsIntegerLabels({
-  Iterable<int> columns,
-  Iterable<String> columnNames,
+  Iterable<int> features,
+  Iterable<String> featureNames,
   String headerPrefix,
   String headerPostfix,
 }) => (data) => EncoderImpl(
   data,
   EncoderType.label,
   SeriesEncoderFactoryImpl(),
-  featureNames: columnNames,
-  featureIds: columns,
+  featureIds: features,
+  featureNames: featureNames,
   encodedHeaderPostfix: headerPostfix,
   encodedHeaderPrefix: headerPrefix,
 );
diff --git a/lib/src/encoder/pipeable/one_hot_encode.dart → ...src/encoder/encode_as_one_hot_labels.dart b/lib/src/encoder/pipeable/one_hot_encode.dart → ...src/encoder/encode_as_one_hot_labels.dart
@@ -5,16 +5,16 @@ import 'package:ml_preprocessing/src/pipeline/pipeable.dart';
 
 /// A factory function to use `one hot` categorical data encoder in pipeline
 PipeableOperatorFn encodeAsOneHotLabels({
-  Iterable<int> columns,
-  Iterable<String> columnNames,
+  Iterable<int> features,
+  Iterable<String> featureNames,
   String headerPrefix,
   String headerPostfix,
 }) => (data) => EncoderImpl(
   data,
   EncoderType.oneHot,
   SeriesEncoderFactoryImpl(),
-  featureNames: columnNames,
-  featureIds: columns,
+  featureIds: features,
+  featureNames: featureNames,
   encodedHeaderPostfix: headerPostfix,
   encodedHeaderPrefix: headerPrefix,
 );
diff --git a/lib/src/encoder/encoder.dart b/lib/src/encoder/encoder.dart
@@ -2,11 +2,12 @@ import 'package:ml_dataframe/ml_dataframe.dart';
 import 'package:ml_preprocessing/src/encoder/encoder_impl.dart';
 import 'package:ml_preprocessing/src/encoder/encoder_type.dart';
 import 'package:ml_preprocessing/src/encoder/series_encoder/series_encoder_factory_impl.dart';
+import 'package:ml_preprocessing/src/pipeline/pipeable.dart';
 
 final _seriesEncoderFactory = SeriesEncoderFactoryImpl();
 
 /// Categorical data encoder factory
-abstract class Encoder {
+abstract class Encoder implements Pipeable {
   factory Encoder.oneHot(DataFrame fittingData, {
     Iterable<int> featureIds,
     Iterable<String> featureNames,
@@ -32,6 +33,4 @@ abstract class Encoder {
     featureNames: featureNames,
     featureIds: featureIds,
   );
-
-  DataFrame encode(DataFrame data);
 }
diff --git a/lib/src/encoder/encoder_impl.dart b/lib/src/encoder/encoder_impl.dart
@@ -4,9 +4,8 @@ import 'package:ml_preprocessing/src/encoder/encoder_type.dart';
 import 'package:ml_preprocessing/src/encoder/helpers/create_encoder_to_series_mapping.dart';
 import 'package:ml_preprocessing/src/encoder/series_encoder/series_encoder.dart';
 import 'package:ml_preprocessing/src/encoder/series_encoder/series_encoder_factory.dart';
-import 'package:ml_preprocessing/src/pipeline/pipeable.dart';
 
-class EncoderImpl implements Pipeable, Encoder {
+class EncoderImpl implements Encoder {
   EncoderImpl(
       DataFrame fittingData,
       EncoderType encoderType,
@@ -35,7 +34,4 @@ class EncoderImpl implements Pipeable, Encoder {
           : [series]);
     return DataFrame.fromSeries(encoded);
   }
-
-  @override
-  DataFrame encode(DataFrame data) => process(data);
 }
diff --git a/lib/src/normalizer/normalize.dart b/lib/src/normalizer/normalize.dart
@@ -0,0 +1,6 @@
+import 'package:ml_linalg/norm.dart';
+import 'package:ml_preprocessing/src/normalizer/normalizer.dart';
+import 'package:ml_preprocessing/src/pipeline/pipeable.dart';
+
+PipeableOperatorFn normalize([Norm norm = Norm.euclidean]) =>
+        (_) => Normalizer(norm);
diff --git a/lib/src/normalizer/normalizer.dart b/lib/src/normalizer/normalizer.dart
@@ -0,0 +1,18 @@
+import 'package:ml_dataframe/ml_dataframe.dart';
+import 'package:ml_linalg/linalg.dart';
+import 'package:ml_preprocessing/src/pipeline/pipeable.dart';
+
+class Normalizer implements Pipeable {
+  Normalizer([this._norm = Norm.euclidean]);
+
+  final Norm _norm;
+
+  @override
+  DataFrame process(DataFrame input) {
+    final transformed = input
+        .toMatrix()
+        .mapRows((row) => row.normalize(_norm));
+
+    return DataFrame.fromMatrix(transformed, header: input.header);
+  }
+}
diff --git a/pubspec.yaml b/pubspec.yaml
@@ -1,17 +1,16 @@
 name: ml_preprocessing
-description: Implementaion of popular algorithms of data preprocessing for machine learning
-version: 4.0.0
+description: Popular algorithms of data preprocessing for machine learning
+version: 5.0.0
 author: Ilia Gyrdymov <ilgyrd@gmail.com>
 homepage: https://github.com/gyrdym/ml_preprocessing
 
 environment:
   sdk: '>=2.4.0 <3.0.0'
 
 dependencies:
-  ml_dataframe: ^0.0.3
-  ml_linalg: ^10.0.3
+  ml_dataframe: ^0.0.4
+  ml_linalg: ^11.0.0
   quiver: ^2.0.2
-  tuple: ^1.0.2
   xrange: ^0.0.4
 
 dev_dependencies:

diff --git a/test/encoder/encoder_impl_test.dart b/test/encoder/encoder_impl_test.dart
@@ -20,7 +20,7 @@ void main() {
         final dataFrame = DataFrame(data);
         final encoder = Encoder.oneHot(dataFrame,
             featureNames: ['second', 'third', 'fourth']);
-        final encoded = encoder.encode(dataFrame);
+        final encoded = encoder.process(dataFrame);
 
         encoded.toMatrix();
 
@@ -39,7 +39,7 @@ void main() {
         final dataFrame = DataFrame(data);
         final encoder = Encoder.oneHot(dataFrame,
             featureIds: [1, 2, 3]);
-        final encoded = encoder.encode(dataFrame);
+        final encoded = encoder.process(dataFrame);
 
         encoded.toMatrix();
 
@@ -60,7 +60,7 @@ void main() {
         final dataFrame = DataFrame(data);
         final encoder = Encoder.label(dataFrame,
             featureNames: ['second', 'third', 'fourth']);
-        final encoded = encoder.encode(dataFrame);
+        final encoded = encoder.process(dataFrame);
 
         encoded.toMatrix();
 
@@ -78,7 +78,7 @@ void main() {
       test('should use indices to access the needed series while encoding', () {
         final dataFrame = DataFrame(data);
         final encoder = Encoder.label(dataFrame, featureIds: [1, 2, 3]);
-        final encoded = encoder.encode(dataFrame);
+        final encoded = encoder.process(dataFrame);
 
         encoded.toMatrix();
 

diff --git a/test/mocks.dart b/test/mocks.dart
diff --git a/test/normalizer/normalize_test.dart b/test/normalizer/normalize_test.dart
@@ -0,0 +1,14 @@
+import 'package:ml_preprocessing/src/normalizer/normalize.dart';
+import 'package:ml_preprocessing/src/normalizer/normalizer.dart';
+import 'package:test/test.dart';
+
+void main() {
+  group('normalize', () {
+    test('should return normalizer factory', () {
+      final normalizerFactory = normalize();
+      final normalizer = normalizerFactory(null);
+
+      expect(normalizer, isA<Normalizer>());
+    });
+  });
+}