Skip to content

Commit

Permalink
Merge pull request #14 from gyrdym/add-normalizer-and-scaler
Browse files Browse the repository at this point in the history
Normalizer entity
  • Loading branch information
gyrdym committed Sep 17, 2019
2 parents ecc528c + 74f5798 commit dd5d56d
Show file tree
Hide file tree
Showing 18 changed files with 197 additions and 64 deletions.
5 changes: 5 additions & 0 deletions CHANGELOG.md
@@ -1,5 +1,10 @@
# Changelog

## 5.0.0
- `Encoder` interface changed: there is no more `encode` method, use `process` from `Pipeable` instead
- `Normalizer` entity added
- `normalize` operator added

## 4.0.0
- `DataFrame` class split up into separate smaller entities
- `DataFrame` class core moved to separate repository
Expand Down
73 changes: 65 additions & 8 deletions README.md
Expand Up @@ -47,11 +47,14 @@ before doing preprocessing. An example with a part of pubspec.yaml:
````
dependencies:
...
ml_dataframe: ^0.0.3
ml_dataframe: ^0.0.4
...
````

## A simple usage example
## Usage examples

### Getting started

Let's download some data from [Kaggle](https://www.kaggle.com) - let it be amazing [black friday](https://www.kaggle.com/mehdidag/black-friday)
dataset. It's pretty interesting data with huge amount of observations (approx. 538000 rows) and a good number of
categorical features.
Expand All @@ -61,7 +64,6 @@ First, import all necessary libraries:
````dart
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';
import 'package:xrange/zrange.dart';
````

Then, we should read the csv and create a data frame:
Expand All @@ -71,19 +73,25 @@ final dataFrame = await fromCsv('example/black_friday/black_friday.csv',
columns: [2, 3, 5, 6, 7, 11]);
````

### Categorical data

After we get a dataframe, we may encode all the needed features. Let's analyze the dataset and decide, what features
should be encoded. In our case these are:

````dart
final featureNames = ['Gender', 'Age', 'City_Category', 'Stay_In_Current_City_Years', 'Marital_Status'];
````

Let's fit the encoder.
### One-hot encoding

Let's fit the one-hot encoder.

Why should we fit it? Categorical data encoder fitting is a process, when all the unique category values are being
Why should we fit it? Categorical data encoder fitting - a process, when all the unique category values are being
searched for in order to create an encoded labels list. After the fitting is complete, one may use the fitted encoder for
new data of the same source. In order to fit the encoder it's needed to create the entity and pass the fitting data as
an argument to the constructor, along with the features to be encoded:
the new data of the same source.

In order to fit the encoder it's needed to create the entity and pass the fitting data as an argument to the
constructor, along with the features to be encoded:


````dart
Expand All @@ -97,7 +105,7 @@ final encoder = Encoder.oneHot(
Let's encode the features:

````dart
final encoded = encoder.encode(dataFrame);
final encoded = encoder.process(dataFrame);
````

We used the same dataframe here - it's absolutely normal, since when we created the encoder, we just fit it with the
Expand All @@ -112,3 +120,52 @@ print(data);
````

In the output we will see just numerical data, that's exactly we wanted to reach.

### Label encoding

Another one well-known encoding method. The technique is the same - first, we should fit the encoder and after that we
may use this "trained" encoder in some applications:

````dart
final encoder = Encoder.label(
dataFrame,
featureNames: featureNames,
);
final encoded = encoder.process(dataFrame);
````

### Numerical data normalizing

Sometimes we need to have our numerical features normalized, that means we need to treat every dataframe row as a
vector and divide this vector element-wise by its norm (Euclidean, Manhattan, etc.). To do so the library exposes
`Normalizer` entity:

````dart
final normalizer = Normalizer(); // by default Euclidean norm will be used
final transformed = normalizer.process(dataFrame);
````

Please, notice, if your data has raw categorical values, the normalization will fail as it requires only numerical
values. In this case you should encode data (e.g. using one-hot encoding) before normalization.

### Pipeline

There is a convenient way to organize a bunch of data preprocessing operations - `Pipeline`:

````dart
final pipeline = Pipeline(dataFrame, [
encodeAsOneHotLabels(featureNames: ['Gender', 'Age', 'City_Category']),
encodeAsIntegerLabels(featureNames: ['Stay_In_Current_City_Years', 'Marital_Status']),
normalize(),
]);
````

Once you create (or rather fit) a pipeline, you may use it farther in your application:

````dart
final processed = pipeline.process(dataFrame);
````

`encodeAsOneHotLabels`, `encodeAsIntegerLabels` and `normalize` are pipeable operator functions. Pipeable operator
function is a factory, that takes fitting data and creates a fitted pipeable entity (e.g., `Normalizer` instance)
2 changes: 1 addition & 1 deletion example/black_friday/black_friday.dart
Expand Up @@ -12,7 +12,7 @@ Future processDataSetWithCategoricalData() async {
dataFrame,
featureNames: ['Gender', 'Age', 'City_Category',
'Stay_In_Current_City_Years', 'Marital_Status'],
).encode(dataFrame);
).process(dataFrame);

final observations = encoded.toMatrix();
final genderEncoded = observations.submatrix(columns: ZRange.closed(0, 1));
Expand Down
8 changes: 4 additions & 4 deletions example/main.dart
Expand Up @@ -2,8 +2,8 @@ import 'dart:async';

import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/ml_preprocessing.dart';
import 'package:ml_preprocessing/src/encoder/pipeable/label_encode.dart';
import 'package:ml_preprocessing/src/encoder/pipeable/one_hot_encode.dart';
import 'package:ml_preprocessing/src/encoder/encode_as_integer_labels.dart';
import 'package:ml_preprocessing/src/encoder/encode_as_one_hot_labels.dart';
import 'package:ml_preprocessing/src/pipeline/pipeline.dart';

Future main() async {
Expand All @@ -12,11 +12,11 @@ Future main() async {

final pipeline = Pipeline(dataFrame, [
encodeAsOneHotLabels(
columnNames: ['position'],
featureNames: ['position'],
headerPostfix: '_position',
),
encodeAsIntegerLabels(
columnNames: ['country'],
featureNames: ['country'],
),
]);

Expand Down
7 changes: 5 additions & 2 deletions lib/ml_preprocessing.dart
@@ -1,4 +1,7 @@
export 'package:ml_linalg/norm.dart';
export 'package:ml_preprocessing/src/encoder/encode_as_integer_labels.dart';
export 'package:ml_preprocessing/src/encoder/encode_as_one_hot_labels.dart';
export 'package:ml_preprocessing/src/encoder/encoder.dart';
export 'package:ml_preprocessing/src/encoder/pipeable/label_encode.dart';
export 'package:ml_preprocessing/src/encoder/pipeable/one_hot_encode.dart';
export 'package:ml_preprocessing/src/normalizer/normalize.dart';
export 'package:ml_preprocessing/src/normalizer/normalizer.dart';
export 'package:ml_preprocessing/src/pipeline/pipeline.dart';
Expand Up @@ -5,16 +5,16 @@ import 'package:ml_preprocessing/src/pipeline/pipeable.dart';

/// A factory function to use label categorical data encoder in pipeline
PipeableOperatorFn encodeAsIntegerLabels({
Iterable<int> columns,
Iterable<String> columnNames,
Iterable<int> features,
Iterable<String> featureNames,
String headerPrefix,
String headerPostfix,
}) => (data) => EncoderImpl(
data,
EncoderType.label,
SeriesEncoderFactoryImpl(),
featureNames: columnNames,
featureIds: columns,
featureIds: features,
featureNames: featureNames,
encodedHeaderPostfix: headerPostfix,
encodedHeaderPrefix: headerPrefix,
);
Expand Up @@ -5,16 +5,16 @@ import 'package:ml_preprocessing/src/pipeline/pipeable.dart';

/// A factory function to use `one hot` categorical data encoder in pipeline
PipeableOperatorFn encodeAsOneHotLabels({
Iterable<int> columns,
Iterable<String> columnNames,
Iterable<int> features,
Iterable<String> featureNames,
String headerPrefix,
String headerPostfix,
}) => (data) => EncoderImpl(
data,
EncoderType.oneHot,
SeriesEncoderFactoryImpl(),
featureNames: columnNames,
featureIds: columns,
featureIds: features,
featureNames: featureNames,
encodedHeaderPostfix: headerPostfix,
encodedHeaderPrefix: headerPrefix,
);
5 changes: 2 additions & 3 deletions lib/src/encoder/encoder.dart
Expand Up @@ -2,11 +2,12 @@ import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_preprocessing/src/encoder/encoder_impl.dart';
import 'package:ml_preprocessing/src/encoder/encoder_type.dart';
import 'package:ml_preprocessing/src/encoder/series_encoder/series_encoder_factory_impl.dart';
import 'package:ml_preprocessing/src/pipeline/pipeable.dart';

final _seriesEncoderFactory = SeriesEncoderFactoryImpl();

/// Categorical data encoder factory
abstract class Encoder {
abstract class Encoder implements Pipeable {
factory Encoder.oneHot(DataFrame fittingData, {
Iterable<int> featureIds,
Iterable<String> featureNames,
Expand All @@ -32,6 +33,4 @@ abstract class Encoder {
featureNames: featureNames,
featureIds: featureIds,
);

DataFrame encode(DataFrame data);
}
6 changes: 1 addition & 5 deletions lib/src/encoder/encoder_impl.dart
Expand Up @@ -4,9 +4,8 @@ import 'package:ml_preprocessing/src/encoder/encoder_type.dart';
import 'package:ml_preprocessing/src/encoder/helpers/create_encoder_to_series_mapping.dart';
import 'package:ml_preprocessing/src/encoder/series_encoder/series_encoder.dart';
import 'package:ml_preprocessing/src/encoder/series_encoder/series_encoder_factory.dart';
import 'package:ml_preprocessing/src/pipeline/pipeable.dart';

class EncoderImpl implements Pipeable, Encoder {
class EncoderImpl implements Encoder {
EncoderImpl(
DataFrame fittingData,
EncoderType encoderType,
Expand Down Expand Up @@ -35,7 +34,4 @@ class EncoderImpl implements Pipeable, Encoder {
: [series]);
return DataFrame.fromSeries(encoded);
}

@override
DataFrame encode(DataFrame data) => process(data);
}
6 changes: 6 additions & 0 deletions lib/src/normalizer/normalize.dart
@@ -0,0 +1,6 @@
import 'package:ml_linalg/norm.dart';
import 'package:ml_preprocessing/src/normalizer/normalizer.dart';
import 'package:ml_preprocessing/src/pipeline/pipeable.dart';

PipeableOperatorFn normalize([Norm norm = Norm.euclidean]) =>
(_) => Normalizer(norm);
18 changes: 18 additions & 0 deletions lib/src/normalizer/normalizer.dart
@@ -0,0 +1,18 @@
import 'package:ml_dataframe/ml_dataframe.dart';
import 'package:ml_linalg/linalg.dart';
import 'package:ml_preprocessing/src/pipeline/pipeable.dart';

class Normalizer implements Pipeable {
Normalizer([this._norm = Norm.euclidean]);

final Norm _norm;

@override
DataFrame process(DataFrame input) {
final transformed = input
.toMatrix()
.mapRows((row) => row.normalize(_norm));

return DataFrame.fromMatrix(transformed, header: input.header);
}
}
9 changes: 4 additions & 5 deletions pubspec.yaml
@@ -1,17 +1,16 @@
name: ml_preprocessing
description: Implementaion of popular algorithms of data preprocessing for machine learning
version: 4.0.0
description: Popular algorithms of data preprocessing for machine learning
version: 5.0.0
author: Ilia Gyrdymov <ilgyrd@gmail.com>
homepage: https://github.com/gyrdym/ml_preprocessing

environment:
sdk: '>=2.4.0 <3.0.0'

dependencies:
ml_dataframe: ^0.0.3
ml_linalg: ^10.0.3
ml_dataframe: ^0.0.4
ml_linalg: ^11.0.0
quiver: ^2.0.2
tuple: ^1.0.2
xrange: ^0.0.4

dev_dependencies:
Expand Down
8 changes: 4 additions & 4 deletions test/encoder/encoder_impl_test.dart
Expand Up @@ -20,7 +20,7 @@ void main() {
final dataFrame = DataFrame(data);
final encoder = Encoder.oneHot(dataFrame,
featureNames: ['second', 'third', 'fourth']);
final encoded = encoder.encode(dataFrame);
final encoded = encoder.process(dataFrame);

encoded.toMatrix();

Expand All @@ -39,7 +39,7 @@ void main() {
final dataFrame = DataFrame(data);
final encoder = Encoder.oneHot(dataFrame,
featureIds: [1, 2, 3]);
final encoded = encoder.encode(dataFrame);
final encoded = encoder.process(dataFrame);

encoded.toMatrix();

Expand All @@ -60,7 +60,7 @@ void main() {
final dataFrame = DataFrame(data);
final encoder = Encoder.label(dataFrame,
featureNames: ['second', 'third', 'fourth']);
final encoded = encoder.encode(dataFrame);
final encoded = encoder.process(dataFrame);

encoded.toMatrix();

Expand All @@ -78,7 +78,7 @@ void main() {
test('should use indices to access the needed series while encoding', () {
final dataFrame = DataFrame(data);
final encoder = Encoder.label(dataFrame, featureIds: [1, 2, 3]);
final encoded = encoder.encode(dataFrame);
final encoded = encoder.process(dataFrame);

encoded.toMatrix();

Expand Down
Empty file removed test/mocks.dart
Empty file.
14 changes: 14 additions & 0 deletions test/normalizer/normalize_test.dart
@@ -0,0 +1,14 @@
import 'package:ml_preprocessing/src/normalizer/normalize.dart';
import 'package:ml_preprocessing/src/normalizer/normalizer.dart';
import 'package:test/test.dart';

void main() {
group('normalize', () {
test('should return normalizer factory', () {
final normalizerFactory = normalize();
final normalizer = normalizerFactory(null);

expect(normalizer, isA<Normalizer>());
});
});
}

0 comments on commit dd5d56d

Please sign in to comment.