Skip to content

Commit

Permalink
categorical_features_support
Browse files Browse the repository at this point in the history
  • Loading branch information
lukapecnik committed Dec 6, 2020
1 parent c107206 commit 45bbae8
Show file tree
Hide file tree
Showing 58 changed files with 719 additions and 226 deletions.
35 changes: 29 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

NiaAML is an automated machine learning Python framework based on nature-inspired algorithms for optimization. The name comes from the automated machine learning method of the same name [[1]](#1). Its goal is to efficiently compose the best possible classification pipeline for the given task using components on the input. The components are divided into three groups: feature seletion algorithms, feature transformation algorithms and classifiers. The framework uses nature-inspired algorithms for optimization to choose the best set of components for the classification pipeline on the output and optimize their parameters. We use <a href="https://github.com/NiaOrg/NiaPy">NiaPy framework</a> for the optimization process which is a popular Python collection of nature-inspired algorithms. The NiaAML framework is easy to use and customize or expand to suit your needs.

The NiaAML framework allows you not only to run full pipeline optimization, but also separate implemented components such as classifiers, feature selection algorithms, etc. It currently supports only numeric features on the input. **However, we are planning to add support for categorical features too.**
The NiaAML framework allows you not only to run full pipeline optimization, but also separate implemented components such as classifiers, feature selection algorithms, etc. **It supports numerical and categorical features.**

* **Free software:** MIT license
* **Documentation:** https://niaaml.readthedocs.io/en/latest/
Expand All @@ -37,7 +37,7 @@ pip install niaaml --pre

## Components

In the following sections you can see a list of currently implemented components divided into groups: classifiers, feature selection algorithms and feature transformation algorithms. At the end you can also see a list of currently implemented fitness functions for the optimization process. All of the components are passed into the optimization process using their class names. Let's say we want to choose between Adaptive Boosting, Bagging and Multi Layer Perceptron classifiers, Select K Best and Select Percentile feature selection algorithms and Normalizer as the feature transformation algorithm (may not be selected during the optimization process).
In the following sections you can see a list of currently implemented components divided into groups: classifiers, feature selection algorithms and feature transformation algorithms. At the end you can also see a list of currently implemented fitness functions for the optimization process and categorical features' encoders. All of the components are passed into the optimization process using their class names. Let's say we want to choose between Adaptive Boosting, Bagging and Multi Layer Perceptron classifiers, Select K Best and Select Percentile feature selection algorithms and Normalizer as the feature transformation algorithm (may not be selected during the optimization process).

```python
PipelineOptimizer(
Expand All @@ -48,7 +48,19 @@ PipelineOptimizer(
)
```

For a full example see the [Examples section](#examples).
Argument of the PipelineOptimizer `categorical_features_encoder` is `None` by default. If you your dataset contains any categorical features, you need to specify an encoder to use.

```python
PipelineOptimizer(
data=...,
classifiers=['AdaBoost', 'Bagging', 'MultiLayerPerceptron'],
feature_selection_algorithms=['SelectKBest', 'SelectPercentile'],
feature_transform_algorithms=['Normalizer'],
categorical_features_encoder='OneHotEncoder'
)
```

For a full example see the [Examples section](#examples) or the list of implemented examples [here](examples).

### Classifiers

Expand Down Expand Up @@ -85,6 +97,10 @@ For a full example see the [Examples section](#examples).
* F1-Score (F1),
* Precision (Precision).

### Categorical Feature Encoders

* One-Hot Encoder (OneHotEncoder).

## Optimization Process And Parameter Tuning

In NiaAML there are two types of optimization. Goal of the first type is to find an optimal set of components (feature selection algorithm, feature transformation algorithm and classifier). The next step is to find optimal parameters for the selected set of components and that is the goal of the second type of optimization. Each component has an attribute `_params`, which is a dictionary of parameters and their possible values.
Expand Down Expand Up @@ -112,6 +128,7 @@ Load data and try to find the optimal pipeline for the given components. The exa
from niaaml import PipelineOptimizer, Pipeline
from niaaml.data import BasicDataReader
import numpy
import pandas

# dummy random data
data_reader = BasicDataReader(
Expand Down Expand Up @@ -140,7 +157,7 @@ And also load it from a file and use the pipeline.
loaded_pipeline = Pipeline.load('pipeline.ppln')

# some features (can be loaded using DataReader object instances)
x = numpy.array([[0.35, 0.46, 5.32], [0.16, 0.55, 12.5]], dtype=float)
x = pandas.DataFrame([[0.35, 0.46, 5.32], [0.16, 0.55, 12.5]])
y = loaded_pipeline.run(x)
```

Expand All @@ -152,18 +169,24 @@ pipeline.export_text('pipeline.txt')

This is a very simple example with dummy data. It is only intended to give you a basic idea on how to use the framework.

### Example of a Pipeline Component Implementation
### Example of a Pipeline Component's Implementation

NiaAML framework is easily expandable as you can implement components by overriding the base classes' methods. To implement a classifier you should inherit from the [Classifier](niaaml/classifiers/classifier.py) class and you can do the same with [FeatureSelectionAlgorithm](niaaml/preprocessing/feature_selection/feature_selection_algorithm.py) and [FeatureTransformAlgorithm](niaaml/preprocessing/feature_transform/feature_transform_algorithm.py) classes. All of the mentioned classes inherit from the [PipelineComponent](niaaml/pipeline_component.py) class.

Take a look at the [Classifier](niaaml/classifiers/classifier.py) class and the implementation of the [AdaBoost](niaaml/classifiers/ada_boost.py) classifier that inherits from it.

### Example of a Fitness Function Implementation
### Example of a Fitness Function's Implementation

NiaAML framework also allows you to implement your own fitness function. All you need to do is implement the [FitnessFunction](niaaml/fitness/fitness_function.py) class.

Take a look at the [Accuracy](niaaml/fitness/accuracy.py) implementation.

### Example of a Feature Encoder's Implementation

NiaAML framework also allows you to implement your own feature encoder. All you need to do is implement the [FeatureEncoder](niaaml/preprocessing/encoding/feature_encoder.py) class.

Take a look at the [OneHotEncoder](niaaml/preprocessing/encoding/one_hot_encoder.py) implementation.

### More

You can find more examples [here](examples).
Expand Down
2 changes: 1 addition & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ which is a popular Python collection of nature-inspired algorithms. The
NiaAML framework is easy to use and customize or expand to suit your
needs.

The NiaAML framework allows you not only to run full pipeline optimization, but also separate implemented components such as classifiers, feature selection algorithms, etc. It currently supports only numeric features on the input. **However, we are planning to add support for categorical features too.**
The NiaAML framework allows you not only to run full pipeline optimization, but also separate implemented components such as classifiers, feature selection algorithms, etc. **It supports numerical and categorical features.**

- **Documentation:** https://niaaml.readthedocs.io/en/latest/

Expand Down
2 changes: 1 addition & 1 deletion docs/getting_started.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ If you want to load and use the saved pipeline later, you can use the following
x = numpy.array([[0.35, 0.46, 5.32], [0.16, 0.55, 12.5]], dtype=float)
y = loaded_pipeline.run(x)
This is a very simple example with dummy data. It is only intended to give you a basic idea on how to use the framework. **NiaAML currently supports only numeric features. However, we are planning to add support for categorical features too.**
This is a very simple example with dummy data. It is only intended to give you a basic idea on how to use the framework. **NiaAML supports numerical and categorical features.**

Find more examples `here <https://github.com/lukapecnik/NiaAML/tree/master/examples>`_

Expand Down
100 changes: 100 additions & 0 deletions examples/example_files/dataset_categorical.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
11.18628795,8.553093649,2.337681394,7.740221556,2.076186822,12.53763102,a,Class 1
10.14583771,8.771100076,9.617580733,11.28898328,3.731375463,3.61496461,a,Class 1
7.608378163,9.215208082,12.21721984,0.217171385,10.84786072,0.068879981,b,Class 1
9.741381036,4.373282412,0.46661895,9.670759021,3.769897645,3.166427585,b,Class 1
6.326460397,4.950181274,0.955970145,13.27044224,13.3803711,6.7042629,b,Class 2
6.614221714,5.139908792,6.322918249,7.613060017,2.965112605,8.757576779,c,Class 1
9.77079694,13.22809724,12.27910247,4.486662894,1.637543898,3.145057057,b,Class 2
5.986399993,3.575034439,4.436729038,10.27403804,6.085388637,9.638152072,a,Class 2
9.306449023,0.419825519,5.207460405,3.352657815,8.955085451,13.74251068,c,Class 2
12.85278467,0.554461277,13.25029775,7.70837554,12.81365721,8.448635991,c,Class 2
10.32440339,3.195964543,1.215275549,3.741461311,11.6736581,6.435247906,a,Class 1
12.45890387,6.754763706,3.568497268,6.787252535,4.528132031,11.24413956,a,Class 1
4.92766004,8.894839385,1.599234587,6.491628957,1.003488457,8.990811618,a,Class 1
14.62465143,8.299301507,12.02718728,3.868350191,5.604315633,13.4574281,a,Class 1
5.548176135,6.507710991,0.798813746,13.28445746,7.037019737,13.71370375,b,Class 2
4.414357119,9.250023962,8.716711231,7.733484723,3.661143919,14.63698956,b,Class 1
7.794274494,0.554776276,6.717665128,3.422141362,12.80249591,3.744667173,b,Class 2
10.46207608,14.78604433,11.14501886,13.28194261,13.35036026,8.342502238,c,Class 1
11.38516345,11.33272181,1.919660335,4.978216028,8.668419104,6.052792392,c,Class 2
11.19640261,10.3436337,0.939034115,14.91069148,7.269366966,12.53406348,c,Class 2
11.06549373,0.091051491,13.96718884,12.53348993,4.476687297,11.87992708,a,Class 1
5.721763439,10.70136406,4.677923895,12.04602629,6.630499903,13.04574224,c,Class 2
14.87203026,4.717515614,12.16090195,10.17484858,1.258457287,3.762734746,c,Class 1
9.517250388,14.61073986,10.55186687,12.13409641,4.195938316,14.9085867,b,Class 2
5.490151571,11.07922707,2.912349404,11.26243041,6.909836863,12.93169762,b,Class 1
3.597959325,7.3606205,1.89533481,8.407778067,12.94742999,9.956797585,a,Class 2
13.99187099,6.16144391,4.430074749,10.48992388,6.724889945,11.63545045,a,Class 1
13.87167852,11.47473231,12.91040409,5.329482463,12.41092153,9.923540019,c,Class 2
6.884021,0.536048784,13.77495679,6.51467553,4.70254023,8.780237509,b,Class 2
2.208914531,12.70665676,13.62555578,9.598180651,7.438779306,7.81610053,b,Class 1
8.659531425,8.209053873,6.907242925,9.847209807,7.643627147,1.24454444,b,Class 1
1.448257785,10.22497998,1.269324615,3.714269901,13.03906827,2.870250771,b,Class 1
10.19241437,4.058700021,8.886001739,5.828695389,2.605134041,1.19188785,b,Class 1
7.777359497,10.96783191,4.890083745,5.284618971,4.411218163,8.605757632,b,Class 2
1.056011622,7.844004878,10.65020289,4.234763934,6.43943205,1.262495126,a,Class 2
7.648844009,10.14403542,9.539688734,13.66072313,0.330411845,5.610949661,a,Class 1
4.321962749,5.604955856,7.525456962,2.795185293,0.557651224,9.096120183,c,Class 2
7.580303996,14.13657189,2.208779404,12.65807527,8.616258995,14.2742891,c,Class 1
6.617679318,12.17838447,6.219814209,9.278219597,2.627013838,10.26198055,a,Class 1
10.42636852,11.37466476,1.605370071,13.38238859,13.4486372,0.658796404,a,Class 1
10.52477845,5.275716432,11.83515271,0.617870822,0.921374579,8.348557261,b,Class 2
12.42122246,0.012249697,7.74555252,12.02705019,3.442939685,7.110063876,b,Class 1
2.721191099,14.56211777,12.2194075,8.457083772,1.843488398,8.775189039,b,Class 2
11.93193151,7.265208519,0.45505228,4.217468632,13.6978792,11.24703349,c,Class 2
0.493376888,0.414245824,5.492426678,3.926579473,5.14363276,9.3274729,b,Class 2
11.99067679,2.224771613,14.51607498,13.038479,2.048398725,10.21056055,a,Class 2
4.39009476,2.715892095,13.65208099,1.276459275,3.1947636,5.738578547,c,Class 2
9.517850998,7.870570236,11.66133708,3.158457987,0.994101959,1.760078291,c,Class 2
1.67226824,1.257990444,13.35645397,6.432977593,8.173353149,10.47964661,a,Class 2
11.40110217,3.755922456,14.78250639,9.12235283,5.463228968,8.004612121,a,Class 1
0.51828403,7.467344863,0.403372329,1.324884922,6.204846153,0.397427501,a,Class 1
8.56890712,3.700288257,14.23433924,1.836880065,0.168958671,1.260377664,a,Class 2
0.927565538,0.256079044,8.244925899,10.78666638,13.47379713,0.009413535,b,Class 2
11.12587642,0.512971929,10.37022985,9.927926894,8.924001776,5.446182158,b,Class 1
8.296166587,3.881765358,12.50788295,6.751744679,3.270039419,13.16076438,b,Class 2
5.342454277,14.06289262,6.052115238,4.60660751,14.92130785,9.251614117,c,Class 1
10.76413906,3.108242418,7.407200789,0.124640979,6.315064545,6.974791847,c,Class 2
11.60334067,7.869475964,14.06285885,3.010169778,0.21862987,10.7119562,c,Class 1
0.675107878,10.62554075,9.16909931,1.51930099,9.054828927,8.018314854,a,Class 2
0.136391516,12.15438651,13.10410369,4.54379884,1.467941336,3.708272962,c,Class 2
11.20314646,7.917973833,7.205146518,14.47482833,5.385158132,3.962870806,c,Class 1
12.96777011,7.276652989,12.46734536,8.774357457,14.49755617,1.021454967,b,Class 1
7.259751863,5.37753719,8.753494011,1.105904802,7.423842186,7.060245922,b,Class 1
5.550401633,10.28344926,8.849268232,10.35224505,11.42901447,2.015178403,a,Class 2
8.724250626,5.144158413,8.881589983,5.654339781,3.348767179,7.567443724,a,Class 2
1.505308287,8.327887318,5.967980754,5.861512631,1.942362782,12.08752455,c,Class 1
11.95352828,13.83709019,1.484043207,14.9990425,0.358430191,0.936128377,b,Class 2
14.1424292,5.653091086,14.75697191,6.534335531,14.59624216,4.217427045,b,Class 1
2.096214566,13.41972927,5.026757888,10.15382225,10.69199037,8.119000359,b,Class 2
4.658969577,8.152082829,10.69897004,3.807611391,9.432866697,7.469063458,b,Class 1
9.880365248,6.857500577,9.50270486,6.185811124,6.801593649,1.426651215,b,Class 2
11.38053935,5.64968146,13.16726558,3.969547861,3.409401613,6.754962952,b,Class 1
2.549597194,3.81373774,3.381883424,12.54165021,7.238285696,5.014469506,a,Class 1
2.149956112,14.18695148,4.495586504,1.193989236,0.629565843,10.71726557,a,Class 1
0.633065458,10.57661883,3.911208047,4.737683148,10.67249983,11.44130896,c,Class 2
10.98958055,8.538690522,2.221702088,7.94460522,7.268542052,13.0506614,c,Class 2
14.05448371,5.906069731,11.02070992,14.78464345,1.395098041,12.45034592,a,Class 1
4.849203233,14.92593789,14.83374088,13.33589083,10.91265222,6.015994872,a,Class 2
1.788538553,1.189933547,13.37927743,7.078881338,0.115268965,1.102757553,b,Class 1
1.520260264,4.390949317,8.961363089,9.116191933,4.902286012,13.82917543,b,Class 2
5.143515013,1.626830627,5.011771888,14.53607373,9.254769126,5.987742339,b,Class 2
0.383485623,5.893120492,2.198084919,3.607295516,11.2909701,14.19259294,c,Class 1
3.543625982,1.817300049,12.79701902,9.150819857,4.270171936,1.046802952,b,Class 2
9.014121301,8.894615211,14.32697976,12.05396604,6.610724668,12.9453385,a,Class 1
2.178293829,11.00240774,3.4661733,4.216419592,14.36522422,3.571201671,c,Class 1
9.218901263,8.682081223,12.48795288,8.796277452,13.72799658,1.414017549,c,Class 2
1.417376539,13.2588434,13.00750995,9.108292367,5.332117011,3.7214796,a,Class 2
11.40541996,10.59274384,11.90631845,4.497592473,4.532009755,4.117336922,a,Class 1
5.547732807,6.107428176,13.30160131,8.442144861,9.854871343,3.268384157,a,Class 2
2.558435481,12.36056669,7.777967112,6.812644994,8.532351866,6.71817697,a,Class 2
2.349328005,11.73919423,11.20515163,11.47196866,13.24600243,1.770939874,b,Class 1
13.34706077,13.86142631,0.291296401,0.12119829,5.885044406,8.475403207,b,Class 2
5.351503888,6.40942837,11.07531808,8.972571254,3.233818614,12.43439266,b,Class 2
6.693621558,13.96686031,1.475546478,12.35803005,0.873347546,0.688133753,c,Class 2
10.48922559,6.646089272,7.4076759,7.873827219,5.742578275,1.806450848,c,Class 2
1.365010518,0.840426698,6.044826791,12.33437799,5.33827304,14.55706457,c,Class 2
6.145883127,12.20161505,1.162956248,11.67002394,6.279495076,5.709716727,a,Class 1
12.99028641,3.448828215,4.946279072,10.87002826,14.83427318,9.154544604,c,Class 2
1.109266891,2.564645156,10.64938657,7.677215295,8.625541169,8.960849049,c,Class 1
6.891117595,13.9566784,0.952437927,6.585976751,13.16019122,7.78218351,b,Class 1
13 changes: 9 additions & 4 deletions examples/factories.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@
from niaaml.preprocessing.feature_selection import FeatureSelectionAlgorithmFactory
from niaaml.preprocessing.feature_transform import FeatureTransformAlgorithmFactory
from niaaml.fitness import FitnessFactory
from niaaml.preprocessing.encoding import EncoderFactory

"""
In this example, we show how to use all of the implemented factories to create new object instances using their class names. You may also
Expand All @@ -12,7 +13,8 @@
classifier_factory = ClassifierFactory()
fsa_factory = FeatureSelectionAlgorithmFactory()
fta_factory = FeatureTransformAlgorithmFactory()
f = FitnessFactory()
f_factory = FitnessFactory()
e_factory = EncoderFactory()

# get an instance of the MultiLayerPerceptron class
mlp = classifier_factory.get_result('MultiLayerPerceptron')
Expand All @@ -23,7 +25,10 @@
# get an instance of the Normalizer class
normalizer = fta_factory.get_result('Normalizer')

# get an instace of the Precision class
precision = f.get_result('Precision')
# get an instance of the Precision class
precision = f_factory.get_result('Precision')

# variables mlp, pso, normalizer and precision contain instances of the classes with the passed names
#get an instance of the OneHotEncoder class
ohe = e_factory.get_result('OneHotEncoder')

# variables mlp, pso, normalizer, precision and ohe contain instances of the classes with the passed names
23 changes: 23 additions & 0 deletions examples/feature_encoding.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
from niaaml.preprocessing.encoding import OneHotEncoder, encode_categorical_features
import os
from niaaml.data import CSVDataReader

"""
In this example, we show how to individually use an implemented categorical feature encoder and its methods. In this case we use OneHotEncoder for demonstration, but
you can use any of the implemented encoders in the same way.
"""

# prepare data reader using csv file
data_reader = CSVDataReader(src=os.path.dirname(os.path.abspath(__file__)) + '/example_files/dataset_categorical.csv', has_header=False, contains_classes=True)

# instantiate OneHotEncoder
ohe = OneHotEncoder()

# fit, transform and print to output the categorical feature in the dataset (index 6)
features = data_reader.get_x()
ohe.fit(features[[6]])
f = ohe.transform(features[[6]])
print(f)

# if you wish to get array of encoders for all of categorical features in a dataset (and transformed DataFrame of features), you may use the utility method encode_categorical_features
transformed_features, encoders = encode_categorical_features(features, 'OneHotEncoder')
3 changes: 2 additions & 1 deletion examples/optimize_run_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
from niaaml.data import CSVDataReader
import os
import numpy
import pandas

"""
In this example, we show how to individually use the Pipeline class. You may use this if you want to test out a specific classification pipeline.
Expand All @@ -25,6 +26,6 @@

# run the pipeline using dummy data
# you could run the pipeline before the optimization process, but get wrong predictions as nothing in the pipeline is fit for the given dataset
predicted = pipeline.run(numpy.random.uniform(low=0.0, high=15.0, size=(30, data_reader.get_x().shape[1])))
predicted = pipeline.run(pandas.DataFrame(numpy.random.uniform(low=0.0, high=15.0, size=(30, data_reader.get_x().shape[1]))))

# pipeline variable contains Pipeline object that can be used for further classification, exported as an object (that can be later loaded and used) or exported as text file
Loading

0 comments on commit 45bbae8

Please sign in to comment.