Skip to content
This repository was archived by the owner on Aug 25, 2024. It is now read-only.

Commit afccb0f

Browse files
0dustpdxjohnny
authored andcommitted
model: scikit: Add clustering models
Signed-off-by: John Andersen <johnandersenpdx@gmail.com>
1 parent 88d7b78 commit afccb0f

File tree

14 files changed

+776
-128
lines changed

14 files changed

+776
-128
lines changed

CHANGELOG.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
66

77
## [Unreleased]
88
### Added
9+
- scikit models
10+
- Clusterers
11+
- KMeans
12+
- Birch
13+
- MiniBatchKMeans
14+
- AffinityPropagation
15+
- MeanShift
16+
- SpectralClustering
17+
- AgglomerativeClustering
18+
- OPTICS
919
- `allowempty` added to source config parameters.
1020
- Quickstart document to show how to use models from Python.
1121
- The latest release of the documentation now includes a link to the

dffml/util/asynctestcase.py

Lines changed: 66 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,17 @@
22
# Copyright (c) 2019 Intel Corporation
33
"""
44
Adds support for test cases which need to be run in an event loop.
5+
6+
Also contains a class integration tests can derive from. The integration
7+
tests can declare which of the plugins (that are a part of the main repo) they
8+
require to run. The test will be skipped if the plugin isn't installed in
9+
development mode.
10+
11+
To install all plugins in development mode
12+
$ dffml service dev install
13+
14+
Add the -user flag to install to ~/.local
15+
516
"""
617
import os
718
import random
@@ -12,7 +23,22 @@
1223
import unittest
1324
import tempfile
1425
import contextlib
15-
from typing import Optional
26+
27+
import re
28+
import io
29+
import json
30+
from typing import Dict, Any, Optional
31+
32+
from dffml.repo import Repo
33+
from dffml.base import config
34+
from dffml.df.types import Definition, Operation, DataFlow, Input
35+
from dffml.df.base import op
36+
from dffml.cli.cli import CLI
37+
from dffml.model.model import Model
38+
from dffml.service.dev import Develop
39+
from dffml.util.packaging import is_develop
40+
from dffml.util.entrypoint import load
41+
from dffml.config.config import BaseConfigLoader
1642

1743

1844
class AsyncTestCase(unittest.TestCase):
@@ -97,3 +123,42 @@ def mktempfile(
97123
if text:
98124
pathlib.Path(filename).write_text(inspect.cleandoc(text) + "\n")
99125
return filename
126+
127+
128+
def relative_path(*args):
129+
"""
130+
Returns a pathlib.Path object with the path relative to this file.
131+
"""
132+
target = pathlib.Path(__file__).parents[0] / args[0]
133+
for path in list(args)[1:]:
134+
target /= path
135+
return target
136+
137+
138+
@contextlib.contextmanager
139+
def relative_chdir(*args):
140+
"""
141+
Change directory to a location relative to the location of this file.
142+
"""
143+
target = relative_path(*args)
144+
orig_dir = os.getcwd()
145+
try:
146+
os.chdir(target)
147+
yield target
148+
finally:
149+
os.chdir(orig_dir)
150+
151+
152+
class IntegrationCLITestCase(AsyncExitStackTestCase):
153+
REQUIRED_PLUGINS = []
154+
155+
async def setUp(self):
156+
await super().setUp()
157+
self.required_plugins(*self.REQUIRED_PLUGINS)
158+
self.stdout = io.StringIO()
159+
160+
def required_plugins(self, *args):
161+
if not all(map(is_develop, args)):
162+
self.skipTest(
163+
f"Required plugins: {', '.join(args)} must be installed in development mode"
164+
)

docs/plugins/dffml_model.rst

Lines changed: 108 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -471,6 +471,22 @@ Predicting with trained model:
471471
| +-------------------------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
472472
| | MultinomialNB | scikitmnb | `scikitmnb <https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB/>`_ |
473473
+----------------+-------------------------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
474+
| Clustering | KMeans | scikitkmeans | `scikitkmeans <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans/>`_ |
475+
| +-------------------------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
476+
| | Birch | scikitbirch | `scikitbirch <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.Birch.html#sklearn.cluster.Birch/>`_ |
477+
| +-------------------------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
478+
| | MiniBatchKMeans | scikitmbkmeans | `scikitmbkmeans <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html#sklearn.cluster.MiniBatchKMeans/>`_ |
479+
| +-------------------------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
480+
| | AffinityPropagation | scikitap | `scikitap <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html#sklearn.cluster.AffinityPropagation/>`_ |
481+
| +-------------------------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
482+
| | MeanShift | scikitms | `scikitms <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MeanShift.html#sklearn.cluster.MeanShift/>`_ |
483+
| +-------------------------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
484+
| | SpectralClustering | scikitsc | `scikitsc <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.SpectralClustering.html#sklearn.cluster.SpectralClustering/>`_ |
485+
| +-------------------------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
486+
| | AgglomerativeClustering | scikitac | `scikitac <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html#sklearn.cluster.AgglomerativeClustering/>`_ |
487+
| +-------------------------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
488+
| | OPTICS | scikitoptics | `scikitoptics <https://scikit-learn.org/stable/modules/generated/sklearn.cluster.OPTICS.html#sklearn.cluster.OPTICS/>`_ |
489+
+----------------+-------------------------------+----------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
474490

475491

476492
**Usage Example:**
@@ -512,14 +528,14 @@ Let us take a simple example:
512528
$ dffml train \
513529
-model scikitlr \
514530
-model-features Years:int:1 Expertise:int:1 Trust:float:1 \
515-
-model-predict Salary \
531+
-model-predict Salary:float:1 \
516532
-sources f=csv \
517533
-source-filename train.csv \
518534
-log debug
519535
$ dffml accuracy \
520536
-model scikitlr \
521537
-model-features Years:int:1 Expertise:int:1 Trust:float:1 \
522-
-model-predict Salary \
538+
-model-predict Salary:float:1 \
523539
-sources f=csv \
524540
-source-filename test.csv \
525541
-log debug
@@ -528,7 +544,7 @@ Let us take a simple example:
528544
dffml predict all \
529545
-model scikitlr \
530546
-model-features Years:int:1 Expertise:int:1 Trust:float:1 \
531-
-model-predict Salary \
547+
-model-predict Salary:float:1 \
532548
-sources f=csv \
533549
-source-filename /dev/stdin \
534550
-log debug
@@ -549,3 +565,92 @@ Let us take a simple example:
549565
}
550566
]
551567
568+
569+
Example below uses KMeans Clustering Model on a small randomly generated dataset.
570+
571+
.. code-block:: console
572+
573+
$ cat > train.csv << EOF
574+
Col1, Col2, Col3, Col4
575+
5.05776417, 8.55128116, 6.15193196, -8.67349666
576+
3.48864265, -7.25952218, -4.89216256, 4.69308946
577+
-8.16207603, 5.16792984, -2.66971993, 0.2401882
578+
6.09809669, 8.36434181, 6.70940915, -7.91491768
579+
-9.39122566, 5.39133807, -2.29760281, -1.69672981
580+
0.48311336, 8.19998973, 7.78641979, 7.8843821
581+
2.22409135, -7.73598586, -4.02660224, 2.82101794
582+
2.8137247 , 8.36064298, 7.66196849, 3.12704676
583+
EOF
584+
$ cat > test.csv << EOF
585+
Col1, Col2, Col3, Col4, cluster
586+
-10.16770144, 2.73057215, -1.49351481, 2.43005691, 6
587+
3.59705381, -4.76520663, -3.34916068, 5.72391486, 1
588+
4.01612313, -4.641852 , -4.77333308, 5.87551683, 0
589+
EOF
590+
$ dffml train \
591+
-model scikitkmeans \
592+
-model-features Col1:float:1 Col2:float:1 Col3:float:1 Col4:float:1 \
593+
-sources f=csv \
594+
-source-filename train.csv \
595+
-source-readonly \
596+
-log debug
597+
$ dffml accuracy \
598+
-model scikitkmeans \
599+
-model-features Col1:float:1 Col2:float:1 Col3:float:1 Col4:float:1\
600+
-model-tcluster cluster:int:1 \
601+
-sources f=csv \
602+
-source-filename test.csv \
603+
-source-readonly \
604+
-log debug
605+
0.6365141682948129
606+
$ echo -e 'Col1,Col2,Col3,Col4\n6.09809669,8.36434181,6.70940915,-7.91491768\n' | \
607+
dffml predict all \
608+
-model scikitkmeans \
609+
-model-features Col1:float:1 Col2:float:1 Col3:float:1 Col4:float:1 \
610+
-sources f=csv \
611+
-source-filename /dev/stdin \
612+
-source-readonly \
613+
-log debug
614+
[
615+
{
616+
"extra": {},
617+
"features": {
618+
"Col1": 6.09809669,
619+
"Col2": 8.36434181,
620+
"Col3": 6.70940915,
621+
"Col4": -7.91491768
622+
},
623+
"last_updated": "2020-01-12T22:51:15Z",
624+
"prediction": {
625+
"confidence": 0.6365141682948129,
626+
"value": 2
627+
},
628+
"src_url": "0"
629+
}
630+
]
631+
632+
**NOTE**: `Transductive <https://scikit-learn.org/stable/glossary.html#term-transductive/>`_ Clusterers(scikitsc, scikitac, scikitoptics) cannot handle unseen data.
633+
Ensure that `predict` and `accuracy` for these algorithms uses training data.
634+
635+
**Args**
636+
637+
- predict: Feature
638+
639+
- Label or the value to be predicted
640+
- Only used by classification and regression models
641+
642+
- tcluster: Feature
643+
644+
- True cluster, only used by clustering models
645+
- Passed with `accuracy` to return `mutual_info_score`
646+
- If not passed `accuracy` returns `silhouette_score`
647+
648+
- features: List of features
649+
650+
- Features to train on
651+
652+
- directory: String
653+
654+
- default: /home/user/.cache/dffml/scikit-{Entrypoint}
655+
- Directory where state should be saved
656+

0 commit comments

Comments
 (0)