diff --git a/README.md b/README.md index 6e18f2df..14c03df2 100644 --- a/README.md +++ b/README.md @@ -4,7 +4,7 @@ ML.NET was originally developed in Microsoft Research and is used across many product groups in Microsoft like Windows, Bing, PowerPoint, Excel and others. `nimbusml` was built to enable data science teams that are more familiar with Python to take advantage of ML.NET's functionality and performance. -This package enables training ML.NET pipelines or integrating ML.NET components directly into Scikit-Learn pipelines (it supports `numpy.ndarray`, `scipy.sparse_cst`, and `pandas.DataFrame` as inputs). +This package enables training ML.NET pipelines or integrating ML.NET components directly into [scikit-learn](https://scikit-learn.org/stable/) pipelines (it supports `numpy.ndarray`, `scipy.sparse_cst`, and `pandas.DataFrame` as inputs). Documentation can be found [here](https://docs.microsoft.com/en-us/NimbusML/overview) and additional notebook samples can be found [here](https://github.com/Microsoft/NimbusML-Samples). @@ -48,7 +48,7 @@ pipeline.fit(train_data) results = pipeline.predict(test_data) ``` -Instead of creating an `nimbusml` pipeline, you can also integrate components into Scikit-Learn pipelines: +Instead of creating an `nimbusml` pipeline, you can also integrate components into scikit-learn pipelines: ```python from sklearn.pipeline import Pipeline diff --git a/src/python/docs/sphinx/concepts/datasources.rst b/src/python/docs/sphinx/concepts/datasources.rst index 23687724..c1fd099d 100644 --- a/src/python/docs/sphinx/concepts/datasources.rst +++ b/src/python/docs/sphinx/concepts/datasources.rst @@ -122,7 +122,7 @@ Output Data Types of Transforms The return type of all of the transforms is a ``pandas.DataFrame``, when they are used inside a `sklearn.pipeline.Pipeline -`_ +`_ or when they are used individually. However, when used inside a :py:class:`nimbusml.Pipeline`, the outputs are often stored in diff --git a/src/python/docs/sphinx/concepts/experimentvspipeline.rst b/src/python/docs/sphinx/concepts/experimentvspipeline.rst index c0e3116c..d796792a 100644 --- a/src/python/docs/sphinx/concepts/experimentvspipeline.rst +++ b/src/python/docs/sphinx/concepts/experimentvspipeline.rst @@ -9,7 +9,7 @@ nimbusml.Pipeline() versus sklearn.Pipeline() .. contents:: :local: -This sections highlights the differences between using a `sklearn.Pipeline `_ +This sections highlights the differences between using a `sklearn.Pipeline `_ and :py:class:`nimbusml.Pipeline` to compose a sequence of transformers and/or trainers. @@ -17,7 +17,7 @@ sklearn.Pipeline ---------------- ``nimbusml`` transforms and trainers are designed to be compatible with -`sklearn.Pipeline `_. +`sklearn.Pipeline `_. For fully optimized performance and added functionality, it is recommended to use :py:class:`nimbusml.Pipeline`. See below for more details. @@ -38,7 +38,7 @@ files that are too large to fit into memory, there is no easy way to train estim streaming the examples one at a time. The :py:class:`nimbusml.Pipeline` module accepts inputs X and y similarly to -`sklearn.Pipeline `_, but also +`sklearn.Pipeline `_, but also inputs of type :py:class:`nimbusml.FileDataStream`, which is an optimized streaming file reader class. This is highly recommended for large datasets. See [Data Sources](datasources.md#data-from-a-filedatastream) for an example of using Pipeline with FileDataStream to read data in files. @@ -46,7 +46,7 @@ example of using Pipeline with FileDataStream to read data in files. Select which Columns to Transform """"""""""""""""""""""""""""""""" -When using `sklearn.Pipeline `_ +When using `sklearn.Pipeline `_ the data columns of X and y (of type``numpy.array`` or ``scipy.sparse_csr``) are anonymous and cannot be referenced by name. Operations and transformations are therefore performed on all columns of the data. @@ -66,7 +66,7 @@ Optimized Chaining of Trainers/Transforms Using NimbusML, trainers and transforms within a :py:class:`nimbusml.Pipeline` will generally result in better performance compared to using them in a -`sklearn.Pipeline `_. +`sklearn.Pipeline `_. Data copying is minimized when processing is limited to within the C# libraries, and if all components are in the same pipeline, data copies between C# and Python is reduced. diff --git a/src/python/docs/sphinx/concepts/types.rst b/src/python/docs/sphinx/concepts/types.rst index e1d53858..32fadb86 100644 --- a/src/python/docs/sphinx/concepts/types.rst +++ b/src/python/docs/sphinx/concepts/types.rst @@ -61,7 +61,7 @@ dataframe and therefore the column_name can still be used to refer to the Vector efficiently without any conversion to a dataframe. Since the ``column_name`` of the vector is also preserved, it is possible to refer to it by downstream transforms by name. However, when transforms are used inside a `sklearn.pipeline.Pipeline() - `_, the output + `_, the output of every transform is converted to a ``pandas.DataFrame`` first where the names of ``slots`` are preserved, but the ``column_name`` of the vector is dropped. diff --git a/src/python/docs/sphinx/metrics.rst b/src/python/docs/sphinx/metrics.rst index da748a0c..4efe0103 100644 --- a/src/python/docs/sphinx/metrics.rst +++ b/src/python/docs/sphinx/metrics.rst @@ -58,7 +58,7 @@ This corresponds to evaltype='binary'. in `ML.NET `_). This expression is asymptotically equivalent to the area under the curve which is what - `scikit-learn `_ computation. + `scikit-learn `_ computation. computes (see `auc `_). That explains discrepencies on small test sets. diff --git a/src/python/nimbusml/datasets/datasets.py b/src/python/nimbusml/datasets/datasets.py index 9cb85145..56c325a6 100644 --- a/src/python/nimbusml/datasets/datasets.py +++ b/src/python/nimbusml/datasets/datasets.py @@ -75,7 +75,7 @@ def as_df(self): class DataSetIris(DataSet): """ - `Iris dataset `_ dataset. """