From af401561468a92cb407bea8157ff8e6dc57c4325 Mon Sep 17 00:00:00 2001
From: Zach Kurtz <zkurtz@gmail.com>
Date: Sat, 26 May 2018 15:14:38 -0400
Subject: [PATCH] [docs] Edits for grammer and clarity (#1389)

* A nitpicky grammer edit with minor clarifications added.

* fix link

* strike s

* try a different optimal-split link, clarify experimental details

* smoothing the FAQ

* edit Features.rst

* several minor edits throughout docs

* historgram-based
---
 docs/Advanced-Topics.rst   | 27 ++++++------
 docs/Experiments.rst       | 32 +++++++--------
 docs/FAQ.rst               | 77 +++++++++++++++++-----------------
 docs/Features.rst          | 84 +++++++++++++++++---------------------
 docs/Parameters-Tuning.rst |  9 ++--
 docs/Parameters.rst        |  2 +-
 docs/Python-Intro.rst      |  2 +-
 7 files changed, 111 insertions(+), 122 deletions(-)

diff --git a/docs/Advanced-Topics.rst b/docs/Advanced-Topics.rst
index 563c8d5263f..ee1a6cd7a31 100644
--- a/docs/Advanced-Topics.rst
+++ b/docs/Advanced-Topics.rst
@@ -4,35 +4,38 @@ Advanced Topics
 Missing Value Handle
 --------------------
 
--  LightGBM enables the missing value handle by default, you can disable it by set ``use_missing=false``.
+-  LightGBM enables the missing value handle by default. Disable it by setting ``use_missing=false``.
 
--  LightGBM uses NA (NaN) to represent the missing value by default, you can change it to use zero by set ``zero_as_missing=true``.
+-  LightGBM uses NA (NaN) to represent missing values by default. Change it to use zero by setting ``zero_as_missing=true``.
 
--  When ``zero_as_missing=false`` (default), the unshown value in sparse matrices (and LightSVM) is treated as zeros.
+-  When ``zero_as_missing=false`` (default), the unshown values in sparse matrices (and LightSVM) are treated as zeros.
 
--  When ``zero_as_missing=true``, NA and zeros (including unshown value in sparse matrices (and LightSVM)) are treated as missing.
+-  When ``zero_as_missing=true``, NA and zeros (including unshown values in sparse matrices (and LightSVM)) are treated as missing.
 
 Categorical Feature Support
 ---------------------------
 
--  LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot encoding, LightGBM can find the optimal split of categorical features.
-   Such an optimal split can provide the much better accuracy than one-hot encoding solution.
+-  LightGBM offers good accuracy with integer-encoded categorical features. LightGBM applies
+   `Fisher (1958) <http://www.csiss.org/SPACE/workshops/2004/SAC/files/fisher.pdf>`_
+   to find the optimal split over categories as
+   `described here <./Features.rst#optimal-split-for-categorical-features>`_. This often performs better than one-hot encoding.
 
 -  Use ``categorical_feature`` to specify the categorical features.
    Refer to the parameter ``categorical_feature`` in `Parameters <./Parameters.rst>`__.
 
--  Converting to ``int`` type is needed first, and there is support for non-negative numbers only. Also, all values should be less than ``Int32.MaxValue`` (2147483647).
-   It is better to convert into continues ranges.
+-  Categorical features must be encoded as non-negative integers (``int``) less than ``Int32.MaxValue`` (2147483647).
+   It is best to use a contiguous range of integers.
 
--  Use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting
-   (when ``#data`` is small or ``#category`` is large).
+-  Use ``min_data_per_group``, ``cat_smooth`` to deal with over-fitting (when ``#data`` is small or ``#category`` is large).
 
--  For categorical features with high cardinality (``#category`` is large), it is better to convert it to numerical features.
+-  For a categorical feature with high cardinality (``#category`` is large), it often works best to
+   treat the feature as numeric, either by simply ignoring the categorical interpretation of the integers or
+   by embedding the categories in a low-dimensional numeric space.
 
 LambdaRank
 ----------
 
--  The label should be ``int`` type, and larger numbers represent the higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect).
+-  The label should be of type ``int``, such that larger numbers correspond to higher relevance (e.g. 0:bad, 1:fair, 2:good, 3:perfect).
 
 -  Use ``label_gain`` to set the gain(weight) of ``int`` label.
 
diff --git a/docs/Experiments.rst b/docs/Experiments.rst
index d11851b5871..d0a94e11ff0 100644
--- a/docs/Experiments.rst
+++ b/docs/Experiments.rst
@@ -28,7 +28,7 @@ We used 5 datasets to conduct our comparison experiments. Details of data are li
 Environment
 ^^^^^^^^^^^
 
-We used one Linux server as experiment platform, details are listed in the following table:
+We ran all experiments on a single Linux server with the following specifications:
 
 +------------------+-----------------+---------------------+
 | OS               | CPU             | Memory              |
@@ -46,7 +46,7 @@ Both xgboost and LightGBM were built with OpenMP support.
 Settings
 ^^^^^^^^
 
-We set up total 3 settings for experiments, the parameters of these settings are:
+We set up total 3 settings for experiments. The parameters of these settings are:
 
 1. xgboost:
 
@@ -84,8 +84,8 @@ We set up total 3 settings for experiments, the parameters of these settings are
        min_data_in_leaf = 0
        min_sum_hessian_in_leaf = 100
 
-xgboost grows tree depth-wise and controls model complexity by ``max_depth``.
-LightGBM uses leaf-wise algorithm instead and controls model complexity by ``num_leaves``.
+xgboost grows trees depth-wise and controls model complexity by ``max_depth``.
+LightGBM uses a leaf-wise algorithm instead and controls model complexity by ``num_leaves``.
 So we cannot compare them in the exact same model setting. For the tradeoff, we use xgboost with ``max_depth=8``, which will have max number leaves to 255, to compare with LightGBM with ``num_leves=255``.
 
 Other parameters are default values.
@@ -96,7 +96,7 @@ Result
 Speed
 '''''
 
-For speed comparison, we only run the training task, which was without any test or metric output. And we didn't count the time for IO.
+We compared speed using only the training task without any test or metric output. We didn't count the time for IO.
 
 The following table is the comparison of time cost:
 
@@ -114,12 +114,12 @@ The following table is the comparison of time cost:
 | Allstate  | 2867.22 s | 1355.71 s     | **348.084475 s** |
 +-----------+-----------+---------------+------------------+
 
-We found LightGBM is faster than xgboost on all experiment data sets.
+LightGBM ran faster than xgboost on all experiment data sets.
 
 Accuracy
 ''''''''
 
-For accuracy comparison, we used the accuracy on test data set to have a fair comparison.
+We computed all accuracy metrics only on the test data set.
 
 +-----------+-----------------+----------+---------------+----------+
 | Data      | Metric          | xgboost  | xgboost\_hist | LightGBM |
@@ -150,8 +150,8 @@ For accuracy comparison, we used the accuracy on test data set to have a fair co
 Memory Consumption
 ''''''''''''''''''
 
-We monitored RES while running training task. And we set ``two_round=true`` (will increase data-loading time,
-but reduce peak memory usage, not affect training speed or accuracy) in LightGBM to reduce peak memory usage.
+We monitored RES while running training task. And we set ``two_round=true`` (this will increase data-loading time and
+reduce peak memory usage but not affect training speed or accuracy) in LightGBM to reduce peak memory usage.
 
 +-----------+---------+---------------+-------------+
 | Data      | xgboost | xgboost\_hist | LightGBM    |
@@ -181,15 +181,15 @@ We used a terabyte click log dataset to conduct parallel experiments. Details ar
 | Criteo | Binary classification | `link`_ | 1,700,000,000 | 67       |
 +--------+-----------------------+---------+---------------+----------+
 
-This data contains 13 integer features and 26 category features of 24 days click log.
-We statisticized the CTR and count for these 26 category features from the first ten days,
-then used next ten days' data, which had been replaced the category features by the corresponding CTR and count, as training data.
+This data contains 13 integer features and 26 categorical features for 24 days of click logs.
+We statisticized the clickthrough rate (CTR) and count for these 26 categorical features from the first ten days.
+Then we used next ten days' data, after replacing the categorical features by the corresponding CTR and count, as training data.
 The processed training data have a total of 1.7 billions records and 67 features.
 
 Environment
 ^^^^^^^^^^^
 
-We used 16 Windows servers as experiment platform, details are listed in following table:
+We ran our experiments on 16 Windows servers with the following specifications:
 
 +---------------------+-----------------+---------------------+-------------------------------------------+
 | OS                  | CPU             | Memory              | Network Adapter                           |
@@ -208,9 +208,7 @@ Settings
     num_thread = 16
     tree_learner = data
 
-We used data parallel here, since this data is large in ``#data`` but small in ``#feature``.
-
-Other parameters were default values.
+We used data parallel here because this data is large in ``#data`` but small in ``#feature``. Other parameters were default values.
 
 Results
 ^^^^^^^
@@ -229,7 +227,7 @@ Results
 | 16       | 42 s          | 11GB                      |
 +----------+---------------+---------------------------+
 
-From the results, we found that LightGBM performs linear speed up in parallel learning.
+The results show that LightGBM achieves a linear speedup with parallel learning.
 
 GPU Experiments
 ---------------
diff --git a/docs/FAQ.rst b/docs/FAQ.rst
index 6698b73def7..a4b20915838 100644
--- a/docs/FAQ.rst
+++ b/docs/FAQ.rst
@@ -17,13 +17,21 @@ Contents
 Critical
 ~~~~~~~~
 
-You encountered a critical issue when using LightGBM (crash, prediction error, non sense outputs...). Who should you contact?
+Please post an issue in `Microsoft/LightGBM repository <https://github.com/Microsoft/LightGBM/issues>`__ for any
+LightGBM issues you encounter. For critical issues (crash, prediction error, nonsense outputs...), you may also ping a
+member of the core team according the relevant area of expertise by mentioning them with the arobase (@) symbol:
 
-If your issue is not critical, just post an issue in `Microsoft/LightGBM repository <https://github.com/Microsoft/LightGBM/issues>`__.
+-  `@guolinke <https://github.com/guolinke>`__ (C++ code / R-package / Python-package)
+-  `@chivee <https://github.com/chivee>`__ (C++ code / Python-package)
+-  `@Laurae2 <https://github.com/Laurae2>`__ (R-package)
+-  `@wxchan <https://github.com/wxchan>`__ (Python-package)
+-  `@henry0312 <https://github.com/henry0312>`__ (Python-package)
+-  `@StrikerRUS <https://github.com/StrikerRUS>`__ (Python-package)
+-  `@huanzhang12 <https://github.com/huanzhang12>`__ (GPU support)
 
-If it is a critical issue, identify first what error you have:
+Please include as much of the following information as possible when submitting a critical issue:
 
--  Do you think it is reproducible on CLI (command line interface), R, and/or Python?
+-  Is it reproducible on CLI (command line interface), R, and/or Python?
 
 -  Is it specific to a wrapper? (R or Python?)
 
@@ -33,19 +41,9 @@ If it is a critical issue, identify first what error you have:
 
 -  Are you able to reproduce this issue with a simple case?
 
--  Are you able to (not) reproduce this issue after removing all optimization flags and compiling LightGBM in debug mode?
-
-Depending on the answers, while opening your issue, feel free to ping (just mention them with the arobase (@) symbol) appropriately so we can attempt to solve your problem faster:
-
--  `@guolinke <https://github.com/guolinke>`__ (C++ code / R-package / Python-package)
--  `@chivee <https://github.com/chivee>`__ (C++ code / Python-package)
--  `@Laurae2 <https://github.com/Laurae2>`__ (R-package)
--  `@wxchan <https://github.com/wxchan>`__ (Python-package)
--  `@henry0312 <https://github.com/henry0312>`__ (Python-package)
--  `@StrikerRUS <https://github.com/StrikerRUS>`__ (Python-package)
--  `@huanzhang12 <https://github.com/huanzhang12>`__ (GPU support)
+-  Does the issue persist after removing all optimization flags and compiling LightGBM in debug mode?
 
-Remember this is a free/open community support. We may not be available 24/7 to provide support.
+When submitting issues, please keep in mind that this is largely a volunteer effort, and we may not be available 24/7 to provide support.
 
 --------------
 
@@ -54,11 +52,11 @@ LightGBM
 
 -  **Question 1**: Where do I find more details about LightGBM parameters?
 
--  **Solution 1**: Take a look at `Parameters <./Parameters.rst>`__ and `Laurae++/Parameters <https://sites.google.com/view/lauraepp/parameters>`__ website.
+-  **Solution 1**: Take a look at `Parameters <./Parameters.rst>`__ and the `Laurae++/Parameters <https://sites.google.com/view/lauraepp/parameters>`__ website.
 
 --------------
 
--  **Question 2**: On datasets with million of features, training do not start (or starts after a very long time).
+-  **Question 2**: On datasets with million of features, training does not start (or starts after a very long time).
 
 -  **Solution 2**: Use a smaller value for ``bin_construct_sample_cnt`` and a larger value for ``min_data``.
 
@@ -66,52 +64,51 @@ LightGBM
 
 -  **Question 3**: When running LightGBM on a large dataset, my computer runs out of RAM.
 
--  **Solution 3**: Multiple solutions: set ``histogram_pool_size`` parameter to the MB you want to use for LightGBM (histogram\_pool\_size + dataset size = approximately RAM used),
+-  **Solution 3**: Multiple solutions: set the ``histogram_pool_size`` parameter to the MB you want to use for LightGBM (histogram\_pool\_size + dataset size = approximately RAM used),
    lower ``num_leaves`` or lower ``max_bin`` (see `Microsoft/LightGBM#562 <https://github.com/Microsoft/LightGBM/issues/562>`__).
 
 --------------
 
 -  **Question 4**: I am using Windows. Should I use Visual Studio or MinGW for compiling LightGBM?
 
--  **Solution 4**: It is recommended to `use Visual Studio <https://github.com/Microsoft/LightGBM/issues/542>`__ as its performance is higher for LightGBM.
+-  **Solution 4**: Visual Studio `performs best for LightGBM <https://github.com/Microsoft/LightGBM/issues/542>`__.
 
 --------------
 
 -  **Question 5**: When using LightGBM GPU, I cannot reproduce results over several runs.
 
--  **Solution 5**: It is a normal issue, there is nothing we/you can do about,
-   you may try to use ``gpu_use_dp = true`` for reproducibility (see `Microsoft/LightGBM#560 <https://github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654>`__).
-   You may also use CPU version.
+-  **Solution 5**: This is normal and expected behaviour, but you may try to use ``gpu_use_dp = true`` for reproducibility
+   (see `Microsoft/LightGBM#560 <https://github.com/Microsoft/LightGBM/pull/560#issuecomment-304561654>`__).
+   You may also use the CPU version.
 
 --------------
 
 -  **Question 6**: Bagging is not reproducible when changing the number of threads.
 
--  **Solution 6**: As LightGBM bagging is running multithreaded, its output is dependent on the number of threads used.
+-  **Solution 6**: LightGBM bagging is multithreaded, so its output depends on the number of threads used.
    There is `no workaround currently <https://github.com/Microsoft/LightGBM/issues/632>`__.
 
 --------------
 
 -  **Question 7**: I tried to use Random Forest mode, and LightGBM crashes!
 
--  **Solution 7**: It is by design.
-   You must use ``bagging_fraction`` and ``feature_fraction`` different from 1, along with a ``bagging_freq``.
-   See `this thread <https://github.com/Microsoft/LightGBM/issues/691>`__ as an example.
+-  **Solution 7**: This is expected behaviour for arbitrary parameters. To enable Random Forest,
+   you must use ``bagging_fraction`` and ``feature_fraction`` different from 1, along with a ``bagging_freq``.
+   `This thread <https://github.com/Microsoft/LightGBM/issues/691>`__ includes an example.
 
 --------------
 
--  **Question 8**: CPU are not kept busy (like 10% CPU usage only) in Windows when using LightGBM on very large datasets with many core systems.
+-  **Question 8**: CPU usage is low (like 10%) in Windows when using LightGBM on very large datasets with many core systems.
 
 -  **Solution 8**: Please use `Visual Studio <https://www.visualstudio.com/downloads/>`__
    as it may be `10x faster than MinGW <https://github.com/Microsoft/LightGBM/issues/749>`__ especially for very large trees.
 
 --------------
 
--  **Question 9**: When I'm trying to specify some column as categorical by using ``categorical_feature`` parameter, I get segmentation fault in LightGBM.
+-  **Question 9**: When I'm trying to specify a categorical column with the ``categorical_feature`` parameter, I get a segmentation fault.
 
--  **Solution 9**: Probably you're trying to pass via ``categorical_feature`` parameter a column with very large values. For instance, it can be some IDs.
-   In LightGBM categorical features are limited by int32 range, so you cannot pass values that are greater than ``Int32.MaxValue`` (2147483647) as categorical features
-   (see `Microsoft/LightGBM#1359 <https://github.com/Microsoft/LightGBM/issues/1359>`__). You should convert them into integer range from zero to number of categories first.
+-  **Solution 9**: The column you're trying to pass via ``categorical_feature`` likely contains very large values.
+   Categorical features in LightGBM are limited by int32 range, so you cannot pass values that are greater than ``Int32.MaxValue`` (2147483647) as categorical features (see `Microsoft/LightGBM#1359 <https://github.com/Microsoft/LightGBM/issues/1359>`__). You should convert them to integers ranging from zero to the number of categories first.
 
 --------------
 
@@ -156,7 +153,7 @@ Python-package
 
        Cannot get/set label/weight/init_score/group/num_data/num_feature before construct dataset
 
-   but I've already constructed dataset by some code like
+   but I've already constructed a dataset by some code like
 
    ::
 
@@ -169,16 +166,16 @@ Python-package
        Cannot set predictor/reference/categorical feature after freed raw data, set free_raw_data=False when construct Dataset to avoid this.
 
 -  **Solution 2**: Because LightGBM constructs bin mappers to build trees, and train and valid Datasets within one Booster share the same bin mappers,
-   categorical features and feature names etc., the Dataset objects are constructed when construct a Booster.
-   And if you set ``free_raw_data=True`` (default), the raw data (with Python data struct) will be freed.
+   categorical features and feature names etc., the Dataset objects are constructed when constructing a Booster.
+   If you set ``free_raw_data=True`` (default), the raw data (with Python data struct) will be freed.
    So, if you want to:
 
-   -  get label(or weight/init\_score/group) before construct dataset, it's same as get ``self.label``
+   -  get label (or weight/init\_score/group) before constructing a dataset, it's same as get ``self.label``
 
-   -  set label(or weight/init\_score/group) before construct dataset, it's same as ``self.label=some_label_array``
+   -  set label (or weight/init\_score/group) before constructing a dataset, it's same as ``self.label=some_label_array``
 
-   -  get num\_data(or num\_feature) before construct dataset, you can get data with ``self.data``,
-      then if your data is ``numpy.ndarray``, use some code like ``self.data.shape``
+   -  get num\_data (or num\_feature) before constructing a dataset, you can get data with ``self.data``.
+      Then, if your data is ``numpy.ndarray``, use some code like ``self.data.shape``
 
-   -  set predictor(or reference/categorical feature) after construct dataset,
+   -  set predictor (or reference/categorical feature) after constructing a dataset,
       you should set ``free_raw_data=False`` or init a Dataset object with the same raw data
diff --git a/docs/Features.rst b/docs/Features.rst
index 391f0d43244..0b084e61204 100644
--- a/docs/Features.rst
+++ b/docs/Features.rst
@@ -1,35 +1,30 @@
 Features
 ========
 
-This is a short introduction for the features and algorithms used in LightGBM\ `[1] <#references>`__.
-
-This page doesn't contain detailed algorithms, please refer to cited papers or source code if you are interested.
+This is a conceptual overview of how LightGBM works\ `[1] <#references>`__. We assume familiarity with decision tree boosting algorithms to focus instead on aspects of LightGBM that may differ from other boosting packages. For detailed algorithms, please refer to the citations or source code.
 
 Optimization in Speed and Memory Usage
 --------------------------------------
 
-Many boosting tools use pre-sorted based algorithms\ `[2, 3] <#references>`__ (e.g. default algorithm in xgboost) for decision tree learning. It is a simple solution, but not easy to optimize.
-
-LightGBM uses the histogram based algorithms\ `[4, 5, 6] <#references>`__, which bucketing continuous feature(attribute) values into discrete bins, to speed up training procedure and reduce memory usage.
-Following are advantages for histogram based algorithms:
+Many boosting tools use pre-sort-based algorithms\ `[2, 3] <#references>`__ (e.g. default algorithm in xgboost) for decision tree learning. It is a simple solution, but not easy to optimize.
 
--  **Reduce calculation cost of split gain**
+LightGBM uses histogram-based algorithms\ `[4, 5, 6] <#references>`__, which bucket continuous feature (attribute) values into discrete bins. This speeds up training and reduces memory usage. Advantages of histogram-based algorithms include the following:
 
-   -  Pre-sorted based algorithms need ``O(#data)`` times calculation
+-  **Reduced cost of calculating the gain for each split**
 
-   -  Histogram based algorithms only need to calculate ``O(#bins)`` times, and ``#bins`` is far smaller than ``#data``
+   -  Pre-sort-based algorithms have time complexity ``O(#data)``
 
-      -  It still needs ``O(#data)`` times to construct histogram, which only contain sum-up operation
+   -  Computing the histogram has time complexity ``O(#data)``, but this involves only a fast sum-up operation. Once the histogram is constructed, a histogram-based algorithm has time complexity ``O(#bins)``, and ``#bins`` is far smaller than ``#data``.
 
--  **Use histogram subtraction for further speed-up**
+-  **Use histogram subtraction for further speedup**
 
-   -  To get one leaf's histograms in a binary tree, can use the histogram subtraction of its parent and its neighbor
+   -  To get one leaf's histograms in a binary tree, use the histogram subtraction of its parent and its neighbor
 
-   -  So it only need to construct histograms for one leaf (with smaller ``#data`` than its neighbor), then can get histograms of its neighbor by histogram subtraction with small cost (``O(#bins)``)
+   -  So it needs to construct histograms for only one leaf (with smaller ``#data`` than its neighbor). It then can get histograms of its neighbor by histogram subtraction with small cost (``O(#bins)``)
    
 -  **Reduce memory usage**
 
-   -  Can replace continuous values to discrete bins. If ``#bins`` is small, can use small data type, e.g. uint8\_t, to store training data
+   -  Replaces continuous values with discrete bins. If ``#bins`` is small, can use small data type, e.g. uint8\_t, to store training data
 
    -  No need to store additional information for pre-sorting feature values
 
@@ -38,7 +33,7 @@ Following are advantages for histogram based algorithms:
 Sparse Optimization
 -------------------
 
--  Only need ``O(2 * #non_zero_data)`` to construct histogram for sparse features
+-  Need only ``O(2 * #non_zero_data)`` to construct histogram for sparse features
 
 Optimization in Accuracy
 ------------------------
@@ -46,16 +41,15 @@ Optimization in Accuracy
 Leaf-wise (Best-first) Tree Growth
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-Most decision tree learning algorithms grow tree by level (depth)-wise, like the following image:
+Most decision tree learning algorithms grow trees by level (depth)-wise, like the following image:
 
 .. image:: ./_static/images/level-wise.png
    :align: center
 
-LightGBM grows tree by leaf-wise (best-first)\ `[7] <#references>`__. It will choose the leaf with max delta loss to grow.
-When growing same ``#leaf``, leaf-wise algorithm can reduce more loss than level-wise algorithm.
+LightGBM grows trees leaf-wise (best-first)\ `[7] <#references>`__. It will choose the leaf with max delta loss to grow.
+Holding ``#leaf`` fixed, leaf-wise algorithms tend to achieve lower loss than level-wise algorithms.
 
-Leaf-wise may cause over-fitting when ``#data`` is small.
-So, LightGBM can use an additional parameter ``max_depth`` to limit depth of tree and avoid over-fitting (tree still grows by leaf-wise).
+Leaf-wise may cause over-fitting when ``#data`` is small, so LightGBM includes the ``max_depth`` parameter to limit tree depth. However, trees still grow leaf-wise even when ``max_depth`` is specified.
 
 .. image:: ./_static/images/leaf-wise.png
    :align: center
@@ -63,15 +57,13 @@ So, LightGBM can use an additional parameter ``max_depth`` to limit depth of tre
 Optimal Split for Categorical Features
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-We often convert the categorical features into one-hot encoding.
-However, it is not a good solution in tree learner.
-The reason is, for the high cardinality categorical features, it will grow the very unbalance tree, and needs to grow very deep to achieve the good accuracy.
+It is common to represent categorical features with one-hot encoding, but this approach is suboptimal for tree learners. Particularly for high-cardinality categorical features, a tree built on one-hot features tends to be unbalanced and needs to grow very deep to achieve good accuracy.
 
-Actually, the optimal solution is partitioning the categorical feature into 2 subsets, and there are ``2^(k-1) - 1`` possible partitions.
-But there is a efficient solution for regression tree\ `[8] <#references>`__. It needs about ``k * log(k)`` to find the optimal partition.
+Instead of one-hot encoding, the optimal solution is to split on a categorical feature by partitioning its categories into 2 subsets. If the feature has ``k`` categories, there are ``2^(k-1) - 1`` possible partitions.
+But there is an efficient solution for regression trees\ `[8] <#references>`__. It needs about ``O(k * log(k))`` to find the optimal partition.
 
-The basic idea is reordering the categories according to the relevance of training target.
-More specifically, reordering the histogram (of categorical feature) according to it's accumulate values (``sum_gradient / sum_hessian``), then find the best split on the sorted histogram.
+The basic idea is to sort the categories according to the training objective at each split.
+More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (``sum_gradient / sum_hessian``) and then finds the best split on the sorted histogram.
 
 Optimization in Network Communication
 -------------------------------------
@@ -83,7 +75,7 @@ These collective communication algorithms can provide much better performance th
 Optimization in Parallel Learning
 ---------------------------------
 
-LightGBM provides following parallel learning algorithms.
+LightGBM provides the following parallel learning algorithms.
 
 Feature Parallel
 ~~~~~~~~~~~~~~~~
@@ -91,7 +83,7 @@ Feature Parallel
 Traditional Algorithm
 ^^^^^^^^^^^^^^^^^^^^^
 
-Feature parallel aims to parallel the "Find Best Split" in the decision tree. The procedure of traditional feature parallel is:
+Feature parallel aims to parallelize the "Find Best Split" in the decision tree. The procedure of traditional feature parallel is:
 
 1. Partition data vertically (different machines have different feature set)
 
@@ -103,19 +95,19 @@ Feature parallel aims to parallel the "Find Best Split" in the decision tree. Th
 
 5. Other workers split data according received data
 
-The shortage of traditional feature parallel:
+The shortcomings of traditional feature parallel:
 
 -  Has computation overhead, since it cannot speed up "split", whose time complexity is ``O(#data)``.
    Thus, feature parallel cannot speed up well when ``#data`` is large.
 
--  Need communication of split result, which cost about ``O(#data / 8)`` (one bit for one data).
+-  Need communication of split result, which costs about ``O(#data / 8)`` (one bit for one data).
 
 Feature Parallel in LightGBM
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-Since feature parallel cannot speed up well when ``#data`` is large, we make a little change here: instead of partitioning data vertically, every worker holds the full data.
-Thus, LightGBM doesn't need to communicate for split result of data since every worker know how to split data.
-And ``#data`` won't be larger, so it is reasonable to hold full data in every machine.
+Since feature parallel cannot speed up well when ``#data`` is large, we make a little change: instead of partitioning data vertically, every worker holds the full data.
+Thus, LightGBM doesn't need to communicate for split result of data since every worker knows how to split data.
+And ``#data`` won't be larger, so it is reasonable to hold the full data in every machine.
 
 The procedure of feature parallel in LightGBM:
 
@@ -134,7 +126,7 @@ Data Parallel
 Traditional Algorithm
 ^^^^^^^^^^^^^^^^^^^^^
 
-Data parallel aims to parallel the whole decision learning. The procedure of data parallel is:
+Data parallel aims to parallelize the whole decision learning. The procedure of data parallel is:
 
 1. Partition data horizontally
 
@@ -144,7 +136,7 @@ Data parallel aims to parallel the whole decision learning. The procedure of dat
 
 4. Find best split from merged global histograms, then perform splits
 
-The shortage of traditional data parallel:
+The shortcomings of traditional data parallel:
 
 -  High communication cost.
    If using point-to-point communication algorithm, communication cost for one machine is about ``O(#machine * #feature * #bin)``.
@@ -156,18 +148,18 @@ Data Parallel in LightGBM
 We reduce communication cost of data parallel in LightGBM:
 
 1. Instead of "Merge global histograms from all local histograms", LightGBM use "Reduce Scatter" to merge histograms of different (non-overlapping) features for different workers.
-   Then workers find local best split on local merged histograms and sync up global best split.
+   Then workers find the local best split on local merged histograms and sync up the global best split.
 
-2. As aforementioned, LightGBM use histogram subtraction to speed up training.
+2. As aforementioned, LightGBM uses histogram subtraction to speed up training.
    Based on this, we can communicate histograms only for one leaf, and get its neighbor's histograms by subtraction as well.
 
-Above all, we reduce communication cost to ``O(0.5 * #feature * #bin)`` for data parallel in LightGBM.
+All things considered, data parallel in LightGBM has time complexity ``O(0.5 * #feature * #bin)``.
 
 Voting Parallel
 ~~~~~~~~~~~~~~~
 
-Voting parallel further reduce the communication cost in `Data Parallel <#data-parallel>`__ to constant cost.
-It uses two stage voting to reduce the communication cost of feature histograms\ `[10] <#references>`__.
+Voting parallel further reduces the communication cost in `Data Parallel <#data-parallel>`__ to constant cost.
+It uses two-stage voting to reduce the communication cost of feature histograms\ `[10] <#references>`__.
 
 GPU Support
 -----------
@@ -181,7 +173,7 @@ Thanks `@huanzhang12 <https://github.com/huanzhang12>`__ for contributing this f
 Applications and Metrics
 ------------------------
 
-Support following application:
+LightGBM supports the following applications:
 
 -  regression, the objective function is L2 loss
 
@@ -189,11 +181,11 @@ Support following application:
 
 -  multi classification
 
--  cross-entropy
+-  cross-entropy, the objective function is logloss and supports training on non-binary labels
 
 -  lambdarank, the objective function is lambdarank with NDCG
 
-Support following metrics:
+LightGBM supports the following metrics:
 
 -  L1 loss
 
@@ -242,7 +234,7 @@ Other Features
 
 -  Bagging
 
--  Column(feature) sub-sample
+-  Column (feature) sub-sample
 
 -  Continued train with input GBDT model
 
diff --git a/docs/Parameters-Tuning.rst b/docs/Parameters-Tuning.rst
index 9a13543f538..f416a374b32 100644
--- a/docs/Parameters-Tuning.rst
+++ b/docs/Parameters-Tuning.rst
@@ -18,16 +18,15 @@ However, the leaf-wise growth may be over-fitting if not used with the appropria
 To get good results using a leaf-wise tree, these are some important parameters:
 
 1. ``num_leaves``. This is the main parameter to control the complexity of the tree model.
-   Theoretically, we can set ``num_leaves = 2^(max_depth)`` to convert from depth-wise tree.
+   Theoretically, we can set ``num_leaves = 2^(max_depth)`` to obtain the same number of leaves as depth-wise tree.
    However, this simple conversion is not good in practice.
-   The reason is, when number of leaves are the same, the leaf-wise tree is much deeper than depth-wise tree. As a result, it may be over-fitting.
+   The reason is that a leaf-wise tree is typically much deeper than a depth-wise tree for a fixed number of leaves. Unconstrained depth can induce over-fitting.
    Thus, when trying to tune the ``num_leaves``, we should let it be smaller than ``2^(max_depth)``.
    For example, when the ``max_depth=7`` the depth-wise tree can get good accuracy,
    but setting ``num_leaves`` to ``127`` may cause over-fitting, and setting it to ``70`` or ``80`` may get better accuracy than depth-wise.
-   Actually, the concept ``depth`` can be forgotten in leaf-wise tree, since it doesn't have a correct mapping from ``leaves`` to ``depth``.
 
-2. ``min_data_in_leaf``. This is a very important parameter to deal with over-fitting in leaf-wise tree.
-   Its value depends on the number of training data and ``num_leaves``.
+2. ``min_data_in_leaf``. This is a very important parameter to prevent over-fitting in a leaf-wise tree.
+   Its optimal value depends on the number of training samples and ``num_leaves``.
    Setting it to a large value can avoid growing too deep a tree, but may cause under-fitting.
    In practice, setting it to hundreds or thousands is enough for a large dataset.
 
diff --git a/docs/Parameters.rst b/docs/Parameters.rst
index e02509f2783..62d901cd8a4 100644
--- a/docs/Parameters.rst
+++ b/docs/Parameters.rst
@@ -548,7 +548,7 @@ Objective Parameters
 
 -  ``boost_from_average``, default=\ ``true``, type=bool
 
-   -  only used in ``regression`` task
+   -  used only in ``regression``, ``binary``, and ``xentropy`` tasks (others may get added)
 
    -  adjust initial score to the mean of labels for faster convergence
 
diff --git a/docs/Python-Intro.rst b/docs/Python-Intro.rst
index 39074a7d3f4..198108465c9 100644
--- a/docs/Python-Intro.rst
+++ b/docs/Python-Intro.rst
@@ -32,7 +32,7 @@ To verify your installation, try to ``import lightgbm`` in Python:
 Data Interface
 --------------
 
-The LightGBM Python module is able to load data from:
+The LightGBM Python module can load data from:
 
 -  libsvm/tsv/csv/txt format file