Merge pull request #744 from az0/staging

Fix spelling,capitalization, and whitespace
recommenders-team · Apr 18, 2019 · a9e9cfd · a9e9cfd
2 parents cf2f3d9 + 2705660
commit a9e9cfd
Show file tree

Hide file tree

Showing 34 changed files with 61 additions and 61 deletions.
diff --git a/README.md b/README.md
@@ -35,7 +35,7 @@ To setup on your local machine:
     cd notebooks
     jupyter notebook
     ```
-6. Run the [SAR Python CPU Movielens](notebooks/00_quick_start/sar_movielens.ipynb) notebook under the 00_quick_start folder. Make sure to change the kernel to "Python (reco)".
+6. Run the [SAR Python CPU MovieLens](notebooks/00_quick_start/sar_movielens.ipynb) notebook under the 00_quick_start folder. Make sure to change the kernel to "Python (reco)".
 
 **NOTE** - The [Alternating Least Squares (ALS)](notebooks/00_quick_start/als_movielens.ipynb) notebooks require a PySpark environment to run. Please follow the steps in the [setup guide](SETUP.md#dependencies-setup) to run these notebooks in a PySpark environment.
 

diff --git a/notebooks/00_quick_start/als_movielens.ipynb b/notebooks/00_quick_start/als_movielens.ipynb
@@ -84,7 +84,7 @@
     "# top k items to recommend\n",
     "TOP_K = 10\n",
     "\n",
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'"
    ]
   },
@@ -256,7 +256,7 @@
    "source": [
     "In the movie recommendation use case, recommending movies that have been rated by the users do not make sense. Therefore, the rated movies are removed from the recommended items.\n",
     "\n",
-    "In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training datatset."
+    "In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training dataset."
    ]
   },
   {

diff --git a/notebooks/00_quick_start/fastai_movielens.ipynb b/notebooks/00_quick_start/fastai_movielens.ipynb
@@ -93,7 +93,7 @@
     "# top k items to recommend\n",
     "TOP_K = 10\n",
     "\n",
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'\n",
     "\n",
     "# Model parameters\n",

diff --git a/notebooks/00_quick_start/ncf_movielens.ipynb b/notebooks/00_quick_start/ncf_movielens.ipynb
@@ -13,7 +13,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Neural Collaborative Filtering on Movielens dataset.\n",
+    "# Neural Collaborative Filtering on MovieLens dataset.\n",
     "\n",
     "Neural Collaborative Filtering (NCF) is a well known recommendation algorithm that generalizes the matrix factorization problem with multi-layer perceptron. \n",
     "\n",
@@ -78,7 +78,7 @@
     "# top k items to recommend\n",
     "TOP_K = 10\n",
     "\n",
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'\n",
     "\n",
     "# Model parameters\n",

diff --git a/notebooks/00_quick_start/rbm_movielens.ipynb b/notebooks/00_quick_start/rbm_movielens.ipynb
@@ -109,7 +109,7 @@
    },
    "outputs": [],
    "source": [
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'"
    ]
   },

diff --git a/notebooks/00_quick_start/sar_movielens.ipynb b/notebooks/00_quick_start/sar_movielens.ipynb
@@ -101,7 +101,7 @@
     "# top k items to recommend\n",
     "TOP_K = 10\n",
     "\n",
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'"
    ]
   },

diff --git a/notebooks/00_quick_start/wide_deep_movielens.ipynb b/notebooks/00_quick_start/wide_deep_movielens.ipynb
@@ -88,7 +88,7 @@
     "\n",
     "# Recommend top k items\n",
     "TOP_K = 10\n",
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'\n",
     "# Metrics to use for evaluation. reco_utils.evaluation.python_evaluation function names\n",
     "RANKING_METRICS = ['map_at_k', 'ndcg_at_k', 'precision_at_k', 'recall_at_k']\n",

diff --git a/notebooks/01_prepare_data/data_split.ipynb b/notebooks/01_prepare_data/data_split.ipynb
@@ -109,7 +109,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For illustration purpose, the data used in the examples below is the Movielens-100K dataset."
+    "For illustration purpose, the data used in the examples below is the MovieLens-100K dataset."
    ]
   },
   {
@@ -1091,7 +1091,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For example, the below illustrates how to do a random split on the given Spark DataFrame. For simplicity reason, the same Movielens data, which is in Pandas DataFrame, is transformed into Spark DataFrame and used for splitting."
+    "For example, the below illustrates how to do a random split on the given Spark DataFrame. For simplicity reason, the same MovieLens data, which is in Pandas DataFrame, is transformed into Spark DataFrame and used for splitting."
    ]
   },
   {

diff --git a/notebooks/02_model/als_deep_dive.ipynb b/notebooks/02_model/als_deep_dive.ipynb
@@ -73,9 +73,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "## 3 Spark ALS based Movielens recommender\n",
+    "## 3 Spark ALS based MovieLens recommender\n",
     "\n",
-    "In the following code, the Movielens-100K dataset is used to illustrate the ALS algorithm in Spark."
+    "In the following code, the MovieLens-100K dataset is used to illustrate the ALS algorithm in Spark."
    ]
   },
   {

diff --git a/notebooks/02_model/ncf_deep_dive.ipynb b/notebooks/02_model/ncf_deep_dive.ipynb
@@ -76,7 +76,7 @@
     "# top k items to recommend\n",
     "TOP_K = 10\n",
     "\n",
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'\n",
     "\n",
     "# Model parameters\n",
@@ -172,9 +172,9 @@
    "source": [
     "## 2 TensorFlow implementation of NCF\n",
     "\n",
-    "We will use the Movielens dataset, which is composed of integer ratings from 1 to 5.\n",
+    "We will use the MovieLens dataset, which is composed of integer ratings from 1 to 5.\n",
     "\n",
-    "We convert Movielens into implicit feedback, and evaluate under our *leave-one-out* evaluation protocol.\n",
+    "We convert MovieLens into implicit feedback, and evaluate under our *leave-one-out* evaluation protocol.\n",
     "\n",
     "You can check the details of implementation in `reco_utils/recommender/ncf`\n"
    ]

diff --git a/notebooks/02_model/rbm_deep_dive.ipynb b/notebooks/02_model/rbm_deep_dive.ipynb
@@ -271,7 +271,7 @@
    "source": [
     "# 3 Data preparation and inspection \n",
     "\n",
-    "The Movielens dataset comes in different sizes, denoting the number of available ratings. The number of users and rated movies also changes across the different dataset. The data are imported in a pandas dataframe including the **user ID**, the **item ID**, the **ratings** and a **timestamp** denoting when a particular user rated a particular item. Although this last feature could be explicitely included, it will not be considered here. The underlying assumption of this choice is that user's tastes are weakly time dependent, i.e. a user's taste typically chage on time scales (usually years) much longer than the typical recommendation time scale (e.g. hours/days). As a consequence, the joint probability distribution we want to learn can be safely considered as time dependent. Nevertheless, timestamps could be used as *contextual variables*, e.g. recommend a certain movie during the weekend and another during weekdays.  \n",
+    "The MovieLens dataset comes in different sizes, denoting the number of available ratings. The number of users and rated movies also changes across the different dataset. The data are imported in a pandas dataframe including the **user ID**, the **item ID**, the **ratings** and a **timestamp** denoting when a particular user rated a particular item. Although this last feature could be explicitely included, it will not be considered here. The underlying assumption of this choice is that user's tastes are weakly time dependent, i.e. a user's taste typically chage on time scales (usually years) much longer than the typical recommendation time scale (e.g. hours/days). As a consequence, the joint probability distribution we want to learn can be safely considered as time dependent. Nevertheless, timestamps could be used as *contextual variables*, e.g. recommend a certain movie during the weekend and another during weekdays.  \n",
     "\n",
     "Below, we first load the different movielens data in pandas dataframes, explain how the user/affinity matrix is built and how the train/test set is generated. As this procedure is common to all the datasets considered here, we explain it in details only for the 1m dataset.  \n",
     "\n",

diff --git a/notebooks/02_model/sar_deep_dive.ipynb b/notebooks/02_model/sar_deep_dive.ipynb
@@ -152,7 +152,7 @@
     "# top k items to recommend\n",
     "TOP_K = 10\n",
     "\n",
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'"
    ]
   },

diff --git a/notebooks/02_model/surprise_svd_deep_dive.ipynb b/notebooks/02_model/surprise_svd_deep_dive.ipynb
@@ -77,7 +77,7 @@
    "source": [
     "## 3 Surprise SVD movie recommender\n",
     "\n",
-    "We will use the Movielens dataset, which is composed of integer ratings from 1 to 5. \n",
+    "We will use the MovieLens dataset, which is composed of integer ratings from 1 to 5. \n",
     "\n",
     "Surprise supports dataframes as long as they have three colums reprensenting the user ids, item ids, and the ratings (in this order)."
    ]
@@ -124,7 +124,7 @@
    },
    "outputs": [],
    "source": [
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'"
    ]
   },

diff --git a/notebooks/02_model/vowpal_wabbit_deep_dive.ipynb b/notebooks/02_model/vowpal_wabbit_deep_dive.ipynb
@@ -39,7 +39,7 @@
    "source": [
     "<h3>Vowpal Wabbit for Recommendations</h3>\n",
     "\n",
-    "In this notebook we demonstrate how to use the VW library to generate recommendations on the [Movielens](https://grouplens.org/datasets/movielens/) dataset.\n",
+    "In this notebook we demonstrate how to use the VW library to generate recommendations on the [MovieLens](https://grouplens.org/datasets/movielens/) dataset.\n",
     "\n",
     "Several things are worth noting in how VW is being used in this notebook:\n",
     "\n",
@@ -216,7 +216,7 @@
    },
    "outputs": [],
    "source": [
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'\n",
     "TOP_K = 10"
    ]

diff --git a/notebooks/03_evaluate/evaluation.ipynb b/notebooks/03_evaluate/evaluation.ipynb
@@ -1633,7 +1633,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For comparison, the similary process it repeated for creating a more balanced dataset which uses the threshold of 3. Another prediction dataset is also created by using the balanced dataset. Again, the probabilities of predicting label 1 and label 0 are fixed as 0.6 and 0.4, respectively. **NOTE**, same as above, in this case, the prediction also gives us a 100% precision. The only difference is the proportion of binary labels."
+    "For comparison, a similar process is used with a threshold value of 3 to create a more balanced dataset. Another prediction dataset is also created by using the balanced dataset. Again, the probabilities of predicting label 1 and label 0 are fixed as 0.6 and 0.4, respectively. **NOTE**, same as above, in this case, the prediction also gives us a 100% precision. The only difference is the proportion of binary labels."
    ]
   },
   {

diff --git a/notebooks/04_model_select_and_optimize/azureml_hyperdrive_surprise_svd.ipynb b/notebooks/04_model_select_and_optimize/azureml_hyperdrive_surprise_svd.ipynb
@@ -252,7 +252,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'"
    ]
   },
@@ -491,7 +491,7 @@
     "    script_params['--recommend-seen'] = ''\n",
     "    \n",
     "# hyperparameters search space\n",
-    "# We do not set 'lr_all' and 'reg_all' because they will be overriden by the other lr_ and reg_ parameters\n",
+    "# We do not set 'lr_all' and 'reg_all' because they will be overwritten by the other lr_ and reg_ parameters\n",
     "\n",
     "hyper_params = {\n",
     "    'n_factors': hd.choice(10, 50, 100, 150, 200),\n",

diff --git a/notebooks/04_model_select_and_optimize/azureml_hyperdrive_wide_and_deep.ipynb b/notebooks/04_model_select_and_optimize/azureml_hyperdrive_wide_and_deep.ipynb
@@ -141,7 +141,7 @@
     "\n",
     "# Recommend top k items\n",
     "TOP_K = 10\n",
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'\n",
     "EPOCHS = 50\n",
     "# Metrics to track\n",
@@ -198,7 +198,7 @@
    "source": [
     "### 2. Create Remote Compute Target\n",
     "\n",
-    "We create a gpu cluster as our **remote compute target**. If a cluster with the same name is already exist in your workspace, the script will load it instead. You can see [this document](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets) to learn more about setting up a compute target on different locations.\n",
+    "We create a GPU cluster as our **remote compute target**. If a cluster with the same name is already exist in your workspace, the script will load it instead. You can see [this document](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets) to learn more about setting up a compute target on different locations.\n",
     "\n",
     "This notebook selects **STANDARD_NC6** virtual machine (VM) and sets it's priority as *lowpriority* to save the cost.\n",
     "\n",

diff --git a/notebooks/04_model_select_and_optimize/tuning_spark_als.ipynb b/notebooks/04_model_select_and_optimize/tuning_spark_als.ipynb
@@ -155,7 +155,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Movielens 100k dataset is used for running the demonstration."
+    "MovieLens 100k dataset is used for running the demonstration."
    ]
   },
   {

diff --git a/notebooks/05_operationalize/README.md b/notebooks/05_operationalize/README.md
@@ -4,7 +4,7 @@ In this directory, a notebook is provided to demonstrate how recommendation syst
 
 | Notebook | Description | 
 | --- | --- | 
-| [als_movie_o16n](als_movie_o16n.ipynb) | End-to-end examples demonstrate how to build, evaluate, and deploye a Spark ALS based movie recommender with Azure services such as [Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction), and [Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/).
+| [als_movie_o16n](als_movie_o16n.ipynb) | End-to-end examples demonstrate how to build, evaluate, and deploy a Spark ALS based movie recommender with Azure services such as [Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction), and [Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/).
 
 
 ## Workflow

diff --git a/notebooks/05_operationalize/als_movie_o16n.ipynb b/notebooks/05_operationalize/als_movie_o16n.ipynb
@@ -319,7 +319,7 @@
     "# top k items to recommend\n",
     "TOP_K = 10\n",
     "\n",
-    "# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
+    "# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
     "MOVIELENS_DATA_SIZE = '100k'"
    ]
   },
@@ -379,7 +379,7 @@
    "source": [
     "### 2.3 Train the ALS model on the training data, and get the top-k recommendations for our testing data\n",
     "\n",
-    "To predict movie ratings, we use the rating data in the training set as users' explicit feedbacks. The hyperparameters used to estimate the model are set based on [this page](http://mymedialite.net/examples/datasets.html).\n",
+    "To predict movie ratings, we use the rating data in the training set as users' explicit feedback. The hyperparameters used to estimate the model are set based on [this page](http://mymedialite.net/examples/datasets.html).\n",
     "\n",
     "Under most circumstances, you would explore the hyperparameters and choose an optimal set based on some criteria. For additional details on this process, please see additional information in the deep dives [here](../04_model_select_and_optimize/hypertune_spark_deep_dive.ipynb)."
    ]
@@ -428,7 +428,7 @@
    "source": [
     "In the movie recommendation use case, recommending movies that have been rated by the users do not make sense. Therefore, the rated movies are removed from the recommended items.\n",
     "\n",
-    "In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training datatset."
+    "In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training dataset."
    ]
   },
   {

diff --git a/reco_utils/README.md b/reco_utils/README.md
@@ -8,10 +8,10 @@ This module (reco_utils) contains functions to simplify common tasks used when d
 This submodule contains high-level utilities for defining constants used in most algorithms as well as helper functions for managing aspects of different frameworks: gpu, spark, jupyter notebook.
 
 ### [Dataset](./dataset)
-Dataset includes helper functions for interacting with Azure Cosmos databases, pulling different sizes of the Movielens dataset and formatting them appropriately as well as utilities for splitting data for training / testing.
+Dataset includes helper functions for interacting with Azure Cosmos databases, pulling different sizes of the MovieLens dataset and formatting them appropriately as well as utilities for splitting data for training / testing.
 
 #### Data Loading
-The movielens module will allow you to load a dataframe in pandas or spark formats from the Movielens dataset, with sizes of 100k, 1M, 10M, or 20M to test algorithms and evaluate performance benchmarks.
+The movielens module will allow you to load a dataframe in pandas or spark formats from the MovieLens dataset, with sizes of 100k, 1M, 10M, or 20M to test algorithms and evaluate performance benchmarks.
 ```python
 df = movielens.load_pandas_df(size="100k")
 ```

diff --git a/reco_utils/common/general_utils.py b/reco_utils/common/general_utils.py
@@ -7,8 +7,8 @@
 
 def invert_dictionary(dictionary):
     """Invert a dictionary
-    NOTE: If the dictionary has unique keys and unique values, the invertion would be perfect. However, if there are
-    repeated values, the invertion can take different keys
+    NOTE: If the dictionary has unique keys and unique values, the inversion would be perfect. However, if there are
+    repeated values, the inversion can take different keys
 
     Args:
         dictionary (dict): A dictionary
@@ -37,7 +37,7 @@ def get_number_processors():
     try:
         num = os.cpu_count()
     except Exception:
-        import multiprocessing  # force exception in case mutiprocessing is not installed
+        import multiprocessing  # force exception in case multiprocessing is not installed
 
         num = multiprocessing.cpu_count()
     return num
diff --git a/reco_utils/common/notebook_memory_management.py b/reco_utils/common/notebook_memory_management.py
@@ -16,7 +16,7 @@
 from __future__ import print_function  # force use of print("hello")
 from __future__ import (
     unicode_literals
-)  # force unadorned strings "" to be unicode without prepending u""
+)  # force unadorned strings "" to be Unicode without prepending u""
 import time
 import memory_profiler
 from IPython import get_ipython

diff --git a/reco_utils/dataset/criteo.py b/reco_utils/dataset/criteo.py
@@ -136,7 +136,7 @@ def extract_criteo(size, compressed_file, path=None):
     """Extract Criteo dataset tar.
 
     Args:
-        size (str): Size of criteo dataset. It can be "full" or "sample".
+        size (str): Size of Criteo dataset. It can be "full" or "sample".
         compressed_file (str): Path to compressed file.
         path (str): Path to extract the file.
     

diff --git a/reco_utils/dataset/download_utils.py b/reco_utils/dataset/download_utils.py
@@ -16,7 +16,7 @@ class TqdmUpTo(tqdm):
     """Wrapper class for the progress bar tqdm to get `update_to(n)` functionality"""
 
     def update_to(self, b=1, bsize=1, tsize=None):
-        """A progress bar showing how much is left to finish the opperation
+        """A progress bar showing how much is left to finish the operation
         
         Args:
             b (int): Number of blocks transferred so far.