Skip to content

Commit

Permalink
Merge pull request #744 from az0/staging
Browse files Browse the repository at this point in the history
Fix spelling,capitalization, and whitespace
  • Loading branch information
miguelgfierro committed Apr 18, 2019
2 parents cf2f3d9 + 2705660 commit a9e9cfd
Show file tree
Hide file tree
Showing 34 changed files with 61 additions and 61 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ To setup on your local machine:
cd notebooks
jupyter notebook
```
6. Run the [SAR Python CPU Movielens](notebooks/00_quick_start/sar_movielens.ipynb) notebook under the 00_quick_start folder. Make sure to change the kernel to "Python (reco)".
6. Run the [SAR Python CPU MovieLens](notebooks/00_quick_start/sar_movielens.ipynb) notebook under the 00_quick_start folder. Make sure to change the kernel to "Python (reco)".

**NOTE** - The [Alternating Least Squares (ALS)](notebooks/00_quick_start/als_movielens.ipynb) notebooks require a PySpark environment to run. Please follow the steps in the [setup guide](SETUP.md#dependencies-setup) to run these notebooks in a PySpark environment.

Expand Down
4 changes: 2 additions & 2 deletions notebooks/00_quick_start/als_movielens.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@
"# top k items to recommend\n",
"TOP_K = 10\n",
"\n",
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'"
]
},
Expand Down Expand Up @@ -256,7 +256,7 @@
"source": [
"In the movie recommendation use case, recommending movies that have been rated by the users do not make sense. Therefore, the rated movies are removed from the recommended items.\n",
"\n",
"In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training datatset."
"In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training dataset."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion notebooks/00_quick_start/fastai_movielens.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@
"# top k items to recommend\n",
"TOP_K = 10\n",
"\n",
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'\n",
"\n",
"# Model parameters\n",
Expand Down
4 changes: 2 additions & 2 deletions notebooks/00_quick_start/ncf_movielens.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Neural Collaborative Filtering on Movielens dataset.\n",
"# Neural Collaborative Filtering on MovieLens dataset.\n",
"\n",
"Neural Collaborative Filtering (NCF) is a well known recommendation algorithm that generalizes the matrix factorization problem with multi-layer perceptron. \n",
"\n",
Expand Down Expand Up @@ -78,7 +78,7 @@
"# top k items to recommend\n",
"TOP_K = 10\n",
"\n",
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'\n",
"\n",
"# Model parameters\n",
Expand Down
2 changes: 1 addition & 1 deletion notebooks/00_quick_start/rbm_movielens.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@
},
"outputs": [],
"source": [
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'"
]
},
Expand Down
2 changes: 1 addition & 1 deletion notebooks/00_quick_start/sar_movielens.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,7 @@
"# top k items to recommend\n",
"TOP_K = 10\n",
"\n",
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'"
]
},
Expand Down
2 changes: 1 addition & 1 deletion notebooks/00_quick_start/wide_deep_movielens.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -88,7 +88,7 @@
"\n",
"# Recommend top k items\n",
"TOP_K = 10\n",
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'\n",
"# Metrics to use for evaluation. reco_utils.evaluation.python_evaluation function names\n",
"RANKING_METRICS = ['map_at_k', 'ndcg_at_k', 'precision_at_k', 'recall_at_k']\n",
Expand Down
4 changes: 2 additions & 2 deletions notebooks/01_prepare_data/data_split.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For illustration purpose, the data used in the examples below is the Movielens-100K dataset."
"For illustration purpose, the data used in the examples below is the MovieLens-100K dataset."
]
},
{
Expand Down Expand Up @@ -1091,7 +1091,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For example, the below illustrates how to do a random split on the given Spark DataFrame. For simplicity reason, the same Movielens data, which is in Pandas DataFrame, is transformed into Spark DataFrame and used for splitting."
"For example, the below illustrates how to do a random split on the given Spark DataFrame. For simplicity reason, the same MovieLens data, which is in Pandas DataFrame, is transformed into Spark DataFrame and used for splitting."
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions notebooks/02_model/als_deep_dive.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -73,9 +73,9 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3 Spark ALS based Movielens recommender\n",
"## 3 Spark ALS based MovieLens recommender\n",
"\n",
"In the following code, the Movielens-100K dataset is used to illustrate the ALS algorithm in Spark."
"In the following code, the MovieLens-100K dataset is used to illustrate the ALS algorithm in Spark."
]
},
{
Expand Down
6 changes: 3 additions & 3 deletions notebooks/02_model/ncf_deep_dive.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -76,7 +76,7 @@
"# top k items to recommend\n",
"TOP_K = 10\n",
"\n",
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'\n",
"\n",
"# Model parameters\n",
Expand Down Expand Up @@ -172,9 +172,9 @@
"source": [
"## 2 TensorFlow implementation of NCF\n",
"\n",
"We will use the Movielens dataset, which is composed of integer ratings from 1 to 5.\n",
"We will use the MovieLens dataset, which is composed of integer ratings from 1 to 5.\n",
"\n",
"We convert Movielens into implicit feedback, and evaluate under our *leave-one-out* evaluation protocol.\n",
"We convert MovieLens into implicit feedback, and evaluate under our *leave-one-out* evaluation protocol.\n",
"\n",
"You can check the details of implementation in `reco_utils/recommender/ncf`\n"
]
Expand Down
2 changes: 1 addition & 1 deletion notebooks/02_model/rbm_deep_dive.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -271,7 +271,7 @@
"source": [
"# 3 Data preparation and inspection \n",
"\n",
"The Movielens dataset comes in different sizes, denoting the number of available ratings. The number of users and rated movies also changes across the different dataset. The data are imported in a pandas dataframe including the **user ID**, the **item ID**, the **ratings** and a **timestamp** denoting when a particular user rated a particular item. Although this last feature could be explicitely included, it will not be considered here. The underlying assumption of this choice is that user's tastes are weakly time dependent, i.e. a user's taste typically chage on time scales (usually years) much longer than the typical recommendation time scale (e.g. hours/days). As a consequence, the joint probability distribution we want to learn can be safely considered as time dependent. Nevertheless, timestamps could be used as *contextual variables*, e.g. recommend a certain movie during the weekend and another during weekdays. \n",
"The MovieLens dataset comes in different sizes, denoting the number of available ratings. The number of users and rated movies also changes across the different dataset. The data are imported in a pandas dataframe including the **user ID**, the **item ID**, the **ratings** and a **timestamp** denoting when a particular user rated a particular item. Although this last feature could be explicitely included, it will not be considered here. The underlying assumption of this choice is that user's tastes are weakly time dependent, i.e. a user's taste typically chage on time scales (usually years) much longer than the typical recommendation time scale (e.g. hours/days). As a consequence, the joint probability distribution we want to learn can be safely considered as time dependent. Nevertheless, timestamps could be used as *contextual variables*, e.g. recommend a certain movie during the weekend and another during weekdays. \n",
"\n",
"Below, we first load the different movielens data in pandas dataframes, explain how the user/affinity matrix is built and how the train/test set is generated. As this procedure is common to all the datasets considered here, we explain it in details only for the 1m dataset. \n",
"\n",
Expand Down
2 changes: 1 addition & 1 deletion notebooks/02_model/sar_deep_dive.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@
"# top k items to recommend\n",
"TOP_K = 10\n",
"\n",
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'"
]
},
Expand Down
4 changes: 2 additions & 2 deletions notebooks/02_model/surprise_svd_deep_dive.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@
"source": [
"## 3 Surprise SVD movie recommender\n",
"\n",
"We will use the Movielens dataset, which is composed of integer ratings from 1 to 5. \n",
"We will use the MovieLens dataset, which is composed of integer ratings from 1 to 5. \n",
"\n",
"Surprise supports dataframes as long as they have three colums reprensenting the user ids, item ids, and the ratings (in this order)."
]
Expand Down Expand Up @@ -124,7 +124,7 @@
},
"outputs": [],
"source": [
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'"
]
},
Expand Down
4 changes: 2 additions & 2 deletions notebooks/02_model/vowpal_wabbit_deep_dive.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
"source": [
"<h3>Vowpal Wabbit for Recommendations</h3>\n",
"\n",
"In this notebook we demonstrate how to use the VW library to generate recommendations on the [Movielens](https://grouplens.org/datasets/movielens/) dataset.\n",
"In this notebook we demonstrate how to use the VW library to generate recommendations on the [MovieLens](https://grouplens.org/datasets/movielens/) dataset.\n",
"\n",
"Several things are worth noting in how VW is being used in this notebook:\n",
"\n",
Expand Down Expand Up @@ -216,7 +216,7 @@
},
"outputs": [],
"source": [
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'\n",
"TOP_K = 10"
]
Expand Down
2 changes: 1 addition & 1 deletion notebooks/03_evaluate/evaluation.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -1633,7 +1633,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For comparison, the similary process it repeated for creating a more balanced dataset which uses the threshold of 3. Another prediction dataset is also created by using the balanced dataset. Again, the probabilities of predicting label 1 and label 0 are fixed as 0.6 and 0.4, respectively. **NOTE**, same as above, in this case, the prediction also gives us a 100% precision. The only difference is the proportion of binary labels."
"For comparison, a similar process is used with a threshold value of 3 to create a more balanced dataset. Another prediction dataset is also created by using the balanced dataset. Again, the probabilities of predicting label 1 and label 0 are fixed as 0.6 and 0.4, respectively. **NOTE**, same as above, in this case, the prediction also gives us a 100% precision. The only difference is the proportion of binary labels."
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -252,7 +252,7 @@
"metadata": {},
"outputs": [],
"source": [
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'"
]
},
Expand Down Expand Up @@ -491,7 +491,7 @@
" script_params['--recommend-seen'] = ''\n",
" \n",
"# hyperparameters search space\n",
"# We do not set 'lr_all' and 'reg_all' because they will be overriden by the other lr_ and reg_ parameters\n",
"# We do not set 'lr_all' and 'reg_all' because they will be overwritten by the other lr_ and reg_ parameters\n",
"\n",
"hyper_params = {\n",
" 'n_factors': hd.choice(10, 50, 100, 150, 200),\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -141,7 +141,7 @@
"\n",
"# Recommend top k items\n",
"TOP_K = 10\n",
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'\n",
"EPOCHS = 50\n",
"# Metrics to track\n",
Expand Down Expand Up @@ -198,7 +198,7 @@
"source": [
"### 2. Create Remote Compute Target\n",
"\n",
"We create a gpu cluster as our **remote compute target**. If a cluster with the same name is already exist in your workspace, the script will load it instead. You can see [this document](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets) to learn more about setting up a compute target on different locations.\n",
"We create a GPU cluster as our **remote compute target**. If a cluster with the same name is already exist in your workspace, the script will load it instead. You can see [this document](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-set-up-training-targets) to learn more about setting up a compute target on different locations.\n",
"\n",
"This notebook selects **STANDARD_NC6** virtual machine (VM) and sets it's priority as *lowpriority* to save the cost.\n",
"\n",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"Movielens 100k dataset is used for running the demonstration."
"MovieLens 100k dataset is used for running the demonstration."
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion notebooks/05_operationalize/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ In this directory, a notebook is provided to demonstrate how recommendation syst

| Notebook | Description |
| --- | --- |
| [als_movie_o16n](als_movie_o16n.ipynb) | End-to-end examples demonstrate how to build, evaluate, and deploye a Spark ALS based movie recommender with Azure services such as [Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction), and [Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/).
| [als_movie_o16n](als_movie_o16n.ipynb) | End-to-end examples demonstrate how to build, evaluate, and deploy a Spark ALS based movie recommender with Azure services such as [Databricks](https://azure.microsoft.com/en-us/services/databricks/), [Cosmos DB](https://docs.microsoft.com/en-us/azure/cosmos-db/introduction), and [Kubernetes Services](https://azure.microsoft.com/en-us/services/kubernetes-service/).


## Workflow
Expand Down
6 changes: 3 additions & 3 deletions notebooks/05_operationalize/als_movie_o16n.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -319,7 +319,7 @@
"# top k items to recommend\n",
"TOP_K = 10\n",
"\n",
"# Select Movielens data size: 100k, 1m, 10m, or 20m\n",
"# Select MovieLens data size: 100k, 1m, 10m, or 20m\n",
"MOVIELENS_DATA_SIZE = '100k'"
]
},
Expand Down Expand Up @@ -379,7 +379,7 @@
"source": [
"### 2.3 Train the ALS model on the training data, and get the top-k recommendations for our testing data\n",
"\n",
"To predict movie ratings, we use the rating data in the training set as users' explicit feedbacks. The hyperparameters used to estimate the model are set based on [this page](http://mymedialite.net/examples/datasets.html).\n",
"To predict movie ratings, we use the rating data in the training set as users' explicit feedback. The hyperparameters used to estimate the model are set based on [this page](http://mymedialite.net/examples/datasets.html).\n",
"\n",
"Under most circumstances, you would explore the hyperparameters and choose an optimal set based on some criteria. For additional details on this process, please see additional information in the deep dives [here](../04_model_select_and_optimize/hypertune_spark_deep_dive.ipynb)."
]
Expand Down Expand Up @@ -428,7 +428,7 @@
"source": [
"In the movie recommendation use case, recommending movies that have been rated by the users do not make sense. Therefore, the rated movies are removed from the recommended items.\n",
"\n",
"In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training datatset."
"In order to achieve this, we recommend all movies to all users, and then remove the user-movie pairs that exist in the training dataset."
]
},
{
Expand Down
4 changes: 2 additions & 2 deletions reco_utils/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ This module (reco_utils) contains functions to simplify common tasks used when d
This submodule contains high-level utilities for defining constants used in most algorithms as well as helper functions for managing aspects of different frameworks: gpu, spark, jupyter notebook.

### [Dataset](./dataset)
Dataset includes helper functions for interacting with Azure Cosmos databases, pulling different sizes of the Movielens dataset and formatting them appropriately as well as utilities for splitting data for training / testing.
Dataset includes helper functions for interacting with Azure Cosmos databases, pulling different sizes of the MovieLens dataset and formatting them appropriately as well as utilities for splitting data for training / testing.

#### Data Loading
The movielens module will allow you to load a dataframe in pandas or spark formats from the Movielens dataset, with sizes of 100k, 1M, 10M, or 20M to test algorithms and evaluate performance benchmarks.
The movielens module will allow you to load a dataframe in pandas or spark formats from the MovieLens dataset, with sizes of 100k, 1M, 10M, or 20M to test algorithms and evaluate performance benchmarks.
```python
df = movielens.load_pandas_df(size="100k")
```
Expand Down
6 changes: 3 additions & 3 deletions reco_utils/common/general_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@

def invert_dictionary(dictionary):
"""Invert a dictionary
NOTE: If the dictionary has unique keys and unique values, the invertion would be perfect. However, if there are
repeated values, the invertion can take different keys
NOTE: If the dictionary has unique keys and unique values, the inversion would be perfect. However, if there are
repeated values, the inversion can take different keys
Args:
dictionary (dict): A dictionary
Expand Down Expand Up @@ -37,7 +37,7 @@ def get_number_processors():
try:
num = os.cpu_count()
except Exception:
import multiprocessing # force exception in case mutiprocessing is not installed
import multiprocessing # force exception in case multiprocessing is not installed

num = multiprocessing.cpu_count()
return num
2 changes: 1 addition & 1 deletion reco_utils/common/notebook_memory_management.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
from __future__ import print_function # force use of print("hello")
from __future__ import (
unicode_literals
) # force unadorned strings "" to be unicode without prepending u""
) # force unadorned strings "" to be Unicode without prepending u""
import time
import memory_profiler
from IPython import get_ipython
Expand Down
2 changes: 1 addition & 1 deletion reco_utils/dataset/criteo.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ def extract_criteo(size, compressed_file, path=None):
"""Extract Criteo dataset tar.
Args:
size (str): Size of criteo dataset. It can be "full" or "sample".
size (str): Size of Criteo dataset. It can be "full" or "sample".
compressed_file (str): Path to compressed file.
path (str): Path to extract the file.
Expand Down
2 changes: 1 addition & 1 deletion reco_utils/dataset/download_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ class TqdmUpTo(tqdm):
"""Wrapper class for the progress bar tqdm to get `update_to(n)` functionality"""

def update_to(self, b=1, bsize=1, tsize=None):
"""A progress bar showing how much is left to finish the opperation
"""A progress bar showing how much is left to finish the operation
Args:
b (int): Number of blocks transferred so far.
Expand Down
Loading

0 comments on commit a9e9cfd

Please sign in to comment.