# MovieLens API


Full: 27,000,000 ratings and 1,100,000 tag applications applied to 58,000 movies by 280,000 users. Includes tag genome data with 14 million relevance scores across 1,100 tags. Last updated 9/2018
Usando el Dataset MovieLens conseguir una solución a las siguientes actividades:

Parte 1:

- Descargar de manera automatizada el dataset.
- Crear un RDD con el dataset
- Entrenar un motor de recomendaciones usando ALS
- Evaluar el rendimiento del motor

Parte 2. Crear tres archivos que cumplan con las siguientes capacidades:

**engine.py**

Funciones del motor de recomendaciones.

- `Función get_counts_and_averages`: Dada una tupla (movieID, ratings_iterable), regresa (movieID, (ratings_count, ratings_avg)
- Clase `RecommendationEngine` con las siguientes funciones:
    - `count_and_average_ratings`: Actualiza el conteo deratings de películas usando el DataFrame de ratings
    - `train_model`: Entrena un modelo ALS con el DataFrame de ratings
    - `predict_ratings`: Dada una tupla (userID, movieID) genera predicciones.
        Regresa: Un RDD con formato (movieTitle, movieRating, numRatings)
    - `add_ratings`: 
        - Añade nuevos ratings al DataFrame existente con el formato (user_id, movie_id, rating).
        - calcula el conteo de ratings (count_and_average_ratings())
        - Re-entrena el modelo con los nuevos ratings
    - `get_ratings_for_movie_id`s: Dado un user_id y una lista de movie_ids, predice ratings para las películas
    - `get_top_ratings`: Dada la tupla (user_id, movies_count) recomienda hasta  el valor establecido en movies_count para películas que no han sido votadas por el usuario user_id 
    - `__init__`: Instancia el DataFrame de Ratings, calcula el conteo de ratings y entrenamiento del modelo




**app.py**

Router  de Fask-Blueprint que maneje las siguientes rutas y funciones:

```python
@main.route("/<int:user_id>/ratings/top/<int:count>", methods=["GET"])
def top_ratings(user_id, count):
    ...
    return json.dumps(top_ratings)
    

```

----

```python
@main.route("/<int:user_id>/ratings/<int:movie_id>", methods=["GET"])
def movie_ratings(user_id, movie_id):
    ...
    return json.dumps(ratings)

```

----

```python
@main.route("/<int:user_id>/ratings", methods = ["POST"])
def add_ratings(user_id):
    # get the ratings from the Flask POST request object
    # create a list with the format required by the negine (user_id, movie_id, rating)
    # add them to the model using then engine API
    ...
    return json.dumps(ratings)

```

----

```python
def create_app(spark_context, dataset_path):
    global recommendation_engine 

    recommendation_engine = RecommendationEngine(spark_context, dataset_path)    
    
    app = Flask(__name__)
    app.register_blueprint(main)
    return app 

```

Pista:
Este router, debe importar la clase RecommendationEngine del script `engine.py`:
```python
from engine import RecommendationEngine
```



**server.py**

El script que instanciará:
 * El servidor web para el router en **app.py**
 * Contexto de Spark para que la API creada en **engine.py** pueda funcionar
     
Este es el script que se somete a **spark-submit**:
`$SPARK_HOME/bin/spark-submit --master local[*] --total-executor-cores 4 --executor-memory 8g server.py`

---




## Parte 1:

### dataset_downloader.sh


Creamos el archivo `dataset_downloader.sh` con el siguiente contenido:

```bash
#!/usr/bin/env bash

hash wget 2>/dev/null || { echo >&2 "Wget required.  Aborting."; exit 1; }
hash unzip 2>/dev/null || { echo >&2 "unzip required.  Aborting."; exit 1; }

wget http://files.grouplens.org/datasets/movielens/ml-latest.zip
     http://files.grouplens.org/datasets/movielens/ml-latest.zip
unzip -o "ml-latest.zip"
DESTINATION="./datasets/"
mkdir -p $DESTINATION
mv ml-latest $DESTINATION
```
---

Una vez creado, ejecutamos los comandos:

```bash

chmod +x dataset_downloader.sh

./dataset_downloader.sh
```



---

## Modelo ALS

Importamos las dependencias:

In [None]:
import os

from pyspark import SparkContext, SparkConf


from pyspark.sql import Column, Row
from pyspark.sql.functions import col, expr


# Machine Learning

from pyspark.ml.recommendation import ALS

# Metrics / Eval

from pyspark.ml.evaluation import RegressionEvaluator

from pyspark.mllib.evaluation import RegressionMetrics, RankingMetrics, RegressionMetrics


Instanciamos una sesión de Spark

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local") \
    .appName("API") \
    .getOrCreate()

Generamos un DataFrame que lea la fuente `ratings.csv`

In [None]:
# dataset_path = "./datasets/ml-latest/"

dataset_path = "/home/jovyan/work/spark-movie-lens/datasets/ml-latest/"

# ratings = spark.read.option("header","false").text(os.path.join(dataset_path, 'ratings.csv'))\
#     .rdd.toDF()\
#     .selectExpr("split(value, ',') as col")\
#     .selectExpr(
#                 "cast(col[0] as int) as userId",
#                 "cast(col[1] as int) as movieId",
#                 "cast(col[2] as float) as rating",
#                 "cast(col[3] as long) as timestamp")\
#     .dropna()\
#     .cache()
ratings = spark.read.option("header","false").text(os.path.join(dataset_path, 'ratings.csv'))\
    .rdd.toDF()\
    .selectExpr("split(value, ',') as col")\
    .selectExpr(
                "cast(col[0] as int) as userId",
                "cast(col[1] as int) as movieId",
                "cast(col[2] as float) as rating")\
    .dropna()\
    .cache()



ratings_short = ratings.sample(fraction=.0001, seed=42)

In [None]:
ratings_short.show()

In [None]:
training, test = ratings_short.randomSplit([0.8, 0.2])
als = ALS()\
    .setMaxIter(5)\
    .setRegParam(0.01)\
    .setUserCol("userId")\
    .setItemCol("movieId")\
    .setRatingCol("rating")
print(als.explainParams())

alsModel = als.fit(training)
predictions = alsModel.transform(test)

We can now output the top 𝘬 recommendations for each user or movie. The model’s recommendForAllUsers method returns a DataFrame of a userId, an array of recommendations, as well as a rating for each of those movies. recommendForAllItems returns a DataFrame of a movieId, as well as the top users for that movie:

In [None]:
alsModel.recommendForAllUsers(10)\
  .selectExpr("userId", "explode(recommendations)").show()
alsModel.recommendForAllItems(10)\
  .selectExpr("movieId", "explode(recommendations)").show()

Evaluators for Recommendation


When covering the cold-start strategy, we can set up an automatic model evaluator when working with ALS. One thing that may not be immediately obvious is that this recommendation problem is really just a kind of regression problem. Since we’re predicting values (ratings) for given users, we want to optimize for reducing the total difference between our users’ ratings and the true values. We can do this using the same RegressionEvaluator that we saw in Chapter 27. You can place this in a pipeline to automate the training process. When doing this, you should also set the cold-start strategy to be drop instead of NaN and then switch it back to NaN when it comes time to actually make predictions in your production system:

In [None]:
evaluator = RegressionEvaluator()\
  .setMetricName("rmse")\
  .setLabelCol("rating")\
  .setPredictionCol("prediction")
rmse = evaluator.evaluate(predictions)

print("Root-mean-square error = %f" % rmse)

## Metrics
Recommendation results can be measured using both the standard regression metrics and some recommendation-specific metrics. It should come as no surprise that there are more sophisticated ways of measuring recommendation success than simply evaluating based on regression. These metrics are particularly useful for evaluating your final model.


### Regression Metrics


We can recycle the regression metrics for recommendation. This is because we can simply see how close each prediction is to the actual rating for that user and item:

In [None]:
regComparison = predictions.select("rating", "prediction")\
  .rdd.map(lambda x: (x(0), x(1)))
metrics = RegressionMetrics(regComparison)

### Ranking Metrics

More interestingly, we also have another tool: ranking metrics. A RankingMetric allows us to compare our recommendations with an actual set of ratings (or preferences) expressed by a given user. RankingMetric does not focus on the value of the rank but rather whether or not our algorithm recommends an already ranked item again to a user. This does require some data preparation on our part. You may want to refer to Part II for a refresher on some of the methods. First, we need to collect a set of highly ranked movies for a given user. In our case, we’re going to use a rather low threshold: movies ranked above 2.5. Tuning this value will largely be a business decision:

In [None]:
perUserActual = predictions\
  .where("rating > 2.5")\
  .groupBy("userId")\
  .agg(expr("collect_set(movieId) as movies"))

At this point, we have a collection of users, along with a truth set of previously ranked movies for each user. Now we will get our top 10 recommendations from our algorithm on a per-user basis. We will then see if the top 10 recommendations show up in our truth set. If we have a well-trained model, it will correctly recommend the movies a user already liked. If it doesn’t, it may not have learned enough about each particular user to successfully reflect their preferences:

In [None]:
perUserPredictions = predictions\
  .orderBy(col("userId"), expr("prediction DESC"))\
  .groupBy("userId")\
  .agg(expr("collect_list(movieId) as movies"))

Now we have two DataFrames, one of predictions and another the top-ranked items for a particular user. We can pass them into the RankingMetrics object. This object accepts an RDD of these combinations, as you can see in the following join and RDD conversion:

In [None]:
perUserActualvPred = perUserActual.join(perUserPredictions, ["userId"]).rdd\
  .map(lambda row: (row[1], row[2][:15]))
ranks = RankingMetrics(perUserActualvPred)

Now we can see the metrics from that ranking. For instance, we can see how precise our algorithm is with the mean average precision. We can also get the precision at certain ranking points, for instance, to see where the majority of the positive recommendations fall:

In [None]:
ranks.meanAveragePrecision

In [None]:
ranks.precisionAt(5)

In [None]:
spark.stop()