## Title

# ALS

### Description:

### Authors:

#### Hugo Cesar Octavio del Sueldo
#### Jose Lopez Galdon

### Date:
15/01/2021
### Version:
1.0

## Load pySpark

First of all, we will create the sparkContext and we will create the RDD from our files downloaded from the official website.

In [1]:
    # Findspark to locate the spark in the system
#import findspark
#findspark.init()

    # Initialize the spark context
from pyspark import SparkContext
sc = SparkContext.getOrCreate()

    # Due to we are going to work with sparkSQL we will introduce the sparksql context
from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode
from pyspark.sql.types import *
from pyspark.sql.window import Window
from pyspark.sql.functions import *
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import col
spark = SparkSession.builder.master("local[*]").getOrCreate()

    # Visualitation
import matplotlib.pyplot as plt
import seaborn as sns

    # Handy
from handyspark import *

## Load data

Now, we will create objects with the file path

In [2]:
data_movies = "../data/01_raw/movies.csv"
data_ratings = "../data/01_raw/ratings.csv"

### Movies dataset

**`spark.read`**: It is necesary to load the csv file.
- format("csv"): Means the format of the file.

- option("sep", ","): It establish the kind of spearator, in this case ','.

- option("inferSchema", "true"): We set spark to infer the type of schema.

- option("header", "true"): We say to spark that the file has a header.

- load(f'{datos_movies}'): Path file.

This code was written in Scala.

In [3]:
raw_movies = spark.read.format("csv") \
                       .option("sep", ",") \
                       .option("inferSchema", "true") \
                       .option("header", "true") \
                       .load(f'{data_movies}')

<class 'pyspark.sql.dataframe.DataFrame'>


### Ratings dataset

**`spark.read`**: It is necesary to load the csv file.
- format("csv"): Means the format of the file.

- option("sep", ","): It establish the kind of spearator, in this case ','.

- option("inferSchema", "true"): We set spark to infer the type of schema.

- option("header", "true"): We say to spark that the file has a header.

- load(f'{datos_movies}'): Path file.

This code was written in Scala.

In [4]:
raw_ratings = spark.read.format("csv") \
                        .option("sep", ",") \
                        .option("inferSchema", "true") \
                        .option("header", "true") \
                        .load(f'{data_ratings}')

<class 'pyspark.sql.dataframe.DataFrame'>


### Join datasets

In order to continue with the exploration, we will merge both datasets.

In [5]:
    # Join both the data frames to add movie data into ratings
movie_ratings = raw_ratings.join(other=raw_movies, on=["movieId"], how="left")

In [6]:
    # Show the dataset
movie_ratings.show()

+-------+------+------+----------+--------------------+--------------------+
|movieId|userId|rating| timestamp|               title|              genres|
+-------+------+------+----------+--------------------+--------------------+
|    307|     1|   3.5|1256677221|Three Colors: Blu...|               Drama|
|    481|     1|   3.5|1256677456|   Kalifornia (1993)|      Drama|Thriller|
|   1091|     1|   1.5|1256677471|Weekend at Bernie...|              Comedy|
|   1257|     1|   4.5|1256677460|Better Off Dead.....|      Comedy|Romance|
|   1449|     1|   4.5|1256677264|Waiting for Guffm...|              Comedy|
|   1590|     1|   2.5|1256677236|Event Horizon (1997)|Horror|Sci-Fi|Thr...|
|   1591|     1|   1.5|1256677475|        Spawn (1997)|Action|Adventure|...|
|   2134|     1|   4.5|1256677464|Weird Science (1985)|Comedy|Fantasy|Sc...|
|   2478|     1|   4.0|1256677239|¡Three Amigos! (1...|      Comedy|Western|
|   2840|     1|   3.0|1256677500|     Stigmata (1999)|      Drama|Thriller|

As we can see above, we have our new dataset ready to perform the ALS.

## Sparcity

In [77]:
# Count the total number of ratings in the dataset
numerator = movie_ratings.select("rating").count()

# Count the number of distinct Id's
num_users = movie_ratings.select("userId").distinct().count()
num_items = movie_ratings.select("movieId").distinct().count()

# Set the denominator equal to the number of users multiplied by the number of items
denominator = num_users * num_items

# Divide the numerator by the denominator
sparsity = (1.0 - (numerator * 1.0)/ denominator) * 100
print("The movie_ratings dataframe is ", "%.2f" % sparsity + "% empty.")

The movie_ratings dataframe is  99.82% empty.


At this point you will be think what is Sparcity? well this is answer in the theory part (`00_THEORY`). Anyway, we will explain a breaf summary about it.

In a real world setting, the vast majority of movies receive very few or even no ratings at all by users. A variable with sparse data is one in which a relatively high percentage of the variable's cells do not contain actual data. Such "empty," or NA, values take up storage space in the file.

***

***