# Technical test results for HEVA company

This notebook repeats the statement of the test. Under each activity you will find the code and the result produced.
You will find all the requirements to run this notebook in the requirements.md file.

## Configuration

### 1. Importing packages

In [1]:
# Import necessary modules

from pyspark.sql import SparkSession

### 2. Settings

In [7]:
# Definition of necessary parameters
data_path = "../sources/data/movies.sqlite"

### 3. Reading data

In [9]:
def read_data(data_path):
    """ Configuring the Pyspark session with the jdbc package
        to read the "movies.sqlite" file.

    Args:
        data_path (string): The sqlite data file path

    Returns:
        tuple: A tuple of 2 Pyspark Dataframes
    """

    # Creation of the Spark session
    spark = SparkSession.builder\
        .config(
            'spark.jars.packages',
            'org.xerial:sqlite-jdbc:3.34.0')\
        .getOrCreate()

    # Reading the movies table
    df_movies = spark.read.format('jdbc')\
        .options(
            driver='org.sqlite.JDBC',
            dbtable='movies',
            url=f'jdbc:sqlite:{data_path}')\
        .load()

    # Reading the ratings table
    df_ratings = spark.read.format('jdbc')\
        .options(
            driver='org.sqlite.JDBC',
            dbtable='ratings',
            url=f'jdbc:sqlite:{data_path}')\
        .load()

    return df_movies, df_ratings


df_movies, df_ratings = read_data(data_path)

### 4. Data overview

In [11]:
def preview_data(df_movies, df_ratings):
    """Showing top 20 rows

    Args:
        df_movies (Dataframe): Movies Dataframe
        df_ratings (Dataframe): Ratings Dataframe
    """

    # Overview of movies table data
    print("Movies table")
    df_movies.show()

    # Preview data from the ratings table
    print("Ratings table")
    df_ratings.show()


preview_data(df_movies, df_ratings)

Movies table
+--------+--------------------+--------------------+
|movie_id|               title|               genre|
+--------+--------------------+--------------------+
|       8|Edison Kinetoscop...|   Documentary|Short|
|      10|La sortie des usi...|   Documentary|Short|
|      12|The Arrival of a ...|   Documentary|Short|
|      25|The Oxford and Ca...|                null|
|      91|Le manoir du diab...|        Short|Horror|
|     131|Une nuit terrible...| Short|Comedy|Horror|
|     417|A Trip to the Moo...|Short|Action|Adve...|
|     439|The Great Train R...|Short|Action|Crim...|
|     443|Hiawatha, the Mes...|                null|
|     628|The Adventures of...|        Action|Short|
|     833|The Country Docto...|         Short|Drama|
|    1223| Frankenstein (1910)| Short|Horror|Sci-Fi|
|    1740|The Lonedale Oper...| Short|Drama|Romance|
|    2101|    Cleopatra (1912)|       Drama|History|
|    2130|    L'inferno (1911)|Adventure|Drama|F...|
|    2354|Max et Jane veule...|Sh

## Tasks

### 1. Counts

- 1.1 How many films are in the database?

In [13]:
def activity_1_1(df_movies):
    """Counting the number of distinct film titles

    Args:
        df_movies (Dataframe): Movies Dataframe

    Return:
        int: Number of movies
    """

    return df_movies\
        .select("title")\
        .distinct()\
        .count()


result_1_1 = activity_1_1(df_movies)
print("There are", result_1_1, "movies in the database")

[Stage 3:>                                                          (0 + 1) / 1]

There are 37947 movies in the database


                                                                                

- 1.2 How many different users are in the database?

In [15]:
def activity_1_2(df_ratings):
    """Counting the number of distinct user id

    Args:
        df_ratings (Dataframe): Ratings Dataframe

    Return:
        int: Number of user id
    """

    return df_ratings\
        .select("user_id")\
        .distinct()\
        .count()


result_1_2 = activity_1_2(df_ratings)
print("There are", result_1_2, "user id in the database")

[Stage 9:>                                                          (0 + 1) / 1]

There are 71707 user id in the database


                                                                                

- 1.3 What is the distribution of the notes provided?
     **Bonus**: create a histogram.

## Safe Notebook versioning

In [16]:
!jupyter nbconvert result.ipynb --to="python"

[NbConvertApp] Converting notebook result.ipynb to python
[NbConvertApp] Writing 2995 bytes to result.py
