## Using Spark Cluster with shared data in Docker

### Handle RDDs

This is a simple example of using Spark in a Docker container. Make sure you execute this notebook in code-server and not in a local notebook.

In [None]:
!pip install pyspark==3.5.3 pandas

#### Download csv file to local directory (shared with spark)

In [6]:
import urllib.request
import zipfile
from os import remove

url = 'https://www.kaggle.com/api/v1/datasets/download/chaitanyahivlekar/large-movie-dataset'
urllib.request.urlretrieve(url,'movies.zip')

with zipfile.ZipFile('movies.zip', 'r') as zip_ref:
    zip_ref.extractall('./')

remove('movies.zip')

#### Connect to Spark Cluster and create Spark Session

In [7]:
from pyspark import SparkConf, SparkContext

conf = SparkConf().setAppName("MyApp").setMaster("spark://spark-master:7077")
sc = SparkContext(conf=conf)

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/16 18:50:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


#### Create RDD

To create a RDD from text file as csv, we can use `.textFile()` method from spark context.

In [8]:
rdd = sc.textFile("movies_dataset.csv")


##### Reading data from spark cluster

Let's read the firsts 10 rows of RDD using `.take()` method any RDD object. The output is a list of strings, each one representing a line of the file.

In [9]:
print("First few lines of the RDD:")
rdd.take(10)

First few lines of the RDD:


                                                                                

[',User_Id,Movie_Name,Rating,Genre',
 '0,1,Pulp Fiction (1994),5.0,Comedy|Crime|Drama|Thriller',
 '1,1,Three Colors: Red (Trois couleurs: Rouge) (1994),3.5,Drama',
 '2,1,Three Colors: Blue (Trois couleurs: Bleu) (1993),5.0,Drama',
 '3,1,Underground (1995),5.0,Comedy|Drama|War',
 "4,1,Singin' in the Rain (1952),3.5,Comedy|Musical|Romance",
 '5,1,Dirty Dancing (1987),4.0,Drama|Musical|Romance',
 '6,1,Delicatessen (1991),3.5,Comedy|Drama|Romance',
 '7,1,Ran (1985),3.5,Drama|War',
 '8,1,"Seventh Seal, The (Sjunde inseglet, Det) (1957)",5.0,Drama']

In the next step, we will count the number of objects in the dataset using the `count()` method.

In [10]:
print("Count total items")
rdd.count()

Count total items


                                                                                

25000096

##### Filter data

Here, we filter the data to first item (`.first()` method), which represent the first row of the CSV file.

In [11]:
# header
header = rdd.first()
print(header)

,User_Id,Movie_Name,Rating,Genre


Using header variable and `.filter()` method, we can filter data for all rows different than the header.

In [12]:
data_rdd = rdd.filter(lambda row: row != header)
data_rdd.take(5)

['0,1,Pulp Fiction (1994),5.0,Comedy|Crime|Drama|Thriller',
 '1,1,Three Colors: Red (Trois couleurs: Rouge) (1994),3.5,Drama',
 '2,1,Three Colors: Blue (Trois couleurs: Bleu) (1993),5.0,Drama',
 '3,1,Underground (1995),5.0,Comedy|Drama|War',
 "4,1,Singin' in the Rain (1952),3.5,Comedy|Musical|Romance"]

Filtering all items with contains "Comedy"

In [13]:
data_rdd = data_rdd.filter(lambda row: row.find("Comedy") != -1)
data_rdd.take(10)

['0,1,Pulp Fiction (1994),5.0,Comedy|Crime|Drama|Thriller',
 '3,1,Underground (1995),5.0,Comedy|Drama|War',
 "4,1,Singin' in the Rain (1952),3.5,Comedy|Musical|Romance",
 '6,1,Delicatessen (1991),3.5,Comedy|Drama|Romance',
 '12,1,Back to the Future Part II (1989),2.5,Adventure|Comedy|Sci-Fi',
 '13,1,Back to the Future Part III (1990),2.5,Adventure|Comedy|Sci-Fi|Western',
 '20,1,"Black Cat, White Cat (Crna macka, beli macor) (1998)",4.5,Comedy|Romance',
 '21,1,"Good Morning, Vietnam (1987)",4.0,Comedy|Drama|War',
 '22,1,"Idiots, The (Idioterne) (1998)",5.0,Comedy|Drama',
 '29,1,"Amelie (Fabuleux destin d\'Amélie Poulain, Le) (2001)",4.5,Comedy|Romance']

##### Transform data

Let's transform the data using `.map()` method together with `lambda` function. We gonna split each string item by comma.

In [14]:
data_split_rdd = data_rdd.map(lambda row: row.split(','))
data_split_rdd.take(5)

[['0', '1', 'Pulp Fiction (1994)', '5.0', 'Comedy|Crime|Drama|Thriller'],
 ['3', '1', 'Underground (1995)', '5.0', 'Comedy|Drama|War'],
 ['4', '1', "Singin' in the Rain (1952)", '3.5', 'Comedy|Musical|Romance'],
 ['6', '1', 'Delicatessen (1991)', '3.5', 'Comedy|Drama|Romance'],
 ['12',
  '1',
  'Back to the Future Part II (1989)',
  '2.5',
  'Adventure|Comedy|Sci-Fi']]

Then, we can filter the two first items in each list.

In [15]:
data = data_split_rdd.map(lambda list: list[1:3])
data.take(5)

[['1', 'Pulp Fiction (1994)'],
 ['1', 'Underground (1995)'],
 ['1', "Singin' in the Rain (1952)"],
 ['1', 'Delicatessen (1991)'],
 ['1', 'Back to the Future Part II (1989)']]

Finally, let's collect the first 100 entries in transformed RDD and parse to a pandas data frame.

In [16]:
import pandas as pd

df = pd.DataFrame(data.take(100), columns=['User_Id', 'Movie_Name'])
df

Unnamed: 0,User_Id,Movie_Name
0,1,Pulp Fiction (1994)
1,1,Underground (1995)
2,1,Singin' in the Rain (1952)
3,1,Delicatessen (1991)
4,1,Back to the Future Part II (1989)
...,...,...
95,3,"""Big Lebowski"
96,3,"""Lock"
97,3,"""Whole Nine Yards"
98,3,Big Momma's House (2000)


##### Stop spark context connection

In [18]:
sc.stop()