# PySpark

Running PySpark on docker by using [jupyter/pyspark-notebook](https://hub.docker.com/r/jupyter/pyspark-notebook/) container. This notebook shows test scripts working.

In [14]:
import pyspark
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession

In [5]:
conf = SparkConf().setAppName('Assignment').setMaster('local')
sc = SparkContext(conf=conf)

In [2]:
pwd

'/home/jovyan/work'

In [3]:
ls

[0m[34;42mdata_processing_course-master[0m/  [01;32mPySpark.ipynb[0m*


In [4]:
cd data_processing_course-master

/home/jovyan/work/data_processing_course-master


In [5]:
ls

[0m[34;42massignments[0m/  [01;32mbootstrap.sh[0m*  [34;42minfra[0m/    [01;32mlocal_setup.sh[0m*  [34;42mspark[0m/
[34;42mbeam[0m/         [34;42mdata[0m/          [01;32mLICENSE[0m*  [01;32mREADME.md[0m*       [01;32mVagrantfile[0m*


In [6]:
cd assignments

/home/jovyan/work/data_processing_course-master/assignments


In [7]:
ls

[0m[01;32mbootstrap.sh[0m*     [01;32mpytest.ini[0m*           [01;32mtest_ejercicio_3.py[0m*
[01;32mconftest.py[0m*      [01;32mREADME.md[0m*            [01;32mtest_ejercicio_4.py[0m*
[01;32mcontenedores.py[0m*  [01;32mrequirements.txt[0m*     [01;32mtest_ejercicio_5.py[0m*
[34;42mdata[0m/             [01;32mtest_ejercicio_0.py[0m*  [01;32mtest_ejercicio_6.py[0m*
[01;32mhelpers.py[0m*       [01;32mtest_ejercicio_1.py[0m*  [01;32mVagrantfile[0m*
[34;42m__pycache__[0m/      [01;32mtest_ejercicio_2.py[0m*


In [8]:
%run contenedores.py

In [9]:
cat resultados/resultado_0

0,1,2,3,4,5,6,7,8,9


# Creating Dataframes

### Introducing the `SparkSession`

In the past, you would potentially work with `SparkConf`, `SparkContext`, `SQLContext`, and `HiveContext` to execute your various Spark queries for configuration, Spark context, SQL context, and Hive context respectively. The **`SparkSession`** is essentially the combination of these contexts including StreamingContext.

### Creating a dataframe from json file

In [15]:
# SparkSession previously imported by: from pyspark.sql import SparkSession
spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

In [22]:
stringJSONRDD = sc.parallelize(("""
  { "id": "123",
"name": "Katie",
"age": 19,
"eyeColor": "brown"
  }""",
"""{
"id": "234",
"name": "Michael",
"age": 22,
"eyeColor": "green"
  }""", 
"""{
"id": "345",
"name": "Simone",
"age": 23,
"eyeColor": "blue",
"randomkey": "Single value"
  }""")
)

In [23]:
jsondata = spark.read.json(stringJSONRDD)

In [24]:
jsondata.show()

+---+--------+---+-------+------------+
|age|eyeColor| id|   name|   randomkey|
+---+--------+---+-------+------------+
| 19|   brown|123|  Katie|        null|
| 22|   green|234|Michael|        null|
| 23|    blue|345| Simone|Single value|
+---+--------+---+-------+------------+



### Creating a dataframe from csv file (We need to create first and RDD, split it into cells and transform it at the end in a dataframe

In [27]:
csvdata = spark.read.csv('containers.csv')

AttributeError: 'SparkSession' object has no attribute 'csvFile'

In [26]:
csvdata.show()

+--------------------+
|                 _c0|
+--------------------+
|ship_imo;ship_nam...|
|AMC1861710;Jayden...|
|POG1615575;Lake E...|
|SQH1155999;Aileen...|
|JCI1797526;Hermin...|
|MBV1836745;Port G...|
|GYR1192020;Emardl...|
|GLV1922612;Eulali...|
|NLH1771681;Port N...|
|FUS1202266;East M...|
|GLV1922612;Eulali...|
|IWE1254579;North ...|
|JET1053895;Jamil;...|
|KSP1096387;Wiley;...|
|GYR1192020;Emardl...|
|GYR1192020;Emardl...|
|JMP1637582;East Z...|
|TCU1641123;New Ma...|
|MBV1836745;Port G...|
|POG1615575;Lake E...|
+--------------------+
only showing top 20 rows

