# Spark ETL with mongoDB database

[1. Create Connection with MongoDB Database](#1)

[2. Read Data from MongoDB Database](#2)

[3. Transform Data](#3)

[4. Write Data into MongoDB Server](#4)



## Spark-mongo Connector

[maven Repository](https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector) <br/>
'org.mongodb.spark:mongo-spark-connector_2.12:3.0.1'

### Load libraries

In [1]:
# Load required library

from pyspark.sql import SparkSession

In [2]:
# Start SparkSession

spark = SparkSession.builder.appName("mongoDB") \
        .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.12:3.0.1')\
        .getOrCreate()

sqlContext = SparkSession(spark)
spark.sparkContext.setLogLevel("ERROR")

<a id = 1> </a>

## Create Connection with MondoDB Database

In [3]:
mongo_df = spark.read.format("mongo") \
    .option("uri", "mongodb://localhost:27017/") \
    .option("database", "dataengineering") \
    .option("collection", "employee") \
    .load()

<a id = 2> </a>
## Read Data from MongoDB Server

In [4]:
mongo_df.printSchema()

root
 |-- _id: struct (nullable = true)
 |    |-- oid: string (nullable = true)
 |-- department_id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- id: integer (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: integer (nullable = true)



In [5]:
mongo_df.show(n=5)

+--------------------+-------------+----------+---+---------+------+
|                 _id|department_id|first_name| id|last_name|salary|
+--------------------+-------------+----------+---+---------+------+
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|110000|
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|106119|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|128922|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|130000|
|{66dd8e06da00b95e...|         1002|     Kelly|  3|  Rosario| 42689|
+--------------------+-------------+----------+---+---------+------+
only showing top 5 rows



<a id = 3> </a>
## Transform Data

In [6]:
mongo_df.createOrReplaceTempView("temp_Mongo")

In [8]:
sqlContext.sql("Select * FROM temp_Mongo").show(n=10)



+--------------------+-------------+----------+---+---------+------+
|                 _id|department_id|first_name| id|last_name|salary|
+--------------------+-------------+----------+---+---------+------+
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|110000|
|{66dd8e06da00b95e...|         1006|      Todd|  1|   Wilson|106119|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|128922|
|{66dd8e06da00b95e...|         1005|    Justin|  2|    Simon|130000|
|{66dd8e06da00b95e...|         1002|     Kelly|  3|  Rosario| 42689|
|{66dd8e06da00b95e...|         1004|  Patricia|  4|   Powell|162825|
|{66dd8e06da00b95e...|         1004|  Patricia|  4|   Powell|170000|
|{66dd8e06da00b95e...|         1002|    Sherry|  5|   Golden| 44101|
|{66dd8e06da00b95e...|         1005|   Natasha|  6|  Swanson| 79632|
|{66dd8e06da00b95e...|         1005|   Natasha|  6|  Swanson| 90000|
+--------------------+-------------+----------+---+---------+------+
only showing top 10 rows



In [10]:
new_df = sqlContext.sql("select first_name, salary from temp_Mongo where salary > 50000")

In [11]:
new_df.count()

78

<a id = 4> </a>

## Write Data into MongoDB Server

In [12]:
new_df.write.format("mongo") \
    .option("uri","mongodb://localhost:27017/") \
    .option("database","dataengineering") \
    .option("collection", "employee1") \
    .mode("append").save()

In [13]:
spark.stop()