# ** In this section we will connect to the MongoDB**

## We demonstrate how to:

1. Load the wine dataset from MongoDB  
2. Create new columns  
3. Perform filtering  
4. Aggregate statistics  
5. Save new results back to MongoDB  

This shows full bidirectional integration between **Apache Spark** and **MongoDB**.

In [1]:
import os

os.environ["PYSPARK_SUBMIT_ARGS"] = (
    "--jars "
    "/Users/ninazorawska/Desktop/spark_jars/mongo-spark-connector_2.13-10.3.0.jar,"
    "/Users/ninazorawska/Desktop/spark_jars/mongodb-driver-sync-4.11.0.jar,"
    "/Users/ninazorawska/Desktop/spark_jars/mongodb-driver-core-4.11.0.jar,"
    "/Users/ninazorawska/Desktop/spark_jars/bson-4.11.0.jar "
    "pyspark-shell"
)


In [2]:
pip install pyspark findspark 



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [3]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("WineMongoIntegration") \
    .config("spark.mongodb.write.connection.uri", "mongodb://localhost:27017") \
    .config("spark.mongodb.read.connection.uri", "mongodb://localhost:27017") \
    .getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/11/20 18:47:10 WARN Utils: Your hostname, Ninas-macbook.local, resolves to a loopback address: 127.0.0.1; using 10.124.248.129 instead (on interface en0)
25/11/20 18:47:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
25/11/20 18:47:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [4]:
jars = spark._jsc.sc().listJars()
for i in range(jars.size()):
    print(jars.apply(i))

spark://10.124.248.129:55403/jars/mongo-spark-connector_2.13-10.3.0.jar
spark://10.124.248.129:55403/jars/mongodb-driver-sync-4.11.0.jar
spark://10.124.248.129:55403/jars/bson-4.11.0.jar
spark://10.124.248.129:55403/jars/mongodb-driver-core-4.11.0.jar


In [5]:
red_df = spark.read.csv("winequality-red.csv", 
                        sep=';', header=True, inferSchema=True)

white_df = spark.read.csv("winequality-white.csv", 
                          sep=';', header=True, inferSchema=True)


In [6]:
from pyspark.sql.functions import lit

red_df = red_df.withColumn("wine_type", lit("red"))
white_df = white_df.withColumn("wine_type", lit("white"))

In [7]:
df = red_df.unionByName(white_df)

In [8]:
df.show(5)
df.printSchema()

+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+---------+
|fixed acidity|volatile acidity|citric acid|residual sugar|chlorides|free sulfur dioxide|total sulfur dioxide|density|  pH|sulphates|alcohol|quality|wine_type|
+-------------+----------------+-----------+--------------+---------+-------------------+--------------------+-------+----+---------+-------+-------+---------+
|          7.4|             0.7|        0.0|           1.9|    0.076|               11.0|                34.0| 0.9978|3.51|     0.56|    9.4|      5|      red|
|          7.8|            0.88|        0.0|           2.6|    0.098|               25.0|                67.0| 0.9968| 3.2|     0.68|    9.8|      5|      red|
|          7.8|            0.76|       0.04|           2.3|    0.092|               15.0|                54.0|  0.997|3.26|     0.65|    9.8|      5|      red|
|         11.2|            0.28|       0

In [9]:
df.write \
    .format("mongodb") \
    .mode("overwrite") \
    .option("connection.uri", "mongodb://localhost:27017") \
    .option("database", "wine_db") \
    .option("collection", "wines") \
    .save()

## Now that we are connected to MongoDB, we can read the data back from MongoDB, using Spark

In [10]:
df_mongo = spark.read \
    .format("mongodb") \
    .option("database", "wine_db") \
    .option("collection", "wines") \
    .load()

df_mongo.show(5)
df_mongo.printSchema()


+--------------------+-------+---------+-----------+-------+-------------+-------------------+----+-------+--------------+---------+--------------------+----------------+---------+
|                 _id|alcohol|chlorides|citric acid|density|fixed acidity|free sulfur dioxide|  pH|quality|residual sugar|sulphates|total sulfur dioxide|volatile acidity|wine_type|
+--------------------+-------+---------+-----------+-------+-------------+-------------------+----+-------+--------------+---------+--------------------+----------------+---------+
|691f6252c8ece6092...|    9.4|    0.076|        0.0| 0.9978|          7.4|               11.0|3.51|      5|           1.9|     0.56|                34.0|             0.7|      red|
|691f6252c8ece6092...|    9.8|    0.098|        0.0| 0.9968|          7.8|               25.0| 3.2|      5|           2.6|     0.68|                67.0|            0.88|      red|
|691f6252c8ece6092...|    9.8|    0.092|       0.04|  0.997|          7.8|               15.0|3

### We can filter, GroupBy or sort/orderBy:  

In [11]:
df_mongo.filter(df_mongo.quality > 7).show()

+--------------------+-------+---------+-----------+-------+-------------+-------------------+----+-------+--------------+---------+--------------------+----------------+---------+
|                 _id|alcohol|chlorides|citric acid|density|fixed acidity|free sulfur dioxide|  pH|quality|residual sugar|sulphates|total sulfur dioxide|volatile acidity|wine_type|
+--------------------+-------+---------+-----------+-------+-------------+-------------------+----+-------+--------------+---------+--------------------+----------------+---------+
|691f6252c8ece6092...|   12.8|    0.078|       0.46| 0.9973|          7.9|               15.0|3.35|      8|           3.6|     0.86|                37.0|            0.35|      red|
|691f6252c8ece6092...|   12.6|    0.073|       0.45| 0.9976|         10.3|                5.0|3.23|      8|           6.4|     0.82|                13.0|            0.32|      red|
|691f6252c8ece6092...|   12.9|    0.045|       0.05| 0.9924|          5.6|               12.0|3

In [12]:
df_mongo.groupBy("quality").avg("alcohol").show()

+-------+------------------+
|quality|      avg(alcohol)|
+-------+------------------+
|      6|10.587552891396365|
|      3|10.214999999999998|
|      5| 9.837782974742739|
|      9|             12.18|
|      4|10.180092592592596|
|      8| 11.67875647668396|
|      7|11.386005560704362|
+-------+------------------+



In [13]:
df_mongo.orderBy(df_mongo.alcohol.desc()).show(10)

+--------------------+-------+---------+-----------+-------+-------------+-------------------+----+-------+--------------+---------+--------------------+----------------+---------+
|                 _id|alcohol|chlorides|citric acid|density|fixed acidity|free sulfur dioxide|  pH|quality|residual sugar|sulphates|total sulfur dioxide|volatile acidity|wine_type|
+--------------------+-------+---------+-----------+-------+-------------+-------------------+----+-------+--------------+---------+--------------------+----------------+---------+
|691f6252c8ece6092...|   14.9|    0.096|       0.65| 0.9976|         15.9|               22.0|2.98|      5|           7.5|     0.84|                71.0|            0.36|      red|
|691f6253c8ece6092...|   14.2|    0.037|       0.28|0.98779|          6.4|               31.0|3.12|      7|           1.6|      0.4|               113.0|            0.35|    white|
|691f6253c8ece6092...|  14.05|    0.041|       0.01| 0.9909|          5.8|               31.0|3

We can also save NEW transformed results back to MongoDB. For example, in the cell below, we will create a new DataFrame.
Firstly, we are filtering the dataset that we already have for wines that have a quality 8 or higher. 
Then, we are creating a new DataFrame, containing ONLY the filtered wines. 

In [14]:
df_high = df_mongo.filter("quality >= 8")

df_high.write \
    .format("mongodb") \
    .mode("overwrite") \
    .option("database", "wine_db") \
    .option("collection", "high_quality") \
    .save()


### We can also add new columns!

In [17]:
from pyspark.sql.functions import when, col

df_cleaned = df_mongo.withColumn(
    "quality_label",
    when(col("quality") >= 7, "good")
    .when(col("quality") <= 4, "bad")
    .otherwise("average")
)

In [18]:
df_cleaned.write \
    .format("mongodb") \
    .mode("overwrite") \
    .option("database", "wine_db") \
    .option("collection", "wines_cleaned") \
    .save()

### We will now compute average acidity, sugar, alcohol, quality per wine type.

In [20]:
from pyspark.sql.functions import avg, col

df_stats = df_mongo.groupBy("wine_type").agg(
    avg("alcohol").alias("avg_alcohol"),
    avg("residual sugar").alias("avg_sugar"),
    avg("quality").alias("avg_quality"),
    avg("volatile acidity").alias("avg_volatile_acidity")
)

df_stats.write \
    .format("mongodb") \
    .mode("overwrite") \
    .option("database", "wine_db") \
    .option("collection", "wine_stats") \
    .save()

# Conclusion: 

## Why Spark + MongoDB Are Excellent for Big Data Analysis**

## Apache Spark and MongoDB together form a powerful ecosystem for handling modern big data challenges. Each technology solves a different part of the big-data problem, and together they support the 4 V’s of Big Data:

## 1. **Volume** — Handling Large Amounts of Data
Spark can process gigabytes or terabytes of data in parallel, distributing work across many machines.
MongoDB stores massive datasets in a scalable, flexible way without requiring rigid tables.
Together:

You can store huge datasets in MongoDB and process them extremely fast with Spark.

## 2. **Variety** — Handling Unstructured, Semi-structured, and Structured Data
MongoDB accepts JSON-like documents, so it can store any shape of data (text, logs, nested structures).
Spark can read data from CSV, MongoDB, Parquet, APIs, or files — and combine them.
Together:

They let you mix structured CSV data (like the wine dataset) with more complex real-world datasets (customer logs, reviews, IoT data).

## 3. **Velocity** — Fast Processing and Real-time Updates
Spark processes data in memory, which is much faster than traditional systems.
MongoDB handles fast inserts and updates, making it good for streaming or live data.
Together:

They allow near-real-time analytics:

→ MongoDB collects the data

→ Spark analyzes it and writes back insights quickly

## 4. **Veracity** — Building Clean, Reliable Data
With Spark, you can clean, standardize, join, filter, and transform data at large scale.
MongoDB can store multiple versions of the dataset (raw, cleaned, aggregated).
Together:

You maintain a reliable data pipeline with reproducible transformations and versioned results.