# I. (30 marks) Below are the theoretical session, which requires you to write down in thw cell "Text" in the notebook:
1. Could you list out the main challenges of concurrency?
2. Could you describe shortly about MapReduce? Please provide an example of MapReduce.
3. Provide a high level comparison of Apache Hadoop and Apache Spark.
4. What are the advantages of Apache Spark?
5. Provide a comparison of RDD and DataFrame in Spark.  

### 1 -  Could you list out the main challenges of concurrency?

Concurrency refers to the ability of a system to handle multiple tasks simultaneously. Some of the main challenges of concurrency include:
  * Race conditions: When multiple threads or processes access shared resources, the order of execution can lead to unexpected results.
  * Deadlocks: Two or more processes are unable to proceed because each is waiting for the other to release a resource.
  * Synchronization: Coordinating access to shared resources to avoid conflicts and maintain data consistency.
  * Performance overhead: Managing and switching between threads or processes can lead to overhead, reducing overall system performance.
  * Resource management: Allocating and managing resources efficiently among concurrent processes.



### 2 - Could you describe shortly about MapReduce? Please provide an example of MapReduce.

MapReduce is a programming model and processing framework designed for parallel processing of large-scale data. It was popularized by Google and later adopted by Apache Hadoop as a fundamental component of its data processing ecosystem.
  In MapReduce, data processing tasks are divided into two phases:

  * Map phase: The input data is split into smaller chunks, and a map function is applied to each chunk independently, producing intermediate key-value pairs.
  * Reduce phase: The intermediate results are shuffled and sorted based on the keys and then processed by a reduce function, which aggregates the data and produces the final output.

  Example of MapReduce:
  Let's say we have a large collection of documents and want to count the occurrences of each word. In MapReduce, we would:

  * Map phase: For each document, the map function emits (word, 1) key-value pairs, where the word is the key, and the value is set to 1.
  * Reduce phase: The reduce function receives all the (word, 1) pairs and sums up the values to get the total count for each word.


### 3 - Provide a high level comparison of Apache Hadoop and Apache Spark.


| Feature                  | Apache Hadoop                                   | Apache Spark                                       |
|--------------------------|-------------------------------------------------|----------------------------------------------------|
| Processing Model        | Primarily MapReduce-based for batch processing  | Offers batch processing, interactive queries, streaming, and machine learning  |
| Speed                    | Slower due to disk-based processing            | Faster due to in-memory processing                |
| Ease of Use              | Lower-level MapReduce API                       | Higher-level APIs (Java, Scala, Python, SQL)       |
| Memory Management       | Relies on disk storage                          | In-memory caching capabilities                     |
| Fault Tolerance         | Provides fault tolerance through replication and HDFS | Provides fault tolerance through lineage information  |
| Libraries                 | External libraries for machine learning, graph processing, etc. | Built-in libraries for machine learning (MLlib), graph processing (GraphX), and streaming (Spark Streaming)  |
| Versatility               | Primarily used for batch processing             |

### 4 -  What are the advantages of Apache Spark?
Apache Spark offers several advantages over traditional data processing frameworks like Hadoop's MapReduce:
* Speed: Spark's in-memory processing allows it to perform tasks much faster than MapReduce, particularly for iterative algorithms and interactive queries.
* Ease of use: Spark provides higher-level APIs in multiple languages, making it more user-friendly and reducing the amount of boilerplate code required.
* Versatility: Spark supports batch processing, interactive queries, real-time streaming, and machine learning, making it a versatile choice for various data processing tasks.
* Fault tolerance: Spark automatically recovers from failures and maintains data consistency using lineage information, similar to Hadoop's MapReduce.
* Rich ecosystem: Spark comes with libraries for machine learning, graph processing, and stream processing, expanding its capabilities beyond basic data processing.




### 5 -  Provide a comparison of RDD and DataFrame in Spark.



| Feature                    | RDD (Resilient Distributed Dataset)      | DataFrame                                             |
|----------------------------|---------------------------------------|-------------------------------------------------------|
| Data Abstraction           | Low-level distributed collection of objects   | Higher-level distributed collection of structured data |
| API                         | Functional API with transformations and actions  | SQL-like API with DataFrame operations                 |
| Optimization               | No built-in optimization                 | Catalyst query optimizer and Tungsten execution engine |
| Schema                     | No predefined schema                      | Structured data with a defined schema                  |
| Language Support        | Java, Scala, Python, and more         | Java, Scala, Python, and SQL                               |
| Fault Tolerance            | Uses lineage information for fault tolerance   | Uses lineage information for fault tolerance              |
| Performance                | May require manual optimization for performance | Optimized for better performance out-of-the-box          |
| Usage                       | Suitable for unstructured and semi-structured data | Suitable for structured and semi-structured data      |

- Both RDDs and DataFrames are essential components of Apache Spark, and the choice between them depends on the nature of the data and the operations you need to perform. RDDs are more versatile and can handle unstructured data, but require more manual optimizations. On the other hand, DataFrames offer better performance, support structured data with a predefined schema, and provide a more user-friendly API with SQL-like queries.

# II. (30 marks) You are given a file `appl_stock.csv`, please carry out the following tasks:

1. Read this file by PySpark. Print out the schema.
2. Create columns of `day of month`, `hour`, `day of year`, `month` from the column `Date` of the data.
3. Using `groupby` and `year()` function to compute the average closing price per year.

In [1]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q http://archive.apache.org/dist/spark/spark-3.1.1/spark-3.1.1-bin-hadoop3.2.tgz
!tar xf spark-3.1.1-bin-hadoop3.2.tgz
!pip install -q findspark
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.1-bin-hadoop3.2"
import findspark
findspark.init()

In [2]:
!wget 'https://media.githubusercontent.com/media/nguyenvudev20/mse22.BigData/main/Final_exam/appl_stock.csv'

--2023-07-22 07:24:40--  https://media.githubusercontent.com/media/nguyenvudev20/mse22.BigData/main/Final_exam/appl_stock.csv
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 143130 (140K) [text/plain]
Saving to: ‘appl_stock.csv’


2023-07-22 07:24:40 (6.04 MB/s) - ‘appl_stock.csv’ saved [143130/143130]



In [3]:
from pyspark import SparkContext
sc = SparkContext(master = 'local')

from pyspark.sql import SparkSession
spark = SparkSession.builder \
          .appName("Python Spark SQL Final Exam") \
          .config("spark.some.config.option", "some-value") \
          .getOrCreate()


### 1 - Read this file by PySpark. Print out the schema.


In [4]:
## II 1.Read this file by PySpark. Print out the schema.
ad = spark.read.csv('appl_stock.csv', header=True, inferSchema=True)
ad.show(5)

+----------+----------+----------+------------------+------------------+---------+------------------+
|      Date|      Open|      High|               Low|             Close|   Volume|         Adj Close|
+----------+----------+----------+------------------+------------------+---------+------------------+
|2010-01-04|213.429998|214.499996|212.38000099999996|        214.009998|123432400|         27.727039|
|2010-01-05|214.599998|215.589994|        213.249994|        214.379993|150476200|27.774976000000002|
|2010-01-06|214.379993|    215.23|        210.750004|        210.969995|138040000|27.333178000000004|
|2010-01-07|    211.75|212.000006|        209.050005|            210.58|119282800|          27.28265|
|2010-01-08|210.299994|212.000006|209.06000500000002|211.98000499999998|111902700|         27.464034|
+----------+----------+----------+------------------+------------------+---------+------------------+
only showing top 5 rows



In [5]:
ad.printSchema()

root
 |-- Date: string (nullable = true)
 |-- Open: double (nullable = true)
 |-- High: double (nullable = true)
 |-- Low: double (nullable = true)
 |-- Close: double (nullable = true)
 |-- Volume: integer (nullable = true)
 |-- Adj Close: double (nullable = true)





### 2 - Create columns of `day of month`, `hour`, `day of year`, `month` from the column `Date` of the data.


In [6]:
## 2. Create columns of day of month, hour, day of year, month from the column Date of the data.

from pyspark.sql.functions import col, dayofmonth, month, year,hour,dayofyear
ad = ad.withColumn("day of month", dayofmonth(col("Date")))
ad = ad.withColumn("month ", month(col("Date")))
ad = ad.withColumn("hour", hour(col("Date")))
ad = ad.withColumn("day of year", dayofyear(col("Date")))
ad.show(5)

+----------+----------+----------+------------------+------------------+---------+------------------+------------+------+----+-----------+
|      Date|      Open|      High|               Low|             Close|   Volume|         Adj Close|day of month|month |hour|day of year|
+----------+----------+----------+------------------+------------------+---------+------------------+------------+------+----+-----------+
|2010-01-04|213.429998|214.499996|212.38000099999996|        214.009998|123432400|         27.727039|           4|     1|   0|          4|
|2010-01-05|214.599998|215.589994|        213.249994|        214.379993|150476200|27.774976000000002|           5|     1|   0|          5|
|2010-01-06|214.379993|    215.23|        210.750004|        210.969995|138040000|27.333178000000004|           6|     1|   0|          6|
|2010-01-07|    211.75|212.000006|        209.050005|            210.58|119282800|          27.28265|           7|     1|   0|          7|
|2010-01-08|210.299994|212.

### 3 - Using `groupby` and `year()` function to compute the average closing price per year.

In [22]:
## 3. Using groupby and year() function to compute the average closing price per year.
ad = ad.withColumn("year", year(col("Date")))
average_closing_per_year = ad.groupBy("year").avg("Close").alias("average_closing").orderBy("Year")
average_closing_per_year.show()

+----+------------------+
|year|        avg(Close)|
+----+------------------+
|2010| 259.8424600000002|
|2011|364.00432532142867|
|2012| 576.0497195640002|
|2013| 472.6348802857143|
|2014| 295.4023416507935|
|2015|120.03999980555547|
|2016|104.60400786904763|
+----+------------------+



# III. (40 marks) You are given a data `customer_churn.csv`, which describes the churn status in clients of a marletting agency. As a data scientist, you are required to create a machine learning model **in Spark** that will help predict which customers will churn (stop buying their service). A short description of the data is as follow:
```
Name : Name of the latest contact at Company
Age: Customer Age
Total_Purchase: Total Ads Purchased
Account_Manager: Binary 0=No manager, 1= Account manager assigned
Years: Totaly Years as a customer
Num_sites: Number of websites that use the service.
Onboard_date: Date that the name of the latest contact was onboarded
Location: Client HQ Address
Company: Name of Client Company
```

1. Read, print the schema and check out the data to set the first sight of the data.
2. Format the data according to `VectorAssembler`, which is supported in MLlib of PySpark.
3. Split the data into train/test data, and then fit train data to the logistic regression model.
4. Evaluate the results and compute the AUC.

### 1 - Read, print the schema and check out the data to set the first sight of the

In [9]:
## 1. Read, print the schema and check out the data to set the first sight of the data.
!wget 'https://media.githubusercontent.com/media/nguyenvudev20/mse22.BigData/main/Final_exam/customer_churn.csv'
ad2 = spark.read.csv('customer_churn.csv', header=True, inferSchema=True)
ad2.show(5)

--2023-07-22 07:41:20--  https://media.githubusercontent.com/media/nguyenvudev20/mse22.BigData/main/Final_exam/customer_churn.csv
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.109.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 115479 (113K) [text/plain]
Saving to: ‘customer_churn.csv’


2023-07-22 07:41:20 (5.52 MB/s) - ‘customer_churn.csv’ saved [115479/115479]

+----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|           Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|
+----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+
|Cameron Williams|42.0|   

In [10]:
ad2.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: string (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



### 2 - Format the data according to `VectorAssembler`, which is supported in MLlib of PySpark.


In [12]:
## 2. Format the data according to VectorAssembler, which is supported in MLlib of PySpark.

from pyspark.ml.feature import VectorAssembler
feature_columns = ["Age", "Total_Purchase", "Account_Manager", "Years", "Num_Sites"]
assembler = VectorAssembler(inputCols=feature_columns, outputCol="features")
ad2 = assembler.transform(ad2)


In [13]:
ad2.show()

+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+--------------------+
|              Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|            features|
+-------------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+--------------------+
|   Cameron Williams|42.0|       11066.8|              0| 7.22|      8.0|2013-08-30 07:00:40|10265 Elizabeth M...|          Harvey LLC|    1|[42.0,11066.8,0.0...|
|      Kevin Mueller|41.0|      11916.22|              0|  6.5|     11.0|2013-08-13 00:38:46|6157 Frank Garden...|          Wilson PLC|    1|[41.0,11916.22,0....|
|        Eric Lozano|38.0|      12884.75|              0| 6.67|     12.0|2016-06-29 06:20:07|1331 Keith Court ...|Miller, Johnson a...|    1|[38.0,12884.75,0....|
|      Phillip White|4


### 3 - Split the data into train/test data, and then fit train data to the logistic regression model.


In [15]:
## 3. Split the data into train/test data, and then fit train data to the logistic regression model.
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator

(train_data, test_data) = ad2.randomSplit([0.8, 0.2], seed=42)
lr_model = LogisticRegression(labelCol="Churn", featuresCol="features")
lr_model = lr_model.fit(train_data)

In [20]:
predictions = lr_model.transform(test_data)
predictions.show()

+-----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+----------+
|            Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|       Onboard_date|            Location|             Company|Churn|            features|       rawPrediction|         probability|prediction|
+-----------------+----+--------------+---------------+-----+---------+-------------------+--------------------+--------------------+-----+--------------------+--------------------+--------------------+----------+
|       Aaron West|55.0|      10056.55|              0| 4.98|      8.0|2006-09-01 06:11:47|071 Schmidt Locks...|Cruz, Russell and...|    0|[55.0,10056.55,0....|[3.20676315272588...|[0.96108799487693...|       0.0|
|      Adam Harris|44.0|       9815.03|              1|  4.9|      9.0|2016-05-29 06:00:09|40488 Michael For...|Smith, Oconnor an...|    0|[44.0

### 4 - Evaluate the results and compute the AUC.

In [21]:
## 4. Evaluate the results and compute the AUC.
evaluator = BinaryClassificationEvaluator(labelCol="Churn", rawPredictionCol="rawPrediction", metricName="areaUnderROC")
auc = evaluator.evaluate(predictions)
print(f"AUC: {auc}")

AUC: 0.8798426150121073
