# IST 718: Big Data Analytics

- Professor: Willard Williamson <wewillia@syr.edu>
- Faculty Assistant: Palaniappan Muthukkaruppan
## General instructions:

- You are welcome to discuss the problems with your classmates but __you are not allowed to copy any part of your answers from your classmates.  Short code snippets are allowed from the internet.  Any code is allowed from the class text books or class provided code.__
- Please do not change the file names. The FAs and the professor use these names to grade your homework.
- Remove or comment out code that contains `raise NotImplementedError`. This is mainly to make the `assert` statement fail if nothing is submitted.
- The tests shown in some cells (i.e., `assert` and `np.testing.` statements) are used to grade your answers. **However, the professor and FAs will use __additional__ test for your answer. Think about cases where your code should run even if it passess all the tests you see.**
- Before submitting your work through Blackboard, remember to save and press `Validate` (or go to 
`Kernel`$\rightarrow$`Restart and Run All`).

# Part 1: MapReduce with Spark

In [1]:
# Run this code to create the Spark session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('map-reduce').getOrCreate()
sc = spark.sparkContext

In [2]:
# this RDD will be used throughout this part of the homework
gpa_rdd = sc.parallelize([
 ['2015', 'Fall', 'IST101', 1, 'A', 4],
 ['2015', 'Fall', 'IST195', 3, 'A', 4],
 ['2015', 'Fall', 'IST233', 3, 'B+', 3.3],
 ['2015', 'Fall', 'SOC101', 3, 'A-', 3.7],
 ['2015', 'Fall', 'MAT221', 3, 'C', 2],
 ['2016', 'Fall', 'IST346', 3, 'A', 4],
 ['2016', 'Fall', 'CHE111', 4, 'A-', 3.7],
 ['2016', 'Fall', 'PSY120', 3, 'B+', 3.3],
 ['2016', 'Fall', 'IST256', 3, 'A', 4],
 ['2016', 'Fall', 'ENG121', 3, 'B+', 3.3],
 ['2016', 'Spring', 'GEO110', 3, 'B+', 3.3],
 ['2016', 'Spring', 'MAT222', 3, 'A', 4],
 ['2016', 'Spring', 'SOC121', 3, 'C+', 2.3],
 ['2016', 'Spring', 'BIO240', 3, 'B-', 2.7]])

## Question 1.1 Cumulative GPA with MapReduce (25 pts)

Cumulative GPA is calculated as:

$$
\begin{align}
\frac{\sum_{i=1}^{N} score_i*credits_i}{\sum_{i=1}^N credits_i}
\end{align}
$$

Where:
credits = Number of credits for a course.  Example: 3 credits.<br>
score = The grade score.  Example: An A grade is a score of 4.<br>
i = The index of a specific course.<br>
N = The total number of courses.<br>

Construct a MapReduce job that takes the `gpa_rdd` RDD and returns the cumulative GPA *per season*. 

Each record in `gpa_rdd` contains:
- The year
- The season
- The course code
- The number of credits
- The letter grade
- The number grade

**Hint:** In class, we discussed the MapReduce job for computing avereage. In this case, the key-value pairs will be similar but instead of counting the number of elements considered in the avarage so far, we need to count the credits. Clearly, the key needs to be the season.

In [3]:
def map_weighted_gpa(record):
    
    return [record[1], [record[5], record[3]]]

def reduce_weighted_gpa(value1, value2):
    
    average = ((value1[0] * value1[1]) + (value2[0] * value2[1]))/(value1[1] + value2[1])
    
    return [average, value1[1] + value2[1]]

The map job should produce as key the season and value a tuple or list with the grade and credit.

For example, the first element of the map of `gpa_rdd` should be

```python
gpa_rdd.\
    map(map_weighted_gpa).\
    first()
```

```python
['Fall', [4, 1]]
```

In [4]:
gpa_rdd.\
    map(map_weighted_gpa).\
    first()

['Fall', [4, 1]]

In [5]:
##### first result
assert gpa_rdd.\
    map(map_weighted_gpa).\
    first() == ['Fall', [4, 1]]
# the key should be a string
assert gpa_rdd.map(map_weighted_gpa).map(lambda x: type(x[0])).distinct().count() == 1
assert gpa_rdd.map(map_weighted_gpa).map(lambda x: type(x[0])).distinct().first() == str
# all values should be of length 2
assert gpa_rdd.map(map_weighted_gpa).map(lambda x: len(x[1])).distinct().count() == 1
assert gpa_rdd.map(map_weighted_gpa).map(lambda x: len(x[1])).distinct().first() == 2

In [6]:
# there should be two results in the map reduce because there are two semesters
gpa_rdd.\
    map(map_weighted_gpa).\
    reduceByKey(reduce_weighted_gpa).count()

2

In [7]:
assert (gpa_rdd.\
    map(map_weighted_gpa).\
    reduceByKey(reduce_weighted_gpa).collect() == \
    [('Spring', [3.0749999999999997, 12]), 
     ('Fall', [3.503448275862069, 29])]) or \
    (gpa_rdd.\
    map(map_weighted_gpa).\
    reduceByKey(reduce_weighted_gpa).collect() == \
    [('Fall', [3.503448275862069, 29]),
     ('Spring', [3.0749999999999997, 12])])