# COMP3002 Big Data and Cloud Project
## Task 1

This notebook describes the tasks that you must complete for the first task.  You should complete the work in this notebook and ensure that you regularly commit it to your GitHub classroom.  You can choose to include additional python .py files if you wish to create some helper functions to keep this notebook clean.  Make sure they are committed to the GitHub repository too.

### Scenario

You are provided with a small sample dataset of Amazon Review Data.  This notebook talks you through the process of loading that data into Spark SQL and asks you to analyse that data.  On the Block Release day on 18th November (Open Cohort) and 20th November (Ford Cohort) you will have access to a larger dataset hosted in the Cloud.  Much of the day will be spent moving your solutions to the cloud, and answering additional questions which will be set on the day.

If you do not finish everything during Block Release.  You will have additionl time to reflect on the experience and finalise your code before final submission on 26th November.

Amazon Review Data was downloaded from [here](https://jmcauley.ucsd.edu/data/amazon/) but a small sample is provided with this assignment.

### Learning Outcomes

Remember that the primary aim with this task is not to get the "correct" answer, but for you to use the time to become confident with some basic Big Data processing.

* **LO1** Understand the principles that allow the processing of big data sets.

* **LO3** Understand the limitations of big data technologies for distributed processing.

* **LO3** Demonstrate practical skills required to implement big-data solutions using modern large-scale data and compute infrastructures.

### Assessment

Assessment follows a similar approach to that used previously on the programme.  This small task attracts up to a grade C.  The second task to be released later this term will allow you to stretch to higher grades.

<table>
    <tr>
        <th align="left">Grade</th>
        <th align="left"><p>Criteria</p></th> 
    </tr>
    <tr>
        <td>C (50)</td>
        <td align="left">
            <p>In addition to the requirements for D-grade, the work should:</p>
            <ul align="left">
                <li align="left">
                    <p>Demonstrate the ability to implement a solution to the challenge tasks posed during the block release day using the Spark Cluster.</p>
                </li>
            </ul>
            <p>If the solution is not complete, a C-grade may still be awarded if a strong narrative is provided to explain where further work is needed and what the next steps would be.</p>
        </td>
    </tr>
    <tr style="background-color: #FBB36B;">
        <td>D (40)</td>
        <td align="left">
            <p>As this is the passing grade for the project, you must achieve all the learning outcomes.</p>
            <p>The work should meet the following minimum criteria:</p>
            <ul align="left">
                <li align="left">
                    <p>Work should be a Jupyter notebook submitted via Github Classrooms with accompanying helper .py files that are free from errors and execute successfully.</p>
                </li>
                <li align="left">
                    <p>The notebook demonstrates that the apprentice can:</p>
                    <ol>
                        <li align="left">
                            <p>Connect to a spark context.</p>
                        </li>
                        <li align="left">
                            <p>Transmit data to Spark.</p>
                        </li>
                        <li align="left">
                            <p>Execute remote transformations and actions on Spark.</p>
                        </li>
                        <li align="left">
                            <p>Retrieve outputs and present them in a suitable manner.</p>
                        </li>
                    </ol>
                </li>
                <li align="left">
                    <p>Provide acceptable answers to questions posed in the task template.</p>
                </li>
            </ul>
            <p>The work may be limited in that:</p>
            <ul align="left">
                <li align="left">
                    <p>It may only run on a single machine via PySpark.</p>
                </li>
                <li align="left">
                    <p>It may not demonstrate an attempt at the challenge tasks posed during the block release day.</p>
                </li>
            </ul>
        </td>
    </tr>
    <tr>
        <td>E (30)</td>
        <td align="left">
            <p>Learning outcomes not met at threshold level, but with additional work a pass could be achieved.  This may mean that code does not run, or solutions are that achieve the brief but without successfully using the Spark infrastructure.</p>
        </td>
    </tr>
    <tr style="background-color: #FBB36B;">
        <td>F (0-29)</td>
        <td align="left">
            <p>Learning outcomes not met at threshold level, but with additional work a pass could be achieved.  This may mean that code does not run, or solutions are that achieve the brief but without successfully using the Spark infrastructure.</p>
        </td>
    </tr>
</table>

<style>
    tr:nth-child(odd) {
        background-color: orange;
    }
</style>

In addition, for grade of E and above 10 discretionary marks are available for presentation quality of submission (including coding).

First you need to establish a Spark Session in a slighlty different way using Spark SQL:

In [30]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, weekofyear, col

In [31]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master("local[*]") \
    .appName("Task1") \
    .getOrCreate()

In [32]:
# Load JSON data into DataFrame
df = spark.read.json("data/reviews.json")

Having imported the data, take a look at the schema.  Perhaps try running some SQL queries over it.  I've suggested a first example, but you can come up with more questions.

**Can you plot how many ratings of each grade are present in the data?**

### Hints

You've loaded your data, and you want to try and process that data remotely as much as possible, only collecting results at the end.

You can add columns to the remote DataFrame using

df.withColumn("myColumnName", data)

You can execute SQL like operations such as group by and order by:

df.orderBy("columnName")
df.groupBy("columnName")

Think about how you would transform the data in the dataframe, and then collect just the data needed to make the plot.

In [21]:
# Your code goes here

**Can you create a histogram of the number of reviews received on each week of the year.  Are there any patterns present?**

In [22]:
# Your code goes here

**Can you think of your own query?**

In [23]:
# Your code goes here

More tasks will be released at the block release day when you will have time to go deeper, and use much larger datasets on a cluster.

In [None]:
# Your code goes here