-
Notifications
You must be signed in to change notification settings - Fork 39
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Signed-off-by: Greg Watson <g.watson@computer.org>
- Loading branch information
Showing
4 changed files
with
44 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
--- | ||
layout: page | ||
title: Advanced Python for Data Science Assignment 13 | ||
exercises: ['BigData 1', 'BigData 2', 'BigData 3'] | ||
--- | ||
|
||
{% include assignment.html %} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
--- | ||
layout: exercise | ||
title: BigData 1 | ||
--- | ||
|
||
The `wordcount_spark.py` program we wrote earlier find the word that is used the most times in the input text. It did this by doing a sum reduction | ||
using the `add` operator. You job is to modify this program using a different kind of reduction in order to count the number of distinct words in | ||
the input text. | ||
|
||
Call your new program `distinct_spark.py` and commit it to the repository you used for Assignment 3. | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
--- | ||
layout: exercise | ||
title: BigData 2 | ||
--- | ||
|
||
We saw how to use the `SparkContext.parallelize` method to create a distributed dataset (RDD) containing all the numbers from 0 to 1,000,000. Use this | ||
same method to create an RDD containing the numbers from 1 to 1000. The RDD class has a handy method called | ||
[fold](https://spark.apache.org/docs/1.1.1/api/python/pyspark.rdd.RDD-class.html#fold) which aggregates all the elements of the data set | ||
using a function that is supplied as an argument. Use this method to creat a program that | ||
calculates the product of all the numbers from 1 to 1000 and prints the result. | ||
|
||
Call your new program `product_spark.py` and commit it to the repository you used for Assignment 3. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
--- | ||
layout: exercise | ||
title: BigData 3 | ||
--- | ||
|
||
There is nothing to stop you combining the `map` operation with the `fold` operation. You can even apply `map` more than once in order | ||
to generate more complex mappings. For *bonus marks*, see if you can work out how to use `map` and `fold` to calculate the average of | ||
the square root of all the numbers from 1 to 1000. i.e the sum of the square roots of all the numbers divided by 1000. | ||
|
||
Call your new program `squareroot_spark.py` and commit it to the repository you used for Assignment 3. | ||
|
||
|