# Managing Big Data for Connected Devices

## 420-N63-NA

## Kawser Wazed Nafi

--------------------------------------------------------------------------------------------------------------------------------------



## Foreach Accumulator Combineby CSVfile Operation


### Foreach
PySpark foreach() is an action operation that is available in RDD, DataFram to iterate/loop over each element in the DataFrmae, It is similar to for with advanced concepts

### Syntax
foreach(lambda function)

### Example 1
for a given dataframe, [1,2,3,4,5], we want to add 5 with each of the values in the dataframe

In [2]:
import pyspark
from pyspark.sql import SparkSession
sparksession = SparkSession.builder.master("local[4]") \
                    .appName('accumulatorAPP') \
                    .getOrCreate()
sc = sparksession.sparkContext

rdd = sc.parallelize([1,2,3,4,5])
rdd.foreach(lambda x: print(x+5))


You might not get anything printed over here. A reason for this is the printing method is running of cluster. Each the values are accessing one by one.

### Accumulator

The PySpark Accumulator is a shared variable that is used with RDD and DataFrame to perform sum and counter operations similar to Map-reduce counters. These variables are shared by all executors to update and add information through aggregation or computative operations.

Accumulators are write-only and initialize once variables where only tasks that are running on workers are allowed to update and updates from the workers get propagated automatically to the driver program. But, only the driver program is allowed to access the Accumulator variable using the value property.

<ul>
<li><strong>sparkContext.accumulator()</strong> is used to define accumulator variables.</li>
<li><strong>add()</strong> function is used to add/update a value in accumulator</li>
<li><strong>value</strong> property on the accumulator variable is used to retrieve the value from the accumulator.</li>
</ul>

### Example 2

In this following program, the RDD is going to used by accumulator where all the values of the RDD are going to be added together.

In [3]:

accum=sc.accumulator(0)
rdd=sc.parallelize([1,2,3,4,5])
rdd.foreach(lambda x:accum.add(x))
print(accum.value) #Accessed by driver


15


### Exercise 1

Can you explain the difference between the PySpark Map() and PySpark Foreach()? Search for it and state over here.

.map() is used to apply a transformation lambda/function on each element in an RDD and returns a new RDD.

.forEach() is used to iterate over each element in an RDD, it does not return anything.

### CombineByKey
Generic function to combine the elements for each key using a custom set of aggregation functions. It works as association rule.

Users provide three functions:
<ul>
<li>createCombiner, which turns a V into a C (e.g., creates a one-element list)</li>

<li>mergeValue, to merge a V into a C (e.g., adds it to the end of a list)</li>

<li>mergeCombiners, to combine two C’s into a single one (e.g., merges the lists)</li>
</ul>
To avoid memory allocation, both mergeValue and mergeCombiners are allowed to modify and return their first argument instead of creating a new C.

In [4]:
x = sc.parallelize([("a", 1), ("b", 1), ("a", 2)])
def createcombiner(a):
    return [a]

def mergerValue(a, b):
    a.append(b)
    return a

def combiners(a, b):
    a.extend(b)
    return a

sorted(x.combineByKey(createcombiner, mergerValue, combiners).collect())

[('a', [1, 2]), ('b', [1])]


### Exercise 2

The following data/list is given to you: ("a", 1), ("b", 1), ("c", 2), ("a", 3), ("b", 5), ("c", 5), ("c", 4), ("a", 2), ("b", 2). Use the CombineByKey
in such a way that it will result in the following combination : [('a', [6]), ('b', [8]), ('c', [11])]


In [5]:
l = sc.parallelize([("a", 1), ("b", 1), ("c", 2), ("a", 3), ("b", 5), ("c", 5), ("c", 4), ("a", 2), ("b", 2)])

def add(x, y):
    return int(x) + int(y)

sorted(l.combineByKey(str, add, add).collect())


[('a', 6), ('b', 8), ('c', 11)]

### CSVfile with RDD

CSVfile is open of the popular way to store the big data in a categorized way. In our last class, we have seen how to read a csv file into rdd. In this assignment, we will perform operations with csvfile, filter data from the given csvfile and analysis the data

To work on that direction, we will perform today's lab operation of the following dataset: MovieLens Dataset.

To get the dataset go to https://grouplens.org/datasets/movielens/ 

and download the ml-latest-small.zip 

you will find it under "recommended for education and development" which has 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users.

Unzip the folder and add the folder to the directory mapped to your container. It will be visible in the left option section. You can also upload all the files using the upload button in the jupyter notebook.

RDD is unstructured data storage. So whether it is a CSV file separated by a comma or not, we will be just loading it as a text file and we will get each line as an entry in the RDD.

In addition, we are going to use take() with RDD. Take merges all the partitons and pull the number of the rows mentioned inside take method.

In [6]:
ratingsRDDWithHeader = sc.textFile('ratings.csv')
ratingsRDDWithHeader.take(5)

['userId,movieId,rating,timestamp',
 '1,1,4.0,964982703',
 '1,3,4.0,964981247',
 '1,6,4.0,964982224',
 '1,47,5.0,964983815']

From the Data, we can see that, the file has a header. We can see that by using the `first` function on the RDD which is an action function that will return the first item in the RDD. After loading the data to
RDD, we can filter that row just to get the rest of the data.

In [7]:
header = ratingsRDDWithHeader.first()
print(header)
ratingsRDD = ratingsRDDWithHeader.filter(lambda x: x != header)
ratingsRDD.take(5)

userId,movieId,rating,timestamp


['1,1,4.0,964982703',
 '1,3,4.0,964981247',
 '1,6,4.0,964982224',
 '1,47,5.0,964983815',
 '1,50,5.0,964982931']

Because RDD is unstructured, we got the full line in the file for each entry. You can simply split the string by the comma to get an array for each entry. 

In [10]:
ratingsRDD = ratingsRDD.map(lambda x: x.split(','))
ratingsRDD.take(5)

[['1', '1', '4.0', '964982703'],
 ['1', '3', '4.0', '964981247'],
 ['1', '6', '4.0', '964982224'],
 ['1', '47', '5.0', '964983815'],
 ['1', '50', '5.0', '964982931']]

### Exercise 3

Find the ID of the movie with the highest average rating. For each movie calculate the average rating and then find the maximum one. 

Hints: 
    <ul>
        <li> Use the filtered RDD derived above. Use map() to generate the (movieDd,rating) tuple for all the movies in the file </li>
        <li> Use CombineByKey to find out the ratings along with the MovieID </li>
        <li> Use ForEach to find out the average ratings based on the movieID</li>
    </ul>

In [18]:
maprdd = ratingsRDD.map(lambda c: c.split(','))

filrdd = maprdd.filter(lambda x: x[1] != "movieID" and x[2] != "rating")

ratings = filrdd.map(lambda x: (int(x[1]), float(x[2])))
combined = ratings.combineByKey(lambda c: (c, 1),
                                lambda c, v: (c[0] + v, c[1] + 1),
                                lambda c, v: (c[0] + v[0], c[1] + v[1]))

avrg = combined.map(lambda c: (c[0], c[1][0] / c[1][1]))
maxr = avrg.max(lambda c: c[1])
print(f"{maxr}")

(131724, 5.0)


### Exercise 4

Load the data from the file 'movies.csv' and find the movie with the highest average rating based on the ID you found in exercise 1

Hints: Use filter/sort method to find out the movie. 

In [34]:
masterRDDWithHeaders = sc.textFile('movies.csv')
header = masterRDDWithHeaders.first()
masterRDD = masterRDDWithHeaders.filter(lambda f: f != header).map(lambda x: x.split(','))

movie = masterRDD.map(lambda x: (x[0], x))
highest = movie.filter(lambda i: i[0] == '131724')
highest.collect()

[('131724',
  ['131724',
   'The Jinx: The Life and Deaths of Robert Durst (2015)',
   'Documentary'])]

## Exercise 5

Based on the result you got for the highest rated movie and based on your understanding of humans and how diverse they are in their opinion, does the average rating of this movie make sense to you?

Are we done here? are you satisfied with what you've achieved and you can happily tell anyone that this is the highest average rated movie? 


If you are not, and you shouldn't be, 
1. get evidence from the data to justify why the result you achieved is not a good indicator of the highest average rated movie
2. propose a new analysis that you think is reasonable to find the highest average rated movie 
3. find that movie based on your proposed analysis

Hints: To answer this question, you should check the content of other available dataset regarding the listed movies.

Realistically, there can be a list of movies with the same average rating of 5(assuming that 5 is the highest possible rating for a movie) so it would not make sense to return only a single movie. That, and I did not take into consideration the other aspects of a movie while retrieving, say a movie with an average rating of 5 with at least 10300 reviews would make a better contender in comparison to the ones that also have a rating of 5 but only a couple of users have rated it.