# introduction


In this Jupyter Notebook you will find materials about 
1. <span style="color:red">RDD</span>   
- Reading a text file (from local or HDFS)
- map() and flatMap
- reduceByKey(), groupByKey(), sortByKey(), keys(), and values()
- join(), rightOuterJoin(), leftOuterJoin(), cogroup(), subtractByKey()
- with key/value data, use mapValues() and flatMapValues() of your transformation doesn't affect the keys. It is more efficient because it allows spark to maintain the same partitioning as original RDD instead of shuffling data.
- filter()
* Question: am I modifying the keys: yes then use map and flatMap, no then use mapValues and flatMapValues.



In [1]:
from pyspark.sql import SparkSession
from nltk.corpus import stopwords

# Some hints

max for each item: reduceByKey(lambda x,y:max(x,y))

# RDD

In [2]:
spark = SparkSession.builder.appName("rdd_practice").getOrCreate()
sc  = spark.sparkContext

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2021-12-28 22:35:21,842 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-12-28 22:35:22,408 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
sc.getConf().getAll()

[('spark.driver.port', '35015'),
 ('spark.sql.warehouse.dir',
  'file:/home/hadoop/lohrasp/analyticsoptim/spark-warehouse'),
 ('spark.rdd.compress', 'True'),
 ('spark.driver.host', 'master'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.master', 'local[*]'),
 ('spark.submit.pyFiles', ''),
 ('spark.executor.id', 'driver'),
 ('spark.app.startTime', '1640730921773'),
 ('spark.submit.deployMode', 'client'),
 ('spark.app.id', 'local-1640730922566'),
 ('spark.app.name', 'rdd_practice'),
 ('spark.ui.showConsoleProgress', 'true')]

## Get list of stopwords to be removed from data

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/hadoop/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

## Movie Rating

In [36]:
d={}

for k in ["movies","ratings","users"]:
    d[k] = sc.textFile(f"hdfs:///user/hadoop/moviesdb/ml-1m/{k}.dat").map(lambda li:li.split("::"))



In [None]:
 d["ratings"].take(2)

[['1', '1193', '5', '978300760'], ['1', '661', '3', '978302109']]

In [46]:
ratings = d["ratings"].map(lambda x:x[2]).countByValue()
ratings

                                                                                

defaultdict(int,
            {'5': 226310, '3': 261197, '4': 348971, '2': 107557, '1': 56174})

In [59]:
rating2 = d["ratings"].map(lambda x: (x[2], 1))
rating2syn = rating2.reduceByKey(lambda x, y: x+y)
rating2syn.collect()

                                                                                

[('4', 348971), ('1', 56174), ('5', 226310), ('3', 261197), ('2', 107557)]

In [71]:
ratingKV = d["ratings"].map(lambda x:(x[2],x[1]))
rating3 = ratingKV.mapValues(lambda x: (x,1)).reduceByKey(lambda x,y:(int(x[0])+int(y[0]),int(x[1])+int(y[1])))
averagePerRating = rating3.mapValues(lambda x:x[0]/x[1])
print(ratingKV.take(2),ratingKV.mapValues(lambda x: (x,1)).take(2))
print(averagePerRating.collect())
rating3.collect()

[('5', '1193'), ('3', '661')] [('5', ('1193', 1)), ('3', ('661', 1))]


[Stage 57:>                                                         (0 + 2) / 2]

[('4', 1875.5138793767962), ('1', 1972.758838608609), ('5', 1728.2636781406036), ('3', 1918.5037423860151), ('2', 1937.4035348698828)]


                                                                                

[('4', (654499954, 348971)),
 ('1', (110817755, 56174)),
 ('5', (391123353, 226310)),
 ('3', (501107422, 261197)),
 ('2', (208381312, 107557))]

## Get data from HDFS

 On the Origin of Species, by Charles Darwin

In [76]:
import re
rdd1 = sc.textFile(
    "hdfs:///user/hadoop/OntheOriginofSpecies.txt").flatMap(lambda text: re.compile(r'\W',re.UNICODE).split(text.lower()))
rdd1 = rdd1.filter(lambda x: x not in stopwords.words("english"))


In [77]:
rdd2 = rdd1.groupBy(lambda x:x[:4])
for k , v in rdd2.take(2):
    print(k,list(v)[:3])

[Stage 65:>                                                         (0 + 2) / 2]

proj ['project', 'project', 'project']
gute ['gutenberg', 'gutenberg', 'gutenberg']


                                                                                

In [83]:
def swapTuple(t):
    return (t[1],t[0])
numOccurance = rdd1.map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).map(swapTuple).sortByKey()
numOccurance.take(2)

                                                                                

[(1, 'title'), (1, '1st')]

In [78]:
rdd3 = rdd1.distinct()
rdd4 = rdd3.groupBy(lambda x:x[:4])
for k , v in rdd4.take(2):
    print(k,list(v)[:3])

[Stage 67:>                                                         (0 + 2) / 2]

proj ['project', 'projecting']
gute ['gutenberg']


                                                                                