# Problem 1 - Working with RDDs (5 points)

This is an interactive PySpark session. Remember that when you open this notebook the `SparkContext` and `SparkSession` are already created, and they are in the `sc` and `spark` variables, respectively. You can run the following two cells to make sure that the Kernel is active.

**Do not insert any additional cells than the ones that are provided.**

In [1]:
sc

In [2]:
spark

In the following cell, make an RDD called `top1m` that contains the contents of the file `top-1m.csv` that you placed into the cluster's HDFS.

In [3]:
top1m= spark.read.csv("top-1m.csv", header= True)
top1m.take(5)

[Row(1=u'2', google.com=u'youtube.com'),
 Row(1=u'3', google.com=u'facebook.com'),
 Row(1=u'4', google.com=u'baidu.com'),
 Row(1=u'5', google.com=u'wikipedia.org'),
 Row(1=u'6', google.com=u'yahoo.com')]

There is one element in the RDD for each line in the file. The `.count()` method will compute how many lines are in the file. In the following cell, type the expression to count the lines in the `top1m` RDD. Run the cell and see the result.

In [4]:
top1m.count()

999999

## Count the `.com` domains

How many of the websites in this RDD are in the .com domain?

In the following cell, write a code snippet that finds the records with `.com` and counts them. (Hint: use a regular expression.)

In [5]:
import re
count =0 
t= top1m.collect()
for line in t:
    match = re.findall(r'[.][c][o][m]', line[1])
    if match:
        count= count+1
print(count)
    

537632


## Histogram the Top Level Domains (TLDs)

What is the distribution of TLDs in the top 1 million websites? We can quickly compute this using the RDD function `countByValue()`.

In the following cell, write a function called `tld` (in Python) that takes a domain name string and outputs the top-level domain.

In [16]:
def tld(line):
    domains = line[1].strip().split(".")[-1]
    return domains


In the following cell, map the `top1m` RDD using `tld` into a new RDD called `tlds`. 

In [17]:
tlds = top1m.rdd.map( lambda line: tld(line))

In the following two cells, evaluate `top1m.first()` and  `tlds.first()` to see if the first line of `top1m` transformed by `tld` is properly represented as the first line of `tlds`. 

In [18]:
top1m.first()

Row(1=u'2', google.com=u'youtube.com')

In [19]:
tlds.first()

u'com'

Look at the first 50 elements of `top1m` by evaluating `top1m.take(50)`.

In [20]:
top1m.take(50)

[Row(1=u'2', google.com=u'youtube.com'),
 Row(1=u'3', google.com=u'facebook.com'),
 Row(1=u'4', google.com=u'baidu.com'),
 Row(1=u'5', google.com=u'wikipedia.org'),
 Row(1=u'6', google.com=u'yahoo.com'),
 Row(1=u'7', google.com=u'qq.com'),
 Row(1=u'8', google.com=u'amazon.com'),
 Row(1=u'9', google.com=u'taobao.com'),
 Row(1=u'10', google.com=u'twitter.com'),
 Row(1=u'11', google.com=u'google.co.in'),
 Row(1=u'12', google.com=u'tmall.com'),
 Row(1=u'13', google.com=u'instagram.com'),
 Row(1=u'14', google.com=u'live.com'),
 Row(1=u'15', google.com=u'vk.com'),
 Row(1=u'16', google.com=u'sohu.com'),
 Row(1=u'17', google.com=u'jd.com'),
 Row(1=u'18', google.com=u'sina.com.cn'),
 Row(1=u'19', google.com=u'reddit.com'),
 Row(1=u'20', google.com=u'weibo.com'),
 Row(1=u'21', google.com=u'google.co.jp'),
 Row(1=u'22', google.com=u'yandex.ru'),
 Row(1=u'23', google.com=u'360.cn'),
 Row(1=u'24', google.com=u'blogspot.com'),
 Row(1=u'25', google.com=u'login.tmall.com'),
 Row(1=u'26', google.com=u'

Try the same thing with the `tlds` RDD to make sure that the first 50 lines were properly transformed.


In [21]:
tlds.take(50)

[u'com',
 u'com',
 u'com',
 u'org',
 u'com',
 u'com',
 u'com',
 u'com',
 u'com',
 u'in',
 u'com',
 u'com',
 u'com',
 u'com',
 u'com',
 u'com',
 u'cn',
 u'com',
 u'com',
 u'jp',
 u'ru',
 u'cn',
 u'com',
 u'com',
 u'com',
 u'com',
 u'ru',
 u'com',
 u'br',
 u'hk',
 u'uk',
 u'com',
 u'jp',
 u'fr',
 u'net',
 u'co',
 u'de',
 u'com',
 u'com',
 u'com',
 u'com',
 u'tv',
 u'com',
 u'com',
 u'com',
 u'com',
 u'ru',
 u'com',
 u'ru',
 u'com']

At this point, `tlds.countByValue()` would give us a list of each TLD and the number of times that it appears in the top1m file. Note that this function returns the results as a `defaultDict` in the Python environemnt, not as an RDD. But we want it reverse sorted by count. To do this, we can set a variable called `tlds_and_counts` equal to `tlds.countByValue()` and then reverse the order, sort, and take the top 50, like this:

```
tlds_and_counts = tlds.countByValue()
counts_and_tlds = [(count,domain) for (domain,count) in tlds_and_counts.items()]
```

In the following cell, run the code above to produce the Python Dictionary.

In [22]:
tlds_and_counts = tlds.countByValue()
counts_and_tlds = [(count,domain) for (domain,count) in tlds_and_counts.items()]

In the following cell, reverse sort `counts_and_tlds` and display the first 50.

In [23]:
counts_and_tlds_rdd= sc.parallelize(counts_and_tlds)
counts_and_tlds_rdd.takeOrdered(50, key = lambda count: -count[0])

[(484592, u'com'),
 (45610, u'org'),
 (41336, u'net'),
 (40239, u'ru'),
 (34374, u'de'),
 (28186, u'br'),
 (18616, u'uk'),
 (16903, u'pl'),
 (15507, u'ir'),
 (12239, u'it'),
 (12041, u'in'),
 (10346, u'fr'),
 (9411, u'au'),
 (8753, u'jp'),
 (8414, u'info'),
 (8070, u'cz'),
 (6518, u'es'),
 (6340, u'nl'),
 (6262, u'ua'),
 (6086, u'co'),
 (5706, u'cn'),
 (5634, u'ca'),
 (5596, u'io'),
 (5246, u'tw'),
 (5009, u'eu'),
 (4812, u'kr'),
 (4794, u'gr'),
 (4788, u'ch'),
 (4512, u'mx'),
 (3841, u'ro'),
 (3836, u'se'),
 (3631, u'no'),
 (3608, u'at'),
 (3484, u'me'),
 (3469, u'tv'),
 (3392, u'be'),
 (3267, u'za'),
 (3266, u'hu'),
 (3076, u'vn'),
 (3039, u'sk'),
 (3020, u'us'),
 (3013, u'ar'),
 (2798, u'edu'),
 (2769, u'dk'),
 (2553, u'tr'),
 (2439, u'pt'),
 (2300, u'biz'),
 (2256, u'cl'),
 (2228, u'id'),
 (2154, u'fi')]

**Question:** `top1m.collect()[0:50]` and `top1m.take(50)` produce the same result. Which one is more efficient and why? Put your answer in the cell below.

In [None]:
## Answer Here (don't run this cell)
Though top1m.collect()[0:50] and top1m.take(50) produce same results, I would say that 
top1m.take(50) is more efficient because it executes in no time while top1m.collect()[0:50] 
takes around 30sec to execute and also when this operation is issued in dataset, the data is 
copied to the driver ie, the master node and also a memory exemption can be thrown saying that
data is too large to fit in the memory whereas 'take' can be used to retrive only the selected 
number of elements instead.


When you finish this problem, click on the File -> 'Save and Checkpoint' in the menu bar to make sure that the latest version of the workbook file is saved. Also, before you close this notebook and move on, make sure you disconnect your SparkContext, otherwise you will not be able to re-allocate resources. Remember, you will commit the .ipynb file to the repository for submission (in the master node terminal.)

In [24]:
sc.stop()