# Problem 1 - Working with RDDs

This is an interactive PySpark session. Remember that when you open this notebook the `SparkContext` and `SparkSession` are already created, and they are in the `sc` and `spark` variables, respectively. You can run the following two cells to make sure that the Kernel is active.

**Do not insert any additional cells than the ones that are provided.**

In [1]:
import findspark
findspark.init()
from pyspark import SparkContext
from pyspark.sql import SparkSession
sc = SparkContext()
spark = SparkSession.builder.appName("problem1").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/10/16 03:42:51 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
22/10/16 03:42:58 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!


In [2]:
sc

In [3]:
! hdfs dfs -put -l top-1m.csv top-1m.csv

put: `top-1m.csv': File exists


In the following cell, make an RDD called `top1m` that contains the contents of the file `top-1m.csv` that you placed into the cluster's HDFS.

In [4]:
top1m = sc.textFile("top-1m.csv")

There is one element in the RDD for each line in the file. The `.count()` method will compute how many lines are in the file. In the following cell, type the expression to count the lines in the `top1m` RDD. Run the cell and see the result.

In [5]:
top1m.count()

                                                                                

1000000

## Count the `.com` domains

How many of the websites in this RDD are in the .com domain?

In the following cell, write an code snipped that finds the records with `.com` at the end of the line and counts them. (Hint: use a regular expression.)

In [6]:
import re
end_w_com = top1m.filter(lambda x: re.match(".*\.com$",x))

In [7]:
end_w_com.count()

                                                                                

484593

## Histogram the Top Level Domains (TLDs)

What is the distribution of TLDs in the top 1 million websites? We can compute this using the RDD function `countByValue()` in this section.

In the following cell, write a function called `tld` (in Python) that takes a domain name string and outputs the top-level domain by grabbing the final characters after the last period in the line.

In [8]:
top1m.take(10)

['1,google.com',
 '2,youtube.com',
 '3,facebook.com',
 '4,baidu.com',
 '5,wikipedia.org',
 '6,yahoo.com',
 '7,qq.com',
 '8,amazon.com',
 '9,taobao.com',
 '10,twitter.com']

In [9]:
def tld(domain):
    result = re.sub('.*\\.', '', domain)
    return result

In the following cell, map the `top1m` RDD using `tld` into a new RDD called `tlds`. 

In [10]:
tlds = top1m.map(lambda x: tld(x))

In the following two cells, evaluate `top1m.first()` and  `tlds.first()` to see if the first line of `top1m` transformed by `tld` is properly represented as the first line of `tlds`. 

In [11]:
top1m.first()

'1,google.com'

In [12]:
tlds.first()

'com'

Look at the first 50 elements of `top1m` by evaluating `top1m.take(50)`.

In [13]:
top1m.take(50)

['1,google.com',
 '2,youtube.com',
 '3,facebook.com',
 '4,baidu.com',
 '5,wikipedia.org',
 '6,yahoo.com',
 '7,qq.com',
 '8,amazon.com',
 '9,taobao.com',
 '10,twitter.com',
 '11,google.co.in',
 '12,tmall.com',
 '13,instagram.com',
 '14,live.com',
 '15,vk.com',
 '16,sohu.com',
 '17,jd.com',
 '18,sina.com.cn',
 '19,reddit.com',
 '20,weibo.com',
 '21,google.co.jp',
 '22,yandex.ru',
 '23,360.cn',
 '24,blogspot.com',
 '25,login.tmall.com',
 '26,linkedin.com',
 '27,pornhub.com',
 '28,google.ru',
 '29,netflix.com',
 '30,google.com.br',
 '31,google.com.hk',
 '32,google.co.uk',
 '33,bongacams.com',
 '34,yahoo.co.jp',
 '35,google.fr',
 '36,csdn.net',
 '37,t.co',
 '38,google.de',
 '39,ebay.com',
 '40,microsoft.com',
 '41,alipay.com',
 '42,office.com',
 '43,twitch.tv',
 '44,msn.com',
 '45,bing.com',
 '46,xvideos.com',
 '47,microsoftonline.com',
 '48,mail.ru',
 '49,pages.tmall.com',
 '50,ok.ru']

Try the same thing with the `tlds` RDD to make sure that the first 50 lines were properly transformed.


In [14]:
tlds.take(50)

['com',
 'com',
 'com',
 'com',
 'org',
 'com',
 'com',
 'com',
 'com',
 'com',
 'in',
 'com',
 'com',
 'com',
 'com',
 'com',
 'com',
 'cn',
 'com',
 'com',
 'jp',
 'ru',
 'cn',
 'com',
 'com',
 'com',
 'com',
 'ru',
 'com',
 'br',
 'hk',
 'uk',
 'com',
 'jp',
 'fr',
 'net',
 'co',
 'de',
 'com',
 'com',
 'com',
 'com',
 'tv',
 'com',
 'com',
 'com',
 'com',
 'ru',
 'com',
 'ru']

There is a better way to make these comparisons rather than looking back and forth between the raw and transformed data. Use the `zip()` function with both `take` outputs to print both the raw and extracted data versions on the same line, one line per record.

In [15]:
raw = top1m.take(50)
new = tlds.take(50)
zipped = zip(raw,new)
list(zipped)

[('1,google.com', 'com'),
 ('2,youtube.com', 'com'),
 ('3,facebook.com', 'com'),
 ('4,baidu.com', 'com'),
 ('5,wikipedia.org', 'org'),
 ('6,yahoo.com', 'com'),
 ('7,qq.com', 'com'),
 ('8,amazon.com', 'com'),
 ('9,taobao.com', 'com'),
 ('10,twitter.com', 'com'),
 ('11,google.co.in', 'in'),
 ('12,tmall.com', 'com'),
 ('13,instagram.com', 'com'),
 ('14,live.com', 'com'),
 ('15,vk.com', 'com'),
 ('16,sohu.com', 'com'),
 ('17,jd.com', 'com'),
 ('18,sina.com.cn', 'cn'),
 ('19,reddit.com', 'com'),
 ('20,weibo.com', 'com'),
 ('21,google.co.jp', 'jp'),
 ('22,yandex.ru', 'ru'),
 ('23,360.cn', 'cn'),
 ('24,blogspot.com', 'com'),
 ('25,login.tmall.com', 'com'),
 ('26,linkedin.com', 'com'),
 ('27,pornhub.com', 'com'),
 ('28,google.ru', 'ru'),
 ('29,netflix.com', 'com'),
 ('30,google.com.br', 'br'),
 ('31,google.com.hk', 'hk'),
 ('32,google.co.uk', 'uk'),
 ('33,bongacams.com', 'com'),
 ('34,yahoo.co.jp', 'jp'),
 ('35,google.fr', 'fr'),
 ('36,csdn.net', 'net'),
 ('37,t.co', 'co'),
 ('38,google.de', '

At this point, `tlds.countByValue()` would give us a list of each TLD and the number of times that it appears in the top1m file. Note that this function returns the results as a `defaultDict` in the Python environment, not as an RDD. But we want it reverse sort it by count. To do this, we can set a variable called `tlds_and_counts` equal to `tlds.countByValue()` and then reverse the order, sort, and take the top 50, like this:

```
tlds_and_counts = tlds.countByValue()
counts_and_tlds = [(count,domain) for (domain,count) in tlds_and_counts.items()]
counts_and_tlds.sort(reverse=True)
counts_and_tlds[0:50]
```

In the following cell, run the code above to produce the Python Dictionary of the top 50 domains, sorted by descending count.

In [16]:
tlds_and_counts = tlds.countByValue()
counts_and_tlds = [(count,domain) for (domain,count) in tlds_and_counts.items()]
counts_and_tlds.sort(reverse=True)
counts_and_tlds[0:50]

                                                                                

[(484593, 'com'),
 (45610, 'org'),
 (41336, 'net'),
 (40239, 'ru'),
 (34374, 'de'),
 (28186, 'br'),
 (18616, 'uk'),
 (16903, 'pl'),
 (15507, 'ir'),
 (12239, 'it'),
 (12041, 'in'),
 (10346, 'fr'),
 (9411, 'au'),
 (8753, 'jp'),
 (8414, 'info'),
 (8070, 'cz'),
 (6518, 'es'),
 (6340, 'nl'),
 (6262, 'ua'),
 (6086, 'co'),
 (5706, 'cn'),
 (5634, 'ca'),
 (5596, 'io'),
 (5246, 'tw'),
 (5009, 'eu'),
 (4812, 'kr'),
 (4794, 'gr'),
 (4788, 'ch'),
 (4512, 'mx'),
 (3841, 'ro'),
 (3836, 'se'),
 (3631, 'no'),
 (3608, 'at'),
 (3484, 'me'),
 (3469, 'tv'),
 (3392, 'be'),
 (3267, 'za'),
 (3266, 'hu'),
 (3076, 'vn'),
 (3039, 'sk'),
 (3020, 'us'),
 (3013, 'ar'),
 (2798, 'edu'),
 (2769, 'dk'),
 (2553, 'tr'),
 (2439, 'pt'),
 (2300, 'biz'),
 (2256, 'cl'),
 (2228, 'id'),
 (2154, 'fi')]

**Question:** `top1m.collect()[0:50]` and `top1m.take(50)` produce the same result. Which one is more efficient and why? Put your answer in the cell below.

In my opinion, top1m.take(50) is more efficient and needs less running time, because for top1m.collect()[0:50], the collect function will compute all the content and metadata of the dataframe and show the top 50. top1m.take(50) can be used to shows content and structure/metadata for a limited number of rows for a very large dataset, it can flatten out the data and show us the top 50.

### **Run the following cell to export your final ordered results of domain into a json file**

In [17]:
import json
json.dump(counts_and_tlds, fp = open('problem-1-soln.json','w'))

When you finish this problem, click on the File -> 'Save and Checkpoint' in the menu bar to make sure that the latest version of the workbook file is saved. Also, before you close this notebook and move on, make sure you disconnect your SparkContext, otherwise you will not be able to re-allocate resources. Remember, you will commit the .ipynb file to the repository for submission (in the master node terminal.)

In [18]:
sc.stop()