# Homework 4 - Spark

In this homework, we are practicing Apache Spark.

You are required to turn in this notebook as BDM\_HW4\_Spark\_**NetId**.ipynb. You will be asked to complete each task using Apache Spark. Output can be printed in the notebook.

## Task 1 (5 points)

You are asked to implement Homework 3 using Spark. The description is provided below for your convenience.

You are asked to implement the Social Triangle example discussed in class. In particular, given the email dataset, please list all "reciprocal" relationships in the company. Recall that:

If A emails B and B emails A, then A and B is *reciprocal*.

If A emails B but B doesn’t email A, then A and B is *directed*.

**Dataset:** We will use a subset of the open [Enron Email Dataset](https://www.cs.cmu.edu/~./enron/ "Enron Email Dataset"), which contains approximately 10,000 simplified email headers from the Enron Corporation. You can download this dataset from NYU Classes as **enron_mails_small.csv**. The file contains 3 columns *Date*, *From*, and *To*. Their description is as follows:

|Column name|Description|
|--|--|
|Date |The date and time of the email, in the format YYYY-MM-DD hh-mm-ss, <br />e.g. "1998-10-30 07:43:00" |
|From |The sender email address, <br />e.g. "mark.taylor@enron.com" |
|To | A list of recipients' email addresses separated by semicolons ';', <br />e.g. "jennifer.fraser@enron.com;jeffrey.hodge@enron.com" |

Note that, we only care about users employed by Enron, or only relationships having email addresses that end with *'@enron.com'*.

The expected output is also provided below. For each reciprocal relationship, please output a tuple consisting of two strings. The first one is always **'reciprocal'**. And the second one is a string showing the name of the two person in the following format: **'Jane Doe : John Doe'**. The names should be presented in the lexical order, i.e. there will not be a 'John Doe : Jane Doe' since 'Jane' is ordered before 'John.

Though the dataset only contains email addresses, not actual names, we're assuming that the email aliases were created based on their name. For example:

|Email Address|Converted Name|
|--|--|
|mark.taylor@enron.com|Mark Taylor|
|alan.aronowitz@enron.com|Alan Aronowitz|
|marc.r.cutler@enron.com|Marc R Cutler|
|hugh@enron.com|Hugh|

Please fill the code block with a series of MapReduce jobs using your own mapper and reducer functions. Be sure to include the naming convention logic into one of your mappers and/or reducers.

In [1]:
EN_FN='enron_mails_small.csv'

In [2]:
def oneToone(pid,rows):
    if pid ==0:
        next(rows)
    for row in rows:
        sender,receivers = row
        receiver_list = receivers.split(';')
        for receiver in receiver_list:
            yield(sender,receiver)
def emailToName(x):
    tx,rx = x
    tx = tx.split('@')[0]
    rx = rx.split('@')[0]
    tx =' '.join(map(lambda x:x.capitalize(),tx.split('.')))
    rx =' '.join(map(lambda x:x.capitalize(),rx.split('.')))
    return((tx,rx),1)

In [3]:
dfEmail = spark.read.load(EN_FN,format='csv',header = True,inferSchema=True)

In [4]:
dfEmail = dfEmail.select('From','To')
rddEmail = dfEmail.rdd.mapPartitionsWithIndex(oneToone)\
                 .filter(lambda x: '@enron.com' in x[0] and '.' in x[0].split('@')[0]  and ('@enron.com' in x[1] 
                    and '.' in x[1].split('@')[0]))\
                   .map(emailToName)\
                   .reduceByKey(lambda x,y:1)\
                   .map(lambda x: (tuple(sorted(x[0])),1))\
                   .reduceByKey(lambda x,y:x+y)\
                   .filter(lambda x:x[1]>1)\
                   .sortByKey()\
                   .map(lambda x:('recipropcal',x[0][0]+' : '+x[0][1]))
rddEmail.collect()

[('recipropcal', 'Brenda Whitehead : Elizabeth Sager'),
 ('recipropcal', 'Carol Clair : Debra Perlingiere'),
 ('recipropcal', 'Carol Clair : Mark Taylor'),
 ('recipropcal', 'Carol Clair : Richard Sanders'),
 ('recipropcal', 'Carol Clair : Sara Shackleton'),
 ('recipropcal', 'Carol Clair : Tana Jones'),
 ('recipropcal', 'Debra Perlingiere : Kevin Ruscitti'),
 ('recipropcal', 'Drew Fossum : Susan Scott'),
 ('recipropcal', 'Elizabeth Sager : Janette Elbertson'),
 ('recipropcal', 'Elizabeth Sager : Mark Haedicke'),
 ('recipropcal', 'Elizabeth Sager : Mark Taylor'),
 ('recipropcal', 'Elizabeth Sager : Richard Sanders'),
 ('recipropcal', 'Eric Bass : Susan Scott'),
 ('recipropcal', 'Fletcher Sturm : Greg Whalley'),
 ('recipropcal', 'Fletcher Sturm : Sally Beck'),
 ('recipropcal', 'Gerald Nemec : Susan Scott'),
 ('recipropcal', 'Grant Masson : Vince Kaminski'),
 ('recipropcal', 'Greg Whalley : Richard Sanders'),
 ('recipropcal', 'Janette Elbertson : Mark Taylor'),
 ('recipropcal', 'Janette El

## Task 2 (5 points)

You are asked to implement Task 2 of Lab 5. The description is provided below for your convenience.

We’ll be using two NYC open data sets: the SAT Results and the NYC High School Directory data sets. Both can be downloaded from the links below, or from online class resources.

**Dataset**: *Please note that each school is uniquely identified by an DBN code, which should be found on both data sets.*

**SAT_Results.csv**
Source: https://nycopendata.socrata.com/Education/SAT-Results/f9bf-2cp4  
Description: “The most recent school level results for New York City on the SAT. Results are available at the school level for the graduating seniors of 2012.”

**DOE_High_School_Directory_2014-2015.csv**
Source: https://data.cityofnewyork.us/Education/DOE-High-School-Directory-2014-2015/n3p6-zve2  
Description: “Directory of NYC High Schools.”

We would like to know how the Math scores vary across bus lines or subway lines serving the schools. Your task is to compute the average Math scores of all schools along each bus line and subway line. You can find the bus and subway lines serving each school in the High School Dictionary as bus and subway columns.

The expected results are two lists:
1. A list of key/value pairs: with bus line as keys, and the average Math scores as values.
2. A list of key/value pairs: with subway line as keys, and the average Math scores as values.

The top ten lines with highest score are shown below.

In [5]:
SAT_FN = 'SAT_Results.csv'
HSD_FN = 'DOE_High_School_Directory_2014-2015.csv'

In [6]:
from pyspark.sql.functions import lit
dfScores = spark.read.load(SAT_FN,format='csv',header = True,inferSchema=True)
dfScores = dfScores.select('DBN',
                          dfScores['Num of SAT Test Takers'].cast('int').alias('ntakers'),
                          dfScores['`SAT Math Avg. Score`'].cast('int').alias('score')
                          ).na.drop()
dfScores = dfScores.select('DBN','ntakers',
                           ((dfScores.ntakers)*(dfScores.score)).alias('total'))

#dfScores = dfScores.withColumn('count',lit(1).cast('int'))
dfScores.show()

+------+-------+------+
|   DBN|ntakers| total|
+------+-------+------+
|02M047|     16|  6400|
|21K410|    475|207575|
|30Q301|     98| 43120|
|17K382|     59| 22066|
|18K637|     35| 13335|
|32K403|     50| 18300|
|09X365|     54| 18306|
|11X270|     56| 22064|
|05M367|     33| 12078|
|14K404|     68| 24276|
|30Q575|    135| 66420|
|13K336|      9|  3366|
|04M635|     48| 17712|
|24Q264|     89| 40406|
|17K408|     57| 19494|
|19K618|     60| 22260|
|27Q309|     36| 13644|
|32K552|     67| 24388|
|13K499|     72| 26208|
|07X600|     76| 30400|
+------+-------+------+
only showing top 20 rows



In [7]:
dfSchools = spark.read.load(HSD_FN,format='csv',header = True,inferSchema=True)
dfSchools = dfSchools.na.drop(subset=['boro'])
dfSchools = dfSchools.select('dbn','bus','subway')
dfSchools.show()

+------+--------------------+--------------------+
|   dbn|                 bus|              subway|
+------+--------------------+--------------------+
|01M292|B39, M14A, M14D, ...|B, D to Grand St ...|
|01M448|M14A, M14D, M15, ...|F to East Broadwa...|
|01M450|M101, M102, M103,...|6 to Astor Place ...|
|01M509|B39, M103, M14A, ...|B, D to Grand St ...|
|01M539|B39, M14A, M14D, ...|F, J, M, Z to Del...|
|01M696|M14A, M14D, M21, ...|                 N/A|
|02M047|M101, M102, M14A,...|4, 5, Q to 14th S...|
|02M135|M10, M104, M11, M...|1, C, E to 50th S...|
|02M139|M103, M15, M22, M...|1 to Chambers St ...|
|02M260|M104, M11, M20, M...|1, 2, 3, A, C, E ...|
|02M280|M103, M15, M22, M...|1 to Chambers St ...|
|02M282|M103, M15, M22, M...|1 to Chambers St ...|
|02M288|M104, M11, M31, M...|     C, E to 50th St|
|02M294|B39, M103, M14A, ...|B, D to Grand St ...|
|02M296|M104, M11, M31, M...|     C, E to 50th St|
|02M298|B39, M103, M14A, ...|6, N, Q, R to Can...|
|02M300|M104, M11, M31, M...|  

In [8]:
dfResults = dfSchools.join(dfScores, dfSchools.dbn==dfScores.DBN,how = 'inner')
dfResults.show()

+------+--------------------+--------------------+------+-------+------+
|   dbn|                 bus|              subway|   DBN|ntakers| total|
+------+--------------------+--------------------+------+-------+------+
|01M292|B39, M14A, M14D, ...|B, D to Grand St ...|01M292|     29| 11716|
|01M448|M14A, M14D, M15, ...|F to East Broadwa...|01M448|     91| 38493|
|01M450|M101, M102, M103,...|6 to Astor Place ...|01M450|     70| 28140|
|01M509|B39, M103, M14A, ...|B, D to Grand St ...|01M509|     44| 19052|
|01M539|B39, M14A, M14D, ...|F, J, M, Z to Del...|01M539|    159| 91266|
|01M696|M14A, M14D, M21, ...|                 N/A|01M696|    130| 78520|
|02M047|M101, M102, M14A,...|4, 5, Q to 14th S...|02M047|     16|  6400|
|02M288|M104, M11, M31, M...|     C, E to 50th St|02M288|     62| 24366|
|02M294|B39, M103, M14A, ...|B, D to Grand St ...|02M294|     53| 20352|
|02M296|M104, M11, M31, M...|     C, E to 50th St|02M296|     58| 21750|
|02M298|B39, M103, M14A, ...|6, N, Q, R to Can...|0

## For Bus

In [9]:
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
dfResultsByBus = dfResults.drop('subway')
dfResultsByBus = dfResultsByBus.withColumn("Bus", explode(split(dfResultsByBus.bus, "[,]")))


In [10]:
from pyspark.sql.functions import trim
dfResultsByBus = dfResultsByBus.withColumn("Bus", trim(dfResultsByBus.Bus))

In [11]:
dfResultsByBus = dfResultsByBus.groupBy('Bus').sum('ntakers','total').na.drop()
dfResultsByBus = dfResultsByBus.filter(dfResultsByBus['Bus']!='N/A')

In [12]:
dfResultsByBus = dfResultsByBus.withColumn('avg',dfResultsByBus[2]/dfResultsByBus[1])\
              .select('Bus','avg')

In [13]:
from pyspark.sql.functions import desc
dfResultsByBus = dfResultsByBus.sort(desc('avg'))

In [14]:
listResultsByBus = dfResultsByBus.select('Bus','avg').rdd.map(lambda x: {x[0]:x[1]}).collect()


In [15]:
listResultsByBus[:20]

[{'S1115': 612.2545811518324},
 {'M79': 594.0},
 {'Q42': 582.6455026455027},
 {'M22': 574.1115190454337},
 {'Bx3': 571.8109992254067},
 {'B52': 560.9733201581028},
 {'B63': 557.9150355871886},
 {'B69': 548.8451901565995},
 {'B54': 543.1855184233076},
 {'B25': 541.0064543889845},
 {'M20': 540.254762509525},
 {'M9': 539.259748427673},
 {'M86': 538.8404255319149},
 {'B65': 538.302463891249},
 {'B45': 534.9575638506876},
 {'Bx10': 534.8907249466951},
 {'Bx26': 533.5892566467716},
 {'B103': 531.7565379825654},
 {'Q64': 529.5889724310777},
 {'Bx22': 525.0057273768614}]

## For subway

In [16]:
from pyspark.sql.functions import udf, col
def getSubWayLine(x):
    lst = [ i for i in x.split(';')]
    lst1 =[j.split('to')[0] for j in lst]
    st = ",".join(lst1)
    return st
getSubWayLineStr = udf(getSubWayLine)

In [17]:
dfResultsBySubway = dfResults.drop('bus')\
                             .withColumn("subway", getSubWayLineStr(col('subway')))

In [18]:
dfResultsBySubway = dfResultsBySubway.withColumn("Subway", explode(split(dfResultsBySubway.subway, "[,]")))

In [19]:
dfResultsBySubway = dfResultsBySubway.withColumn("Subway", trim(dfResultsBySubway.Subway))

In [20]:
dfResultsBySubway = dfResultsBySubway.groupBy('Subway').sum('ntakers','total')
dfResultsBySubway = dfResultsBySubway.filter(dfResultsBySubway['Subway']!='N/A')

In [21]:
dfResultsBySubway = dfResultsBySubway.withColumn('avg',dfResultsBySubway[2]/dfResultsBySubway[1])\
              .select('Subway','avg')

In [22]:
dfResultsBySubway = dfResultsBySubway.sort(desc('avg'))

In [23]:
listResultsBySubway = dfResultsBySubway.select('Subway','avg').rdd.map(lambda x: {x[0]:x[1]}).collect()

In [24]:
listResultsBySubway

[{'3': 513.4009556313994},
 {'C': 510.239433265865},
 {'A': 510.0150229357798},
 {'R': 508.6067355282978},
 {'G': 503.4458706708646},
 {'D': 502.2631520035818},
 {'E': 501.2646720368239},
 {'1': 499.84488281908614},
 {'SIR': 498.87491683300067},
 {'4': 495.29238227146817},
 {'N': 493.5055292259084},
 {'B': 491.95760524225574},
 {'2': 488.0718242975861},
 {'Q': 482.14557840292673},
 {'5': 461.0280319703463},
 {'7': 457.35861778339654},
 {'M': 454.06567963458815},
 {'F': 445.7865661411926},
 {'J': 439.1299656694458},
 {'Z': 438.12698819907644},
 {'6': 432.80367816091956},
 {'S': 427.93296529968455},
 {'L': 426.3222871994802}]