# CSCI 4253 / 5253 - Lab #4 - Patent Problem with Spark RDD
<div>
 <h2> CSCI 4283 / 5253 
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right"/> </h2>
</div>

This [Spark cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf) is useful

In [18]:
from pyspark import SparkContext, SparkConf
import numpy as np
import operator
import gzip

In [19]:

def decode_citations(x):
    list_objects = x.decode('utf-8').split(',')
    return tuple(y.strip() for y in x.decode('utf-8').split(','))

def decode_patents(x):
    columns = x.decode('utf-8').split(',')
    return columns[0], columns[5]

def filter_null_citations(x):
    citing, merged_tuple = x
    cited, state = merged_tuple
    if citing and cited and state:
        return True
    else:
        return False

def change_order(x):
    citing, merged_tuple = x;
    cited,state = merged_tuple
    return cited,citing+'^'+state

def filter_null_values(x):
    a,b = x
    if a and b:
        return True
    else:
        return False

def expand_values_mapper2(x):
    cite, merged_tuple = x;
    citation_str, state = merged_tuple
    citation, citation_state = citation_str.split('^')
    return citation, citation_state, cite, state

def filter_phase2(x):
    citing, citing_state, cited, cited_state = x
    if citing_state == cited_state and citing_state != '""' and cited_state != '"""':
        return True
    else:
        return False

def patent_file_filter(x):
    key,value = x.decode('utf-8').split(',',1)
    return key,value

def mapper3_one(x):
    citing, citing_state,cited,cited_state = x
    return citing

def final_result_print(x):
    key, merged_tuple = x
    long_str, count = merged_tuple
    if not count:
        count = 0
    result_tuple = (int(key),)
    middle_tuple = tuple(k.strip() for k in long_str.split(','))
    result_tuple += middle_tuple
    result_tuple += (count,)
    return result_tuple


Above are the various functions which are used in the transformation operations on the rdds.

In [None]:
conf=SparkConf().setAppName("Lab4-rddd").setMaster("local[*]")
sc = SparkContext(conf=conf)

Using PySpark and RDD's on the https://coding.csel.io machines is very slow -- most of the code is executed in Python and this is much less efficient than the java-based code using the PySpark dataframes. Be patient and trying using `.cache()` to cache the output of joins. You may want to start with a reduced set of data before running the full task.

To that end, we've included code to just extract the last 200,000 lines of each file below using the Python "slice" notation. Using that subset of the data your "new patent" table should look like:

![Top partial 10 RDD self-state citations](top-subsample-rdd.png)

When you're ready to run the whole thing, just include all the data and run it again (...and wait...).

These two RDD's are called "rawCitations" and "rawPatents" because you probably want to process them futher (e.g. convert them to integer types, etc). If you haven't used Python "byte" types before, google it. You can convert a byte variable `x` into e.g. a UTF8 string using `x.decode('uft-8')`.

In [28]:
# Load the data into citations rdd
with gzip.open('cite75_99.txt.gz', 'r') as f:
    rddCitations = sc.parallelize( f.readlines()[-800000:] )

In [29]:
# Load the data into patents rdd
with gzip.open('apat63_99.txt.gz', 'r') as f:
    rddPatents = sc.parallelize( f.readlines()[-800000:] )

Load the data into rdds before performing the transformtion operations

In [30]:
citations = rddCitations.map(decode_citations)
patent_info = rddPatents.map(decode_patents)


As the values are in the utf-8 encoding format, decode them and assign to a different rdd.

In [31]:
## Map Reduce Reduce 1
mapper1 = citations.fullOuterJoin(patent_info)
mapper1_filter = mapper1.filter(filter_null_citations)
mapper1 = mapper1_filter.map(change_order)
reducer1 = mapper1.filter(filter_null_values)

Perform the join on RDDs similar to dataframe join. 
Using the map operation filter the null citations.
Change the order of the citation, cite which will be useful in the next step.

In [32]:
## Map Reduce Reduce 2

mapper2 = reducer1.fullOuterJoin(patent_info)
mapper2 = mapper2.filter(filter_null_citations)
mapper2 = mapper2.map(expand_values_mapper2)
reducer2 = mapper2.filter(filter_phase2)

The operations are similar to the steps in the Map Reduce phase 1, where we perform the join, filtering o

In [33]:
## Map Reduce Reduce 3

mapper3 = reducer2.map(mapper3_one)
reducer3 = mapper3.countByValue()

In [34]:
citation_count = sc.parallelize(reducer3.items())
patent_rdd = rddPatents.map(patent_file_filter)
resultant_rdd = patent_rdd.fullOuterJoin(citation_count)
resultant_rdd = resultant_rdd.map(final_result_print)
resultant_rdd_descending_order = resultant_rdd.takeOrdered(10, key=lambda x: -x[23])
for element in resultant_rdd_descending_order:
    print(element)

(5959466, '1999', '14515', '1997', '"US"', '"CA"', '5310', '2', '', '326', '4', '46', '159', '0', '1', '', '0.6186', '', '4.8868', '0.0455', '0.044', '', '', 94)
(6008204, '1999', '14606', '1998', '"US"', '"CA"', '749584', '2', '', '514', '3', '31', '121', '0', '1', '', '0.7415', '', '5', '0.0085', '0.0083', '', '', 80)
(5952345, '1999', '14501', '1997', '"US"', '"CA"', '749584', '2', '', '514', '3', '31', '118', '0', '1', '', '0.7442', '', '5.1102', '0', '0', '', '', 78)
(5999972, '1999', '14585', '1996', '"US"', '"CA"', '551495', '2', '', '709', '2', '22', '352', '0', '1', '', '0.8714', '', '4.0398', '0.0117', '0.0114', '', '', 77)
(5998655, '1999', '14585', '1998', '"US"', '"CA"', '', '1', '', '560', '1', '14', '114', '0', '1', '', '0.7387', '', '5.1667', '', '', '', '', 76)
(5958954, '1999', '14515', '1997', '"US"', '"CA"', '749584', '2', '', '514', '3', '31', '116', '0', '1', '', '0.7397', '', '5.181', '0', '0', '', '', 76)
(5987245, '1999', '14564', '1996', '"US"', '"CA"', '55149