# CSCI 4253 / 5253 - Lab #4 - Patent Problem with Spark RDD
<div>
 <h2> CSCI 4283 / 5253 
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right"/> </h2>
</div>

This [Spark cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf) is useful

In [1]:
from pyspark import SparkContext, SparkConf
import numpy as np
import operator

In [2]:
conf=SparkConf().setAppName("Lab4-rddd").setMaster("local[*]")
sc = SparkContext(conf=conf)

Using PySpark and RDD's on the https://coding.csel.io machines is very slow -- most of the code is executed in Python and this is much less efficient than the java-based code using the PySpark dataframes. Be patient and trying using `.cache()` to cache the output of joins. You may want to start with a reduced set of data before running the full task.

To that end, we've included code to just extract the last 200,000 lines of each file below using the Python "slice" notation. Using that subset of the data your "new patent" table should look like:

![Top partial 10 RDD self-state citations](top-subsample-rdd.png)

When you're ready to run the whole thing, just include all the data and run it again (...and wait...).

These two RDD's are called "rawCitations" and "rawPatents" because you probably want to process them futher (e.g. convert them to integer types, etc). If you haven't used Python "byte" types before, google it. You can convert a byte variable `x` into e.g. a UTF8 string using `x.decode('uft-8')`.

In [3]:
import gzip
with gzip.open('cite75_99.txt.gz', 'r') as f:
    rddCitations = sc.parallelize( f.readlines()[-200000:] )

In [4]:
with gzip.open('apat63_99.txt.gz', 'r') as f:
    rddPatents = sc.parallelize( f.readlines()[-200000:] )

# Transform the citations and the patents into (k, v) tuples

#### k => Citing Number
#### v => Cited Number (OR) Patent Information

In [5]:
citations = rddCitations.map(lambda x: x.decode('utf-8')) \
                        .map(lambda x: x.rstrip()) \
                        .map(lambda x: x.split(',')) \
                        .map(lambda x: (x[0], x[1])) \

patents = rddPatents.map(lambda x: x.decode('utf-8')) \
                    .map(lambda x: x.rstrip()) \
                    .map(lambda x: x.split(',', 1)) \
                    .map(lambda x: (x[0], x[1].split(','))) \

# Get an RDD of Patent-State relationships
states = patents.map(lambda x: (x[0], x[1][4])) \
                .filter(lambda x: x[1] != '""')

In [6]:
citations.take(5)

[('5991301', '5394398'),
 ('5991301', '5414701'),
 ('5991301', '5418783'),
 ('5991301', '5420857'),
 ('5991301', '5420858')]

In [7]:
states.take(5)

[('5807364', '"WA"'),
 ('5807365', '"NJ"'),
 ('5807366', '"GA"'),
 ('5807367', '"WI"'),
 ('5807368', '"MN"')]

In [8]:
patents.take(1)

[('5807364',
  ['1998',
   '14137',
   '1995',
   '"US"',
   '"WA"',
   '625795',
   '2',
   '38',
   '604',
   '3',
   '32',
   '104',
   '1',
   '0.9423',
   '0',
   '0.8803',
   '1',
   '17.7308',
   '0.1047',
   '0.0865',
   '1',
   '1'])]

# Get the number of costate citations per citing patent

In [43]:
citingState = citations.join(states) \
              .map(lambda x: (x[1][0], (x[0], x[1][1]))) \
              .join(states) \
              .map(lambda x: (x[1][0][0], (x[1][0][1], x[0], x[1][1]))) \
              .filter(lambda x: x[1][0] == x[1][2]) \
              .groupByKey() \
              .map(lambda x: (x[0], (len(x[1]))))


"""
For 2nd map:
('5871279', (('5991652', '"CA"'), '"CA"'))
x[0] = Cited Pat
x[1] = ((Citing Pat, Citing State), Cited State)
~~~~
x[1][0][0] = Citing Pat
x[1][0][1] = Citing State
x[1][1] = Cited State
"""

citingState.take(10)

[('6009541', 9),
 ('5999913', 7),
 ('6003285', 6),
 ('6006836', 5),
 ('5999540', 5),
 ('6003328', 5),
 ('6006835', 5),
 ('6009277', 4),
 ('5996023', 4),
 ('5994920', 4)]

# Join back to the original patents

In [81]:
finalOutput = patents.join(citingState) \
              .map(lambda x: (",".join([x[0], ",".join(x[1][0]), str(x[1][1])]))) \
              .sortBy(lambda x: x.split(',')[-1], ascending=False)

finalOutput.take(20)

['6009541,1999,14606,1997,"US","CA",722315,2,,714,2,22,155,0,1,,0.8503,,2.6968,0.0132,0.0129,,,9',
 '5999913,1999,14585,1997,"US","GA",395480,2,,705,2,22,16,0,1,,0.7344,,4.4375,0.5333,0.5,,,7',
 '6003285,1999,14599,1998,"US","IL",720823,2,,53,6,68,71,0,0.8873,,0.5135,,10.6761,0.7414,0.6056,,,6',
 '5999540,1999,14585,1998,"US","TX",739062,2,,370,2,21,93,0,1,,0.8024,,5.3011,0.0333,0.0323,,,5',
 '6003328,1999,14599,1999,"US","VA",687394,2,,62,6,69,57,0,0.9298,,0.8494,,11.5965,0.4474,0.2982,,,5',
 '6006835,1999,14606,1998,"US","OK",696774,2,,166,6,64,53,0,0.9811,,0.3972,,13.0189,0.1087,0.0943,,,5',
 '6006836,1999,14606,1998,"US","OK",696774,2,,166,6,64,54,0,0.9815,,0.3909,,13.1667,0.1087,0.0926,,,5',
 '5996515,1999,14585,1999,"US","IA",139275,2,,111,6,61,23,0,0.913,,0.6576,,20.6522,0.3846,0.2174,,,4',
 '5994920,1999,14578,1997,"US","MS",132165,2,,326,4,46,70,0,1,,0.6273,,4.8857,0.1449,0.1429,,,4',
 '6009277,1999,14606,1998,"US","MA",445890,2,,396,5,54,7,0,1,,0.2449,,9.7143,1,0.5714,,,4',
 