# CSCI 4253 / 5253 - Lab #4 - Patent Problem with Spark RDD - SOLUTION
<div>
 <h2> CSCI 4283 / 5253 
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right"/> </h2>
</div>

This [Spark cheatsheet](https://s3.amazonaws.com/assets.datacamp.com/blog_assets/PySpark_SQL_Cheat_Sheet_Python.pdf) is useful

In [1]:
from pyspark import SparkContext, SparkConf
import csv 
import io
import numpy as np
import operator

In [2]:
conf=SparkConf().setAppName("Lab4-rdd").setMaster("local[*]")
sc = SparkContext(conf=conf)

Using PySpark and RDD's on the https://coding.csel.io machines is slow -- most of the code is executed in Python and this is much less efficient than the java-based code using the PySpark dataframes. Be patient and trying using `.cache()` to cache the output of joins. You may want to start with a reduced set of data before running the full task. You can use the `sample()` method to extract just a sample of the data or use 

These two RDD's are called "rawCitations" and "rawPatents" because you probably want to process them futher (e.g. convert them to integer types, etc). 

The `textFile` function returns data in strings. This should work fine for this lab.

Other methods you use might return data in type `Byte`. If you haven't used Python `Byte` types before, google it. You can convert a value of `x` type byte into e.g. a UTF8 string using `x.decode('uft-8')`. Alternatively, you can use the `open` method of the gzip library to read in all the lines as UTF-8 strings like this:
```
import gzip
with gzip.open('cite75_99.txt.gz', 'rt',encoding='utf-8') as f:
    rddCitations = sc.parallelize( f.readlines() )
```
This is less efficient than using `textFile` because `textFile` would use the underlying HDFS or other file system to read the file across all the worker nodes while the using `gzip.open()...readlines()` will read all the data in the frontend and then distribute it to all the worker nodes.

Loading dataset and viewig the examples

In [3]:
rddCitations = sc.textFile("cite75_99.txt.gz")
rddPatents = sc.textFile("apat63_99.txt.gz")

print("Datasets loaded successfully")
print("Sample citations:", rddCitations.take(3))
print("Sample patents:", rddPatents.take(3))

Datasets loaded successfully
Sample citations: ['"CITING","CITED"', '3858241,956203', '3858241,1324234']
Sample patents: ['"PATENT","GYEAR","GDATE","APPYEAR","COUNTRY","POSTATE","ASSIGNEE","ASSCODE","CLAIMS","NCLASS","CAT","SUBCAT","CMADE","CRECEIVE","RATIOCIT","GENERAL","ORIGINAL","FWDAPLAG","BCKGTLAG","SELFCTUB","SELFCTLB","SECDUPBD","SECDLWBD"', '3070801,1963,1096,,"BE","",,1,,269,6,69,,1,,0,,,,,,,', '3070802,1963,1096,,"US","TX",,1,,2,6,63,,0,,,,,,,,,']


In other words, they are a single string with multiple CSV's. You will need to convert these to (K,V) pairs, probably convert the keys to `int` and so on. You'll need to `filter` out the header string as well since there's no easy way to extract all the lines except the first.

 ## step 1: parsing citations

### This helper function splits a line of text into fields using Python’s CSV reader.  
- It handles commas inside quotes correctly.  
- If parsing fails, it returns an empty list.  

In [17]:
def split_fields(text_line):
    try:
        return next(csv.reader(io.StringIO(text_line)))
    except Exception:
        return []

### This function parses a single line of the citation file.  
- Skips the header and malformed rows.  
- Returns a tuple (citing_id, cited_id) as integers (Convert a citations record into (citing, cited).)
- Returns None if line is header or malformed.

In [5]:
def parse_citations(line):
    parts = split_fields(line)
    if not parts or parts[0] == "CITING":
        return None
    if len(parts) < 2:
        return None
    try:
        citing_id = int(parts[0])
        cited_id = int(parts[1])
        return (citing_id, cited_id)
    except ValueError:
        return None

### This function parses a line of the patent file
- Extracts (patent_id, state) but only if the patent is from the US and has a state.  
- Skips patents from other countries or with missing state info.
- Returns None otherwise.


In [6]:
def parse_patents(line):
    parts = split_fields(line)
    if not parts or parts[0] == "PATENT":
        return None
    if len(parts) < 6:
        return None
    
    try:
        pid = int(parts[0])
    except ValueError:
        return None
    
    country = parts[4].strip('"') if len(parts) > 4 else ""
    state = parts[5].strip('"') if len(parts) > 5 else ""
    
    if country == "US" and state:
        return (pid, state)
    return None

### This function extracts the entire row of patent information, keyed by patent ID.  
- Useful later to display detailed info about important patents.
- Ensures rows have at least 23 fields.  

In [7]:
def parse_full_patent_info(line):
    parts = split_fields(line)
    if not parts or parts[0] == "PATENT":
        return None
    if len(parts) < 23:
        return None
    
    try:
        pid = int(parts[0])
        return (pid, parts)
    except ValueError:
        return None

print("Parsing utilities loaded successfully")

Parsing utilities loaded successfully


### We process the citations RDD:  
- Apply parse_citations.  
- Filter out None values.  
- Cache the result for reuse.  
- Show the total count and a small sample.  


In [8]:
citations_parsed = rddCitations.map(parse_citations).filter(lambda x: x).cache()
print(f"Parsed citations count: {citations_parsed.count()}")
print("Sample:", citations_parsed.take(5))

Parsed citations count: 16522438
Sample: [(3858241, 956203), (3858241, 1324234), (3858241, 3398406), (3858241, 3557384), (3858241, 3634889)]


## step 2: parsing patents

### We process the patents RDD:  
- Apply parse_patents.  
- Keep only valid (patent_id, state) pairs.  
- Cache the result.  
- Show how many such patents exist and a small sample.  

In [9]:
patents_parsed = rddPatents.map(parse_patents).filter(lambda x: x).cache()
print(f"Parsed US patents with state count: {patents_parsed.count()}")
print("Sample:", patents_parsed.take(5))

Parsed US patents with state count: 1784989
Sample: [(3070802, 'TX'), (3070803, 'IL'), (3070804, 'OH'), (3070805, 'CA'), (3070806, 'PA')]


## step 3: creating state lookup

### We convert the parsed patents into a Python dictionary using collectAsMap().  
- Key = patent ID.  
- Value = state code.  
This gives fast lookups for finding a patent’s state.  

In [10]:
map_patent_to_state = patents_parsed.collectAsMap()
print(f"Patent to state mapping created with {len(map_patent_to_state)} entries")
print("Sample:", list(map_patent_to_state.items())[:5])

Patent to state mapping created with 1784989 entries
Sample: [(3070802, 'TX'), (3070803, 'IL'), (3070804, 'OH'), (3070805, 'CA'), (3070806, 'PA')]


## step 4: filter citations should include those with both states availabe

### We enrich each citation with state info for both patents:  
- (citing, cited, citing_state, cited_state) 
- Only keep citations where both patents have state information.  

In [11]:
def filter_citations_with_states(citation):
    citing, cited = citation
    if citing in map_patent_to_state and cited in map_patent_to_state:
        return (citing, cited, map_patent_to_state[citing], map_patent_to_state[cited])
    return None

citations_with_states = citations_parsed.map(filter_citations_with_states).filter(lambda x: x).cache()
print(f"Citations with both states count: {citations_with_states.count()}")
print("Sample:", citations_with_states.take(5))

Citations with both states count: 6920796
Sample: [(3858241, 3398406, 'MA', 'FL'), (3858241, 3557384, 'MA', 'MA'), (3858241, 3634889, 'MA', 'OH'), (3858242, 3319261, 'MI', 'OH'), (3858242, 3668705, 'MI', 'WI')]


## step 5: filter for self state citation

We filter to keep only citations where both patents are from the same state.  
- These are called “self-state citations.”  

In [12]:
def is_self_state_citation(cws):
    citing, cited, citing_state, cited_state = cws
    return citing_state == cited_state

self_state_citations = citations_with_states.filter(is_self_state_citation).cache()
print(f"Self-state citations count: {self_state_citations.count()}")
print("Sample:", self_state_citations.take(5))

Self-state citations count: 1488330
Sample: [(3858241, 3557384, 'MA', 'MA'), (3858245, 3755824, 'NY', 'NY'), (3858247, 3621837, 'CA', 'CA'), (3858247, 3694819, 'CA', 'CA'), (3858249, 3418664, 'TX', 'TX')]


## step 6: count self state citations

### We transform each self-state citation into (citing_id, 1) and then sum counts per patent using reduceByKey.  
- The result is: for each citing patent, how many same-state citations it made.  

In [13]:
def extract_citing_patent(cws):
    citing, cited, citing_state, cited_state = cws
    return (citing, 1)

self_state_counts = self_state_citations.map(extract_citing_patent).reduceByKey(operator.add).cache()
print(f"Patents with self-state citations count: {self_state_counts.count()}")
print("Sample:", self_state_counts.take(5))


Patents with self-state citations count: 571919
Sample: [(3858241, 1), (3858245, 1), (3858247, 2), (3858249, 4), (3858251, 1)]


step 7: top 10 reuslt

In [14]:
top_10_counts = self_state_counts.takeOrdered(10, key=lambda x: -x[1])
print("Top 10 patents (patent_id, same-state citation count):")
for i, (pid, cnt) in enumerate(top_10_counts, 1):
    print(f"{i}. {pid} → {cnt}")

Top 10 patents (patent_id, same-state citation count):
1. 5959466 → 125
2. 5983822 → 103
3. 6008204 → 100
4. 5952345 → 98
5. 5958954 → 96
6. 5998655 → 96
7. 5936426 → 94
8. 5739256 → 90
9. 5913855 → 90
10. 5925042 → 90


step 8: creating formatted table and printing top 10 patent infos

In [15]:
top_ids = [pid for pid, _ in top_10_counts]

top_detailed = rddPatents.map(parse_full_patent_info) \
    .filter(lambda x: x and x[0] in top_ids) \
    .collectAsMap()

for pid, cnt in sorted(top_10_counts, key=lambda x: -x[1]):
    if pid in top_detailed:
        row = top_detailed[pid]
        print(f"Patent {row[0]} ({row[5]}) → {cnt} same-state citations")

Patent 5959466 (CA) → 125 same-state citations
Patent 5983822 (TX) → 103 same-state citations
Patent 6008204 (CA) → 100 same-state citations
Patent 5952345 (CA) → 98 same-state citations
Patent 5958954 (CA) → 96 same-state citations
Patent 5998655 (CA) → 96 same-state citations
Patent 5936426 (CA) → 94 same-state citations
Patent 5739256 (CA) → 90 same-state citations
Patent 5913855 (CA) → 90 same-state citations
Patent 5925042 (CA) → 90 same-state citations


In [16]:
columns = [
    ("PATENT", 8), ("GYEAR", 6), ("GDATE", 6), ("APPYEAR", 8),
    ("COUNTRY", 8), ("POSTATE", 8), ("ASSIGNEE", 9), ("ASSCODE", 8),
    ("CLAIMS", 7), ("NCLASS", 7), ("CAT", 4), ("SUBCAT", 7),
    ("CMADE", 6), ("CRECEIVE", 9), ("RATIOCIT", 9), ("GENERAL", 8),
    ("ORIGINAL", 9), ("FWDAPLAG", 9), ("BCKGTLAG", 9), ("SELFCTUB", 9),
    ("SELFCTLB", 9), ("SECDUPBD", 9), ("SECDLWBD", 9), ("SAME_STATE", 11)
]

sep_line = "+" + "+".join("-" * w for _, w in columns) + "+"
header_line = "|" + "|".join(name.ljust(w) for name, w in columns) + "|"

print()
print(sep_line)
print(header_line)
print(sep_line)

patent_to_count = dict(top_10_counts)
sorted_patents = sorted(top_10_counts, key=lambda x: -x[1])

for pid, count in sorted_patents:
    if pid not in top_detailed:
        continue
    
    row = top_detailed[pid]

    def safe_val(v):
        if v is None or str(v).strip() == "":
            return "null"
        return str(v).strip('"')

    values = [
        safe_val(row[0]), safe_val(row[1]), safe_val(row[2]), safe_val(row[3]),
        safe_val(row[4]), safe_val(row[5]), safe_val(row[6]), safe_val(row[7]),
        safe_val(row[8]), safe_val(row[9]), safe_val(row[10]), safe_val(row[11]),
        safe_val(row[12]), safe_val(row[13]), safe_val(row[14]), safe_val(row[15]),
        safe_val(row[16]), safe_val(row[17]), safe_val(row[18]), safe_val(row[19]),
        safe_val(row[20]), safe_val(row[21]), safe_val(row[22]), str(count)
    ]
    
    formatted = "|" + "|".join(val.ljust(w) for val, (_, w) in zip(values, columns)) + "|"
    print(formatted)

print(sep_line)
print("only showing top 10 rows")



+--------+------+------+--------+--------+--------+---------+--------+-------+-------+----+-------+------+---------+---------+--------+---------+---------+---------+---------+---------+---------+---------+-----------+
|PATENT  |GYEAR |GDATE |APPYEAR |COUNTRY |POSTATE |ASSIGNEE |ASSCODE |CLAIMS |NCLASS |CAT |SUBCAT |CMADE |CRECEIVE |RATIOCIT |GENERAL |ORIGINAL |FWDAPLAG |BCKGTLAG |SELFCTUB |SELFCTLB |SECDUPBD |SECDLWBD |SAME_STATE |
+--------+------+------+--------+--------+--------+---------+--------+-------+-------+----+-------+------+---------+---------+--------+---------+---------+---------+---------+---------+---------+---------+-----------+
|5959466 |1999  |14515 |1997    |US      |CA      |5310     |2       |null   |326    |4   |46     |159   |0        |1        |null    |0.6186   |null     |4.8868   |0.0455   |0.044    |null     |null     |125        |
|5983822 |1999  |14564 |1998    |US      |TX      |569900   |2       |null   |114    |5   |55     |200   |0        |0.995    |n