# CSCI 4253 / 5253 - Lab #3 - Patent Problem with SQL - SOLUTION
<div>
 <h2> CSCI 4283 / 5253 
  <IMG SRC="https://www.colorado.edu/cs/profiles/express/themes/cuspirit/logo.png" WIDTH=50 ALIGN="right"/> </h2>
</div>

In this assignment, we're going to solve a problem for which you'll also see the solution in Hadoop and then implement a PySpark solution. We have two databases
* One contains information about patents
* One contains information about patent citations (one patent citing the work of another)
The problem we're going to solve is augmenting the original patent data to include the number of *co-state citations*. In other words, if patent X was issued to someone in Colorado and patent Y was also issued to someone in Colorado and X cites Y, then this is a co-state citation.

The easiest way to solve this is to build an intermediary product from the patent citations table. That table contains `CITING` and `CITED` columns; you would augment it by adding `CITING_STATE` and `CITED_STATE`. Then, it becomes fairly simple to filter out all the cases where those states don't match. You can then use an sql `GROUP BY` to `COUNT(*)` all the co-state citations for a given patent and then join that with the original patents table resulting in an augmented table.

The final results for the first 13 words, sorted in descending order by co-state citations looks like the following
![this final output](final-output.png)

The challenge is going to be that I want you do to this *a single SQL query*. We're assuming you've learned some SQL in a previous life and if not, [now is a great time to learn](https://www.sqlitetutorial.net/). Even if you've done basic SQL, you'll probably need to review [using `select` in a where-clause or using multiple joins](https://dba.stackexchange.com/questions/33553/using-select-in-the-where-clause-of-another-select).


## Logistics

We're going to be using the SQLite3 system which runs entirely from a file (no server needed). The `Makefile` contains commands to download the raw data as ZIP files.

We can run shell commands in our notebook using [builtin "magic" commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html). You need to run this at least once prior to starting the lab to make certain you have the files and have created the `patents.sq3` database file. It will take a few minutes to complete & you should see that the `patents.sq3` file is about 645MBytes in size.

In [None]:
# Temporary fix to CSEL version conflict issue
%pip install --upgrade --user ipython-sql==0.5.0

In [None]:
%%bash
make
rm patents.sq3
zcat < acite75_99.zip | sqlite3 patents.sq3 ".mode csv" ".import /dev/stdin citations"
zcat < apat63_99.zip | sqlite3 patents.sq3 ".mode csv" ".import /dev/stdin patents"
ls -l patents.sq3

We'll use another "magic" to run SQL queries in notebook cells. The following will load the SQL extension and connect to the `patents.sq3` file.

In [None]:
%load_ext sql
%sql sqlite:///patents.sq3
%config SqlMagic.style = '_DEPRECATED_DEFAULT'


Following this, we can run individual SQL queries and see the result by putting `%%sql` at the front of a cell. If you don't have that, you'll be running Python code.

So, for example, we can examine our two raw database tables.

In [None]:
%%sql
select * from patents limit 5;

In [None]:
%%sql
select * from citations limit 5;

If you want to create indexes over various fields, go ahead. It shouldn't affect the correctness of your results but may affect the performance.

## Steps to the full solution

In order to determine when a *cited* patent and a *citing* patent are from the same state,
we're going to need to produce a series of tables that combine information from the citations and the patents tables.

We can use a simple left-outer join (or just **JOIN**) to get the information for one or the other column in the citations table. For example, we can determine the state for *cited* patents using this join:

In [None]:
%%sql
SELECT CITED, patents.POSTATE as CITED_POSTATE, CITING
FROM citations JOIN patents
WHERE CITED==PATENT limit 5;

Then, you'll need to do the same for the `CITING` column as well. As mentioned earlier, you may want to review [using `select` in a where-clause or using multiple joins](https://dba.stackexchange.com/questions/33553/using-select-in-the-where-clause-of-another-select).

## Your solution

Eneter your solution as a single SQL query below:

In [None]:
# Your solution should be in the last cell