# Generational Difference
For patents that cite each other, a significant question is how the characteristic of the prior generation "cited generation" affect the behaviour of future patents that cite them. 

1. Assignees - Number of Unique Assignees
1. Inventors - Number of Unique Inventors
1. Locations - Number of Unique Locations
1. Patents - Number of Unique Patents
1. NBER Category - Number of NBER Category
1. Jason-Shannon Divergence
1. Similarity 
1. Claims 

In [None]:
import neo4j 
import pandas as pd
import random
import numpy as np
import datetime
from credentials import uri, user, pwd
from patent_neo4j.connection import Neo4jConnection
from patent_neo4j.analysis import assign_missing_nber, nber_distribution, js_divergence

In [None]:
df = pd.read_csv('Data/Mined Data/sample_patents_stats.csv')

In [None]:
df.head()

Connection for citation tree queries

In [None]:
# "standard" root for looking at how things behave
root = df.loc[4,'id']
# "degenerate" as in only 1 child
degenerate_root = df.loc[643,'id']
conn = Neo4jConnection(uri, user, pwd)
citation_tree = conn.query_citation_tree(root)
degen_tree = conn.query_citation_tree(degenerate_root)

In [None]:
citation_tree.head()

In [None]:
citation_tree.info()

In [None]:
degen_tree.head()

### Obtain Only the "Last Generation" as the forefront of Invention
Patents in the citation tree could potentially cite their grandparents, this causes a patent to **potentially be classified in different generations**. It is also possible that some patents might cite a generation 1 and generation 2 patent, this places them in both generations 2 and 3. <br>

Defining generation as a form of a 'forefront' of the technology development, we care about the **latest generation** to classify the specific invention as the **frontier**. 

In [None]:
'''
Given (RAW) citation_tree, keep only the "oldest" generation
i.e. if a is gen 1 and 2, gen will be only 2
Also, take the direct simiarity
Input:
    citation_tree
Output:
    citation_tree
'''
def get_max_generation(citation_tree):
    # Obtain the generation based on lineage
    citation_tree['gen'] = citation_tree['similarity'].apply(lambda x: len(x))
    
    # Dropping duplicates due to different inventors
    generation = citation_tree.loc[:,['id','gen']].drop_duplicates()
    
    # Sort values based on a generation, keeping the last (the idea of the forefront of inventions)
    generation = generation.sort_values(by=['gen']).drop_duplicates(subset=['id'], keep='last')
    
    # Left join with generation -> this only keeps the max(gen) for each patent
    citation_tree = pd.merge(generation,citation_tree,on=['id','gen'], how='left')
    
    # Take the direct similarity of the max(gen)
    citation_tree['similarity'] = citation_tree['similarity'].apply(lambda x: x[0])
    
    return citation_tree

In [None]:
def fixing_na_nber(citation_tree):
    
    # Get NBER, similarity and lineage
    citation_tree['nber'] = citation_tree['nber_lineage'].apply(lambda x: x[0])
    
    # Return Assigned NBER
    citation_tree = pd.merge(citation_tree.drop(['nber'],axis=1),assign_missing_nber(citation_tree), on='id', how='left')
    
    return citation_tree

In [None]:
citation_tree = fixing_na_nber(citation_tree)
degen_tree = fixing_na_nber(degen_tree)

In [None]:
citation_tree = get_max_generation(citation_tree)
degen_tree = get_max_generation(degen_tree)

In [None]:
citation_tree.head()

In [None]:
degen_tree.head()

### Counting the Numbers
For some of the features, and as for now we have:

1. Inventors
1. Assignees
1. Location
1. Patent ID

We are interested in how many of them are in a given citation tree for each generation, and I refer to them as **countables**

In [None]:
"""
Takes a citation_tree, and given an optional parameter countables, count the number of unique countables 
(whatever they are) by generation
Input:
    citation_tree and countables
Output:
    generation - dataframe with counts of countables by generation
"""
def counting_countables(citation_tree, countables=['inventor','assignee','location','id']):
    
    counter = lambda x: citation_tree.loc[:,['gen',x]].drop_duplicates().groupby("gen").agg("count").reset_index()
    
    generation = pd.DataFrame({'gen':[1,2,3]})
    
    
    for c in countables:
        generation = pd.merge(generation, counter(c), how='left', on='gen')
        
    return generation

In [None]:
counting_countables(citation_tree).head()

In [None]:
counting_countables(degen_tree).head()

### Averaging the values

Taking the average values by generation, we have columns that are:
1. Similarity
1. Claims

And these are the **averageables**

In [None]:
'''
Takes a citation_tree, and given column averageables that take the average for each
by generation. 
Inputs:
    citation_tree
    averageables - list of columns that are 'averageable'
Output:
    generational information
'''
def averaging_averageables(citation_tree, averageables = ['similarity','claims']):
    # Averaging function that drops duplicates 
    # Drops NA rows and ensure all are of float64
    # Then average by generation LoL
    averager = lambda x: citation_tree.loc[:,['gen']+x].drop_duplicates().dropna().astype('float64').groupby("gen").agg("mean").reset_index()
    
    generation = pd.DataFrame({'gen':[1,2,3]})
    
    generation = pd.merge(generation, averager(averageables), how='left', on='gen')
    
    return generation

In [None]:
averaging_averageables(citation_tree)

In [None]:
averaging_averageables(degen_tree)

### Putting Things Together
Just boring, putting both averageables together, based on the root, so we know who it belongs to

In [None]:
citation_generation = pd.merge(counting_countables(citation_tree),averaging_averageables(citation_tree))

In [None]:
citation_generation['root'] = root

In [None]:
citation_generation.head()

In [None]:
degen_generation = pd.merge(counting_countables(degen_tree),averaging_averageables(degen_tree))

In [None]:
degen_generation['root'] = degenerate_root

In [None]:
degen_generation.head()

In [None]:
degen_generation.columns

## Awful and Sad Loops
This is the part where I hate myself, and my laptop would hate me even more because I am abusing it. BUT, whatever, I couldn't care less.

In [None]:
sad_loop_to_go_around = list(df['id'])

In [None]:
df = pd.read_csv("./Data/important_patents.csv")
df.head()

In [None]:
def generational_information(root):
    # Query Citation Tree
    citation_tree = conn.query_citation_tree(root)
    
    # Get Max Generation and Clean Data
    citation_tree = get_max_generation(citation_tree)
    
    # Assign NBER
    citation_tree = fixing_na_nber(citation_tree)
    
    # Count and Average Data
    citation_tree = pd.merge(counting_countables(citation_tree),averaging_averageables(citation_tree))
    
    citation_tree['root'] = root
    
    return citation_tree

In [None]:
data = pd.DataFrame(columns = ['gen', 'inventor', 'assignee', 'location', 'id', 'similarity', 'claims','root'])

In [None]:
data.head()

In [None]:
for sad in sad_loop_to_go_around:
    data = pd.concat([data, generational_information(sad)], ignore_index=True)

In [None]:
data.to_csv("generation_pt1.csv", index = False)

In [None]:
for sad in sad_loop_to_go_around[500:557]:
    print(sad)
    data = pd.concat([data, generational_information(sad)], ignore_index=True)

In [None]:
data.to_csv("generation_pt2.csv", index = False)

### This is the DEVIL. Not working on my LAPTOP

In [None]:
for sad in sad_loop_to_go_around[557:558]:
    print(sad)
    data = pd.concat([data, generational_information(sad)], ignore_index=True)

In [None]:
data.to_csv("generation_pt3.csv", index = False)

In [None]:
data = pd.DataFrame(columns = ['gen', 'inventor', 'assignee', 'location', 'id', 'similarity', 'claims','root'])

In [None]:
for sad in sad_loop_to_go_around[558:]:
    print(sad)
    data = pd.concat([data, generational_information(sad)], ignore_index=True)

In [None]:
data.to_csv("generation_pt4.csv", index = False)

In [None]:
df1 = pd.read_csv("generation_pt1.csv")
df2 = pd.read_csv("generation_pt2.csv")
df3 = pd.read_csv("generation_pt4.csv")

In [None]:
df = pd.concat([df1,df2,df3])

In [None]:
df.shape

In [None]:
df.to_csv("generation.csv", index = False)