# Patent eda

This is an exploratory analysis of the patent data. 

Some questions:

* What are the variables in the data?
* What are the missing and present values?
* What do the data capture *legally*?
* What will be the most interesting transformations to carry out in this data

## 0. Preamble

In [None]:
%run notebook_preamble.ipy

In [None]:
import pandas_profiling as pp

### 1. Load data

In [None]:
pat = pd.read_csv('../data/raw/18_6_2019_patent_apps.csv')

In [None]:
pat.head()

In [None]:
pat.shape

### 2. EDA

In [None]:
print('|name|type|observations|')
print('|----|----|----|')

for c in pat.columns:
    
    print(f'|{c}|{type(pat[c].iloc[0])}|   |')

In [None]:
for c in pat.columns:
    
    print(c)
    print('=====')
    
    var = pat[c].dropna()
    
    print(type(var.iloc[0]))
    print(var.iloc[0])
    
    print('\n')

In [None]:
for c in pat.columns:
    
    print(c)
    print('=====')
    
    var = pat[c].dropna()
    
    print(type(var.iloc[0]))
    print(var.iloc[0])
    
    print('\n')
    
    

Some questions to check:

* How many unique ids are there?
* Is it a many patents to one applicant table?
* Or do we have lists of applicants where there is more than one?

### Some basic analysis

#### How many patents are there in the data?

In [None]:
len(set(pat.appln_id))

Look: only around 770k unique ids

In [None]:
#len(set(pat.docdb_family_id))

len(set(pat.docdb_family_id.dropna()))

Less than 300,000

In [None]:
#what about people / organisations

len(set(pat.psn_id.dropna()))

More persons than patents, suggesting that each row represents one person. Let's check

In [None]:
# pat.dropna(axis=0,
#            subset=['docdb_family_id']).loc[pat['docdb_family_id'].duplicated()][['appln_id','psn_name','docdb_family_id','appln_abstract']]

In [None]:
#focus on patents with ids

pat.sort_values('appln_id')[['appln_id','person_name','docdb_family_id','appln_abstract','person_address',
                             'invt_seq_nr','applt_seq_nr','appln_auth']].head(n=30)

Some observations:

* Each row captures a bit of information - the name and address of an inventor or applicant, the abstract etc...
* Some of the missing inventors / applicants must be based outside of GB

Shall we create a df where every row is a patent application?

We would then have:

* Abstract
* Technical information (technology area, ipc code, nace code etc.)
* Patent family
* Application and publication year
* Whether it has been granted or not
* Authority (can we check repeated patents?)
* Information about inventor
* Information about the applicant
* Information about the inventor

For each variable we should group over application ids and create a list of the other variables



Organise the patent data more sensibly

In [None]:
%%time
# For each column, group and aggregate

pat_gr = []

for column in pat.columns:
    
    #print(column)
    
    p = pat.copy()
    
    #Drop nas for the column:
    
    p = p.dropna(axis=0,subset=[column])
    
    #We create a list of values. Later on we will extract values when the length of the list is always zero or 1
    
    group = p.groupby('appln_id')[column].apply(lambda x: list(x))
    
    pat_gr.append(group)

In [None]:
pat_grouped = pd.concat(pat_gr,axis=1)

In [None]:
pat_grouped.head()

Next steps: 

* Convert non list variables
* Remove redundant fields
* Do EDA


In [None]:
#Extract from list those fields that only have one value:

pat_grouped_2 = pat_grouped.copy()

In [None]:
#For each column
for c in pat_grouped.columns:
    
    #If a column has all values with the same length, extract that value from the list (it is not a list)
    n_vals = len(set([len(x) for x in pat_grouped[c].dropna()]))
    
    print(n_vals)
    
    if n_vals==1:
        pat_grouped_2[c]= pat_grouped[c].apply(lambda x: x[0] if pd.isnull(x)==False else np.nan)

In [None]:
for c in pat_grouped_2.columns:
    
    print(c)
    print('=====')
    
    var = pat_grouped_2[c].dropna()
    
    try:
        print(type(var.iloc[0]))
        print(var.iloc[0])
    except:
        print('all missing')
    
    print('\n')
    

In [None]:
# pat_grouped_2.to_csv(f'../data/interim/{today_str}_patent_grouped.csv',compression='zip')

In [None]:
pat.appln_auth.value_counts().head()

Are these duplicated?

In [None]:
pat.appln_filing_year.value_counts().loc[np.arange(2005,2019)].plot.bar(color='blue',title='Patents per year')

In [None]:
pat.appln_kind.value_counts()

Note that the W means an international application under the cooperation treaty

### Profiling

In [None]:
pp.ProfileReport(pat)