# The Idea

> The CPC / IPC "*provides a hierarchical system of language independent symbols for the classification of patents and utility models according to the different areas of technology to which they pertain*".

> To be able to identify each entry in the hierarchy, at least a **classificationn symbol** and "**level**"(CPC)/ "**kind**"(IPC) are specified.

> We will iterate over the levels and create a table in our database containing the entries and the corresponding direct parent to establish  the hierarchy. We will then enrich our database with languages, dates, statistics ... which will be very useful for the visualization later on.

> **Summary**
- Overview: IPC and CPC structure
- IPC and CPC : from XML to database
    - Level formatting 
    - Symbol formatting
- IPC and CPC titles
- IPC and CPC evolution - Technology Tree
- IPC and CPC statistics
- Run

# Overview: IPC and CPC structure

>A complete classification symbol comprises the combined symbols representing the section, class, subclass and main group or subgroup.

![classidication](./static/classification1.PNG)

| Symbol | IPC   | CPC   |
|------|------|------|
|   **Section Symbol**  | one of the capital letters A through H | the addition of section 'Y' is unique to CPC.|
|   **Class Symbol**  | section symbol followed by a two-digit number | - |
|   **Subclass Symbol**  | class symbol followed by a capital letter | - |
|   **Main Group Symbol**  | subclass symbol followed by a one- to three-digit number, the slash ("/") and the number 00 | subclass symbol followed by a one- to four-digit number, the slash ("/") and the number 00 |
|   **Subgroup Symbol**  | subclass symbol followed by the one- to three-digit number of its main group, the slash ("/") and a number of at least two digits other than 00 | subclass symbol followed by the one- to four-digit number of its main group, the slash ("/") and a two- to six-digit numbers other than 00 |

>The hierarchical structure is shown in the following example:


![classidication](./static/classification2.PNG)

> **A new version of the IPC enters into force each year on January 1. The CPC is updated multiple times a years. Therefore the links have to be updated in the config.py!**

# IPC and CPC : from XML to database

In [3]:
# to open the xml files
import requests, zipfile, io, os
import tempfile

# to connect to database
from sqlalchemy import create_engine
from sqlalchemy import inspect
from sqlalchemy.engine.url import URL
from sqlalchemy.types import String
import sqlite3

# to read, write and process the data
import pandas as pd
from lxml import etree as ET
from sklearn import preprocessing
import re
import time
import numpy as np

# initiate configuration class and connect to database
from config import config
config = config()
reqsD = config.connect_to_database()
fileD = config.files()
engine = create_engine(str(URL(**reqsD)) + '/patent_classification')
conn = engine.connect()
inspector = inspect(engine)
# check for schema and table in database
#if scheme in inspector.get_table_names():
#    conn.execute('DROP TABLE patent_classification.' + scheme)

>We iterate over the entries and save the clasification hierarchy i.e. 
- classification symbols as indices
- kinds
- direct parents.

In [17]:
# takes approximately 15 min
def importer(scheme, fileD, schemaL, engine, conn, to_sql):
    with tempfile.TemporaryDirectory() as directory:

        # unzip...
        url = fileD[scheme]
        r = requests.get(url)
        z = zipfile.ZipFile(io.BytesIO(r.content))
        z.extractall(directory)
        filename = os.listdir(directory)[0] #en and fr version available, for the hierarchy struture only need one 

        # ...and parse file
        parser = ET.XMLParser(remove_blank_text=True)
        path = os.path.join(directory, filename)
        tree = ET.parse(path, parser=parser)
        root = tree.getroot()

        # list of parent entries to iterate over
        parentL=[root]

        for parent in parentL:
            # create dataframe for each level and parent
            df = pd.DataFrame(columns=['kind', 'parent'])

            # iterate over children entries
            for entry in parent.iterchildren(tag='{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry'):

                # add children to list of parents  
                parentL.append(entry)

                # save classification symbol
                symbol = entry.attrib['symbol']
                
                # get classification level 
                kind = entry.attrib['kind']
                
                # save kind and symbol of direct parent
                if kind == "s":
                    # for entries of level 1 save "IPC" as parent
                    df.loc[symbol, 'kind']= kind
                    df.loc[symbol, 'parent']= scheme.upper()
                elif kind not in ['t', 'i', 'g', 'n']:
                    # save only relevant entries 
                    if parent.tag == '{http://www.wipo.int/classifications/ipc/masterfiles}ipcEntry':
                        df.loc[symbol, 'kind']= kind
                        df.loc[symbol, 'parent']= parent.attrib['symbol']
                        
            # write to database
            if to_sql == 'yes':
                df.to_sql(scheme, engine, if_exists='append', schema='patent_classification', index=True, index_label="symbol")
                

### Level formatting

> Additionally to the alphanumeric "kinds" of the IPC ("s" for section, "c" for class, "u" for subclass, "m" for main group and 1-9, "A", "B" for sub groups), we will use the numeric equivalent "levels" of the CPC (2-19) to be able to iterate over a well-ordered set.

In [8]:
def add_levels(scheme, engine, to_sql):
    
    # save "levels" by mapping corresponding "kinds"
    df = pd.read_sql_table(scheme, engine, schema='patent_classification', index_col='symbol')
    df['level'] = df['kind']
    kind_to_levelD = {'s': 2, 'c': 3, 'u': 4, 'm' :5, '1': 6, '2': 7, '3': 8, '4': 9, '5': 10,
                      '6': 11, '7': 12, '8': 13, '9': 14, 'A': 15, 'B': 16, 'C':17}
    df.replace({"level": kind_to_levelD}, inplace=True)
    if to_sql == 'yes':
        df.to_sql(scheme, engine, if_exists='replace', schema='patent_classification', index=True)

### Symbol formatting

> The symbols in the xml files have not the format shown in the exemple from the beginin: it's H01F0001053000 and not H01F1/053. So we add the short format for the symbols.

In [9]:
# transforms H01F0001053000 to H01F1/053
def formatting_symbols(x):
    # for section, class, subclass return symbol
    if len(x)<= 4:
        return x
    else:
        # for group remove leading and trailing zeros
        x = x[:4] + str(int(x[4:8])) + '/' + x[8:].rstrip('0')
        
        if x[-1] == '/':
            return x + '00'
        else:
            return x

def add_symbols_short(scheme, engine, to_sql):
    df = pd.read_sql_table(scheme, engine, schema='patent_classification', index_col='symbol')
    df['symbol_short'] = list(map(lambda x: formatting_symbols(x), df.index.tolist()))
    df['parent_short'] = df['parent'].apply(lambda x: formatting_symbols(x))

    if to_sql == 'yes':
        df.to_sql(scheme, engine, if_exists='replace', schema='patent_classification', index=True)

# IPC and CPC titles

> We choose to not extract the titles directly from the xml files (because the syntax is not practical). The txt files containing the IPC/ CPC scheme titles in authentic languages look as follow (except that for the CPC the short symbol format is used). 

In [5]:
titles = pd.read_csv(fileD['ipc_en'][0], sep='\t', header=None)
print(titles.head(10))

                0                                                  1
0               A                                  HUMAN NECESSITIES
1             A01  AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTI...
2            A01B  SOIL WORKING IN AGRICULTURE OR FORESTRY; PARTS...
3  A01B0001000000  Hand tools (edge trimmers for lawns A01G000306...
4  A01B0001020000                                    Spades; Shovels
5  A01B0001040000                                         with teeth
6  A01B0001060000                             Hoes; Hand cultivators
7  A01B0001080000                                with a single blade
8  A01B0001100000                            with two or more blades
9  A01B0001120000                    with blades provided with teeth


>We concatenate the txt files from A to H (A to Y for the CPC) to a dataframe with the symbols as indices and a column of titles, which we then join to the database table. 

In [24]:
def get_titles(scheme, schemeD, index, engine):

    for language in list(schemeD.keys()):

        # concatenate txt files to dataframe of titles 
        titles= []

        # ...for the cpc it's a txt file that we have to unzip first
        if scheme == 'cpc':
            with tempfile.TemporaryDirectory() as directory:
                r = requests.get(schemeD[language])
                z = zipfile.ZipFile(io.BytesIO(r.content))
                z.extractall(directory)
                filenameL = [file for file in os.listdir(directory) if file.endswith('.txt')]
                for filename in filenameL:
                    path = os.path.join(directory, filename)
                    df = pd.read_csv(path,  sep='\t', header=None, names=[index, 'title_' + language])
                    titles.append(df)
        else:
            for filename in schemeD[language]:
                r = requests.get(filename)
                byte = r.content
                s=str(byte,'utf-8')
                data = StringIO(s) 
                df = pd.read_csv(data,  sep='\t', header=None, names=[index, 'title_' + language])
                titles.append(df)
        df_titles = pd.concat(titles, axis=0, ignore_index=True)
        df_titles.drop_duplicates(subset=index, keep='last', inplace=True)

        # append titles to database
        df = pd.read_sql_table(scheme, engine, schema='patent_classification', index_col=index)
        df = df.join(df_titles.set_index(index))
        return df
        #df.to_sql(scheme, engine, if_exists='replace', schema='patent_classification', index=True)

# IPC and CPC evolution - Technology Tree

> To trace the development of the technology fields over time, we save a creation date for every symbol.

In [11]:
def pc_evolution(scheme, index, engine, fileD, to_sql):
    
    # read table from database
    df = pd.read_sql_table(scheme, engine, schema='patent_classification', index_col=index)

    # read inventory of IPC/CPC ever used symbols with creation date
    inventory = fileD[ scheme + '_inventory']
    
    # ...for the cpc it's a txt file that we have to unzip first
    if scheme == 'cpc':
        with tempfile.TemporaryDirectory() as directory:
            r = requests.get(inventory)
            z = zipfile.ZipFile(io.BytesIO(r.content))
            z.extractall(directory)
            filename = [file for file in os.listdir(directory) if file.endswith('.txt')][0]
            path = os.path.join(directory, filename)
            
            pc_inventory = pd.read_csv(path, sep='\t', names= [index, 'creation_date', 'expiration_date'], usecols=[0,1])
    
    # ...for the ipc we can read it directly
    else:
        pc_inventory = pd.read_csv(inventory, sep=';', names= [index, 'creation_date', 'expiration_date'], usecols=[0,1])
    
    # we join the tables...
    df = df.join(pc_inventory.set_index(index))
    
    # ...and fill the creation dates of the section and class symbols
    if scheme == 'cpc':
        creation_data = '2013-01-01'
    else:
        creation_data = 19680901
    df.fillna(value= {'creation_date': creation_data}, inplace=True)
    
    if to_sql == 'yes':
        df.to_sql(scheme, engine, if_exists='replace', schema='patent_classification', index=True, dtype={'creation_date':String()})

# IPC and CPC statistics

> The WIPO offers a summary table with the number of classes, subclasses, main groups, subgroups and groups (i.e. sum of main groups and subgroups) for each section. This statistics are only given at the highest hierachical level. The tree allows us to have statistcs for each entry at each hierarchical level.

> We trace all the possible branches of the tree i.e. the paths of all the symbols from main groups and subgroups. Then we can "count" for each symbol of each level in how many paths it appears and deduce a size for each entry of the tree.

In [12]:
# returns a list with the paths of all symbols from main groups and subgroups as lists of symbols
# takes approximately 30 min
def get_paths(scheme, engine):
    
    df = pd.read_sql_table(scheme, engine, schema='patent_classification')
    df['level'] = pd.to_numeric(df['level'])
    grouped_level = df.groupby('level')
    
    pathsL = []
    
    # iterte df over rows
    for index, row in df.iterrows():
        key = row['symbol']
        level = int(row['level'])
        
        # select only symbols of main groups and subgroups
        if level >4:
            path=[key]
            # trace path by appending successively the corresponding direct parent to a list
            for i in range(level , 2, -1):
                group = grouped_level.get_group(i)
                try:
                    parent = group[group['symbol'] == key]['parent'].iloc[0]
                    key = parent
                    path.append(parent)
                except:
                    print(key)
                    pass
            path = path[::-1]
            pathsL.append(path)
    df = pd.DataFrame(pathsL)
    df.to_csv(scheme + '_statistics.csv')


In [13]:
def pc_statistics(pathsL, scheme, engine, to_sql):
    
    # convert list of paths to dataframe and fill paths with 0 to reach maximal lenght
    paths_df = pd.read_csv('ipc_statistics.csv', index_col=0)
    paths_df.fillna('0', inplace=True)

    stat=[]
    # iterate df over columns in order to group by each level
    for i in range(paths_df.shape[1]):
        grouped = paths_df.groupby([str(j) for j in range(i+1)])
        
        # save symbol and it's size to list
        for key in grouped.groups.keys():
            if '0' not in key:
                stat.append([key[-1], len(grouped.get_group(key))])

    # convert to df with the symbols and for each symbol a value corresponding to the number of groups it contains
    stat_df = pd.DataFrame(stat, columns=['symbol', 'size'])
    
    # add size in % 
    stat_df['size_percent'] = round(stat_df['size'] * 100 / 74503, 3)
    
    # append statistics to database
    df = pd.read_sql_table(scheme, engine, schema='patent_classification')
    #pathsL = get_paths(df, engine)
    df = df.set_index('symbol').join(stat_df.set_index('symbol'))
    
    # add normalised size for each group of children for visualization
    df['size_normalised'] = df['size']
    grouped = df.groupby('parent')
    for key in grouped.groups.keys():
        group = grouped.get_group(key)
        x = group['size_normalised'].values.reshape(-1, 1)
        min_max_scaler = preprocessing.MinMaxScaler(feature_range=(3, 13))
        x_scaled = min_max_scaler.fit_transform(x)
        index = group.index.values
        df.loc[index, 'size_normalised'] = x_scaled
    
    if to_sql == 'yes':
        df.to_sql(scheme, engine, if_exists='replace', schema='patent_classification')

> Sanity checks for the statistics: 

In [10]:
def sanity_check(df):
    grouped = df.groupby('level')
    for key in grouped.groups.keys():
        group = grouped.get_group(key)
        size = group['size'].sum()
        print(key, size)

# From postgresql to sqlite

In [14]:
def postgresql_to_sqlite(scheme, engine):
    df = pd.read_sql_table(scheme, engine, schema='patent_classification')
    conn2 = sqlite3.connect('./patent-classification.db')
    df.to_sql(name=scheme, schema='patent-classification', index=False, con=conn2, if_exists='replace')

In [6]:
conn2 = sqlite3.connect('./patent-classification.db')
import_df = pd.read_sql_query("SELECT * FROM" + scheme, conn2)
print(import_df.head)

<bound method NDFrame.head of                symbol kind          parent  level symbol_short parent_short  \
0                   A    s             IPC      2            A          IPC   
1                 A01    c               A      3          A01            A   
2                A01B    u             A01      4         A01B          A01   
3      A01B0001000000    m            A01B      5     A01B1/00         A01B   
4      A01B0001020000    1  A01B0001000000      6     A01B1/02     A01B1/00   
5      A01B0001040000    2  A01B0001020000      7     A01B1/04     A01B1/02   
6      A01B0001060000    1  A01B0001000000      6     A01B1/06     A01B1/00   
7      A01B0001080000    2  A01B0001060000      7     A01B1/08     A01B1/06   
8      A01B0001100000    2  A01B0001060000      7      A01B1/1     A01B1/06   
9      A01B0001120000    2  A01B0001060000      7     A01B1/12     A01B1/06   
10     A01B0001140000    2  A01B0001060000      7     A01B1/14     A01B1/06   
11     A01B0001160000 

# Run

In [26]:
import time
scheme = input('Choose between cpc or ipc: ').lower()
to_sql = input('Overwrite database?(yes/no): ').lower()


# df with symbols, kinds and direct parents..
# takes approximately 15 min
#importer(scheme, fileD, schemaL, engine, conn, to_sql)

# ..add column for levels
# takes less than 1 min
#add_levels(scheme, engine, to_sql)

# ..add column for short symbols
# takes less than 1 min
#add_symbols_short(scheme, engine, to_sql)

if scheme == 'ipc':
    schemeD = {'en' : fileD['ipc_en'], 'fr': fileD['ipc_fr']}
    index = 'symbol'
elif scheme == 'cpc':
    schemeD = {'en' : fileD['cpc_en']}
    index = 'symbol_short'

# ..add columns for titles
# takes less than 1 min
get_titles(scheme, schemeD, index, engine)

# ..add columns for creation dates
# takes less than 1 min
#pc_evolution(scheme, index, engine, fileD, to_sql)

# ..add columns for sizes i.e. number of groups contained (in % and normalised)
# takes approximately 30 min
#pathsL = get_paths(scheme, engine)
# takes less than 1 min
#pc_statistics(pathsL, scheme, engine, to_sql)

# from postgresql to sqlite
#postgresql_to_sqlite(scheme, engine)

Choose between cpc or ipc: cpc
Overwrite database?(yes/no): yes


ConnectionError: HTTPSConnectionPool(host='www.cooperativepatentclassification.org', port=443): Max retries exceeded with url: /cpc/interleaved/CPCTitleList201908.zip (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x121ac7240>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

### Colors for branches

In [33]:
import numpy as np
creation_date = df['creation_date'].unique().tolist()
creation_year = [date[:4] for date in creation_date]
creation_yearL = sorted(list(map(int, creation_year )))
print(creation_yearL)

[1968, 1968, 1974, 1980, 1985, 1990, 1995, 2000, 2006, 2007, 2007, 2008, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]


In [44]:
for date in creation_date:
    print('--------------------------------', date)
    print(df[df['creation_date'] == date].shape)
    print(df[df['creation_date'] == date].head(10))

-------------------------------- 19680901
(139, 11)
     symbol kind parent  level symbol_short parent_short  \
0         A    s    IPC      2            A          IPC   
1       A01    c      A      3          A01            A   
1680    A21    c      A      3          A21            A   
1846    A22    c      A      3          A22            A   
1919    A23    c      A      3          A23            A   
2717    A24    c      A      3          A24            A   
2972    A41    c      A      3          A41            A   
3235    A42    c      A      3          A42            A   
3282    A43    c      A      3          A43            A   
3697    A44    c      A      3          A44            A   

                                               title_en  \
0                                     HUMAN NECESSITIES   
1     AGRICULTURE; FORESTRY; ANIMAL HUSBANDRY; HUNTI...   
1680  BAKING; EQUIPMENT FOR MAKING OR PROCESSING DOU...   
1846  BUTCHERING; MEAT TREATMENT; PROCESSING POULTR

(291, 11)
              symbol kind          parent  level symbol_short parent_short  \
629   A01D0093000000    m            A01D      5    A01D93/00         A01D   
1639  A01N0065030000    1  A01N0065000000      6    A01N65/03    A01N65/00   
1640  A01N0065040000    1  A01N0065000000      6    A01N65/04    A01N65/00   
1641  A01N0065060000    1  A01N0065000000      6    A01N65/06    A01N65/00   
1642  A01N0065080000    1  A01N0065000000      6    A01N65/08    A01N65/00   
1643  A01N0065100000    2  A01N0065080000      7     A01N65/1    A01N65/08   
1644  A01N0065120000    2  A01N0065080000      7    A01N65/12    A01N65/08   
1645  A01N0065140000    2  A01N0065080000      7    A01N65/14    A01N65/08   
1646  A01N0065160000    2  A01N0065080000      7    A01N65/16    A01N65/08   
1647  A01N0065180000    2  A01N0065080000      7    A01N65/18    A01N65/08   

                                               title_en  \
629   Harvesting apparatus not provided for in other...   
1639         

(615, 11)
              symbol kind          parent  level symbol_short parent_short  \
4635  A47C0001035500    5  A47C0001035000     10   A47C1/0355    A47C1/035   
6313  A61F0002070000    4  A61F0002060000      9     A61F2/07     A61F2/06   
6352  A61F0002844000    2  A61F0002820000      7    A61F2/844     A61F2/82   
6353  A61F0002848000    2  A61F0002820000      7    A61F2/848     A61F2/82   
6354  A61F0002852000    2  A61F0002820000      7    A61F2/852     A61F2/82   
6355  A61F0002856000    2  A61F0002820000      7    A61F2/856     A61F2/82   
6358  A61F0002890000    3  A61F0002860000      8     A61F2/89     A61F2/86   
6360  A61F0002910000    4  A61F0002900000      9     A61F2/91      A61F2/9   
6361  A61F0002915000    5  A61F0002910000     10    A61F2/915     A61F2/91   
6363  A61F0002930000    3  A61F0002920000      8     A61F2/93     A61F2/92   

                                               title_en  \
4635   actuated by linkages, e.g. lazy-tongs mechanisms   
6313         

In [28]:
df = pd.read_sql_table('cpc', engine, schema='patent_classification')
print(df.shape)
print(df.dtypes)
print(df.head(5))

(180118, 8)
symbol_short     object
symbol           object
kind             object
parent           object
level            object
parent_short     object
title_en         object
creation_date    object
dtype: object
  symbol_short symbol kind parent level parent_short  \
0            A      A    s    CPC     2          CPC   
1            B      B    s    CPC     2          CPC   
2            C      C    s    CPC     2          CPC   
3            D      D    s    CPC     2          CPC   
4            E      E    s    CPC     2          CPC   

                              title_en creation_date  
0                    HUMAN NECESSITIES    2013-01-01  
1  PERFORMING OPERATIONS; TRANSPORTING    2013-01-01  
2                CHEMISTRY; METALLURGY    2013-01-01  
3                      TEXTILES; PAPER    2013-01-01  
4                  FIXED CONSTRUCTIONS    2013-01-01  


In [40]:
print(len(df[df.duplicated(subset='symbol')]))

0


In [44]:
df = pd.read_sql_table('cpc', engine, schema='patent_classification')
grouped_level = df.groupby('level')

pathsL = []

# iterte df over rows
for index, row in df.iterrows():
    key = row['symbol']
    level = int(row['level'])

    # select only symbols of main groups and subgroups
    if level >4:
        path=[key]
        # trace path by appending successively the corresponding direct parent to a list
        for i in range(level , 2, -1):
            group = grouped_level.get_group(i)
            try:
                parent = group[group['symbol'] == key]['parent'].iloc[0]
                key = parent
                path.append(parent)
            except:
                print(key, level)
                pass
        path = path[::-1]
        pathsL.append(path)
paths_df = pd.DataFrame(pathsL)

A01D0101000000 5
A01D0101000000 5
A01F0025000000 5
A01F0025000000 5
A47C0020000000 7
A47G0019000000 7
A62D0101000000 5
A62D0101000000 5
A63B0102000000 5
A63B0102000000 5
B01D0024000000 5
B01D0024000000 5
B07C0099000000 5
B07C0099000000 5
B23K0101000000 5
B23K0101000000 5
B23K0103000000 5
B23K0103000000 5
B29C0065000000 5
B29C0065000000 5
B29K0001000000 5
B29K0001000000 5
B29K0007000000 5
B29K0007000000 5
B29K0009000000 5
B29K0009000000 5
B29K0019000000 5
B29K0019000000 5
B29K0021000000 5
B29K0021000000 5
B29K0023000000 5
B29K0023000000 5
B29K0023000000 8
B29K0023000000 8
B29K0023000000 8
B29K0023000000 8
B29K0023000000 8
B29K0023000000 8
B29K0023000000 7
B29K0025000000 5
B29K0025000000 5
B29K0027000000 5
B29K0027000000 5
B29K0029000000 5
B29K0029000000 5
B29K0031000000 5
B29K0031000000 5
B29K0033000000 5
B29K0033000000 5
B29K0035000000 5
B29K0035000000 5
B29K0045000000 5
B29K0045000000 5
B29K0055000000 5
B29K0055000000 5
B29K0059000000 5
B29K0059000000 5
B29K0061000000 5
B29K0061000000

E01D0101000000 5
E01D0101000000 5
E05C0017000000 7
E21B0001000000 5
E21B0001000000 5
E21B0003000000 5
E21B0003000000 5
E21B0015000000 5
E21B0015000000 5
E21B0021000000 5
E21B0021000000 5
F04B0037000000 5
F04B0037000000 5
F16D0121000000 5
F16D0121000000 5
F16D0123000000 5
F16D0123000000 5
F16D0125000000 5
F16D0125000000 5
F16D0127000000 5
F16D0127000000 5
F16D0129000000 5
F16D0129000000 5
F16D0131000000 5
F16D0131000000 5
F16K0011000000 5
F16K0011000000 5
F16L0101000000 5
F16L0101000000 5
F21V0001000000 5
F21V0001000000 5
F21V0001000000 5
F21V0001000000 5
F21V0001000000 5
F21V0001000000 5
F21V0001000000 5
F21V0001000000 5
F21V0001000000 5
F21V0001000000 5
F21W0102000000 5
F21W0102000000 5
F21W0103000000 5
F21W0103000000 5
F21W0104000000 5
F21W0104000000 5
F21W0105000000 5
F21W0105000000 5
F21W0106000000 5
F21W0106000000 5
F21W0107000000 5
F21W0107000000 5
F21W0111000000 5
F21W0111000000 5
F21W0121000000 5
F21W0121000000 5
F21W0131000000 5
F21W0131000000 5
F21Y0101000000 5
F21Y0101000000

KeyboardInterrupt: 

In [48]:
df = pd.read_sql_table('ipc', engine, schema='patent_classification')
dateL = list(df['creation_date'].unique())
dateL = sorted([int(str(date)[:4]) for date in dateL])
print(dateL)

[1968, 1974, 1980, 1985, 1990, 1995, 2000, 2006, 2007, 2007, 2008, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019]


In [8]:
df = pd.read_sql_table('ipc', engine, schema='patent_classification')
print(df.shape)

(75287, 12)


In [7]:
grouped = df.groupby('creation_date')
keys = list(grouped.groups.keys())
for key in keys:
    group = grouped.get_group(key)
    print(key, group.shape)

19680901 (39374, 12)
19740701 (6271, 12)
19800101 (5608, 12)
19850101 (4187, 12)
19900101 (7117, 12)
19950101 (2824, 12)
20000101 (1658, 12)
20060101 (1307, 12)
20070101 (58, 12)
20071001 (28, 12)
20080101 (146, 12)
20080401 (25, 12)
20090101 (291, 12)
20100101 (402, 12)
20110101 (493, 12)
20120101 (501, 12)
20130101 (615, 12)
20140101 (626, 12)
20150101 (336, 12)
20160101 (1027, 12)
20170101 (684, 12)
20180101 (1040, 12)
20190101 (669, 12)
