# Data Processing with EP-Fulltext
In this notebook we will use the EPAB client that is built in TIP for analyzing the portfolio of granted applications of a given applicant. This lesson assumes that the user is familiar with TIP and the EPAB client. 

In [1]:
# We first need to initalize the EPAB client.
from epo.tipdata.epab import EPABClient
epab = EPABClient()

This client instance is currently configured to use a test dataset with reduced number of publications (~10K).
Use EPABClient(env='PROD') to use the complete EPAB dataset (>7M publications).



We will create a query that represents the applicant that we want to search. In this case we want to search for publications that contain the string "Marel Food Systems" in the applicant field. Please take into consideration that the search is case insensitive. 

In [2]:
query = 'marel food systems'
q = epab.query_applicant_name(query)

The query we get is an object, and not yet a list of results

In [3]:
print (type(q))

<class 'epo.tipdata.epab.query.Query'>


In order to get the results from the query, we need to use the method `get_results`. We need to specify the fields that we want to display for each publication that result from our search query. For now we are going to take a look at the applicant field in each publication. We will ask the EPAB object to render the results as a list.

In [4]:
publications = q.get_results(" application.number, applicant", output_type='list')

print (publications)

[{'application': {'number': '06821704.1'}, 'applicant': [{'name': 'Marel Food Systems hf.', 'address': 'Austurhrauni 9', 'city': '210 Gardabær', 'country': 'IS'}]}, {'application': {'number': '08702741.3'}, 'applicant': [{'name': 'Marel Food Systems hf.', 'address': 'Austurhrauni 9', 'city': '210 Gardabær', 'country': 'IS'}]}, {'application': {'number': '02787168.0'}, 'applicant': [{'name': 'Marel Food Systems hf.', 'address': 'Austurhrauni 9', 'city': '210 Gardabær', 'country': 'IS'}]}, {'application': {'number': '07350006.8'}, 'applicant': [{'name': 'Marel Food Systems hf.', 'address': 'Austurhrauni 9', 'city': '210 Gardabær', 'country': 'IS'}]}, {'application': {'number': '07706198.4'}, 'applicant': [{'name': 'Marel Food Systems hf.', 'address': 'Austurhrauni 9', 'city': '210 Gardabær', 'country': 'IS'}]}, {'application': {'number': '08775601.1'}, 'applicant': [{'name': 'Marel Food Systems hf.', 'address': 'Austurhrauni 9', 'city': '210 Gardabær', 'country': 'IS'}]}, {'application':

<p>We can see that the results we get is a list with a dictionary per publication. In each dictionary there is a key `applicant` that contains a list of dictionaries. This is because each publication can mention more than one applicant.</p>
<p>Each dictionary in the list corresponds to one applicant, and you can see that it contains information about said applicant beyond the name, such as the address, city and country. As long as the query we specified is present in the applicant's name of one of the applicants mentioned in a publication, this publication will show in the search results.</p>

## Publications and applications 

To continue with our analysis, it's crucial to understand that a patent application may go through multiple publications during its lifetime. For a comprehensive overview of the different types of publications associated with a patent application, please refer to [this link](#).

**Key Concepts:**
- A patent application is initially published as either an **A1** or **A2 publication**.
- If the patent is subsequently granted, it is then published as a **B1 publication**.
- Additional publications, denoted as **Bx**, may occur after the patent is granted, e.g. B8.

We will now group the different publications according to their corresponding application. We will later use the filing date to get an overview of the patent applications filed (and granted) per year. 


In [5]:
## Extract results again 
publications = q.get_results("publication.kind, application.number, application.filing_date", output_type='list')
print (publications)

[{'publication': {'kind': 'A1'}, 'application': {'number': '06821704.1', 'filing_date': '20061219'}}, {'publication': {'kind': 'A1'}, 'application': {'number': '08702741.3', 'filing_date': '20080131'}}, {'publication': {'kind': 'B8'}, 'application': {'number': '02787168.0', 'filing_date': '20020712'}}, {'publication': {'kind': 'A1'}, 'application': {'number': '07706198.4', 'filing_date': '20070115'}}, {'publication': {'kind': 'B1'}, 'application': {'number': '07350006.8', 'filing_date': '20070510'}}, {'publication': {'kind': 'B1'}, 'application': {'number': '08702741.3', 'filing_date': '20080131'}}, {'publication': {'kind': 'B2'}, 'application': {'number': '08702741.3', 'filing_date': '20080131'}}, {'publication': {'kind': 'B1'}, 'application': {'number': '06821704.1', 'filing_date': '20061219'}}, {'publication': {'kind': 'B1'}, 'application': {'number': '08775601.1', 'filing_date': '20080226'}}]


We can see that there are unique application numbers that have more than one publication associated with it. Let's work on reorganizing the results, grouping the publication kinds into unique application numbers. We want a list containing one dictionary per each application, and in that dictionary a list of all the publications for that specific application. 

In [6]:
# Initialize an empty set to store unique application numbers
unique_application_numbers = set()

# Initialize an empty list to store unique applications
unique_applications = []

# Iterate through each item (dictionary) in the publications list
for item in publications:
    # Extract the application number, filing date, and publication kind from the current item
    application_number = item['application']['number']
    filing_date = item['application']['filing_date']
    publication_kind = item['publication']['kind']
    
    # Check if the application number is already in the set of unique application numbers
    if application_number not in unique_application_numbers:
        # If not, add it to the set
        unique_application_numbers.add(application_number)
        # Create a new dictionary for this unique application number
        new_application = {
            'application_number': application_number,
            ## instead of the filig date, we extract the first 4 digits to determine the year when the application was filed.
            'year': filing_date[:4],
            'publications': [publication_kind]  # Initialize the list with the current publication kind
        }
        # Append the new application dictionary to the list of unique applications
        unique_applications.append(new_application)
    else:
        # If the application number is already in the set, find the corresponding dictionary
        for app in unique_applications:
            if app['application_number'] == application_number:
                # Add the new publication kind to the list inside the existing dictionary
                app['publications'].append(publication_kind)
                break  # Break the loop once the correct dictionary is found

print (unique_applications)


[{'application_number': '06821704.1', 'year': '2006', 'publications': ['A1', 'B1']}, {'application_number': '08702741.3', 'year': '2008', 'publications': ['A1', 'B1', 'B2']}, {'application_number': '02787168.0', 'year': '2002', 'publications': ['B8']}, {'application_number': '07706198.4', 'year': '2007', 'publications': ['A1']}, {'application_number': '07350006.8', 'year': '2007', 'publications': ['B1']}, {'application_number': '08775601.1', 'year': '2008', 'publications': ['B1']}]


## Understanding the results
We now have a much clearer view of the results, with the grouping we have done. Let's take a look at some of the results we have obtained.

- Application 06821704.1 was filed in 2006, published as an A1 with a search report, then granted with a B1.
- Application 08702741.3 was filed in 2008, publisded as an A1 with a search report, granted with a B1, then the patent was amended and published as a B2
- Application 07706198.4 was published as an A1, but never granted

We have also two curious cases where we see a B type specification but no A type specification. This should not be possible, since we know that every patent application must first be published as an A type publication. What is happening? The answer comes by looking into the history of these applications in the register. 

For example when opening [application 07350006.8](https://register.epo.org/application?number=EP07350006)  in the register we can see that this application was filed by a private French individual listed as applicant, and published as such. After the publication of the A1 the application was transfered to Marel Food Systems, and the subsequent B1 was published listing the new applicant. Our results make sense then!

# Lesson 2
## Analyzing granted patents
We ended the first lesson with a  result set is small and we can analyze it quickly, but broader searches may need further analysis. For example we may want to see only the applications that have been granted. We know that in order for the application to have been granted, a B type publication must have been issued. 

In [7]:
# We create a new list to include only the applications that have been granted. 
granted = []

# We iterate inside the list of applications
for app in unique_applications:
    # inside each application, we iterate inside the list of publications with the any() method
    # the any() method avoids continuing the loop if one B publication is already found
    if any (publication.startswith('B') for publication in app['publications']):
        # when the first B type publication is found inside the list of publications for an application
        granted.append({
                'application': app['application_number'],
                'year' : app['year'],
                #register link' : url
        })
          

print ("Total applications mentioning ", query, "as one of the applicants:", len(unique_applications))
print ("Granted applications:", len(granted))

Total applications mentioning  marel food systems as one of the applicants: 6
Granted applications: 5


We can see that in our results there is one application that was not granted. Let's take a look at the structure of the list we have created. We see that now we have a list of the applications that have been granted as patents mentioning applicants that contain our query in the applicant's name. We see also the year of filing, which can be useful to study trends and to have an overview of the patents that are close to lapsing. X

In [8]:
# We print now the list of granted applications
print (granted)

[{'application': '06821704.1', 'year': '2006'}, {'application': '08702741.3', 'year': '2008'}, {'application': '02787168.0', 'year': '2002'}, {'application': '07350006.8', 'year': '2007'}, {'application': '08775601.1', 'year': '2008'}]


## Linking to the register
When studying a portfolio, it can be very useful to have a link for each application that opens it in the European Patent Register. Luckily for us, the urls of the european register contain the application number, so by knowing said application number we can compose its register link. 

In [9]:
# We initiate again the list of granted applications to include the creation of the link
granted = []
for app in unique_applications:
    if any (publication.startswith('B') for publication in app['publications']):
        # defining the application number as a variable
        application_number = app['application_number']
        
        # the url of the register does not include the application number.
        # We remove the last 2 characters of the application number 
        no_check_digit = application_number[:-1]

        # we insert the application number into the 
        url = f"https://register.epo.org/application?number=EP{no_check_digit}"
        granted.append({
                'application': app['application_number'],
                'year' : app['year'],
                'register link' : url
        })

print (granted)

[{'application': '06821704.1', 'year': '2006', 'register link': 'https://register.epo.org/application?number=EP06821704.'}, {'application': '08702741.3', 'year': '2008', 'register link': 'https://register.epo.org/application?number=EP08702741.'}, {'application': '02787168.0', 'year': '2002', 'register link': 'https://register.epo.org/application?number=EP02787168.'}, {'application': '07350006.8', 'year': '2007', 'register link': 'https://register.epo.org/application?number=EP07350006.'}, {'application': '08775601.1', 'year': '2008', 'register link': 'https://register.epo.org/application?number=EP08775601.'}]


You can see now that our list contains a link to the register for all the applications in our results that have been granted. 

## Working with dataframes
We have created a list that does not have nested lists or dictionaries inside. We have been working with a list to be able to process data with python, but printing a list is not a practical way of visualizing data. We are now going to start using Pandas,  a very popular library used for working with data sets. It has built-in functions for analyzing, cleaning, exploring, and manipulating data. For a good tutorial about Pandas, visit [this link](https://www.w3schools.com/python/pandas/pandas_intro.asp).

In [10]:
# We import the pandas library
import pandas as pd

# creating the dataframe
df = pd.DataFrame(granted)

# displaying the dataframe
df

Unnamed: 0,application,year,register link
0,6821704.1,2006,https://register.epo.org/application?number=EP...
1,8702741.3,2008,https://register.epo.org/application?number=EP...
2,2787168.0,2002,https://register.epo.org/application?number=EP...
3,7350006.8,2007,https://register.epo.org/application?number=EP...
4,8775601.1,2008,https://register.epo.org/application?number=EP...


## Displaying the links as clickable
We can see that jupyter rendered the links as clickable when we were displaying the results as a list, but when displaying the dataframe we have lost the formatting. This can be easily solved because Pandas has a method apply() that can run a function over a whole column. We also need to import Ipython to be able to make Dataframes show HTML.

In [11]:
from IPython.display import display, HTML


# Function to create clickable links
def make_clickable(link):
    return f'<a target="_blank" href="{link}">open</a>'

# Apply the function to the 'register link' column
df['register link'] = df['register link'].apply(make_clickable)



# Display the DataFrame with clickable links
display(HTML(df.to_html(escape=False)))

Unnamed: 0,application,year,register link
0,6821704.1,2006,open
1,8702741.3,2008,open
2,2787168.0,2002,open
3,7350006.8,2007,open
4,8775601.1,2008,open


## Further processing results
One useful way of displaying the results is to order the applications by year of filing.

In [12]:
# Sort the DF
df = df.sort_values(by=['year', 'application']).reset_index(drop=True)
# Display the DataFrame with clickable links
display(HTML(df.to_html(escape=False)))

Unnamed: 0,application,year,register link
0,2787168.0,2002,open
1,6821704.1,2006,open
2,7350006.8,2007,open
3,8702741.3,2008,open
4,8775601.1,2008,open


### Aggregating the results
In our example we are working with a small set of applications, but depending on our query we could have results list several orders of magnitude larger. It can be useful to aggregate data to visualize trends.

In [13]:
# Group by year and count the number of applications per year
aggregated_df = df.groupby('year').agg({'application': 'count'}).reset_index()


aggregated_df

Unnamed: 0,year,application
0,2002,1
1,2006,1
2,2007,1
3,2008,2
