# The EP full-text library - Lesson 4
This notebook expands on lesson 4 to dive into more advanced concepts of EPAB, the implementation in TIP of the EP full-text library. We will introduce working with raw SQL queries. **Important note:** the goal of this lesson is NOT to explain SQL sintaxt. If you are not familiar with SQL it is recommended that you first follow [this tutorial](https://www.w3schools.com/sql/default.asp).

As we did in the first notebook, we first create an instance of the EPAB library. Remember that by default we are getting access to a test database.

In [1]:
# Importing the EPAB client
from epo.tipdata.epab import EPABClient

# creating an instance of the EPAB client with the TEST database
epab = EPABClient(env='TEST')


## Working with SQL queries
So far we have been querying the EPAB library with methods that specify the query parameters, and separate methods that specify what data we wish to get, from the publications returned as a result to the query. The library also allows you to work with SQL queries to access the EPAB database. The data model is very simple, since EPAB consists in a single table containing the publications. 

## Working with SQL queries in EPAB 
Let's start reviewing what data you can retrieve from the EPAB database with the `SELECT` statement. In lesson 1 we saw the fields that the EPAB library has, for each publication. You can retrieve the list of fields with the `fields()` method. 

In [12]:
epab.fields()

WidDatabaseFields(header='', input_data={'': [{'name': 'epab_doc_id', 'type': 'STRING', 'mode': 'REQUIRED', 'd…

### The table name
You can use any of the fields above in your SELECT statement. The name of the database table is `p-epo-tip-prj-3a1f.p_epo_tip_euwe4_bqd_epab.publications` which is not the easiest name to memorize. To solve this, EPAB has a property `epab.full_table_name` that gives you the name of the table

In [13]:
# printing the name of the EPAB table, remember that if we are in the test environment the table will have the suffix _test
print (f'the name of the table is', epab.full_table_name)

the name of the table is p-epo-tip-prj-3a1f.p_epo_tip_euwe4_bqd_epab.publications


### Our first SQL SELECT query
Let's do a simple SQL query. The EPAB library accepts an SQL query as a string, so it is useful to build the query by parts. We want to see the publications for applications filed in 2020.

In [11]:
# Defining the fields to select
selection = 'application.number, application.filing_date, publication.number'

# Defining the condition, publications with an application filed in 2020. 
condition = " WHERE application.filing_date LIKE '2020%'"

# Putting the statement together with string concatenation for better readability
statement = (
    f"SELECT {selection} "
    f"FROM `{epab.full_table_name}` "
    f"{condition};"
)

# Reviewing the statement
print (statement)

# Querying EPAB with the SQL statement
results = epab.sql_query(statement)

# Showing the results as a pithon list
results

SELECT application.number, application.filing_date, publication.number FROM `p-epo-tip-prj-3a1f.p_epo_tip_euwe4_bqd_epab.publications`  WHERE application.filing_date LIKE '2020%';


[{'number': '20895356.2', 'filing_date': '20201202', 'number_1': '4289839'},
 {'number': '20963382.5', 'filing_date': '20201230', 'number_1': '4289941'},
 {'number': '23177812.7', 'filing_date': '20200224', 'number_1': '4303475'},
 {'number': '20775582.8', 'filing_date': '20200904', 'number_1': '4208126'},
 {'number': '20767779.0', 'filing_date': '20200903', 'number_1': '4208502'},
 {'number': '20764998.9', 'filing_date': '20200901', 'number_1': '4208492'},
 {'number': '20952336.4', 'filing_date': '20200901', 'number_1': '4208403'},
 {'number': '20951991.7', 'filing_date': '20200904', 'number_1': '4209093'},
 {'number': '20951852.1', 'filing_date': '20200901', 'number_1': '4208529'},
 {'number': '20957108.2', 'filing_date': '20201015', 'number_1': '4209085'},
 {'number': '20820062.6', 'filing_date': '20201130', 'number_1': '4208727'},
 {'number': '20771224.1', 'filing_date': '20200902', 'number_1': '4209037'},
 {'number': '20767761.8', 'filing_date': '20200902', 'number_1': '4209036'},

### Grouping publications by application number
As we have seen multiple times now, for each application we can get multiple publications. Let's group the publications so we have a single entry per application, using the `COUNT` aggregate function and the `GROUP BY` statement.

In [None]:
# creating an instance of the EPAB client with the PROD database
# This is done to ensure that multiple publications for each application can be retrieved
epab = EPABClient(env='PROD')


# Defining the fields to select
selection = 'application.number, application.filing_date, COUNT(publication.number) AS publication_count'

# Defining the condition, publications with an application filed in 2020
condition = "WHERE application.filing_date LIKE '2020%'"

# Grouping by application number and filing date
group_by = "GROUP BY application.number, application.filing_date"

# Ordering by publication count in descending order
order_by = "ORDER BY publication_count DESC"

# Putting the statement together with string concatenation for better readability
statement = (
    f"SELECT {selection} "
    f"FROM `{epab.full_table_name}` "
    f"{condition} "
    f"{group_by} "
    f"{order_by};"
)

# Reviewing the statement
print(statement)

# Querying EPAB with the SQL statement
results = epab.sql_query(statement)

# Showing the results as a Python list
results


### Grouping applications by number of publications
Let's work with an extra level of aggregation. We will take the query we just did as a sub query, and aggregate the applications according to the number of publications they contain. We do not need the filing date for this analysis.

In [17]:
# First step: Calculate the number of publications per application
sub_query = (
    f"SELECT application.number, COUNT(publication.number) AS publication_count "
    f"FROM `{epab.full_table_name}` "
    f"GROUP BY application.number"
)

# Second step: Count the number of applications for each publication count
final_query = (
    f"SELECT publication_count, COUNT(*) AS application_count "
    f"FROM ({sub_query}) AS sub "
    f"GROUP BY publication_count "
    f"ORDER BY publication_count DESC;"
)

# Reviewing the final query
print(final_query)

# Querying EPAB with the final query
results = epab.sql_query(final_query)

# Showing the results as a Python list
results


SELECT publication_count, COUNT(*) AS application_count FROM (SELECT application.number, COUNT(publication.number) AS publication_count FROM `p-epo-tip-prj-3a1f.p_epo_tip_euwe4_bqd_epab.publications` GROUP BY application.number) AS sub GROUP BY publication_count ORDER BY publication_count DESC;


[{'publication_count': 6, 'application_count': 12},
 {'publication_count': 5, 'application_count': 283},
 {'publication_count': 4, 'application_count': 9210},
 {'publication_count': 3, 'application_count': 367505},
 {'publication_count': 2, 'application_count': 2066571},
 {'publication_count': 1, 'application_count': 2124895}]