# Discovering popular terminology within Patents

This tutorial looks at the use of natural language processing to detect popular terminology within patents, and visualises the usage of such terminology over time.

We will learn how to preprocess text data, transform words to numbers, convert the occurences to a time series and plot the timeseries.

## What we will do:
* Import the Python modules that will be used in the analysis.
* Read the pre-prepared patent collections
* Examine and discuss the data we have imported
* Identify common terms with TF-IDF
* Improve our results using stop words, frequency filtering and stemming
* Identify popular terms through accumulation of TF-IDF scores
* Convert TF-IDF scores to a time series of term occurance over time
* Produce graphs of terms to investigate how usage changes over time


## How is this tutorial structured:
For every section, I will highlight its Goal and what we will do to achieve it. Then, I will explain the methods we use, what alternatives or additional thing we could do and lastly, we will run the code together. Note that some code cells can "run" for a while, so we will run them first and then explain what they do.

## Download example patent data from PATSTAT

We have already extracted a few sample datasets from the [PATSTAT](https://www.epo.org/searching-for-patents/business/patstat.html#tab-1) patents database.
These are exported as Pandas DataFrames, so we just need to load them in.

First of all, we need to prepare by loading in the support libraries...

In [None]:
%load_ext autoreload
%autoreload 2

# install im_tutorial package
!pip install git+https://github.com/nestauk/im_tutorials.git
    
# We also need S3 data support (to load our sample patents)
!pip install smart_open

# pandas - to manage data frames
!pip install pandas

# scikit-learn for our NLP pipeline
!pip install scikit-learn

# nltk for more NLP support ("Natural Language ToolKit")
!pip install nltk

## Import the data

Download the file from an S3 bucket... 

In [None]:
from im_tutorials.data.ons import patents_10k, patents_100k

df = patents_10k() 
# df = patents_100k() 

df.shape


## What have we acquired?
Quickly check what data we've loaded... what attributes are available?

In [None]:
df.columns


## An example patent?
What does the a random entry look like? Let's take a look at row 500...

In [None]:
df.iloc[500]

# Looking for popular terminology

We will use TF-IDF to find statistically popular terminology - where "terminology" is defined as a sequence of words. 

## TF-IDF

TF-IDF is defined as "Term Frequency - Inverse Document Frequency", where the frequeny of a term in a document is divided by the number of documents it occurs in. This "normalises" a popular term by reducing its popularity by dividing by the number of documents it occurs in - if every document uses this term, it isn't very unusual, more likely to be a word such as "the" or "and".

We use scikit-learn's implementation of TFIDF (refer to their [example of topic extraction](https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.htm) which uses TFIDF).

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

from time import time
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'Number of features (words in our dictionary): {len(tfidf_vectorizer.get_feature_names()):,}')

## Unfiltered results

What words have we discovered? Let's look at the first 10 terms or "feature names":

In [None]:
tfidf_vectorizer.get_feature_names()[0:10]

## That's a lot of 0's

Oh dear. Maybe we should remove digits and punctuation? Let's just keep A-Z (assuming we are restricted to English)

In [None]:
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[A-Za-z]+', analyzer='word')

from time import time
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'Number of features (words in our dictionary): {len(tfidf_vectorizer.get_feature_names()):,}')
tfidf_vectorizer.get_feature_names()[0:10]

## Just single words
Looks better, but isolated words aren't very useful - no context. How about pairs or triplets of words? (bi-grams and tri-grams)

In [None]:
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[A-Za-z]+', analyzer='word', ngram_range=(2,3))

from time import time
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'Number of features (words in our dictionary): {len(tfidf_vectorizer.get_feature_names()):,}')
tfidf_vectorizer.get_feature_names()[0:10]

## Bi-grams and tri-grams
Yikes! That didn't help! Mind you "a" isn't a very useful word. Let's add in some "stopwords"...

In [None]:
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[A-Za-z]+', analyzer='word', 
                                   ngram_range=(2,3), stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'Number of features (words in our dictionary) after English stop words removed: {len(tfidf_vectorizer.get_feature_names()):,} bigrams and trigrams')
tfidf_vectorizer.get_feature_names()[0:10]

## Unusual terms still present
Hmmn. What if we skip rare terms, that could just be formatting or spelling errors? How about only terms that occur in at least 5 documents...

In [None]:
minimum_document_frequency = 5
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[A-Za-z]+', analyzer='word', 
                                   ngram_range=(2,3), stop_words='english', 
                                   min_df=minimum_document_frequency)
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'...after English stop words removed, remove terms occuring in less than {minimum_document_frequency} documents:'
      f' {len(tfidf_vectorizer.get_feature_names()):,} bigrams and trigrams')
tfidf_vectorizer.get_feature_names()[0:10]

## Meaningful bi- and tri-grams
That's better! That's really reduced the number of n-grams. What else have we got?

In [None]:
tfidf_vectorizer.get_feature_names()[10:20]

## Same words, different forms?
Hmmn. That's a lot of variants of 'absorb'. If we had a "stemmer" we could remove common endings to get to the common "stem" (note that this is different to lemmatising - lemmas are the basic form of the word, but require a dictionary - patent words might not all be in the dictionary).

First of all, let's load NLTK's library:

In [None]:
import nltk
nltk.download('punkt')

## A "stemming" tokenizer

We need a piece of code that can extract words ("tokens") from a stream of text - and "stem" the words...

In [None]:
from nltk import word_tokenize

class StemTokenizer(object):
    def __init__(self):
        self.ps = nltk.PorterStemmer()

    def __call__(self, doc):
        return [self.ps.stem(t) for t in word_tokenize(doc)]

t = StemTokenizer()
t('absorbs absorbing absorber absorption 123')

## Stemming Tokenizer ready

Looks good, multiple forms of "absorb" are now mapped to a single stem - shame about "absorption" - a lemmatiser could map this to "absorb" if it was in the lemmatiser's dictionary.

Never mind, let's try it with the patent abstracts:

In [None]:
minimum_document_frequency = 5
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[A-Za-z]+', analyzer='word', 
                                   ngram_range=(2,3), stop_words='english', 
                                   min_df=minimum_document_frequency,
                                   tokenizer=StemTokenizer())
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'...after English stop words removed, remove terms occuring in less than {minimum_document_frequency} documents:'
      f' {len(tfidf_vectorizer.get_feature_names()):,} bigrams and trigrams')
tfidf_vectorizer.get_feature_names()[0:10]

## What went wrong?

Oh dear. Tokenizer overrides the regular expression, so we'll have to combine the two...

In [None]:
import re
class StemTokenizerWithWordFilter(object):
    def __init__(self):
        self.ps = nltk.PorterStemmer()
        self.token_pattern = re.compile(r'[A-Za-z]+')

    def __call__(self, doc):
        return [self.ps.stem(t) for t in self.token_pattern.findall(doc)]

t = StemTokenizerWithWordFilter()
t('absorbs absorbing absorber absorption 123')

## Stemmer revisited

Great - digits are removed, and the "absorb" stemming still works - so let's try again...

In [None]:
minimum_document_frequency = 5
tfidf_vectorizer = TfidfVectorizer(analyzer='word', 
                                   ngram_range=(2,3), stop_words='english',
                                   min_df=minimum_document_frequency,
                                   tokenizer=StemTokenizerWithWordFilter())
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'...after English stop words removed, remove terms occuring in less than {minimum_document_frequency} documents:'
      f' {len(tfidf_vectorizer.get_feature_names()):,} bigrams and trigrams')
tfidf_vectorizer.get_feature_names()[0:10]

## Errors from scikit-learn?

Ah - yes, we are comparing stemmed words with the original stopword list which isn't stemmed. Whoops. Let's stem the stopwords so they will match the output of the stemmer...

In [None]:
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
stop_words_as_string = " ".join(stop_words)
stemmed_stop_words = StemTokenizerWithWordFilter()(stop_words_as_string)
stemmed_stop_words_no_duplicates = list(set(stemmed_stop_words))
stemmed_stop_words_no_duplicates[0:10]

## Stemmed stopwords
Let's analyse the patents again, this time with the stopwords matching the output of a our stemmer...

In [None]:
minimum_document_frequency = 5
tfidf_vectorizer = TfidfVectorizer(analyzer='word', 
                                   ngram_range=(2,3), 
                                   stop_words=stemmed_stop_words_no_duplicates,
                                   min_df=minimum_document_frequency,
                                   tokenizer=StemTokenizerWithWordFilter())
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'...after English stop words removed, remove terms occuring in less than {minimum_document_frequency} documents:'
      f' {len(tfidf_vectorizer.get_feature_names()):,} bigrams and trigrams')
tfidf_vectorizer.get_feature_names()[0:10]

## Repeated words?

Hmmn. Slightly odd - "absorb heat heat" etc.; let's see what else we have... let's look at the following 10 terms...

In [None]:
tfidf_vectorizer.get_feature_names()[10:20]


Ok, not so bad after all. Hopefully we've now got a sensible feature set - what features are of interest?

# Features of interest

One approach is to look at the TFIDF matrix; each row represents a document, each column a feature (i.e. an "n-gram"). A feature is of interest if it is popular and interesting - by that we mean it appears repeatedly in a document but not in all documents. Or, in other words, a high TF-IDF value against a term.

Let's try collapsing the matrix by summing the rows; this will reveal which features have the highest weights and in turn which n-grams are of interest...

In [None]:
summed_tfidf = tfidf.sum(axis=0)
summed_tfidf.shape

## Which term accumulated what TF-IDF total?
Let's associate the n-grams with their scores...

In [None]:
summed_tfidf_list = summed_tfidf.tolist()[0]
print(len(summed_tfidf_list))

ngram_list = tfidf_vectorizer.get_feature_names()
print(len(ngram_list))

ngram_scores = list(zip(summed_tfidf_list, tfidf_vectorizer.get_feature_names()))
ngram_scores[0:10]

## Which terms have the highest accumulated TF-iDF score?
So if we sort the tuples by TF-IDF accumulated score...

In [None]:
sorted_ngram_scores = sorted(ngram_scores, key=lambda tup: tup[0], reverse=True)
sorted_ngram_scores[0:20]

## We have popular terminology! Is it meaningful?

Now we're getting somewhere! However, there are a number of n-grams that aren't useful:
* util model ("utility model"?)
* least one ("...at least one..."?)
* invent relat ("invention related"?)
* present invent ("present invention"?)
* invent disclos ("invention disclosed"?)

Suggest we add "invention" to the stopword list...


In [None]:
stemmed_stop_words_custom = stemmed_stop_words_no_duplicates + ['invent', 'util', 'disclos', 'problem', 'solv', 'becau', 'copyright', 'one']
stemmed_stop_words_custom[-10:]

## Rerun with revised stopwords
Let's try again, with the revised list of words to ignore...

In [None]:
minimum_document_frequency = 5
tfidf_vectorizer = TfidfVectorizer(analyzer='word', 
                                   ngram_range=(2,3), 
                                   stop_words=stemmed_stop_words_custom,
                                   min_df=minimum_document_frequency,
                                   tokenizer=StemTokenizerWithWordFilter())
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'...after English stop words removed, remove terms occuring in less than {minimum_document_frequency} documents:'
      f' {len(tfidf_vectorizer.get_feature_names()):,} bigrams and trigrams')

summed_tfidf = tfidf.sum(axis=0)
summed_tfidf_list = summed_tfidf.tolist()[0]
print(len(summed_tfidf_list))

ngram_list = tfidf_vectorizer.get_feature_names()
print(len(ngram_list))

ngram_scores = list(zip(summed_tfidf_list, tfidf_vectorizer.get_feature_names()))
sorted_ngram_scores = sorted(ngram_scores, key=lambda tup: tup[0], reverse=True)
sorted_ngram_scores[0:20]

# How terms are used over time
We want to visualise how terms are used over time - let's plot how many times a given term is used per year. We need to map the TFIDF matrix to a count - was the term used in a document? And then sum the counts over a time period (e.g. each year).

The original dataframe has the date information...

In [None]:
min(df.publication_date)

In [None]:
max(df.publication_date)

## Challenge: convert a matrix of term occurance into a time series of term usage...

The matrix itself has a row per patent; each patent is a row in the original data frame, which contains dates (publication date and application date). Let's take a look at the publication date:

In [None]:
df.publication_date[0:20]

## TF-IDF to time series of term usage

Good news - the way the data was captured meant that the rows are in publication date order. We can write a piece of code to group documents by year. We will score a term with a '1' for each patent that mentions this term, during a particular year. This way we record how many patents use this term per year.

Now the technical part. The matrix is stored in [Compressed Sparse Row](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html) form, where each row is stored as a list of non-zero entries. This saves a lot of storage and compute as the majority of our TF-IDF matrix is 0 - and as we tend to navigate by patent, we use row compression so we can easily iterate over rows (patents). The downside is that iterating over columns is slow, as each row has to be decompressed to determine its contribution to a row.

To convert the TF-IDF row entries to a 1 or 0 depending on if a term was mentioned, we simply use a condition (`tfidf[current_row_index,:] > 0`) which maps a 0 to `False` and non-zero (i.e. term was mentioned) to `True`. We then add this boolean row to a running total, where `False` adds 0 and `True` adds 1. We accumulate each patent onto a running total, until we find that a new patent refers to a new year - so we then record the total against the current year, and start a new running total from 0.

As an aside, there is also [Compressed Sparse Column](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csc_matrix.html), which would make column iteration easy - i.e. term iteration - as data would be stored in compressed columns.

We also use [TQDM](https://github.com/tqdm/tqdm) which provides a nice "wrapper" over an iterator which provides a completion bar - useful with long running jobs.

In [None]:
import numpy as np
from tqdm import tqdm
from scipy.sparse import csr_matrix, vstack

number_of_rows, number_of_terms = tfidf.shape

term_counts_per_year_csr = None
patent_dates = df.publication_date.tolist()

current_year = patent_dates[0].year
term_counts_current_year_csr = csr_matrix((1, number_of_terms), dtype=np.int32)
number_of_documents_per_year = []
year_dates = []
number_of_documents_this_year = 0

for current_row_index in tqdm(range(number_of_rows), 'Calculating yearly term counts', unit='patent', total=number_of_rows):
    new_year = patent_dates[current_row_index].year

    while new_year > current_year:
        term_counts_per_year_csr = vstack([term_counts_per_year_csr, term_counts_current_year_csr],
                                          format='csr') if term_counts_per_year_csr is not None else term_counts_current_year_csr
        number_of_documents_per_year.append(number_of_documents_this_year)
        year_dates.append(current_year)
        term_counts_current_year_csr = csr_matrix((1, number_of_terms), dtype=np.int32)
        current_year += 1
        number_of_documents_this_year = 0

    current_row_as_counts = tfidf[current_row_index, :] > 0
    term_counts_current_year_csr += current_row_as_counts
    number_of_documents_this_year += 1

term_counts_per_year_csr = vstack([term_counts_per_year_csr, term_counts_current_year_csr],
                                  format='csr') if term_counts_per_year_csr is not None else term_counts_current_year_csr
number_of_documents_per_year.append(number_of_documents_this_year)
year_dates.append(current_year)

## Plotting time series

Each term (n-gram) that we previously captured with TF-IDF now has a related time series; we have to extract the related column of the terms user per year count matrix (`term_counts_per_year_csr`), and plot this against the record years.

To make the code more flexible, we take a required term and find where it is in the list of extracted terms, so you can quickly investigate term usage. To start with, let's look at `solar cell` which was our top listed term with the 10,000 patent sample:

In [None]:
import matplotlib.pyplot as plt

term_of_interest = 'solar cell'
index_of_term_of_interest = ngram_list.index(term_of_interest)

fig = plt.figure(figsize=(6, 1.5), dpi=100)
ax = fig.add_subplot(111)

term_of_interest_time_series = term_counts_per_year_csr.getcol(index_of_term_of_interest).todense()
term_of_interest_time_series = term_of_interest_time_series.flatten().tolist()[0]
ax.plot(year_dates, term_of_interest_time_series, color='b', linestyle='-', marker='x', label='Year')

ax.set_title(f'Patents using term "{term_of_interest}"')
ax.set_ylabel('Number of\npatents\nwith term', fontsize=12)
ax.set_xlabel('Year', fontsize=12)

plt.show()

## What did you find?
Try different terms - what usage did you find? Any patterns?