# Discovering popular terminology within Patents

This tutorial looks at the use of natural language processing to detect popular terminology within patents, and visualises the usage of such terminology over time.

We will learn how to preprocess text data, transform words to numbers, convert the occurences to a time series and plot the timeseries.

## What we will do:
* Import the Python modules that will be used in the analysis.
* Read the pre-prepared patent collections
* Examine and discuss the data we have imported
* Identify common terms with TF-IDF
* Improve our results using stop words, frequency filtering and stemming
* Identify popular terms through accumulation of TF-IDF scores


## How is this tutorial structured:
For every section, I will highlight its Goal and what we will do to achieve it. Then, I will explain the methods we use, what alternatives or additional thing we could do and lastly, we will run the code together. Note that some code cells can "run" for a while, so we will run them first and then explain what they do.

## Download example patent data from PATSTAT

We have already extracted a few sample datasets from the [PATSTAT](https://www.epo.org/searching-for-patents/business/patstat.html#tab-1) patents database.
These are exported as Pandas DataFrames, so we just need to load them in.

First of all, we need to prepare by loading in the support libraries...

In [2]:
%load_ext autoreload
%autoreload 2

# install im_tutorial package
!pip install git+https://github.com/nestauk/im_tutorials.git
    
# We also need S3 data support (to load our sample patents)
!pip install smart_open

# pandas - to manage data frames
!pip install pandas

# scikit-learn for our NLP pipeline
!pip install scikit-learn

# nltk for more NLP support ("Natural Language ToolKit")
!pip install nltk

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
Collecting git+https://github.com/nestauk/im_tutorials.git
  Cloning https://github.com/nestauk/im_tutorials.git to /private/var/folders/5w/8j3_gwlj7rdgx_rpxmfpksf40000gn/T/pip-req-build-zgipq5y_
  Running command git clone -q https://github.com/nestauk/im_tutorials.git /private/var/folders/5w/8j3_gwlj7rdgx_rpxmfpksf40000gn/T/pip-req-build-zgipq5y_
Building wheels for collected packages: im-tutorials
  Building wheel for im-tutorials (setup.py) ... [?25ldone
[?25h  Created wheel for im-tutorials: filename=im_tutorials-0.1.0-cp37-none-any.whl size=12596 sha256=7290478e37e65629ff5f87f41a7599f5388df16e679ebc95dd8994ede30c8652
  Stored in directory: /private/var/folders/5w/8j3_gwlj7rdgx_rpxmfpksf40000gn/T/pip-ephem-wheel-cache-iaoaeogw/wheels/47/a3/cb/bdc5f9ba49bcfd2c6864b166a1566eb2f104113bf0c3500330
Successfully built im-tutorials


## Import the data

Download the file from an S3 bucket... 

In [3]:
from im_tutorials.data.ons import patents_10k, patents_100k

df = patents_10k() 
# df = patents_100k() 

df.shape


(10000, 13)

## What have we acquired?
Quickly check what data we've loaded... what attributes are available?

In [4]:
df.columns


Index(['appln_id', 'abstract', 'appln_auth', 'application_date',
       'application_id', 'publication_date', 'patent_id',
       'applicant_countries', 'applicant_cities', 'inventor_countries',
       'inventor_cities', 'invention_title', 'classifications_cpc'],
      dtype='object')

## An example patent?
What does the a random entry look like? Let's take a look at row 500...

In [5]:
df.iloc[500]

appln_id                                                        33191383
abstract               PURPOSE:To miniaturize a hot water storage tan...
appln_auth                                                            JP
application_date                                     1979-12-18 00:00:00
application_id                                                  54165040
publication_date                                     1981-07-16 00:00:00
patent_id                                                       15804685
applicant_countries                                                  NaN
applicant_cities                                                     NaN
inventor_countries                                                   NaN
inventor_cities                                                      NaN
invention_title        HOT WATER STORAGE TYPE WATER HEATER UTILIZING ...
classifications_cpc                                        [Y02E  10/40]
Name: 33191383, dtype: object

# Looking for popular terminology

We will use TF-IDF to find statistically popular terminology - where "terminology" is defined as a sequence of words. 

## TF-IDF

TF-IDF is defined as "Term Frequency - Inverse Document Frequency", where the frequeny of a term in a document is divided by the number of documents it occurs in. This "normalises" a popular term by reducing its popularity by dividing by the number of documents it occurs in - if every document uses this term, it isn't very unusual, more likely to be a word such as "the" or "and".

We use scikit-learn's implementation of TFIDF (refer to their [example of topic extraction](https://scikit-learn.org/stable/auto_examples/applications/plot_topics_extraction_with_nmf_lda.htm) which uses TFIDF).

In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

from time import time
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'Number of features (words in our dictionary): {len(tfidf_vectorizer.get_feature_names()):,}')

Processed in 1.16s.
Number of features (words in our dictionary): 36,718


## Unfiltered results

What words have we discovered? Let's look at the first 10 terms or "feature names":

In [7]:
tfidf_vectorizer.get_feature_names()[0:10]

['00',
 '000',
 '00000',
 '00001',
 '00002',
 '0001',
 '0005',
 '000angstrom',
 '000deg',
 '000kg']

## That's a lot of 0's

Oh dear. Maybe we should remove digits and punctuation? Let's just keep A-Z (assuming we are restricted to English)

In [8]:
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[A-Za-z]+', analyzer='word')

from time import time
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'Number of features (words in our dictionary): {len(tfidf_vectorizer.get_feature_names()):,}')
tfidf_vectorizer.get_feature_names()[0:10]

Processed in 1.06s.
Number of features (words in our dictionary): 30,550


['a',
 'aa',
 'ab',
 'aback',
 'abaissement',
 'abaisser',
 'abajo',
 'abamectin',
 'abandon',
 'abandoned']

## Just single words
Looks better, but isolated words aren't very useful - no context. How about pairs or triplets of words? (bi-grams and tri-grams)

In [9]:
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[A-Za-z]+', analyzer='word', ngram_range=(2,3))

from time import time
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'Number of features (words in our dictionary): {len(tfidf_vectorizer.get_feature_names()):,}')
tfidf_vectorizer.get_feature_names()[0:10]

Processed in 8.96s.
Number of features (words in our dictionary): 1,208,186


['a a',
 'a a a',
 'a a and',
 'a a are',
 'a a b',
 'a a beigemischt',
 'a a belt',
 'a a bivalent',
 'a a block',
 'a a bottom']

## Bi-grams and tri-grams
Yikes! That didn't help! Mind you "a" isn't a very useful word. Let's add in some "stopwords"...

In [10]:
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[A-Za-z]+', analyzer='word', 
                                   ngram_range=(2,3), stop_words='english')
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'Number of features (words in our dictionary) after English stop words removed: {len(tfidf_vectorizer.get_feature_names()):,} bigrams and trigrams')
tfidf_vectorizer.get_feature_names()[0:10]

Processed in 7.15s.
Number of features (words in our dictionary) after English stop words removed: 1,147,284 bigrams and trigrams


['aa anode',
 'aa anode bb',
 'aa base',
 'aa base material',
 'aa cc',
 'aa cc reducing',
 'aa comparative',
 'aa comparative example',
 'aa cooling',
 'aa cooling water']

## Unusual terms still present
Hmmn. What if we skip rare terms, that could just be formatting or spelling errors? How about only terms that occur in at least 5 documents...

In [11]:
minimum_document_frequency = 5
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[A-Za-z]+', analyzer='word', 
                                   ngram_range=(2,3), stop_words='english', 
                                   min_df=minimum_document_frequency)
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'...after English stop words removed, remove terms occuring in less than {minimum_document_frequency} documents:'
      f' {len(tfidf_vectorizer.get_feature_names()):,} bigrams and trigrams')
tfidf_vectorizer.get_feature_names()[0:10]

Processed in 6.75s.
...after English stop words removed, remove terms occuring in less than 5 documents: 15,961 bigrams and trigrams


['abh ngigkeit',
 'abnormality detection',
 'abrasion resistance',
 'absolute value',
 'absorb heat',
 'absorber layer',
 'absorber plate',
 'absorber second',
 'absorber solution',
 'absorbing device']

## Meaningful bi- and tri-grams
That's better! That's really reduced the number of n-grams. What else have we got?

In [12]:
tfidf_vectorizer.get_feature_names()[10:20]

['absorbing heat',
 'absorbing layer',
 'absorbing material',
 'absorbing solar',
 'absorbing surface',
 'absorbs heat',
 'absorption coating',
 'absorption efficiency',
 'absorption heat',
 'absorption heat pump']

## Same words, different forms?
Hmmn. That's a lot of variants of 'absorb'. If we had a "stemmer" we could remove common endings to get to the common "stem" (note that this is different to lemmatising - lemmas are the basic form of the word, but require a dictionary - patent words might not all be in the dictionary).

First of all, let's load NLTK's library:

In [13]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /Users/grimsi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## A "stemming" tokenizer

We need a piece of code that can extract words ("tokens") from a stream of text - and "stem" the words...

In [14]:
from nltk import word_tokenize

class StemTokenizer(object):
    def __init__(self):
        self.ps = nltk.PorterStemmer()

    def __call__(self, doc):
        return [self.ps.stem(t) for t in word_tokenize(doc)]

t = StemTokenizer()
t('absorbs absorbing absorber absorption 123')

['absorb', 'absorb', 'absorb', 'absorpt', '123']

## Stemming Tokenizer ready

Looks good, multiple forms of "absorb" are now mapped to a single stem - shame about "absorption" - a lemmatiser could map this to "absorb" if it was in the lemmatiser's dictionary.

Never mind, let's try it with the patent abstracts:

In [15]:
minimum_document_frequency = 5
tfidf_vectorizer = TfidfVectorizer(token_pattern=r'[A-Za-z]+', analyzer='word', 
                                   ngram_range=(2,3), stop_words='english', 
                                   min_df=minimum_document_frequency,
                                   tokenizer=StemTokenizer())
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'...after English stop words removed, remove terms occuring in less than {minimum_document_frequency} documents:'
      f' {len(tfidf_vectorizer.get_feature_names()):,} bigrams and trigrams')
tfidf_vectorizer.get_feature_names()[0:10]

  'stop_words.' % sorted(inconsistent))


Processed in 45.60s.
...after English stop words removed, remove terms occuring in less than 5 documents: 28,334 bigrams and trigrams


['! 2.',
 '! 3.',
 '% (',
 '% )',
 '% ) .',
 '% ,',
 '% -10',
 '% -10 %',
 '% -15',
 '% -15 %']

## What went wrong?

Oh dear. Tokenizer overrides the regular expression, so we'll have to combine the two...

In [16]:
import re
class StemTokenizerWithWordFilter(object):
    def __init__(self):
        self.ps = nltk.PorterStemmer()
        self.token_pattern = re.compile(r'[A-Za-z]+')

    def __call__(self, doc):
        return [self.ps.stem(t) for t in self.token_pattern.findall(doc)]

t = StemTokenizerWithWordFilter()
t('absorbs absorbing absorber absorption 123')

['absorb', 'absorb', 'absorb', 'absorpt']

## Stemmer revisited

Great - digits are removed, and the "absorb" stemming still works - so let's try again...

In [17]:
minimum_document_frequency = 5
tfidf_vectorizer = TfidfVectorizer(analyzer='word', 
                                   ngram_range=(2,3), stop_words='english',
                                   min_df=minimum_document_frequency,
                                   tokenizer=StemTokenizerWithWordFilter())
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'...after English stop words removed, remove terms occuring in less than {minimum_document_frequency} documents:'
      f' {len(tfidf_vectorizer.get_feature_names()):,} bigrams and trigrams')
tfidf_vectorizer.get_feature_names()[0:10]

  'stop_words.' % sorted(inconsistent))


Processed in 34.33s.
...after English stop words removed, remove terms occuring in less than 5 documents: 21,641 bigrams and trigrams


['abh ngigkeit',
 'abnorm detect',
 'abov atmospher',
 'abov deg',
 'abov deg c',
 'abov describ',
 'abov heat',
 'abov mention',
 'abov predetermin',
 'abov process']

## Errors from scikit-learn?

Ah - yes, we are comparing stemmed words with the original stopword list which isn't stemmed. Whoops. Let's stem the stopwords so they will match the output of the stemmer...

In [18]:
nltk.download('stopwords')
stop_words = nltk.corpus.stopwords.words('english')
stop_words_as_string = " ".join(stop_words)
stemmed_stop_words = StemTokenizerWithWordFilter()(stop_words_as_string)
stemmed_stop_words_no_duplicates = list(set(stemmed_stop_words))
stemmed_stop_words_no_duplicates[0:10]

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/grimsi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


['until', 'becaus', 'other', 'thi', 'the', 'my', 'and', 'needn', 'both', 'few']

## Stemmed stopwords
Let's analyse the patents again, this time with the stopwords matching the output of a our stemmer...

In [19]:
minimum_document_frequency = 5
tfidf_vectorizer = TfidfVectorizer(analyzer='word', 
                                   ngram_range=(2,3), 
                                   stop_words=stemmed_stop_words_no_duplicates,
                                   min_df=minimum_document_frequency,
                                   tokenizer=StemTokenizerWithWordFilter())
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'...after English stop words removed, remove terms occuring in less than {minimum_document_frequency} documents:'
      f' {len(tfidf_vectorizer.get_feature_names()):,} bigrams and trigrams')
tfidf_vectorizer.get_feature_names()[0:10]

  'stop_words.' % sorted(inconsistent))


Processed in 34.73s.
...after English stop words removed, remove terms occuring in less than 5 documents: 22,860 bigrams and trigrams


['abh ngigkeit',
 'abnorm detect',
 'abras resist',
 'absolut valu',
 'absorb absorb',
 'absorb absorpt',
 'absorb devic',
 'absorb first',
 'absorb heat',
 'absorb heat heat']

## Repeated words?

Hmmn. Slightly odd - "absorb heat heat" etc.; let's see what else we have... let's look at the following 10 terms...

In [20]:
tfidf_vectorizer.get_feature_names()[10:20]


['absorb high',
 'absorb layer',
 'absorb light',
 'absorb liquid',
 'absorb materi',
 'absorb pipe',
 'absorb plate',
 'absorb refriger',
 'absorb second',
 'absorb solar']

Ok, not so bad after all. Hopefully we've now got a sensible feature set - what features are of interest?

# Features of interest

One approach is to look at the TFIDF matrix; each row represents a document, each column a feature (i.e. an "n-gram"). A feature is of interest if it is popular and interesting - by that we mean it appears repeatedly in a document but not in all documents. Or, in other words, a high TF-IDF value against a term.

Let's try collapsing the matrix by summing the rows; this will reveal which features have the highest weights and in turn which n-grams are of interest...

In [21]:
summed_tfidf = tfidf.sum(axis=0)
summed_tfidf.shape

(1, 22860)

## Which term accumulated what TF-IDF total?
Let's associate the n-grams with their scores...

In [22]:
summed_tfidf_list = summed_tfidf.tolist()[0]
print(len(summed_tfidf_list))

ngram_list = tfidf_vectorizer.get_feature_names()
print(len(ngram_list))

ngram_scores = list(zip(summed_tfidf_list, tfidf_vectorizer.get_feature_names()))
ngram_scores[0:10]

22860
22860


[(1.964984642007418, 'abh ngigkeit'),
 (1.9833368916355836, 'abnorm detect'),
 (2.1418959799734973, 'abras resist'),
 (1.8430863781922602, 'absolut valu'),
 (0.8510145638607169, 'absorb absorb'),
 (0.7004883503700028, 'absorb absorpt'),
 (0.8188876183620329, 'absorb devic'),
 (0.5389650133241702, 'absorb first'),
 (3.6754122614917653, 'absorb heat'),
 (0.6217575810475298, 'absorb heat heat')]

## Which terms have the highest accumulated TF-iDF score?
So if we sort the tuples by TF-IDF accumulated score...

In [23]:
sorted_ngram_scores = sorted(ngram_scores, key=lambda tup: tup[0], reverse=True)
sorted_ngram_scores[0:20]

[(87.59415244080746, 'util model'),
 (86.23965268576134, 'least one'),
 (84.13185378317797, 'problem solv'),
 (70.4916842323881, 'solar cell'),
 (68.95704829225679, 'heat exchang'),
 (66.66387292986626, 'power gener'),
 (65.03648010606014, 'invent relat'),
 (64.17028054759605, 'present invent'),
 (60.038667850465764, 'solv provid'),
 (60.017751716429366, 'problem solv provid'),
 (58.10918030547546, 'power suppli'),
 (52.437059533307604, 'exhaust ga'),
 (51.715616871843, 'invent disclos'),
 (49.485036218829784, 'raw materi'),
 (45.15404079848826, 'model disclos'),
 (45.15404079848826, 'util model disclos'),
 (44.27394436080138, 'p solut'),
 (44.21824454019031, 'c jpo'),
 (44.21824454019031, 'copyright c'),
 (44.21824454019031, 'copyright c jpo')]

## We have popular terminology! Is it meaningful?

Now we're getting somewhere! However, there are a number of n-grams that aren't useful:
* util model ("utility model"?)
* least one ("...at least one..."?)
* invent relat ("invention related"?)
* present invent ("present invention"?)
* invent disclos ("invention disclosed"?)

Suggest we add "invention" to the stopword list...


In [24]:
stemmed_stop_words_custom = stemmed_stop_words_no_duplicates + ['invent', 'util', 'disclos', 'problem', 'solv', 'becau', 'copyright', 'one']
stemmed_stop_words_custom[-10:]

['are',
 'an',
 'invent',
 'util',
 'disclos',
 'problem',
 'solv',
 'becau',
 'copyright',
 'one']

## Rerun with revised stopwords
Let's try again, with the revised list of words to ignore...

In [25]:
minimum_document_frequency = 5
tfidf_vectorizer = TfidfVectorizer(analyzer='word', 
                                   ngram_range=(2,3), 
                                   stop_words=stemmed_stop_words_custom,
                                   min_df=minimum_document_frequency,
                                   tokenizer=StemTokenizerWithWordFilter())
t0 = time()
tfidf = tfidf_vectorizer.fit_transform(df.abstract)
print(f"Processed in {time() - t0:.2f}s.")
print(f'...after English stop words removed, remove terms occuring in less than {minimum_document_frequency} documents:'
      f' {len(tfidf_vectorizer.get_feature_names()):,} bigrams and trigrams')

summed_tfidf = tfidf.sum(axis=0)
summed_tfidf_list = summed_tfidf.tolist()[0]
print(len(summed_tfidf_list))

ngram_list = tfidf_vectorizer.get_feature_names()
print(len(ngram_list))

ngram_scores = list(zip(summed_tfidf_list, tfidf_vectorizer.get_feature_names()))
sorted_ngram_scores = sorted(ngram_scores, key=lambda tup: tup[0], reverse=True)
sorted_ngram_scores[0:20]

  'stop_words.' % sorted(inconsistent))


Processed in 34.49s.
...after English stop words removed, remove terms occuring in less than 5 documents: 22,283 bigrams and trigrams
22283
22283


[(71.59628262339267, 'solar cell'),
 (69.58197938536033, 'heat exchang'),
 (67.55062628438021, 'power gener'),
 (58.65742995494544, 'power suppli'),
 (52.89986476139019, 'exhaust ga'),
 (50.31499614602553, 'raw materi'),
 (45.51062668368377, 'p c'),
 (45.50908955529698, 'p solut'),
 (45.45388738177465, 'c jpo'),
 (45.45388738177465, 'p c jpo'),
 (44.30474600746714, 'combust engin'),
 (44.15775185053338, 'deg c'),
 (42.1165349730431, 'intern combust'),
 (41.30885745905931, 'intern combust engin'),
 (37.70586353655121, 'combust chamber'),
 (36.113257106080034, 'e g'),
 (34.746712203393855, 'high temperatur'),
 (34.62581249743111, 'solar energi'),
 (34.14698368013074, 'water tank'),
 (33.827613467761225, 'control system')]

# How terms are used over time
We want to visualise how terms are used over time - let's plot how many times a given term is used per year. We need to map the TFIDF matrix to a count - was the term used in a document? And then sum the counts over a time period (e.g. each year).

The original dataframe has the date information...

In [27]:
min(df.publication_date)

Timestamp('1913-07-24 00:00:00')

In [28]:
max(df.publication_date)

Timestamp('2017-07-27 00:00:00')

In [30]:
number_of_rows, number_of_terms = tfidf.shape

In [32]:
number_of_terms

22283