# Topic Modeling Daily Patent Assignments

This dataset contains a record of every [patent assignment](https://www.legalzoom.com/assets/legalforms/Patent%20Assignment.pdf) which took place on October 18, 2016. A patent assignment occurs whenever a patent is either issued experiences a change in ownership. There are apparently perhaps 2 million or so active patents in the United States, and this dataset is a window into what a day in their movements looks like.

This notebook has three parts: flatenning; general exploration; and, finally, topic modeling. The goal is to outline the basics of how natural language processing, specifically topic modeling, works.

**[Click here to skip directly to the topic modeling](#Topic-Modeling).**

## Flattening

The dataset is provided in an XML format, which is a bit awkward to work with. Before doing anything else with it, I've written a pipe that turns it into a more easily workable CSV file (note, however, that this CSV file's format is not ideal for all possible applications...).

The details of this code are explained in [another notebook](https://www.kaggle.com/residentmario/d/uspto/patent-assignment-daily/flattening-the-patent-assignment-daily-dataset), and we will gloss over them here.

In [1]:
from lxml import etree
import pandas as pd
import numpy as np

In [2]:
patents = etree.parse("../input/ad20161018.xml")
root = patents.getroot()
assignments = list(list(root)[2])

def serialize(assn):
    srs = pd.Series()
    # Metadata
    srs['last-update-date'] = assn.find("assignment-record").find("last-update-date").find("date").text
    srs['recorded-date'] = assn.find("assignment-record").find("recorded-date").find("date").text
    srs['patent-assignors'] = "|".join([assn.find("name").text for assn in assn.find("patent-assignors")])
    srs['patent-assignees'] = "|".join([assn.find("name").text for assn in assn.find("patent-assignees")])
    # WIP---below.
    try:
        srs['patent-numbers'] = "|".join(
            ["|".join([d.find("doc-number").text for d in p.findall("document-id")])\
             for p in assn.find("patent-properties").findall("patent-property")]
        )
    except AttributeError:
        pass
    try:
        srs['patent-kinds'] = "|".join(
            ["|".join([d.find("kind").text for d in p.findall("document-id")])\
             for p in assn.find("patent-properties").findall("patent-property")]
        )
    except AttributeError:
        pass
    try:
        srs['patent-dates'] = "|".join(
            ["|".join([d.find("date").text for d in p.findall("document-id")])\
             for p in assn.find("patent-properties").findall("patent-property")]
        )    
    except AttributeError:
        pass
    try:
        srs['patent-countries'] = "|".join(
            ["|".join([d.find("country").text for d in p.findall("document-id")])\
             for p in assn.find("patent-properties").findall("patent-property")]
        )    
    except AttributeError:
        pass

    try:
        srs['title'] = "|".join(
            [p.find("invention-title").text for p in assn.find("patent-properties").findall("patent-property")]
        )        
    except AttributeError:
        pass
    return srs

flattened = pd.concat([serialize(assn) for assn in assignments], axis=1).T

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.




In [3]:
del patents
del root
del assignments

In [4]:
assignments = flattened

In [5]:
assignments.head()

Unnamed: 0,last-update-date,patent-assignees,patent-assignors,patent-countries,patent-dates,patent-kinds,patent-numbers,recorded-date,title
0,20161018,FASTCASE,"WALTERS, EDWARD J. III|ROSENTHAL, PHILIP J.",US|US,20001108|20161018,X0|B1,09707911|9471672,20010320,Relevance sorting for database searches
1,20161018,ANABASIS SRL,"LAMBIASE, ALESSANDRO",US|US,20010726|20161018,X0|B1,09890088|9468665,20010720,METHOD OF TREATING INTRAOCCULAR TISSUE PATHOLO...
2,20161018,QUALCOMM INCORPORATED,"WALTON, J. RODNEY|KETCHUM, JOHN W.",US|US|US,20031201|20050602|20161018,X0|A1|B2,10725904|20050120097|9473269,20031201,METHOD AND APPARATUS FOR PROVIDING AN EFFICIEN...
3,20161018,INTERNATIONAL BUSINESS MACHINES CORPORATION,"MORARIU, JANIS A.|STAPEL, STEVEN W.|STRAACH, J...",US|US|US,20040622|20051222|20161018,X0|A1|B2,10873346|20050282136|9472114,20040903,"COMPUTER-IMPLEMENTED METHOD, SYSTEM AND PROGRA..."
4,20161018,INTERNATIONAL BUSINESS MACHINES CORPORATION,"LI, XIN|ROBERTS, GREGORY WAYNE",US|US|US,20041019|20060420|20161018,X0|A1|B2,10967958|20060085754|9471332,20050208,Selecting graphical component types at runtime


## Topic Modeling

Let's get to building a topic model around our patents. To do this we're going to use `nltk`, the "natural language toolkit", and `gensim`, a popular Python topic modeling and querying framework.

[Topic modeling](https://en.wikipedia.org/wiki/Topic_model) is a set of tasks in the natural language processing field organized around classifying individual documents according to their topics. The idea is that if you have, for example, two sets of documents, one about football and one about tennis, for example, we should be able to use a computer to seperate them into two different topical "piles".

In this case, we're going to try out classifying our patents based on their titles. Note that this is a bit of a stretch, as topic modeling works best when given fulltexts of things (IBM's Watson tools for example always recommend you input at least 250 words per document, or thereabout), but it nevertheless makes for a good demonstration.

In [6]:
import nltk
import gensim

The first thing we have to do is **tokenize** our words. A naive way to do this would be to split our string based on spaces (e.g. `str.split(" ")`), which is sometimes OK but has many edge cases (alternative punctuation marks like &mdash;, for example) and will fail to work as expected for larger problems.

`nltk` comes with a built-in word tokenizer that we can take advantage of.

In [7]:
titles = assignments['title']
title_tokens = [nltk.word_tokenize(title) for title in\
                    np.concatenate(titles.map(str).map(str.title).map(lambda s: s.split("|")))]

In [8]:
title_tokens = [title for title in title_tokens if len(title_tokens) > 0]

In [9]:
len(title_tokens)

370460

In [10]:
title_tokens[:3]

[['Relevance', 'Sorting', 'For', 'Database', 'Searches'],
 ['Method',
  'Of',
  'Treating',
  'Intraoccular',
  'Tissue',
  'Pathologies',
  'With',
  'Nerve',
  'Growth',
  'Factor',
  '.'],
 ['Method',
  'And',
  'Apparatus',
  'For',
  'Providing',
  'An',
  'Efficient',
  'Control',
  'Channel',
  'Structure',
  'In',
  'A',
  'Wireless',
  'Communication',
  'System']]

Next, we will **stem** our words. Stemming is a procedure in natural language processing where we chop off everything except for the root of a word. So for example, the words go, going, and gone will all map to the same root&mdash;go.

This is a good thing to do, particularly given the small size of our documents, because it increases the accuracy of classifications&mdash;more things end up being the same.

`nltk` comes with several stemmers installed, we'll use the `PorterStemmer`.

In [11]:
stemmer = nltk.stem.PorterStemmer()
titles_stemmed = [[stemmer.stem(token) for token in tokens] for tokens in title_tokens]

In [12]:
titles_stemmed[:3]

[['relev', 'sort', 'for', 'databas', 'search'],
 ['method',
  'Of',
  'treat',
  'intraoccular',
  'tissu',
  'patholog',
  'with',
  'nerv',
  'growth',
  'factor',
  '.'],
 ['method',
  'and',
  'apparatu',
  'for',
  'provid',
  'An',
  'effici',
  'control',
  'channel',
  'structur',
  'In',
  'A',
  'wireless',
  'commun',
  'system']]

If we examine a list of words, however, we see that the most common English-language words dominate:

In [13]:
pd.Series(np.concatenate(titles_stemmed)).value_counts()

and                    170856
for                    145224
method                 137910
A                      101633
Of                      91017
system                  75843
devic                   65015
with                    46433
In                      39461
,                       37235
apparatu                31147
circuit                 29046
memori                  28675
semiconductor           26290
use                     24697
To                      23959
data                    22710
An                      22235
control                 20957
the                     20244
have                    17662
integr                  16994
process                 16233
structur                16052
network                 14185
form                    14012
On                      12411
commun                  12408
power                   11927
same                    10940
                        ...  
serr                        1
faid                        1
de-identif

These words carry no meaning and aren't very interesting. They're known as **stopwords** in NLP, and we're going to once again use `nltk` builtins to remove them from consideration.

In [14]:
from nltk.corpus import stopwords

In [15]:
nltk.download("stopwords")

[nltk_data] Error loading stopwords: <urlopen error [Errno -3]
[nltk_data]     Temporary failure in name resolution>


False

In [16]:
english_stopwords = set([word.title() for word in stopwords.words("english")])

In [17]:
stemmed_title_words = [[word for word in title if word not in english_stopwords] for title in titles_stemmed]

In [18]:
stemmed_title_words[:3]

[['relev', 'sort', 'for', 'databas', 'search'],
 ['method',
  'treat',
  'intraoccular',
  'tissu',
  'patholog',
  'with',
  'nerv',
  'growth',
  'factor',
  '.'],
 ['method',
  'and',
  'apparatu',
  'for',
  'provid',
  'effici',
  'control',
  'channel',
  'structur',
  'wireless',
  'commun',
  'system']]

In [19]:
word_counts = pd.Series(np.concatenate(stemmed_title_words)).value_counts()
singular_words = set(word_counts[pd.Series(np.concatenate(stemmed_title_words)).value_counts() == 1].index)

In [20]:
stemmed_title_common_words = [[word for word in title if word not in singular_words] for title in stemmed_title_words]

In [21]:
stemmed_title_common_words[:3]

[['relev', 'sort', 'for', 'databas', 'search'],
 ['method',
  'treat',
  'intraoccular',
  'tissu',
  'patholog',
  'with',
  'nerv',
  'growth',
  'factor',
  '.'],
 ['method',
  'and',
  'apparatu',
  'for',
  'provid',
  'effici',
  'control',
  'channel',
  'structur',
  'wireless',
  'commun',
  'system']]

Next, let's consider the opposite problem: words that occur to infrequently to be useful. Words that only ever appear once, for example, don't carry any information. Remember, we're going to split all of our patent titles into some small number of classes; just as in any other dataset, a data point which is only populated once isn't interesting, and can be safely dropped.

In fact, we could probably drop a *lot* of words from consideration, not just ones appearing once but ones appearing tens or even hundreds of times. This would speed up our algorithms and won't significantly impact our results.

After a certain point words do start to matter, however; figuring out where that point is is up to you.

In our case we'll just be lazy and cut off at words that appear only once, and leave words appearing twice or more intact.

In [22]:
non_empty_indices = [i for i in range(len(stemmed_title_common_words)) if len(stemmed_title_common_words[i]) > 0]

In [23]:
non_empty_indices[5000]

5003

Notice that discarding words from our set has resulted in a handful of empty titles. Apparently a few patents have nothing *but* unique words!

In [24]:
stemmed_title_common_words_nonnull = np.asarray(stemmed_title_common_words)[non_empty_indices]

In [25]:
classifiable_titles = np.asarray(title_tokens)[non_empty_indices]

With our titles adequately processed, now we switch over to `gensim`. The first thing we have to do is build a dictionary of words, which associates each word [stem] with a particular index number:


In [26]:
dictionary = gensim.corpora.Dictionary(stemmed_title_common_words_nonnull)

In [27]:
str(dictionary.token2id)[:1000]

"{'databas': 0, 'for': 1, 'relev': 2, 'search': 3, 'sort': 4, '.': 5, 'factor': 6, 'growth': 7, 'intraoccular': 8, 'method': 9, 'nerv': 10, 'patholog': 11, 'tissu': 12, 'treat': 13, 'with': 14, 'and': 15, 'apparatu': 16, 'channel': 17, 'commun': 18, 'control': 19, 'effici': 20, 'provid': 21, 'structur': 22, 'system': 23, 'wireless': 24, ',': 25, 'computer-impl': 26, 'educ': 27, 'product': 28, 'program': 29, 'compon': 30, 'graphic': 31, 'runtim': 32, 'select': 33, 'type': 34, 'aggreg': 35, 'chromatographi': 36, 'high': 37, 'hydroxyapatit': 38, 'molecular': 39, 'remov': 40, 'use': 41, 'weight': 42, 'aid': 43, 'differ': 44, 'further': 45, 'protocol': 46, 'station': 47, 'the': 48, 'transpond': 49, 'convert': 50, 'input': 51, 'languag': 52, 'output': 53, 'phonet': 54, 'written': 55, 'dataset': 56, 'from': 57, 'link': 58, 'methodolog': 59, 'multi-mod': 60, 'pattern': 61, 'charact': 62, 'digit': 63, 'media': 64, 'person': 65, 'replac': 66, 'airway': 67, 'detect': 68, 'instabl': 69, 'align': 7

Why are we doing this? Because shortly we're going to throw our corpus into a [tf-idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) algorithm. TF-IDF is an algorithm in information retrieval which converts a list of word "vectors" to a scaled Euclidian normal vector. It turns a count of the number of each word in our document into a unit vector in N-dimensional space, where N is, believe it or not, the number of individual words that we have in our dictionary (above).

That means that, in this case, we have a "dataset" matrix with hundreds of thousands of columns in it!

The beauty of TD-IDF is that it scales the words according to how frequent or rare they are. Words that appear a lot in your text but also appear a lot in the rest of the corpus are weighed less heavily than words that appear a lot in your text but more rarely outside of it.

Thus we first use `gensim` to convert our words to word incidence vectors...

In [28]:
corpus = [dictionary.doc2bow(text) for text in stemmed_title_common_words_nonnull]

In [29]:
stemmed_title_common_words_nonnull[0], corpus[0]

(['relev', 'sort', 'for', 'databas', 'search'],
 [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)])

In [30]:
stemmed_title_common_words_nonnull[100], corpus[100]

(['passeng',
  'transport',
  'system',
  'and',
  'method',
  'for',
  'obtain',
  'ticket',
  'such',
  'system'],
 [(1, 1),
  (9, 1),
  (15, 1),
  (23, 2),
  (300, 1),
  (325, 1),
  (326, 1),
  (327, 1),
  (328, 1)])

...then run `TfidfModel` from `gensim` on them to turn them into our word vectors!

In [31]:
from gensim.models import TfidfModel

In [32]:
tfidf = TfidfModel(corpus)

Note that `gensim` doesn't follow the `scikit` access pattern, if you are familiar with it. It instead (1) defers computations on individual entries until necessary and (2) provides access to data using bracket indexing notation (`[]`).

By contrast, `scikit` will run everything immediately by default, provides results using a `.values_` attribute, and seperates model initialization from runtime (the latter doesn't occur until you `fit()` your model).

In [33]:
stemmed_title_common_words_nonnull[0], corpus[0], tfidf[corpus[0]]

(['relev', 'sort', 'for', 'databas', 'search'],
 [(0, 1), (1, 1), (2, 1), (3, 1), (4, 1)],
 [(0, 0.4377475295381456),
  (1, 0.07562061510033051),
  (2, 0.5469009211535774),
  (3, 0.36992777626242423),
  (4, 0.6055670447985132)])

With our words suitibly datified, we can now move on to fitting a model. Since our words are now, effectively, a very large dataset, it's possible to use any general purpose classifier to fit it. An [earlier notebook on this dataset](https://www.kaggle.com/the1owl/d/uspto/patent-assignment-daily/exploring-daily-patent-assignments-files), for example, uses a `scipy` `KMeans` clustering algorithm to arrive at its topics (you should go read that after you're done with this one).

We'll instead use a model specifically adapted to natural language processing from the `gensim` built-ins, `LsiModel`.

In [34]:
from gensim.models import LsiModel

Here's how we run it:

In [35]:
corpus_tfidf = tfidf[corpus]
lsi = LsiModel(tfidf[corpus], id2word=dictionary, num_topics=10)
corpus_lsi = lsi[corpus_tfidf]

Here's a printout of what words are important to our various topics. Notice that certain extremely common words, like `semiconductor`, appear in different positions in multiple classifiers. Also, note that this display is cut off at a certain number of displayed words; in reality the model considers far more than these (you can specify how many to display here, however, using the `num_words` parameter).

In [36]:
lsi.print_topics(10)

[(0,
  '0.381*"devic" + 0.350*"semiconductor" + 0.298*"method" + 0.281*"and" + 0.229*"for" + 0.227*"system" + 0.194*"," + 0.188*"memori" + 0.183*"circuit" + 0.153*"apparatu"'),
 (1,
  '-0.603*"semiconductor" + -0.354*"devic" + 0.328*"system" + 0.197*"apparatu" + 0.171*"for" + 0.166*"data" + 0.153*"and" + 0.152*"," + -0.146*"manufactur" + -0.132*"form"'),
 (2,
  '-0.719*"circuit" + -0.501*"integr" + 0.162*"system" + 0.122*"data" + 0.115*"semiconductor" + -0.112*"packag" + 0.111*"devic" + 0.100*"commun" + -0.099*"voltag" + 0.096*"network"'),
 (3,
  '0.670*"memori" + -0.339*"commun" + 0.282*"cell" + -0.200*"handl" + -0.174*"semiconductor" + -0.167*"inform" + 0.140*"non-volatil" + -0.122*"wireless" + -0.113*"devic" + 0.113*"form"'),
 (4,
  '-0.562*"," + -0.303*"imag" + -0.261*"apparatu" + 0.234*"system" + 0.232*"handl" + 0.211*"manag" + -0.187*"display" + 0.182*"power" + 0.160*"network" + 0.153*"inform"'),
 (5,
  '-0.575*"fiber" + -0.542*"optic" + -0.274*"connector" + -0.274*"cabl" + 0.178

Here are the scoring outputs for the first five documents:

In [37]:
for scores in corpus_lsi[:5]:
    print(scores)

[(0, 0.025505913724665038), (1, 0.027499880322717597), (2, 0.01426593619606057), (3, 0.0009908453639730559), (4, 0.010034513755037638), (5, 0.0031677506319896455), (6, 0.007341945673440692), (7, 0.0010426105087261027), (8, 0.0016127075703606755), (9, 0.00579595984066685)]
[(0, 0.026358616443681843), (1, 0.007150780195027558), (2, -0.0015753877055622715), (3, 0.004530836932058682), (4, 0.003458390186545298), (5, -0.010173937921073961), (6, -0.002085055067646866), (7, -0.006282638143009672), (8, 0.0010517417740392848), (9, 0.016492222014504534)]
[(0, 0.2644780037895393), (1, 0.226591928943143), (2, 0.11033380073002112), (3, -0.15922000707781936), (4, 0.02501138178478588), (5, 0.028925162486698443), (6, 0.20785703694213398), (7, -0.027280712424837966), (8, 0.014817728795491708), (9, 0.061294115583082016)]
[(0, 0.13510394049920565), (1, 0.11694971928301032), (2, 0.03056331831469064), (3, 0.04932630336538833), (4, -0.11597052146596132), (5, 0.04686238922327079), (6, -0.04655090311702373), (

Let's use these scores to fetch best-fit classifications for all of our (classifiable) patents:

In [38]:
classifications = [np.argmax(np.asarray(corpus_lsi[i])[:,1]) for i in range(len(stemmed_title_common_words_nonnull))]

In [39]:
topics = pd.DataFrame({'topic': classifications, 'title': classifiable_titles})

Certain topics that our classifier arrives at are much more common than others.

In [40]:
topics['topic'].value_counts()

0    242086
1     39351
9     31661
3     23469
4     12435
6     11829
8      8925
5       387
7       238
2        41
Name: topic, dtype: int64

Let's see what our classes look like.

In [41]:
from IPython.display import display

In [42]:
for i in range(10):
    print("Topic", i + 1)
    display(topics.query('topic == @i').head(5))

Topic 1


Unnamed: 0,topic,title
1,0,"[Method, Of, Treating, Intraoccular, Tissue, P..."
2,0,"[Method, And, Apparatus, For, Providing, An, E..."
3,0,"[Computer-Implemented, Method, ,, System, And,..."
4,0,"[Selecting, Graphical, Component, Types, At, R..."
5,0,"[Removal, Of, High, Molecular, Weight, Aggrega..."


Topic 2


Unnamed: 0,topic,title
0,1,"[Relevance, Sorting, For, Database, Searches]"
19,1,"[Implicit, Searching, For, Mobile, Content]"
30,1,"[Physical, Navigation, Of, A, Mobile, Search, ..."
53,1,"[Call, Control, Server]"
54,1,"[Methods, And, Apparatus, For, Communicating, ..."


Topic 3


Unnamed: 0,topic,title
8962,2,[Headphones]
34413,2,"[Computer, Console, Case]"
38916,2,"[Computer, Console, Case]"
43419,2,"[Computer, Console, Case]"
49696,2,[Keyscreen]


Topic 4


Unnamed: 0,topic,title
178,3,"[Driver, For, Non-Linear, Displays, Comprising..."
198,3,"[Erasable, And, Programmable, Non-Volatile, Cell]"
232,3,"[Method, And, System, For, Accelerated, Access..."
256,3,"[Two-Dimensional, Data, Memory]"
322,3,"[Error, Correction, Scheme, For, Use, In, Flas..."


Topic 5


Unnamed: 0,topic,title
40,4,"[Resource, Consumption, Reduction, Via, Meetin..."
157,4,"[Antenna, Configuration]"
291,4,"[Data, Carrier, For, Storing, Information, Rep..."
320,4,"[Digital, Rights, Management, Unit, For, A, Di..."
524,4,"[Non-Linear, Distribution, Of, Voltage, Steps,..."


Topic 6


Unnamed: 0,topic,title
285,5,"[Method, Of, Calling, Up, Object-Specific, Inf..."
2050,5,"[Collecting, Information, Before, A, Call]"
2557,5,"[Collecting, Information, Before, A, Call]"
3357,5,"[Collecting, Information, Before, A, Call]"
3963,5,"[Telephony, Usage, Derived, Presence, Informat..."


Topic 7


Unnamed: 0,topic,title
6,6,"[Communication, Station, For, Communication, W..."
18,6,"[Mobile, Search, Substring, Query, Completion]"
20,6,"[Creation, Of, A, Mobile, Search, Suggestion, ..."
21,6,"[Mobile, Pay-Per-Call, Campaign, Creation]"
22,6,"[Mobile, Pay-Per-Call, Campaign, Creation]"


Topic 8


Unnamed: 0,topic,title
4809,7,"[Buddy, Lists, For, Information, Vehicles]"
5255,7,"[Paper, Sheet, Handling, Apparatus]"
9900,7,"[Setting, User-Preference, Information, On, Th..."
22857,7,"[Oral, Irrigator, Housing]"
22860,7,"[Oral, Irrigator, Housing]"


Topic 9


Unnamed: 0,topic,title
618,8,"[Connector, For, Chip-Card]"
717,8,"[Multitrack, Optical, Disc, Reader]"
1007,8,"[Low-Voltage, ,, Low-Skew, Differential, Trans..."
1017,8,"[High-Speed, ,, Low-Power, ,, Low-Skew, ,, Low..."
1213,8,"[Coil, Construction]"


Topic 10


Unnamed: 0,topic,title
37,9,"[Image-Guided, Laser, Catheter]"
69,9,"[Sensor, Arrangement]"
74,9,"[Power, Converter]"
75,9,"[Acceptance, Filter]"
78,9,"[Power, Optimized, Collocated, Motion, Estimat..."


## Conclusion

It's uncertain that our classifier found a "good" representation of our title data. Certainly I think that `METHOD OF TREATING INTRAOCCULAR TISSUE PATHOLOGIES` and `Selecting graphical component types at runtime` should find their way to seperate classes.

However, given the low volume of information contained in a simple patent title&mdash;in some cases these titles are just one or two words long!&mdash;this is clearly just about the best that we can do.

`TF-IDF` is not unique to `gensim`; it is also available in `scikit-learn`, among other places. `gensim` is a high-capacity but low-level library, and a classifier with very performance in much less code also on this data is available [here](https://www.kaggle.com/the1owl/d/uspto/patent-assignment-daily/exploring-daily-patent-assignments-files).

Nevertheless, with "wider" datasets `gensim` models should be the most performant.

## Further Reading

* http://www.nltk.org/book/ch01.html
* https://radimrehurek.com/gensim/tutorial.html
* https://www.youtube.com/watch?v=oqfKz-PP9FU