# Lab 11: APIs and Text Analysis


## Motivation: Text Analysis on University of Maryland Patents

We are able to use the [Patentsview API](https://www.patentsview.org/api/query-language.html) to pull data from patents awarded to inventors at University of Maryland, including the abstract from each of the patents. Suppose we wanted to know what types of patents were awarded to UMD. We can look at the information from the abstracts and read through them, but this would take a very long time since there are almost 1,300 abstracts. Instead, we can use **text analysis** to help us.

However, as-is, the text from abstracts can be difficult to analyze. We aren't able to use traditional statistical techniques without some heavy data manipulation, because the text is essentially a categorical variable with unique values for each patent. We need to basically break it apart and clean the data before we apply our data analysis techniques. 

In this notebook, we will go through the process of cleaning and processing the text data to prepare it for **topic modeling**, which is the process of automatically assigning topics to individual **documents** (in this case, an individual document is an abstract from a patent). We will use a technique called **Latent Dirichlet Allocation** as our topic modeling technique, and try to determine what sorts of patents were awarded to University of Maryland.

### Python Setup

In [1]:
# interacting with websites and web-APIs
import requests # easy way to interact with web sites and services

# data manipulation
from datascience import *
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn import preprocessing

# Text Analysis Tools

from nltk import download
from nltk.corpus import stopwords
from nltk import SnowballStemmer
import string

# APIs with Python

APIs (application programming interfaces) are hosted on web servers. When you type www.google.com in your browser's address bar, your computer is actually asking the www.google.com server for a webpage, which it then returns to your browser. APIs work much the same way, except instead of your web browser asking for a webpage, your program asks for data. This data is usually returned in JSON format. 

To retrieve data, we make a request to a webserver. The server then replies with our data. In Python, we'll use the `requests` library to do this.



## Retrieving Patent Data about University of Maryland

We will use the `request` package to retrieve information about the patents that have been granted to inventors at University of Maryland, using the PatentsView API. This notebook goes over using the `request` package to get the data, as well as putting that data into a form that is usable. 

## How does the request package work?

We first need to understand what information can be accessed from the API. We use an example of the **PatentsView API** (www.patentsview.org) to make the API call and check the information we get. 

### About PatentsView API

The PatentsView platform is built on data derived from the US Patent and Trademark Office (USPTO) bulk data to link inventors, their organizations, locations, and overall patenting activity. The PatentsView API provides programmatic access to longitudinal data and metadata on patents, inventors, companies, and geographic locations since 1976. 

To access the API, we use the `request` function. In order to tell Python what to access, we need to specify the url of the API endpoint.

PatentsView has several API endpoints. An endpoint is a server route that is used to retrieve different data from the API. You can think of the endpoints as just specifying what types of data you want. Examples of PatentsView API endpoints are shown here: http://www.patentsview.org/api/doc.html

Many times, we need to request a key from the data provider in order to access an API. For example, if you wanted to access the Twitter API, then you would need to get a Twitter developer account and access token (see [https://developer.twitter.com/en/docs/basics/authentication/overview/oauth](https://developer.twitter.com/en/docs/basics/authentication/overview/oauth)). Currently no key is necessary to access the PatentsView API. 

### Making a Request
When you ping a website or portal for information this is called making a request. That is exactly what the `requests` library has been designed to do. However, we need to provide a query URL according to the format defined by PatentsView. The details on how to do that is explained [at this link.](https://www.patentsview.org/api/query-language.html)

Following the directions detailed in the link above, let's build our first query URL.

**Query String Format**

The query string is always a single JSON object: **{`<field>`:`<value>`}**, where `<field>` is the name of a database field and `<value>` is the value the field will be compared to for equality (Each API Endpoint section contains a list of the data fields that can be selected for inclusion in output datasets).

We use the following base URL for the Patents Endpoint:

**Base URL**: `https://api.patentsview.org/patents/query?q={criteria}`



## Task example: Pull patents for University of Maryland

In this example, we will only pull patents from one organization: University of Maryland. Let's go to the Patents Endpoint (http://www.patentsview.org/api/patent.html) and find the appropriate field for the organization's name. Based on looking at the APID documentation, we can see that the variable that we need is called `"assignee_organization"` (organization name, if assignee is organization).

> _Note_: **Assignee**: the name of the entity - company, foundation, partnership, holding company or individual - that owns the patent. In this example we are looking at universities (organization-level).

We will pull from the API using a step-by-step process:
- Build the query
- Get the response
- Check the response code
- Get the content
- Convert to table

By the end, we should have data about patents that we can work with using the tools we've already learned.

### Step 1. Build the URL query 

Let's build our first URL query by combining the base url with one criterion (name of the `assignee_organization`). This is based on the directions detailed [at this link.](https://www.patentsview.org/api/query-language.html)

To build it up, we start with the base url (`http://www.patentsview.org/api/patents/query?q=`) and add the various criteria that we want. For example, we use `'q={"assignee_organization":"University of Maryland, College Park"}'` to set the organization to University of Maryland, College Park.

In [2]:
base_url = 'http://api.patentsview.org/patents/query?'
organization = 'q={"assignee_organization":"University of Maryland, College Park"}'
variables = '&f=["patent_title","patent_year", "patent_type", "patent_abstract"]'
options = '&o={"per_page":1300}'

url = base_url + organization + variables + options
print(url)

http://api.patentsview.org/patents/query?q={"assignee_organization":"University of Maryland, College Park"}&f=["patent_title","patent_year", "patent_type", "patent_abstract"]&o={"per_page":1300}


### Step 2. Get the response

Now let's get the response using the URL defined above, using the `requests` library.

In [3]:
r = requests.get(url)  # Get response from the URL

### Step 3. Check the Response Code

Before you can do anything with a website or URL in Python, it’s a good idea to check the current status code of said portal.

The following are the response codes for the PatentsView API:

`200` - the query parameters are all valid; the results will be in the body of the response

`400` - the query parameters are not valid, typically either because they are not in valid JSON format, or a specified field or value is not valid; the “status reason” in the header will contain the error message

`500` -  there is an internal error with the processing of the query; the “status reason” in the header will contain the error message

Let's check the status of our response 

In [4]:
r.status_code  # Check the status code

200

We are good to go. Now let's get the content.

### Step 4. Get the Content
After a web server returns a response, you can collect the content you need by converting it into a JSON format.

JSON is a way to encode data structures like lists and dictionaries to strings that ensures that they are easily readable by machines. JSON is the primary format in which data is passed back and forth to APIs, and most API servers will send their responses in JSON format.

In [11]:
json = r.json()

We want to be able to use this data, but it's a bit hard to in the current JSON format. We want to essentially take the information that is in the `patents` field within the dictionary and create a Table out of it. To do that, we'll use a trick called **list comprehension** that will make our lives much easier.

#### List Comprehension

List comprehension is kind of like a compact `for` loop inside a list. You use it to generate a list of values with certain characteristics. For example, if we wanted to create a list with values from 0 to 9, we could use the following.

In [11]:
[i for i in np.arange(10)]

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

### Step 5. Converting JSON into a Table

Now let's convert the JSON into a `Table`. To do this, let's first examine how the JSON looks.

In [13]:
patent_title = [a['patent_title'] for a in json['patents']]
patent_year = [a['patent_year'] for a in json['patents']]
patent_type = [a['patent_type'] for a in json['patents']]
patent_abstract = [a['patent_abstract'] for a in json['patents']]

patents = Table().with_columns('patent_title', patent_title,
                               'patent_year', patent_year,
                              'patent_type', patent_type,
                               'patent_abstract', patent_abstract)
patents.show(5)

patent_title,patent_year,patent_type,patent_abstract
Methods for recovery of leaf proteins,2018,utility,A novel method for processing soluble plant leaf protein ...
"Systems, methods, and devices for health monitoring of a ...",2018,utility,A health monitoring device includes an ultrasound source ...
Sparse decomposition of head related impulse responses w ...,2018,utility,This application describes methods of signal processing ...
Flapping wing aerial vehicles,2018,utility,An autonomous flapping wing aerial vehicle can have a ve ...
Compositions and vaccines comprising vesicles and method ...,2018,utility,"The disclosure relates to compositions, pharmaceutical c ..."


In [14]:
patents.num_rows

398

We can pull out just the abstracts as an array to use for our text analysis.

In [15]:
abstracts = patents.column('patent_abstract')

# Topic Modeling

We are going to apply topic modeling, an unsupervised learning method, to our corpus to find the high-level topics in our corpus. Through this process, we'll discuss how to clean and preprocess our data to get the best results. Topic modeling is a broad subfield of machine learning and natural language processing. We are going to focus on a common modeling approach called Latent Dirichlet Allocation (LDA). 

To use topic modeling, we first have to assume that topics exist in our corpus, and that some small number of these topics can "explain" the corpus. Topics in this context refer to words from the corpus, in a list that is ranked by probability. A single document can be explained by multiple topics. For instance, an article on net neutrality would fall under the topic "technology" as well as the topic "politics." The set of topics used by a document is known as the document's allocation, hence, the name Latent Dirchlet Allocation, each document has an allocation of latent topics allocated by Dirchlet distribution. 

We will use topic modeling in order to determine what types of inventions have been produced at the University of Maryland. 


## Preparing Text Data for Natural Language Processing (NLP)

The first important step in working with text data is cleaning and processing the data, which includes (but is not limited to):

- forming a corpus of text
- stemming and lemmatization
- tokenization
- removing stop-words
- finding words co-located together (N-grams)



The ultimate goal is to transform our text data into a form an algorithm can work with, because a document or a corpus of text cannot be fed directly into an algorithm. Algorithms expect numerical feature vectors with certain fixed sizes, and can't handle documents, which are basically sequences of symbols with variable length. We will be transforming our text corpus into a *bag of n-grams* to be further analyzed. In this form our text data is represented as a matrix where each row refers to a specific job description (document) and each column is the occurence of a word (feature).

### Stemming and Lemmatization - Distilling text data

We want to process our text through *stemming and lemmatization*, or replacing words with their root or simplest form. For example "systems," "systematic," and "system" are all different words, but we can replace all these words with "system" without sacrificing much meaning. 

- A **lemma** is the original dictionary form of a word (e.g. the lemma for "lies," "lied," and "lying" is "lie").
- The process of turning a word into its simplest form is **stemming**. There are several well known stemming algorithms -- Porter, Snowball, Lancaster -- that all have their respective strengths and weaknesses.

In this notebook, we'll use the Snowball Stemmer:

In [None]:
# Examples of how a Stemmer works:
stemmer = SnowballStemmer("english")
print(stemmer.stem('lies'))
print(stemmer.stem("lying"))
print(stemmer.stem('systematic'))
print(stemmer.stem("running"))

<font color = 'red'>**Question 2. Using the format of the examples above, try stemming the word `systematically` and `systemic`.**</font>

### Removing Punctuation

For some purposes, we might want to preserve punctuation. For example, if we wanted to be able to detect sentiment of text, we might want to keep exclamation points, because they signify something about the text. For our purposes, however, we will simply strip the punctuation so that it does not affect our analysis. To do this, we use the `string` package, creating a translator that takes any string and "translates" it into a string without any punctuation.

An example using the first abstract in our corpus is shown below.

In [None]:
# Before
abstracts[0]

In [None]:
# Create translator
# This maps all punctuation to an empty space
translator=str.maketrans(string.punctuation, ' ' * len(string.punctuation))

# After
abstracts[0].translate(translator)

### Tokenizing

We want to separate text into individual tokens (generally individual words). To do this, we'll first write a function that takes a string and splits it up into indiviudal words. We'll do the whole process of removing punctuation, stemming, and tokenizing all in one function.

In [None]:
def tokenize(text):
    translator=str.maketrans(string.punctuation, ' '*len(string.punctuation)) # translator that replaces punctuation with empty spaces
    return [stemmer.stem(i) for i in text.translate(translator).split()]

The `tokenize` function actually does several things at the same time. First, it removes any punctuation using the `translate` method. Then, the `split` method breaks it apart into individual words. Then, using `stemmer.stem`, it creates a list of the stemmed versions of each of those individual words. 

Let's take a look at an example of how this works using the first abstract in our corpus.

In [None]:
tokenize(abstracts[0])

What we get out of it is something called a **bag of words**. This is a list of all of the words that are in the abstract, cleaned of all punctuation and stemmed. The paragraph is now represented as a vector of individual words rather than as one whole entity. 

We can apply this to each abstract in our corpus using `CountVectorizer`. This will not only do the tokenizing, but it will also count any duplicates of words and create a matrix that contains the frequency of each word. This will be quite a large matrix (number of columns will be number of unique words), so it outputs the data as a sparse matrix.

Similar to how we fit models using `sklearn`, we will first create the `vectorizer` object (you can think of this like a model object), and then fit it with our abstracts. This should give us back our overall corpus bag of words, as well as a list of features (that is, the unique words in all the abstracts).

In [None]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                            tokenizer=tokenize, # function to create tokens
                            ngram_range=(0,1),
                            strip_accents='unicode',
                            min_df = 0.05,
                            max_df = 0.95)

In [None]:
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus is a bag of words 
features = vectorizer.get_feature_names()

In [None]:
print(bag_of_words[0])

This format looks a bit weird, but it's showing the `(row, column)` and corresponding value associated with it. Everything else is a `0`. Let's take a look at the column names.

In [None]:
features[0:10]

Now that we have our bag of words, we can start using it for models such as Latent Dirichlet Allocation.

### Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a statistical model that generates groups based on similarities. This is an example of an **unsupervised machine learning model**. That is, we don't have any sort of outcome variable -- we're just trying to group the abstracts into rough categories.

Let's try fitting an LDA model. The way we do it is very similar to the models we've fit before from `sklearn`. We first create a `LatentDirichletAllocation` object, then fit it using our corpus bag of words.

In [None]:
lda = LatentDirichletAllocation(learning_method='online') 

doctopic = lda.fit_transform( bag_of_words )

In [None]:
ls_keywords = []
for i,topic in enumerate(lda.components_):
    word_idx = np.argsort(topic)[::-1][:5]
    keywords = ', '.join( features[i] for i in word_idx)
    ls_keywords.append(keywords)
    print(i, keywords)

This doesn't look very helpful! There are way too many common words in the corpus, such as 'a', 'of', and so on. We need to remove them, because they don't actually have any interesting information about the documents.

### Removing meaningless text - Stopwords

Stopwords are words that are found commonly throughout a text and carry little semantic meaning. Examples of common stopwords are prepositions ("to", "on", "in"), articles ("the", "an", "a"), conjunctions ("and", "or", "but") and common nouns. For example, the words *the* and *of* are totally ubiquitous, so they won't serve as meaningful features, whether to distinguish documents from each other or to tell what a given document is about. You may also run into words that you want to remove based on where you obtained your corpus of text or what it's about. There are many lists of common stopwords available for you to use, both for general documents and for specific contexts, so you don't have to start from scratch.   

We can eliminate stopwords by checking all the words in our corpus against a list of commonly occuring stopwords that comes with NLTK.

In [None]:
# Download most current stopwords
download('stopwords')

In [None]:
stop = stopwords.words('english')
stop[0:10]

In [None]:
# Tokenize stop words to match
eng_stopwords = [tokenize(s)[0] for s in stop]

In [None]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                            tokenizer=tokenize, # function to create tokens
                            ngram_range=(0,1),
                            strip_accents='unicode',
                            stop_words=eng_stopwords,
                            min_df = 0.05,
                            max_df = 0.95)

# Creating bag of words
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus is a bag of words 
features = vectorizer.get_feature_names()

# Fitting LDA model
lda = LatentDirichletAllocation(n_components = 5, learning_method='online') 
doctopic = lda.fit_transform( bag_of_words )

# Displaying the top keywords in each topic
ls_keywords = []
for i,topic in enumerate(lda.components_):
    word_idx = np.argsort(topic)[::-1][:5]
    keywords = ', '.join( features[i] for i in word_idx)
    ls_keywords.append(keywords)
    print(i, keywords)

<font color = 'red'>**Question 3. What does it look like the topics of the abstracts are like? Are some categories harder to distinguish than others?**</font>

*Your answer here.*

### N-grams - Adding context by creating N-grams

Obviously, reducing a document to a bag of words means losing much of its meaning - we put words in certain orders, and group words together in phrases and sentences, precisely to give them more meaning. If you follow the processing steps we've gone through so far, splitting your document into individual words and then removing stopwords, you'll completely lose all phrases like "kick the bucket," "commander in chief," or "sleeps with the fishes." 

One way to address this is to break down each document similarly, but rather than treating each word as an individual unit, treat each group of 2 words, or 3 words, or *n* words, as a unit. We call this a "bag of *n*-grams," where *n* is the number of words in each chunk. Then you can analyze which groups of words commonly occur together (in a fixed order). 

In [None]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                            tokenizer=tokenize, # function to create tokens
                            ngram_range=(0,2), # Allow for bigrams
                            strip_accents='unicode',
                            stop_words=eng_stopwords,
                            min_df = 0.05,
                            max_df = 0.95)

# Creating bag of words
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus is a bag of words 
features = vectorizer.get_feature_names()

# Fitting LDA model
lda = LatentDirichletAllocation(n_components = 5, learning_method='online') 
doctopic = lda.fit_transform( bag_of_words )

# Displaying the top keywords in each topic
ls_keywords = []
for i,topic in enumerate(lda.components_):
    word_idx = np.argsort(topic)[::-1][:10]
    keywords = ', '.join( features[i] for i in word_idx)
    ls_keywords.append(keywords)
    print(i, keywords)

### TF-IDF - Weighting terms based on frequency

A final step in cleaning and processing our text data is **Term Frequency-Inverse Document Frequency (TF-IDF)**. TF-IDF is based on the idea that the words (or terms) that are most related to a certain topic will occur frequently in documents on that topic, and infrequently in unrelated documents.  TF-IDF re-weights words so that we emphasize words that are unique to a document and suppress words that are common throughout the corpus by inversely weighting terms based on their frequency within the document and across the corpus.

Let's look at how to use TF-IDF:

In [None]:
stop = stopwords.words('english') + ['invent', 'produce', 'method', 'use', 'first', 'second']
full_stopwords = [tokenize(s)[0] for s in stop]

In [None]:
vectorizer = CountVectorizer(analyzer= "word", # unit of features are single words rather then phrases of words 
                            tokenizer=tokenize, # function to create tokens
                            ngram_range=(0,2),
                            strip_accents='unicode',
                            stop_words=full_stopwords,
                            min_df = 0.05,
                            max_df = 0.95)
# Creating bag of words
bag_of_words = vectorizer.fit_transform(abstracts) #transform our corpus is a bag of words 
features = vectorizer.get_feature_names()

# Use TfidfTransformer to re-weight bag of words 
transformer = TfidfTransformer(norm = None, smooth_idf = True, sublinear_tf = True)
tfidf = transformer.fit_transform(bag_of_words)

# Fitting LDA model
lda = LatentDirichletAllocation(n_components = 5, learning_method='online') 
doctopic = lda.fit_transform(tfidf)

# Displaying the top keywords in each topic
ls_keywords = []
for i,topic in enumerate(lda.components_):
    word_idx = np.argsort(topic)[::-1][:5]
    keywords = ', '.join( features[i] for i in word_idx)
    ls_keywords.append(keywords)
    print(i, keywords)

We can also put this all together in a Table in order to see it more clearly.

In [None]:
topics = Table().with_columns('Patent Title', patents.column('patent_title'),
                             ls_keywords[0],doctopic[:,0],
                             ls_keywords[1],doctopic[:,1],
                             ls_keywords[2],doctopic[:,2],
                             ls_keywords[3],doctopic[:,3],
                             ls_keywords[4],doctopic[:,4],)

In [None]:
topics.show(5)

<font color = 'red'>**Question 4. Try editing the above code to show a different number of topics and keywords for each topic.**</font>

To change the number of topics, you can adjust the following line:

    lda = LatentDirichletAllocation(n_components = 5, learning_method='online') 

and change the `n_components` value to the number of topics that you want.

To change the number of keywords that are displayed for each topic, change the following line:

    word_idx = np.argsort(topic)[::-1][:5]

by adjusting the `5` to the number of keywords you want to show.

# Supervised Learning: Document Classification

Previously, we used topic modeling to infer relationships between social service facilities within the data. That is an example of unsupervised learning: we were looking to uncover structure in the form of topics, or groups of agencies, but we did not necessarily know the ground truth of how many groups we should find or which agencies belonged in which group.  

We can also do supervised learning with text data. In supervised learning, we have a *known* outcome or label (*Y*) that we want to produce given some data (*X*), and in general, we want to be able to produce this *Y* when we *don't* know it, or when we *only* have *X*. 

In order to produce labels we need to first have examples our algorithm can learn from, a "training set." In the context of text analysis, developing a training set can be very expensive, as it can require a large amount of human labor or linguistic expertise. **Document classification** is an example of supervised learning in which want to characterize our documents based on their contents (*X*). A common example of document classification is spam e-mail detection. Another example of supervised learning in text analysis is *sentiment analysis*, where *X* is our documents and *Y* is the state of the author. This "state" is dependent on the question you're trying to answer, and can range from the author being happy or unhappy with a product to the author being politically conservative or liberal. Another example is *part-of-speech tagging* where *X* are individual words and *Y* is the part-of-speech. 

# Further Resources


A great resource for NLP in Python is 
[Natural Language Processing with Python](https://www.amazon.com/Natural-Language-Processing-Python-Analyzing/dp/0596516495).