In [54]:
from __future__ import print_function
import time

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import urllib.request
import json

api_key = '8d0c5c7411a969ae233ee34d47b1784c'
base_url = "https://api.betterdoctor.com/2016-03-01/doctors?specialty_uid=oncologist&location="
rest_url = ",100&limit=100&user_key=" + api_key

# curated list of lat/long of top-10 cities in USA
nyc = (40.71,-74.00)
sf = (37.77, -122.42)
dc = (38.91, -77.03)
boston = (42.36, -71.06)
seattle = (47.61, -122.33)
chicago = (41.88, -87.63)
austin = (30.27, -97.74)
la = (34.05,-118.24)
denver = (39.74,-104.99)
minn = (44.98, -93.26)

top10cities = [nyc,sf,dc,boston,seattle,chicago,austin,la,denver,minn]
bios = []

for city in top10cities:
    time.sleep(5)
    link = base_url + str(city[0]) + "," + str(city[1]) + rest_url    
    jsonstr = urllib.request.urlopen(link).read().decode("utf8")
    data = json.loads(jsonstr)['data']

    for index in range(len(data)):        
        bio = data[index]['profile']['bio']
        for s in bio.split(".\n\n"):
            bios.append(s)
            

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))
    print()


# Load the doctor bios, vectorize it. 
data_samples = bios

# Use tf (raw term count) features for LDA.
print("Extracting tf features for LDA...")
tf_vectorizer = CountVectorizer(stop_words='english', max_df=0.95)
tf = tf_vectorizer.fit_transform(data_samples)
lda = LatentDirichletAllocation(n_topics=10, max_iter=10,
                                learning_method='online',
                                learning_offset=50.,
                                random_state=0)
lda.fit(tf)

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()
print_top_words(lda, tf_feature_names, 20) #top 20 words per topic

Extracting tf features for LDA...

Topics in LDA model:
Topic #0:
doctors united states network 20 experience based credentials rated analysis nationwide depth holds distinction spears belzer goldberg chadha slivnick yeh
Topic #1:
oncology medicine patients hematology md licensed internal currently california minnesota medical specializes washington colorado specialist treat practices sees illinois treats
Topic #2:
menomonie osseo bloomer eau ochoa bayona husband claire farm medicina burns nacional nambudiri activity like outdoors rio faculade sul brunstein
Topic #3:
princeton andover red maple wing burnsville wyoming zumbrota ellsworth weisdorf batezini robbinsdale kosmo workman plymouth keralavarma nwaneri fujioka laramie claudio
Topic #4:
medical active history malpractice licenses background check passed screening automated addition license including successfully having holds clear looked elements status
Topic #5:
nebraska subbiah pudunagar thome omaha hoessly gul fairmont prague s

Latent Dirichlet allocation is an automatic topic discovery algorithm first described in a paper published in thhe Journal of Machine Learning Research by Dr. Blei in 2003. 
For example, in a newspaper, the topics might be Entertainment, Sports, Politics, Classifieds...
Each topic would then have a list of top-20 words.

Suppose we wish to discover what topics describe the world of oncology.

Using the BetterDoctor API, we obtain a set of 100 doctors in the surrounding 100 mile radius of a given lat/long location. To obtain a fairly distributed set of oncologists, we look up the locations of the top-10 cities in the USA. We thus obtain 10 times 100 = 1000 doctor profiles. These are extremely rich profiles with insurances, practices, addresses, phone numbers, degrees etc. We narrow in on the doctor's bio, which describes each doctor succintly in a para. We split each bio into multiple sentences. Now we have a giant corpus of sentences that describe what oncologists do.

We use LDA to extract topics from this corpus.
But how does LDA perform this discovery?

LDA is a bag-of-words model.
LDA represents documents as topic mixtures, that spit out words with certain probabilities. 
It assumes that documents are produced like so -
Suppose you write a bio, you
a. Decide on the number of words N the bio has, according to a Poisson distribution.
b. Choose a topic mixture for the bio according to a Dirichlet distribution over a fixed set of K topics. 
Say we pick 12 words in a bio. 

We decide 
1/3 of the bio is about the doctor's credentials, 
1/3 about the locations where he practiced, and 
1/3 about what kind of diseases he specialises in.

So picking a topic has 33% probability in the multinomial distribution of topics.
Within a topic, we have a 25% probability of picking 1 of 4 words.
Thus, each topic is probabilistically generated.

Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.

Learning

Now we have a set of bios. We choose a fixed number say 10 topics to automatically discover. LDA uses collapsed Gibbs sampling to discover the topic representation of each of these ten topics.

LDA goes through each bio & randomly assigns each word in the bio to one of the 10 topics.

This random assignment already gives you both topic representations of all the bios and word distributions of all the topics (albeit not very good ones).
T improve on them, for each bio b,
go through each word w in b…
And for each topic t, compute two things: 
a) p(topic t | bio b) = the proportion of words in b that are currently assigned to topic t, 
b) p(word w | topic t) = the proportion of assignments to topic t over all bios that come from this word w.

Reassign w a new topic, where we choose topic t with probability p(topic t | bio b) * p(word w | topic t)

According to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability. We’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated. After repeating the previous step a large number of times, we eventually reach a roughly steady state where all assignments are pretty good. 

So we use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

Let's examine our results:
TOPIC 1 - DESCRIBES ONCOLOGY, STATES
Topic 1:
oncology medicine patients hematology md licensed internal currently california minnesota medical specializes washington colorado specialist treat practices sees illinois treats

Topic 2, 3,5,6 - PLACES, PEOPLE
Topic 2:
menomonie osseo bloomer eau ochoa bayona husband claire farm medicina burns nacional nambudiri activity like outdoors rio faculade sul brunstein
Topic 3:
princeton andover red maple wing burnsville wyoming zumbrota ellsworth weisdorf batezini robbinsdale kosmo workman plymouth keralavarma nwaneri fujioka laramie claudio
Topic 5:
nebraska subbiah pudunagar thome omaha hoessly gul fairmont prague stephan papillion regina niguel birbal bhaskar laguna shanmuga monrovia popescu landisville
Topic 6:
jersey new graduate brunswick nevada course combinatorial ryan voorhees chemistry coll med endoscopy asge raj burkhardt cline chippewa plainfield doyle

Topic 0 - CREDENTIALS
doctors united states network 20 experience based credentials rated analysis nationwide depth holds distinction spears belzer goldberg chadha slivnick yeh

Topic 4 - LICENSES
medical active history malpractice licenses background check passed screening automated addition license including successfully having holds clear looked elements status

Topic 7 - SCHOOLS
medical university oncology medicine center hematology degree american cancer school clinical internal received texas college society board completed fellowship graduated

Topic 8 - ANATOMY
surgery surgical iowa mckenzie liver pancreas laparoscopic hepatobiliary surgeon amini filho costa colon invasive minimally ducts robotic bile hagin tumors

Topic 9 - CARE
cancer patients care treatment clinical time research breast trials family children new cancers therapies leukemia malignant development enjoys work patient