# Using Natural language processing to analyze the abstracts from a scientific conference.

### Imports

In [2]:
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
import nltk
import random
import re
import os
import codecs
import mpld3
from sklearn import feature_extraction

## Step 1: Extract the text of the abstracts at the conference
I think a better solution is to copy everything into a text file, and then follow a simple chain of logic to parse it.

- Every ID number comes IN ORDER in the file.
- Every ID number comes after a new line and is followed by a space.
- So if we scan line by line, and incriment the ID we are looking for, then it is really unlikely that we will run into any conflicts (i.e. next ID is 100 and 100 is used in the text of 99, causing it to be split.)  
- Finally, we want to ommit the author line for each abstract. This will work since the author line and the text line are seperate

In [2]:
with open("dros_2019_data/Dros19_Abstracts_Full_v2_hard-Copy.txt", 'r') as f:
    
    abstracts = {}
    idfind = 2
    record = ''
    
    for line in f:
        line = line.rstrip()
        if (str(idfind + 1) + ' ') in line:
            abstracts[idfind] = record
            record = ''
            idfind += 1
        record += line
            
print(len(abstracts), 'abtracts')

844 abtracts


## Step 2: Web Scraping the Dros 2019 abstract assignment data
The data for abstracts at this conference comes in two parts - first, the PDF that actually has all the abstracts in them. However, that PDF doesn't have the abstract assignment (poster or talk). For the assignment, we need the conference web page: http://conferences.genetics-gsa.org/drosophila/2019/assignments  
  
The assignments are in a large table, that is thankfully continuous over the whole page. The table has the presenter's name, partial abstract title, session assignment, program #, session date, and presentation time. We need hte session assignment and the program number. The number will allow us to connect it to the correct abstract in the program.  
  
I've saved the web page in this folder in case it changes or gets removed. At this point it's not going to change anyway.

In [3]:
# we tell BeautifulSoup and tell it which parser to use
soup = BeautifulSoup(open("dros_2019_data/abstract_lookup.html"), "html.parser")

# make a pandas data frame from the table on the website
ab_table = soup.find(id="authassign-All")
abstracts_df = pd.read_html(str(ab_table))[0]

# group the assignments together
assignments = {}

for idx, num in enumerate(abstracts_df['Program #'].values):
    if "P" not in num:
  
        if "Poster" in abstracts_df['Session Assignment'].values[idx]:
            assignments[num] = {
                'type' : 'poster',
                'section' : abstracts_df['Session Assignment'].values[idx].split('Poster')[1].lower()
            }

        if "Plenary" in abstracts_df['Session Assignment'].values[idx]:
            assignments[num] = {
                'type' : 'talk',
                'section' : abstracts_df['Session Assignment'].values[idx][7:].lower()
            }

        if "Platform" in abstracts_df['Session Assignment'].values[idx]:
            assignments[num] = {
                'type' : 'talk',
                'section' : abstracts_df['Session Assignment'].values[idx].split('Platform')[1].lower()
            }

sections = set()
for assign in assignments:
    sections.add(assignments[assign]['section'])

There are many secitons, and several fall into the same general category of study. So, I've made these group categories to narrow down the general thing that is being studied. Later we can try plotting by sections and by groups

In [4]:
section_groups = {
    'cell biology: cytoskeleton, organelles and trafficking' : 'cell bio',
    'cell biology: cytoskeleton, organelles, trafficking' : 'cell bio',
    'cell death and cell stress' : 'cell bio',
    'cell division and cell growth' : 'cell bio',
    'cell division and growth control' : 'cell bio',
    'cell stress and cell death' : 'cell bio',
    'chromatin, epigenetics and genomics' : 'gene expression',
    'chromatin, epigenetics and genomics ii' : 'gene expression',
    'education ' : 'education',
    'educational initiatives' : 'education',
    'evolution' : 'evolution',
    'evolution i' : 'evolution',
    'evolution ii' : 'evolution',
    'immunity and the microbiome' : 'models of human disease',
    'models of human disease' : 'models of human disease',
    'models of human disease i' : 'models of human disease',
    'models of human disease ii' : 'models of human disease',
    'neural circuits and behavior' : 'neurobiology',
    'neural circuits and behavior i' : 'neurobiology',
    'neural development and physiology' : 'neurobiology',
    'neural development and physiology ii/neural circuits and behavior ii'  : 'neurobiology',
    'patterning, morphogenesis and organogenesis' : 'developmental biology',
    'patterning, morphogenesis and organogenesis i' : 'developmental biology',
    'patterning, morphogenesis and organogenesis ii' : 'developmental biology',
    'physiology, metabolism and aging' : 'developmental biology',
    'physiology, metabolism and aging i' : 'developmental biology',
    'physiology, metabolism and aging ii' : 'developmental biology',
    'plenary ii' : 'plenary',
    'plenary session i' : 'plenary',
    'regulation of gene expression' : 'gene expression',
    'regulation of gene expression i' : 'gene expression',
    'regulation of gene expression ii/ chromatin, epigenetics and genomics i' : 'gene expression',
    'reproduction and gametogenesis'  : 'developmental biology',
    'signal transduction'  : 'developmental biology',
    'stem cells, regeneration and tissue injury'  : 'developmental biology',
    'techniques & technology' : 'education'    
}

In [5]:
groups = set()
for assign in assignments:
    assignments[assign]['group'] = section_groups[assignments[assign]['section']]
    groups.add(assignments[assign]['group'])
    
print(len(assignments), 'assignments')



844 assignments


It appears that there are 844 abstracts and 844 assignments! 

## Step 3: Some basic NLP investigation

First, we need to turn the abstracts into objects that NLTK can use.

In [6]:
f = open("dros_2019_data/Dros19_Abstracts_Full_v2_hard-Copy.txt", 'rU')
text = f.read().split()
full = nltk.Text(text)

  """Entry point for launching an IPython kernel.


In [7]:
# check that it works by looking for a word
full.concordance('demonstrates')

Displaying 14 of 14 matches:
del of Huntington’s disease that demonstrates “prion-like” spreading of mutant
ion of mutations that cause bias demonstrates that niche competition is geneti
work for GWAS power analysis and demonstrates the feasibility of the Hybrid Sw
erimental design and methodology demonstrates a novel technique for characteri
a virus infected D. melanogaster demonstrates the presence of Vago, Vir-1, and
wever, our work mutating H3.2K36 demonstrates that this modification is neithe
ced neuromuscular function, this demonstrates that the VHD domain is critical 
 cohabitation. This cohabitation demonstrates the evolution of linguistic vari
tivation of different body parts demonstrates that changing the amount of sens
through utilization of our assay demonstrates that three distinct responses ar
tested in Drosophila. This study demonstrates the value of Drosophila in funct
d lung tumours. Preliminary data demonstrates that synthetic lethal screening 
rotein. Taken together,

In [8]:
# words that people use like demonstrate
full.similar('demonstrates')

suggests indicates and showed in is indicate demonstrated demonstrate
shows of how at for suggest identified with on via neuronal


In [9]:
# remove stopwords
stopwords = nltk.corpus.stopwords.words('english')
frequency_dist = nltk.FreqDist(full)
print(frequency_dist)

# check the 200 most common words and keep all that are not stop words
not_stops = []
for word in frequency_dist.most_common(200):
    if (str.isalpha(word[0])) and (word[0].lower() not in stopwords):
        not_stops.append(word)

# and print the 50 most common words
print(not_stops[:50])

<FreqDist with 32401 samples and 257957 outcomes>
[('Drosophila', 1369), ('cell', 925), ('gene', 666), ('cells', 661), ('expression', 639), ('University', 622), ('genes', 608), ('role', 522), ('Department', 496), ('also', 492), ('protein', 490), ('found', 439), ('flies', 421), ('genetic', 419), ('function', 407), ('results', 371), ('signaling', 369), ('proteins', 364), ('using', 342), ('identified', 332), ('two', 319), ('required', 304), ('model', 302), ('mutant', 292), ('neurons', 292), ('may', 283), ('show', 275), ('different', 273), ('melanogaster', 260), ('data', 258), ('fly', 257), ('loss', 253), ('human', 252), ('mechanisms', 247), ('specific', 245), ('complex', 243), ('development', 242), ('activity', 241), ('known', 235), ('studies', 234), ('changes', 233), ('regulation', 232), ('transcription', 229), ('levels', 229), ('within', 227), ('study', 227), ('identify', 227), ('including', 224), ('associated', 221), ('response', 221)]


This is actually kind of cool. "Drosophila" comes out on top, which kind of makes sense. "University" and "Department" probably come from the author line.  
  
OK, first question. Split the data into two sets - one that got talks, and one that got posters, and see what the top 50 words are.

In [22]:
talks = ''
posters = ''
for x in abstracts:
    if str(x) in assignments:
        if assignments[str(x)]['type'] == 'talk':
            talks += abstracts[x] + ' '
        if assignments[str(x)]['type'] == 'poster':
            posters += abstracts[x] + ' '

In [37]:
def top_fifty(corpus):
    frequency_dist = nltk.FreqDist(corpus)

    # check the 200 most common words and keep all that are not stop words
    not_stops = []
    for word in frequency_dist.most_common(200):
        if (str.isalpha(word[0])) and (word[0].lower() not in stopwords):
            not_stops.append(word)

    # and print the 50 most common words
    return [x[0] for x in not_stops][:50]

talks1 = nltk.Text(talks.split())
posters1 = nltk.Text(posters.split())

print('Talks')
top50_talks = set(top_fifty(talks1))
print(top50_talks)
print('\n')

print('Posters')
top50_posters = set(top_fifty(posters1))
print(top50_posters)
print('\n')

Talks
{'regulation', 'results', 'University', 'mutant', 'DNA', 'also', 'data', 'protein', 'using', 'neurons', 'changes', 'suggest', 'signaling', 'cells', 'activity', 'found', 'specific', 'transcription', 'genetic', 'gene', 'human', 'flies', 'levels', 'pathway', 'factor', 'required', 'Drosophila', 'response', 'function', 'model', 'conserved', 'complex', 'Medical', 'cell', 'show', 'stem', 'identified', 'adult', 'expression', 'fly', 'Department', 'mechanisms', 'proteins', 'early', 'genes', 'growth', 'two', 'loss', 'different', 'role'}


Posters
{'results', 'University', 'mutant', 'also', 'data', 'protein', 'using', 'neurons', 'studies', 'suggest', 'signaling', 'cells', 'activity', 'found', 'specific', 'used', 'within', 'genetic', 'known', 'gene', 'human', 'identify', 'may', 'flies', 'levels', 'required', 'Drosophila', 'response', 'function', 'model', 'development', 'complex', 'cell', 'show', 'associated', 'identified', 'expression', 'fly', 'Department', 'mechanisms', 'including', 'protein

At first glance, they look pretty similar. Lets use some set joining functions to see what the union and differences are.

In [34]:
both = top50_talks & top50_posters
talks_only = top50_talks - top50_posters
posters_only = top50_posters - top50_talks

print('Common to both', both, '\n')
print('More common in talks', talks_only, '\n')
print('More common in posters', posters_only, '\n')

Common to both {'results', 'University', 'mutant', 'also', 'data', 'protein', 'using', 'neurons', 'suggest', 'signaling', 'cells', 'activity', 'found', 'specific', 'genetic', 'gene', 'human', 'flies', 'levels', 'required', 'Drosophila', 'response', 'function', 'model', 'complex', 'cell', 'show', 'identified', 'expression', 'fly', 'Department', 'mechanisms', 'proteins', 'genes', 'two', 'different', 'loss', 'role'} 

More common in talks {'regulation', 'pathway', 'factor', 'early', 'conserved', 'growth', 'DNA', 'transcription', 'Medical', 'stem', 'changes', 'adult'} 

More common in posters {'including', 'well', 'associated', 'development', 'melanogaster', 'used', 'within', 'known', 'study', 'identify', 'may', 'studies'} 



## Step 4: Basic classifyer
NLTK contains a basic Baysean classification system. We can use that to see if something was a talk or poster favored word.

In [53]:
documents = []
talk_num = 0
poster_num = 0
for x in abstracts:
    if str(x) in assignments:
        if assignments[str(x)]['type'] == 'talk':
            documents.append((abstracts[x].split(),'talk'))
            talk_num += 1
        if assignments[str(x)]['type'] == 'poster':
            documents.append((abstracts[x].split(),'poster'))
            poster_num += 1
total = poster_num + talk_num

In [42]:
frequency_dist = nltk.FreqDist(full)

# check the 200 most common words and keep all that are not stop words
filtered_words = []
for word in frequency_dist.most_common(200):
    if (str.isalpha(word[0])) and (word[0].lower() not in stopwords):
        filtered_words.append(word)
        
word_features =  [word_tuple[0] for word_tuple in filtered_words]

In [51]:
def document_features(document):    
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

featuresets = [(document_features(d), c) for (d,c) in documents]
random.shuffle(featuresets)
print(len(featuresets))

842


In [56]:
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

print(poster_num/total)
print(nltk.classify.accuracy(classifier, test_set))


classifier.show_most_informative_features(20)

0.8111638954869359
0.82
Most Informative Features
  contains(melanogaster) = True           poster : talk   =      2.5 : 1.0
     contains(determine) = True           poster : talk   =      2.0 : 1.0
       contains(nuclear) = True             talk : poster =      1.8 : 1.0
    contains(previously) = True           poster : talk   =      1.8 : 1.0
      contains(neuronal) = True             talk : poster =      1.8 : 1.0
      contains(critical) = True             talk : poster =      1.7 : 1.0
       contains(binding) = True             talk : poster =      1.6 : 1.0
contains(transcriptional) = True             talk : poster =      1.6 : 1.0
        contains(factor) = True             talk : poster =      1.6 : 1.0
       contains(neurons) = True             talk : poster =      1.6 : 1.0
          contains(find) = True             talk : poster =      1.5 : 1.0
           contains(DNA) = True             talk : poster =      1.5 : 1.0
       contains(defects) = True           poster 

It looks like NTLK is doing a good job of predicting talks and posters, but on closer inspection it is barely beating the ratio of posters / total submissions. There is at best a 1% increase in the chance of getting a talk by following the recomendations of the classifier here.

## Step 5: Clustering the abstracts
I'd like to be able to make a un-supervised clustering of the abstracts to answer some questions:  
1) Do the abstracts cluster based on section or super-section?  
2) Are there clusters that are not represented by the sections  
3) Are there individuals who maybe submitted their abstract to the wrong section?  
  
Going to follow a tutorial that I found first: http://brandonrose.org/clustering

### Prep
We need several lists for this analysis.  
1) A list of the abstract numbers  
2) A list of the abstracts  
3) A list of the categories  / super categories

In [None]:
numbers = []
texts = []
categories = []
for ab in abstracts:
    