# Web Scraping the Dros 2019 Abstract data
The data for abstracts at this conference comes in two parts - first, the PDF that actually has all the abstracts in them. However, that PDF doesn't have the abstract assignment (poster or talk). For the assignment, we need the conference web page: http://conferences.genetics-gsa.org/drosophila/2019/assignments  
  
The assignments are in a large table, that is thankfully continuous over the whole page. The table has the presenter's name, partial abstract title, session assignment, program #, session date, and presentation time. We need hte session assignment and the program number. The number will allow us to connect it to the correct abstract in the program.  
  
I've saved the web page in this folder in case it changes or gets removed. At this point it's not going to change anyway.

In [6]:
from bs4 import BeautifulSoup
import pandas as pd

In [2]:
# we tell BeautifulSoup and tell it which parser to use
soup = BeautifulSoup(open("dros_2019_data/abstract_lookup.html"), "html.parser")
# the output corresponds exactly to the html file
soup

<!DOCTYPE html>

<html lang="en">
<!-- BC_OBNW -->
<head>
<title>ABSTRACTS: Assignments Lookup</title>
<link href="/StyleSheets/ModuleStyleSheets.css" rel="StyleSheet" type="text/css"/>
<script type="text/javascript">var jslang='EN';</script>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0" name="viewport">
<title>ABSTRACTS: Assignments Lookup</title>
<link href="/favicon.png" rel="shortcut icon"/>
<link href="/home_badge_64.png" rel="apple-touch-icon"/>
<link href="/home_badge_64.png" rel="apple-touch-icon-precomposed"/>
<link href="/_System/apps/thr-bootstrap/public/stylesheets.min.css" rel="stylesheet"/>
<!--<link rel="stylesheet" href="/_assets/css/bootstrap.min.css">-->
<link href="/_assets/css/jasny-bootstrap.min.css" rel="stylesheet"/>
<link href="/_assets/css/font-awesome.min.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/css?family=Poppins:300,400,500,600,70

In [7]:
ab_table = soup.find(id="authassign-All")
abstracts = pd.read_html(str(ab_table))[0]
abstracts.head()

Unnamed: 0,Presenter's Name,PartialAbstract Title,Session Assignment,Program #,Session Dates,Presentation Time
0,"Abdal-Rhida, Muna",Transcriptional regulation of robo2 in the ...,PosterNeural development and physiology,643,Thursday - SaturdayMarch 28 - 30,"Thursday, March 283:00 - 4:00 PMFriday, March ..."
1,"Abdalla, Jasmina",Epigenetic Inheritance of Alcohol ...,"PosterPhysiology, metabolism and aging",634,Thursday - SaturdayMarch 28 - 30,"Thursday, March 282:00 - 3:00 PMFriday, March ..."
2,"Abdalla, Jasmina",Epigenetic Inheritance of Alcohol ...,"Poster PreviewPhysiology, Metabolism and Aging II",634P,"Friday, March 29",6:21 PM-6:23 PM
3,"Abdulazeez, Rashidatu",Introducing the fruit fly as a powerful ...,PosterEducational Initiatives,825,Thursday - SaturdayMarch 28 - 30,"Thursday, March 283:00 - 4:00 PMFriday, March ..."
4,"Abrams, John M.",p53 genes and the game of transposons.,PlenaryPlenary II,155,"Sunday, March 31",8:30 AM-9:00 AM


In [33]:
assignments = {}

for idx, num in enumerate(abstracts['Program #'].values):
    if "P" not in num:
  
        if "Poster" in abstracts['Session Assignment'].values[idx]:
            assignments[num] = {
                'type' : 'poster',
                'section' : abstracts['Session Assignment'].values[idx].split('Poster')[1].lower()
            }

        if "Plenary" in abstracts['Session Assignment'].values[idx]:
            assignments[num] = {
                'type' : 'talk',
                'section' : abstracts['Session Assignment'].values[idx][7:].lower()
            }

        if "Platform" in abstracts['Session Assignment'].values[idx]:
            assignments[num] = {
                'type' : 'talk',
                'section' : abstracts['Session Assignment'].values[idx].split('Platform')[1].lower()
            }

sections = set()
for assign in assignments:
    sections.add(assignments[assign]['section'])

There are many secitons, and several fall into the same general category of study. So, I've made these group categories to narrow down the general thing that is being studied. Later we can try plotting by sections and by groups

In [30]:
section_groups = {
    'cell biology: cytoskeleton, organelles and trafficking' : 'cell bio',
    'cell biology: cytoskeleton, organelles, trafficking' : 'cell bio',
    'cell death and cell stress' : 'cell bio',
    'cell division and cell growth' : 'cell bio',
    'cell division and growth control' : 'cell bio',
    'cell stress and cell death' : 'cell bio',
    'chromatin, epigenetics and genomics' : 'gene expression',
    'chromatin, epigenetics and genomics ii' : 'gene expression',
    'education ' : 'education',
    'educational initiatives' : 'education',
    'evolution' : 'evolution',
    'evolution i' : 'evolution',
    'evolution ii' : 'evolution',
    'immunity and the microbiome' : 'models of human disease',
    'models of human disease' : 'models of human disease',
    'models of human disease i' : 'models of human disease',
    'models of human disease ii' : 'models of human disease',
    'neural circuits and behavior' : 'neurobiology',
    'neural circuits and behavior i' : 'neurobiology',
    'neural development and physiology' : 'neurobiology',
    'neural development and physiology ii/neural circuits and behavior ii'  : 'neurobiology',
    'patterning, morphogenesis and organogenesis' : 'developmental biology',
    'patterning, morphogenesis and organogenesis i' : 'developmental biology',
    'patterning, morphogenesis and organogenesis ii' : 'developmental biology',
    'physiology, metabolism and aging' : 'developmental biology',
    'physiology, metabolism and aging i' : 'developmental biology',
    'physiology, metabolism and aging ii' : 'developmental biology',
    'plenary ii' : 'plenary',
    'plenary session i' : 'plenary',
    'regulation of gene expression' : 'gene expression',
    'regulation of gene expression i' : 'gene expression',
    'regulation of gene expression ii/ chromatin, epigenetics and genomics i' : 'gene expression',
    'reproduction and gametogenesis'  : 'developmental biology',
    'signal transduction'  : 'developmental biology',
    'stem cells, regeneration and tissue injury'  : 'developmental biology',
    'techniques & technology' : 'education'    
}

In [35]:
groups = set()
for assign in assignments:
    assignments[assign]['group'] = section_groups[assignments[assign]['section']]
    groups.add(assignments[assign]['group'])
    
print(len(assignments), 'assignments')

844 assignments
