## Expression Analysis

From a list of Pubmed ID's that were exported from R as a textfile and that represent the abstracts in the 'taste bud' category, the abstracts are pulled from the SQLite database, tokenized, stemmed and filtered for the word stem 'express'.

In [29]:
import sys, numpy, math, sqlite3, re, nltk
from nltk.stem.porter import *

In [189]:
def readpmidsfromfile(filename):
    with open(filename) as f:
        lines = f.read().splitlines()
    return lines

def tokenize(abstracts):
    import nltk
    import re
    
    # first tokenize by sentence, then by word
    
    result = {pmid:[[word.lower() for word in nltk.word_tokenize(sent)] for sent in nltk.sent_tokenize(abstract)] for pmid,abstract in abstracts.items()}
    return result


## from a list of ids concatenate all titles and abstracts from database taste.db as one word list
def getabstracts(idlist):

    conn = sqlite3.connect('taste.db')
    c = conn.cursor()


    abstracts = dict()

    for pmid in idlist:
        c.execute('''SELECT title,abstract FROM articles WHERE pmid = (?)''', (pmid,))
        result = c.fetchone()
        abstracts[pmid] = ".".join(result)
    conn.close()
    return abstracts


def stem(abstracts):
    
    stemmer = PorterStemmer()
    result = {pmid:[[sentence,[stemmer.stem(token) for token in sentence]] for sentence in abstract] for pmid,abstract in abstracts.items()}
    return result

def filterforterm(abstracts,term):
    result = {pmid:[sentence[0] for sentence in abstract if term in sentence[1]] for pmid,abstract in abstracts.items()}
    result = {pmid:abstract for pmid,abstract in result.items() if len(abstract)>0}
    return result
    

In [190]:
ids = readpmidsfromfile('tastebudpmids.txt')
abstracts = getabstracts(ids)
tokenizedabstracts = tokenize(abstracts)
stemmedabstracts = stem(tokenizedabstracts)
expressionabstracts = filterforterm(stemmedabstracts,'express')

In [192]:
len(expressionabstracts)

1054

So, 1054 'taste bud' abstracts contain sentences with variations of the word 'express'. 
Then, the mouse gene symbol/synonym data is loaded from the SQLite database into a dictionary.

In [181]:

conn = sqlite3.connect('taste.db')
c = conn.cursor()
c.execute('SELECT * FROM synonympairs')

data = c.fetchall()
genedict = dict([(item.lower() for item in pair) for pair in data])

conn.close

    

<function Connection.close>

In [182]:
len(genedict.values())


85719

In [183]:
len(set(genedict.values()))


22934

In [184]:
len(set(genedict.keys()))

85719

A function `findgenes` is defined that generates a list of gene symbol/Pubmed ID tuples after removing stopwords that were hand-picked and that, in this context, likely don't represent genes. 750 Gene/abstract pairs were identified. Consolidation of the list reveals a set of 334 unique genes mentioned in these abstracts. The most strongly represented genes are mentioned in up to 39 abstracts.

In [257]:
def findgenes(abstractdict,genedict):
    stopwords = ('minor','fish','to','now','men','a','white','no','eng','peripheral','lobe','aim','tip','rest','in','striated','ii','so','be','was','4','great','light','mice','spatial','ctx','cub','acts','g','as','not','do','via','pole','fat','adipose','e','we','k','olfactory','t','p','pig','ct','salt','olfactory','b','skin','an','out','gut','can','damage','act',)
    expdict = list()
    for pmid in abstractdict:
        expgenes = [list(filter(lambda x: x in genedict.keys(),sentence)) for sentence in abstractdict[pmid]]
        expgenes = [[token for token in sent if token not in stopwords] for sent in expgenes]
        expgenes = expgenes[0]
        expgenes = set([genedict[gene] for gene in expgenes])
        for gene in expgenes:
            expdict.append((gene,pmid))
    return expdict

In [259]:
geneabstractpairs = findgenes(expressionabstracts,genedict)

In [261]:
len(geneabstractpairs)

750

In [272]:
from collections import Counter
countedgenes = Counter([pair[0] for pair in geneabstractpairs])

import operator

sortedgenes = sorted(countedgenes.items(), key=operator.itemgetter(1),reverse = True)

In [268]:
len(sortedgenes)

334

In [271]:
import pandas as pd
df = pd.DataFrame(sortedgenes)
df.columns = ['gene','count']
df

Unnamed: 0,gene,count
0,tas1r3,39
1,gnat3,34
2,trpm5,27
3,bdnf,21
4,tas1r2,20
5,krt71,16
6,shh,13
7,tbpl1,13
8,tas1r1,12
9,cd36,12


In [273]:
df.to_csv('tastegenes.csv', sep=',', encoding='utf-8')