# Exploring Proteins in NCBI

Two questions are answered below to illustrate using the Entrez Programming Utilities APIs to gather data from the NCBI website.
1. How many human proteins are larger than 300000 daltons?
2. What is the longest human protein?

## Install packages

In [114]:
%pip install requests

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\DataScience\ExploreNCBI\venv\Scripts\python.exe -m pip install --upgrade pip' command.


## import needed packages

In [115]:
import requests

# QUESTION

How many human proteins in the NCBI protein database are bigger than 300,000 daltons?

In [116]:

# use eSearch URL
base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
extension = 'esearch.fcgi'
db = "protein"

# create search terms
terms = "Homo+sapiens[orgn]+AND+300000:4000000[molwt]"

# paramaters
params = "?db=" + db + "&term=" + terms + "&usehistory=y&format=json"

# send the request
r = requests.get(base+extension+params)

# get the data
json_data = r.json()


json_data = json_data['esearchresult']


# ANSWER

In [117]:
print("There are " + str(json_data['count']) + " human proteins in the NCBI database that are bigger than 300,000 daltons.")

There are 3612 human proteins in the NCBI database that are bigger than 300,000 daltons.


# QUESTION

What is the longest human protein in the NCBI database?

In [118]:
# let's get a much smaller number of proteins
# create search terms
minimum="3800000"
extension = "esearch.fcgi"
terms = "Homo+sapiens[orgn]+AND+"+minimum+":4000000[molwt]"

# paramaters
params = "?db=" + db + "&term=" + terms + "&usehistory=y&format=json"

# send the request
r = requests.get(base+extension+params)

# get the data
json_data = r.json()


json_data = json_data['esearchresult']
query_key = json_data['querykey']
webenv = json_data['webenv']
print("Query Key:  " + query_key)
print("Webenv:  " + webenv)


print("There are " + str(json_data['count']) + " human proteins in the NCBI database that are bigger than "+minimum+" daltons.")



Query Key:  1
Webenv:  MCID_62814fd4a958831c143f5449
There are 10 human proteins in the NCBI database that are bigger than 3800000 daltons.


In [119]:

# use eSummary to get details of each protein
extension = "esummary.fcgi"
params = "?db=" + db + "&query_key=" + query_key + "&WebEnv=" + webenv + "&format=json"

# send the request
r = requests.get(base+extension+params)

# get the data
json_data = r.json()
proteins = json_data['result']
del proteins['uids']
#print("Results:  \n" + json.dumps(proteins,indent=4))
maxlength = 0
uid = ""

for key in proteins:
    # print(proteins[key]['uid'] + " is " + str(proteins[key]['slen']) + " aa long")
    if proteins[key]['slen'] >= maxlength:
        maxlength = proteins[key]['slen']
        uid = proteins[key]['uid']

longest = proteins[uid]



# ANSWER

In [120]:
print("The longest human protein is: " )
print(longest['title'])
print("Sequence Length:  " + str(longest['slen']))
print("Accession:  " + longest['accessionversion'])

The longest human protein is: 
titin [Homo sapiens]
Sequence Length:  35991
Accession:  KAI2525983.1
