# Testing the EXTRACT API for basic NER


The EXTRACT tool and API documentation
https://extract.jensenlab.org/

**GetEntities**

GetEntities (http://tagger.jensenlab.org/GetEntities) returns the unique list of the entities identified in the document. The entities belong to the specified entity_types and the response follows the specified format.

Request:
```
http://tagger.jensenlab.org/GetEntities?document=Both+samples+were+dominated+by+Zetaproteobacteria+Fe+oxidizers.+This+group+was+most+abundant+at+Volcano+1,+where+sediments+were+richer+in+Fe+and+contained+more+crystalline+forms+of+Fe+oxides.&entity_types=-2+-25+-26+-27&format=tsv
```
Response:
```
Zetaproteobacteria	-2	580370
sediments	-27	ENVO:00002007
Volcano	-27	ENVO:00000247
```

Note: HTTPS is also supported (use https://tagger.jensenlab.org/)

**Parameters:**

document: the plain or html-formatted text to be tagged

format: "tsv" or "xml" (default)

Entity types to fetch:
-2: NCBI Taxonomy entries
-26: Disease Ontology terms
(concatenate with "+" to use multiple) 

In [1]:
import os
import pandas as pd
import requests
import time
import json
import math
import pickle

In [2]:
## set filepaths
script_path = os.getcwd()
parent_path = os.path.dirname(script_path)
input_path = os.path.join(parent_path,'Pubtator_Check','data')
input_file = os.path.join(input_path,'unnannotated_records.tsv')
output_path = os.path.join(script_path,'data')

In [6]:
def parse_tsv(ndeid,text_response):
    dictlist = []
    records = text_response.split('\n')
    i=0
    k=len(records)
    while i<k:
        results = records[i].split('\t')
        dictlist.append({'_id':ndeid,'extracted_text':results[0],'entity_type':results[1],'onto_id':results[2]})
        i = i+1
    return dictlist


## Generate the test data

In [52]:
%%time

## Perform the initial query

#query_url = 'https://api.data.niaid.nih.gov/v1/query?q=_exists_:species&fields=_id,name,species&fetch_all=true'
query_url = 'https://api-staging.data.niaid.nih.gov/v1/query?q=_exists_:species&fields=_id,name,description&fetch_all=true'
r = requests.get(query_url)
cleanr = json.loads(r.text)
hits = cleanr['hits']
#print(len(cleanr['hits']))
df1 = pd.DataFrame(cleanr['hits'])
scroll_id = cleanr['_scroll_id']
total_hits = cleanr['total']
print(total_hits)

201222
CPU times: total: 172 ms
Wall time: 1.74 s


In [55]:
%%time
## Scroll to get all the results

i = 0
#k = 3 
k = math.ceil(total_hits/1000)
while i < k:
    #r2 = requests.get(f'https://api.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
    r2 = requests.get(f'https://api-staging.data.niaid.nih.gov/v1/query?scroll_id={scroll_id}')
    tmp = json.loads(r2.text)
    scroll_id = tmp['_scroll_id']
    tmpdf = pd.DataFrame(tmp['hits'])
    df1 = pd.concat((df1,tmpdf),ignore_index=True)
    #print(len(df1))
    i = i+1
    time.sleep(0.25)

KeyError: '_scroll_id'

In [58]:
## Inspect and save the results of the search

print(len(df1))
print(df1.head(n=3))
with open(os.path.join(script_path,'data','processed_species_results.pickle'),'wb') as dumpfile:
    pickle.dump(df1,dumpfile)

201222
                    _id  _score  \
0  DDE_0565c31a11705723     1.0   
1  DDE_095ecd25213286dd     1.0   
2  DDE_1058e9acef861126     1.0   

                                         description  \
0  Metabolomics data from cell culture; Treated w...   
1  APMS analysis of SARS-CoV-2 proteins to evalua...   
2  We performed genome-wide CRISPR KO screens in ...   

                                                name _ignored  
0  Primary human microvascular endothelial cells ...      NaN  
1     Protein-protein interaction map for SARS-CoV-2      NaN  
2  genotyping by high throughput sequencing, gene...      NaN  


## Perform one test inquiry for EXTRACT API

Determine how the query needs be made and how to parse the reponse

In [59]:
colnames = ['_id', 'raw_text','source']
unannotated_records = pd.read_csv(input_file,delimiter='\t',names=colnames)
print(unannotated_records.head(n=2))

                  _id                                           raw_text  \
0  OMICSDI_PRJNA13120  Mycoplasma hyopneumoniae 232| The causative ag...   
1  OMICSDI_PRJNA13401  Spiroplasma citri| The causative agent of Citr...   

  source  
0    NDE  
1    NDE  


In [60]:
base_url = f"http://tagger.jensenlab.org/GetEntities?document={raw_text}&entity_types=-2+-26&format=tsv"

test_text = unannotated_records.iloc[0]['raw_text'].replace('|','.').replace(' ','+')
print(test_text)

Mycoplasma+hyopneumoniae+232.+The+causative+agent+of+swine+mycoplasmosis.


In [62]:
raw_text = test_text
r = requests.get(base_url)

In [63]:
print(r.text.split('\n'))

['swine\t-2\t9823', 'Mycoplasma hyopneumoniae 232\t-2\t295358']


In [64]:
result = parse_tsv('OMICSDI_PRJNA13120',r.text)
print(result)

[{'_id': 'OMICSDI_PRJNA13120', 'extracted_text': 'swine', 'entity_type': '-2', 'onto_id': '9823'}, {'_id': 'OMICSDI_PRJNA13120', 'extracted_text': 'Mycoplasma hyopneumoniae 232', 'entity_type': '-2', 'onto_id': '295358'}]


## Conduct test Extraction test

#### extract entities for which species data is available

In [67]:
print(df1.head(n=2))
print(len(df1))

                    _id  _score  \
0  DDE_0565c31a11705723     1.0   
1  DDE_095ecd25213286dd     1.0   

                                         description  \
0  Metabolomics data from cell culture; Treated w...   
1  APMS analysis of SARS-CoV-2 proteins to evalua...   

                                                name _ignored  
0  Primary human microvascular endothelial cells ...      NaN  
1     Protein-protein interaction map for SARS-CoV-2      NaN  
201222


In [93]:
df1['raw_text'] = df1['name'].astype(str).str.cat(df1['description'].astype(str),sep='. ').replace('\n',' ')
print(df1.head(n=2))

                    _id  _score  \
0  DDE_0565c31a11705723     1.0   
1  DDE_095ecd25213286dd     1.0   

                                         description  \
0  Metabolomics data from cell culture; Treated w...   
1  APMS analysis of SARS-CoV-2 proteins to evalua...   

                                                name _ignored  \
0  Primary human microvascular endothelial cells ...      NaN   
1     Protein-protein interaction map for SARS-CoV-2      NaN   

                                            raw_text  
0  Primary human microvascular endothelial cells ...  
1  Protein-protein interaction map for SARS-CoV-2...  


In [117]:
testdf = df1.sample(10000,replace=False)
print(len(testdf))
print(testdf.head(n=2))
with open(os.path.join(script_path,'data','test_10000_data.pickle'),'wb') as writefile:
    pickle.dump(testdf,writefile)

100000
              _id  _score                                        description  \
116846  GSE238109     1.0  To elucidate the role of bta-miR-484 in adipoc...   
146350  GSE185888     1.0  This SuperSeries is composed of the SubSeries ...   

                                                     name       _ignored  \
116846  RNA-Seq analyses in bta-miR-484 transfected ad...  [all.keyword]   
146350  ALKBH5 promotes tumor progression by decreasin...            NaN   

                                                 raw_text  
116846  RNA-Seq analyses in bta-miR-484 transfected ad...  
146350  ALKBH5 promotes tumor progression by decreasin...  


In [122]:
%%time

n = 49300
m = len(testdf)
#m=10
extractlist = []
faillist = []
while n < m:
    raw_text = testdf.iloc[n]['raw_text'].replace(' ','+').strip('\n').replace('\n','+')
    base_url = f"http://tagger.jensenlab.org/GetEntities?document={raw_text}&entity_types=-2&format=tsv"
    r = requests.get(base_url)
    if r.status_code == 200:
        if len(r.text)>0:
            try:
                tmpdf = parse_tsv(testdf.iloc[n]['_id'],r.text)
                extractlist.extend(tmpdf)
            except:
                faillist.append({"_id":testdf.iloc[n]['_id'],"fail_type":"reponse_parse_fail"})
    else:
        print('failed')
        time.sleep(0.25)
        faillist.append({"_id":testdf.iloc[n]['_id'],"fail_type":"request_fail"})
    n = n+1
    time.sleep(0.25)

#print(extractlist)
testresultdf = pd.DataFrame(extractlist)
cleanresult = testresultdf.loc[(testresultdf['entity_type']==-2)|(testresultdf['entity_type']=='-2')]
cleanresult.to_csv(os.path.join(script_path,'data','test_100000.tsv'),sep='\t',header=0)
with open(os.path.join(script_path,'data','test_100000_fails.pickle'),'wb') as failfile:
    pickle.dump(faillist,failfile)

failed
failed
failed
failed
failed
failed
failed
failed
failed
failed
CPU times: total: 14min 6s
Wall time: 19h 28min 42s


In [120]:
testresultdf = pd.DataFrame(extractlist)
print(len(testresultdf['_id'].unique().tolist()))
print(len(faillist))

48630
659


In [121]:
cleanresult = testresultdf.loc[(testresultdf['entity_type']==-2)|(testresultdf['entity_type']=='-2')]
cleanresult.to_csv(os.path.join(script_path,'data','test_50000.tsv'),sep='\t',header=0)
with open(os.path.join(script_path,'data','test_50000_fails.pickle'),'wb') as failfile:
    pickle.dump(faillist,failfile)

In [113]:
print(r.text)

<?xml version="1.0" encoding="UTF-8"?><GetEntitiesResponse xmlns="Reflect" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><items><item><name xsi:type="xsd:string">Eed</name><count xsi:type="xsd:int">1</count><entities><entity><type xsi:type="xsd:int">9606</type><identifier xsi:type="xsd:string">ENSP00000263360</identifier></entity><entity><type xsi:type="xsd:int">10090</type><identifier xsi:type="xsd:string">ENSMUSP00000102853</identifier></entity></entities></item></items></GetEntitiesResponse>


In [115]:
print(cleanresult)

            _id                           extracted_text entity_type onto_id
0     GSE157915                      Xenopus allofraseri          -2  288535
1     GSE216370                                     mice          -2   10090
5       GSE6738                                    human          -2    9606
7      GSE47404                                    Human          -2    9606
10    GSE108656                                    human          -2    9606
...         ...                                      ...         ...     ...
3375   GSE61171                                    human          -2    9606
3377  GSE122058  Salmonella enterica serovar Typhimurium          -2   90371
3378  GSE122058                               Salmonella          -2     590
3379  GSE122058                                    human          -2    9606
3381   GSE59780                                    human          -2    9606

[1721 rows x 4 columns]


In [116]:
print(faillist)

[{'_id': 'GSE135638', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE143561', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE160524', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE137164', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE133658', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE106301', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE74550', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE97411', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE50679', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE10838', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE230628', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE109509', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE125653', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE147326', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE85696', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE147895', 'fail_type': 'reponse_parse_fail'}, {'_id': 'GSE155605', 'fail_type': 'reponse_parse_fail'}]


## Evaluate the results

1. Run Text2Term for all the extracted raw text (to be able to compare with Extract)
2. If an extracted raw text term mapped to multiple species, select the one with the best score (Text2Term), since Extract does not give scores, select the one that appears more than once in the paragraph
3. Compare the results with those pulled from PubTator

In [5]:
## load the results from Extract

df1 = pd.read_csv(os.path.join(output_path, 'test_49000.tsv'), delimiter='\t',header=None, index_col=0)
df2 = pd.read_csv(os.path.join(output_path, 'test_50000.tsv'), delimiter='\t',header=None, index_col=0)
dfall = pd.concat((df1,df2), ignore_index=True)
dfall.rename(columns={1:'_id',2:'extracted_term',3:'extracted_type',4:'taxid'},inplace=True)
dfall['CURIE'] = ['NCBITAXON:'+str(x) for x in dfall['taxid']]

print(len(dfall))
print(dfall.head(n=2))
dfall.to_csv(os.path.join(output_path, 'test_100000.tsv'), sep='\t', header=True)

210242
         _id extracted_term  extracted_type  taxid           CURIE
0  GSE238109         bovine              -2   9913  NCBITAXON:9913
1  GSE187829          human              -2   9606  NCBITAXON:9606


In [44]:
## deal with multimapped results
dfgrouped = dfall.groupby(['_id','extracted_term','extracted_type']).size().reset_index(name='counts')
dfgrpunique = dfgrouped.loc[dfgrouped['counts']==1]
print(len(dfunique))

dfgrpmulti = dfgrouped.loc[dfgrouped['counts']!=1]
print(len(dfmulti))

dfunique = dfgrpunique.merge(dfall,on=['_id','extracted_term','extracted_type'],how='left')
print(len(dfunique))
print(dfunique.head(n=2))

dfmulti = dfgrpmulti.merge(dfall,on=['_id','extracted_term','extracted_type'],how='left')
print(len(dfmulti))

78630
131612
78630
                    _id   extracted_term  extracted_type  counts  taxid  \
0  DDE_01a3d57e683d1471  West Nile virus              -2       1  11082   
1  DDE_01a3d57e683d1471          viruses              -2       1  10239   

             CURIE  
0  NCBITAXON:11082  
1  NCBITAXON:10239  
131612


In [50]:
## load the Pubtator Results
speciespubtatordf = pd.read_csv(os.path.join(parent_path,'text2term_test','data','clean_pubtator_results_from_nde.tsv'),delimiter='\t',header=0,index_col=0)
infectpubtatordf = pd.read_csv(os.path.join(parent_path,'text2term_test','data','infectiousAgent_clean_pubtator_results_from_nde.tsv'),delimiter='\t',header=0,index_col=0)
infectpubtatordf.rename(columns={'infectiousAgent':'species'}, inplace=True)
pubtatordf = pd.concat((speciespubtatordf,infectpubtatordf), ignore_index=True)
print(pubtatordf.head(n=2))

                    _id  _score  \
0  DDE_0565c31a11705723     1.0   
1  DDE_095ecd25213286dd     1.0   

                                                name  \
0  Primary human microvascular endothelial cells ...   
1     Protein-protein interaction map for SARS-CoV-2   

                                             species _ignored           CURIE  
0  {'alternateName': ['Human', 'Homo sapiens Linn...      NaN  NCBITAXON:9606  
1  {'alternateName': ['Human', 'Homo sapiens Linn...      NaN  NCBITAXON:9606  


In [54]:
## See how many of the terms that mapped to a single species by EXTRACT mapped to Pubtator species

dfunique_merged = dfunique.merge(pubtatordf,on=['_id','CURIE'],how='left')
print(len(dfunique_merged))
print(dfunique_merged.head(n=2))
dfunique_matched = dfunique_merged.loc[~dfunique_merged['species'].isna()]
dfunique_unmatched = dfunique_merged.loc[dfunique_merged['species'].isna()]
## number of matching mappings
print(len(dfunique_matched))
print(dfunique_matched.head(n=2))

## number of unmatched mappings
print(len(dfunique_unmatched))
## these are EXTRACT-based mappings that did not match because
#### 1. The mapping from EXTRACT is wrong OR
#### 2. EXTRACT pulled out more terms to map than was available from Pubtator via PMID matching

78630
                    _id   extracted_term  extracted_type  counts  taxid  \
0  DDE_01a3d57e683d1471  West Nile virus              -2       1  11082   
1  DDE_01a3d57e683d1471          viruses              -2       1  10239   

             CURIE  _score                                               name  \
0  NCBITAXON:11082     1.0  Mouse popliteal lymph node transcriptome respo...   
1  NCBITAXON:10239     NaN                                                NaN   

                                             species _ignored  
0  {'alternateName': ['WNV'], 'classification': '...      NaN  
1                                                NaN      NaN  
41383
                    _id   extracted_term  extracted_type  counts    taxid  \
0  DDE_01a3d57e683d1471  West Nile virus              -2       1    11082   
2  DDE_095ecd25213286dd       SARS-CoV-2              -2       1  2697049   

               CURIE  _score  \
0    NCBITAXON:11082     1.0   
2  NCBITAXON:2697049     1.0  

In [56]:
## Inspect the case when EXTRACT maps a single term to multiple taxa
print(dfmulti.head(n=4))
#### Since EXTRACT does not do any sort of scoring, better to use Text2Term
#### See Text2Term test notebooks

                    _id extracted_term  extracted_type  counts  taxid  \
0  DDE_01a3d57e683d1471          Mouse              -2       2  10090   
1  DDE_01a3d57e683d1471          Mouse              -2       2  10088   
2  DDE_01a3d57e683d1471          mouse              -2       2  10090   
3  DDE_01a3d57e683d1471          mouse              -2       2  10088   

             CURIE  
0  NCBITAXON:10090  
1  NCBITAXON:10088  
2  NCBITAXON:10090  
3  NCBITAXON:10088  


## Check the results from Dylan's tests

In [5]:
text = "First report of the East African kdr mutation in an Anopheles gambiae mosquito in Côte d’Ivoire| Immature stages of Anopheles gambiae s.l. were collected from breeding sites at the outskirts of Yamoussoukro, Côte d'Ivoire. Emerging 3-5 day old adult female mosquitoes were tested for susceptibility to deltamethrin 0.05%, malathion 5%, bendiocarb 1% and dichlorodiphenyltrichloroethane (DDT) 4% according to WHO standard procedures. A total of 50  An. gambiae s.l. specimens were drawn at random for DNA extraction and identification down to the species level. A subsample of 30 mosquitoes was tested for the East-African kdr mutation using a Taqman assay. (MapVEu VBP0000191)"
test_text = text.replace('|','.').replace(' ','+')
base_url = f"http://tagger.jensenlab.org/GetEntities?document={test_text}&entity_types=-2&format=tsv"

r = requests.get(base_url)
result = parse_tsv('veupathdb_DS_cd1a65bcca',r.text)
print(result)

[{'_id': 'veupathdb_DS_cd1a65bcca', 'extracted_text': 'Anopheles gambiae s', 'entity_type': '-2', 'onto_id': '7165'}, {'_id': 'veupathdb_DS_cd1a65bcca', 'extracted_text': 'Anopheles gambiae', 'entity_type': '-2', 'onto_id': '7165'}]


## Dealing with location terms that map to taxa

There are a handful of country, U.S. state names that have exact matches in NCBI taxon. However, the probability of a dataset being about that particular taxon (vs being mentioned as a location) should be relatively low. To address this, we will create an exclusion list by using Wikidata queries to pull names of sovereign states and U.S. States and using EXTRACT to map these to NCBI taxon. This way, we can identify the location terms that EXTRACT will likely pull out as species.

**additional information**
Wikidata: 
* Sovereign state: Q3624078
* U.S. state: Q35657

Example query 1: "https://query.wikidata.org/#%23Countries%0ASELECT%20%3Fitem%20%3FitemLabel%20%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ3624078.%20%23%20Must%20be%20a%20sovereign%20state%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%20%23%20Helps%20get%20the%20label%20in%20your%20language%2C%20if%20not%2C%20then%20en%20language%0A%7D"

Example query 2: "https://query.wikidata.org/#%23US%20states%0ASELECT%20%3Fitem%20%3FitemLabel%20%0AWHERE%20%0A%7B%0A%20%20%3Fitem%20wdt%3AP31%20wd%3AQ35657.%20%23%20Must%20be%20a%20US%20state%0A%20%20SERVICE%20wikibase%3Alabel%20%7B%20bd%3AserviceParam%20wikibase%3Alanguage%20%22%5BAUTO_LANGUAGE%5D%2Cen%22.%20%7D%20%23%20Helps%20get%20the%20label%20in%20your%20language%2C%20if%20not%2C%20then%20en%20language%0A%7D"

Rather than trying to parse the overly-complicated Wikidata json files, the query results were downloaded as tsv files via the browser interface and saved to the data folder.

In [5]:
states = pd.read_csv(os.path.join(output_path,'state_query_results.tsv'), delimiter='\t',header=0)
print(states.head(n=2))
countries = pd.read_csv(os.path.join(output_path,'country_query_results.tsv'), delimiter='\t',header=0)
print(countries.head(n=2))

                                  item   itemLabel
0   http://www.wikidata.org/entity/Q99  California
1  http://www.wikidata.org/entity/Q173     Alabama
                                 item itemLabel
0  http://www.wikidata.org/entity/Q16    Canada
1  http://www.wikidata.org/entity/Q17     Japan


In [26]:
## extract the lists
statelist = states['itemLabel'].unique().tolist()
print(statelist)
raw_text = ""
for eachstate in statelist:
    raw_text = raw_text+" "+(eachstate)

print(raw_text)
base_url = f"http://tagger.jensenlab.org/GetEntities?document={raw_text}&entity_types=-2+-26&format=tsv"
r = requests.get(base_url)
dictlist = parse_tsv('states',r.text)
statedf = pd.DataFrame(dictlist)
print(statedf)


['California', 'Alabama', 'Maine', 'New Hampshire', 'Massachusetts', 'Connecticut', 'Hawaii', 'Alaska', 'Florida', 'Arizona', 'Oregon', 'Utah', 'Michigan', 'Illinois', 'North Dakota', 'South Dakota', 'Montana', 'Wyoming', 'Idaho', 'Washington', 'Nevada', 'Colorado', 'Virginia', 'West Virginia', 'New York', 'Rhode Island', 'Maryland', 'Delaware', 'Ohio', 'Pennsylvania', 'New Jersey', 'Indiana', 'Georgia', 'Texas', 'North Carolina', 'South Carolina', 'Mississippi', 'Tennessee', 'New Mexico', 'Minnesota', 'Wisconsin', 'Iowa', 'Nebraska', 'Kansas', 'Missouri', 'Louisiana', 'Kentucky', 'Arkansas', 'Oklahoma', 'Vermont']
 California Alabama Maine New Hampshire Massachusetts Connecticut Hawaii Alaska Florida Arizona Oregon Utah Michigan Illinois North Dakota South Dakota Montana Wyoming Idaho Washington Nevada Colorado Virginia West Virginia New York Rhode Island Maryland Delaware Ohio Pennsylvania New Jersey Indiana Georgia Texas North Carolina South Carolina Mississippi Tennessee New Mexico

In [30]:
def clean_stopwords(termtext):
    stopwordlist = ["People's Republic of","State of","Kingdom of", "Republic of the", "Republic of","Republic","Democratic"]
    for eachword in stopwordlist:
        termtext = termtext.replace(eachword,"")
    return termtext

print(clean_stopwords("People's Republic of China"))

countrylist = countries['itemLabel'].unique().tolist()
raw_text = ""
for eachcountry in countrylist:
    cleancountry = clean_stopwords(eachcountry)
    raw_text = raw_text+" "+(cleancountry)

print(raw_text)
base_url = f"http://tagger.jensenlab.org/GetEntities?document={raw_text}&entity_types=-2+-26&format=tsv"
r = requests.get(base_url)
dictlist = parse_tsv('country',r.text)
countrydf = pd.DataFrame(dictlist)
print(countrydf)

 China
 Canada Japan Norway  Ireland Hungary Spain United States of America Belgium Luxembourg Finland Sweden Poland Lithuania Italy Switzerland Austria Greece Turkey Portugal Uruguay Egypt Mexico Kenya Ethiopia Ghana France United Kingdom  China Brazil Russia Germany Belarus Iceland Estonia Latvia Ukraine Czech  Slovakia Slovenia Moldova Romania Bulgaria North Macedonia Albania Croatia Bosnia and Herzegovina Azerbaijan Andorra Cyprus Georgia Kazakhstan Malta Monaco Montenegro Vatican City San Marino Cuba Belize Barbados Indonesia South Africa Algeria Uzbekistan Chile Singapore Liechtenstein Bahrain Armenia Serbia Australia Argentina Peru North Korea Cambodia East Timor Chad New Zealand India Tuvalu Tonga Samoa Solomon Islands Vanuatu Papua New Guinea Palau Nauru Federated States of Micronesia Marshall Islands Kiribati Mongolia Fiji Venezuela Suriname Paraguay Guyana Ecuador Colombia Bolivia Trinidad and Tobago Saint Vincent and the Grenadines  Geneva Saint Lucia Saint Kitts and Nevis 

Based on the results above, the only Country and state names that EXTRACT will accidentally pull as species terms are:
`Montana`, `Nevada`, and `Tonga`

It did not pull out the term `China` even though `China` has an exact match in NCBI Taxonomy