### Overview

Tested again on July 31st and August 1st


This notebook contains various scripts to load data into tables on a local DuckDB database. <br>
- proteins are loaded into W2V_PROTEIN
- pfam entries are loaded into W2V_TOKEN
- disorder regions are also loaded into W2V_TOKEN

The tables are created at the time the data is loaded - so see the appropariate cells for the table definition.

Indexes are applied after the data is loaded.


DuckDB is very easy to install on a mac and can load tab-delimited files extremely quickly.
To recreate this environment, you just need to install DuckDB and then set the db_string at the top of this file
to the location where you wish the database file to be stored


## SETUP AND TEST

In [3]:
import duckdb
import time
#
# TODO - SET THIS STRING TO WHERE YOU WANT THE DB TO STORE ITS DATA
#
db_string = "/Users/patrick/dev/ucl/word2vec/COMP_0158_MSC_PROJECT/database/w2v_20240731_test.db"


Test the DB works OK

In [3]:
# CREATE A TABLE
#con = duckdb.connect(database=':memory:')
con = duckdb.connect(database=db_string)  
duckdb.sql("\
    CREATE TABLE TEST (\
        ID VARCHAR,\
    )")
con.close()

In [4]:
# DESCRIBE
con = duckdb.connect(database=db_string)
res = duckdb.sql("DESCRIBE TEST")
print(res)
con.close()

┌─────────────┬─────────────┬─────────┬─────────┬─────────┬─────────┐
│ column_name │ column_type │  null   │   key   │ default │  extra  │
│   varchar   │   varchar   │ varchar │ varchar │ varchar │ varchar │
├─────────────┼─────────────┼─────────┼─────────┼─────────┼─────────┤
│ ID          │ VARCHAR     │ YES     │ NULL    │ NULL    │ NULL    │
└─────────────┴─────────────┴─────────┴─────────┴─────────┴─────────┘



In [17]:
# DROP
con = duckdb.connect(database=db_string)  
#duckdb.sql("DROP TABLE TEST")
con.close()

## DATA PREPARATION - CONVESION TO DAT

### LOAD PROTEIN TrEMBL : INTO W2V_PROTEIN

I initialy used TrEMBL to create the corpus as the UniRef100 extract was too large and kept breaking my Macbook!
I subsequently found that it's possible to download only the eukaryortic UniRef100 proteins from the uniprot website. For this, it's necessary to filter on tax id 2759 and also select 100% completion(?). Note that it takes between 12 and 16 hours for Uniprot to prepare the extract for download, but it's worth it as it contains the taxonomy details as well.

In [None]:
# load protein file into protein table
# 20 July 2024 - This took 12.9s to load uniprotkb-2759_78494531.dat (78M proteins)
# 31 July testing again to check code works
con = duckdb.connect(database=db_string)           
con.execute("CREATE TABLE W2V_PROTEIN AS SELECT * FROM read_csv_auto('/Volumes/My Passport/data/protein/dat/uniprotkb-2759_78494531.dat', columns={'uniprot_id' :'VARCHAR', 'start': 'USMALLINT', 'end': 'USMALLINT'})")
con.close()

In [11]:
# This should output that there are 78,494,529 items
con = duckdb.connect(database=db_string)           
protein_count = con.execute("SELECT COUNT(*) FROM W2V_PROTEIN").fetchall()
print(protein_count)
con.close()

[(78494529,)]


In [9]:
con = duckdb.connect(database=db_string)           
#con.execute("DROP TABLE W2V_PROTEIN")
con.close()

In [12]:
# create an index (after loading the data)
con = duckdb.connect(database=db_string)   
con.execute("CREATE INDEX UNIP_IDX ON W2V_PROTEIN(UNIPROT_ID)")
print('index created')
con.close()

index created


In [10]:
con = duckdb.connect(database=db_string)      

# SELECT FROM LIST OF IDS - REALLY SLOW
#list = ['A0A010R6E0', 'A0A010RP22']
#entries = con.execute("SELECT * FROM PFAM_TOKEN WHERE column0 IN (SELECT UNNEST(?))", [list]).fetchall()

res = con.execute("SELECT * FROM W2V_PROTEIN WHERE UNIPROT_ID = (?)", ['A0A010PZU8']).fetchall()

print(res)
con.close()

[('A0A010PZU8', 1, 1389)]


### LOAD PROTEIN - UNIREF :

UniRef100 - All Eukaryotic Proteins - including Taxonomy Details

In [4]:
# load protein file into protein table
# 05 Aug 2024 - Took 1min 4.5s to load 95,272,305 items
#
# [counter, uniprot_id, len, start, end, n_members, tax_id, tax_name]
#
con = duckdb.connect(database=db_string)           
con.execute("CREATE TABLE W2V_PROTEIN_UREF100_E AS SELECT * FROM read_csv_auto('/Users/patrick/dev/ucl/comp0158_mscproject/data/protein/uniref100only_2759-95272305_20240805.dat', columns={'counter' : UINTEGER, 'uniprot_id' :VARCHAR, 'length': USMALLINT, 'start': USMALLINT, 'end': USMALLINT, 'n_members': USMALLINT, 'tax_id' :UINTEGER, 'tax_name' : VARCHAR})")
con.close()

In [5]:
# This should output that there are 95,272,305 items
con = duckdb.connect(database=db_string)           
count = con.execute("SELECT COUNT(*) FROM W2V_PROTEIN_UREF100_E").fetchall()
print(count)
con.close()

[(95272305,)]


In [6]:
# This should output that there are 95,272,305 items
con = duckdb.connect(database=db_string)           
count = con.execute("DESCRIBE W2V_PROTEIN_UREF100_E").fetchall()
print(count)
con.close()

[('counter', 'UINTEGER', 'YES', None, None, None), ('uniprot_id', 'VARCHAR', 'YES', None, None, None), ('length', 'USMALLINT', 'YES', None, None, None), ('start', 'USMALLINT', 'YES', None, None, None), ('end', 'USMALLINT', 'YES', None, None, None), ('n_members', 'USMALLINT', 'YES', None, None, None), ('tax_id', 'UINTEGER', 'YES', None, None, None), ('tax_name', 'VARCHAR', 'YES', None, None, None)]


In [9]:
# This should output that there are 95,272,305 items
con = duckdb.connect(database=db_string)           
count = con.execute("SELECT COUNT(*) FROM W2V_PROTEIN_UREF100_E").fetchall()
print(count[0])
con.close()

(95272305,)


In [None]:
# create an index (after loading the data) - initially ran out of memory on Macbook with all proteins
# Took 25s with eukaryotic only
#
# Going to create 2 indices as have added a counter column and want an index
#
con = duckdb.connect(database=db_string)   
con.execute("CREATE INDEX UNIREF100_IDX ON W2V_PROTEIN_UREF100_E(UNIPROT_ID)")
con.execute("CREATE INDEX COUNTER_IDX ON W2V_PROTEIN_UREF100_E(COUNTER)")
print('indices created')
con.close()

In [None]:
con = duckdb.connect(database=db_string)   
results = con.execute("SELECT * FROM W2V_PROTEIN_UREF100_E WHERE COUNTER >= 30000000 AND COUNTER <30000100 ").fetchall()
for res in results:
    print (res)
con.close()

### LOAD PFAM TOKENS INTO W2V_TOKEN

In [14]:
# July 20 2024 - Took 1m 55s to load 296,017,815 entries from a directory on a macbook
# July 31 2024 - Restest took 3m 10s from an external drive attached to macbook
con = duckdb.connect(database=db_string)

con.execute("CREATE TABLE W2V_TOKEN AS SELECT * FROM read_csv_auto('/Volumes/My Passport/data/pfam/protein2ipr_pfam_20240715.dat', columns={'uniprot_id' :'VARCHAR', 'type' : 'VARCHAR', 'token' : 'VARCHAR', 'start': 'USMALLINT', 'end': 'USMALLINT'})")
con.close()

In [25]:
# with pfam only this shows 296,017,815 entries
# after loading disorder as well this shows 377,274,915 (81,257,100 disorder entries)
con = duckdb.connect(database=db_string)           
token_count = con.execute("SELECT COUNT(*) FROM W2V_TOKEN WHERE TYPE='DISORDER'").fetchall()
#token_count = con.execute("SELECT COUNT(*) FROM W2V_TOKEN WHERE TYPE='PFAM'").fetchall()
#token_count = con.execute("SELECT COUNT(*) FROM W2V_TOKEN").fetchall()
print(token_count)
con.close()

[(81257100,)]


In [None]:
# create an index (after loading data)
con = duckdb.connect(database=db_string)  
#res = con.execute("CREATE INDEX PF_TKN_IDX ON W2V_TOKEN(UNIPROT_ID)")
con.close()

In [44]:
# test
con = duckdb.connect(database=db_string)
#token = 'PF19782' # Has 0 eukaryotic proteins
#token = 'PF20176' # Has 0 eukaryotic proteins
#token = 'PF20200' # Has 0 eukaryotic proteins
#token = 'PF14033' # Has about 4852 eukaryotic proteins
#token = 'PF03463'
#token = 'PF03465'
token = 'PF00400'

#
# get number of times this token is in W2V_TOKEN - Note that this table has everything fro protein2ipr.dat
#
token_count = con.execute("SELECT COUNT(*) FROM W2V_TOKEN WHERE TOKEN=(?) AND TYPE='PFAM'", [token]).fetchall()
print(f"found {token_count[0][0]} entries for {token}")

#
# get the unique list of proteins
#
results = con.execute("SELECT DISTINCT UNIPROT_ID FROM W2V_TOKEN WHERE TOKEN=(?)", [token]).fetchall()
print(f"Found {len(results)} unique proteins containing {token}")


# need to check if that protein is actually eukaryotic
eukaryotic_count = 0
for protein_res in results:
    #print(res)
    protein_id = protein_res[0]
    
    # check for protein_id
    protein_count = con.execute("SELECT COUNT(*) FROM W2V_PROTEIN_UREF100_E WHERE UNIPROT_ID=(?)", [protein_id]).fetchall()
    
    count = protein_count[0][0]
    
    if(count >0):
        eukaryotic_count +=1
        
print(f"Found {eukaryotic_count} eukaryotic proteins with pfam entry {token}") 
    
con.close()

found 2883106 entries for PF00400
Found 750512 unique proteins containing PF00400
Found 617234 eukaryotic proteins with pfam entry PF00400


In [48]:
con = duckdb.connect(database=db_string)

token = 'PF00400'

results = con.execute("SELECT * FROM W2V_TOKEN WHERE TOKEN=(?) and UNIPROT_ID='A0A010QCY6'", [token]).fetchall()

for res in results:
    print(f"{res}")

('A0A010QCY6', 'PFAM', 'PF00400', 310, 347)
('A0A010QCY6', 'PFAM', 'PF00400', 351, 387)
('A0A010QCY6', 'PFAM', 'PF00400', 431, 464)
('A0A010QCY6', 'PFAM', 'PF00400', 536, 570)


 ### LOAD DISORDER ITEMS INTO W2V_TOKEN

In [18]:
# Load disorder entries
# First run : July 19
# Retest    : August 1st (on Macbook - took 2mn 25s)
con = duckdb.connect(database=db_string) 
con.execute("INSERT INTO W2V_TOKEN SELECT * FROM read_csv_auto('/Volumes/My Passport/data/disorder/dat/disordered_tokens_20240719.dat')")
con.close()

In [4]:
# with pfam only this shows 296,017,815 entries
# after loading disorder as well this shows 377,274,915
con = duckdb.connect(database=db_string)           
token_count = con.execute("SELECT COUNT(*) FROM W2V_TOKEN WHERE TYPE='DISORDER'").fetchall()
print(token_count)
con.close()

[(81257100,)]


In [43]:
# test that W2V_TOKEN has all pfam and disorder entries
con = duckdb.connect(database=db_string)           
tokens = con.execute("SELECT * FROM W2V_TOKEN WHERE UNIPROT_ID=(?)", ['A0A010PZU8']).fetchall()
for token in tokens:
    print(token)
con.close()

('A0A010PZU8', 'PFAM', 'PF00400', 865, 900)
('A0A010PZU8', 'PFAM', 'PF00400', 928, 955)
('A0A010PZU8', 'PFAM', 'PF00400', 960, 998)
('A0A010PZU8', 'PFAM', 'PF00400', 1017, 1040)
('A0A010PZU8', 'PFAM', 'PF00400', 1078, 1108)
('A0A010PZU8', 'PFAM', 'PF00400', 1233, 1260)
('A0A010PZU8', 'PFAM', 'PF05729', 358, 479)
('A0A010PZU8', 'PFAM', 'PF17100', 152, 254)
('A0A010PZU8', 'DISORDER', 'Consensus Disorder Prediction', 1, 30)


### LOAD TAXONOMY INFO

#### Names

In [4]:
# see the data-preparation folder for a shell script that produces the .dat file loaded here
con = duckdb.connect(database=db_string)
con.execute("CREATE TABLE W2V_TAX_NAME AS SELECT * FROM read_csv_auto('/Volumes/My Passport/data/taxonomy/dat/scientific_names_20240802.dat', columns={'tax_id' :'VARCHAR', 'name' : 'VARCHAR'})")
con.close()

In [5]:
# count  - should have 2,588,170 entries
con = duckdb.connect(database=db_string)           
token_count = con.execute("SELECT COUNT(*) FROM W2V_TAX_NAME").fetchall()
print(token_count)
con.close()

[(2588170,)]


In [10]:
# create an index (after loading data)
con = duckdb.connect(database=db_string)  
res = con.execute("CREATE INDEX TAX_NM_IDX ON W2V_TAX_NAME(TAX_ID)")
con.close()

#### Categories

In [7]:
# see the data-preparation folder for a shell script that produces the .dat file loaded here
con = duckdb.connect(database=db_string)
con.execute("CREATE TABLE W2V_TAX_CAT AS SELECT * FROM read_csv_auto('/Volumes/My Passport/data/taxonomy/dat/categories_20240802.dat', columns={'type' : 'VARCHAR', 'parent_id' :'VARCHAR', 'id' : 'VARCHAR'})")
con.close()

In [9]:
# count  - should have 1,567,316 entries
con = duckdb.connect(database=db_string)           
token_count = con.execute("SELECT COUNT(*) FROM W2V_TAX_CAT").fetchall()
print(token_count)
con.close()

[(1567316,)]


In [11]:
# create an index (after loading data)
con = duckdb.connect(database=db_string)  
res = con.execute("CREATE INDEX TAX_CT_IDX ON W2V_TAX_CAT(ID)")
con.close()

In [2]:
# test that W2V_TOKEN has all pfam and disorder entries
# 1445577   : Colletotrichum fioriniae PJ7
# 10116     : Rattus norvegicus
con = duckdb.connect(database=db_string)           
tokens = con.execute("SELECT * FROM W2V_TAX_NAME WHERE TAX_ID=(?)", ['1310608']).fetchall()
print(tokens)
con.close()

[('1310608', 'Acinetobacter sp. 1295259')]


## DATA PREPARATION - GET TOKENS FOR E PROTEINS

In [51]:
con = duckdb.connect(database=db_string)
try:
    results = con.execute(f"SELECT UNIPROT_ID, TOKEN FROM W2V_TOKEN WHERE TOKEN='PF03465' AND TYPE='PFAM' AND UNIPROT_ID IN (SELECT DISTINCT UNIPROT_ID FROM W2V_PROTEIN_UREF100_E) ORDER BY UNIPROT_ID").fetchall()
    #results = con.execute(f"SELECT COUNT(SELECT DISTINCT UNIPROT_ID, TOKEN FROM W2V_TOKEN)").fetchall()
except Exception as e:
    print(f"Error  {e}")
    con.close()
for res in results:
    print(res)


('A0A010PZU0', 'PF03465')
('A0A010RE77', 'PF03465')
('A0A015LT99', 'PF03465')
('A0A015MIZ2', 'PF03465')
('A0A016S0U2', 'PF03465')
('A0A017SMB6', 'PF03465')
('A0A017SSH1', 'PF03465')
('A0A022PVK6', 'PF03465')
('A0A022Q8N7', 'PF03465')
('A0A022QYY8', 'PF03465')
('A0A022RZQ5', 'PF03465')
('A0A022W1R5', 'PF03465')
('A0A022W2B2', 'PF03465')
('A0A022W8Q5', 'PF03465')
('A0A023B7L6', 'PF03465')
('A0A023BAB4', 'PF03465')
('A0A023ES91', 'PF03465')
('A0A023F939', 'PF03465')
('A0A023FIE9', 'PF03465')
('A0A023FJB6', 'PF03465')
('A0A023FZK0', 'PF03465')
('A0A023GJD1', 'PF03465')
('A0A024FZK3', 'PF03465')
('A0A024G9L4', 'PF03465')
('A0A024UI99', 'PF03465')
('A0A024VHM6', 'PF03465')
('A0A026WKW8', 'PF03465')
('A0A026WXX0', 'PF03465')
('A0A034W963', 'PF03465')
('A0A034WAN3', 'PF03465')
('A0A058Z7E7', 'PF03465')
('A0A058Z7H0', 'PF03465')
('A0A058ZBF9', 'PF03465')
('A0A059AFV6', 'PF03465')
('A0A059APN0', 'PF03465')
('A0A059D1D9', 'PF03465')
('A0A059D1T3', 'PF03465')
('A0A059D253', 'PF03465')
('A0A059D2V9

In [46]:
output_file_root = "/Users/patrick/dev/ucl/comp0158_mscproject/data/corpus/tokens/uniref100_e_tokens_20240808_ALL"

# This is a vastly improved method of joining the two tables of proteins and tokens to give a line per token
# This takes about 14s to process 1M proteins, find its tokens in W2V_TOKEN and output meta data for each entry to a file
# The main difference is, by having a counter column on the protein it is a much more efficient way of moving through
# that table in 'chunks' because you can query directly upon the 'COUNTER" column with a > and <= query
# BY contrast, using OFFSET and LIMIT apears to load up to the limit each time (or something) so each subsequent
# query gets slower and slower as thee LIMIT increases.
def extract_eukaryotic_tokens(start_pos, end_pos, iteration):

    # time check
    s = time.time()
    
    # output file
    output_file = output_file_root+ str(iteration) + ".dat"
    
    # create long life/expensive objects
    of  = open(output_file, "w")
    con = duckdb.connect(database=db_string)
    
    print(f"iteration {iteration} querying from {start_pos} to {end_pos}.")
    
    try:
        results = con.execute(f"SELECT W2V_PROTEIN_UREF100_E.UNIPROT_ID, W2V_PROTEIN_UREF100_E.LENGTH, W2V_TOKEN.TYPE, W2V_TOKEN.TOKEN, W2V_TOKEN.START, W2V_TOKEN.END FROM ( SELECT UNIPROT_ID, LENGTH FROM W2V_PROTEIN_UREF100_E WHERE COUNTER >= {start_pos} and COUNTER < {end_pos} ) AS W2V_PROTEIN_UREF100_E INNER JOIN W2V_TOKEN AS W2V_TOKEN ON W2V_PROTEIN_UREF100_E.UNIPROT_ID = W2V_TOKEN.UNIPROT_ID ORDER BY W2V_PROTEIN_UREF100_E.UNIPROT_ID").fetchall()
    except Exception as e:
        print(f"Error on iteration {iteration}, {e}, closing file {output_file}")
        of.close()
        con.close()
        return False
    e1 = time.time()

    #print(f"{len(results)} results returned")
    if (len(results) == 0):
        print(f"iteration {iteration} from {start_pos} to {end_pos}.... no results returned. finished?")
        of.close()
        con.close()
        return False
    
    # write out the results
    for res in results:
        #print(res[0], res[1], res[2], res[3], res[4], res[5])
        buffer = "|".join([res[0], str(res[1]), res[2], res[3], str(res[4]), str(res[5])])
        #print(buffer)
        of.write(buffer +'\n')        
    
    # time check
    e2 = time.time()
    print(f"iteration {iteration} querying from {start_pos} to {end_pos}. query took {round(e1-s,2)}s, overall took {round(e2-s,2)}s")

    of.close()
    con.close()
    return True


# Macbook timings:
# chunk size 1000   : iteration 0 querying from 0 to 1000. query took 5.04s, overall took 5.05s
# chunk size 10000  : iteration 0 querying from 0 to 10000. query took 4.82s, overall took 4.84s
# chunk size 100000 : iteration 0 querying from 0 to 100000. query took 6.06s, overall took 6.2s
# chunk size 500000 : iteration 0 querying from 0 to 500000. query took 8.86s, overall took 9.74s
# chunk size 500000 : iteration 0 querying from 0 to 1000000. query took 13.04s, overall took 14.71s
start_pos       = 0    # start point
chunk_size      = 10    # how many rows to return
end_pos         = chunk_size
keep_iterating  = True
iteration       = 0

# loop through proteins
while keep_iterating and iteration <= 2 :
    keep_iterating = extract_eukaryotic_tokens(start_pos, end_pos, iteration)
    start_pos += chunk_size
    end_pos += chunk_size
    iteration += 1


iteration 0 querying from 0 to 1000000.
iteration 0 querying from 0 to 1000000. query took 10.43s, overall took 12.18s
iteration 1 querying from 1000000 to 2000000.
iteration 1 querying from 1000000 to 2000000. query took 11.78s, overall took 13.87s
iteration 2 querying from 2000000 to 3000000.
iteration 2 querying from 2000000 to 3000000. query took 9.38s, overall took 10.63s
iteration 3 querying from 3000000 to 4000000.
iteration 3 querying from 3000000 to 4000000. query took 11.54s, overall took 13.47s
iteration 4 querying from 4000000 to 5000000.
iteration 4 querying from 4000000 to 5000000. query took 11.61s, overall took 13.36s
iteration 5 querying from 5000000 to 6000000.
iteration 5 querying from 5000000 to 6000000. query took 10.69s, overall took 12.62s
iteration 6 querying from 6000000 to 7000000.
iteration 6 querying from 6000000 to 7000000. query took 10.82s, overall took 12.81s
iteration 7 querying from 7000000 to 8000000.
iteration 7 querying from 7000000 to 8000000. quer

#### SANITY CHECK SEPT 2024

In [76]:
#
# FOund some missing items in original query - query looks good, but checking again here
#
output_file = "/Users/patrick/dev/ucl/word2vec/COMP_0158_MSC_PROJECT/data/corpus_validation_sep/uniref100_e_tokens_20240910.dat"


def extract_eukaryotic_tokens_2(start, end):
    s = time.time()

    # create long life/expensive objects
    of  = open(output_file, "a")
    con = duckdb.connect(database=db_string)
    
    print(f"\nquerying from {start} to {end}.")
    
    try:
        results = con.execute(f"SELECT T1.UNIPROT_ID, T1.TOKEN, T1.TYPE, T1.START, T1.END FROM W2V_TOKEN T1 WHERE UNIPROT_ID IN ( SELECT UNIPROT_ID FROM W2V_PROTEIN_UREF100_E T2 WHERE T2.COUNTER >= {start} and T2.COUNTER <{end} ORDER BY T2.COUNTER)").fetchall()
    except Exception as e:
        print(f"Error {e}, closing file {output_file}")
        of.close()
        con.close()
        return
    e = time.time()

    # write out the results
    for res in results:
        buffer = "|".join( [ res[0], str(res[1]), str(res[2]), str(res[3]), str(res[4]) ] )
        #print(buffer)
        of.write(buffer +'\n')        
    
    # time check
    e2 = time.time()
    print(f"query for entries from {start} to {end} took {round(e-s, 2)}s")

    of.close()
    con.close()

#end_pos = 95272305
end_pos = 95272350
chunk   = 1000000

for i in range (0, end_pos, chunk):
    #print(i,'-', i+chunk)
    extract_eukaryotic_tokens_2(i, i+chunk)



querying from 0 to 1000000.
query for entries from 0 to 1000000 took 9.27s

querying from 1000000 to 2000000.
query for entries from 1000000 to 2000000 took 10.27s

querying from 2000000 to 3000000.
query for entries from 2000000 to 3000000 took 9.07s

querying from 3000000 to 4000000.
query for entries from 3000000 to 4000000 took 9.37s

querying from 4000000 to 5000000.
query for entries from 4000000 to 5000000 took 9.04s

querying from 5000000 to 6000000.
query for entries from 5000000 to 6000000 took 9.48s

querying from 6000000 to 7000000.
query for entries from 6000000 to 7000000 took 10.01s

querying from 7000000 to 8000000.
query for entries from 7000000 to 8000000 took 10.48s

querying from 8000000 to 9000000.
query for entries from 8000000 to 9000000 took 9.36s

querying from 9000000 to 10000000.
query for entries from 9000000 to 10000000 took 9.47s

querying from 10000000 to 11000000.
query for entries from 10000000 to 11000000 took 10.19s

querying from 11000000 to 1200000

In [77]:
#
# FOund some missing items in original query - query looks good, but checking again here
#
output_file = "/Users/patrick/dev/ucl/word2vec/COMP_0158_MSC_PROJECT/data/corpus_validation_sep/uniref100_e_tokens_inner_join_20240910.dat"


def extract_eukaryotic_tokens_inner_join(start, end):
    s = time.time()

    # create long life/expensive objects
    of  = open(output_file, "a")
    con = duckdb.connect(database=db_string)
    
    print(f"\nquerying from {start} to {end}.")
    
    try:
        #results = con.execute(f"SELECT T1.UNIPROT_ID, T1.TOKEN, T1.TYPE, T1.START, T1.END FROM W2V_TOKEN T1 WHERE UNIPROT_ID IN ( SELECT UNIPROT_ID FROM W2V_PROTEIN_UREF100_E T2 WHERE T2.COUNTER >= {start} and T2.COUNTER <{end} ORDER BY T2.COUNTER)").fetchall()
        
        results = con.execute(f"SELECT T2.COUNTER, T2.LEN, T1.UNIPROT_ID, T1.TOKEN, T1.TYPE, T1.START, T1.END FROM W2V_TOKEN T1 INNER JOIN W2V_PROTEIN_UREF100_E T2 ON T1.UNIPROT_ID = T2.UNIPROT_ID WHERE T2.COUNTER >= {start} and T2.COUNTER <{end} ORDER BY T2.COUNTER)").fetchall()
        
    except Exception as e:
        print(f"Error {e}, closing file {output_file}")
        of.close()
        con.close()
        return
    e = time.time()

    # write out the results
    for res in results:
        buffer = "|".join( [ res[0], str(res[1]), str(res[2]), str(res[3]), str(res[4]) ] )
        #print(buffer)
        of.write(buffer +'\n')        
    
    # time check
    e2 = time.time()
    print(f"query for entries from {start} to {end} took {round(e-s, 2)}s")

    of.close()
    con.close()

#end_pos = 95272305
#end_pos = 95272350
end_pos = 43
chunk   = 20

for i in range (0, end_pos, chunk):
    #print(i,'-', i+chunk)
    extract_eukaryotic_tokens_inner_join(i, i+chunk)


querying from 0 to 20.
Error Parser Error: syntax error at or near ")", closing file /Users/patrick/dev/ucl/word2vec/COMP_0158_MSC_PROJECT/data/corpus_validation_sep/uniref100_e_tokens_inner_join_20240910.dat

querying from 20 to 40.
Error Parser Error: syntax error at or near ")", closing file /Users/patrick/dev/ucl/word2vec/COMP_0158_MSC_PROJECT/data/corpus_validation_sep/uniref100_e_tokens_inner_join_20240910.dat

querying from 40 to 60.
Error Parser Error: syntax error at or near ")", closing file /Users/patrick/dev/ucl/word2vec/COMP_0158_MSC_PROJECT/data/corpus_validation_sep/uniref100_e_tokens_inner_join_20240910.dat


## DATA PREPARATION - TOKEN METADATA

In [46]:
#
# Creates a new table with metadata for eukaryotic proteins - it takes 13s to load the data (29min 48s on pandas)
# - Need to first get the metadata from the tokens file - see awk script below
#
# awk 'BEGIN {FS="|"} {print $1}' tokens_combined/uniref100_e_tokens_20240808_ALL_COMBINED.dat > tokens_combined/metadata_uniref100_e_tokens_20240808_ALL_COMBINED_2.dat
#
#
con = duckdb.connect(database=db_string)

# note when yo specify the filename with a parameter within f-strings f"..." it can screw up the import
file = "/Users/patrick/dev/ucl/comp0158_mscproject/data/corpus/tokens_combined/metadata_uniref100_e_tokens_20240808_ALL_COMBINED.dat"

# use filename directly - not as a varialbe
con.execute("CREATE TABLE W2V_SENTENCE_METADATA_E AS SELECT * FROM read_csv('/Users/patrick/dev/ucl/comp0158_mscproject/data/corpus/tokens_combined/metadata_uniref100_e_tokens_20240808_ALL_COMBINED.dat', delim=':', header='false', columns={'UNIPROT_ID': 'VARCHAR', 'LENGTH': 'BIGINT', 'NUM_TOKENS': 'BIGINT', 'NUM_PF_TOKENS': 'BIGINT', 'NUM_DIS_TOKENS': 'BIGINT'})")


# the above isn;t working so just do it manually - but you get crap column names
#con.execute(f"CREATE TABLE W2V_SENTENCE_METADATA_E AS SELECT * FROM read_csv_auto('{file}', delim=':', header='false')")

# this outputs how duckdb will interpret the columns
'''
results = con.execute(f"SELECT Prompt FROM sniff_csv('{file}', delim=':' )").fetchall()
for res in results:
    print(res[0])
'''

con.close()



In [55]:
con = duckdb.connect(database=db_string)
print(con.execute("DESCRIBE W2V_SENTENCE_METADATA_E").fetchall())
print('Num entries:', con.execute("SELECT COUNT(*) FROM W2V_SENTENCE_METADATA_E").fetchall())
print('Num entries with >1 token:', con.execute("SELECT COUNT(*) FROM W2V_SENTENCE_METADATA_E WHERE NUM_TOKENS > 1").fetchall())
print('Num entries with >=1 pfam token:', con.execute("SELECT COUNT(*) FROM W2V_SENTENCE_METADATA_E WHERE NUM_PF_TOKENS >= 1").fetchall())
print('Num entries with >=2 pfam token:', con.execute("SELECT COUNT(*) FROM W2V_SENTENCE_METADATA_E WHERE NUM_PF_TOKENS >= 2").fetchall())
print('Num entries with >=3 pfam token:', con.execute("SELECT COUNT(*) FROM W2V_SENTENCE_METADATA_E WHERE NUM_PF_TOKENS >= 3").fetchall())
print('Num entries with >=4 pfam token:', con.execute("SELECT COUNT(*) FROM W2V_SENTENCE_METADATA_E WHERE NUM_PF_TOKENS >= 4").fetchall())
print('Num entries with >=5 pfam token:', con.execute("SELECT COUNT(*) FROM W2V_SENTENCE_METADATA_E WHERE NUM_PF_TOKENS >= 5").fetchall())
print('Num entries with >=10 pfam token:', con.execute("SELECT COUNT(*) FROM W2V_SENTENCE_METADATA_E WHERE NUM_PF_TOKENS >= 10").fetchall())
#print(con.execute("DROP TABLE W2V_SENTENCE_METADATA_E_2").fetchall())
#print(con.execute("DROP TABLE W2V_SENTENCE_METADATA_E").fetchall())
con.close()

[('UNIPROT_ID', 'VARCHAR', 'YES', None, None, None), ('LENGTH', 'BIGINT', 'YES', None, None, None), ('NUM_TOKENS', 'BIGINT', 'YES', None, None, None), ('NUM_PF_TOKENS', 'BIGINT', 'YES', None, None, None), ('NUM_DIS_TOKENS', 'BIGINT', 'YES', None, None, None)]
Num entries: [(50249678,)]
Num entries with >1 token: [(26463970,)]
Num entries with >=1 pfam token: [(45909435,)]
Num entries with >=2 pfam token: [(18645741,)]
Num entries with >=3 pfam token: [(8451525,)]
Num entries with >=4 pfam token: [(4631406,)]
Num entries with >=5 pfam token: [(2915334,)]
Num entries with >=10 pfam token: [(557233,)]


In [57]:
con = duckdb.connect(database=db_string)
print('Num entries with >=1 pfam token:', con.execute("SELECT COUNT(*) FROM W2V_SENTENCE_METADATA_E WHERE NUM_PF_TOKENS >= 1").fetchall())
print('Num entries with >=5 pfam token:', con.execute("SELECT COUNT(*) FROM W2V_SENTENCE_METADATA_E WHERE NUM_PF_TOKENS >= 5").fetchall())

con.close()

Num entries with >=1 pfam token: [(45909435,)]
Num entries with >=5 pfam token: [(2915334,)]


## UTILITIES

#### Search for PFAM and PROTEIN ENTRIES

In [26]:
# test that W2V_TOKEN has all pfam and disorder entries
# 1445577   : Colletotrichum fioriniae PJ7
# 10116     : Rattus norvegicus
con = duckdb.connect(database=db_string)

# 1. Test - find a protein with pfam entries
#    - Both of these work
#protein_id = "A0A009GYB3" # this is prob not eukaryotic
protein_id = "A0A010PZJ8"

#tokens = con.execute("SELECT * FROM W2V_TOKEN WHERE UNIPROT_ID = 'A0A009GYB3'").fetchall()
#tokens = con.execute("SELECT * FROM W2V_TOKEN WHERE UNIPROT_ID = (?)", [protein_id] ).fetchall()

# 2. Find that same protein in W2V_PROTEIN
# doesn't work - possibly because the pfam entries are from all proteins whereas W2V_PROTEIN only
# has TrEMBL Eukaryotic proteins
#tokens = con.execute("SELECT * FROM W2V_PROTEIN WHERE UNIPROT_ID = 'A0A009GYB3'").fetchall()
tokens = con.execute("SELECT * FROM W2V_PROTEIN WHERE UNIPROT_ID = (?)", [protein_id]).fetchall()
print('W2V_PROTEIN', tokens)

# doesn't work
#tokens = con.execute("SELECT * FROM W2V_TOKEN WHERE UNIPROT_ID = (?)", ['protein_id']).fetchall()

# none of these work - is the protein A0A009GYB3 in UniRef??
# tokens = con.execute("SELECT * FROM W2V_PROTEIN_UNIREF_100_ALL_TAX WHERE UNIPROT_ID = 'A0A009GYB3'").fetchall()
tokens = con.execute("SELECT * FROM W2V_PROTEIN_UREF100_E WHERE UNIPROT_ID = (?)", [protein_id]).fetchall()
# tokens = con.execute("SELECT * FROM W2V_PROTEIN_UNIREF_100_ALL_TAX WHERE UNIPROT_ID = (?)", [protein_id]).fetchall()
# grep "A0A009GYB3" uniref100_tax_20240801.dat > returns nothing

print('W2V_PROTEIN_UREF100_E', tokens)
con.close()

W2V_PROTEIN [('A0A010PZJ8', 1, 494)]
W2V_PROTEIN_UREF100_E [('UniRef100', 'A0A010PZJ8', 493, 1, 494, 1, 1445577, 'Colletotrichum fioriniae PJ7')]


In [40]:
# test that W2V_TOKEN has all pfam and disorder entries
# 1445577   : Colletotrichum fioriniae PJ7
# 10116     : Rattus norvegicus
con = duckdb.connect(database=db_string)           
tokens = con.execute("SELECT * FROM W2V_TAX_CAT WHERE ID=(?)", ['1445577']).fetchall()
print(tokens)
tokens = con.execute("SELECT * FROM W2V_TAX_NAME WHERE TAX_ID=(?)", ['1445577']).fetchall()
print(tokens)
con.close()

[('E', '710243', '1445577')]
[('1445577', 'Colletotrichum fioriniae PJ7')]


#### Drop Table

In [16]:
con = duckdb.connect(database=db_string)           
con.execute("DROP TABLE X")
con.close()

In [9]:
con = duckdb.connect(database=db_string)           
tables = con.execute("SHOW TABLES").fetchall()
for table in tables:
    print(table)
con.close()

('W2V_EVO_PFAM',)
('W2V_PFAM_CLAN',)
('W2V_PFAM_E',)
('W2V_PROTEIN',)
('W2V_PROTEIN_UNIREF_100_ALL_TAX',)
('W2V_PROTEIN_UREF100_E',)
('W2V_SENTENCE_METADATA_E',)
('W2V_TAX_CAT',)
('W2V_TAX_NAME',)
('W2V_TOKEN',)


In [10]:
con = duckdb.connect(database=db_string)           
tables = con.execute("DESCRIBE W2V_TOKEN").fetchall()
print(tables)
con.close()

[('uniprot_id', 'VARCHAR', 'YES', None, None, None), ('type', 'VARCHAR', 'YES', None, None, None), ('token', 'VARCHAR', 'YES', None, None, None), ('start', 'USMALLINT', 'YES', None, None, None), ('end', 'USMALLINT', 'YES', None, None, None)]


#### Unlock database

In [None]:
import duckdb
import os

# this doesn;t seem to work....
'''
def is_locked():
    lock_file = f'{db_string}.lock'
    return os.path.exists(lock_file)
is_locked()
'''

# This works - execute from a command prompt then kill -9 <id if there is one list>
fuser /Users/patrick/dev/ucl/comp0158_mscproject/database/w2v_20240731_test.db
