This program simply parses through a list of .ABI files provided by the user, organize the files into a hierarchical structure, and insert the corresponding entries into a simple SQL database.
- organizing
Using the database
- Copy/Move the sequence files into the Input folder.
- Click the "Check Quality" button to separate out the good and bad quality ones. Then click "View Uncertains" to see the files whose quality is inbetween good and bad. Judge if the sequence is good enough for giving meaningful BLASTn result. If yes, move it to the Good folder; if not, Bad folder.
- Input the information about the sequence files to the fields in Step 2, and check that the information are correct.
- Click the "Import Files" button. This will import the sequence files into the SQL database, create FASTA version of each sequence, and organize all the files into folders according to their Plate/Clone Alphabet. This will also produce a log file, which is equivalent to the transaction log on the interface.
- If there is any exception, click the "View Exceptions" button to view those files. If any exception file is a valid trace file, then use the manual input interface to add it into the database. Otherwise, the user should remove the file.
Note: If there is any error during the import process, the error message is usually displayed in the command line window that accompanies the software window.
The database
The database is an elementary SQLite database containing two tables, along with three views.
The first table is the Sequence table, which contains most of the information about the sequence files. The other table is the Blast table, which contains only the first result of the BLASTn search on the sequence.
The three views are: LowQuality, Rerun, and Pursue.
| ID | Plate | Clone | Primer | Run_date | Student | Instructor | Institution | Quality | FASTA | ABI | Pursue | Comment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2384 | 2 | H5 | F | 2006-20-01 | NLA | Bangera | Bellevue College | 1 | Sequence Files/FASTA/PF002/H/Pf002_H5F_2006-20-01.fasta | Sequence Files/ABI/PF002/H/Pf002_H5F_2006-20-01.abi | 0 |
- ID
- Unique identifier of the sequence in the database. This information is used to correlate the sequence with its BLASTn search result in the Blast table.
- Plate
- The plate number of the bacteria clone.
- Clone
- The clone number of the bacteria clone.
- Primer
- The primer used for the sequencing reaction. F stands for SL1, R stands for SR2.
- Run_date
- The specific date when the sequence was read on the sequencing machine. When the sequencing machine is left to run overnight, it is possible that two different sequencing in that same batch will have different run date. This is because the run date extracted from the sequence file is the date of *that* individual file. For instance, both sample A and B are loaded into the sequencing machine at the same time, but sample A is sequenced on 11:30 PM of 2010-10-23, sample B is sequenced on 1:05 AM of 2010-10-24, then they will have different run date.
- Student
- The student who performed the procedures to sequence the clone.
- Instructor
- The instructor who oversee the student.
- Institution
- The place where the sequencing was done.
- Quality
- How good the sequence is. There are two possible values: 0 and 1. "1" means that the sequence is good enough for BLAST search, "0" means otherwise. Even if a record has a quality of "1", it does not mean that it will be perfect, but only that it might still give meaningful result in BLAST searches.
- FASTA
- Path to FASTA file of the sequence.
- ABI
- Path to ABI trace file of the sequence.
- Pursue
- Whether the sequence is selected for further investigation. "1" means yes, and "0" means no.
- Comment
- Any additional comment.
| ID | Genome | Organism | E_value | Query_from | Query_to | Subject_from | Subject_to | Identity | Similarity |
|---|---|---|---|---|---|---|---|---|---|
| 2384 | CP002585.1 | Pseudomonas brassicacearum subsp. brassicacearum NFM421 | 1e-145 | 100 | 745 | 40555456 | 40556101 | 99% | F113 |
- ID
- Unique identifier for the sequence in this database. This value is used to link the BLASTn result with a sequence in the Sequence table.
- Genome
- Genome ID of the matching organism. The value used here is the Genebank ID, not the accession ID.
- Organism
- Name of the organism that this gene is found in, with the highest certainty by the BLASTn search.
- E_value
- Expected matching sequences in the database just by chance.
- Query_from and Query_to
- Location of the matching hit on the query.
- Subject_from and Subject_to
- Location of the hit on the subject.
- Identity
- The percentage similar of the query to the match.
- Similarity
- Other organism that this sequence is also found in.