Skip to content

Commit

Permalink
Fix casing. Accessions uppercase. Dates lower case, ie 2016-xx-xx.
Browse files Browse the repository at this point in the history
  • Loading branch information
trvrb committed May 28, 2016
1 parent b096ece commit 5e6ffd5
Show file tree
Hide file tree
Showing 3 changed files with 13 additions and 13 deletions.
10 changes: 5 additions & 5 deletions vdb/README.md
Expand Up @@ -27,7 +27,7 @@ Sequences can be uploaded from a fasta file, genbank file or file of genbank acc
* `Virus`: Virus type in CamelCase format. Loose term for like viruses (viruses that you'd want to include in a single tree). Examples include `Flu`, `Ebola`, `Zika`.
* `Subtype`: Virus subtype in lowercase, where available, Null otherwise. `h3n2`, `h1n1pdm`, `vic`, `yam`
* `Date_Modified`: Last modification date for virus document in `YYYY-MM-DD` format.
* `Date`: Collection date in `YYYY-MM-DD` format, for example, `2016-02-28`.
* `Date`: Collection date in `YYYY-MM-DD` format, for example, `2016-02-28` or `2016-02-xx` if day ambiguous.
* `Region`: Collection region in CamelCase format. See [here](https://github.com/blab/nextflu/blob/master/augur/source-data/geo_regions.tsv) for examples.
* `Country`: Collection country in CamelCase format. See [here](https://github.com/blab/nextflu/blob/master/augur/source-data/geo_synonyms.tsv) for examples.
* `Division`: Administrative division in CamelCase format. Where available, Null otherwise.
Expand All @@ -53,7 +53,7 @@ Viruses with null values for required attributes will be filtered out of those u
### Commands
Command line arguments to run vdb_upload:
* -db --database default='vdb', help=database to upload to. Ex 'vdb', 'test'
* -v --virus help=virus table to interact with. Ex 'Zika', 'Flu'
* -v --virus help=virus table to interact with. Ex 'zika', 'zlu'
* --fname help=input file name
* --ftype help=input file type, fasta, genbank or accession
* --accessions help=comma separated list of accessions numbers to upload
Expand All @@ -75,15 +75,15 @@ Upload flu sequences from GISAID:

Upload Zika sequences from VIPR:

python vdb/zika_upload.py --database vdb --virus zika --fname GenomeFastaResults.fasta --source Genbank --locus Genome
python vdb/zika_upload.py --database vdb --virus zika --fname GenomeFastaResults.fasta --source genbank --locus genome

Upload via accession file:

python vdb/zika_upload.py --database test --virus zika --fname entrez_test.txt --ftype accession --source Genbank --locus Genome
python vdb/zika_upload.py --database test --virus zika --fname entrez_test.txt --ftype accession --source genbank --locus genome

Upload via accession list:

python vdb/zika_upload.py --database test --virus zika --source Genbank --locus Genome --accessions KU501216,KU501217,KU365780,KU365777
python vdb/zika_upload.py --database test --virus zika --source genbank --locus genome --accessions KU501216,KU501217,KU365780,KU365777

## Downloading
Sequences can be downloaded from vdb.
Expand Down
2 changes: 1 addition & 1 deletion vdb/parse.py
Expand Up @@ -70,7 +70,7 @@ def fix_casing(self, v):
force lower case on fields besides strain, title, authors
'''
for field in v:
if field in ['strain', 'title', 'authors']:
if field in ['strain', 'title', 'authors', 'accession']:
pass
elif v[field] is not None and isinstance(v[field], str):
v[field] = v[field] = v[field].lower().replace(' ', '_')
Expand Down
14 changes: 7 additions & 7 deletions vdb/upload.py
Expand Up @@ -115,17 +115,17 @@ def format_date(self, virus):
# ex. 2002_04_25 to 2002-04-25
virus['date'] = re.sub(r'_', r'-', virus['date'])
# ex. 2002 (Month and day unknown)
if re.match(r'\d\d\d\d-(\d\d|XX)-(\d\d|XX)', virus['date']):
pass
if re.match(r'\d\d\d\d-(\d\d|XX|xx)-(\d\d|XX|xx)', virus['date']):
virus['date'] = virus['date'].lower()
elif re.match(r'\d\d\d\d\s\(Month\sand\sday\sunknown\)', virus['date']):
virus['date'] = virus['date'][0:4] + "-XX-XX"
virus['date'] = virus['date'][0:4] + "-xx-xx"
# ex. 2009-06 (Day unknown)
elif re.match(r'\d\d\d\d-\d\d\s\(Day\sunknown\)', virus['date']):
virus['date'] = virus['date'][0:7] + "-XX"
virus['date'] = virus['date'][0:7] + "-xx"
elif re.match(r'\d\d\d\d-\d\d', virus['date']):
virus['date'] = virus['date'][0:7] + "-XX"
virus['date'] = virus['date'][0:7] + "-xx"
elif re.match(r'\d\d\d\d', virus['date']):
virus['date'] = virus['date'][0:4] + "-XX-XX"
virus['date'] = virus['date'][0:4] + "-xx-xx"
else:
print("Couldn't reformat this date: " + virus['date'])

Expand Down Expand Up @@ -177,7 +177,7 @@ def filter(self):
'''
print(str(len(self.viruses)) + " viruses before filtering")
self.rethink_io.check_optional_attributes(self.viruses, self.optional_fields)
self.viruses = filter(lambda v: re.match(r'\d\d\d\d-(\d\d|XX)-(\d\d|XX)', v['date']), self.viruses)
self.viruses = filter(lambda v: re.match(r'\d\d\d\d-(\d\d|xx)-(\d\d|xx)', v['date']), self.viruses)
self.viruses = filter(lambda v: isinstance(v['public'], (bool)), self.viruses)
self.viruses = filter(lambda v: v['region'] is not None, self.viruses)
self.viruses = filter(lambda v: self.rethink_io.check_required_attributes(v, self.upload_fields, self.index_field), self.viruses)
Expand Down

0 comments on commit 5e6ffd5

Please sign in to comment.