### Downloading reference sequences from GenBank

The first step of the pipeline is to collect all reference sequences from GenBank according to the literature as describe below.

In [1]:
!pwd

/home/working/C_floridanus/1-NCBI_references


In [2]:
!ls

C_floridanus_NCBI_reference.ipynb  Crangonyx_NCBI_2018.gb
c-islandicus_accession.txt	   O.bicornis_NCBI_2018.gb
C.islandicus_NCBI_2018.gb	   pos-taxa_accession.txt
Crangonyx_accession.txt		   synurella_accession.txt


The files for each taxa with the accession numbers were created following the Slothouber Galbreath et al. (2010), Nagabuko et al. (2011), and Mauvisseau et al. (2018). We manually inspected all accessions to make sure all existing duplicates were removed.
After an initial manual alignment of the records, we detected 8 sequences that had low similarity values (~82%) with either _Crangonyx floridanus_ and _Crangonyx pseudogracilis_. This value is lower than the expected similarity between the two target species (~84%), we thus decided to remove those sequences from the database on the base that they might have been misidentified.

Removed records were:

- AJ968905
- AJ968906
- AJ968907
- AJ968908
- AJ968909
- AJ968910
- AJ968911
- EF570296

In [3]:
%%file Crangonyx_accession.txt
AJ968893
AJ968894
AJ968895
AJ968896
AJ968897
AJ968898
AJ968899
AJ968900
AJ968901
AJ968902
AJ968903
AJ968904
AB513800
AB513801
AB513802
AB513803
AB513804
AB513805
AB513806
AB513807
AB513808
AB513809
AB513810
AB513811
AB513812
AB513813
AB513814
AB513815
AB513816
AB513817
AB513818
AB513819
AB513820
AB513821
AB513822
AB513823
AB513824
AB513825
AB513826
AB513827
AB513828
AB513829
AB513830
AB513831
AB513832
AB513833
AB513834
AB513835
MK036646 
MK036647
MK036648
MK036649
MK036650
MK036651
MK036652
MK036653
MK036654
MK036655
MK036656
MK036657
MK036658
MK036659

Overwriting Crangonyx_accession.txt


In [4]:
!head -n 50 Crangonyx_accession.txt

AJ968893
AJ968894
AJ968895
AJ968896
AJ968897
AJ968898
AJ968899
AJ968900
AJ968901
AJ968902
AJ968903
AJ968904
AB513800
AB513801
AB513802
AB513803
AB513804
AB513805
AB513806
AB513807
AB513808
AB513809
AB513810
AB513811
AB513812
AB513813
AB513814
AB513815
AB513816
AB513817
AB513818
AB513819
AB513820
AB513821
AB513822
AB513823
AB513824
AB513825
AB513826
AB513827
AB513828
AB513829
AB513830
AB513831
AB513832
AB513833
AB513834
AB513835
MK036646 
MK036647


After creating the accession files, we fetch the records from NCBI and saving them to file in `.gb` format.

In [5]:
%%bash

for acc in $(cat Crangonyx_accession.txt | sort -n | uniq)
do 
    wget -O - "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=${acc}&rettype=gb"
done > Crangonyx_NCBI_2018.gb

--2019-07-24 08:44:27--  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AB513800&rettype=gb
Resolving eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)... 130.14.29.110
Connecting to eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)|130.14.29.110|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AB513800&rettype=gb [following]
--2019-07-24 08:44:28--  https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=AB513800&rettype=gb
Connecting to eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘STDOUT’

     0K ...                                                    25.5M=0s

2019-07-24 08:44:28 (25.5 MB/s) - written to stdout [3076]

--2019-07-24 08:44:28--  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.f

Following we include also the outgroup for the phylogenetic analysis and the positive taxa included in the metabarcoding:

_Crangonyx islandicus_ (Kornobis et al. 2010)

- HM015193
- HM015194
- HM015195
- HM015196

_Osmia bicornis_ (Radzeviciute et al. 2016)

- KX957868
- KX374768
- KX957868

Again we fetch all the sequences from NCBI as above, saving each taxa in individual `'gb` files.

In [9]:
%%file c-islandicus_accession.txt
HM015193
HM015194
HM015195
HM015196

Overwriting c-islandicus_accession.txt


In [10]:
!head c-islandicus_accession.txt

HM015193
HM015194
HM015195
HM015196

In [11]:
%%bash

for acc in $(cat c-islandicus_accession.txt | sort -n | uniq)
do 
    wget -O - "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=${acc}&rettype=gb"
done > C.islandicus_NCBI_2018.gb

--2019-07-24 08:45:08--  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=HM015193&rettype=gb
Resolving eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)... 130.14.29.110
Connecting to eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)|130.14.29.110|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=HM015193&rettype=gb [following]
--2019-07-24 08:45:08--  https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=HM015193&rettype=gb
Connecting to eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘STDOUT’

     0K ..                                                     22.1M=0s

2019-07-24 08:45:08 (22.1 MB/s) - written to stdout [2954]

--2019-07-24 08:45:08--  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.f

In [12]:
%%file pos-taxa_accession.txt
KX957868
KX374768
KX957868

Overwriting pos-taxa_accession.txt


In [13]:
%%bash

for acc in $(cat pos-taxa_accession.txt | sort -n | uniq)
do 
    wget -O - "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=${acc}&rettype=gb"
done > O.bicornis_NCBI_2018.gb

--2019-07-24 08:45:10--  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=KX374768&rettype=gb
Resolving eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)... 130.14.29.110
Connecting to eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)|130.14.29.110|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=KX374768&rettype=gb [following]
--2019-07-24 08:45:10--  https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=KX374768&rettype=gb
Connecting to eutils.ncbi.nlm.nih.gov (eutils.ncbi.nlm.nih.gov)|130.14.29.110|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘STDOUT’

     0K ..                                                     23.3M=0s

2019-07-24 08:45:11 (23.3 MB/s) - written to stdout [2932]

--2019-07-24 08:45:11--  http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.f