### If you're feeling more ambitious....

I want to look at the evolution of Begonia in Papua New Guinea.  How many already have sequence in NCBI and how many will I need to sequence?

List of Begonia sequences in Genbank can be got here:  
    https://www.ncbi.nlm.nih.gov/genbank/  
Search txid3681[Organism:exp]   
Download the complete record in format summary (button at bottom right of page) 

List of Begonia in Papua New Guinea can be got here:  
    https://padme.rbge.org.uk/Begonia/data/checklist  
Make a checklist for Papua New Guinea.  You have the option to download as a word document, but this will not read directly into pandas, to access this we need to install a new module - python-docx 

In [60]:
pip install python-docx

Note: you may need to restart the kernel to use updated packages.


In [63]:
import docx
PNG_doc = docx.Document('../datasets/Checklist of Papua New Guinea.docx')

We can then extract the text

In [64]:
text = [p.text for p in PNG_doc.paragraphs if p.text]  
text

['Begonia L.',
 'Acetosa Mill.',
 'Riessia Klotzsch',
 'Trilomisa Raf.',
 'Sect. Baryandra',
 'Begonia sharpeana F.Muell.',
 'Sect. Diploclinium',
 'Begonia Sect. Trilobaria A.DC.',
 'Diploclinium Lindl.',
 'Begonia acaulis Merr. & L.M.Perry',
 'Begonia bartlettiana Merr. & L.M.Perry',
 'Begonia kaniensis Irmsch.',
 'Begonia minjemensis Irmsch.',
 'Begonia subcyclophylla Irmsch.',
 'Sect. Donaldia',
 'Donaldia Klotzsch',
 'Begonia ulmifolia Willd.',
 'Begonia dasycarpa A.DC.',
 'Begonia gesnerioides L.B.Sm. & B.G.Schub.',
 'Begonia jairii Brade',
 'Donaldia ulmifolia (Willd.) Klotzsch',
 'Sect. Ignota',
 'Begonia archboldiana Merr. & L.M.Perry',
 'Begonia physandra Merr. & L.M.Perry',
 'Sect. Oligandrae',
 'Begonia chambersiae W.N.Takeuchi',
 'Begonia maguniana H.P.Wilson',
 'Begonia oligandra Merr. & L.M.Perry',
 'Begonia pentandra W.N.Takeuchi',
 'Begonia sandsiana W.N.Takeuchi',
 'Sect. Petermannia',
 'Petermannia Klotzsch',
 'Begonia aikrono H.P.Wilson & Jimbo',
 'Begonia augustae 

The text is a python list, so we can access individual elements:

In [65]:
text[13]

'Begonia subcyclophylla Irmsch.'

We need to filter this to just the species names

In [66]:
Species = []
for i in text:
    if i.startswith("Begonia"):
        Species.append(i)

In [16]:
len(Species)

92

In [67]:
Species[10]

'Begonia gesnerioides L.B.Sm. & B.G.Schub.'

We can make this list into a dataframe:  

    df2 = df.Dataframe(LIST, columns=["colum_name"])

In [99]:
#png_beg = pd.DataFrame(Species, columns=['Text'])

In [100]:
png_beg.head(3)

Unnamed: 0,Text
0,Begonia L.
1,Begonia sharpeana F.Muell.
2,Begonia Sect. Trilobaria A.DC.


We need to split these strings into columns.  Use:  
    
        df["Column"].str.split("delimitor", expand=True)
    
This will make as many columns as needed by the number of time the delimitor appears in the string

In [102]:
#png_all = png_beg["Text"].str.split(" ", expand=True)
png_all.head(10)

Unnamed: 0,0,1,2,3,4,5,6,7
0,Begonia,L.,,,,,,
1,Begonia,sharpeana,F.Muell.,,,,,
2,Begonia,Sect.,Trilobaria,A.DC.,,,,
3,Begonia,acaulis,Merr.,&,L.M.Perry,,,
4,Begonia,bartlettiana,Merr.,&,L.M.Perry,,,
5,Begonia,kaniensis,Irmsch.,,,,,
6,Begonia,minjemensis,Irmsch.,,,,,
7,Begonia,subcyclophylla,Irmsch.,,,,,
8,Begonia,ulmifolia,Willd.,,,,,
9,Begonia,dasycarpa,A.DC.,,,,,


It will be helpful to re-name the colmns  

   df.columns =['name1', 'name2', 'name3']

In [105]:
png_all.columns =['Genus', 'Species', 'Auth1', 'And', 'Auth2', 'And2', 'Auth3', 'Stuff']
png_all.head(10)

Unnamed: 0,Genus,Species,Auth1,And,Auth2,And2,Auth3,Stuff
0,Begonia,L.,,,,,,
1,Begonia,sharpeana,F.Muell.,,,,,
2,Begonia,Sect.,Trilobaria,A.DC.,,,,
3,Begonia,acaulis,Merr.,&,L.M.Perry,,,
4,Begonia,bartlettiana,Merr.,&,L.M.Perry,,,
5,Begonia,kaniensis,Irmsch.,,,,,
6,Begonia,minjemensis,Irmsch.,,,,,
7,Begonia,subcyclophylla,Irmsch.,,,,,
8,Begonia,ulmifolia,Willd.,,,,,
9,Begonia,dasycarpa,A.DC.,,,,,


Subset the dataframe to just the genus, species and first author name:  

    df2=df[["Column1","Column2"]]

In [109]:
png_species=png_all[["Genus", "Species", "Auth1"]]
png_species.head(10)

Unnamed: 0,Genus,Species,Auth1
0,Begonia,L.,
1,Begonia,sharpeana,F.Muell.
2,Begonia,Sect.,Trilobaria
3,Begonia,acaulis,Merr.
4,Begonia,bartlettiana,Merr.
5,Begonia,kaniensis,Irmsch.
6,Begonia,minjemensis,Irmsch.
7,Begonia,subcyclophylla,Irmsch.
8,Begonia,ulmifolia,Willd.
9,Begonia,dasycarpa,A.DC.


We will join this to the genbank list to see what sequences exist for these species.

Checking the structure of the NCBI file using:  
    
    ! head my_file.txt
    ! tail my_file.txt

In [20]:
#! head nuccore_result.txt


1. Begonia minor voucher L.L. Forrest 161 (E) internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
764 bp linear DNA 
AF485171.1 GI:33320582

2. Begonia cubensis voucher L.L. Forrest 159 (E) internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence
759 bp linear DNA 
AF485169.1 GI:33320580

3. Begonia odorata voucher L.L. Forrest 158 (E) internal transcribed spacer 1, partial sequence; 5.8S ribosomal RNA gene, complete sequence; and internal transcribed spacer 2, partial sequence


In [32]:
#! tail nuccore_result.txt

673 bp linear DNA 
AB302885.1 GI:187761564

7143. Fusarium oxysporum f. sp. batatas strain NRRL 25594 elongation factor EF1 alpha-like mRNA, partial sequence
650 bp linear mRNA 
AY337717.1 GI:43336389

7144. Fusarium oxysporum f. sp. batatas strain NRRL 25594 beta-tubulin-like mRNA, partial sequence
569 bp linear mRNA 
AY337716.1 GI:43336363


We want to read this in with the number ignored, the first two words put into columns "Genus' and "species' the next line put into  column.  Read in the file as a table, giving the single column the name "Text"

In [75]:
#genbank = pd.read_table('../datasets/nuccore_result.txt', header = None, names=["Text"])

In [76]:
genbank.head(10)

Unnamed: 0,Text
0,1. Begonia minor voucher L.L. Forrest 161 (E) ...
1,764 bp linear DNA
2,AF485171.1 GI:33320582
3,2. Begonia cubensis voucher L.L. Forrest 159 (...
4,759 bp linear DNA
5,AF485169.1 GI:33320580
6,3. Begonia odorata voucher L.L. Forrest 158 (E...
7,741 bp linear DNA
8,AF485168.1 GI:33320579
9,4. Begonia obliqua internal transcribed spacer...


In [70]:
genbank.describe()

Unnamed: 0,0
count,21432
unique,15883
top,606 bp linear DNA
freq,153


This is not handy.  We need to filter for just the rows with "Begonia" in and split the lines into columns.

To pull out the Begonia rows use:  
        
    df_new = df[df["column"].str.contains("matching_text")]
    

In [121]:
beg_gen =  genbank[genbank["Text"].str.contains("Begonia")]

In [122]:
beg_gen.head(4)

Unnamed: 0,Text
0,1. Begonia minor voucher L.L. Forrest 161 (E) ...
3,2. Begonia cubensis voucher L.L. Forrest 159 (...
6,3. Begonia odorata voucher L.L. Forrest 158 (E...
9,4. Begonia obliqua internal transcribed spacer...


We need to split these strings into columns.  Use the string splilter as you did earlier:  
    
        df["Column"].str.split("delimitor", expand=True)
    
The dleimitor on this file should be ";"

In [158]:
all = beg_gen["Text"].str.split(";", expand=True)

In [159]:
all.head(3)

Unnamed: 0,0,1,2,3
0,1. Begonia minor voucher L.L. Forrest 161 (E) ...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",
3,2. Begonia cubensis voucher L.L. Forrest 159 (...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",
6,3. Begonia odorata voucher L.L. Forrest 158 (E...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",


In [160]:
all.tail(3)

Unnamed: 0,0,1,2,3
20298,6767. Bemisia tabaci isolate 159_ArujaSP_Begon...,mitochondrial,,
20301,6768. Bemisia tabaci isolate 105_GuarulhosSP_B...,mitochondrial,,
20304,6769. Bemisia tabaci isolate 90_LondrinaPR_Beg...,mitochondrial,,


In [161]:
all.columns =['Sample', 'Locus1', 'Locus2', 'Stuff']
all.head(10)

Unnamed: 0,Sample,Locus1,Locus2,Stuff
0,1. Begonia minor voucher L.L. Forrest 161 (E) ...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",
3,2. Begonia cubensis voucher L.L. Forrest 159 (...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",
6,3. Begonia odorata voucher L.L. Forrest 158 (E...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",
9,4. Begonia obliqua internal transcribed spacer...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",
12,5. Begonia silletensis subsp. mengyangensis vo...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",
15,6. Begonia cf. breviramosa Forrest 138 interna...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",
18,7. Begonia sp. Forrest 190 voucher L.L. Forres...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",
21,"8. Begonia obliqua 26S ribosomal RNA gene, par...",,,
24,9. Begonia dregei subsp. homonyma 18S ribosoma...,"internal transcribed spacer 1, 5.8S ribosomal...","and 28S ribosomal RNA gene, partial sequence",
27,10. Begonia luxurians voucher L.L. Forrest 180...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",


Now you need to split the first column to isolate the species name.  use str.split again, expanding it and chooeing the third item in the series (Python is 0-indexed

In [162]:
genus = all["Sample"].str.split(" ", expand=True)[1]
species = all["Sample"].str.split(" ", expand=True)[2]

In [155]:
type(species)

pandas.core.series.Series

Add this new series to the dataframe

In [163]:
all = all.assign(Species = species)
all = all.assign(Genus = genus)
all.head(10)

Unnamed: 0,Sample,Locus1,Locus2,Stuff,Species,Genus
0,1. Begonia minor voucher L.L. Forrest 161 (E) ...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",,minor,Begonia
3,2. Begonia cubensis voucher L.L. Forrest 159 (...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",,cubensis,Begonia
6,3. Begonia odorata voucher L.L. Forrest 158 (E...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",,odorata,Begonia
9,4. Begonia obliqua internal transcribed spacer...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",,obliqua,Begonia
12,5. Begonia silletensis subsp. mengyangensis vo...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",,silletensis,Begonia
15,6. Begonia cf. breviramosa Forrest 138 interna...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",,cf.,Begonia
18,7. Begonia sp. Forrest 190 voucher L.L. Forres...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",,sp.,Begonia
21,"8. Begonia obliqua 26S ribosomal RNA gene, par...",,,,obliqua,Begonia
24,9. Begonia dregei subsp. homonyma 18S ribosoma...,"internal transcribed spacer 1, 5.8S ribosomal...","and 28S ribosomal RNA gene, partial sequence",,dregei,Begonia
27,10. Begonia luxurians voucher L.L. Forrest 180...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",,luxurians,Begonia


Now we just join the dataframes to see which sequences we have for Begonia from Papua New Guinea

In [157]:
png_species.head(10)

Unnamed: 0,Genus,Species,Auth1
0,Begonia,L.,
1,Begonia,sharpeana,F.Muell.
2,Begonia,Sect.,Trilobaria
3,Begonia,acaulis,Merr.
4,Begonia,bartlettiana,Merr.
5,Begonia,kaniensis,Irmsch.
6,Begonia,minjemensis,Irmsch.
7,Begonia,subcyclophylla,Irmsch.
8,Begonia,ulmifolia,Willd.
9,Begonia,dasycarpa,A.DC.


Combine these lists into a dataframe

In [166]:
result = png_species.merge(all, on ='Species', suffixes=("_PNG", "_in_gen_bank"))
result

Unnamed: 0,Genus_PNG,Species,Auth1,Sample,Locus1,Locus2,Stuff,Genus_in_gen_bank
0,Begonia,ulmifolia,Willd.,61. Begonia ulmifolia voucher L.L. Forrest 169...,"5.8S ribosomal RNA gene, complete sequence","and internal transcribed spacer 2, partial se...",,Begonia
1,Begonia,ulmifolia,Willd.,895. Begonia ulmifolia isolate EDNA12-0025425 ...,plastid,,,Begonia
2,Begonia,ulmifolia,Willd.,1020. Begonia ulmifolia isolate EDNA12-0025425...,plastid,,,Begonia
3,Begonia,ulmifolia,Willd.,1105. Begonia ulmifolia isolate EDNA120025425 ...,plastid,,,Begonia
4,Begonia,ulmifolia,Willd.,1271. Begonia ulmifolia trnC-trnD intergenic s...,,,,Begonia
...,...,...,...,...,...,...,...,...
118,Begonia,symsanguinea,L.L.Forrest,4717. Begonia symsanguinea voucher Glasgow Bot...,mitochondrial,,,Begonia
119,Begonia,symsanguinea,L.L.Forrest,4746. Begonia symsanguinea voucher Glasgow Bot...,mitochondrial,,,Begonia
120,Begonia,symsanguinea,L.L.Forrest,4775. Begonia symsanguinea voucher Glasgow Bot...,chloroplast,,,Begonia
121,Begonia,symsanguinea,L.L.Forrest,4804. Begonia symsanguinea voucher Glasgow Bot...,"photosystem II 44 kDa protein (psbC), complet...","and psbC-trnS intergenic spacer, partial sequ...",chloroplast,Begonia


Count the number of Genbank accessions per PNG species

In [169]:
result['Species'].value_counts()

ulmifolia            30
aptera               21
symsanguinea         18
brevirimosa          12
koordersii            8
pseudolateralis       7
bipinnatifida         6
serratipetala         6
gesnerioides          3
augustae              3
weigallii             3
argenteomarginata     3
strigosa              3
Name: Species, dtype: int64