# Plasmid_data_analysis

Plasmid data analysis is a test jupyter noteboot to train into several python3 commands and data analysis tools:
    - urllib
    - pandas
    - matplotlib
    - etc.

First we need to download the file we are going to use from [ncbi](https://www.ncbi.nlm.nih.gov/) __FTP__ server:
[ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/plasmids.txt](ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/plasmids.txt).

For that we use urllib to retrieve the file containing all plasmid added to refseq database:

In [12]:
import sys

print(sys.version)

#sys.path.append('usr/local/share/jupyter/kernels')

3.6.6 (default, Sep 12 2018, 18:26:19) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]]


In [13]:
import urllib
import pandas as pd

In [14]:
#Alternate way to download the file (would need to storage in in a file)
#url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/plasmids.txt'
#with urllib.request.urlopen(url) as plasmidurl:
#    plasmids = plasmidurl.read()

In [15]:
#faster way to store the database in a file
url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/plasmids.txt'
urllib.request.urlretrieve(url, 'plasmid_ddbb.txt')

('plasmid_ddbb.txt', <email.message.Message at 0x7fbbe3030978>)

The file downloaded is a TSV (tab-separated value) file with all plasmids added to the refseq database, which looks like that:

In [17]:
with open("plasmid_ddbb.txt") as plasmid_ddbb:
    head = plasmid_ddbb.readlines()[0:10]
print(head)

['#Organism/Name\tKingdom\tGroup\tSubGroup\tPlasmid Name\tRefSeq\tINSDC\tSize (Kb)\tGC%\tProtein\trRNA\ttRNA\tOther RNA\tGene\tPseudogene\n', 'Acaryochloris marina MBIC11017\tBacteria\tTerrabacteria group\tCyanobacteria/Melainabacteria group\tpREB1\tNC_009926.1\tCP000838\t374.161\t47.3483\t309\t-\t-\t-\t333\t24\n', 'Acaryochloris marina MBIC11017\tBacteria\tTerrabacteria group\tCyanobacteria/Melainabacteria group\tpREB2\tNC_009927.1\tCP000839\t356.087\t45.3367\t336\t-\t-\t-\t360\t24\n', 'Acaryochloris marina MBIC11017\tBacteria\tTerrabacteria group\tCyanobacteria/Melainabacteria group\tpREB3\tNC_009928.1\tCP000840\t273.121\t45.1902\t250\t-\t-\t-\t290\t40\n', 'Acaryochloris marina MBIC11017\tBacteria\tTerrabacteria group\tCyanobacteria/Melainabacteria group\tpREB4\tNC_009929.1\tCP000841\t226.68\t45.877\t209\t-\t-\t-\t225\t16\n', 'Acaryochloris marina MBIC11017\tBacteria\tTerrabacteria group\tCyanobacteria/Melainabacteria group\tpREB5\tNC_009930.1\tCP000842\t177.162\t44.6755\t176\t-\t-\t

This prints the top 10 lines of the file.

To see the same information in a human readable format we use pandas. 
With head command, we can see the top 10 rows of a pandas dataframe

In [18]:
#Add the TSV file into a pandas DataFrame 
plasmid_df = pd.read_csv("plasmid_ddbb.txt", sep='\t', header=0)
#other parameters nrows=5

In [9]:
plasmid_df.head(10)

Unnamed: 0,#Organism/Name,Kingdom,Group,SubGroup,Plasmid Name,RefSeq,INSDC,Size (Kb),GC%,Protein,rRNA,tRNA,Other RNA,Gene,Pseudogene
0,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB1,NC_009926.1,CP000838,374.161,47.3483,309,-,-,-,333,24
1,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB2,NC_009927.1,CP000839,356.087,45.3367,336,-,-,-,360,24
2,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB3,NC_009928.1,CP000840,273.121,45.1902,250,-,-,-,290,40
3,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB4,NC_009929.1,CP000841,226.68,45.877,209,-,-,-,225,16
4,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB5,NC_009930.1,CP000842,177.162,44.6755,176,-,-,-,179,3
5,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB6,NC_009931.1,CP000843,172.728,47.1267,152,-,-,-,165,13
6,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB7,NC_009932.1,CP000844,155.11,45.5909,130,-,-,-,136,6
7,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB8,NC_009933.1,CP000845,120.693,45.4185,103,-,-,-,109,6
8,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB9,NC_009934.1,CP000846,2.133,42.5223,2,-,-,-,3,1
9,Acetobacter aceti,Bacteria,Proteobacteria,Alphaproteobacteria,pAC5,NC_001275.1,AF110140,5.123,55.8657,2,-,-,-,2,-


In [19]:
#The function shape tell us the DF dimensions 
print('This database has %s plasmids with information of %s features' %  plasmid_df.shape)

This database has 14406 plasmids with information of 15 features


In [38]:
plasmid_df.describe()

Unnamed: 0,Size (Kb),GC%
count,14406.0,14406.0
mean,111.212034,45.939259
std,241.435367,11.771237
min,0.537,0.0
25%,11.975,35.87895
50%,48.8675,46.72905
75%,111.69275,54.125875
max,5836.68,87.4773


In [69]:
plasmid_df['rRNA'].describe()

count     14406
unique       14
top           -
freq      14233
Name: rRNA, dtype: object

In [56]:
plasmid_df[['#Organism/Name','Kingdom','Group','SubGroup']].describe()

Unnamed: 0,#Organism/Name,Kingdom,Group,SubGroup
count,14406,14406,14406,14406
unique,3230,4,22,57
top,Escherichia coli,Bacteria,Proteobacteria,Gammaproteobacteria
freq,1151,14094,8604,6214


In [70]:
#locate a specific plasmid
#plasmid_df.iloc[4497]
#plasmid_df['Plasmid Name']

The key fields are completelly filled. Fieds such 'rRNA', 'tRNA' or 'Other RNA' have many missing values. As an attemp to clean them we can make a copy and replace '-' for 'NaN' to find out if those missing values alter descriptive stats. We will find out later that this is __not__ the case.

In [76]:
plasmid_missing = plasmid_df.copy()
plasmid_na = plasmid_missing.replace('-', 'NaN')
#plasmid_na

Unnamed: 0,#Organism/Name,Kingdom,Group,SubGroup,Plasmid Name,RefSeq,INSDC,Size (Kb),GC%,Protein,rRNA,tRNA,Other RNA,Gene,Pseudogene
0,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB1,NC_009926.1,CP000838,374.161,47.3483,309,,,,333,24
1,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB2,NC_009927.1,CP000839,356.087,45.3367,336,,,,360,24
2,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB3,NC_009928.1,CP000840,273.121,45.1902,250,,,,290,40
3,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB4,NC_009929.1,CP000841,226.680,45.8770,209,,,,225,16
4,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB5,NC_009930.1,CP000842,177.162,44.6755,176,,,,179,3
5,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB6,NC_009931.1,CP000843,172.728,47.1267,152,,,,165,13
6,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB7,NC_009932.1,CP000844,155.110,45.5909,130,,,,136,6
7,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB8,NC_009933.1,CP000845,120.693,45.4185,103,,,,109,6
8,Acaryochloris marina MBIC11017,Bacteria,Terrabacteria group,Cyanobacteria/Melainabacteria group,pREB9,NC_009934.1,CP000846,2.133,42.5223,2,,,,3,1
9,Acetobacter aceti,Bacteria,Proteobacteria,Alphaproteobacteria,pAC5,NC_001275.1,AF110140,5.123,55.8657,2,,,,2,


In [89]:
print(plasmid_df['rRNA'].describe(), plasmid_na['rRNA'].describe())


count     14406
unique       14
top           -
freq      14233
Name: rRNA, dtype: object count     14406
unique       14
top         NaN
freq      14233
Name: rRNA, dtype: object


Plasmids are very promiscuous structuresthat can be transfered within different species so, to find a relationship within species and other feature, we first need to use only genus and species, not the fields supplied that includes the variety, interpreted as different organism.

We can do it easily with string manipulation.

### Author: pedroscampoy@gmail.com
#### TUTORIAL UNDER CONSTRUCTION