# Plasmid_data_analysis

Plasmid data analysis is a test jupyter noteboot to train into several python3 commands and data analysis tools:
    - urllib
    - pandas
    - matplotlib
    - etc.

First we need to download the file we are going to use from [ncbi](https://www.ncbi.nlm.nih.gov/) __FTP__ server:
[ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/plasmids.txt](ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/plasmids.txt).

For that we use urllib to retrieve the file containing all plasmid added to refseq database:

In [None]:
import sys

print(sys.version)

#sys.path.append('usr/local/share/jupyter/kernels')

In [None]:
import urllib
import matplotlib
import pandas as pd

In [None]:
#Alternate way to download the file (would need to storage in in a file)
#url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/plasmids.txt'
#with urllib.request.urlopen(url) as plasmidurl:
#    plasmids = plasmidurl.read()

In [3]:
#faster way to store the database in a file
url = 'ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/plasmids.txt'
urllib.request.urlretrieve(url, 'plasmid_ddbb.txt')

KeyboardInterrupt: 

The file downloaded is a TSV (tab-separated value) file with all plasmids added to the refseq database, which looks like that:

In [None]:
with open("plasmid_ddbb.txt") as plasmid_ddbb:
    head = plasmid_ddbb.readlines()[0:10]
print(head)

This prints the top 10 lines of the file.

To see the same information in a human readable format we use pandas. 
With head command, we can see the top 10 rows of a pandas dataframe

In [None]:
#Add the TSV file into a pandas DataFrame 
plasmid_df = pd.read_csv("plasmid_ddbb.txt", sep='\t', header=0)
#other parameters nrows=5

In [None]:
plasmid_df.head(10)

In [None]:
#The function shape tell us the DF dimensions 
print('This database has %s plasmids with information of %s features' %  plasmid_df.shape)

In [None]:
plasmid_df.describe()

In [None]:
plasmid_df['rRNA'].describe()

In [None]:
plasmid_df[['#Organism/Name','Kingdom','Group','SubGroup']].describe()

In [None]:
#locate a specific plasmid
#plasmid_df.iloc[4497]
#plasmid_df['Plasmid Name']

The key fields are completelly filled. Fieds such 'rRNA', 'tRNA' or 'Other RNA' have many missing values. As an attemp to clean them we can make a copy and replace '-' for 'NaN' to find out if those missing values alter descriptive stats. We will find out later that this is __not__ the case.

In [None]:
plasmid_missing = plasmid_df.copy()
plasmid_na = plasmid_missing.replace('-', 'NaN')
#plasmid_na

In [None]:
print(plasmid_df['rRNA'].describe(), plasmid_na['rRNA'].describe())


Plasmids are very promiscuous structuresthat can be transfered within different species so, to find a relationship within species and other feature, we first need to use only genus and species, not the fields supplied that includes the variety, interpreted as different organism.

We can do it easily with string manipulation.

In [None]:
plasmid_df['Size (Kb)'].plot.box()
#plasmid_df
#plasmid_size.colum = "Size"
#plasmid_size
#plasmid_size['double size'] = plasmid_df['Size (Kb)']*2


### Author: pedroscampoy@gmail.com
#### TUTORIAL UNDER CONSTRUCTION