# Use Unix and/or Python to view, sort and parse (split or separarate) large data sets such as those from genome-wide gene expression studies. <a name='UnixPython'>

### UNIX COMMANDS

In [None]:
!pwd  #lists the name of the current directory 
!ls #displays the contents of the current directory 
!wc #counts the number of characters in a file 
!wc -l #counts the number of lines in a file 
!cat #prints the contents of a file
!head -n10 # prints the first 10 lines of a file. Note: if your file is large, use the head or tail command rather than the cat
           # command to examine its contents 
!tail -n10 #prints the last 10 lines of a file. Note: if your file is large, use the head or tail command rather than the cat
           # command to examine its contents 
!grep "target" source_file_name  #searches for the string "target" inside of the source file
!cut -f1,2,3  #cuts the specified columns (in this case 1,2,3) from a file
!echo text  #writes "text" to the screen or to a file
!zcat #view a zipped file such as a file with a .gz extension
!sort #sorts a file, use the -k flag to sort by a specific column, the -r flag to sort in reverse order

### UNIX OPERATORS

In [None]:
> write to a file
| send the output of one command to the input of the next

To get help with a command you can type the command and use the --h or -help flags. 

### PYTHON COMMANDS

In [None]:
#gets the current working directory
import os
os.getcwd()
#changes the current working directory
os.chdir('..')
#lists files in the current directory, a single period (.)  stands for the current directory 
os.listdir('.')
#make a subdirectory in the working directory 
os.mkdir('YOUR DIRECTORY')


#Open a file and create a file object 
#The 'r' means that the file is readable.'w' would mean the file is writable.  
sequence=open('sequence.txt','r')
#open a file removing white space and splitting at line breaks
data=open("bedfile.bed",'r').read().strip().split('\n')
   

#Read the sequence file contents and print them 
print(sequence.read())

#calculate the length of a string
len(sequence)

#convert a number i to a string
str(i)


#make a for loop 
for i in range(0,len(sequence)):

#make an if statement:     
if sequence[i]=='G':
    
#Creating a dictionary
geneticcode3let={'UUU':'Phe','UUC':'Phe','UAA':'Leu','UUG':'Leu'}

#Defining a function that returns x
def my_function(a,b):
    '''this function adds two numbers '''
    x=a+b
    return(x)

#get help on a function or module 
help (my_function)

 
#write contents of a window
%%writefile ../helpers/central_dogma_helpers_updated.py

#write a string to an output file 
outf=open("outputfilename.txt",'w') #opens the file for writing
mydata="data"
outf.write(mydata)# writes data to the file
outf.close() # closes the file 


#append data to an existing file, without overwriting it
outf=open("outputfilename.txt",'a') #opens the file for apending 
       

#importing .tsv file into Python using pandas
import pandas
df = pd.read_table(
     filepath_or_buffer='filename.tsv', 
     header=0,
     index_col=0
)


#importing .csv file into Python using pandas
df = pd.read_csv(
     filepath_or_buffer='filename.csv', 
     header=0,
     index_col=0
)

#specifying rows or columns in a dataframe
x=df['columnname']
x=df.loc[rowname]
x=df.iloc[rownumber(s),columnnumber(s)]

#sorting pandas dataframe
df_sorted=pd.DataFrame.sort_values(df,by="columnname",ascending=False)
 
#importing packages
import Bio 
print(Bio.__version__)

## BIOLOGY TOOLS 

In [None]:
#SeqIO package provides many functions for sequence operations 
seq1=SeqIO.read('sequence1.txt',"fasta")


#Bio package contains alignment functions 

#pairwise alignment 
from Bio import pairwise2 
alignments = pairwise2.align.globalxx(seq1.seq, seq2.seq)

#multiple sequence alignment with MUSCLE algorithm 
from Bio.Align.Applications import MuscleCommandline

#alignment to reference genome with Bowtie2 algorithm 

## TOOLS FOR STATISTICAL ANALYSIS AND MACHINE LEARNING

In [None]:
#PCA 
from sklearn.decomposition import PCA as sklearnPCA
#We decompose the data in dataframe df into 10 principal components 
sklearn_pca = sklearnPCA(n_components=10)
pca_results = sklearn_pca.fit_transform(df)

#KMeans clustering 
from sklearn.cluster import KMeans

#GO Term enrichment analysis 
http://cbl-gorilla.cs.technion.ac.il/
    


### Tools for investigating if a variant is in a UTR, coding region (CDS), promotor or enhancer

[Bedtools](http://bedtools.readthedocs.io/en/latest/)
* `bedtools sort -i myFile.bed` sorts myFile.bed by chromosome, start coordinate, end coordinate 
* `bedtools intersect -wa -a fileA -b fileB` -- finds all entries in fileB that intersect entries in fileA 
* `bedtools closest -wa -a fileA -b fileB` -- finds the closest entry in fileB to each entry in fileA. 
* `bedtools getfasta -fi hg19.genome.fa -bed human_insulin_exon_boundaries.bed -fo human_insulin_exons.fa.out` -- extracts the fasta sequence from the hg19 reference genome corresponding to the positions in the input bed file 'human_insulin_exon_boundaries.bed'   


[WashU Epigenome Roadmap Browser](http://epigenomegateway.wustl.edu/browser/) with Chromatin State Tracks from the Public Track Roadmap Epigenomics Integrative Analysis Hub. 

# Analyze datasets from cases and controls to identify sites in the genome that are likely to be relevant to a disease. <a name='CasesandControls'>

<a href=#GWAS> See GWAS Review above </a>

* The Biobank Engine https://biobankengine.stanford.edu/ is a databse of GWAS associations from the UK Biobank dataset. 


* The GWAS Catalog https://www.ebi.ac.uk/gwas/ is often regarded as a "gold standard" for querying known GWAS hits. 


* Gene Cards http://www.genecards.org/ provides information about the function of a gene as well as known associations with diseases, and known pathogenic variants inside the gene. 


# Query a large data set and visualize the data by making or interpreting a scatter plot, barplot, histogram or heatmap. <a name='Plot'>

### Plotting packages 

Plotly: https://plot.ly/python/ <br>
Matplotlib: https://matplotlib.org/

In [None]:
#load the necessary modules for plotly
import plotly 
plotly.offline.init_notebook_mode()
import plotly.plotly as py
from plotly.graph_objs import *

In [None]:
#Make a Scatter Plot using plotly
trace=Scatter(
    #selects the x-values for the scatter plot
    x,       
    #selects the y-values for the scatter plot 
    y, 
    #adds in text that will be displayed on the points
    text=text,
    #defines the mode of the plot, in this case markers (as opposed to lines or text)
    mode="markers")

#Label the axes 
layout=Layout(xaxis=dict(title='x-axis title'),yaxis=dict(title='y-axis title'),showlegend=True)

#Draw the figure 
fig=Figure(data=[trace],layout=layout)
plotly.offline.iplot(fig)    

In [None]:
#Make a Bar plot using plotly
trace=Bar(
    #selects the x-values for the scatter plot
    x,   
    #selects the y-values for the scatter plot
    y      
    )
#Label the axes 
layout=Layout(xaxis=dict(title='x-axis title'),yaxis=dict(title='y-axis title'),showlegend=True)

#Draw the figure 
fig=Figure(data=[trace],layout=layout)
plotly.offline.iplot(fig) 

In [None]:
#Make a histogram using plotly
#Define the values for the histogram
data = [Histogram(x=x)]

#Label the axes
layout=Layout(xaxis=dict(title='x-axis title'),yaxis=dict(title='y-axis title'))

#Draw the figure  
fig=Figure(data=data,layout=layout)
plotly.offline.iplot(fig)

# Using Anaconda to perform computational analysis on your computer

We have been running code on the Google Compute Cloud, but you can also run all of this analysis on your own computer. To do this, you can install the Anaconda Python distribution from the Anaconda website: https://www.anaconda.com/download/

Once you have installed Anaconda following the instructions on the website, you can add Python packages and libraries with the following command: 

**`conda install packagename`**

for example: 
**`conda install pandas`** will install the pandas library.

If you want to check whether a given package is installed, you can use the command: 

**`conda list packagename`** 

for example: 

**`conda list pandas`** 

If you want to upgrade a package to a more recent version: 

**`conda upgrade packagename`** 

And finally, if you want to uninstall a package: 

** `conda uninstall packagename`**


