# Lab 15 : Python modules for manipulating files and directories

## Learning Objectives

* os module
* shutil module
* glob module
* gzip module

## 15.1 The Python os module

In Session 1 we learned to manipulate files and directories on your native operating system using terminal and Unix commands (Ubuntu and OS X) or the Command Prompt and MS-DOS (Windows). The functions that the Python OS module provides allows you to interface with the underlying operating system that Python is running on – be that Windows, Mac or Linux. 
The Python OS module also provides a range of useful methods to manipulate files and directories. To use this module you need to import it first and then call any related functions.

In [42]:
#!/usr/bin/env python

# Example 15.1
# This is a not meant to be run as a program.  
# Either use the ipython interpreter or select and run commands in Spyder as in Session 9

import os

# To the path of your current working directory
os.getcwd()
print(os.getcwd())
#  Windows or Anaconda command prompt = cd
#  Apple OSX or Linux = pwd

# Create a directory "test"
os.mkdir("test")
#  Windows or Anaconda command prompt = mkdir
#  Apple OSX or Linux = mkdir

# Changing into that directory
os.chdir("test")
#  Windows or Anaconda command prompt = chdir
#  Apple OSX or Linux = cd

print(os.getcwd())

# create file
outfile = open("example.txt", "w")
outfile.write('example text in my example file')
outfile.close()

# get a list of directory contents (files and directories)
print(os.listdir('.'))
#  Windows or Anaconda command prompt = dir
#  Apple OSX or Linux = ls

# get a list of directory contents (files and directories)
os.remove('example.txt')
#  Windows or Anaconda command prompt = del
#  Apple OSX or Linux = rm

# To move up one directory
os.chdir("..")
#  Windows or Anaconda command prompt = chdir ..
#  Apple OSX or Linux = cd ..

# Delete/Remove "test" directory. Note the directory must be empty
os.rmdir("test")
#  Windows or Anaconda command prompt = del 
#  Apple OSX or Linux = rm
# To remove a directory and all of its contents use shutil.rmtree() - remember to import shutil

print(os.getcwd())

os.system("mkdir TEST")

/home/jlb/jlb@umass.edu/GoEcology/Courses/597-EvoGen/2018/labs
/home/jlb/jlb@umass.edu/GoEcology/Courses/597-EvoGen/2018/labs/test
['example.txt']
/home/jlb/jlb@umass.edu/GoEcology/Courses/597-EvoGen/2018/labs


0

I try to keep this class operating system INDEPENDENT, but if you want to directly interact with the terminal in OSX or Linux or the command prompt in Windows use the os.system() command.  For example os.system("makedir TEST") to make a new directory

As we have seen a path points to a file system location by following the directory tree hierarchy expressed in a string of characters in which path components, separated by a delimiting character, represent each directory. The delimiting character is most commonly the slash ("/") in Unix or OS X and the backslash character ("\") in Windows.  

In [7]:
#!/usr/bin/env python

# Example 15.2
#
# A program for making example directories and files
# The output of this program will be a set of directories and files

# Usage: python make_example_directories.py


import os

print(os.getcwd())

# make the main directory
os.mkdir("main_directory")

# move into the main directory
os.chdir("main_directory")

# make the sub directories
os.mkdir("sub_directory1")
os.mkdir("sub_directory2")
os.mkdir("sub_directory3")

# get a list of the subdirectories
list_sub_dir = os.listdir('.')

# make a set of files with different extensions in each sub directory
for sub_dir in list_sub_dir :
    os.chdir(sub_dir)
    outfilename1 = sub_dir + ".file1.txt"
    outfile1 = open(outfilename1, 'w')
    outfile1.write ('text from %s\n' % (outfilename1))
    outfilename2 = sub_dir + ".file2.faa"
    outfile2 = open(outfilename2, 'w')
    outfile2.write ('protein from %s\n' % (outfilename2))
    outfilename3 = sub_dir + ".file3.gbk"
    outfile3 = open(outfilename3, 'w')
    outfile3.write ('Genbank Record from %s\n' % (outfilename3))
    print(os.listdir())
    os.chdir("..")

# close the files
outfile1.close()
outfile2.close()
outfile3.close()

# print path of working directory and sub directories
print(os.getcwd())
print(os.listdir())

# At the moment main_directory is the current working directory
# Move back to your working directory
# If you are funning this in Juptyer notebooks and do not go back to your original working directory
# Then main_directory would be your starting working directory.

os.chdir("..")
print(os.getcwd())


/home/jlb/jlb@umass.edu/GoEcology/Courses/597-EvoGen/2018/labs
['sub_directory2.file2.faa', 'sub_directory2.file3.gbk', 'sub_directory2.file1.txt']
['sub_directory3.file1.txt', 'sub_directory3.file3.gbk', 'sub_directory3.file2.faa']
['sub_directory1.file1.txt', 'sub_directory1.file2.faa', 'sub_directory1.file3.gbk']
/home/jlb/jlb@umass.edu/GoEcology/Courses/597-EvoGen/2018/labs/main_directory
['sub_directory2', 'sub_directory3', 'sub_directory1']
/home/jlb/jlb@umass.edu/GoEcology/Courses/597-EvoGen/2018/labs


## The Python shutil module

The shutil module offers operations for working files and collections of files. In particular, functions are provided which support file copying and removal. For operations on individual files. It overlaps in function with some of the os module, but I use it for moving and copying files.


In [48]:
#!/usr/bin/env python

# Example 15.3
#
# A program for moving files

# Usage: python move_file.py

import os
import shutil

outfile = open("example.txt", "w")
outfile.write('example text in my example file')
outfile.close()

os.mkdir('test')

shutil.move('example.txt','test/example.txt' )
# remember the slashes go in the other direction on Windows

'test/example.txt'

## 11.3 The Python glob module

Another useful tool in getting files in a directory is glob.  The glob module finds all the pathnames matching a specified pattern according to the rules used by the Unix shell. No tilde expansion is done, but \*, ?, and character ranges expressed with [ ] will be correctly matched.

In [5]:
#!/usr/bin/env python

# Example 15.4
# glob

# Usage: python myglob.py

import glob

txt_files = glob.glob('*.txt')

print (txt_files)

['DNA_results.txt', 'GitHub.txt', 'id.txt', 'p53_mRNA_withU.txt', 'all_gbk_files.txt', 'Tree_of_Life_Core_Sequences.txt', 'RDP_exercise_set.txt', 'p53_mRNA.txt', 'id_gut.txt', 'GCF_000018685.1_ASM1868.faa.pfam03319.results.txt', 'lab3_and_lab4_exercise_sequences.txt', 'RDP_example_set.txt', 'unknown_sequence.txt']


These results are the ones in my working directory. Note the output of the above program will depend on what .txt files you have in your working directory. 

Below is a program that will make a set of directories and files for testing your knowledge of the above commands and your code in the examples and exercises

In [8]:
#!/usr/bin/env python

# Example 15.5

# This example traverses the directory structure getting the contents of all GenBank record files (.gbk)
# This program assumes the directory structure from make_example_directories.py
# It should be run from the same directory that the main directory is in

# Usage: python traverse_directories.py


import os
import glob

outfile1 = open('all_gbk_files.txt', 'w')

# move into the main directory
os.chdir("main_directory")

# get a list of the subdirectories
list_sub_dir = os.listdir('.')

# go into each sub directory and get the contexts of the gbk files
for sub_dir in list_sub_dir :
    os.chdir(sub_dir)
    gbk_files = glob.glob('*.faa')
    for file in gbk_files :
        ind_gbk = open(file, 'r')
        file_contents = ind_gbk.read()
        outfile1.write(file_contents)
        # Also you can print to screen
        print(file_contents)
    os.chdir("..")

# Move back to your working directory
os.chdir("..")
# close the files
outfile1.close()



protein from sub_directory2.file2.faa

protein from sub_directory3.file2.faa

protein from sub_directory1.file2.faa



## The Python gzip module


In [41]:
#!/usr/bin/env python

# Example 15.6

# This example reads a compressed .tar.gz file and writes the uncompressed file

# Usage: python unzippit.py

import gzip

infile = gzip.open('GCF_000018685.1_ASM1868v1_protein.faa.gz', 'rb')
outfile = open('GCF_000018685.1_ASM1868v1_protein.faa', 'wb')
outfile.write(infile.read())

infile.close()
outfile.close()

In [39]:
#!/usr/bin/env python

# Example 15.7

# This example reads all compressed .tar.gz files and writes the uncompressed files

# Usage: python unzippit.py


import os
import glob
import gzip

tar_files = glob.glob('*(copy).faa.gz')
for file in tar_files :
        out_filename = file.replace('.gz', '')
        infile = gzip.open(file, 'rb')
        outfile = open(out_filename, 'wb')
        outfile.write(infile.read())
        infile.close()
        outfile.close()
        # if you want to delete the .gz file
        # os.remove(file)

## Exercises

1. Create a directory(folder) on your computer called NCBI_proteomes.  Go to the NCBI ftp site contain files for complete bacteria genomes ftp://ftp.ncbi.nih.gov/genomes/Bacteria/  Download the .protein.faa.gz files for 3 proteomes into the NCBI_proteomes directory.  Write a program that moves into the NCBI_proteomes directory and prints the name of the files in the directory.

2. Write a program that first moves into the NCBI_proteomes directory and then (using glob) decompresses all three files and deletes the original .gz file (as in example). As an output print the final contents of the NCBI_proteome directory (this should be 3 .faa files). 

3. Write a program that moves into the NCBI_Genomes directory and creates 3 subdirectories each with the name of the proteome without the .protein.faa (e.g. GCF_000018685.1_ASM1868v1_protein.faa should be GCF_000018685.1_ASM1868v1). Then move the corresponding .protein.faa file into that directory. You should now have a directory NCBI_proteomes that contains three subdirectories (each with the name of the protein file) and in each of the subdirectories one protein file. As output print the final contents of the NCBI_proteome directory.

* Next - <a href="http://nbviewer.ipython.org/github/jeffreyblanchard/EvoGenV5/blob/master/EvoGenV5_Lab16.ipynb">Lab 16 : Tree Visualization</a>
* Previous - <a href="http://nbviewer.ipython.org/github/jeffreyblanchard/EvoGenV5/blob/master/EvoGenV5_Lab14.ipynb">Lab 14 : Sequence Alignment and Phylogenetic Analysis</a> 