# File Handling

## Reading files

Reading files is a three step process in python.

### 1. **File opening** 

- First we create a filehandle using the builtin function `open(filename, mode)`. Now, till we close the file, this file handle will be used to refer to the file. 

- The open function takes two parameters: **name of the file** and **opening mode**. 
    - The file name is a string with the file name, in most cases including the system path. 
        - If the system path is included, then this absolute path is used 
        - If we enter just the file name, then a relative path is assumed. 
    - Valid parameters for the opening mode:
        - “r” to only read, which is also the default value.
        - “w” to only write, if the file already exits this mode overwrites (truncates) and hence existing data is lost. 
        - “a” to append data at the end of an existing file. 
        - "r+" to both read and then write at the begining of the file.


|          Mode          |  r   |  r+  |  w   |  w+  |  a   |  a+  |
| :--------------------: | :--: | :--: | :--: | :--: | :--: | :--: |
|          Read          |  +   |  +   |      |  +   |      |  +   |
|         Write          |      |  +   |  +   |  +   |  +   |  +   |
|         Create         |      |      |  +   |  +   |  +   |  +   |
|         Cover          |      |      |  +   |  +   |      |      |
| Point in the beginning |  +   |  +   |  +   |  +   |      |      |
|    Point in the end    |      |      |      |      |  +   |  +   |


<img src="FileOpeningSummary.png" alt="Drawing" style="width: 700px;"/>

In [None]:
fh = open('protein.fasta', 'r')
fh

### 2. **Reading** from the file

A file can be read once it is opened. We can apply several methods on the file handle to read the file. For example,
 - **read(n)**: With the parameter n it returns n byte from the file, without the parameter it reads the entire file.
 - **readline()**: Returns a single line from the file, with '\n' as the end of line marker. The read posititon then advances to the beginning of the next line such that the next call to `readline()` returns the next line. After reaching the end of the file, it returns an empty string.
 - **readlines()**: Returns a string with all lines of the file joined by the newline character '\n'.


In [None]:
fh.read()

In [None]:
fh2 = open('protein.fasta', 'r')
fh2.readline()

In [None]:
fh2.readline()

In [None]:
fh3 = open('protein.fasta', 'r')
fh3.readlines()

#### 3. **Closing** the file

It is important to close a file once we are done with it. If we don't close the file Python will automatically close it at the end of the session. Nonetheless, to avoid unforseen consequences it is better to close the file as soon as possible.

We close a file by using: **file_handle.close()** statement.

In [None]:
fh.close()
fh2.close()
fh3.close()

#### Using a context manager
We can use **with** to ensure that our file gets closed.
This is the preferred (safe!) way to open files as it guarantees the file gets closed even if an exception occurs while it is open.

In [None]:
with open('protein.fasta', 'r') as read_handle:
    #do whatever with the file
    file_content = read_handle.read() 
    # the read method returns the content of the file from the current postion till the end of the file.
    # And now we have stored it in the variable 'file_content'
    print("File closed at 1:")
    print(read_handle.closed)
    # the file closes as soon we get out of here
print("File closed at 2:")
print(read_handle.closed)

In [None]:
file_content

#### **Problem**:
Read the protein.fasta file and store the sequences and their identifiers as dictionary values and keys.

In [None]:
#intiate empty dictionary
dict_fasta = {}
with open('protein.fasta', 'r') as read_handle:
    # iterate over each line
    for line in read_handle:
        # the line containg fasta id starts with >
        if line[0] == '>':
            # strip away the newline character at the end of the line and slice out the '>' to get the fasta_id
            fasta_id = line[1:].strip('\n')
        #the line containg fasta sequence starts with an amino acid or X
        elif line[0] in 'RNDBCEQZGHILKMFPSTWX':
            # strip away the new line character to get the fasta sequence
            seq = line.strip('\n')
            # populate the dictionary with fasta_id as they key and its correspondig sequence as the value
            dict_fasta[fasta_id] = seq

In [None]:
dict_fasta

## Writing files

Writing to a file also has the same three steps as reading the file, except that we open the file in a write enabled mode and in the second step we write data to the file by calling the `write()` method.

```python
file_handle.write(string_that_is_to_be_written)
```

#### **Problem**:
Read the file `protein.fasta` and store each fasta sequence in a seperate fasta file.

In [None]:
with open('protein.fasta', 'r') as read_handle:
    # iterate over each line
    for line in read_handle.readlines():
        # the line containg fasta id starts with >
        if line[0] == '>':
            # strip away the newline character at the end of the line and slice out the '>' to get the fasta_id
            fasta_id = line[1:].strip('\n')
        #the line containg fasta sequence starts with an amino acid or X
        elif line[0] in 'RNDBCEQZGHILKMFPSTWX':
            # strip away the new line character to get the fasta sequence
            seq = line.strip('\n')
            # name of the new fasta file
            fl_name = '{}.fasta'.format(fasta_id)
            # write the new fasta file
            with open(fl_name, 'w') as write_handle:
                # compose the string that is to be written
                write_string = '>{}\n{}\n'.format(fasta_id, seq)
                # write
                write_handle.write(write_string)


## CSV (Comma Seperated Values): file format involved in the bulk of data analysis

- These are plain text files where data is seperated by comma, or other seperators such as tabs(\t), colons(:), whitespace( ), pipe(|), etc.
- Each line represents a seperate record.
- All spreadsheets can be stored in this format.

In [None]:
!cat course_contact.csv

### Raw file IO:

In [None]:
with open('course_contact.csv') as read_handle:
    #skip header and move to next line
    next(read_handle)
    for line in read_handle:
        line = line.split(',')
        print("""
        Contact
        Name:\t{1}
        Room:\t{0}
        Phone:\t{2}
        Email:\t{3}""".format(*line))

### Using the `csv` module

In [None]:
import csv

# Setup the csv reader.
lines = csv.reader(open('course_contact.csv'), delimiter=',')

# Skip the first (header) line.
next(lines)

# Loop over all lines.
for line in lines:
    print("""
    Contact
    Name:\t{1}
    Room:\t{0}
    Phone:\t{2}
    Email:\t{3}""".format(*line))

## File handling though the os module (Operating System)

os module is used for handling an interface between Python and the operating system.

**getcwd()**: Get the current working directory.

In [None]:
import os
os.getcwd()

**chdir(path)**: Go to the given path

In [None]:
pd = os.getcwd()
os.chdir('..')
os.getcwd()

In [None]:
os.chdir(pd)
os.getcwd()

**listdir(dir)**: List all the enteries in the the given directory.

In [None]:
#list all files and folders in the current directory
os.listdir('.')

**path.isdir(directory_path)**: Checks if the given file directory path points to a directory or not.

In [None]:
os.path.isdir('../Day2')

In [None]:
os.path.isdir('../Day6')

**mkdir(path)**: Make new directory.

In [None]:
os.mkdir('../Day6')

In [None]:
os.path.isdir('../Day6')

**path.isfile(file_path)**: Checks if the given file path points to a file or not.

In [None]:
os.path.isfile('course_contact.csv')

In [None]:
os.path.isfile('course_is_correct.csv')

**remove(file_path)**: Remove an existing file.

In [None]:
os.remove('./ENSNLET00000008753.fasta')

In [None]:
os.path.isfile('./ENSNLET00000008753.fasta')

#### Remove a directory

In [None]:
os.rmdir("../Day6")

In [None]:
os.path.exists("../Day6")


**rename(old_path, new_path)**: Rename an existing file.
- Make sure not to overwrite on an existing file.

In [None]:
os.rename('./ENST00000450565.fasta', './Human.fasta')

In [None]:
os.path.isfile('./ENST00000450565.fasta')

In [None]:
os.path.isfile('./Human.fasta')

The `os.path` module contains many useful functions to handle paths in python. A few examples:

In [None]:
# Get the absolute path:
os.path.abspath(".")

In [None]:
# Get the relative path with respect to a given starting point:
os.path.relpath(".", "..")

In [None]:
# The parent directory of a given file or dir.
os.path.dirname("..") # "__file__" is a special variable for **this** file.

In [None]:
# Split a path into parent dir and the rest.
components = os.path.split(os.getcwd())
components

In [None]:
# Join elements into a path
joined = os.path.join(*components)
joined

#### Exercise:
For each file in `./Day3.2/` under the current working directory, print the absolute path to each file.

In [None]:
for f in os.listdir('Day3.2'):
    print(
        os.path.abspath(
            os.path.join(
                os.getcwd(),
                "Day3.2", f
            )
        )
    )

### Executing UNIX commands
The os module also allows us to execute UNIX commands

**popen(command, mode)**: It opens a pipe to or from the command. The return value is an open file object connected to the pipe, which can be read or written depending on whether mode is 'r' (default) or 'w'.

In [None]:
GetIds = os.popen(r'''grep ">" protein.fasta''').read()

In [None]:
GetIds

```Python
os.popen(r'''grep ">" protein.fasta''').read()
```
- The 'r' before the command string ensures that entire command is passed raw to the OS.

- Do not forget to add **.read()** at the end of the **popen()** method even if you don't need to read the output. It ensures that command gets fully executed, otherwise if the default buffersize is exceeded then the python interpretor will move to the next before the command completes and we would not know anything about it.