# Working with files

If you're processing scientific data, it won't be long before you want to read your data from files. This section will also show you how to write your program's results to files, for example to keep them for further analysis or to open in another program.

We'll assume in this lesson that your data is in a text file (one you could open with a standard text editor), rather than a binary file (which require you to know about how the file is encoded in binary).

## Reading files

To open a file, there are different methods you can use:

First we use the `open()` function, this takes two arguments - filename and mode (see table 1 below). It is good coding to close your file after you are finished with it using the `close()` function.  

In [7]:
f = open('datafile.csv', 'r')
f.close()

#### Table 1: modes for file.open()
|Mode|Description|
|:------:|:------|
|r|Open file only for reading|
|w|Open file only for writing|
|a|Open for appending - data written is added to the end of the file|
|r+|Open file for reading and writing|

`open()` only opens the file - it doesn't do anything with the file contents. The file contents can be accessed in a few different ways, which vary depending on what you want to do with the data, and how much data is expected to be in the file.  

Files can be accessed as a sequence of individual lines using **loops**. There are other ways to open files, however as this approach menas you don't need to hold the whole file in memory at one time, it is a good approach for big data files and is the approach that will be used throughout this course.


#### Loops

You can loop over the file object to get individual lines one at a time to process. This is an extremely common way to process scientific data files as it is very memory efficient and fast, since the whole file doesn't need to be stored in memory at once.

In [9]:
#this code prints the file line by line
f = open('datafile.csv', 'r') #open datafile.csv as read only
for line in f:
    print(line) #print each line one at a time
f.close()

EnsembleID,logFC,logCPM,PValue,FDR,description,external_gene_name

ENSMUSG00000000085,-0.517098261,5.077189037,0.000921513,0.013475949,sex comb on midleg homolog 1 [Source:MGI Symbol;Acc:MGI:1352762],Scmh1

ENSMUSG00000000142,3.163834082,-0.373208103,6.53E-08,5.58E-06,axin2 [Source:MGI Symbol;Acc:MGI:1270862],Axin2

ENSMUSG00000000275,0.76032031,6.626927229,1.00E-05,0.00034474,tripartite motif-containing 25 [Source:MGI Symbol;Acc:MGI:102749],Trim25

ENSMUSG00000000486,0.528756286,5.399726946,0.001804609,0.023120858,septin 1 [Source:MGI Symbol;Acc:MGI:1858916],sept1


#### with open...

There is a slightly better way to open files in Python that is commonly used by experienced programmers because it means that the file will automatically be closed when the block exits (even if the exit is because of an exception in your code block) - this is therefore more robust, and easier because you don't need to remember to use the `close()` function. To do this, the `open()` function and a `with` statement are combined:

In [10]:
#this code prints the file line by line
with open('datafile.csv', 'r') as f:
    for line in f:
        print(line)

EnsembleID,logFC,logCPM,PValue,FDR,description,external_gene_name

ENSMUSG00000000085,-0.517098261,5.077189037,0.000921513,0.013475949,sex comb on midleg homolog 1 [Source:MGI Symbol;Acc:MGI:1352762],Scmh1

ENSMUSG00000000142,3.163834082,-0.373208103,6.53E-08,5.58E-06,axin2 [Source:MGI Symbol;Acc:MGI:1270862],Axin2

ENSMUSG00000000275,0.76032031,6.626927229,1.00E-05,0.00034474,tripartite motif-containing 25 [Source:MGI Symbol;Acc:MGI:102749],Trim25

ENSMUSG00000000486,0.528756286,5.399726946,0.001804609,0.023120858,septin 1 [Source:MGI Symbol;Acc:MGI:1858916],sept1


### Processing Real Data Files

Real data files take a variety of different formats, for example csv (comma-separated values) files are a common file type where the columns of data in each line are separated by commas. For notes on different filetypes, see the section 'Common biological data files and how to parse them'.

Contents of oligos.csv:

```
gct1, ACTGATCCTATGACGGA, chr1, sense
llt1, ACGTAGCACAGTTTCACG, chr17, anti-sense
uta2, GCATCAGGATAGCCAG, chr14, anti-sense
cis1, CTAGGATTGATCACAGT, chr1, sense
```

The csv file above consists of data fields separated by commas. Although it is not usually visable in a text editor, the end of each line is terminated by a newline character `\n`. You'll usually want to remove this before you do anything else with the line, so that a newline character doesn't remain attached to the last entry in the line which may give unwanted effects when you further process that data.

To remove the newline character from the end of the line, we can use the `rstrip()` method that we encountered in the strings section of this tutorial. 

In [2]:
#this code prints the file line by line
with open('oligos.csv', 'r') as f:
    for line in f:
        stripped_line = line.rstrip("\n")
        print(stripped_line) #prints the line without the newline character at the end

gct1, ACTGATCCTATGACGGA, chr1, sense
llt1, ACGTAGCACAGTTTCACG, chr17, anti-sense
uta2, GCATCAGGATAGCCAG, chr14, anti-sense
cis1, CTAGGATTGATCACAGT, chr1, sense


Try modifying the above code to `print(line)` and compare to `print(stripped_line)` - note the effect of printing the extra newline if it is not removed. Individual print statements automatically print on a newline, so if the previous newline is still present, an additional empty line is inserted between rows of data.

### Splitting up your data

Once you have removed the new line, you might want to work with the individual data fields in each line if your line contains fields that are separated (delimited) by a particular character such as a comma or tab. 

In our file of oligo sequences, we need to split the up at the comma values. By using the `split()` method, we can separate the individual data and put them into a list. We can then use or process the individual list elements the same way we learned in the list section of the tutorial - by accessing the individual list elements using square bracket notation, eg. `example_list[2]`

We usually also want to combine this action with using the `.rstrip()` method to remove the newline character from teh end of the line - if we don't do this then the last element of the list generates by `.split()` would have an invisible newline (\n) character at the end of it, and that could inhibit processing the that data value. 

Note that in the example below, we can apply the split method immediately after the rstrip method by writing them sequentially.

In [15]:
#this code calculates the length of each oligo in the file
with open('oligos.csv', 'r') as f:
    for line in f:
        #remove newline then split at commas
        data_list = line.rstrip("\n").split(",") 
        oligo_length=len(data_list[1])
        print(oligo_length)  

18
19
17
18


The code above takes each individual line of the text file, one by one, removes the newline character at the end of the line, then splits the line at commas to form a new list called data_list. A variable, oligo_length, is assigned to the length of the second value in the list (accessed as datalist[1]), and the printed to screen.

#### Example of working with the extracted data using list functions

In [3]:
#this code calculates the length of each oligo in the file, appends it to a list the calculates the total sumber of bases
with open('oligos.csv', 'r') as f:
    list_of_lengths = [] #create an empty list to add lengths to
    
    for line in f:
        #remove newline then split at commas
        data_list = line.rstrip("\n").split(",") 
        oligo_length=len(data_list[1])
        list_of_lengths.append(oligo_length)
        
print("Contents of list is {}".format(list_of_lengths))

#Use the sum() function to calculate the sum of the list
sum_of_lengths=sum(list_of_lengths)
print("Sum of list values is {}".format(sum_of_lengths))

#Use the len() function to calculate the number of items in the list
number_of_items=len(list_of_lengths)
print("Number of items in list is {}".format(number_of_items))

Contents of list is [18, 19, 17, 18]
Sum of list values is 72
Number of items in list is 4


### Writing files

Writing files is very similar to reading them, we specify the 'w' argument when opening the file to indicate that we can to write to the file. We can then use the write() method to write to our file object.

In [4]:
with open("outputfile.txt", 'w') as outfile:
    outfile.write("This text will be written to the file\n")
    outfile.write("This will be the second line")

If you run the above code in your Jupyter notebook, you will then be able to view and open your newly created file from the file menu (the first page viewed when you log in). 

In [26]:
#This code writes the contents of the dictionary to a file in FASTA format
sequence_dict = { "TIGR4":"ATCGATGCTA", "D39":"AGCTAGCCTA", "GC38":"CATGCTAGCT"}
with open("output_seqs.fasta", 'w') as outfile:
    for strain in sequence_dict:
        outfile.write(">{}\n".format(strain)) #writes key (strain) to FASTA header line
        outfile.write("{}\n".format(sequence_dict[strain])) #writes value (sequence) to sequence line

# Exercise

Your home folder contains a csv file called 'FD_CDR3_10000.csv' that contains peptide sequences in the format "donor, sequence" (example excerpt is shown below). 
		
		FD003, CEFRDANIPWER
		FD004, CGATPQWERWE
		FD004, CATECGTAPWERYT

* Write a program to read the file and add each CDR3 sequence to a list
* Modify your program to calculate the mean peptide sequence length 
