# Day 1: Working with .txt and .csv files
An important feature of computational biology is working with various forms of input data. Very often, they arrive in the form of `.txt` or `.csv` files. 
JSON files are also important, but are not covered in this course. \
\
Some FileTypes such as `.csv` have their own parser. A parser helps you to translate the data in the given the csv file structure so that it is accessible with your python code.

## Reading and writing with .txt files 
It is possible to open a `.txt` document with the `open()` open function. With `mode="r"` we define the reading mode. This only allows Python to read the document, not to change it any way. Python now reads the whole content of a file into a variable: 

In [1]:
f = open("additional_data/simple_txt_file.txt",mode="r")

Now we can read the content of the variable into a string. This prints out all content of the `.txt` file:

In [3]:
s = f.read()
print(s)

Biochemistry is awesome! 
Coding is great!



We can also read it line by line with the command `readline()`:

In [1]:
f = open("additional_data/simple_txt_file.txt",mode="r")
line1 = f.readline()
line2 = f.readline()

print("Line 1:", line1)
print(20*"-")
print("Line 2:", line2 )

Line 1: Biochemistry is awesome! 

--------------------
Line 2: Coding is great!



Finally, after we are done with working with our file we need to close it. This creates free memory (This step is crucial when working with large files!):

In [7]:
f.close()

To avoid bugs or broken memory because you forgot to close your file it is recommended to use the `with()` statement to close your file automatically. \ 
Using `with open('file') as f` we put the file data into a new variable `f`. The file is closed automatically after the code below the statement is complete. The `f` variable is therefore temporary. 

In [21]:
with open("additional_data/simple_txt_file.txt",mode="r") as f: 
    line1 = f.readline()
    line2 = f.readline()
    
    print("Line 1:", line1)
    print("Line 2:", line2)

Line 1: Biochemistry is awesome! 

Line 2: Coding is great!



Now we look at another file, a sequence alignment, which is structured as such: \
`IL2RA_SHEEP    MEPSLLMWRFFVFIVVPGCVTEACHDDPPSLRNA----------MFKVLRYE----VGTM`

For this, we need to write a custom parser which transforms the text file into a dictionary.

In [20]:
gene_seq = {}
with open("additional_data/protein_alignment.txt",mode="r") as f:
    for line in f:
        values = line.split() #splits the line at whitespace positions
        gene_seq[values[0]] = values[1]
print(gene_seq)

{'IL2RA_SHEEP': 'MEPSLLMWRFFVFIVVPGCVTEACHDDPPSLRNA----------MFKVLRYE----VGTM', 'IL2RA_MOUSE': 'MEPRLLMLGFLSLTIVPSCRAELCLYDPPEVPNA----------TFKALSYK----NGTI', 'IL2RA_FELCA': 'MEPSLLLWGILTFVVVHGHVTELCDENPPDIQHA----------TFKALTYK----TGTM', 'IL2RA_HUMAN': 'MDSYLLMWGLLTFIMVPGCQAELCDDDPPEIPHA----------TFKAMAYK----EGTM', 'IL2RA_MACMU': 'MDPYLLMWGLLTFITVPGCQAELCDDDPPKITHA----------TFKAVAYK----EGTM'}


Of course, you can also write `.txt` files in Python: 

In [2]:
with open('additional_data/new_txt_file.txt', mode='w') as t:
    t.write('A new file has been created!')

The `mode` is crucial in this step: if the file already exists, `w` mode will overwrite any existing content. `a` mode will append to the end of the file: 

In [3]:
with open('additional_data/new_txt_file.txt', mode='a') as t:
    t.write('\nWe add some text to our file!')

## Reading and writing .csv files

CSV (Comma Separated Values) files are one of the most important forms of data in biology. Think of an Excel sheet, for example. In essence, a `.csv` is like an `.xlsx` document, without fancy layout and functions. \
Python has a build in `.csv` parser, which helps a lot. \
In the example below we read a `.csv` file with protein names, their gene name and a name of the species. 

In [12]:
import csv # the python csv parser 

with open("additional_data/names_proteins.csv") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=",") # we initiate a reader object which removes the delimiter defined and returns a list of String elements
    counter_line = 0
    header = []
    for row in csv_reader: # Looping though a csv_reader object, the row is now a list with elements
        if counter_line == 0:
            print('The categories in this csv file are:',row[0],row[1],row[2])
            header = row
            counter_line += 1 
        else: 
            print(f'{header[0]}:{row[0]}, {header[1]}:{row[1]}, {header[2]}:{row[2]}')
            counter_line += 1 


The categories in this csv file are: protein  gene  organism
protein:DNA-directed RNA polymerase II subunit RPB1,  gene: POLR2A,  organism: Homo sapiens (Human)
protein:Hemoglobin subunit beta-1,  gene:Hbb-b1,  organism:Mus musculus (Mouse)
protein:Mating pheromone Er-23,  gene:MAT23,  organism:Euplotes raikovi 


Writing `.csv` files is also very easy with the csv package:  

In [1]:
import csv

with open('additional_data/new_csv_file.csv', mode='w') as output_file: 
    output_file = csv.writer(output_file, delimiter=',') # we now create a writer object which uses the delimiter defined to create a csv line from a given list
    output_file.writerow(['Nucleotide 1','Nucleotide 2','Nucleotide 3','Nucleotide 4']) # we write a csv row with the csv statement
    for i in range(100):
        output_file.writerow(['A','T','G','C'])

With the command below, you can remove your files in python code. This becomes useful in debugging and testing:

In [19]:
import os 
os.remove('additional_data/new_csv_file.csv')