# Day 1: Working with .txt and .csv files
An important feature of computational biology is working with various forms of input data. Very often, they arrive in the form of `.txt` or `.csv` files. JSON files are also important, but are not covered in this course. 
\
Some FileTypes such as `.csv` have their own parser. A parser helps you to translate the data in the given the csv file structure so that it is accessible with your python code.

## Reading and working with .txt files 
It is possible to open a `.txt`document with the `open()` open function. It reads the whole content of a file into a variable: 

In [1]:
f = open("additional_data/simple_txt_file.txt")

Now we can read the content of the variable into a string. This prints out all content of the `.txt` file:

In [3]:
s = f.read()
print(s)

Biochemistry is awesome! 
Coding is great!



We can also read it line by line with the command `readline()`:

In [6]:
f = open("additional_data/simple_txt_file.txt")
line1 = f.readline()
line2 = f.readline()

print("Line 1:", line1)
print("Line 2:", line2 )

Line 1: Biochemistry is awesome! 

Line 2: Coding is great!



Finally, after we are done with working with our file we need to close it. This creates free memory (This step is crucial when working with large files!):

In [7]:
f.close()

To avoid bugs or broken memory because you forgot to close your file it is recommended to use the `with()` statement to close your file automatically: 

In [None]:
with open("additional_data/simple_txt_file.txt") as f:
    line1 = f.readline()
    line2 = f.readline()
    
    print("Line 1:", line1)
    print("Line 2:", line2)

Now we look at another file, a sequence alignment, which is structured as such: \
`IL2RA_SHEEP    MEPSLLMWRFFVFIVVPGCVTEACHDDPPSLRNA----------MFKVLRYE----VGTM`

For this, we need to write a custom parser which transforms the text file into a dictionary.

In [15]:
gene_seq = {}
with open("additional_data/protein_alignment.txt") as f:
    for line in f:
        values = line.split()
        gene_seq[values[0]] = values[1]
print(gene_seq)

{'IL2RA_SHEEP': 'MEPSLLMWRFFVFIVVPGCVTEACHDDPPSLRNA----------MFKVLRYE----VGTM', 'IL2RA_MOUSE': 'MEPRLLMLGFLSLTIVPSCRAELCLYDPPEVPNA----------TFKALSYK----NGTI', 'IL2RA_FELCA': 'MEPSLLLWGILTFVVVHGHVTELCDENPPDIQHA----------TFKALTYK----TGTM', 'IL2RA_HUMAN': 'MDSYLLMWGLLTFIMVPGCQAELCDDDPPEIPHA----------TFKAMAYK----EGTM', 'IL2RA_MACMU': 'MDPYLLMWGLLTFITVPGCQAELCDDDPPKITHA----------TFKAVAYK----EGTM'}


## Reading and working with .csv files

CSV files are one of the most important forms of data in biology. Think of an Excel sheet, for example. In essence, a `.csv` is like an `.xlsx` document, without fancy layout and functions. \
Python has a build in `.csv` parser, which helps a lot. \
In the example below we read a `.csv` file with protein names, their gene name and a name of the species. 

In [12]:
import csv # the Python  

with open("additional_data/names_proteins.csv") as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=",")
    counter_line = 0
    header = []
    for row in csv_reader:
        if counter_line == 0:
            print('The categories in this csv file are:',row[0],row[1],row[2])
            header = row
            counter_line += 1 
        else: 
            print(f'{header[0]}:{row[0]}, {header[1]}:{row[1]}, {header[2]}:{row[2]}')
            counter_line += 1 


The categories in this csv file are: protein  gene  organism
protein:DNA-directed RNA polymerase II subunit RPB1,  gene: POLR2A,  organism: Homo sapiens (Human)
protein:Hemoglobin subunit beta-1,  gene:Hbb-b1,  organism:Mus musculus (Mouse)
protein:Mating pheromone Er-23,  gene:MAT23,  organism:Euplotes raikovi 
