< [6 Tuples and Dictionaries](6-TuplesDictionaries.ipynb) | [Contents](0-Contents.ipynb) | [8 NumPy](8-NumPy.ipynb) >

# 7. File I/O

#### 7.1 Introduction
In the previous chapters, we either defined our data directly in the code, asked the user for data input, or generated it using functions. However, in many cases, we need to read data from external sources, such as files. 

A file is a collection of data stored on a disk. Files can be text files, which contain human-readable text (coded as ASCII or Unicode), or binary files, which contain data in a format that is not human-readable. 

#### 7.2 File location

When you open a file, you need to specify the file location. The file location can be either an absolute path or a relative path:

* An **absolute path** specifies the location of a file from the root directory.
* A **relative path** specifies the location of a file relative to the current working directory. 

For example, if you have a file called `example.txt` in the current working directory, you can open it using the relative path `'example.txt'`. If you have a file called `example.txt` in a directory called `files` in the current working directory, you can open it using the relative path `'files/example.txt'`.

In what follows, we assume that all files are in the `files` directory. This means that you can open a file called `example.txt` using the relative path `'files/example.txt'`.


#### 7.3 The `os` module

The `os` module provides a way of using operating system dependent functionality. The `os` module provides a way to interact with the file system. 

The `os` module provides several functions to work with files and directories. Here are some of the most commonly used functions:

* `os.getcwd()`: returns the current working directory.
* `os.chdir(path)`: changes the current working directory to `path`.
* `os.listdir(path)`: returns a list of all files and directories in the directory specified by `path`.
* `os.mkdir(path)`: creates a new directory with the name `path`.

Run the code cells below to see how these functions work.


In [None]:
import os

# Get the current working directory
print(os.getcwd())

In [None]:
# Change the current working directory
os.chdir('files')

# Get the current working directory
print(os.getcwd())

In [None]:
# List all files and directories in the current working directory
print(os.listdir())


Note that since we changed the current working directory to the `files` directory, the content of this directory is listed.

In [None]:
# Create a new directory
os.mkdir('new_directory')

Note that since we changed the current working directory to the `files` directory, the new directory `new_dir` is created in the `files` directory.

Execute the code cell below to change the current working directory back to the original directory. This is important because the rest of the code in this notebook assumes that the current working directory is the original directory.


In [None]:
os.chdir("..")

# Get the current working directory
print(os.getcwd())

The `..` between the parentheses is a special path that refers to the parent directory of the current working directory. This is useful when you want **to move up one directory level**. 

#### 7.4 Reading text files

Reading data from a text files goes as follows:
* open the file with the `open()` function
* read the data using the `read()`, `readline()`, or `readlines()` functions
* close the file with the `close()` function

You can open a file using the `open()` function. The `open()` function takes two arguments:
* the name of the file you want to open and
* the mode in which you want to open the file. The mode can be 
    * `'r'` for reading (default),
    * `'w'` for writing, or
    * `'a'` for appending. 

The `open()` function returns a file object, which you can use to read or write data to the file. Python has several functions to read data from a file:
* `read()`: read the entire content of the file as a single string
* `readline()`: read one line of the file
* `readlines()`: read all lines of the file and store them in a list

To close the file, you can use the function `close()`.

#### A First example

Consider the file `data.txt` in the `files` directory. The file contains 5 lines of data. The following code reads the entire content of the file into a list: 

```python
    f = open("files/data.txt", "r") # "r" is optional
    data = f.readlines()
    f.close()
    print(data)
```

Run the code in the cell below:

In [None]:
f = open("files/data.txt", "r")
data = f.readlines()
f.close()
print(data)

**Important notes** 
* Each element of the list corresponds to a line of the file.
* The newline character `\n` is included in each element of the list. To get rid of the newline character, you can use the `strip()` function. The `strip()` function removes leading and trailing whitespaces (including the newline character) from a string.
* The values in the file are read (interpreted) as strings. If you want to convert the values to numbers (to perform calculations), you need to use the `int()` or `float()` functions.

The following code reads the entire content of the file into a list and converts the values to floating-point numbers:

```python
    f = open("files/data.txt", "r")
    data = f.readlines()
    f.close()
    for i in range(len(data)):
        data[i] = float(data[i].strip())
    print(data)
```

Run the code in the cell below:

In [None]:
f = open("files/data.txt", "r")
data = f.readlines()
f.close()
for i in range(len(data)):
    data[i] = float(data[i].strip())
print(data)

#### A Second example
The file `names.txt` in the folder `files` contains a list of names, one per line. The following code reads the entire content of the file into a list and removes the newline character:

```python
    f = open("files/names.txt", "r")
    names = f.readlines()
    f.close()
    for i in range(len(names)):
        names[i] = names[i].strip()
    print(names)
```

Run the code in the cell below:

In [None]:
f = open("files/names.txt", "r")
names = f.readlines()
f.close()
for i in range(len(names)):
    names[i] = names[i].strip()
print(names)

#### 7.5 Processing file content

In many cases, you need to process the content of a file:
* You may need to remove the newline character.
* You may need to split the content of a file into words or numbers, or a combination of both.
* You may need to convert (part of) the content of a file to a different format.


The file `atoms.txt` contains a list of 50 atoms and their properties. For each atom the following data are given: the name of the atom, the symbol of the atom, and the atomic weight. The data are separated by comma and a space character.

The following code reads the content of the file and stores the data in a nested list. Each element of the nested list corresponds to a line of the file, and each element of the nested list is a list of strings. The atomic weight is converted to a floating-point number:

```python
    f = open("files/atoms.txt", "r")
    data = f.readlines()
    f.close()
    for i in range(len(data)):
        data[i] = data[i].strip().split(", ")
        data[i][-1] = float(data[i][-1])
    print(data)
```

Run the code in the cell below:


In [None]:
f = open("files/atoms.txt", "r")
data = f.readlines()
f.close()
for i in range(len(data)):
    data[i] = data[i].strip().split(", ")
    data[i][-1] = float(data[i][-1])
print(data)

Note that the stripping and the splitting of the data are done in one line of code: first the newline character is removed, and then the string is split into a list of strings.

#### 7.6 Writing to files

Writing data to a file goes as follows:
* open the file
* write the data
* close the file

You can use the `print()` function to write data to a file. The `print()` function takes an additional argument, `file`, which specifies the file object to which you want to write the data. 

#### An example

The following code reads user input and writes these and the number of words to a file called `words.txt`:

```python
    n = 0
    f = open("files/words.txt", "w")
    while True:
        word = input("Enter a word (or 'q' to quit): ")
        if word == "q":
            break
        n = n + 1
        print(word, file = f)
    print("Number of words:", n, file = f)
    f.close()
```

Run the code in the cell below:


In [None]:
n = 0
f = open("files/words.txt", "w")
while True:
    word = input("Enter a word (or 'q' to quit): ")
    if word == "q":
        break
    n = n + 1
    print(word, file = f)
print("Number of words:", n, file = f)
f.close()

**Note**: if you forget to specify the file object `file = f`, the data will be written to the standard output (the screen). 

#### 7.7 Reading from a directory

Sometimes the data you need to process are stored in multiple files in a directory. You can read all files in a directory using the `os` module. The `os` module provides a function called `listdir()`, which returns a list of all files in a directory. 

The following code lists all files in the `files` directory:

```python
    import os
    files = os.listdir("files")
    print(files)
```

Run the code in the cell below:

In [None]:
import os
files = os.listdir("files")
print(files)

#### Example

The directory `pH_data` contains several files, each of which contains one pH measurement. The following code reads all files in the directory and stores the pH measurements as floating-point numbers in a list:

```python
    import os
    pH = []
    files = os.listdir("files/pH_data")
    for file in files:
        file = open("files/pH_data/" + file, "r")
        pH.append(float(file.readline().strip()))
        file.close()
    print(pH)
```

Run the code in the cell below:

In [None]:
import os
pH = []
files = os.listdir("files/pH_data")
for file in files:
    f = open("files/pH_data/" + file, "r")
    pH.append(float(f.readline().strip()))
    f.close()
print(pH)

Note that the command `pH.append(float(file.readline().strip()))`
* reads the first (and only) line of the file, 
* removes the newline character, 
* converts the value to a floating-point number, and
* appends the value to the list `pH`.

The code will be more readible if we split this command into several lines:

```python
    import os
    pH = []
    files = os.listdir("files/pH_data")
    for file in files:
        f = open("files/pH_data/" + file, "r")
        line = f.readline()
        line = line.strip()
        pH.append(float(line))
        f.close()
    print(pH)
```



#### 7.8 The infoFunWP module

The module `infoFunWP` is part of a package with the same name but is not part of the standard Python library. 

The module contains the following functions to **read** files:
* `listRead()`: read the complete file and return a list of strings (one string per line)
* `listReadValues()`: read the complete file and return a list of floats (one float per line)
* `stringRead()`: read the complete file and return a **single string**

The module contains the following functions to **write** files:
* `listWrite()`: write a list of strings to a text file (one line per string)
* `stringWrite()`: write a string to a text file

In order to use the module, you need to install the package `infoFunWP` first. You can install the package using the following command in a terminal:

```python
    pip install infoFunWP
```

Next, you can import the module using the following command:

```python
    import infoFunWP as infoFun
```

A few examples of how to use the module are given below. 

The function `listReadValues()` is very useful if the file contains a single column with numerical data. The function reads the data and returns a list of floating-point numbers:

In [None]:
# read the content of the file data.txt in the files directory
import infoFunWP as infoFun
data = infoFun.listReadValues("files/data.txt")
print(data)

Compare this code with the code in the first example of Section 7.4. The code is more concise and easier to read.

If the file contains text data, you can use the function `listRead()` to read the data and return a list of strings:

In [None]:
# read the content of the file names.txt in the files directory
import infoFunWP as infoFun
names = infoFun.listRead("files/names.txt")
print(names)

When the file contains multiple columns of data such as in the file `atoms.txt`, you can use the function `listRead()` to read the data and to get a list of strings. You can then process the data as described in Section 7.5:
* remove the newline character
* split each string into a list of strings
* convert to float
* ...

The following code gives the same result as the code in Section 7.5:

In [None]:
# read the content of the file atoms.txt in the files directory
import infoFunWP as infoFun
data = infoFun.listRead("files/atoms.txt")
for i in range(len(data)):
    data[i] = data[i].strip().split(", ")
    data[i][-1] = float(data[i][-1])
print(data)

### 7.9 Exercises

#### **Exercise 1**

The file `numbers.txt` in the `files` directory contains 1000 throws of a die. Each throw is a number between 1 and 6. 

Write a function `count_freqs()` that reads the content of the file and calculates the frequency of each number. The function should return a dictionary `freqs_dict` with the numbers as keys and the frequencies as values. The dictionary should look like this:

`freqs_dict = {1: 182, 6: 159, 2: 158, 4: 168, 3: 167, 5: 166}`.



In [None]:
def count_freqs(file):
    freq_dict = {}
    ... # code to read the file

    ... # code to count frequencies

    ... # code to fill the dictionary
    return freq_dict

freq_dict = count_freqs("files/numbers.txt")
print(freq_dict) # {1: 182, 6: 159, 2: 158, 4: 168, 3: 167, 5: 166}

#### **Exercise 2**

The file `codontable.csv` in the `files` directory contains for all amino acids the following data:
* the name of the amino acid
* the one-letter code of the amino acid
* the list of codons that code for the amino acid

The data are separated by a comma.

Write a function `read_data()` that reads the content of the file, stores the data in a list of lists and returns this nested list. Each element of the list corresponds to a line of the file, and each element of the list is a list of strings. The nested list should look like this:

`[['Alanine', 'A', 'GCT, GCC, GCA, GCG'], ['Arginine', 'R', 'CGT, CGC, CGA, CGG, AGA, AGG'], ...]`.

In [None]:
def read_data(file):
    ... # code to read the file and process the data
    
    return data

data = read_data("files/data.txt")
print(data)

#### **Exercise 3**

We continue with the data from the previous exercise. 

The one-letter code of an amino acid is used in bioinformatics to represent amino acids. Using the one-letter code, the sequence of amino acids in a protein can be represented as a string of characters. 

Write a function `translate_dna()` that takes a string of DNA as input and returns the corresponding string of one letter codes of the amino acids. The function should use the data from the previous exercise.

For the DNA sequence `ATCATCCTCCTC` the function should return the string `IILL` because `ATC` translates to `I` and `CTC` translates to `L`.

In [None]:
def translate_dna(dna):
    slc = "" # string to store the one letter codes
    ... # code to translate the DNA sequence
    return slc

dna = "ATCATCCTCCTC"
slc = translate_dna(dna)
print(slc) # IILL

< [6 Tuples and Dictionaries](6-TuplesDictionaries.ipynb) | [Contents](0-Contents.ipynb) | [8 NumPy](8-NumPy.ipynb) >