Python for Beginners Exercise 4: Input/Output with text

Made by: Julian Liber

Date Created: 03/16/2020

## Hello Everyone!


#### This activity should teach you:
- How to read and write files
- What "parsing" is
- Basic methods for data input

<img src="https://imgs.xkcd.com/comics/exploits_of_a_mom.png" width=100% alt="Hulahoop"><p style="text-align: right;">From: https://imgs.xkcd.com/comics/exploits_of_a_mom.png</p>

Reading and writing data is a critical task for bioinformatics. There are many methods for this, but I'll introduce some of the basic ones.

#### Read a human-readable text file

The cleanest way to do this in base Python uses the `with`, `open`, and `as` keywords.

The header looks like this:

` with open("my_file.txt", "r") as infile:`

After operations are performed on the file, this closes it properly. With the `open()` function, `"r"` indicates to read the file, while `"w"` means write. There are other arguments which you may also use.

In the body of this structure, infile is an _object_, with attributes and methods.

We are going to use a FASTA file to read in.

In [None]:
my_filename = "3Yad CW 3A_ITS1f_E01.fasta"
with open(my_filename, "r") as infile:
    line = infile.readline()  # Read one line
    while line != "": # Read until end of file is reached
        print(line)
        line = infile.readline() # Read the next line in the file

In [None]:
my_filename = "3Yad CW 3A_ITS1f_E01.fasta"
with open(my_filename, "r") as infile:
    lines = infile.readlines()  # Read all lines an an iterable
    for line in lines: # Iterate though all lines
        print(line)

Why might we use one of these techniques over the other?

The `while` loop only has one line in memory at a time, which can be good for large files. However, realistically you will probably not have to deal this for human readable text file.

Comparitively, the `for` loop has fewer lines to write, so it may simplify your code.

#### Output of text files

In order to write files, we use a very similar `with` header, changing the second argument to `open()`

`with open("my_file.txt", "w") as outfile:`

Then, all we need is `outfile.write()` to write to the file.

In [None]:
my_filename = "example_outfile.txt"
with open(my_filename, "w") as outfile:
    for i in range(10):
        outfile.write(str(i) + "\n") # convert the number to a string, add an end-line character (\n)

If writing many lines to a file, it can be helpful to write multiple line to a string called a _buffer_, then outputing to a file. This can save time, but requires more memory.

In [None]:
my_filename = "example_outfile.txt"
buffer = "" # empty string to store values
for i in range(10):
        buffer += str(i) + "\n" # convert the number to a string, add an end-line character (\n)
with open(my_filename, "w") as outfile:
    outfile.write(buffer)

A useful tool for writing to files is called string formatting, which in Python 3.6+ can look like this:

`F"My number: {i}\n"`

In [None]:
my_filename = "example_outfile.txt"
buffer = "" # empty string to store values
for i in range(10):
        buffer += F"My number is: {i}\n" # convert the number to a string, add an end-line character (\n)
with open(my_filename, "w") as outfile:
    outfile.write(buffer)

#### Do This:

Read in the FASTA file, and store the name (FASTA header) and sequence as two separate strings.

Each line contains an end-of-line character (`\n`) at the end of each line. You can use the `string.strip()` method to remove this character. Other string methods such as `split()` and `join()` are often useful.

Bonus:

If you feel ambitious, try store multiple FASTA files in a dictionary, with the headers as keys and sequences as values.

### Pandas data handling

<img src="https://media.giphy.com/media/EatwJZRUIv41G/giphy.gif" width=70% alt="Hulahoop"><p style="text-align: right;">From: https://media.giphy.com/media/EatwJZRUIv41G/giphy.gif</p>

Pandas is a package which contains a datatype you are likely already familiar with: dataframes. These are ubiquitous in R stats, and are kinda like a spreadsheet, but without interactivity.

Let's import `pandas`, using `pd` as the conventional name.

In [None]:
import pandas as pd

Typically the data will be a CSV (`.csv`) file, which stands for comma-separated value file. However, any file with column separated by a character can be used (some `.txt`, `.tsv` files).

I've included an example file, called `Water backups master.csv`.

In [None]:
csv_file = "Water backups master.csv"
data = pd.read_csv(csv_file)

In [None]:
data.head() # Display first 5 rows

Each column can be accessed using bracket notation:

`data["Name"]`

If the column name is a single word, dot notation can be used:

`data.Genus`

While a column is a pandas.Series type, it generally acts like a np.array.

The pandas.Series has methods that are worth exploring, such as `value_counts()` as shown below.

The [Pandas Cheat Sheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf) is an excellent resource for this package.


In [None]:
data.Genus.value_counts()

#### Do This:

Find a spreadsheet you might be interested in using in Python. Import the data, and use some method from pandas to find something interesting.

### Thanks for doing Exercise 3!

#### More will follow soon!