[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/How-to-Learn-to-Code/python-class/blob/master/Lesson_4_FileIO/Lesson_4.ipynb)

# Lesson 4 - File IO and String Manipulation

## Learning Objectives: 

Students will be able to load text files into Python objects and learn to manipulate file names and strings.

* [Introduction to IO](#io) 
* [Reading Files](#reading)
* [Closing Files](#closing)
* [Context Management](#context)
* [String Manipulation](#string)
* [Working with Multiple Files](#multiple)
* [In-Class Exercises](#exercises)

### Introduction to IO <a id='io'></a>

Every program has an input and an output. In science, the input is usually your raw data; the output can be anything from processed data, statistical tests, model predictions, or figures for a paper or presentation. In any case, loading your input data is often one of the first tasks that you will have to execute in your code.

This lesson will teach you how to deal with text files, but the tools learned here will be applicable to a variety of different file types that you may come accross in your research.

#### Motivating Example

Your collaborator just sent you files containing lists of genes that are upregulated in specific cell types. You're interested in using these gene lists to determine how similar cells in your own dataset are to these cell types discovered by your collaborator. You have all the files in a single directory, but how do you go about reading the data into Python?

#### Setup

We will be using data in external files for this lesson, so we will need a way to access these. In Colab, we can do this by cloning the data from GitHub. You can also access your Google Drive from Colab by mounting your Drive every time you open your notebook.

We'll clone the data from GitHub using a language called bash. This language is specialized for communicating with UNIX operating systems, including MacOS and Linux. (Google Colab is running your notebook on a Linux server). We can run bash commands in our notebook by starting a line with an exclamation point or percent sign.

In [None]:
!git clone https://github.com/How-to-Learn-to-Code/python-class.git

Now we have access to all the files in the GitHub repository. Next, we want to change the folder where our code will execute. This is called the **working directory**. We can change the working directory using the `cd` bash command. Let's go to the directory where the data is stored.

In [None]:
%cd python-class/Lesson_4_FileIO/data

If we want to double check which directory we're in, we can print our current working directory using the command `pwd`.

In [None]:
%pwd

We can also see all the files in our current directory, using the "list" command `ls`. After running the next cell, you should see the all the files in the `data` directory.

In [None]:
%ls

### Reading Files <a id='reading'></a>

In order to read files in Python, we need to know the file path. The file path is the location of the file on your computer. Because we're already in the `data` directory, we can access the file by using the file name.

To open a file in Python, we can simply use the `open()` function.

In [None]:
filepath = 'data/dna.txt'
my_file = open(filepath)

Now the file is open, but we can't see the contents. If we just try to print the file, we'll get something unexpected... 

Instead, in order to read the contents of the file, we can use the `read()` method. This will read the entire file and return the contents as a string.

In [None]:
file_contents = my_file.read()
print(file_contents)

Because we've already read the file, we can't read it again; the file is considered "exhausted". We can't read it again unless we close it and open it again, so make sure to store the data in a variable! 

In [None]:
print(my_file.read())

### Closing Files <a id='closing'></a>

When you're done with a file, it's important to close it. This is because the file is still open in the background, and if you try to open it again, you may run into issues. You may also run into memory issues if you open too many files at once without closing them.

To close a file, we can simply use the `close()` method. This will close the file and free up the memory that was being used to store the file.

In [None]:
my_file.close()

### Context Management <a id='context'></a>

Remembering to close your file every time you open it is a pain. Fortunately, there is a clean way to deal with this. The Python `with` statement is known as a "context manager". 

As shown below, if you open a file using the `with` statement, Python will automatically close the file for you when you're done with it.

This is the preferred way to open files in Python.

In [None]:
# Open the file with a context manager
with open('data/three_seq.txt') as file_handle:
    file_contents2 = file_handle.readlines()

# File is closed automatically!
try:
    print(file_handle.read())
except ValueError as e:
    print(e)

Also, notice that we used the `readlines()` method instead of `read()`. This method reads the entire file and returns a list of strings, where each string is a line in the file.

In [None]:
print(file_contents2)

### String Manipulations <a id='string'></a>

Let's investigate the data that we just read into `file_contents` from "dna.txt".

In [None]:
print('The DNA sequence is', file_contents, 'and is', len(file_contents), 'bases long.')

Output looks strange and the length is incorrect due to a hidden newline (`'\n'`) character. The file we read in is actually 2 lines with the second line being blank. The `strip()` method removes any leading (spaces at the beginning) and trailing (spaces at the end) characters. By default, it removes space (`' '`) characters.

In [None]:
my_dna_strip = file_contents.strip()
print('The DNA sequence is', my_dna_strip, 'and is', len(my_dna_strip), 'bases long.')

You can remove any other character by passing it as an argument to the `strip()` method.

In [None]:
new_dna = my_dna_strip.strip('A')
print(new_dna)

There are some more useful things we can do with strings, including finding and replacing text using the `replace()` method. The first argument to `replace()` is the pattern we are searching for in the string and the second argument is what we want to replace it with.

In [None]:
'this is a STRING'.replace('STRING', 'pizza')

We can use the `upper()` and `lower()` methods to make all characters in a string upper- or lower-case, respectively. Also, `swapcase()` switches all upper- and lower-case letters.

In [None]:
print('lower2upper'.upper())
print('UPPER2LOWER'.lower())
print('cAmElCaSe'.swapcase())

Strings can be concatenated using the addition operator:

In [None]:
'str' + 'ing'

Variables can be inserted into strings by first converting them into strings and then concatenating them:

In [None]:
my_name = 'John'
my_age = 30
print('My name is ' + my_name + ' and my age is ' + str(my_age) + '.')

As of Python 3.6, one of the easiest ways to format strings is using f-strings. This is done by placing an `f` before the string and using curly braces `{}` to insert variables. The variables will automatically be converted to strings.

In [None]:
print(f'My name is {my_name} and my age is {my_age}')

You can even do some cool formatting tricks with f-strings, like rounding numbers to a certain number of decimal places:

In [None]:
mean_val = 3.14159
print(f'The mean value is {mean_val:.2f}')

Sometimes it is useful to check if a certain pattern is present within a string. Here is one way to do this:

In [None]:
print('hi' in 'this is a test')
print('ahoy' in 'this is a test')

You can also use the `in` keyword to check if there is an item in a list or a key in a dictionary:

In [None]:
my_list = [1, 2, 3, 4, 5]
print(5 in my_list)

my_dict = {'name': 'John', 'age': 30}
print('name' in my_dict)

### Working with Multiple Files <a id='multiple'></a>

To go from reading a single file to reading multiple files, use a `for` loop.

We'll start getting all the file names in a directory. We can do this using the `os` module and the `listdir` function.

In [None]:
import os

data_dir = 'data/gene_sets'
files = os.listdir(data_dir)
files

We want to read the hallmark gene sets, however there is also a 'junk' file that we do not want to read. This often occurs if you use a program that stores temporary files.

We will use a filter to get only the files with the extension we want:

In [None]:
gene_set_files = []
for file in files:
    if '.grp' in file:
        gene_set_files.append(file)

gene_set_files

To do this, we initialized an empty list and used a nested `if` statement within a `for` loop to append to the list only under a certain condition: if a file name included the extension we want to read.

This is a lot of boilerplate, however there is a better way: list comprehensions.

List comprehensions are a way to create lists in Python. They are a more concise way to create lists than using a `for` loop. They are also faster than using a `for` loop. We can write a list comprehension for the same task as above:

In [None]:
gene_set_files = [f for f in files if ('.grp' in f)]
gene_set_files

List comprehensions are just shorthand for a `for` loop that populates a list. You can use them to shorten some `for` loops, like the one above.

Here are some more examples of list comprehensions:

In [None]:
# List comprehension with string manipulation.
[f'This is a {string}' for string in ['mouse', 'moose', 'pipette']]

In [None]:
# List comprehension with arithmetic.
numbers = [1, 2, 3, 4, 5]
mean = sum(numbers) / len(numbers)
[number - mean for number in numbers]

When in doubt, you can always write a list comprehension as a full `for` loop. 

Let's read in all the gene sets. But first, let's define a helper function that reads in a file and returns the contents as a list of strings.

In [None]:
def file_reader(filepath):
    with open(filepath) as file_handle:
        return file_handle.readlines()

Now we can use this function and list comprehensions to read in all the gene sets.

In [None]:
gene_sets = [file_reader(data_dir + '/' + file) for file in gene_set_files]

In [None]:
gene_sets[0][:5]

The gene sets each actually contain the gene set name and a url as the first two elements. The rest of the list consists of genes in the set. We want to filter the lists to contain only the genes of interest. Let's use list comprehensions to do 3 things: remove the newline characters, extract and store the gene set names, and extract the 2nd to the last elements of each list:

In [None]:
# Remove the newline character from each gene name.

# Extract the gene set name.

# Remove the gene set name and url from the gene list.

In [None]:
gene_set_names

### Sneak Peak at Visualization

Now that we've read the gene sets, we can use them to calculate how much a cell is expressing the genes in each set. Then, we can visualize the expression of these genes in different cell populations.

*Note: this section will use some packages and code that you have not encountered. That's okay--you don't have to understand all of it. The goal is just so you can see how you might use these files as part of a larger project.*

In [None]:
import scanpy as sc

data = sc.datasets.pbmc3k_processed()
for gene_set, gene_set_name in zip(gene_sets, gene_set_names):
    sc.tl.score_genes(data, gene_set, score_name = gene_set_name)

sc.tl.pca(data)
sc.pl.pca(data, color = ['louvain'] + gene_set_names, ncols = 2, 
          vmin = 'p1.5', vmax = 'p97.5')

The gene sets are able to help us visualize what kinds of cellular processes are occurring in different immune cell types. For example, we find that NK and dendritic cells have the highest inflammatory responses, while CD4 T cells have the highest expression of G2M checkpoint genes.

### In-Class Exercises <a id='exercises'></a>

1. Read in the DNA sequence from 'dna.txt' and convert all nucleotides to lowercase.

2. Read in the sequences in 'seq_list.txt' and then use f-strings to append a poly-A tail to each sequence.

3. Use string manipulation to get rid of the version numbers and file extensions in `gene_set_files`. The output should be identical to `gene_set_names`.

4. Use string manipulation and list comprehensions to remove the `'HALLMARK_'` prefix on the file names.