In [None]:
## Sanity Check, look away or you will turn into stone
import sys
# Check that python versions are correct
assert sys.version_info.major == 3
assert sys.version_info.minor == 6

__author__ = "Emanuel Burgos"
__email__ = "eburgos@wisc.edu"

# Hour of Code with Mandel Lab #2

# 2020-08-27: More Python Basics

Textbook: [Python for Biologists](https://pythonforbiologists.com/) by Dr. Martin Jones

- If you are still unsure of why you should learn programming, here is a great [post](https://www.wired.com/2009/03/why-biology-students-should-learn-how-to-program/#:~:text=The%20important%20thing%20is%20learning,an%20increasingly%20data%2Ddriven%20field.)

### Guidelines:

- Notebooks is sectioned by headers. Each one will have small exercises that we can practice with as the discussion goes on. With each practice cell, there is an test cell that you can run to verify your solution. DO NOT MODIFY THIS IN ANY WAY. You will run this code to verify your solution but do not change the code within it.

Have fun.

## Opening and Writing Files

- As biologists, we are interested in analyzing and manipulating biological sequences (this is literally the sole purpose of bioinformatics...). However, imagine you have a **10Mbp** genome and you want to find a string of a DNA sequence

- In python, we can open **any type of file** by using the builtin function `open()`

#### `open(file, mode)` - opens a **file** and returns a `file object`. Mode is a string that represents write, read, or append.

- `file` is the **RELATIVE PATH** to the location of your file. Lets look at some code.
- Modes are `'r', 'w', and 'a'`

In [None]:
# First lets import os module

# Check current working directory


In [None]:
# Whats in the current directory?


In [None]:
# Open rna_sequence.txt in reading mode
# <- OPEN IN READING MODE


> The only thing I want you to remember from this block is that `open()` will return an IO object. IO stands for input and output.

### Reading the file into a variable

- You can `read` the contents of the file by using the **methods** of a file object through the **dot-operator**.

#### `file.readlines()` - converts each line of the file (determined by `\n`) into a list collection

- `\n` means **newline**. Its equivalent to when you press `enter` when typing and your cursor moves to a newline.

> Do not worry, we will cover `list objects` next. For now, think about them as a **collection of things**.

In [None]:
# Read the lines in file


> Note: Add another cell and try running rna_file.readlines() again! What happened?

In [None]:
# Run readlines() again


- Python reads files by using a **pointer**. It starts with the pointer at the **start** of the file (line 0; character 0). Once you `file.readlines()`, you have basically moved the pointer to the **end** of the file.

### Writing to a file

- Just changed the mode to `w`! Then you can use the method `file.write(line)` to add content to the file

#### file.write(line) - writes the given line to the file object

In [None]:
## Write a sentence to a file called 'my_name.txt'


> If you check "story.txt", it should have the sentence written in it.

In [1]:
## What if we want two lines of story?
two_lines_f = open("two_lines_story.txt", 'w')
two_lines_f.write("Once upon a time")
two_lines_f.write("There was a dog named lola.")
# Go check the file...

- Python does not know that you want a **newline** each time you call `file.write(line)`.
- Computers use **special characters** to specify non-text formatting.
- In this case, we should use `"\n"` to add a newline at the end of the line.

> Note: There are TONS of **special characters** in programming. For example, `\t` does NOT translate to the string `"\t"`. It actually tells the program to write a **tab-space**. 

In [None]:
### PRACTICE ###
# Now that you know how to add newlines, fix the problem above.
two_lines_fixed_f = open('two_lines_story_fixed.txt', 'w')

In [None]:
## DO NOT TOUCH ## 
with open("two_lines_story_fixed.txt", 'r') as f:
    for line in f.readlines():
        assert line.endswith('\n')

### Closing the file

- Once you use the file for whatever it is you are doing, you need to **close** it.
- This is because python keeps the file in **memory** until you close the **kernel** (python engine). If you are using BIG files, then as your code runs it should close unused files to free up memory

##### `file.close()` - closes the stream for the file object

In [None]:
# Close every file object

### Another way to open files...

- Another method for opening **files** (or any other object that has __enter__ and __exit__ methods), you can use the **with** keyword.
- This is the pythonic way because its more readable and it automatically **closes** the file after you exit the indented block.

In [None]:
# Open file with `with`
# The name after `as` keyword is the VARIABLE NAME GIVEN TO FILE OBJECT

# Once you leave the 'block'(unindent), python closes the file object

In [None]:
### PRACTICE ###
## There is a file called "rna_sequence_practice.txt". Open this file and assign the object to `rna_sequences_file_object`
# Read the lines from this file using any method shown above and assign the lines to variable `rna_sequences`.
rna_sequences = None


In [None]:
## DO NOT TOUCH
assert len(rna_sequences) == 6
assert rna_sequences_file_object.closed

## Working with lists and Loops

### Defining a list

- **Lists** are another type of data type in python. Defined by **open-close square brackets** ([ ])
- They are meant to function as a collection that holds other **data types** (ints, float, strings, etc...)
- Considered to be immutable
- Select items by **indexing** (just like strings).

```python
x = [1,2,3,4,5,6]
print(x)
# Prints: [1,2,3,4,5,6]
```

In [None]:
# Lets make a random list with strings as the items! ['Mark', 'Hector', 'Denise', 'Emanuel', 'Ruth', 'Jake', 'Natalia']


In [None]:
# How many people? (What function could I use?)


In [None]:
# How can I select items?


### Adding to list (Using methods!)

- Since list is a data type in python, it also has its own methods! It can be accessed by, you guessed it, the **dot operator**.

#### `list.append(item)` - adds the provided item to the END of the list

In [None]:
# Add the lab dog

#### `list.insert(index, item)` - inserts item at index

- We can also `insert` items at a specific location.

In [None]:
## Add Mandel next to Mark


- Other methods for list objects.

#### `list.remove(item)` - removes the item from the list

In [None]:
# Remove the least important person


#### `list.index(item)` - finds the index for the item in the list

In [None]:
# Where is the dog?


In [None]:
### PRACTICE ### 
# Add `flour` to beginning of ingredients_for_cake and remove the `salt`.
# THEN, find the index for `butter` and assign it to `butter_index`
ingredients_for_cake = ['butter', 'sugar', 'salt']
butter_index = None



In [None]:
## DO NOT TOUCH
assert 'flour' in ingredients_for_cake
assert not 'salt' in ingredients_for_cake
assert butter_index == 1

### Immutable, what does this mean

- Like in algebra, we assign a **value** to a **variable** where the name of the variable can be called and we expect it to return the correct value (If `x = 4` then whenever we mention `x` it will give me `4`). In python, we call these a **reference variable** because its named is use to reference the actual data it holds.
- All objects (data types) in python are either **immutable**, cannot be changed after been created or **mutable**, can be changed after creation and affects future assignments.

For example:

In [None]:
# Assign x to a new variable and perform an operation on that


In [None]:
# Lets change x

# What happens to y?


In [None]:
# What happens with lists?


In [None]:
# Lets add to A

# What happens to B


### Looping though items

- In programming, we `loops` to iterate (go through each item) of a collection and perform operations on things.
- Uses **keywords** `for` and `in` to iterate through the collection. Once you **unindent** iteration has stopped

```python
for item in collection:
    # Do stuff with item
```

> Keep in mind that indentation is important for python to know that statements belong to the loop!

In [None]:
## Lets loop through lab people

    # Each item will be named person


- As you have noticed until now, `file.readlines()` returns a list that contains the files lines

### List Comprehension

- Python sometimes provides **shortcuts** for certain programming operations. One of them is **comprehension**!
- Comprehension is used to perform an operation on each item in the collection and return it as a **new** list.

```python
# Its syntax is:
new_list = [x**x for x in [0,1,2,3]] # new_list = [0, 1, 4, 9]
```

### Exercises

- There is a fasta file called **"exercise_1.fasta"** which contains **10** DNA reads with adapter sequences that need to be removed.
- The adapter sequence is `AAAGGGAAATTTCCCTTT`.
- Trim the reads and write them to a file called "trimmed_reads.fasta" with the **same header as input**. Each read should be on its own line.
- Also, figure out the **length** of the **sequences** after **trimming** and assign it to `sequence_length`.

> **HINT:** Remember that `len(string)` counts EVERY character in the string.

In [None]:
### YOUR SOLUTION HERE
adapter_to_remove = "AAAGGGAAATTTCCCTTT"
output_file_name = "trimmed_reads.fasta"

In [None]:
## DONT TOUCH ##
adapter_to_remove = "AAAGGGAAATTTCCCTTT"
with open('trimmed_reads.fasta') as f:
    for line in f:
        if line.startswith('>'):
            assert line == '>trim_these_reads\n'
        else:
            assert not line.startswith(adapter_to_remove)
assert sequence_length == 20

In [None]:
## SOLUTION ##
output = open("trimmed_reads.fasta", 'w')
with open('exercise_1.fasta', 'r') as f:
    for line in f:
        output.write(line.replace(adapter_to_remove, ''))
        sequence_length = len(line.replace(adapter_to_remove, ''))
output.close()