## On lists, dictionaries, series and dataframes

### List Comprehension

- Or, how to apply some function to a bunch of elements in a list

- I'm really lazy and try to keep my typing to a minimum (after all, my time is worth a lot more than the computer's)

- Is there a way for me to concisely write the loop above in a clear manner?
  - Hint: yes

- Don't forget, just like in a loop, we can call our variable (NOT THE LIST!! Just the variable that acts as placeholder) whatever we want. But it's better to be clear about what you're manipulating!

- I only want to modify one list element, otherwise, return the element unchanged.

- Benefits of list comprehension: 

1. Faster to write (no seriously, that is a benefit sometimes)
2. Always returns a list, so no need to mess around with append/extend etc.
3. Great for quickly filtering elements out of a list, or modifying specific elements

---

### Dictionaries part II

Recall that there are several ways to create a dictionary

1. Init an empty dictionary and fill it with keys/values manually

2. Set key value pairs when creating the dictionary

#### A note on importing packages

- The file `amino_acids.py` contains a bunch of defined lists that are `constants`, hence the ALL CAPS convention.

In [None]:
%load_ext autoreload
%autoreload 2
from pprint import pprint
from amino_acids import AMINO_ACID_NAMES, AMINO_ACID_CODONS, AMINO_ACID_CODES, AMINO_ACID_WEIGHTS

- First, let's refresh our memory on how to create a dictionary

- We can also create all of our key values ahead of time if we know them...

- Question! Note that in the cell below, I add the key/value pairs in a different order than I did originally. Do you expect the comparison to return _True_ or _False_?

- We're going to introduce a special function in python called `enumerate` that helps us manipulate lists and their elements. Previously, we've been iterating through lists using an arbitrary variable name, e.g.

- This works great when iterating through one list, but what if we want to iterate through multiple lists at once?

- The enumerate function returns a tuple with the index and the value of the list at that index

- Given a list of amino acid names, their short codes and their symbol, create a dictionary where the SYMBOL is the key, and the values are the symbol and their short code

- Example key/value pair: `amino_acids['K']` should return: ``'Lysine', ['AAA', 'AAG']``

- We can also the `zip` function to iterate through multiple lists at the same time

# WARNING !!!!

- You may have noticed that we made a _BIG_ assumption with creating our dictionary - all of the lists are ordered alphabetically (e.g. the first set of codons corresponds to the Alanine, and so on)

In [None]:
from amino_acids import AMINO_ACID_WEIGHTS
print(AMINO_ACID_WEIGHTS)

# Fyi, the weights are in daltons, meaning Alanine weighs 89,100 grams per mole

In [None]:
pprint(amino_acids)

- Question! Using your newly made dictionary, calculate the weight of the protein with the following sequence: 'AAAAAAAAEKTWYV'

In [None]:
calculate_protein_weight('AAAAAAAAEKTWYV', amino_acids)

- Wow. That's a complicated dictionary. It's also pretty annoying to dig around it, and have to remember what index corresponds to each value.
  - What if we could make a dictionary (or dictionary like structure) that was easier to work with?
  - Let's create a dictionary that let's us access values like amino_acids["key"]["Name"] or amino_acids["key"]["Codons"]

- Introducing iterating on keys and values in dictionaries
  - It turns out, it's not that different than using zip or enumerate!

- That was a lot of work. Can I save the awesome_aa dictionary and use it later?
- Great question Jonathon! You definitely can.
- There's a bunch of different ways, but my personal favorite is it to save it as a .txt file to read in later.

In [None]:
with open("awesome_aa.txt", "w") as f:
  f.write(str(awesome_aa))

# Dataframes! Wow we made it

- Let's load in our awesome amino acid dictionary we made, and convert it to a dataframe

In [None]:
with open("./awesome_aa.txt", "r") as f:
  amino_acids = eval(f.read())

print(amino_acids)


In [None]:
import pandas as pd


- We don't need that summary column, let's drop it

- A few ways to show what's in a dataframe

- Pandas dataframes are a two dimensional data structure that can hold multiple arrays (or series). 
    - A series is just a one dimensional array holding data of any type. You can think of them like fancy lists or tuples
- Note that each column has a data type

- Some, but not all, of that stuff we taught you about indexing applies here
  - Note that we return a series when indexing

In [None]:
print(my_basic_series[2])
print("----------------")
print(my_basic_series[1])
print("----------------")
print(my_basic_series[1:3])
print("----------------")
print(type(my_basic_series[1:3]))
print("----------------")
print(my_basic_series[-1:])
print("----------------")
print(my_basic_series[:-2])

- You can filter a dataframe by a specific column value

- You can also filter by multiple columns

- Alternative way to filter on a string value using "isin"

- You don't have to return the whole dataframe, you can also use one column to filter, then report the values of another

- Adding a column to a dataframe is pretty easy. Luckily I have plenty of columns we can add.

- Let's merge our metadata with our amino acid dataframe on the name columns

- Finally, let's save our work and come back to it for the homework!
