## On lists, dictionaries, series and dataframes

### List Comprehension

- Or, how to apply some function to a bunch of elements in a list

In [1]:

baseball_teams = ['Royals', 'Twins', 'Tigers', 'White Sox']
baseball_teams_upper = []
for team in baseball_teams: 
  baseball_teams_upper.append(team.upper())

print(baseball_teams_upper)

['ROYALS', 'TWINS', 'TIGERS', 'WHITE SOX']


- I'm really lazy and try to keep my typing to a minimum (after all, my time is worth a lot more than the computer's)
- Is there a way for me to concisely write the loop above in a clear manner?
  - Hint: yes

In [2]:
[x for x in baseball_teams]

['Royals', 'Twins', 'Tigers', 'White Sox']

In [3]:
[x.upper() for x in baseball_teams]

['ROYALS', 'TWINS', 'TIGERS', 'WHITE SOX']

- Don't forget, just like in a loop, we can call our variable (NOT THE LIST!! Just the variable that acts as placeholder) whatever we want. But it's better to be clear about what you're manipulating!

In [4]:
[team.upper() for team in baseball_teams]

['ROYALS', 'TWINS', 'TIGERS', 'WHITE SOX']

- I only want to modify one list element, otherwise, return the element unchanged.

In [5]:
baseball_teams_upper = []
for team in baseball_teams:
  if team == "Royals":
    baseball_teams_upper.append(team.upper())
  else:
    baseball_teams_upper.append(team)
print(baseball_teams_upper)


['ROYALS', 'Twins', 'Tigers', 'White Sox']


- You can do some complicated logic in a list comprehension

In [6]:
[team.upper() if team == "Royals" else team for team in baseball_teams]

['ROYALS', 'Twins', 'Tigers', 'White Sox']

- Benefits of list comprehension: 

1. Faster to write (no seriously, that is a benefit sometimes)
2. Always returns a list, so no need to mess around with append/extend etc.
3. Great for quickly filtering elements out of a list, or modifying specific elements

---

### Dictionaries part II

Recall that there are several ways to create a dictionary

1. Init an empty dictionary and fill it with keys/values manually
2. Set key value pairs when creating the dictionary

#### A note on importing packages

- The file `amino_acids.py` contains a bunch of defined lists that are `constants`, hence the ALL CAPS convention.

In [7]:
%load_ext autoreload
%autoreload 2
from pprint import pprint
from amino_acids import AMINO_ACID_NAMES, AMINO_ACID_CODONS, AMINO_ACID_CODES, AMINO_ACID_WEIGHTS

- Like lists, there are a few pieces of syntactical-sugar we can use
- First, let's refresh our memory on how to create a dictionary

In [8]:

amino_acids = {}
amino_acids['Ala'] = 'Alanine'
amino_acids['Arg'] = 'Arginine'
amino_acids['Asp'] = 'Asparagine'

pprint(amino_acids)
print(amino_acids['Ala'])
print(amino_acids['Asp'])
# And so on...

{'Ala': 'Alanine', 'Arg': 'Arginine', 'Asp': 'Asparagine'}
Alanine
Asparagine


- We can also create all of our key values ahead of time if we know them...

In [9]:
amino_acids_all_at_once = {'Ala': 'Alanine', 'Arg': 'Arginine', 'Asp': 'Asparagine'}
# By the way - you can see if two dictionaries are equal
print(amino_acids == amino_acids_all_at_once)

True


# Question 1: Note that in the cell below, I add the key/value pairs in a different order than I did originally. Do you expect the comparison to return _True_ or _False_?

In [10]:
amino_acids_all_at_once = {'Asp': 'Asparagine', 'Arg': 'Arginine', 'Ala': 'Alanine'}
print(amino_acids == amino_acids_all_at_once)

True


- We're going to introduce a special function in python called `enumerate` that helps us manipulate lists and their elements. Previously, we've been iterating through lists using an arbitrary variable name, e.g.

In [11]:
muh_list = ['say', 'this', 'five', 'times', 'fast']
for floopy_ba_boopy in muh_list:
  print(floopy_ba_boopy)


say
this
five
times
fast


  - This works great when iterating through one list, but what if we want to iterate through multiple lists at once?

In [12]:

for i, k in enumerate(muh_list):
  print(i, k)

0 say
1 this
2 five
3 times
4 fast


- The enumerate function returns a tuple with the index and the value of the list at that index

# Question 2: Given a list of amino acid names, their short codes and their symbol, create a dictionary where the SYMBOL is the key, and the values are the symbol and their short code

  - Example key/value pair: `amino_acids['K']` should return: ``'Lysine', ['AAA', 'AAG']``

In [14]:
print(AMINO_ACID_NAMES)
print(AMINO_ACID_CODES)
print(AMINO_ACID_CODONS)


['Alanine', 'Arginine', 'Asparagine', 'Aspartic acid', 'Cysteine', 'Glutamic acid', 'Glutamine', 'Glycine', 'Histidine', 'Isoleucine', 'Leucine', 'Lysine', 'Methionine', 'Phenylalanine', 'Proline', 'Serine', 'Threonine', 'Tryptophan', 'Tyrosine', 'Valine']
['A', 'R', 'N', 'D', 'C', 'E', 'Q', 'G', 'H', 'I', 'L', 'K', 'M', 'F', 'P', 'S', 'T', 'W', 'Y', 'V']
[['GCT', 'GCC', 'GCA', 'GCG'], ['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], ['AAT', 'AAC'], ['GAT', 'GAC'], ['TGT', 'TGC'], ['GAA', 'GAG'], ['CAA', 'CAG'], ['GGT', 'GGC', 'GGA', 'GGG'], ['CAT', 'CAC'], ['ATT', 'ATC', 'ATA'], ['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'], ['AAA', 'AAG'], ['ATG'], ['TTT', 'TTC'], ['CCT', 'CCC', 'CCA', 'CCG'], ['TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'], ['ACT', 'ACC', 'ACA', 'ACG'], ['TGG'], ['TAT', 'TAC'], ['GTT', 'GTC', 'GTA', 'GTG']]


In [15]:
for i, k in enumerate(AMINO_ACID_NAMES):
  # This is the same thing as printing k...
  print(AMINO_ACID_NAMES[i])
  # But we can also use i to access elements in a different list
  print(AMINO_ACID_CODES[i])

Alanine
A
Arginine
R
Asparagine
N
Aspartic acid
D
Cysteine
C
Glutamic acid
E
Glutamine
Q
Glycine
G
Histidine
H
Isoleucine
I
Leucine
L
Lysine
K
Methionine
M
Phenylalanine
F
Proline
P
Serine
S
Threonine
T
Tryptophan
W
Tyrosine
Y
Valine
V


- We can also use the `zip` function to iterate through multiple lists at the same time

In [16]:
for name, code, codon in zip(AMINO_ACID_NAMES, AMINO_ACID_CODES, AMINO_ACID_CODONS):
  print(name, code, codon)

Alanine A ['GCT', 'GCC', 'GCA', 'GCG']
Arginine R ['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG']
Asparagine N ['AAT', 'AAC']
Aspartic acid D ['GAT', 'GAC']
Cysteine C ['TGT', 'TGC']
Glutamic acid E ['GAA', 'GAG']
Glutamine Q ['CAA', 'CAG']
Glycine G ['GGT', 'GGC', 'GGA', 'GGG']
Histidine H ['CAT', 'CAC']
Isoleucine I ['ATT', 'ATC', 'ATA']
Leucine L ['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG']
Lysine K ['AAA', 'AAG']
Methionine M ['ATG']
Phenylalanine F ['TTT', 'TTC']
Proline P ['CCT', 'CCC', 'CCA', 'CCG']
Serine S ['TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC']
Threonine T ['ACT', 'ACC', 'ACA', 'ACG']
Tryptophan W ['TGG']
Tyrosine Y ['TAT', 'TAC']
Valine V ['GTT', 'GTC', 'GTA', 'GTG']


In [17]:

def create_aa_dict(amino_acid_names: list, amino_acid_codes: list, amino_acid_codons: list):
  amino_acids = {}
  for name, code, codon in zip(amino_acid_names, amino_acid_codes, amino_acid_codons):
    amino_acids[code] = [name, codon]

  return amino_acids


In [18]:
amino_acids = create_aa_dict(AMINO_ACID_NAMES, AMINO_ACID_CODES, AMINO_ACID_CODONS)
pprint(amino_acids)
print(amino_acids['K'])
print(amino_acids['K'][0])
print(amino_acids['K'][1])
print(amino_acids['K'][1][0])

{'A': ['Alanine', ['GCT', 'GCC', 'GCA', 'GCG']],
 'C': ['Cysteine', ['TGT', 'TGC']],
 'D': ['Aspartic acid', ['GAT', 'GAC']],
 'E': ['Glutamic acid', ['GAA', 'GAG']],
 'F': ['Phenylalanine', ['TTT', 'TTC']],
 'G': ['Glycine', ['GGT', 'GGC', 'GGA', 'GGG']],
 'H': ['Histidine', ['CAT', 'CAC']],
 'I': ['Isoleucine', ['ATT', 'ATC', 'ATA']],
 'K': ['Lysine', ['AAA', 'AAG']],
 'L': ['Leucine', ['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG']],
 'M': ['Methionine', ['ATG']],
 'N': ['Asparagine', ['AAT', 'AAC']],
 'P': ['Proline', ['CCT', 'CCC', 'CCA', 'CCG']],
 'Q': ['Glutamine', ['CAA', 'CAG']],
 'R': ['Arginine', ['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG']],
 'S': ['Serine', ['TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC']],
 'T': ['Threonine', ['ACT', 'ACC', 'ACA', 'ACG']],
 'V': ['Valine', ['GTT', 'GTC', 'GTA', 'GTG']],
 'W': ['Tryptophan', ['TGG']],
 'Y': ['Tyrosine', ['TAT', 'TAC']]}
['Lysine', ['AAA', 'AAG']]
Lysine
['AAA', 'AAG']
AAA


# WARNING !!!!

- You may have noticed that we made a _BIG_ assumption with creating our dictionary - all of the lists are ordered alphabetically (e.g. the first set of codons corresponds to the Alanine, and so on)

In [19]:
from amino_acids import AMINO_ACID_WEIGHTS
print(AMINO_ACID_WEIGHTS)

# Fyi, the weights are in daltons, meaning Alanine weighs 89,100 grams per mole

{'A': 89.1, 'R': 174.2, 'N': 132.1, 'D': 133.1, 'C': 121.2, 'E': 147.1, 'Q': 146.2, 'G': 75.1, 'H': 155.2, 'I': 131.2, 'L': 131.2, 'K': 146.2, 'M': 149.2, 'F': 165.2, 'P': 115.1, 'S': 105.1, 'T': 119.1, 'W': 204.2, 'Y': 181.2, 'V': 117.1}


In [20]:
for key in amino_acids.keys():
  # Always good to check for unexpected behavior
  if key in AMINO_ACID_WEIGHTS.keys():
    amino_acids[key].append(AMINO_ACID_WEIGHTS[key])
  else: 
    print("What are we even doing here???")

In [21]:
pprint(amino_acids)

{'A': ['Alanine', ['GCT', 'GCC', 'GCA', 'GCG'], 89.1],
 'C': ['Cysteine', ['TGT', 'TGC'], 121.2],
 'D': ['Aspartic acid', ['GAT', 'GAC'], 133.1],
 'E': ['Glutamic acid', ['GAA', 'GAG'], 147.1],
 'F': ['Phenylalanine', ['TTT', 'TTC'], 165.2],
 'G': ['Glycine', ['GGT', 'GGC', 'GGA', 'GGG'], 75.1],
 'H': ['Histidine', ['CAT', 'CAC'], 155.2],
 'I': ['Isoleucine', ['ATT', 'ATC', 'ATA'], 131.2],
 'K': ['Lysine', ['AAA', 'AAG'], 146.2],
 'L': ['Leucine', ['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'], 131.2],
 'M': ['Methionine', ['ATG'], 149.2],
 'N': ['Asparagine', ['AAT', 'AAC'], 132.1],
 'P': ['Proline', ['CCT', 'CCC', 'CCA', 'CCG'], 115.1],
 'Q': ['Glutamine', ['CAA', 'CAG'], 146.2],
 'R': ['Arginine', ['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], 174.2],
 'S': ['Serine', ['TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'], 105.1],
 'T': ['Threonine', ['ACT', 'ACC', 'ACA', 'ACG'], 119.1],
 'V': ['Valine', ['GTT', 'GTC', 'GTA', 'GTG'], 117.1],
 'W': ['Tryptophan', ['TGG'], 204.2],
 'Y': ['Tyrosine', ['T

# Question 3 - Using your newly made dictionary, calculate the weight of the protein with the following sequence: 'AAAAAAAAEKTWYV'

In [22]:
def calculate_protein_weight(protein: str, amino_acid_dict: dict):
  weight = 0
  for aa in protein:
    weight += amino_acid_dict[aa][2]
  return weight

In [23]:
calculate_protein_weight('AAAAAAAAEKTWYV', amino_acids)

1627.7

- Wow. That's a complicated dictionary. It's also pretty annoying to dig around it, and have to remember what index corresponds to each value.
  - What if we could make a dictionary (or dictionary like structure) that was easier to work with?
  - Let's create a dictionary that let's us access values like amino_acids["key"]["Name"] or amino_acids["key"]["Codons"]

- Introducing iterating on keys and values in dictionaries
  - It turns out, it's not that different than using zip or enumerate!

In [24]:

awesome_aa = {}
for key, value in amino_acids.items():
  print(key, value)
  name = value[0]
  codons = value[1]
  weight = value[2]
  summary = {f"{name} is an amino acid with the following codons: {', '.join(codons)} that weighs {weight} grams per mole"}
  awesome_aa[key] = {'Name': name, 'Codons': codons, 'Weight': weight, 'Summary': summary}


A ['Alanine', ['GCT', 'GCC', 'GCA', 'GCG'], 89.1]
R ['Arginine', ['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], 174.2]
N ['Asparagine', ['AAT', 'AAC'], 132.1]
D ['Aspartic acid', ['GAT', 'GAC'], 133.1]
C ['Cysteine', ['TGT', 'TGC'], 121.2]
E ['Glutamic acid', ['GAA', 'GAG'], 147.1]
Q ['Glutamine', ['CAA', 'CAG'], 146.2]
G ['Glycine', ['GGT', 'GGC', 'GGA', 'GGG'], 75.1]
H ['Histidine', ['CAT', 'CAC'], 155.2]
I ['Isoleucine', ['ATT', 'ATC', 'ATA'], 131.2]
L ['Leucine', ['TTA', 'TTG', 'CTT', 'CTC', 'CTA', 'CTG'], 131.2]
K ['Lysine', ['AAA', 'AAG'], 146.2]
M ['Methionine', ['ATG'], 149.2]
F ['Phenylalanine', ['TTT', 'TTC'], 165.2]
P ['Proline', ['CCT', 'CCC', 'CCA', 'CCG'], 115.1]
S ['Serine', ['TCT', 'TCC', 'TCA', 'TCG', 'AGT', 'AGC'], 105.1]
T ['Threonine', ['ACT', 'ACC', 'ACA', 'ACG'], 119.1]
W ['Tryptophan', ['TGG'], 204.2]
Y ['Tyrosine', ['TAT', 'TAC'], 181.2]
V ['Valine', ['GTT', 'GTC', 'GTA', 'GTG'], 117.1]


In [25]:

print(awesome_aa)
awesome_aa['K']['Name']
awesome_aa['K']['Weight']
pprint(awesome_aa['K']['Summary'])


{'A': {'Name': 'Alanine', 'Codons': ['GCT', 'GCC', 'GCA', 'GCG'], 'Weight': 89.1, 'Summary': {'Alanine is an amino acid with the following codons: GCT, GCC, GCA, GCG that weighs 89.1 grams per mole'}}, 'R': {'Name': 'Arginine', 'Codons': ['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], 'Weight': 174.2, 'Summary': {'Arginine is an amino acid with the following codons: CGT, CGC, CGA, CGG, AGA, AGG that weighs 174.2 grams per mole'}}, 'N': {'Name': 'Asparagine', 'Codons': ['AAT', 'AAC'], 'Weight': 132.1, 'Summary': {'Asparagine is an amino acid with the following codons: AAT, AAC that weighs 132.1 grams per mole'}}, 'D': {'Name': 'Aspartic acid', 'Codons': ['GAT', 'GAC'], 'Weight': 133.1, 'Summary': {'Aspartic acid is an amino acid with the following codons: GAT, GAC that weighs 133.1 grams per mole'}}, 'C': {'Name': 'Cysteine', 'Codons': ['TGT', 'TGC'], 'Weight': 121.2, 'Summary': {'Cysteine is an amino acid with the following codons: TGT, TGC that weighs 121.2 grams per mole'}}, 'E': {'Name'

- That was a lot of work. Can I save the awesome_aa dictionary and use it later?
- Great question Jonathon! You definitely can.
- There's a bunch of different ways, but my personal favorite is it to save it as a .txt file to read in later.

In [26]:
with open("awesome_aa.txt", "w") as f:
  f.write(str(awesome_aa))

# Dataframes! Wow we made it

- Let's load in our awesome amino acid dictionary we made, and convert it to a dataframe

In [27]:
with open("./awesome_aa.txt", "r") as f:
  amino_acids = eval(f.read())

print(amino_acids)


{'A': {'Name': 'Alanine', 'Codons': ['GCT', 'GCC', 'GCA', 'GCG'], 'Weight': 89.1, 'Summary': {'Alanine is an amino acid with the following codons: GCT, GCC, GCA, GCG that weighs 89.1 grams per mole'}}, 'R': {'Name': 'Arginine', 'Codons': ['CGT', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'], 'Weight': 174.2, 'Summary': {'Arginine is an amino acid with the following codons: CGT, CGC, CGA, CGG, AGA, AGG that weighs 174.2 grams per mole'}}, 'N': {'Name': 'Asparagine', 'Codons': ['AAT', 'AAC'], 'Weight': 132.1, 'Summary': {'Asparagine is an amino acid with the following codons: AAT, AAC that weighs 132.1 grams per mole'}}, 'D': {'Name': 'Aspartic acid', 'Codons': ['GAT', 'GAC'], 'Weight': 133.1, 'Summary': {'Aspartic acid is an amino acid with the following codons: GAT, GAC that weighs 133.1 grams per mole'}}, 'C': {'Name': 'Cysteine', 'Codons': ['TGT', 'TGC'], 'Weight': 121.2, 'Summary': {'Cysteine is an amino acid with the following codons: TGT, TGC that weighs 121.2 grams per mole'}}, 'E': {'Name'

In [28]:
import pandas as pd

df = pd.DataFrame(amino_acids).T
print(df)


            Name                          Codons Weight  \
A        Alanine            [GCT, GCC, GCA, GCG]   89.1   
R       Arginine  [CGT, CGC, CGA, CGG, AGA, AGG]  174.2   
N     Asparagine                      [AAT, AAC]  132.1   
D  Aspartic acid                      [GAT, GAC]  133.1   
C       Cysteine                      [TGT, TGC]  121.2   
E  Glutamic acid                      [GAA, GAG]  147.1   
Q      Glutamine                      [CAA, CAG]  146.2   
G        Glycine            [GGT, GGC, GGA, GGG]   75.1   
H      Histidine                      [CAT, CAC]  155.2   
I     Isoleucine                 [ATT, ATC, ATA]  131.2   
L        Leucine  [TTA, TTG, CTT, CTC, CTA, CTG]  131.2   
K         Lysine                      [AAA, AAG]  146.2   
M     Methionine                           [ATG]  149.2   
F  Phenylalanine                      [TTT, TTC]  165.2   
P        Proline            [CCT, CCC, CCA, CCG]  115.1   
S         Serine  [TCT, TCC, TCA, TCG, AGT, AGC]  105.1 

- We don't need that summary column, let's drop it

In [29]:

df = df.drop(columns=['Summary'])


- A few ways to show what's in a dataframe

In [30]:

# See the first few rows
print(df.head())

print("---------------------------------------------------------")

# See the last few rows
print(df.tail())

print("---------------------------------------------------------")

# View the columns
print(df.columns)

print("---------------------------------------------------------")

# Print a specific column
print(df["Name"])

# Check out the number of times values appear in each column
print(df["Name"].value_counts())

# Show the max, min and mean values of a particular column
print(df["Weight"].max())
print(df["Weight"].min())
print(df["Weight"].mean())

            Name                          Codons Weight
A        Alanine            [GCT, GCC, GCA, GCG]   89.1
R       Arginine  [CGT, CGC, CGA, CGG, AGA, AGG]  174.2
N     Asparagine                      [AAT, AAC]  132.1
D  Aspartic acid                      [GAT, GAC]  133.1
C       Cysteine                      [TGT, TGC]  121.2
---------------------------------------------------------
         Name                          Codons Weight
S      Serine  [TCT, TCC, TCA, TCG, AGT, AGC]  105.1
T   Threonine            [ACT, ACC, ACA, ACG]  119.1
W  Tryptophan                           [TGG]  204.2
Y    Tyrosine                      [TAT, TAC]  181.2
V      Valine            [GTT, GTC, GTA, GTG]  117.1
---------------------------------------------------------
Index(['Name', 'Codons', 'Weight'], dtype='object')
---------------------------------------------------------
A          Alanine
R         Arginine
N       Asparagine
D    Aspartic acid
C         Cysteine
E    Glutamic acid
Q     

- Pandas dataframes are a two dimensional data structure that can hold multiple arrays (or series). 
    - A series is just a one dimensional array holding data of any type. You can think of them like fancy lists or tuples
- Note that each column has a data type

In [31]:
print(df.dtypes)

Name      object
Codons    object
Weight    object
dtype: object


In [32]:
my_basic_series = pd.Series([1, 2, 3, 4, 5])
print(my_basic_series)

0    1
1    2
2    3
3    4
4    5
dtype: int64


- Some, but not all, of that stuff we taught you about indexing applies here
  - Note that we return a series when indexing

In [33]:
print(my_basic_series[2])
print("----------------")
print(my_basic_series[1])
print("----------------")
print(my_basic_series[1:3])
print("----------------")
print(type(my_basic_series[1:3]))
print("----------------")
print(my_basic_series[-1:])
print("----------------")
print(my_basic_series[:-2])

3
----------------
2
----------------
1    2
2    3
dtype: int64
----------------
<class 'pandas.core.series.Series'>
----------------
4    5
dtype: int64
----------------
0    1
1    2
2    3
dtype: int64


- You can filter a dataframe by a specific column value


In [34]:

# Using a number....
print(df["Weight"] > 100)

# Or a string..
print(df["Name"].str.contains("acid"))


A    False
R     True
N     True
D     True
C     True
E     True
Q     True
G    False
H     True
I     True
L     True
K     True
M     True
F     True
P     True
S     True
T     True
W     True
Y     True
V     True
Name: Weight, dtype: bool
A    False
R    False
N    False
D     True
C    False
E     True
Q    False
G    False
H    False
I    False
L    False
K    False
M    False
F    False
P    False
S    False
T    False
W    False
Y    False
V    False
Name: Name, dtype: bool


- Uh.. that's cool I guess. But now all my values are just booleans. How do I get the actual values back?

In [35]:
df[df["Weight"] > 100]

Unnamed: 0,Name,Codons,Weight
R,Arginine,"[CGT, CGC, CGA, CGG, AGA, AGG]",174.2
N,Asparagine,"[AAT, AAC]",132.1
D,Aspartic acid,"[GAT, GAC]",133.1
C,Cysteine,"[TGT, TGC]",121.2
E,Glutamic acid,"[GAA, GAG]",147.1
Q,Glutamine,"[CAA, CAG]",146.2
H,Histidine,"[CAT, CAC]",155.2
I,Isoleucine,"[ATT, ATC, ATA]",131.2
L,Leucine,"[TTA, TTG, CTT, CTC, CTA, CTG]",131.2
K,Lysine,"[AAA, AAG]",146.2


- You can also filter by multiple columns

In [36]:
df[(df["Weight"] > 100) & (df["Name"].str.contains('acid'))]
# Weight > 100 and name contains the string 'acid'
# Note the .str accessor to turn the column into a string


Unnamed: 0,Name,Codons,Weight
D,Aspartic acid,"[GAT, GAC]",133.1
E,Glutamic acid,"[GAA, GAG]",147.1


- Alternative way to filter on a string value using "isin"

In [37]:
df[df["Name"].isin(['Alanine', 'Arginine'])]

Unnamed: 0,Name,Codons,Weight
A,Alanine,"[GCT, GCC, GCA, GCG]",89.1
R,Arginine,"[CGT, CGC, CGA, CGG, AGA, AGG]",174.2


- You don't have to return the whole column, you can also use one column to filter, then report the values of another

In [38]:
df[df["Weight"] > 100]["Name"]

R         Arginine
N       Asparagine
D    Aspartic acid
C         Cysteine
E    Glutamic acid
Q        Glutamine
H        Histidine
I       Isoleucine
L          Leucine
K           Lysine
M       Methionine
F    Phenylalanine
P          Proline
S           Serine
T        Threonine
W       Tryptophan
Y         Tyrosine
V           Valine
Name: Name, dtype: object

- Adding a column to a dataframe is pretty easy. Luckily I have plenty of columns we can add.

In [39]:

# Read in our protein metadata table!
metadata = pd.read_csv("./protein_metadata.csv")
print(metadata.head())


            name 3_letter_code 1_letter_code     formula  formula_weight  \
0        Alanine           ALA             A    C3H7N1O2           89.09   
1       Cysteine           CYS             C  C3H7N1O2S1          121.16   
2  Aspartic Acid           ASP             D    C4H7N1O4          133.10   
3  Glutamic Acid           GLU             E    C5H9N1O4          147.13   
4  Phenylalanine           PHE             F   C9H11N1O2          165.19   

   isoelectric_point           type  
0               6.00      aliphatic  
1               5.02  polar neutral  
2               2.77   polar acidic  
3               3.22   polar acidic  
4               5.48       aromatic  


- Let's merge our metadata with our amino acid dataframe on the name columns

In [43]:
mega_awesome_df = pd.merge(df, metadata, left_on='Name', right_on='name')
print(mega_awesome_df)
# Looks like we got an extra column in 'name', so let's fix that

mega_awesome_df = mega_awesome_df.drop(columns=['name'])

# And one extra thing I like to do , let's make sure all the column names are lower cases with no spaces
new_columns = []
for column_name in mega_awesome_df.columns:
    new_columns.append(column_name.lower().replace(" ", "_"))
mega_awesome_df.columns = new_columns
    
    
# You can also use list comprehensions like below
# mega_awesome_df.columns = [x.lower() for x in mega_awesome_df.columns]
# mega_awesome_df.columns = [x.replace(" ", "_") for x in mega_awesome_df.columns]
print(mega_awesome_df)


             Name                          Codons Weight           name  \
0         Alanine            [GCT, GCC, GCA, GCG]   89.1        Alanine   
1        Arginine  [CGT, CGC, CGA, CGG, AGA, AGG]  174.2       Arginine   
2      Asparagine                      [AAT, AAC]  132.1     Asparagine   
3        Cysteine                      [TGT, TGC]  121.2       Cysteine   
4       Glutamine                      [CAA, CAG]  146.2      Glutamine   
5         Glycine            [GGT, GGC, GGA, GGG]   75.1        Glycine   
6       Histidine                      [CAT, CAC]  155.2      Histidine   
7      Isoleucine                 [ATT, ATC, ATA]  131.2     Isoleucine   
8         Leucine  [TTA, TTG, CTT, CTC, CTA, CTG]  131.2        Leucine   
9          Lysine                      [AAA, AAG]  146.2         Lysine   
10     Methionine                           [ATG]  149.2     Methionine   
11  Phenylalanine                      [TTT, TTC]  165.2  Phenylalanine   
12        Proline        

- Finally, let's save our work and come back to it for the homework!

In [41]:
mega_awesome_df.to_csv("mega_awesome_df.csv", index=False)