# Collection datatypes

Python collections or *sequence* types allow users to create an object that can hold multiple values (or objects). Here, we'll go through three of the most commonly used collections:

- **string** objects only hold text data.
- **list** objects can hold any object
- **tuple** objects can hold any object

## String

`string` objects can be created in many ways, all being equivalent:

```python
>>> 'Hello'  # with apostrophes ("single-quotes")
'Hello'

>>> "Hello"  # with quotation marks ("double-quotes")
"Hello"

>>> """Hello,     
... my name is
... Mohammad"""   # with triple-double-quotes (a "docstring", used for multi-line text)
'Hello,\nmy name is\nNick'  

>>> str(32)  # using the str() function to change into a string
'32'
```

Nucleotide sequences, for instance, are often represented as strings:

```python
>>> seq = 'GCATTGGCT'
```
<br>

Remember objects and that every object comes with its own methods? A `str` object is no exception. What methods does it have? Let's explore some of them:

### Exercise

Modify the dna sequences below in a single line of code to match what's asked for.  Functions and methods that may be used are:

### Operations
  - `'GTC' * 3`
  - `'GTC' + 'GTC'`
  - `'GTC' == 'GTC'`
  - `'GTC' != 'GTC'`

### Functions
  - `len('GTC')`

### Methods
  - `'GTC'.count('A')`
  - `'GtC'.upper()`
  - `'GTC'.lower()`
  - `'GTC'.isdigit()`
  - `'GTC'.index('T')`
  - `'GTC'.replace('G', 'C')`
  - `'GTC-CCA'.split('-')`



1. Count the Number of "G" in the sequence.

In [1]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

In [2]:
seq.count("G")

6

2. Count the number of "AT" repeats in the sequence.

In [3]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

In [4]:
seq.count("AT")

3

3. Concatenate the following two sequences (i.e. combine them into one sequence).

In [5]:
seq1 = "GTGTCAGT"
seq2 = "TGAATCGATAG"

In [6]:
seq1 + seq2

'GTGTCAGTTGAATCGATAG'

4. How long is the following sequence?

In [7]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

In [8]:
len(seq)

24

5. Repeat the following sequence 13 times.

In [9]:
gc = "GC"

In [10]:
gc * 13

'GCGCGCGCGCGCGCGCGCGCGCGCGC'

6. Replace the incorrect letter with an empty string (i.e. delete the letter).

In [11]:
seq = "GTGXXGTXCCXCCATGXAATCGXATA"

In [12]:
seq.replace("X", "")

'GTGGTCCCCATGAATCGATA'

7. Standardize the formatting of this sequence (i.e. all capital letters).

In [13]:
seq = "GtCGAaaCCgTaGcTAgc"

In [14]:
seq.upper()

'GTCGAAACCGTAGCTAGC'

8. Split the following string around the empty space into a list of sequences.

In [15]:
seqs = "GTTCGAAAG GACCTGATTATAG AACCGATTTA"

In [16]:
seqs.split(" ")

['GTTCGAAAG', 'GACCTGATTATAG', 'AACCGATTTA']

9. What percentage of strong nucleotides (G and C) are there in this sequence?

In [17]:
seq = "GTGTCAGTCCCCATGAATCGATAG"

In [18]:
(seq.count("G") + seq.count("C")) / len(seq)

0.5

---

## List

`list` objects are created by placing elements inside square brackets `[]`, separated by commas:

```
>>> my_list = [1, 2, 3]
```
<br>

They can have any number of elements and they may be of different types (integer, float, string, etc.):
```
>>> my_list = [1, "tow", ["another", "list"]]
```

In [19]:
my_list = [1, 2., "two", "strings"]

In [20]:
len(my_list)

4

In [21]:
type(my_list)

list

## Tuple

`tuple` objects are very similar to lists but they are created with parentheses `()`:
```
>>> my_tuple = (1, "tow", ["another", "list"])
```
<br>

There is another (important) difference between lists and tuples, but we'll get to that during the exercise ;-)

In [22]:
my_tuple = (1, 2, "two", "str")
type(my_tuple)

tuple

### Exercise (quick)

1. Create a list object that contains numbers from 1 to 4.

In [23]:
my_list = [1, 2, 3, 4]
my_list

[1, 2, 3, 4]

2. What methods does a list object have? Display all the methods.

In [24]:
dir(my_list)

['__add__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

3. Change the list to a tuple and save it as (create a variable called) `my_tuple`.

In [26]:
my_tuple = tuple(my_list)

In [27]:
my_tuple

(1, 2, 3, 4)

4. What is the sum of all the elements in `my_tuple`?

In [28]:
sum(my_tuple)

10

5. What is the mean values of `my_tuple`?

In [29]:
sum(my_tuple) / len(my_tuple)

2.5

---

## Extracting Data From a Collection: Indexing and Slicing

**Indexing**: Data can be indexed/queried/extracted from collections using the square brackets: `[ ]`

In sequences, putting a number inside the the brackets extracts the nth (counting from zero) value.
```python
>>> (1, 2, 3)[1]
2

>>> (1, 2, 3)[0]
1
```
<br>

Negative indices can be used to index the data from the end (last element's index being `-1`)
```
>>> (1, 2, 3)[-1]
3
```
<br>

**Slicing**: You can "slice" a sequence (get all elements from one index to another index) using the colon `[:]`
```python
>>> (10, 20, 30, 40, 50, 60)[1:3]
(20, 30)

>>> (10, 20, 30, 40, 50, 60)[:3]
(10, 20, 30)

>>> (10, 20, 30, 40, 50, 60)[3:]
(40, 50, 60)

>>> (10, 20, 30, 40, 50, 60)[1:6:2]  # takes every second element
(20, 40, 60)
```

![Duplicate Button](./imgs/indexing.png)


### Exercise

1. Display the methods of this list object.

In [30]:
letters = ['a', 'c', 'd', 'e']

In [31]:
dir(letters)

['__add__',
 '__class__',
 '__class_getitem__',
 '__contains__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__imul__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__mul__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__reversed__',
 '__rmul__',
 '__setattr__',
 '__setitem__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'append',
 'clear',
 'copy',
 'count',
 'extend',
 'index',
 'insert',
 'pop',
 'remove',
 'reverse',
 'sort']

2. How many items does the `letters` (from previous exercise) list have? Use a Python built-in function to display the length of the list.

In [32]:
len(letters)

4

3. Reverse the order of the elements in the `letters` list.

In [37]:
letters[::-1]

['e', 'd', 'c', 'a']

4. Add `"f"` at the end of the `letters` list. Hint: list object has a method for this.

In [39]:
letters = ['a', 'c', 'd', 'e']

In [40]:
letters.append("f")

In [41]:
letters

['a', 'c', 'd', 'e', 'f']

5. Insert `"b"` at the second position (after `"a"`) of the `letters` list.

In [42]:
letters.insert(1, "b")

In [43]:
letters

['a', 'b', 'c', 'd', 'e', 'f']

6. Create a tuple from the letters list and call it `letters_tuple`.

In [44]:
letters_tuple = tuple(letters)
letters_tuple

('a', 'b', 'c', 'd', 'e', 'f')

7. Change "f" into "z" in both `letters` list and `letters_tuple` tuple. 

In [45]:
letters[-1] = "z"

In [46]:
letters

['a', 'b', 'c', 'd', 'e', 'z']

In [47]:
letters_tuple[-1] = "z"

TypeError: 'tuple' object does not support item assignment

8. Make a list of only the first codon (the first 3 nucleotides) in each sequence.

In [48]:
seqs = ['GGTATTA', 'GGCCAG', 'CCAGGATTAG']

In [49]:
[seqs[0][:3], seqs[1][:3], seqs[2][:3]]

['GGT', 'GGC', 'CCA']

### Further Reading

[Here's a nice article](https://towardsdatascience.com/the-basics-of-indexing-and-slicing-python-lists-2d12c90a94cf) that reviews indexing and slicing.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f12bb307-0323-41b8-b58a-a3dc423d7ca4' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>