# Data types and Data Structures

One of the most fundamental characteristics of a programming language is how it handles data. 

Here, we will first look into some basic data types, followed by data structures.

| Key | Data Type | Data Structure |
| --- | --- | --- |
| Definition | Data type represents the nature of the data that is to be used in programming. It describes the common properties shared by the data of the given type. For example, an integer data type describes every integer that the given macihine can handle. | Data structure is the collection that holds data, which can be manipulated and used in programming so that operations and algorithms can be more easily applied. For example lists are python data structures that can retain an ordered collection of different data types including different data structures.
| Storage | Data type does not store the value of data as it only represents the type of data that can be stored. | Data structure holds the data along with its value that actually acquires the space in the main memory. Also data structure can hold different kind and types of data within one single object.

## Basic data types

1. Numbers
2. Strings
3. Variables

## Numbers

### Integers

In [None]:
type(5)

### Float (Real numbers)

In [None]:
type(3.5)

### Complex numbers

In [None]:
type(1+3j)

## Strings


In [None]:
type("Neel")

**A string is a sequence of symbols delimited by a single quote (’), double quotes("), triple single quotes (”’), or triple double quotes (""")**. 

Therefore, the following strings are equivalent:

In [None]:
print("This is a string in Python")
print('This is a string in Python')
print('''This is a string in Python''')
print("""This is a string in Python""")

The advantage of having both single (’) and double (") quote delimiters is that we can insert a single quote in a string delimited for double quote, and vice versa:

In [None]:
print("A single quote (') inside a double quote")
print('Here we have "double quotes" inside single quotes')

**We must always end a string with the same type of quote that we started with.**

In [None]:
print("Do not mix your quotes')

#### Block strings

Strings delimited by triple quotes can be used to indicate multi-line strings.

In [None]:
print("""
This
is
a
multi-line
string.""")

In [None]:
print('''
This
is
also
a
multi-line
string.''')

We can use the **'\n' (EOL: end-of-line or new line)** character, to write the above strings without triple quotes.

In [None]:
print('This\ntoo\nis\na\nmulti-line\nstring.')

**Triple quotes are better for formatting**.

In [None]:
print('''
This
    too
        is
            a
                multi-line
                            string.''')

#### Escape characters
Backslash is an “escape” character that gives the next character a special meaning:

| Construct | Meaning |
| --- | --- |
| \n | Newline |
| \t | Tab |
| \\\ | Backslash |
| \\" | Double quote |

In [None]:
print("A double quote \" inside a double quote")

#### In python, strings are both a data type and a data structure.

String belongs to a python group of data structure called **sequence**, which contain elements arranged in a sequential order. **Lists** and **tuples** are other sequence data structures. These sequences share certain common properties such as they can be indexed, sliced, and iterated. We will discuss other sequences and their properties later.

#### Basic string operators

| Operator | Function |
| --- | --- |
| + | Concatenate strings |
| * | Replcate (copy) strings |
| in | membership test : true if the first exists inside the second string |
| not in | non-memebrship tese : true if the first does not exists inside the second string  |

In [None]:
'atg' + 'gtacgtccgt'

In [None]:
'atg'*3

In [None]:
'atg' in 'atggccggcgta'

In [None]:
'atg' not in 'atggccggcgta'

#### String Manipulation
Strings are immutable and hence, once created, they can not be modified.

In [None]:
s = 'agt'

# Attempt to modify at 3rd position
s[2] = c

However, they can be passed as a parameter to a function, which returns a modified value of the input string.

For example, by using the string method **lower()** we obtain the lower-case version of Carsten's name.

In [None]:
NAME = 'Carsten Fortmann-Grote'
name = NAME.lower()
name

In [None]:
NAME

A **method** is a function associated with a particular object. Here, our orginal string is the object.

We can also give the new string the same name as the old string.

In [None]:
NAME = 'Carsten Fortmann-Grote'
NAME = NAME.lower()
NAME

#### Common Methods Associated with Strings

**replace(old,new\[,n\])**: Allows us to replace a portion of a string (old) with another (new). If the optional argument 'n' is used, only the first n occurrences of old will be replaced:

In [None]:
dna_seq = 'GCTAGTAATGTG'
m_rna_seq = dna_seq.replace('T','U')
m_rna_seq

**count(sub\[, start\[, end\]\])**: Counts how many times the substring sub appears, between the start and end positions (if available). Let’s see how it can be used to calculate the CG content of a sequence:

In [None]:
dna_seq

In [None]:
c = dna_seq.count("C")
g = dna_seq.count("G")
(c+g)/len(dna_seq)*100

**index(sub\[,start\[,end\]\])**: Returns the position of the substring sub, between the start and end positions (if available). If the substring is not found in the string, this method returns a ValueError:

In [None]:
m_rna_seq

In [None]:
m_rna_seq.index('AUG')

In [None]:
m_rna_seq.index('GGG')

**split(\[sep \[,maxsplit\]\])**: Separates the “words” of a string and returns them as a list. If a separator (sep) is not specified, the default separator will be a whitespace:

In [None]:
'This string has words separated by single spaces'.split()

In [None]:
 "Bilal Haider,5555-5555,billy@boy.thomson".split()

In [None]:
 "Bilal Haider,5555-5555,billy@boy.thomson".split(",")

**join(seq)**: It is the inverse function of split, which Joins the sequence using a string as a “glue character”:

In [None]:
':'.join(['Bilal Haider', '5555-5555', 'billy@boy.thomson'])

To join a sequence without any glue character, use empty quotes (""):

In [None]:
''.join(['A','T','A','C'])

## Variables

Variables are storage containers for numbers, strings, etc.

The equal sign (**=**NAME = 'Carsten Fortmann-Grote') is used to assign a value to a variable.

In [None]:
Person1 = 'Carsten Fortmann-Grote'

 A variable should be defined before use.

In [None]:
Person2

The name associated with the value is called a **variable**, beacuse its value can vary.

In reality there is no value assignment in **Python**, there is only **binding names with objects**.

In [None]:
a = 2
a

In [None]:
b = a
b

Here we bind `b` with a new object which is the sum of existing `b` and integer 5.

In [None]:
b = b + 5
b

In [None]:
a = 100
a

In [None]:
a += 1 # a = a + 1
a

In [None]:
b = 200
b =  0 + 100
b

In complex data types:

In [None]:
ls_a = [1, 2, 3] # ls_a is name bound to a list with three elements
ls_b = ls_a # ls_b is also bound to the same list
ls_b[1] = 100 # change the object at postion [1], but it changes the original list
ls_a

In [None]:
ls_a = [1, 2, 3]
ls_b = ls_a
ls_b = ls_b*2 # list b is bound to a new list that is replicate of the original list
ls_a # original list does not change

In [None]:
ls_b 

In [None]:
ls_a = [1, 2, 3]
ls_b = ls_a
ls_b = ls_b*1 # list b is bound to a new list that is replicate of the original list
ls_b[1] = 100 # now that the list_b is bound to a new list , changes in this list has no impact on the original list
ls_a

In [None]:
ls_b

Variable names are **case-sensitive**.

In [None]:
Variable1 = 1
variable1 = 11
VARIABLE1 = 111
print(Variable1, variable1, VARIABLE1)

#### Variable names can only consist of letters, numbers (not the first letter), and the underscore character.

Valid names: name, My_str, DNA , sequence1 

Invalid names: 1string, name#, year@21

Note: **Please give meaningful names to variables**. eg. 'ProteinSeq' is much better than 'p'.

# Data structures

There are two types of Python data structures:

1. Sequencial data structures, which include **strings**, **lists**, and **tuples**.
2. Unordered  data structures. which include **dictionaries** and **sets**.

## Lists

A list is an ordered collection of objects. 

It is represented by elements separated by commas and enclosed between square brackets.

We have seen a list while applying the split() function.

In [None]:
 "Bilal Haider,5555-5555,billy@boy.thomson".split(',')

All three elements of the resulting list, seperated by commas, are strings.

The next code shows how to define and name a list:

In [None]:
first_list = [1, 2, 3, 4, 5,]
first_list

A list can hold differnt kinds of elements:

In [None]:
second_list = [1, 'two', 3, 4, 'five']
second_list

In [None]:
nested_list = [1, 'two', first_list, 'five']
nested_list

In [None]:
type(nested_list)

We can also define an empty list with empty brackets:

In [None]:
empty_list = []
empty_list

#### List with repeated items

A list can be intialised with the same item repeated multiple times using the __*__ operator.

In [None]:
List_Uno = [1] * 5
List_Uno

It is also possible to create an empty list of a given size.

In [None]:
List_Null = [None] * 5
List_Null

#### __List Comprehension__

A list can be created from another list.

In [None]:
List_Enumerated = [1, 2, 3, 4, 5]
List_Comprehended = [x*2 for x in List_Enumerated]
List_Comprehended

List comprehension means creating a new list by describing the property shared by all its members. 

In the above case, all members of the List_Comprehended are obtained by extracting the members of List_Enumerated and then doubling them.

### Modifying Lists
Unlike strings, lists can be modified by adding, removing, or changing their elements:

#### **Adding**

There are three ways to add elements to a list: append, insert, and extend.

**append(element)**: Adds an element at the end of the list.

In [None]:
first_list

In [None]:
first_list.append(99)
first_list

**insert(position,element)**: Inserts the given element at the given position.

In [None]:
first_list.insert(2,50)
first_list

**extend(list)**: Extends a list by adding a list to the end of the original list.

In [None]:
first_list.extend([6,7,8])
first_list

Same as using the __+__ symbol.

In [None]:
[1,2,3]+[4,5]

#### **Removing**
There are three ways to remove elements from a list: remove, pop, and del.

**remove(element)**: Removes the element specified in the parameter but does not return anything. In the case where there is more than one copy of the same object in the list, it removes the first one, counting from the left. 

In [None]:
first_list.remove(99)

In [None]:
first_list

In [None]:
first_list.remove(6)
first_list

**pop(index)**: Removes and returns the element at the specified index in the parameter.

In [None]:
first_list.pop(0)

In [None]:
first_list

**del**: Has the same effect at pop but does not return anything.

In [None]:
del(first_list[0])

In [None]:
first_list

#### **Copying**

Copying a list is a tricky buissnes.

In [None]:
a = [1, 2, 3]
b = a
b

In [None]:
b.pop()

In [None]:
a

So the "__=__" does not copy values, it copies the reference to the original list.

To copy a list:

1. Use the **copy** method from the **copy** module.

In [None]:
from copy import copy as cp
a = [1, 2, 3]
b = cp(a)
b.pop()

In [None]:
a

2. Use **slicing**

In [None]:
a[:]

In [None]:
b = a[:]
b.pop()

In [None]:
a

## Tuples

### Tuples Are Immutable Lists

A tuple is a collection of ordered objects with the characteristic that once created, it cannot be modified. That is why they are referred to as "immutable lists".

Tuples elements are enclosed between parentheses.

In [None]:
first_tuple = (1 , 2, 3)
first_tuple

In [None]:
Lonely_tuple = (4,)
Lonely_tuple

The trailing comma is important, otherwise the parantheses will be ignored for Lonely_tuple and it will interpreted just as the integer 5.

In [None]:
type(Lonely_tuple)

In [None]:
Lonely_No = (4)
type(Lonely_No)

You cannot add or remove elements from a tuple:

In [None]:
first_tuple.append(4)

#### Tuples take less memory than lists and allow faster operations.

## Common Properties of the Sequences

The follwing properties can be applied to strings, lists, and tuples.

### Indexing

Since the elements of a sequence are ordered, we can gain access to any element through an index that begins at zero.

In [None]:
MyStr = 'Python'
MyList = list(MyStr)
MyTuple = tuple(MyStr)
print(MyStr) 
print(MyList)
print(MyTuple)

**list()** converts other sequences into a list and **tuple()** converts other sequences into a tuple.


Indexing is done by placing the index number in brackets after the varaible name.

In [None]:
MyStr[1]

In [None]:
MyList[0]

In [None]:
MyTuple[3]

Indexing can also be done from the right side.

In [None]:
MyStr[-1]

In [None]:
MyList[-6]

In [None]:
MyTuple[-3]

#### **Double indexing**
We can also access elements that are inside a sequence, which is inside another sequence.

In [None]:
nested_list

In [None]:
nested_list[2][-1]

In [None]:
nested_list[2][-1][2]

### Slicing

We can select a portion of a sequence using the slice notation. Slicing consists of using two index coordinates separated by a colon (:). 

These coordinates represent a slice in the existing space between the elements.

In [None]:
MyStr[0:2]

By ommiting the first sub-index, the index value defaults to the first position(0).

In [None]:
MyStr[:2]

In [None]:
MyStr[3:]

In [None]:
MyStr[-6:-4]

In [None]:
MyStr[-2:-1]

**Step argument** is the optinal third index used to skip positions.

In [None]:
MyStr[0:5]

In [None]:
MyStr[0:5:2]

Step with negative argument is used for backward counting. Hence, -1 can be used as the step argument to **invert a sequence**.

In [None]:
MyStr[::-1]

#### Slicing always returns another sequence.

### Membership Test

We can verify whether an element belongs to a sequence, using the **in** keyword.

In [None]:
point = (23, 56, 11)
11 in point

In [None]:
my_sequence = 'MRVLLVALALLALAASATS'
'X' in my_sequence

### Concatenation
We can concatenate two or more sequences of the same class using the “+” sign.

In [None]:
point2 = (2, 6, 7)
point + point2

In [None]:
dna_seq = 'ATGCTAGACGTCCTCAGATAGCCG'
tata_box = 'TATAAA'

In [None]:
tata_box + dna_seq

In [None]:
tata_box + point

### len, max, and min
**len()** returns the length (the number of items) of a sequence. 

**max()** and **min()** applied over a sequence of numbers return, as expected, the maximum and the minimum value.

In [None]:
len(point)

In [None]:
max(point)

In [None]:
min(point)

max() and min() applied to strings return a character according to the maximum or minimum value of its ASCII code.

## Dictionaries

### Mapping: Calling Each Value by a Name
Dictionaries are a special data structures not present in all programming languages. 

The main characteristic of a dictionary is that it stores arbitrary indexed unordered data types.

In [None]:
iupac = {'A':'Ala','C':'Cys','E':'Glu'}
print('C stands for the amino acid {}.'.format(iupac['C']))

iupac is the name of a dictionary with three elements. 

It was defined by encloing key:value pairs between curly brackets ({}). 

Every element consists of a key:value pair. 

The key is the index used to retrieve the value.

In [None]:
iupac['E']

Only immutable objects like strings, tuples and numbers can be used as keys

A dictionary can also be created from a sequence with **dict**.

In [None]:
rgb = [('red','ff0000'), ('green','00ff00'), ('blue','0000ff')]
colors_d = dict(rgb)
colors_d

**dict** also accepts name=value pairs in the keyword argument list.

In [None]:
colors_d = dict(red='ff0000', green='00ff00', blue='0000ff')
colors_d

Another way to initialize a dictionary is to create an empty dictionary and add elements as needed.

In [None]:
rgb = {}
rgb['green'] = '00ff00'
rgb['red'] = 'ff0000'
rgb

## Dictionary Methods

| Methods | Descriptions |
| --- | --- |
| len(d) | Number of elements of d |
| d\[k\] | The element from d that has a k key |
| d\[k\] = v | Set d\[k\] to v |
| del d\[k\] | Remove d\[k\] from d |
| d.clear() | Remove all items fromd | 
| d.copy() | Copy d |
| k in d | True if d has a key k, else False |
| k not in d | Equivalent to not k in d | 
| d.has_key(k) | Equivalent to k in d, use that form in new code |
| d.items() | A copy of d’s list of (key, value) pairs |
| d.keys() | A copy of d’s list of keys | 
| d.update(\[b\]) | Updates (and overwrites) key/value pairs from b | 
| d.fromkeys(seq\[,value\]) | Creates a new dictionary with keys from seq and values set to value |
| d.values() | A copy of d’s list of values |
| d.get(k\[, x\]) | a\[k\] if k in d, else x |
| d.setdefault(k\[, x\]) | a\[k\] if k i nd, else x (also setting it) | 
| d.pop(k\[, x\]) | d\[k\] if k in d, else x(and remove k) |
| d.popitem() | Remove and return an arbitrary (key, value) pair

In [None]:
iupac = {'A':'Ala','C':'Cys','E':'Glu' , 'X':'Xaa'}

In [None]:
iupac.keys()

In [None]:
iupac.values()

Note that these methods do not return a list, but they return a special object called **dictionary views**. This object shows you the current keys or values, so if it changes in the dictionary, it will change in the dictionary view.

Another way of accessing the elements of a dictionary is by using **items()**, which returns a dictionary view with a tuple for every key/value pair.

In [None]:
iupac.items()

To query a value from a dictionary without the risk of invoking an exception, we use get(k,x). 

K is the key of the element to extract, while x is the element that will be returned in case k is not found as a key of the dictionary.

In [None]:
 iupac.get('A','No translation available')

In [None]:
 iupac.get('Z','No translation available')

To erase elements from a dictionary, use the **del** instruction

In [None]:
del iupac['A']
iupac

## Sets

This type of data is also not commonly found in other programming languages. 

A set is a structure frequently found in mathematics. 

It is similar to a list, with two outstanding differences:
1. Its elements do not preserve an implied order.
2. Every element is unique.

The most common uses of sets are membership testing, duplicate removal, and the application of mathematical operations: intersections, unions, differences, and symmetrical differences.

### Creating a set

In [None]:
first_set = {'CP0140.1','XJ8113.5','EF3616.3'}
first_set

In [None]:
type(first_set)

We can also create an empty set and then add elements to it.

In [None]:
first_set = set()
first_set.add('CP0140.1')
first_set.add('XJ8113.5')
first_set.add('EF3616.3')
first_set

A set does not accept duplicated elements.

In [None]:
unique_set = {2,2,3,4,5,3}
unique_set

## Set operations

### Intersection

Intersection gives the common elements from two sets.

<img src="intersection-in-python.jpg" alt="Drawing" style="width: 400px;"/>

In [None]:
first_set

In [None]:
other_set = {'EF3616.3'}
other_set

In [None]:
common = first_set.intersection(other_set)
common

Intersection is equivalent to &.

In [None]:
common = first_set & other_set
common

### Union

<img src="Union-in-python.jpg" alt="Drawing" style="width: 400px;"/>

In [None]:
other_set = {'AB7416.2'}
first_set.union(other_set)

In [None]:
first_set | other_set 

### Difference

<img src="set-difference.jpg" alt="Drawing" style="width: 400px;"/>

In [None]:
first_set.difference(other_set)

In [None]:
first_set - other_set

In [None]:
other_set - first_set

### Immutable Set: Frozenset
Frozenset is the immutable version of set. Its contents cannot be changed, so methods like add() and remove() are not available. 

It is generated with the frozenset object that takes an iterable as input

In [None]:
fs = frozenset(['a', 'b'])
fs

In [None]:
fs.remove('a')

Frozensets can be used as dictionary key.

## References

* [Python for Bioinformatics](https://www.routledge.com/Python-for-Bioinformatics/Bassi/p/book/9781138035263)