Basic Python and native data structures - 3
==========

A note about objects
-------

- Everything in Python is an object, which can be seen as an advanced version of a variable
- objects have methods
- the **dir** keyword allows the user to discover them
- to access them you can use the following syntax: VARIABLE.ATTRIBUTE or VARIABLE.METHOD()

Data structures
------------


There are quite a few data structures available. The builtins data structures are: 
- **lists**
- **tuples**
- **dictionaries**
- **strings**
- **sets** 

Lists, strings and tuples are **ordered sequences** of objects. Unlike strings that contain only characters, list and tuples can contain any type of objects. Lists and tuples are like arrays. Tuples like strings are **immutables**. Lists are mutables so they can be extended or reduced at will. Sets are mutable unordered sequence of unique elements.

Lists are enclosed in brackets:

```    python
l = [1, 2, "a"]
```

Tuples are enclosed in parentheses:

```python
t = (1, 2, "a")
```

Tuples are faster and consume less memory.

Dictionaries are built with curly brackets:

```python
d = {"a":1, "b":2}
```

Sets are made using the **set** builtin function. More about the data structures here below:

|                    | immutable  | mutable     |
|--------------------|------------|-------------|
| ordered sequence   | string     |             |
| ordered sequence   | tuple      |  list       |
| unordered sequence |            |  set        |
| hash table         |            |  dict       |


**Indexing** starts at 0, like in C

In R however, the first index is 1

This leads to many off-by-one errors

In [None]:
s1 = "Example"
s1[0]

In [None]:
# last index is therefore the length of the string minus 1
s1[len(s1)-1]

In [None]:
s1[6]

In [None]:
# Negative index can be used to start from the end
s1[-2]

In [None]:
# Careful with indexing out of bounds
s1[100]

Strings and slicing
-----

There are 4 ways to represent strings:
- with single quotes
- with double quotes
- with triple single quotes
- with triple double quotes

In [None]:
"Simple string"

In [None]:
'Simple string'

In [None]:
#single quotes can be used to use double quotes and vice versa
"John's book"

In [None]:
#we can also use escaping
'John\'s book'

In [None]:
"""This is an example of 
a long string on several lines"""

**String operations**

In [None]:
s1 = "First string"
s2 = "Second string"
# + operator concatenates strings
s1 + " and " + s2

In [None]:
# Strings are immutables
# Try
s1[0] = 'e'

In [None]:
# to change an item, you got to create a new string
'N' + s1[1:]

**Slicing sequence syntax**

Applies to strings as well as lists, tuples and any other **iterable** (_i.e._ an object that behaves like a list, tuple, etc..)

<pre>
- [start:end:step]   most general slicing
- [start:end:]      (step=1)
- [start:end]       (step=1)
- [start:]          (step=1,end=-1)
- [:]               (start=0,end=-1, step=1)
- [::2]             (start=0, end=-1, step=2)
</pre>

In [None]:
s1 = 'Banana'
s1[1:6:2]

In [None]:
s = 'TEST'
s[-1:-4:-2]

In [None]:
# slicing. using one : character means from start to end index.
s1 = "First string"
s1[:]

In [None]:
s1[::2]

In [None]:
# indexing
s1[0]

**Other string operations**

In [None]:
print(dir(s1))

Well, that's a lot ! Here are the common useful ones:
- split
- find
- index
- replace
- lower
- upper
- endswith
- startswith
- strip

In [None]:
# split is very useful when parsing files
s = 'first second third'
s.split()

In [None]:
# a different character can be used as well as separator
s = 'first,second,third'
s.split(',')

In [None]:
# Upper is a very easy and handy method
s.upper()

In [None]:
# Methods can be chained as well!
s.upper().lower().split(',')

In [None]:
s = 'Banana'
s.startswith('Ban')

Lists
-----

The syntax to create a list can be the function **list** or square brackets **[]**

In [None]:
# you can  any kind of objects in a lists. This is not an array !
l = [1, 'a', 3]
l

In [None]:
# slicing and indexing like for strings are available
l[0]
l[0::2]

In [None]:
l

In [None]:
# list are mutable sequences:
l[1] = 2
l

**Mathematical operators can be applied to lists as well**

In [None]:
[1, 2] + [3, 4]

In [None]:
[1, 2] * 10

**Adding elements to a list: append Vs. expand**

Lists have several methods amongst 
which the **append** and **extend** methods. The former appends an object to the end of the list (e.g., another list) while the later appends each element of the iterable object (e.g., anothee list) to the end of the list.

For example, we can append an object (here the character 'c') to the end of a simple list as follows:

In [None]:
stack = ['a','b']
stack.append('c')
stack

In [None]:
stack.append(['d', 'e', 'f'])
stack

In [None]:
stack[3]

The object ``['d', 'e', 'f']`` has been appended to the exiistng list. However, it happens that sometimes what we want is to append the elements one by one of a given list rather the list itself. You can do that manually of course, but a better solution is to use the :func:`extend()` method as follows:


In [None]:
# the manual way
stack = ['a', 'b', 'c']
stack.append('d')
stack.append('e')
stack.append('f')
stack

In [None]:
# semi-manual way, using a "for" loop
stack = ['a', 'b', 'c']
to_add = ['d', 'e', 'f']
for element in to_add:
    stack.append(element)
stack

In [None]:
# the smarter way
stack = ['a', 'b', 'c']
stack.extend(['d', 'e','f'])
stack

Tuples
----

Tuples are sequences similar to lists but **immutables**. Use the parentheses to create a tuple

In [None]:
t = (1, 2, 3)
t

In [None]:
# simple creation:
t = 1, 2, 3
print(t)
t[0] = 3

In [None]:
# Would this work?
(1)

In [None]:
# To enforce a tuple creation, add a comma
(1,)

**Same operators as lists**

In [None]:
(1,) * 5

In [None]:
t1 = (1,0)
t1 += (1,)
t1

**Why tuples instead of lists?**

- faster than list
- protects the data (immutable)
- tuples, being immutable, can be used as keys on dictionaries (more on that later)

Sets
----

Sets are constructed from a sequence (or some other iterable object). Since sets cannot have duplicates, there are usually used to build sequence of unique items (e.g., set of identifiers).

The syntax to create a set can be the function **set** or curly braces **{}**


In [None]:
a = {'1', '2', 'a', '4'}
a

In [None]:
# a list preserves duplicates
a = [1, 1, 1, 2, 2, 3, 4]
a

In [None]:
# a set ignores duplications
a = {1, 2, 1, 2, 2, 3, 4}
a

In [None]:
a = []
to_add = [1, 1, 1, 2, 2, 3, 4]
for element in to_add:
    if element in a:
        continue
    else:
        a.append(element)
a

In [None]:
# Sets have the very handy "add" method
a = set()
to_add = [1, 1, 1, 2, 2, 3, 4]
for element in to_add:
    a.add(element)
a

**Sets have very interesting operators**

What operators do we have ?
- | for union
- & for intersection
- < for subset
- \- for difference
- ^ for symmetric difference

In [None]:
a = {'a', 'b', 'c'}
b = {'a', 'b', 'd'}
c = {'a', 'e', 'f'}

In [None]:
# intersection
a & b

In [None]:
# union
a | b

In [None]:
# difference
a - b

In [None]:
# symmetric difference
a ^ b

In [None]:
# is my set a subset of the other?
a < b

In [None]:
# operators can be chained as well
a & b & c

In [None]:
# the same operations can be performed using the operator's name
a.intersection(b).intersection(c)

In [None]:
# a more complex operation
a.intersection(b).difference(c)

Dictionaries
------

- A dictionary is a sequence of items.
- Each item is a pair made of a **key** and a **value**. They are useful to convert a **key** to its corresponding **value** (_e.g._ gene identifier to its common name)
- Dictionaries are unordered. 
- You can access to the list of keys or values independently.

In [None]:
d = {} # an empty dictionary

In [None]:
d = {'first':1, 'second':2} # initialise a dictionary

In [None]:
# access to value given a key:
d['first']

In [None]:
# add a new pair of key/value:
d['third'] = 3

In [None]:
# what are the keys ?
d.keys()

In [None]:
# what are the values ?
d.values()

In [None]:
# what are the key/values pairs?
d.items()

In [None]:
# can be used in a for loop as well
for key, value in d.items():
    print(key, value)

In [None]:
# Delete a key (and its value)
del d['third']
d

In [None]:
# naive for loop approach:
for key in d.keys():
    print(key, d[key])

In [None]:
# no need to call the "keys" method explicitly
for key in d:
    print(key, d[key])

In [None]:
# careful not to look for keys that are NOT in the dictionary
d['fourth']

In [None]:
# the "get" method allows a safe retrieval of a key
d.get('fourth')

In [None]:
# the "get" method returns a type "None" if the key is not present
# a different value can be specified in case of a missed key
d.get('fourth', 4)

**Note on the "None" type**

In [None]:
n = None
n

In [None]:
print(n)

In [None]:
None + 1

In [None]:
# we can explicitly test for a variable being "None"
value = d.get('fourth')
if value is None:
    print('Key not found!')

---


Exercises
---------

Create two string variables containing a sentence each, then compute the number of words shared between the two sentences

Write a function `filter_long_words()` that takes a list of words and an integer n and returns the list of words that are longer than n.

Define a function `generate_n_chars()` that takes an integer `n` and a character `c` and returns a string, n characters long, consisting only of the chosen character. For example, `generate_n_chars(5, "x")` should return the string `"xxxxx"`.

Write a function that takes list of words and returns a dictionary with the words as keys and their length as values.

A pangram is a sentence that contains all the letters of the English alphabet at least once, for example: `The quick brown fox jumps over the lazy dog`. Your task here is to write a function to check a sentence to see if it is a pangram or not.