# Python Advanced Data Types

## Python collections module

### Namedtuple
This module provide a tuple that you can assign a name to each item. However, note that a tuple is not mutable (cannot be changed).

In [1]:
from collections import namedtuple

In [2]:
Address = namedtuple('Addr', ['street', 'city', 'state', 'zip'])

In [5]:
my_address = Address('23/03', 'Lansing', 'MI', '48823')

In [6]:
my_address

Addr(street='23/03', city='Lansing', state='MI', zip='48823')

In [8]:
for m in my_address: # you can iterate through all items as usual
    print(m)

23/03
Lansing
MI
48823


In [13]:
my_address[3] # or use an index to get a value

'48823'

In [14]:
my_address.zip # but this time you can also refer to value using a name

'48823'

In [16]:
my_address.zip = '48839' # of course, you cannot set a new value

AttributeError: can't set attribute

In [17]:
addr_list = [] # create an empty list
addr_list.append(Address('101', 'Saginaw', 'MI', '48825')) # add new address tuple to the list
addr_list.append(Address('304', 'Flint', 'MI', '48825'))
addr_list.append(Address('14/3', 'Ann Arbor', 'MI', '48843'))
addr_list.append(Address('34', 'Troy', 'MI', '48833'))
addr_list.append(my_address) # add my_address to the list

In [19]:
for addr in addr_list:
    print(addr.city, addr.state)

('Saginaw', 'MI')
('Flint', 'MI')
('Ann Arbor', 'MI')
('Troy', 'MI')
('Lansing', 'MI')


In [21]:
for addr in sorted(addr_list, key=lambda x: x.city): # sorted() will not alter the original value,
    print(addr.city, addr.state) # so it works well with a tuple
                                 # key parameter is used to specify by what value you like the tuple to be ordered

('Ann Arbor', 'MI')
('Flint', 'MI')
('Lansing', 'MI')
('Saginaw', 'MI')
('Troy', 'MI')


### Counter

In [23]:
from collections import Counter
cnt = Counter() # create a counter

In [24]:
abstract = '''
As a chronic disorder, insomnia affects approximately 10% of the population at some time during their lives, and its treatment is often challenging. Since the antagonists of the H₁ receptor, a protein prevalent in human central nervous system, have been proven as effective therapeutic agents for treating insomnia, the H₁ receptor is quite possibly a promising target for developing potent anti-insomnia drugs. For the purpose of understanding the structural actors affecting the antagonism potency, presently a theoretical research of molecular interactions between 129 molecules and the H₁ receptor is performed through three-dimensional quantitative structure-activity relationship (3D-QSAR) techniques. The ligand-based comparative molecular similarity indices analysis (CoMSIA) model (Q² = 0.525, R²ncv = 0.891, R²pred = 0.807) has good quality for predicting the bioactivities of new chemicals. The cross-validated result suggests that the developed models have excellent internal and external predictability and consistency. The obtained contour maps were appraised for affinity trends for the investigated compounds, which provides significantly useful information in the rational drug design of novel anti-insomnia agents. Molecular docking was also performed to investigate the mode of interaction between the ligand and the active site of the receptor. Furthermore, as a supplementary tool to study the docking conformation of the antagonists in the H₁ receptor binding pocket, molecular dynamics simulation was also applied, providing insights into the changes in the structure. All of the models and the derived information would, we hope, be of help for developing novel potent histamine H₁ receptor antagonists, as well as exploring the H₁-antihistamines interaction mechanism.
'''

In [27]:
words = abstract.lower().split()

In [28]:
words[:20]

['as',
 'a',
 'chronic',
 'disorder,',
 'insomnia',
 'affects',
 'approximately',
 '10%',
 'of',
 'the',
 'population',
 'at',
 'some',
 'time',
 'during',
 'their',
 'lives,',
 'and',
 'its',
 'treatment']

In [29]:
for word in words: # cnt is a dictionary of a word and its count
    cnt[word] += 1

In [32]:
cnt.most_common(10) # most_common() returns a list of tuples each contains a word and occurances

[('the', 27),
 ('of', 11),
 ('for', 7),
 ('and', 6),
 ('h\xe2\x82\x81', 5),
 ('as', 5),
 ('a', 5),
 ('receptor', 4),
 ('molecular', 4),
 ('in', 4)]

In [35]:
Counter(words).most_common(10) # you can also get the same results this way
                                # not that English stop words occur frequently in the text
                                # you may want to get rid of them in order to get more meaningful results

[('the', 27),
 ('of', 11),
 ('for', 7),
 ('and', 6),
 ('h\xe2\x82\x81', 5),
 ('as', 5),
 ('a', 5),
 ('receptor', 4),
 ('molecular', 4),
 ('in', 4)]

In [36]:
Counter('gallahad') # Counter accepts iterable data type as a parameter

Counter({'a': 3, 'd': 1, 'g': 1, 'h': 1, 'l': 2})

In [38]:
Counter([1,1,2,2,2,3,4,4,5,3,2,4,5,7,7])

Counter({1: 2, 2: 4, 3: 2, 4: 3, 5: 2, 7: 2})

### Ordered dictionary

In [39]:
from collections import OrderedDict

In [40]:
odict = OrderedDict() # create an ordered dict

In [55]:
odict['John'] = 23000
odict['Gene'] = 34000
odict['Thomas'] = 40000
odict['Michael'] = 25000
odict['Sam'] = 34000

In [52]:
odict

OrderedDict([('John', 23000),
             ('Gene', 34000),
             ('Thomas', 40000),
             ('Michael', 25000)])

In [47]:
for key, value in odict.items():
    print('{} earned {}'.format(key, value))

John earned 23000
Gene earned 34000
Thomas earned 40000
Michael earned 25000
Sam earned 34000


In [48]:
for key, value in reversed(odict.items()):
    print('{} earned {}'.format(key, value))

Sam earned 34000
Michael earned 25000
Thomas earned 40000
Gene earned 34000
John earned 23000


In [56]:
odict.popitem()

('Sam', 34000)

In [57]:
odict

OrderedDict([('John', 23000),
             ('Gene', 34000),
             ('Thomas', 40000),
             ('Michael', 25000)])

In [58]:
odict.pop('Gene')

34000

In [59]:
odict

OrderedDict([('John', 23000), ('Thomas', 40000), ('Michael', 25000)])

In [60]:
odict['Gene'] = 34000

In [61]:
odict

OrderedDict([('John', 23000),
             ('Thomas', 40000),
             ('Michael', 25000),
             ('Gene', 34000)])

### Default dictionary

In [62]:
from collections import defaultdict

In [63]:
dd = defaultdict(int) # set default value to 0

In [65]:
dd['x']

0

In [66]:
dd

defaultdict(int, {'x': 0})

In [72]:
for c in 'aaggtggcccaaccggtt':
    dd[c] += 1

In [73]:
dd

defaultdict(int,
            {'a': 7, 'c': 5, 'd': 1, 'g': 7, 'h': 1, 'l': 2, 't': 3, 'x': 0})

In [69]:
ddl = defaultdict(list) # set default value to empty list

In [70]:
for c in 'aaggtggcccaaccggtt':
    ddl[c].append(c)

In [71]:
ddl

defaultdict(list,
            {'a': ['a', 'a', 'a', 'a'],
             'c': ['c', 'c', 'c', 'c', 'c'],
             'g': ['g', 'g', 'g', 'g', 'g', 'g'],
             't': ['t', 't', 't']})

### Array

In [75]:
import numpy as np

In [76]:
data = np.array([[1,2,3], [4,5,6], [7,8,9]])

In [78]:
data

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [83]:
data.ndim, data.shape, data.dtype, data.size, data.nbytes
# supported data types are int, uint, bool, float, complex

(2, (3, 3), dtype('int64'), 9, 72)

In [84]:
data = np.array([[1,2,3], [4,5,6], [7,8,9]], dtype=np.float) # use dtype argument to specify the data type

In [86]:
data, data.dtype

(array([[ 1.,  2.,  3.],
        [ 4.,  5.,  6.],
        [ 7.,  8.,  9.]]), dtype('float64'))

In [87]:
intdata = np.array(data, dtype=int) # typecasting
intdata.dtype

dtype('int64')

In [89]:
intdata = data.astype(int) # typecasting using astype()
intdata

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [94]:
zeros = np.zeros((3,4))
zeros

array([[ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.]])

In [95]:
ones = np.ones((3,4))
ones

array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

In [108]:
arange = np.arange(20)
arange

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [111]:
linspace = np.linspace(0, 20, 40)
linspace

array([  0.        ,   0.51282051,   1.02564103,   1.53846154,
         2.05128205,   2.56410256,   3.07692308,   3.58974359,
         4.1025641 ,   4.61538462,   5.12820513,   5.64102564,
         6.15384615,   6.66666667,   7.17948718,   7.69230769,
         8.20512821,   8.71794872,   9.23076923,   9.74358974,
        10.25641026,  10.76923077,  11.28205128,  11.79487179,
        12.30769231,  12.82051282,  13.33333333,  13.84615385,
        14.35897436,  14.87179487,  15.38461538,  15.8974359 ,
        16.41025641,  16.92307692,  17.43589744,  17.94871795,
        18.46153846,  18.97435897,  19.48717949,  20.        ])

In [118]:
a = np.empty(5) # create an empty array
a

array([ 0.,  0.,  0.,  0.,  0.])

In [123]:
a.fill(5.4)
a

array([ 5.4,  5.4,  5.4,  5.4,  5.4])

In [126]:
a = np.ones(5) # another way to achieve the same result
a = a * 5.4

In [127]:
a

array([ 5.4,  5.4,  5.4,  5.4,  5.4])

### Slicing

In [128]:
# indexing is pretty much like a list
f = lambda m, n: n + 10 * m
A = np.fromfunction(f, (6, 6), dtype=int)
A

array([[ 0,  1,  2,  3,  4,  5],
       [10, 11, 12, 13, 14, 15],
       [20, 21, 22, 23, 24, 25],
       [30, 31, 32, 33, 34, 35],
       [40, 41, 42, 43, 44, 45],
       [50, 51, 52, 53, 54, 55]])

In [129]:
A[:,1] # second column

array([ 1, 11, 21, 31, 41, 51])

In [130]:
A[1,:] # second row

array([10, 11, 12, 13, 14, 15])

In [132]:
A[3:,3:] # create a submatrix

array([[33, 34, 35],
       [43, 44, 45],
       [53, 54, 55]])

In [133]:
A[:3,:3] # create a submatrix

array([[ 0,  1,  2],
       [10, 11, 12],
       [20, 21, 22]])

In [135]:
B = A[1:5, 1:5]
B

array([[11, 12, 13, 14],
       [21, 22, 23, 24],
       [31, 32, 33, 34],
       [41, 42, 43, 44]])

In [137]:
B.fill(0.) # just like a list, sliced array is a view of the original array
            # thus, changing value of the view will affect the original array

In [138]:
A

array([[ 0,  1,  2,  3,  4,  5],
       [10,  0,  0,  0,  0, 15],
       [20,  0,  0,  0,  0, 25],
       [30,  0,  0,  0,  0, 35],
       [40,  0,  0,  0,  0, 45],
       [50, 51, 52, 53, 54, 55]])

In [139]:
C = B.copy() # copy() will create a new array, instead of a view
C.fill(1.) # so changing data in C will not affect B
C

array([[1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1],
       [1, 1, 1, 1]])

In [140]:
B # data in B remain the same

array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])

### Arithmetic operation

In [146]:
from __future__ import division

In [141]:
x = np.array([[1,2], [3,4]])
y = np.array([[5,6], [7,8]])

In [142]:
x + y

array([[ 6,  8],
       [10, 12]])

In [143]:
x * y

array([[ 5, 12],
       [21, 32]])

In [144]:
y - x

array([[4, 4],
       [4, 4]])

In [150]:
y / x # future division gives decimal output

array([[ 5.        ,  3.        ],
       [ 2.33333333,  2.        ]])

In [151]:
(y / x).dtype

dtype('float64')

In [152]:
x * 2

array([[2, 4],
       [6, 8]])

In [153]:
x ** 2

array([[ 1,  4],
       [ 9, 16]])

In [154]:
y / 2

array([[ 2.5,  3. ],
       [ 3.5,  4. ]])

In [155]:
z = np.array([1,2,3,4])

In [158]:
x + z # adding arrays of different shape will fail

ValueError: operands could not be broadcast together with shapes (2,2) (4,) 

In [160]:
x + z.reshape(2,2) # reshape() changes the dimension of the array

array([[2, 4],
       [6, 8]])