To handle multidimensional data such as:

- fMRI: Voxels x time
- EEG: Electrodes x time
- Questionnaires: Subjects x responses

Lists are not inherently multidimensional, but we can embed lists within lists, or tuples within lists.

**Reminder:** Tuples are *unchangeable*, in comparison to lists. 

### Exercise

In [4]:
coords = [(2, 3), (5, 5), (10, 1)]

print(coords)
print(type(coords))
print(type(coords[0]))

[(2, 3), (5, 5), (10, 1)]
<class 'list'>
<class 'tuple'>


**Dictionaries** use assigned keys instead of an numerical index (of location) to retrieve the associated value. 

In [6]:
nbaTeams = {
    'LosAngeles':'Lakers',
    'Toronto':'Raptors',
    'Chicago':'Bulls'
}

nbaTeams['Chicago']

'Bulls'

In [7]:
# an alternative way of defining a dictionary
fruits = dict([("red", "apple"), ("green", "pear"), ("yellow", "lemon"), ("blue", "blueberry")])

fruits.get("blue")

'blueberry'

## Numpy

In [26]:
import random
import numpy as np

arr = np.array(range(1,6))
print(arr)

print()

arr_2D = np.array([random.sample(range(1,10), 3), random.sample(range(10,20), 3), random.sample(range(20,30), 3)])
print(arr_2D)

arr_2D.ndim

[1 2 3 4 5]

[[ 2  6  1]
 [17 11 15]
 [25 22 24]]


2

In [39]:
for i in range(arr_2D.ndim + 1):
    print(arr_2D[i])

# to index a single element of the 2D array
# maximum number of indices that can be specified: array.ndim
# only possible with numpy arrays
arr_2D[0, 0]
print()

for x in arr_2D: # for each element of arr_2D
    for y in x: # for each element in x, which is in turn an element of arr_2D
        print(y) # print current element on its own line

[2 6 1]
[17 11 15]
[25 22 24]

2
6
1
17
11
15
25
22
24


In [46]:
# playing around with lists
list_2D = [random.sample(range(1,10), 3), random.sample(range(10,20), 3), random.sample(range(20,30), 3)]
print(list_2D)
print(list_2D[0][0], list_2D[0][1], list_2D[0][2])

[[6, 3, 5], [12, 16, 14], [23, 20, 24]]
6 3 5


**Masking**: Finding specific values in your data is one of the most common things you will need to do.

In [54]:
cleanArr = arr[arr > 2]
cleanArr

array([3, 4, 5])

In [57]:
arr2 = np.array(range(10))
highArr2 = arr2[arr2 > 4]
highArr2

array([5, 6, 7, 8, 9])

**An alternative to masking:**

``np.where(condition, [x,y], /)``
returns elements from x or y depending on condition

In [58]:
data = np.loadtxt(fname='inflammation-01.csv', delimiter=',')

In [59]:
print(type(data))
print(data.shape)

<class 'numpy.ndarray'>
(60, 40)


In [66]:
# indexing a multi-dimensional nparray
print(data[0:10])
print(data[0:10, 0:10])
print(data[0,:])
print(data[0:1, 0:1])
# this is equivalent to data[0,0]

[[ 0.  0.  1.  3.  1.  2.  4.  7.  8.  3.  3.  3. 10.  5.  7.  4.  7.  7.
  12. 18.  6. 13. 11. 11.  7.  7.  4.  6.  8.  8.  4.  4.  5.  7.  3.  4.
   2.  3.  0.  0.]
 [ 0.  1.  2.  1.  2.  1.  3.  2.  2.  6. 10. 11.  5.  9.  4.  4.  7. 16.
   8.  6. 18.  4. 12.  5. 12.  7. 11.  5. 11.  3.  3.  5.  4.  4.  5.  5.
   1.  1.  0.  1.]
 [ 0.  1.  1.  3.  3.  2.  6.  2.  5.  9.  5.  7.  4.  5.  4. 15.  5. 11.
   9. 10. 19. 14. 12. 17.  7. 12. 11.  7.  4.  2. 10.  5.  4.  2.  2.  3.
   2.  2.  1.  1.]
 [ 0.  0.  2.  0.  4.  2.  2.  1.  6.  7. 10.  7.  9. 13.  8.  8. 15. 10.
  10.  7. 17.  4.  4.  7.  6. 15.  6.  4.  9. 11.  3.  5.  6.  3.  3.  4.
   2.  3.  2.  1.]
 [ 0.  1.  1.  3.  3.  1.  3.  5.  2.  4.  4.  7.  6.  5.  3. 10.  8. 10.
   6. 17.  9. 14.  9.  7. 13.  9. 12.  6.  7.  7.  9.  6.  3.  2.  2.  4.
   2.  0.  1.  1.]
 [ 0.  0.  1.  2.  2.  4.  2.  1.  6.  4.  7.  6.  6.  9.  9. 15.  4. 16.
  18. 12. 12.  5. 18.  9.  5.  3. 10.  3. 12.  7.  8.  4.  7.  3.  5.  4.
   4.  3.  2.  1.

In [67]:
# finding mean on nparrays
np.mean(data[0,:])
# equivalent to data[0]???

np.float64(5.45)

### Exercise 

Obtain the maximum inflation value for patient 1, and the mean value for day 10

Obtain values for each patient/day using the axis option

In [75]:
patient = 1
day = 10

print("Max inflammation value of patient %i: %f" %(patient, np.amax(data[0])))
print("Mean value for day %i: %f" %(day, np.mean(data[:,9])))
# Oh!!! This is so powerful!!!

Max inflammation value of patient 1: 18.000000
Mean value for day 10: 5.516667


In [73]:
# numpy also has a random module and associated functions
from numpy import random

randData = random.randint(5, 10, size = [10, 50])
randData

random.choice(data[0])

np.float64(4.0)

Note for assignment:

Look into distribution function to simulate behavioural data!

## Pandas

Pandas data structures are more capable of handling **multiple data structures**!
- **Series:** A pandas Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.). 
- **DataFrame:** The primary data structure in pandas, the DataFrame is a two-dimensional, mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns)

In [159]:
import pandas as pd

pd.Series(data=['4 cups','1 cup','2 large','1 can'])

0     4 cups
1      1 cup
2    2 large
3      1 can
dtype: object

In [87]:
s = pd.Series(data = [1,'2',3,4,'5',6,7,8,'99','1000'])
print(s)
s.astype('int')

0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8      99
9    1000
dtype: object


0       1
1       2
2       3
3       4
4       5
5       6
6       7
7       8
8      99
9    1000
dtype: int64

In [88]:
dataNew = pd.Series([1,2,pd.NA,4,5])
dataNew.dropna(inplace=True)
dataNew

0    1
1    2
3    4
4    5
dtype: object

In [92]:
dataNew.astype('int').apply(np.sqrt)
# dataNew

0    1.000000
1    1.414214
3    2.000000
4    2.236068
dtype: float64

In [93]:
# You can also write short anonymous functions to pass to the data using the lambda function

dataNew.apply(lambda x: x + 1)

0    2
1    3
3    5
4    6
dtype: int64

In [123]:
data = pd.read_csv("RTdata.csv", index_col='subjs')
data.columns
data.head

<bound method NDFrame.head of        runcode sex       race       RTs         K
subjs                                            
1        23887   m      asian  0.268098  0.254095
2        23888   f  caucasian  0.810172  1.020760
3        23889   m  caucasian  0.625572  2.882098
4        23890   f      asian  0.892729  1.024061
5        23891   m  caucasian  0.495700  2.093723
6        23892   f  caucasian  0.117297  1.012419
7        23893   f  caucasian  0.964358  2.904390
8        23894   m  caucasian  0.131785  1.922144
9        23895   f      asian  0.529800  2.359120
10       23896   f      asian  0.917709  2.088641
11       23897   m      asian  0.245590  1.693146
12       23898   f      asian  0.815274  1.072074
13       23899   m  caucasian  0.350466  3.813035
14       23900   m  caucasian  0.000590  2.791595
15       23901   f      asian  0.960922  0.064458
16       23902   m      asian  0.111476  4.186431
17       23903   f      asian  0.307025  3.112951
18       23904   m  

In [133]:
# indexing numerically
data.loc[:, 'sex']

subjs
1     m
2     f
3     m
4     f
5     m
6     f
7     f
8     m
9     f
10    f
11    m
12    f
13    m
14    m
15    f
16    m
17    f
18    m
Name: sex, dtype: object

In [172]:
# Find RTs for only males
data[data['sex'] == 'm']['RTs']

# alternatively
# data.groupby('sex').mean()
data['RTs'].mean()

# Find RTs for Caucasian males
data[(data['sex']=='m') & (data['race']=='caucasian')]['RTs']

subjs
3     0.625572
5     0.495700
8     0.131785
13    0.350466
14    0.000590
18    0.641817
Name: RTs, dtype: float64

Pandas has a number of methods for dealing with textual (string) data under the **str** accessor

e.g. import the titanic data set, and change the names of all passengers using the lower method:

``titanic[“Name”].str.lower()``

In [156]:
titanic = pd.read_csv("titanic.csv")

In [153]:
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [154]:
titanic['Name']

0                                Braund, Mr. Owen Harris
1      Cumings, Mrs. John Bradley (Florence Briggs Th...
2                                  Heikkinen, Miss Laina
3           Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                               Allen, Mr. William Henry
                             ...                        
886                                Montvila, Rev. Juozas
887                          Graham, Miss Margaret Edith
888              Johnston, Miss Catherine Helen "Carrie"
889                                Behr, Mr. Karl Howell
890                                  Dooley, Mr. Patrick
Name: Name, Length: 891, dtype: object

In [158]:
titanic['FirstName'] = titanic['Name'].str.split(',').str.get(-1)
titanic['LastName'] = titanic['Name'].str.split(',').str.get(0)

titanic['FirstName']

0                                  Mr. Owen Harris
1       Mrs. John Bradley (Florence Briggs Thayer)
2                                       Miss Laina
3               Mrs. Jacques Heath (Lily May Peel)
4                                Mr. William Henry
                          ...                     
886                                    Rev. Juozas
887                            Miss Margaret Edith
888                  Miss Catherine Helen "Carrie"
889                                Mr. Karl Howell
890                                    Mr. Patrick
Name: FirstName, Length: 891, dtype: object