# **Python For Neuro Week 6**: More Numpy and Pandas

## Warmup

We're going to start today by loading in ```mat_1.npy``` again. If you don't remember how to do this, use google to find the right function to use. Please assign the array to variable called ```arr```.

In [1]:
import numpy as np

In [8]:
# load in data
arr = np.load('ex_array.npy')

## Summary operations

Summary operations allow you to collapse an array according to a certain summary statistic. For instance, we may want to compute the overall mean firing rate in our experimental data:

In [9]:
arr.mean()

np.float64(0.8717270073789575)

You can also specify the axis along we want to average. For instance, maybe we want to average firing rates across individual trials:

In [10]:
arr.shape

(2, 10, 50, 2000)

In [None]:
arr_across_trials = arr.mean(axis=1)
# average for trials for the 2 conditions, 50 neurons, and timepoints (average across the trials)

In [None]:
arr_across_trials.shape
# taken the 10 different trials and averaged them (collapsed them down)

(2, 50, 2000)

The `keepdims` argument means that you don't remove the dimensions you're averaging over, but rather set their length to 1:

In [13]:
arr_across_trials = arr.mean(axis=1, keepdims=True)

In [14]:
arr_across_trials.shape

(2, 1, 50, 2000)

You can average across multiple axes as well. For instance, maybe you want to average across both trials and time:

In [15]:
arr_across_trials_and_time = arr.mean(axis=(1,3))

In [16]:
arr_across_trials_and_time.shape

(2, 50)

### Question 1

- What is the average firing rate across all neurons, times, and trials for each condition?
- (Advanced.) Subtract the average firing rate per time across all neurons, trials, and conditions from the original array.

In [33]:
avg_condition = arr.mean(axis=(1,2,3)) #average firing rate across all neurons, times, and trials for each condition
avg_condition.shape
avg_condition

array([0.98828779, 0.75516623])

In [None]:
avg_condition
avg_overall = arr.mean(axis = 3, keepdims=True) 
# keepdims creating a fake dimension so python knows to apply the 1 thing to everyhing else
avg_overall.shape

(2, 10, 50, 1)

In [41]:
arr - avg_overall

array([[[[-7.00688590e-01,  4.25891874e-02, -7.41298043e-01, ...,
           4.14718632e-01,  1.04691386e+00, -7.77652685e-01],
         [ 1.00568119e+00,  3.06183827e-01,  4.54450893e-01, ...,
          -3.53166873e-01, -5.93903828e-01, -1.82492763e-01],
         [ 4.59796796e-01,  8.21008307e-01,  9.13964819e-01, ...,
          -5.47633280e-01,  7.07070254e-02, -3.19402053e-01],
         ...,
         [-3.41114426e-01, -3.13104230e-01, -2.86115189e-02, ...,
          -3.51396840e-02,  1.67502135e-01,  7.60185947e-01],
         [ 1.13938081e-01, -6.67570042e-01, -2.37529908e-01, ...,
           2.57306423e-01,  7.50760531e-01, -4.41437690e-01],
         [-5.57690268e-01,  7.59463299e-01,  9.22591078e-02, ...,
           4.54657628e-01,  8.23055532e-01, -3.88631203e-01]],

        [[ 4.15208163e-02,  1.63798119e-01, -5.09885184e-01, ...,
           5.46173041e-01,  5.26961558e-01,  4.68263548e-01],
         [-1.61982306e-01,  6.00047083e-02,  3.64808156e-01, ...,
          -2.33516039e

In [None]:
avg_condition.shape #checks the dimensions

(2,)

## Indexing

Indexing in vectors works just as in lists:

In [43]:
vec_1 = np.array([1,2,3])
vec_1

array([1, 2, 3])

In [44]:
vec_1[0]

np.int64(1)

For matrices and higher-dimensional arrays, a single index selects a single row:

In [47]:
mat_1 = np.array(([1,2,3],[4,5,6]))
mat_1

array([[1, 2, 3],
       [4, 5, 6]])

In [48]:
mat_1[0]

array([1, 2, 3])

In [49]:
mat_1[0][1]

np.int64(2)

Instead of using two brackets, you can also separate the row and column index by a comma:

In [50]:
# The following two lines of code are equivalent
print(mat_1[0][0])
print(mat_1[0,0])

1
1


### Slicing

Slicing is a useful way of extracting more than one element. In particular, `j:k` extracts the elements j,...,k-1:

In [51]:
vec = np.arange(10)
print(vec)

[0 1 2 3 4 5 6 7 8 9]


In [52]:
vec[3:7]

array([3, 4, 5, 6])

We can leave either end of the range away and it will default to the beginning and the end of the list, respectively.

In [None]:
vec[:7]
# returns everything before 7 - starting at 0

array([0, 1, 2, 3, 4, 5, 6])

In [None]:
vec[3:]
# returns everything after 3

array([3, 4, 5, 6, 7, 8, 9])

In [None]:
vec[:] # What do you think this will do?
# returns everything

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

You can therefore also use the colon to select all rows of a matrix and specific columns.

In [56]:
mat_1

array([[1, 2, 3],
       [4, 5, 6]])

In [None]:
mat_1[:,0]
# returns everything in the first row

array([1, 4])

You can add another colon to specify a step size, similarly to how you would use these three arguments in `range`.

In [None]:
vec[3:7:2]
# skipping every other element - moving ip in sets of two

array([3, 5])

We could still leave away the beginning or the end of the slice:

In [59]:
vec[::2] 
# kind of like you are putting in zero to the end 
# counts in sets of 2

array([0, 2, 4, 6, 8])

### Question 2
Predict the output of the following commands:

In [None]:
vec[:4]
# returns elements from 0 to 3

array([0, 1, 2, 3])

In [None]:
vec[5:9:2]
#returns everything 5 to 9 intervals of 2

array([5, 7])

In [62]:
vec[:7:2]
#start 0 until 7 (not inclusive), intervals of 2

array([0, 2, 4, 6])

In [63]:
vec[2::2]
# starts at 2 and returns elemetns until the end, intervasl of 2

array([2, 4, 6, 8])

### Boolean indexing

Do you remember how to create an array that is true if and only if `vec` is smaller than 5?

In [64]:
vec = np.arange(10)
vec

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [65]:
selector = vec <= 5
selector

array([ True,  True,  True,  True,  True,  True, False, False, False,
       False])

You can use these boolean arrays to subset the corresponding true values.

In [66]:
vec

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [None]:
vec[selector]
# only return elements less than 5
# a way to return specific elements from an array

array([0, 1, 2, 3, 4, 5])

In [68]:
vec[vec<=5]

array([0, 1, 2, 3, 4, 5])

You can do the same with matrices:

In [69]:
mat_1 = np.array([[1, 2],
       [3, 4],
       [5, 6]])

In [70]:
mat_1 >= 3

array([[False, False],
       [ True,  True],
       [ True,  True]])

In [None]:
mat_1[mat_1 >= 3]
# lost structure in the matrix, but there is a numpy arg that will return the organized r

array([3, 4, 5, 6])

### Questions 3
- Consider the example matrix from above and subset all entries with values between 2 and 4. You can try to do this in one line or do it through multiple lines!

In [None]:
mat_1[(mat_1>= 2) & (mat_1< 4)]
# NEED PARATHESES WHEN YOU HAVE MULTIPLE CONDITIONS
# & means that both need to be true - could also use | (this means or)

array([2, 3])

# Pandas
## Python's package for handling data
### Motivation for pandas
Dictionaries allow us to save multiple attributes of a particular object. For example, we can store some information about a lesson:

In [86]:
lesson_5 = {
    'topic': 'Numpy',
    'teacher': 'Sharon',
    'week': 5
}

Often, we collect multiple observations for which we record the same attributes and we'd like to store them together:

In [87]:
lesson_3 = {
    'topic': 'Basics of Python 2',
    'teacher': 'Sharon',
    'week': 3
}
lesson_1 = {
    'topic': 'Setting up Python',
    'teacher': 'Abhi',
    'week': 1
}

We could go about this by storing them in a list:

In [88]:
lst_lessons = [lesson_5, lesson_3, lesson_1]

In [89]:
lst_lessons

[{'topic': 'Numpy', 'teacher': 'Sharon', 'week': 5},
 {'topic': 'Basics of Python 2', 'teacher': 'Sharon', 'week': 3},
 {'topic': 'Setting up Python', 'teacher': 'Abhi', 'week': 1}]

However, such lists are lacking a lot of functionality. For example, we may want to print out only those observations where Jasmine was the teacher. We'd have to use a for loop for this:

In [90]:
sharons_lessons = [
    lesson for lesson in lst_lessons if lesson['teacher'] == 'Sharon'
]
sharons_lessons

[{'topic': 'Numpy', 'teacher': 'Sharon', 'week': 5},
 {'topic': 'Basics of Python 2', 'teacher': 'Sharon', 'week': 3}]

We therefore need a new data structure that can record multiple pieces of information about multiple observations. This is provided by `pandas` (which stands for *panel data*):

In [91]:
#We normally import pandas like this
import pandas as pd

The core object in pandas is a *data frame*, which consists of observations organized along its rows and different pieces of information about its observations organized along its columns.

In [92]:
df_lessons = pd.DataFrame(lst_lessons)
df_lessons

Unnamed: 0,topic,teacher,week
0,Numpy,Sharon,5
1,Basics of Python 2,Sharon,3
2,Setting up Python,Abhi,1


### Finding out basic information

In [93]:
df_lessons.shape

(3, 3)

In [94]:
df_lessons.columns

Index(['topic', 'teacher', 'week'], dtype='object')

### Indexing

Regular brackets return a specific column or a subset of columns:

In [95]:
df_lessons['teacher']

0    Sharon
1    Sharon
2      Abhi
Name: teacher, dtype: object

(*Note:* The object that is returned is called a `pd.Series` and has a few additional features compared to a one-dimensional numpy array. I personally don't use those additional features and think they are counter-productive, but you can look them up if you have to interact with them.)

You can operate on those columns in the same way you would operate on numpy arrays:

In [96]:
df_lessons['teacher'] == 'Sharon'

0     True
1     True
2    False
Name: teacher, dtype: bool

In [97]:
df_lessons[['topic', 'teacher']]

Unnamed: 0,topic,teacher
0,Numpy,Sharon
1,Basics of Python 2,Sharon
2,Setting up Python,Abhi


This is helpful so you can look up data by the name
Data frames seem super fucking helpful

`.loc` allows you to index data frames by row numbers and column names:

In [98]:
df_lessons.loc[1, 'teacher']

'Sharon'

This also works with slicing:

In [99]:
df_lessons.loc[1:, 'teacher']

1    Sharon
2      Abhi
Name: teacher, dtype: object

In [None]:
df_lessons.loc[:, ['topic', 'teacher']]
# this is returning every row for topic and teacher but not week number

Unnamed: 0,topic,teacher
0,Numpy,Sharon
1,Basics of Python 2,Sharon
2,Setting up Python,Abhi


`iloc` works in the same way, but allows you to access columns according to their numerical index rather than their name:

In [None]:
df_lessons.iloc[1, 1]
# iloc treats it like the matrix
# returns second element in the first row

'Sharon'

Finally you can also do boolean indexing with rectangular brackets.

In [None]:
selector = df_lessons['teacher'] == 'Sharon'
df_lessons[selector]
# booleans are helpful if you want to do selective for things based on a true false (ie does something have something)
# ex. looking for all trials where numbers are above a threshold

Unnamed: 0,topic,teacher,week
0,Numpy,Sharon,5
1,Basics of Python 2,Sharon,3


(Note that the single `=` assigns the command to the right of it to the variable on its left. The double `==` on the other hand compares the values in `df_lessons['teacher']` and determines whether they are equal to `'Jasmine'`.)

In [None]:
df_lessons[df_lessons['teacher']=='Sharon']
# find rows where teachers are equal to sharon and returns those rows

Unnamed: 0,topic,teacher,week
0,Numpy,Sharon,5
1,Basics of Python 2,Sharon,3


Finally, you can add new columns in the same way you would add a new key, value pair to a dictionary:

In [104]:
df_lessons

Unnamed: 0,topic,teacher,week
0,Numpy,Sharon,5
1,Basics of Python 2,Sharon,3
2,Setting up Python,Abhi,1


In [None]:
df_lessons['homework'] = [True, True, False]
# adding a new column
# easy way to add to your data!!

In [None]:
df_lessons

Unnamed: 0,topic,teacher,week,homework
0,Numpy,Sharon,5,True
1,Basics of Python 2,Sharon,3,True
2,Setting up Python,Abhi,1,False


### Exercises
1. Create a data frame that additionally includes this week (week 6) with the appropriate topic (pandas) and teacher (Sam).
2. Print out the topic for the second row.
3. Subset the data frame to only print out the lessons for week 3 and higher.
4. Create a new data frame that also includes week 7's lesson with teacher Sam. However, you don't know the topic yet. How does `pandas` represent this information? (Hint: Create a dictionary that only contains the keys `week` and `teacher`, but not `topic`. Try adding it to the list we used above and turning it into a dataframe.)
5. You could have alternately also represented this information as a two-dimensional array with observations structured along rows and variables structured along columns. What would the difference be and why might this be a bad idea in this case? Discuss with the other students at your table.

In [None]:
#1
df_lessons.loc[3] = ['Pandas', 'Abhi', '6', 'False']
df_lessons
# can also use .loc to add in a row - indexing something new will add it
# this is modifying the orginal data frame
#

Unnamed: 0,topic,teacher,week,homework
0,Numpy,Sharon,5,True
1,Basics of Python 2,Sharon,3,True
2,Setting up Python,Abhi,1,False
3,Pandas,Abhi,6,False


In [None]:
#1 another way - make a dictonary
# append it to the lst_lessons
#change it into a data frame
# could also just make a big list
wk6_dict = {
    'topic': 'Pandas',
    'teacher': 'Abhi',
    'week': 6
}

In [123]:
lst_lessons.append(wk6_dict)

In [None]:
df = pd.DataFrame(lst_lessons)
df
# this is creating a new data frame

Unnamed: 0,topic,teacher,week
0,Numpy,Sharon,5
1,Basics of Python 2,Sharon,3
2,Setting up Python,Abhi,1
3,Pandas,Abhi,6


In [None]:
#2
df.loc[1, 'topic']
df['topic'][1]
df.iloc[1,0]
# there are many ways to do this! choose what works best for your data

'Basics of Python 2'

In [None]:
df[df['week']>= 3]
#this has not changed the original data frame - it has just selected the appropriate rows

Unnamed: 0,topic,teacher,week
0,Numpy,Sharon,5
1,Basics of Python 2,Sharon,3
3,Pandas,Abhi,6


In [137]:
wk7_dict = {
    'teacher': 'Sam',
    'week': 7
}
lst_lessons.append(wk7_dict)
df2 = pd.DataFrame(lst_lessons)

df2

Unnamed: 0,topic,teacher,week
0,Numpy,Sharon,5
1,Basics of Python 2,Sharon,3
2,Setting up Python,Abhi,1
3,Pandas,Abhi,6
4,,Sam,7
5,,Sam,7
6,,Sam,7
7,,Sam,7
8,,Sam,7


### Saving and loading a data frame
You can save data frames in different formats. A popular format is csv (comma-separated values), which represents each observation in one row and each variable separated by commas.

In [138]:
df_lessons.to_csv('df_lessons.csv')

Let's inspect this file.

We'll be using csv files today. Note that they are not always ideal. For example, they do not save the type of your different values which can lead to issues. The hdf5 format is a popular alternative (but a little more complicated to use); alternatively the feather format is lightweight and more reliable, but a little less common.

In [None]:
df_lessons_loaded = pd.read_csv('df_lessons.csv')

In [None]:
df_lessons_loaded

### Exercises
1. Read in the file `dot_motion.csv` using pandas and assign it to the variable `df_dm`.
2. Try exploring the file and describe the data contained in it.
3. Subset the data frame to only contain the observations with a reaction time of above 100.
4. Create a new variable 'accuracy' that is 1 if the motion and the choice are matching and 0 otherwise.

#### Hint for 4:
If the motion and choice are matching, their entries should be equal. Create an array `accuracy` that contains as a boolean whether they are or are not matching. You can turn this boolean array (with True and False value) into a float array (which will assign 1 to True and 0 to False), using `accuracy.astype(float)`.
