In [1]:
%pip install pandas seaborn

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import seaborn as sns



# Pandas and DataFrames

Often, we have tables of data--collections of named columns arranged in rows.  The **Pandas** package gives us a **DataFrame()** class that lets us index these columns the same way as with dicts, while still getting the benefit of Numpy arrays, meaning we can still write vectorized code.  

Let's start playing with the analysis now.  We'll examine Pandas in more depth in the coming days.

## Today's Dataset: Mental Rotation Psychology Experiment

![Mental Rotation Task Example](http://mercercognitivepsychology.pbworks.com/f/1353970952/mental-rotation-image.gif)

## Loading the Data

Please open the file “MentalRotation.csv” found at the url below (pd.read_csv()) and use it to answer the following questions about the results of the Mental Rotation psychology experiment. If you reach the end of the exercises, explore the dataset and DataFrames more and see what you can find about this experiment!

In [3]:
url = "https://raw.githubusercontent.com/nickdelgrosso/CodeTeachingMaterials/main/datasets/MentalRotation.csv"

In [8]:
df = pd.read_csv(url)

## Examining the Dataset

| With Slicing | With Method | With Function |
| :-- | :-- | :-- |
| `df[:5]` | `df.head()` |   |
| `df[-5:]` | `df.tail()` |  |
|  | `df.sample(5)` |   | 
|  | `df.info()` |   |
|  | `df.describe()` |   |
|  | `df.shape[0]` | `len(df)` |

Print the first 5 lines of the dataset:

In [9]:
df[:5]

Unnamed: 0,Subject,Trial,Angle,Matching,Response,Time,Correct,Age,Sex
0,49,1,0,0,n,3107,1,32,M
1,49,2,150,0,n,2930,1,32,M
2,49,3,150,1,b,1874,1,32,M
3,49,4,100,1,b,3793,1,32,M
4,49,5,50,1,b,2184,1,32,M


Look at the last 5 lines of the dataset

In [14]:
df[-5:]

Unnamed: 0,Subject,Trial,Angle,Matching,Response,Time,Correct,Age,Sex
5066,33,92,150,1,b,2095,1,20,F
5067,33,93,150,0,n,2125,1,20,F
5068,33,94,50,0,n,1226,1,20,F
5069,33,95,100,1,b,2783,1,20,F
5070,33,96,0,0,n,1017,1,20,F


Check 3 random lines in the dataset.

In [21]:
df.sample(3)

Unnamed: 0,Subject,Trial,Angle,Matching,Response,Time,Correct,Age,Sex
2567,39,29,0,1,b,2325,1,18,M
1795,25,4,100,1,n,1738,0,28,M
1893,30,9,150,0,n,6920,1,24,F


How Many Total Trials (rows) are in the study?

In [24]:
len(df)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5071 entries, 0 to 5070
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Subject   5071 non-null   int64 
 1   Trial     5071 non-null   int64 
 2   Angle     5071 non-null   int64 
 3   Matching  5071 non-null   int64 
 4   Response  5071 non-null   object
 5   Time      5071 non-null   int64 
 6   Correct   5071 non-null   int64 
 7   Age       5071 non-null   int64 
 8   Sex       5071 non-null   object
dtypes: int64(7), object(2)
memory usage: 356.7+ KB


## Calculating Values on Columns

| Method | Example |
| :-- | :-- |
| `.max()` | `df['Height'].max()` |
| `.min()` | `df['Weight'].min()` |
| `.mean()` | `df['Time'].mean()` |
| `.median()` | `df['Speed'].median()` |
| `.value_counts()` | `df['Kind'].value_counts()` |


What is the maximum number of trials that one subject performed?

In [29]:
df['Subject'].value_counts()

Subject
49    96
42    96
28    96
51    96
36    96
23    96
14    96
8     96
6     96
7     96
37    96
1     96
39    96
20    96
26    95
15    95
38    95
32    95
34    95
27    95
16    95
5     95
33    95
41    95
52    95
4     95
9     95
17    95
35    95
50    94
48    94
11    94
24    94
25    94
43    94
12    94
3     94
21    94
46    93
40    93
54    93
2     93
31    92
22    92
47    92
45    92
30    91
10    91
44    91
53    91
29    90
18    89
13    88
19    85
Name: count, dtype: int64

What was the median reaction time across all subjects?

In [31]:
df['Time'].median()

np.float64(2402.0)

What was the average accuracy rate (i.e. proportion of correct trials) across all subjects?

In [35]:
df['Correct'].value_counts().mean()


np.float64(2535.5)

How many trials were shown at each Angle?

In [36]:
df['Angle'].value_counts()

Angle
50     1280
0      1274
100    1261
150    1256
Name: count, dtype: int64

How many trials were answered correctly and incorrectly, for each angle? (hint: `df[['A', 'B']]`)

In [37]:
df[['Angle','Correct']].value_counts()

Angle  Correct
0      1          1216
50     1          1198
100    1          1121
150    1          1052
       0           204
100    0           140
50     0            82
0      0            58
Name: count, dtype: int64

### Making New Columns

| Syntax | 
| :-- |
| `df['NewCol'] = df['OldCol'] * 10` |

Make a "TimeSecs" column by converting the Time column to seconds by dividing it by 1000.

In [38]:
df['TimeSecs'] = df ['Time'] * 1000

Make an "IsCorrect" column by converting the "Correct" column to *bool* (True/False) values

In [50]:
df['IsCorrect'] = df['Correct'].astype(bool)

### Logical Indexing

| Syntax |
| :-- |
| `df[df['Time'] > 3]` |

Example: How many trials used an angle of 150?

In [51]:
len(df[df['Angle'] == 150])

1256

How many trials had response times longer than 3 seconds?

In [52]:
len(df[df['TimeSecs'] > 3])

5071

What was the accuracy of subject 9?

In [66]:
df[df['Subject'] == 9]['TimeSecs']

3577    2788000
3578    4118000
3579    3809000
3580    4317000
3581    2362000
         ...   
3667    2434000
3668    3341000
3669    1204000
3670    2084000
3671    2157000
Name: TimeSecs, Length: 95, dtype: int64

What was the average response time of subject 32?

In [67]:
df[df['Subject'] == 32]['TimeSecs'].mean()

np.float64(2608452.6315789474)

What was the average response time for subject 12 on trials with an Angle of 50? (Hint: `(A) & (B)`)

In [79]:
df[(df['Subject'] == 12) & (df['Angle'] == 50)]

Unnamed: 0,Subject,Trial,Angle,Matching,Response,Time,Correct,Age,Sex,TimeSecs,IsCorrect
2639,12,5,50,1,b,2435,1,21,F,2435000,True
2640,12,6,50,0,n,1251,1,21,F,1251000,True
2645,12,11,50,0,n,2292,1,21,F,2292000,True
2650,12,16,50,1,b,2379,1,21,F,2379000,True
2653,12,19,50,0,n,3052,1,21,F,3052000,True
2654,12,20,50,1,b,1372,1,21,F,1372000,True
2660,12,26,50,0,n,3376,1,21,F,3376000,True
2661,12,27,50,1,b,1669,1,21,F,1669000,True
2667,12,33,50,1,b,1676,1,21,F,1676000,True
2674,12,40,50,0,n,1434,1,21,F,1434000,True


Was there an overall difference in response accuracy between matching and non-matching trials?

Is there a response time difference between matching and nonmatching
trials?

## Group By

| Syntax |
| :-- |
| `df.groupby('Age').Time.mean()` |

Example: What was the response accuracy for matching and non-matching trials?

In [None]:
df.groupby('Matching').IsCorrect.mean()

Matching
0    0.909163
1    0.899961
Name: IsCorrect, dtype: float64

Example: What was the response accuracy for Each Angle?

In [None]:
df.groupby('Angle', as_index=False).IsCorrect.mean()

Unnamed: 0,Angle,IsCorrect
0,0,0.954474
1,50,0.935937
2,100,0.888977
3,150,0.83758


What was the response accuracy for each Angle and Matching/Nonmatching value?

What was the average response time for each Angle and Matching/Nonmatching value?

What was the average response time for each Angle and Matching/Nonmatching value, for each subject?

### Plotting with Pandas

| Syntax |
| :-- |
| `df['Column'].plot(kind='hist')` |
| `df['Column'].plot.hist()` |
| `df.hist('Column', by='Group')` |
| `df.plot(x='Age', y='Height', kind='scatter')` |
| `df.plot.scatter(x='Age', y='Height')` |

Plot the response time distribution as a histogram.

Plot the average response time for each stimulus category (matching and non-matching)

Is there a correlation between Angle of mental rotation and response time?  Visualize the relationship using a scatter plot

Is there a relationship between subject age and average response time?   Visualize the relationship using a box plot

Did participants get faster or slower as they did more trials? Visualize the relationship using a scatter plot

Plot the response time distribution, with a seperate subplot for each subject.

## Plotting with Seaborn

| Syntax |
| :-- |
| `sns.catplot(data=df, x='Col1', y='Col2', hue='Col3', kind='bar')` |
| `sns.lineplot(data=df, x='Col1', y='Col2', hue='Col3')` |
| `sns.lmplot(data=df, x='Col1', y='Col2', hue='Col3')` |

Is there a difference between average response time for matching and non-matching trials?

Is there a correlation between Angle of mental rotation and response time?  Visualize the relationship

Is there a difference in the relationship between Angle of mental rotation and response time, between stimulus categories?

Is there a difference in the relationship between Angle of mental rotation and response time for participants younger than 22 and participants older than 22, between stimulus categories?

Did participants get faster or slower as they did more trials? Visualize the relationship using a line plot