# Exploratory Data Analysis

### Pre-readings
- Chapter 5

### Learning objectives
- Today we will be exploring a data set investigating stimulus response compatibility.   
- Learn how to use split-apply-combine to average repeated measures data appropriately
- Learn how to properly visualize repeated measures data with Seaborn
- Understand the difference between within-subjects variability versus between-subjects variability
- Understand how conflating these two sources of variance leads to incorrect statistical inferences  


## What is stimulus response compatibility
- When the stimulus and the response are compatible, reaction times tend to decrease and mistakes tend to decrease. In other words, people respond faster and more accuately. However, when stimuli are incompatible, reaction times increase and mistakes increase, or people are slower and make more mistakes.
- The data we are going to look at come from a standard SR-compatibility experiment. Participants performed a task in which they had to press a key in response to a stimulus in two conditions; compatible and incompatible.  
- In the compatible (left figure), the stimulus and response were on the same side and in the incompatible (right figure), the stimulus and response were on opposite sides.
- We are going to explore the response times for these different conditions, as well as the accuracy.

<img src="../images/sr.png" width=600>

---
### Task 1: Getting our data
- import the necessary libraries
- load the  data set 'sr_compatibility.csv' and examine it
- Determine the structure of the data
- Come up with a plan on how you plan to explore this dataset.

In [None]:
# your answer here


In [None]:
# your answer here


### Take a couple of minutes to think about different ways of exploring this data and write them down below.

Your answer here


---
### Task 2: Taking a look at our data more closely
- First, let's take a look at our reaction time data as a whole. This is an important first step to understand our data set.
- Plot the reaction times for the different conditions to get an idea of the data.
- Use at least 3 different plot types to examine the data


In [None]:
# your answer here


---
### Task 3: Summarize our response times
- Now that we have seen our data, let's compute some descriptive statistics.
- Use the `describe` method to evaluate the characteristics of the data.
- Do this for  the whole data set, and also for each participant.

In [None]:
# your answer here (whole data set)


In [None]:
# your answer here (per participant)


---
### Task 4: Plotting each participant
- Now let's take a closer look at each participant's data by plotting their response times.
- Plot the response time for each participant in a separate subplot.

In [None]:
# your answer here


---
# Dealing with repeated measures
- When looking at our data we have to consider both the within participant variability as well as the between participant variability.
- Therefore, we must first aggregate our data within our participants to examine the effects.

### Task 5: Comparing confidence interval calculations
- Create a point plot with confidence intervals to examine our data (as we did above)
- Do this with and without using the `units` keywords and compare the results. Did the mean change? Did the confidence interval?

In [None]:
# your answer here



### Task 6: Split-apply-combine
- Use the split-apply-combine to aggregate the data yourselves.
- First create a summary table for your aggregated data
- Then create a new dataframe that only contains the participant averages and not every trial


In [None]:
# Your answer here


In [None]:
# your answer here


---
### Task 7: Looking at the interaction
- We have been plotting our data by condition previously, but now we will look at it more carefully. The SR compatibility experiment effect manifests as an interaction effect, so now we will plot that effect.
- An interaction means that the impact of one independent variable (e.g., response side) on the outcome (dependent) variable (i.e., `ResponseTime`) depends on the level of another independent variable (e.g., stimulus side). (Check out [this page](https://statisticsbyjim.com/regression/interaction-effects/) for a description of a very funny but real interaction effect between food and condiment.) 
- Essentially, the response side reaction times depends on the side the stimulus was presented on (and vice versa).
- Produce a figure that demonstrates this interaction effect.

In [None]:
# Your answer here


---
# Dealing with outliers

### Tukey's Method
- Outliers are defined as points that are more than $1.5 \times IQR$ above/below the third/first quartile.




**Step 1:**
- Calculate your quartiles and the upper and lower thresholds for an outlier.

In [None]:
# Your answer here


**Step 2:**
- Identify all the data points that are above or below this threshold.

In [None]:
# Your answer here


**Step 3:**
- Create a new dataframe that has the outliers you identified in step 2 removed.

In [None]:
# Your answer here


**Step 4:**
- Regenerate the figure in task 7 with this new dataframe

In [None]:
# Your answer here


---
# Looking at accuracy
- Follow the same logic we used for the response time data and explore the accuracy results.
- This data is binary (1 or 0), not continuous like the response time data, so you may have to take a slightly different approach.

In [None]:
# your answer here


In [None]:
# your answer here


### Continue your analyses of accuracy following the same steps we used for response times