# HW 2: Due Monday, Feb 10th by Midnight

**Instructions:**  

This assignment is based on [Takagi et al.'s (2022)](https://www.nature.com/articles/s41598-022-10261-5) study investigating whether cats learn to associate names with faces.  
We've included the PDF in this repository for your reference - please skim it before starting the assignment - we're only focusing on **Experiment 1**

You're allowed to work on this HW by yourself OR with other students.  
If you do work together, please list the names of your partner(s) in the next notebook cell.  
You can also nominate yourself or others for cooperative extra-credit. [See the course-website for details](https://stat-intuitions.com/pages/syllabus.html#grading)

**Grading:**  

Your Instructors will grade the **last `git push`** you did by the deadline.  
However, you'll have an opportunity to push additional changes to improve your grade after we provide feedback.  

To receive **full** credit you must be both **accurate** and **explain** your process.  

What's most important to us is your ability to demonstrate:  

- You attempted the assignment in good-faith
- You made effort to clearly document and explain your thought process, reasoning, code, and where/why you got stuck if you did
- What attempts you made to fix issues you ran into, how you approached debugging, and what you learned from the process
- Why you made a particular choice in your code/analysis, and/or what assumptions you made for a particular statistical inference


**HW Partner(s) (optional):**  

Please list the names of your HW partners in this cell:

1. ...
2. ...
etc

## Resources

You might find it helpful to refer to the [coding cheatsheets](https://stat-intuitions.com/pages/cheatsheets.html) for the various Python libraries we've learned about so far.

As well as the following recent labs/lectures we covered:
- [tidy-data analysis with polars](https://stat-intuitions.com/labs/3/01_polars.html)
- [statistical visualizations with seaborn](https://stat-intuitions.com/labs/3/02_new_eda_seaborn.html)
- [introduction to resampling (permutation)](https://stat-intuitions.com/lectures/wk2/05_simulation.html#permutation-re-sampling-without-replacement)
- [EDA resampling for basic comparisons](https://stat-intuitions.com/labs/3/04_permutation_comparison.html)

Remember to ping your instructors on Github/Slack if you get stuck!

---


## Background

You'll working with simulated data inspired by Experiment 1 from the paper. You should refer to the paper for additional information and context, but here are key details of the design and analysis:

Cats were show pictures of *other* cats' faces paired with an audio recording of *other* cats' names.  
On some trials, the names and faces and audio were *congruent* - the identity of the picture *matched* the name being played.  
On other trials faces and audio were *incongruent* - the identity of the picture displayed a *different* cat than the name being played.

There were 2 independent variables with 2 levels each:

- Condition (within): Whether a trial was a **congruent** or **incongruent** name-face pairing
- Setting (between): Whether a cat was raised in a **cafe** or a **home**

And 1 dependent variable:

- Looking time: How many **seconds** a cat looked at the screen on a given trial

They tested 2 hypotheses:

1. Incongruent trials would lead to *longer* looking times overall due to expectancy violation (mis-matched pictures and audio names)
2. The degree of expectancy violation (difference between congruent and incongruent trials) would be *higher* for cats raised in *homes* vs *cafes* (more experience with human-socialization, i.e. name-calling)

## Your Tasks

We've broken up the the assignment into 6 parts that will have you perform the following tasks:

- Basic data exploration and pre-processing (e.g. creating columns, filtering, etc.)
- Reproduce Figure 2 from the paper
  - Test for a difference between congruent vs incongruent looking times using a *permutation test*
  - Test for difference between home vs cafe cats using a *permutation test*
- Reproduce Figure 3 from the paper
  - Test whether the "violation index" is *larger* for cats raised in homes vs cafes using a *permutation test*

**HW Partner(s) (optional):**  

Please list the names of your HW partners in this cell:

1. ...
2. ...
etc

---

## Meet the data

The file `cat_name_recognition.csv` contains *tidy-format* data with the following columns:

- `cat_id`: Unique identifier for each cat
- `trial`: Trial number (1-4)
- `setting`: Whether the cat lives in a 'cafe' or 'house'
- `condition`: Whether the name-face pairing was 'congruent' or 'incongruent'
- `looking_time`: Time (in seconds) the cat spent looking at the screen
- `n_cohabitants`: Number of other cats living with this cat
- `age`: Age of the cat in years

We'll import some libraries and functions to get you setup - but feel free to import any other libaries and functions you think you'll need as you work through the assignment

In [28]:
# Numpy - arrays and basic stats
import numpy as np

# Polars - dataframes
import polars as pl
from polars import col

# Statistical plots
import seaborn as sns

# Simple plots
import matplotlib.pyplot as plt

# Permutation test
from scipy.stats import permutation_test

# Setup some default styles so plots look more like the paper
plt.style.use('ggplot')
sns.set_style('white')

## Part (1) Basic Data Exploration & Pre-processing

### Exploration

Load the data using Polars and show the first few rows

How many unique cats are there in the data? Does this match the paper?

How many cats are there per setting, i.e. how many *house* cats and how many *cafe* cats?

Are there any rows of with missing data? If so how many?

Create a new dataframe called `cats` that excludes any rows with missing data

Are there any cats in our new `cats` DataFrame with only 1 trial?

*Hint: there are a few different ways to calculate this - check-out `.group_by().len()` in Polars for one option*

### Pre-processing measurements

In the paper, the unit of measurement for their dependent variable **looking time** was *number of frames*. But in our data, the unit of measurement is *seconds*.

Add a column called `looking_time_frames` that converts `looking_time` to frames by multiplying by 30 (*the number of frames per second*).

### Aggregating repeated measurements

In these data, and the original experiment, each cat experienced 2 trials of each condition: 2 congruent and 2 incongruent trials.

The authors analyzed these data using a *linear mixed model* - but we won't cover this topic for several more weeks.

For now, let's *average* these repeated measurements within each condition for each cat into: 1 average congruent trial and 1 average incongruent trial.

Create a new dataframe called `cats_agg` that *averages* `looking_time_frames` for each cat within condition. 

**This dataframe should have exactly 2 rows per cat** - 1 for congruent and 1 for incongruent.

*Hint: you can group-by multiple columns at once by using a list, e.g. `.group_by(['setting, 'condition'])`*

*Hint you can also pass multiple columns to `col()` when creating an expression, e.g. `col('looking_time_frames', 'age').mean()`*

### Violation Index

Another measure the authors created was the "violation index" for each cat:

$$ VI = mean(incongruent\ looking\ time) - mean(congruent\ looking\ time) $$

Create a new dataframe called `cats_vi` with 3 columns:
- `cat_id`: the cat's ID
- `setting`: whether the cat was raised in a cafe or a house
- `violation_index`: difference between the mean incongruent and mean congruent looking times for this cat

*Hint: check out the [reshaping](https://stat-intuitions.com/labs/3/01_polars.html#reshaping-dataframes) section of the Polars class tutorial, especially `.pivot()`*

## Part (2) Recreate Figure 2 from the paper

Using the `cats_agg` DataFrame you made earlier, recreate Figure 2 from the paper (reproduced here):

![](figs/Fig2.png)

Don't worry if the data you visualize don't match the paper - we're *not* using the study's original data

Instead make sure:  

- X-axis values *and* colors reflect *condition* 

- Y-axis values reflect *looking time in frames*

- The left and right columns in the plot reflect the *setting* a cat came from

- You should also overlay data from individual cats as *points* on the boxplot

*Hint: Check out the [layering plots](https://stat-intuitions.com/labs/3/02_new_eda_seaborn.html#layering-plots-on-facetgrid) section of our lab tutorial for how to combine multiple plots onto a single `FacetGrid`*

In [None]:
# Your code here





# Optional: you can add this line to the end of your code to convert your y-axis
# to a log scale like the original paper

# plt.yscale('log')

## Part (3) Main effect of living environment - between cats

Let's try to reproduce the main effect of *setting* - whether cats who live in a house have a *lower* average looking time than cats who live in a cafe.

First update the figure you made in (2), by simplifying it to only display a single boxplot:
- X-axis value should reflect *setting* (cafe or house cats)
- Y-axis values should reflect *looking time in frames*

Using the `cats_agg` DataFrame you previously created, make a *new* DataFrame called `cats_agg_setting` that contains 3 columns:

- `cat_id`: the cat's ID
- `setting`: the environment the cat was raised in (cafe or house)
- `looking_time_frames`: the average looking time for the cat across all trials

*Hint you're summarizing within cat so think about a `.group_by` expression*

Filter the dataframe to create 2 new variables that are numpy arrays of:

- `house_cats`: rows of `looking_time_frames` where `setting` is 'house'
- `cafe_cats`: rows of `looking_time_frames` where `setting` is 'cafe'

*Hint you can use `.to_numpy()` to convert a polars column to a numpy array*

Using these 2 variables and the [permutation_test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.permutation_test.html) function from `scipy.stats` perform an **independent** permutation test to compare the **difference between the means** of each group.

*Hint check out the [previous lecture tutorial illustrating how to use a permutation test](https://stat-intuitions.com/labs/3/04_permutation_comparison.html)*  
*Hint look at the `permutation_type` argument in function help to control the type of permutation test you're doing* 

Use the output of the `permutation_test` function and `matplotlib` to create a simple histogram of the null distribution and make sure to add:

- a vertical line of the observed mean difference in black
- put the observed mean difference and p-value in the title

*Hint: check out the help on the output of `permutation_test` to grab the values you need*

Using the results from the permutation test, the histogram of the null distribution, and the simplified boxplot you made - write a few sentences interpreting and explaining the results.

How do they compare to the results from the paper?

*Your response here...*

## Part (4) Main effect of condition - *within* cat

Now let's try to reproduce the main effect of *condition* - whether looking times were *higher* in incongruent trials relative to congruent trials.

First update the figure you made in (2), by simplifying it to only display a single boxplot, but **this time**:
- X-axis value should reflect *condition* (congruent or incongruent trials)
- Y-axis values should reflect *looking time in frames*

Using the `cats_agg` DataFrame you previously created, make a *new* DataFrame called `cats_wide_condition` that contains 3 columns:

- `cat_id`: the cat's ID
- `incongruent`: looking time for incongruent trials
- `congruent`: looking time for congruent trials

*Hint: This time we're not summarizing, but [reshaping](https://stat-intuitions.com/labs/3/01_polars.html#reshaping-dataframes) because we want to pivot our measurement of looking time from rows into 2 new columns - one for each condition.*   

Use the incongruent and congruent columns to create 2 new variables that are numpy arrays of:

- `incongruent_trials`
- `congruent_trials`


Using these 2 variables and the [permutation_test](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.permutation_test.html) function from `scipy.stats` perform a **samples** permutation test to compare the **mean difference** between conditions.

*Hint the function you're providing to `permutation_test` to calculate the statistic you want, needs to be slighty different than what you did before, as does the `permutation_type` argument.  
Think about what the "mean difference" (this question) vs "difference of means" (question 3) is...*

Using the output of the `permutation_test` function and `matplotlib` to create a simple histogram of the null distribution and make sure to add:

- a vertical line of the observed mean difference in black
- put the observed mean difference and p-value in the title

Using the results from the permutation test, the histogram of the null distribution, and the simplified boxplot you made - write a few sentences interpreting and explaining the results.

How do they compare to the results from the paper?

*Your response here...*

## Part (5) Recreate Figure 3 from the paper

Using the `cats_vi` DataFrame you made earlier, recreate Figure 3 from the paper (reproduced here):

![](figs/Fig3.png)

Don't worry if the data you visualize don't match the paper - we're *not* using the study's original data

Make sure:  

- X-axis values *and* colors reflect *setting* 

- Y-axis values reflect *looking time in frames*

- You should also overlay data from individual cats as *points* on the boxplot

*Hint: Check out the [layering plots](https://stat-intuitions.com/labs/3/02_new_eda_seaborn.html#layering-plots-on-facetgrid) section of our lab tutorial for how to combine multiple plots onto a single `FacetGrid`*

## Part (6) Compare Violation Index

Using the `cats_vi` DataFrame (that you just plotted) - use a similar approach to the steps you performed in Part (3) to perform an **independent** permutation test comparing the **difference between the means** of violation index between cafe vs house cats.

Use the output of the `permutation_test` function and `matplotlib` to create a simple histogram of the null distribution and make sure to add:

- a vertical line of the observed mean difference in black
- put the observed mean difference and p-value in the title

Using the results from the permutation test, the histogram of the null distribution, and the figure you made in Part (5) - write a few sentences interpreting and explaining the results.

How do they compare to the results from the paper?

*Your response here...*

How is the comparison of Violation Index the same/different from the "interaction between condition and setting" the authors report in first part of the results?


*Your response here...*