# Data Processing

## Load the Data

Read the data into a Pandas DataFrame.  Before analyzing the data, make sure the columns have real names! This can be done in one of two ways:

*Approach A*: Put the column names in the data file, so Pandas detects the names upon loading.

```
Num, Name, Time, ...
1, Nick, 3.1, ...
2, James, 3.5, ...
```
.

*Approach B*: Insert the names in your code, in the read_data step:
  - `df = df.read_csv('myfile.csv', names=['Num', 'Name', 'Time', ...])`
.

*Approach C*: Load the data without names, then rename the columns in the file.  There are many ways to do this in Pandas, below are some options:
  - `df = df.rename(columns={0: 'Num', 1: 'Name', 2: 'Time, ...})`
  - `df.columns = ['Num', 'Name', 'Time', ...]`
    

*(5-Min Discussion)*: What are the pros and cons of each option?  When might you go with Approach A?  Approach B?  Approach C?  Which will you go with in this analysis? (Note that later on we'll be trying to merge everyone's data!)

In [None]:
import pandas as pd
df = pd.read_csv('../data/raw/stroop_nick.txt', sep='\t', 
    names=['block', 'word', 'color', 'matches', 'tablerow', 'key', 'status', 'rt']
    )
df.head()

Unnamed: 0,block,word,color,matches,tablerow,key,status,rt
0,training,yellow,blue,0,3,3,1,1038
1,training,blue,yellow,0,13,4,1,878
2,training,blue,green,0,14,2,1,786
3,training,blue,yellow,0,13,4,1,762
4,training,green,green,1,10,2,1,686


In [None]:
# pg.corr(df.index, df.rt_
pd.DataFrame.corr(method='spearman')


Unnamed: 0,rt
rt,1.0


In [None]:
# import numpy as np
np.corrcoef(df.index, df['rt'])

array([[ 1.        , -0.16411847],
       [-0.16411847,  1.        ]])

## Munge Your Data

Before starting your analysis, we want to make sure that our data analysis tools have the data in a form that's best for your analysis, and that any obvious inconsistencies in the data have been repaired.

#### Drop Columns You Won't Analyze

Keep the analysis process focused by getting rid of uninformative / irrelevant columns.

`df = df.drop(columns=['Name', 'Num'])`


Get rid of the column that just repeats the word "training" over and over (uninformative), and the column that describes the "table row" (irrelevant)

#### Check DTypes

The `type()` of `df` is `pd.DataFrame`, but what about the types of the individual columns?  In Numpy and Pandas, these types are called "**dtypes**".  You can check them with:

  - `df.dtypes`
  - `df.info()`
  - `df['col'].dtype`

And you can change them with:

  - df2 = `df.astype({'Col1': bool, 'Time': float})`


What are the `dtypes` in your data?

#### Restrict your Dtypes 

Choose a dtype that only supports the values that are actually in your data analysis.  This makes it easier for others to make good assumptions about your data, and helps statistical software to correctly analyze your data

Make integer/float columns that can only be `1` or `0` into `bool`

Take the `int` column related to "status" of the response (it can only be 1, 2, or 3), and make two new `bool` columns out of it:  `IsCorrect` and `TimedOut`.  After you have the columns, drop the status column.

#### Convert Numeric Codes to Text

Leaving labels as a number can confuse statistical software, and make it hard to remember what the data represents.  Here's a line of code that can do this:

  - `df['newcol'] = df['col'].transform(lambda s: {1: 'a', 2: 'b', 3:'c', 4:'d'}[s])`


Turn the keyboard labels column into text labels, showing which letter key was pressed for a given trial.  

*Note*: you'll have to do some detective work to figure this out; look through your data and reason through it with your groupmates.

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=c665e050-0883-422f-9ef6-043d968a80b6' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>