# Data wrangling in pandas

Import all the tools...

In [2]:
from glob import glob
import pandas as pd
import numpy as np
import re
import seaborn as sns
import matplotlib.pyplot as plt
from patsy import dmatrix
%matplotlib inline

## Read in the data
Let's play with some behavioral data from the Human Connectome Project. We'll look at data from the N-back task for the first 100 subjects. The E-Prime logs are contained in a `data` folder below this notebook. We'll loop over them and read each one into a pandas DataFrame; then we'll concatenate them all into one big DataFrame to rule them all.

In [19]:
# Read all individual run E-Prime outputs and merge into one big DataFrame
files = glob('data/WM*TAB.txt')
dfs = []

for f in files:

    # Read in each tab-delimited file with pandas
    _df = pd.read_csv(f, sep='\t')
    
    # contents don't include subject, so we extract it from the filename
    # and append it to the DataFrame
    subject = re.search('.*(\d{6})', f).group(1)
    _df['Subject'] = subject

    dfs.append(_df)

# Concatenate all DFs together along the row axis
data = pd.concat(dfs, axis=0)

# Inspecting the data
The first thing we should do with any new dataset is inspect it in various ways to get a better sense of what it contains. Some things we might want to know:
* How big is the dataset?
* What do the first few rows look like?
* How many columns contain meaningful information?
* How are values distributed?
* Are there missing values?

In [28]:
# Insert code here!

# Preparing the data
Most datasets require some amount of cleaning/wrangling/munging/reshaping/sanitizing/[insert your favorite verb]ing before they're ready for analysis. Some things we should consider:
* Are missing or obviously erroneous values present, and if so, how should we handle them?
* Are there subsets of the data we might not want to include?
* Are there variables we want to drop or need to transform in some way?
    * Are are our variables on reasonable scales?
* Does the format of the data match what our analysis or visualization tools expect?
    * E.g., should the data be in "wide" or "long" format?

In [15]:
# Insert code here!

# A deeper look
In practice, data exploration and preparation is an iterative process: we typically clean up the data to address some obvious problems, then take another look, find new problems, and rinse and repeat. This cycle can last for a long time, which is why many data scientists "joke" that actual data analysis is the easy (and smaller) part of their job.

Now that we've ensured basic sanity (hopefully!), we can start to probe the data in ways that might be a bit more scientifically interesting. At this point, the questions start to become increasingly domain-specific.

Things we can ask about the HCP N-back data:
* How does performance vary across different experimental conditions?
* How does it covary across subjects?
* How do the different performance metrics relate--e.g., RT and accuracy?
* Are there outliers--either in terms of subjects or in terms of stimuli?

In [None]:
# Insert code here!

# Statistical inference
Researchers often rush to get to statistical inference. That's probably a bad idea. Statistical analysis should ideally be conducted only after we feel we have a reasonable handle on some of the basic qualitative patterns present in our data. If we can't answer questions like "is there much variation in performance across conditions" *before* running a regression analysis, we should consider taking a step back. (Note that everything we're talking about in this notebook here falls squarely under the heading of exploratory analysis--and should be presented as such in talks, manuscripts, etc. If we want to claim that you're doing hypothesis testing, our hypotheses and analysis plan should all be written down in detail *before* we ever see the data. We can't decide how to clean and analyze your data after looking at it without running a serious risk of overfitting.)

In [20]:
# Guess what this box is for? That's right! Insert code here!