Data from study 1 of "The Pen Is Mightier Than the Keyboard: Advantages of Longhand Over Laptop Note Taking".

See <https://journals.sagepub.com/doi/full/10.1177/0956797614524581> for the main paper and <https://journals.sagepub.com/doi/10.1177/0956797618781773> for the corrigendum.

OSF data repository housed at <https://osf.io/crsiz>.

Data file from <https://osf.io/23yad>.

Data codebook at <https://osf.io/j9472>.

In [1]:
import os.path as op
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
df = pd.read_csv('laptop_notes_study1.csv')
df.head()

Unnamed: 0,Variable,onegrams,twograms,threegrams,wordcount,rspanabsolute,rspantotal,sixletters,auxverbs,presentverbs,...,openandobjectiveZ,rawobjective,rawopen,openZ.1,condition.1,whichtalk.1,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39
0,,0.86875,0.42138,0.18987,167.0,18.0,30.0,40.12,2.4,1.2,...,-1.2,4.0,2.0,-1.35,0.0,1.0,6.258065,5.794118,-0.251935,0.259118
1,,0.77437,0.38547,0.17647,325.0,51.0,70.0,29.23,4.62,2.77,...,-0.58,7.0,3.0,-0.86,0.0,1.0,,,,
2,,0.44311,0.10843,0.01212,159.0,27.0,54.0,30.19,2.52,2.52,...,-0.77,6.0,3.0,-0.86,0.0,1.0,,,,
3,,0.79386,0.3348,0.11504,223.0,34.0,57.0,33.18,5.38,2.69,...,-0.58,7.0,3.0,-0.86,0.0,1.0,-4.78,,,
4,,0.66995,0.19802,0.06468,213.0,28.0,60.0,34.74,5.16,6.57,...,-0.4,7.0,3.0,-0.86,1.0,1.0,5.83,,,


There are some duplicate columns:

In [3]:
cond1 = df[['condition', 'condition.1']].dropna()
np.all(cond1['condition'] == cond1['condition.1'])

True

In [4]:
df['condition'].value_counts()

1.0    34
0.0    31
Name: condition, dtype: int64

From text describing figure 1 in the [corrigendum](https://journals.sagepub.com/doi/10.1177/0956797618781773), mean z score on conceptual responses for writing notes was 0.162, and mean z score for laptop notes was −0.178.  This doesn't look like what we have:

In [5]:
df[['condition', 'openZ']].groupby('condition').mean()

Unnamed: 0_level_0,openZ
condition,Unnamed: 1_level_1
0.0,-0.155806
1.0,0.152941


Let's try the raw scores:

In [6]:
df[['condition', 'rawopen']].groupby('condition').mean()

Unnamed: 0_level_0,rawopen
condition,Unnamed: 1_level_1
0.0,3.774194
1.0,4.294118


Make a new dataset with fewer columns:

From the paper:

> Participants were 67 students (33 male, 33 female, 1 unknown) from the Princeton University subject pool. Two participants were excluded, 1 because he had seen the lecture serving as the stimulus prior to participation, and 1 because of a data-recording error.

We have:

In [7]:
len(df)

66

One subject doesn't have a condition recorded:

In [8]:
bad_condition = df['condition'].isna()
df[bad_condition]

Unnamed: 0,Variable,onegrams,twograms,threegrams,wordcount,rspanabsolute,rspantotal,sixletters,auxverbs,presentverbs,...,openandobjectiveZ,rawobjective,rawopen,openZ.1,condition.1,whichtalk.1,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39
65,,,,,,,,,,,...,,6.015385,4.046154,,,,,,,


Let's assume this is the subject with a "data-recording error", and drop that subject:

In [9]:
clean_df = df[~bad_condition].copy()
len(clean_df)

65

The authors also had this question:

> betterlorn - In general, do you think it is better for learning purposes to take notes on a laptop or in a notebook? a. Laptop significantly better (1) – Notebook significantly better (7)

Let's include that too.

We will also use `SATcombined` as an index of student prior ability.

Make a clearer column for laptop use (compared to by hand writing):

In [10]:
clean_df['laptop_longhand'] = 'laptop'
clean_df.loc[clean_df['condition'] == 1, 'laptop_longhand'] = 'longhand'
clean_df.head()

Unnamed: 0,Variable,onegrams,twograms,threegrams,wordcount,rspanabsolute,rspantotal,sixletters,auxverbs,presentverbs,...,rawobjective,rawopen,openZ.1,condition.1,whichtalk.1,Unnamed: 36,Unnamed: 37,Unnamed: 38,Unnamed: 39,laptop_longhand
0,,0.86875,0.42138,0.18987,167.0,18.0,30.0,40.12,2.4,1.2,...,4.0,2.0,-1.35,0.0,1.0,6.258065,5.794118,-0.251935,0.259118,laptop
1,,0.77437,0.38547,0.17647,325.0,51.0,70.0,29.23,4.62,2.77,...,7.0,3.0,-0.86,0.0,1.0,,,,,laptop
2,,0.44311,0.10843,0.01212,159.0,27.0,54.0,30.19,2.52,2.52,...,6.0,3.0,-0.86,0.0,1.0,,,,,laptop
3,,0.79386,0.3348,0.11504,223.0,34.0,57.0,33.18,5.38,2.69,...,7.0,3.0,-0.86,0.0,1.0,-4.78,,,,laptop
4,,0.66995,0.19802,0.06468,213.0,28.0,60.0,34.74,5.16,6.57,...,7.0,3.0,-0.86,1.0,1.0,5.83,,,,longhand


In [11]:
# New empty data frame.
smaller_df = pd.DataFrame()
# Rename laptop_longhand column.
smaller_df['condition'] = clean_df['laptop_longhand']
# Change whichtalk to integers.
smaller_df['whichtalk'] = clean_df['whichtalk'].astype(int)
# Copy over, rename open and objective scores.
smaller_df['concept_score'] = clean_df['rawopen']
smaller_df['factual_score'] = clean_df['rawobjective']
# Add betterlorn
smaller_df['better_laptop_or_long'] = clean_df['betterlorn']
# Combined SAT score
smaller_df['SAT_combined'] = clean_df['SATcombined']
smaller_df.head()

Unnamed: 0,condition,whichtalk,concept_score,factual_score,better_laptop_or_long,SAT_combined
0,laptop,1,2.0,4.0,6.0,2180.0
1,laptop,1,3.0,7.0,7.0,2100.0
2,laptop,1,3.0,6.0,5.0,2340.0
3,laptop,1,3.0,7.0,5.0,2310.0
4,longhand,1,3.0,7.0,6.0,2310.0


In [12]:
# Check we still get the same answer as before.
smaller_df.groupby('condition').mean()

Unnamed: 0_level_0,whichtalk,concept_score,factual_score,better_laptop_or_long,SAT_combined
condition,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
laptop,3.032258,3.774194,5.580645,4.096774,2254.642857
longhand,2.823529,4.294118,6.411765,4.823529,2278.935484


In [13]:
# Save to CSV, without the row labels (index).
out_fname = op.join('processed', 'laptop_longhand_study1.csv')
smaller_df.to_csv(out_fname, index=False)

In [14]:
# Check we can load back, get the same thing.
pd.read_csv(out_fname).head()

Unnamed: 0,condition,whichtalk,concept_score,factual_score,better_laptop_or_long,SAT_combined
0,laptop,1,2.0,4.0,6.0,2180.0
1,laptop,1,3.0,7.0,7.0,2100.0
2,laptop,1,3.0,6.0,5.0,2340.0
3,laptop,1,3.0,7.0,5.0,2310.0
4,longhand,1,3.0,7.0,6.0,2310.0
