# Homework 2: Data tables

This homework assignment is designed to get you comfortable loading and working with data tables.

You will need to download the **LexicalData_toclean.csv** file from the *Homework/lexDat* folder in the class GitHub repository. 

This data is a subset of the [English Lexicon Project database](https://elexicon.wustl.edu/). It provides the reaction times (in milliseconds) of many subjects as they are presented with letter strings and asked to decide, as quickly and as accurately as possible, whether the letter string is a word or not.

*Data courtesy of Balota, D.A., Yap, M.J., Cortese, M.J., Hutchison, K.A., Kessler, B., Loftis, B., Neely, J.H., Nelson, D.L., Simpson, G.B., & Treiman, R. (2007). The English Lexicon Project. Behavior Research Methods, 39, 445-459.*

---
## 1. **Loading the Data (1 point)** 

Use the `setwd` and `read.csv` functions to load the data table from the **LexicalData_toclean.csv** file. Use the `head` function to look at the first few rows of the data. 

In [None]:
setwd("~/Downloads")
read.csv("LexicalData_toclean.csv")

UncleanData <- read.csv("LexicalData_toclean.csv")

head(UncleanData)


# If you are running this on your local computer, wet your workign directory to 
# the location of the lexDat data by setting your harddrive. Uncomment this line
# and change the location to where it is on your computer. 
#setwd("~/Documents/PittCMU/G3/DSPN/DataSciencePsychNeuro/Homeworks/lexDat")

# If you are running this on Colab, then use something like this.
# system("gdown --id 1wSvRPME5NimUDa0t3WqNSGzimLB1uNa7")


The **LexicalData_toclean.csv** file contains the variables `Sub_ID` (Subject ID), `Trial` (the trial number), `D_RT` (reaction time) and `D_Word` (the word they were responding to).

---
## 2. **Data Cleansing (4 points)**

There are three things we want to do to make this data more useable:
* Get rid of the commas in the reaction time values, and make this variable numeric (hint: check out the functions `gsub` and `as.numeric`).
* Get rid of rows where the reaction times are missing (hint: you can use the `filter` function from `tidyverse`, but you'll need to load the library).
* Make sure all of the reaction times are positive. 

Write code that will copy the data to a new variable and make the above changes. 

In [None]:
# INSERT CODE HERE

##First Action: Create a new variable that contains the values within D_RT as numeric and 
## without commas.

New_D_RT <- as.numeric(gsub(",","",UncleanData$D_RT))
head(New_D_RT)

##Replace the original D_RT values with the New_D_RT values within UncleanData;
## Now, the column will have numerical values without commas.
UncleanData$D_RT <- New_D_RT

head(UncleanData)

##Second Action: Begin by loading tidyverse.
library(tidyverse)

##Remove rows that have a NA (missing) value within the D_RT column.
## Assign this altered frame to new variable Data_Without_Nans.
Data_Without_Nans <- UncleanData %>%
  na.omit(UncleanData$D_RT)

print(Data_Without_Nans)

##Third Action: Check if any reaction times (D_RT) are negative.  
## Returns the sum of values in D_RT column that are negative.
sum(Data_Without_Nans$D_RT < 0)

##Returns the sum of values in D_RT column that equal zero.
sum(Data_Without_Nans$D_RT == 0)




For each of the three actions above, is it addressing a data anomaly (as described in the Müller reading)? If so, name the *type* of anomaly it's addressing. 

> *Write your response here.*
> * First action: This action addresses a syntactic anomaly by altering the data type of the values in the column.
> * Second action: This action addresses a coverage error by removing rows with missing values.
> * Third action: This action addresses a semantic anomaly by resolving a constraint violation; reaction times must be positive.

---
## 3. **Data Manipulation with Tidyverse (4 points)**

Now let's use `tidyverse` functions to play around with this data a bit. Use the piping operator (`%>%`) in both of these code cells. 

First, let's get some useful summary **statistics** using `summarise`. Output a table that tells us how many observations there are in the data set, as well as the mean and standard deviation of the reaction times.

In [None]:

##Output a table that displays the total number of observations (under 'n'),
## the mean response time (D_RT) and sd of the response time (D_RT).
Data_Without_Nans %>%
  summarise(total_observations = count(Data_Without_Nans), mean_response = mean(D_RT), sd_response = sd(D_RT))


Now, we'll use `mutate` to re-number the trials, starting from 0 instead of 1. Make a new variable that is equal to the `Trial` variable minus one. 

In [None]:

##Within Data_Without_Nans, make a new column where each value is 
## the corresponding Trial value minus 1.
Data_Without_Nans %>%
  mutate(new_trial = Trial - 1)

##Add the new variable new_trial to the data frame Data_Without Nans.
Data_Without_Nans$new_trial = Data_Without_Nans$Trial - 1


---
## 4. **Plotting Data (1 point)**

Use the `plot()` function to visualize the data, in a way that helps you see if there's a relationship between `D_RT` and your new trial variable.

In [None]:
##Make a quick scatter plot where x=new_trial, y=D_RT, and the 
## data source is Data_Without_Nans.
qplot(new_trial,D_RT,data = Data_Without_Nans)

That's all for Homework 2! When you are finished, save the notebook as Homework2.ipynb, push it to your class GitHub repository (the one you made for Homework 1) and send the instructors a link to your notebook via Canvas. You can send messages via Canvas by clicking "Inbox" on the left and then pressing the icon with a pencil inside a square.

**DUE:** 5pm EST, Feb 14, 2022

**IMPORTANT** Did you collaborate with anyone on this assignment? If so, list their names here. 
> *Marc Levesque*