# Python for Psychologists - Session 7

## handling dataframes

From session 3, we are already familiar with handling data in **pandas** and how to create dataframes from scratch. However, most of the time, we don´t create our dataframe on our own, but rather get a table (i.e. a .xls or .csv or .txt) with data. Today we want to cover some basic steps from reading, cleaning and processing data.

In [None]:
import pandas as pd 

To load and read an existing datasheet into a dataframe we can e.g. use ```pd.read_csv("folderstructure/folder/file.csv")``` . Keep in mind, that ```pd.read``` has more options to read in datafiles! Depending on your datafile, you also need to specify the *file seperator* that separates your values within your datasheet.

Let´s load and read the **results1.csv** file from your data directory using
- ```pd.read_csv("file", sep= "\t")```
- ```pd.read_csv("file", sep= ",")```


In [1]:
df = pd.read_csv("/Users/Dominik/Documents/WS19_Python_for_Psychologists/session7/daten_python", sep="\t")
df.head()

NameError: name 'pd' is not defined

With ```dataframe.head() ``` we can easily get a quick first glance of the first five rows of our dataframe, however, we can adjust this value by simply indicating another number in the parentheses.

If you want to display the whole dataframe, you usually end up with a section of your rows or columns, as pandas masks some of them by default. By using ```pd.options.display.max_rows/columns``` we can check the default value and, if needed adjust it with ```pd.options.set.max_rows/columns=int```

In [None]:
pd.options.display.max_rows

As you can see, our dataframe only contains reaction times from a single participant. If we are not super duber interested in single-case studies, this is usually not the type of data we are handling. Most often, especially after conducting an experiment we usually (&hopefully) got data from more than 1 subject stored on our disk.



Most of the time your data is somewhere stored on your disk, to interact with our operating system in jupyter notebook, we can use:

In [None]:
import os

As already mentioned above, we usually end up with more than one datafile from different subjects. 

To check e.g. for files in a given directory, we can use ```os.listdir("directory")```

In [None]:
source = os.listdir("/Users/Dominik/Documents/daten_python")[1:] ### the ds_store file is special to iOS
source

In our easy example, we got data from 8 different participants. Conveniently, all our files are named in the same way, i.e. result**number**.csv 

To load the datafile from each subject iteratevily, we could do something like this:

In [None]:
logfile = "/Users/Dominik/Documents/daten_python/results{}.csv" #create a blank file directory with {} 
logfile

We can use ```some_string{}.format``` to format our string by filling in the placeholder {} with a value, either fixed or in an iteratevily manner.

Lets see with a single string how it works:

Or with a whole sentence:

Let´s load the datasheet of each subject and combine it to one single dataframe:

In [None]:
                              # create a subject list 
             

                                 # create an empty list for all dataframes


                                 # we use the blank directory and .format to fill in the subject per for circle 
                                 # read/load the file and create a dataframe
                                 # append each subjects dataframe to the list 
                                    # use pandas concatenate to create one final dataframe with all subjects 

Or, we could do something like this, which might be a bit more handy 

In [None]:
source

In [None]:
path = "/Users/Dominik/Documents/daten_python/" 

all_dfs = []
for i in source:
    df = pd.read_csv(path+i)
    all_dfs.append(df)
df = pd.concat(all_dfs)

To get a first idea about your dataframe, its shape, its column names and some first descriptive values, we can use:

- ```dataframe.shape```: contains your rows x columns
- ```dataframe.columns```: contains your column names
- ```dataframe.describe```: contains some basic, descriptive values (e.g. min, max) for numeric values

Especially ```dataframe.describe()``` gives you a first idea whether there are missing values in your data. However it only works for columns with numeric values.


As we can see, the ```dataframe.describe()```function only resolves columns with numeric values. To get a better idea what other columns contain, we could look for unique values by simply using ```dataframe["some_column"].unique()```

To further check for missing values we can use ```dataframe.isnull()``` which returns a dataframe with a boolean statement (i.e. TRUE or FALSE) for each cell. To get a better and comprehensive overview, we just can add ```.sum()``` to see the added missing values per column 

Luckily we don´t have any missing values in our dataframe. However, if you have any in yours, don´t worry, there are ways to either fill or drop missing values 

- ```dataframe.fillna(some value, inplace=True)``` fills missing values with "some value"
- ```dataframe.dropna(axis=0 , how="any")``` drops rows with at least one missing value

Let´s again have a look at our todays example. We have information about the stimulus material, the correct *answer*, the actual button press and the response/reaction time

From the ```dataframe.describe()```above, we already know, that the mean RT was about 0.46 s - try to recreate the result using the ```.mean()```function and round it two decimals using ```round(some_number, number_of_decimals)```

Try also to recreate the *min* / *max* / *standard deviation* from the **describe** table

We can see, that mean RT of the experiment and the SD is quite small (under 1s) - let´s try to display any RT above 1 s using ```dataframe.loc``` and a conditional (i.e. boolean) statement as a *potential* outlier

To visually check if certain values might be an outlier or not, we can use a quick and tbh not very aesthetic vizualisation by using ```dataframe[some_colum].plot(kind="box" or "hist" or "bar" or or or ...)```

After the RT, let´s investigate the accuracy of our participants. To do so, we need to use the given information in our dataframe.


Let´s create a new column *iscorrect* :

How can we assess the overall mean accuracy in our experiment? 

Why can we assess it in this way?

The overall accuracy is quite high, let´s see how each individual participant has performed. For this we can use ```dataframe.groupby(["some_column"]).method()```to split our dataframe, apply a function and combine the results

Now we would like to add the accuracy from our aggregated dataframe back to our "non-aggregated" dataframe. For this purpose we need a list that matches the length of our original dataframe

For this purpose we can use a list comprehension:

In [None]:
acc_list=[]

  

df.head() 

But also a for loop:

In [None]:
acc_list=[]

for vpn in df["vpn_number"]:
    acc_list.append(accuracy[vpn])
    
df["accuracy"] = acc_list    
    
df.head()    

Try to add the error rate as simple as possible

In [None]:

df.head()

Let´s see whether the RT differs between correct and false responses:

Since we already know that there is more than one correct way, let´s try to investigate the RT differences between correct and false responses by creating two new dataframes (i.e. df_corr and df_incorr) using ```dataframe.loc```

In [None]:
print()
print()

We could also combine different statemens and e.g. group our dataframe by subjects and evaluate only the mean RT for correct trials. We can do this by using ```df.loc```+ ```df.groupby```

Let´s visualize the RT for correct and incorrect trials

In [None]:
import seaborn as sns

In [None]:
sns.barplot()

To further specify our plot, we can set hue (i.e. color coding in seaborn) to *condition*

In [None]:
sns.barplot()

The visualization suggests that there is no difference between RT_correct and RT_incorrect. However, for educational purposes only, let´s calculate a paired t-test.

Most statistics are covered in **stats** from the **scipy** module

In [None]:
from scipy import stats

To save your new dataframe, you can use

```pd.save("location_on_disk", potential settings, such as, sep=) ```

In [None]:
#df.to_csv("/Users/Dominik/Documents/daten_python/final.csv", sep=",")