<img src="./images/title_slide.png" width="1020" align="center" border = "5px solid #555">

<img src="./images/presenters.png" width="1020" align="center" border = "5px solid #555">

<img src="./images/why_python.png" width="1020" align="center" border = "5px solid #555">

<img src="./images/python_better_than_excel.png" width="1020" align="center" border = "5px solid #555">

# The goal is to eventually be like this guy:

<img src="https://www.abc.net.au/cm/rimage/12082194-3x2-xlarge.jpg" width="750" border = "5px solid #555" align="center" >

# But we'll start here:

<img src="./images/snake-charmer-simple-icon-vector-6819909.jpg" width="200" align="center" border = "5px solid #555">

# This is a "Jupyter notebook": a beginner-friendly environment to start learning and playing with Python in a web browser.
# A notebook consists of _cells_ (see what the arrow is pointing to below). Each cell can contain:
## 1. Python code
## 2. Markdown (i.e., a _fancy_ way to present **text**)
## 3. `Raw text`

 <img src="./images/down_arrow.gif" width="100" align="left">

In [None]:
# This is a python code cell



# 
# Each cell below contains Python code. You can run the code in the cell by clicking into the cell, then pressing <font color='#9E54C0'>**CTRL-ENTER**</font>, or by clicking the Play icon on the upper left. The output of the code (if any) shows up right below that cell.
<!-- # <img src="./images/notebook_options_light.png"> -->

# <img src = "./images/binder_header.png">

### Click anywhere inside this next cell and run it by pressing CTRL-ENTER

In [None]:
print("Hello, world!")

# 
# For the remainder of this notebook, pay attention to the text color _above_ cells:
### 1. Most of the text is in black font, which is informative only. 
### 2. The word <span style="color:#9E54C0"> **"Exercise"** </span> indicates when you should try out what you've learned.
# 

# Table of Contents

### 1. Python Basics
### 2. Intro to Pandas (Essential Data Analysis Package)
### 3. Read in existing data
### 4. Basic shape of the data (size, mean, sd, median, etc)
### 5. Transform data
### 6. Examine missing data
### 7. Categorical variables: frequencies
### 8. Filter data
### 9. Compare groups on a measure
### 10. Quickly visualize your data
### 11. How to teach yourself

# 
# **PYTHON BASICS**

# Let's start with some basic information that you'll need to know in order to work through the exercises today.

## 



### In this section, we'll be going over _**variables**_.
### In math we learned about variables (e.g., x = 4), and in Python variables generally work in the same way. 
### A **variable** is a stand-in value that represents one or more values. 
# 
### In Python, you may use any name you want for a variable **so long as**:
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1. The variable name does **not** start with a number 
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2. There are **no spaces** between characters in the variable name

### Some variables contain just one value. Those can be one of **four types**:
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1. an **integer** (e.g., 5)
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 2. a **float** (e.g., 5.0)
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3. a **string** (e.g., 'abc' or '5')
### &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 4. a **boolean** (True or False)

In [None]:
# Remember, use CTRL-ENTER to run a cell

a = 5

b = 5.0

c = 'abc'

d = False

print(a)
print(b)
print(c)
print(d)

### True is equivalent to 1, False is equivalent to 0

In [None]:
print(True + True)
print(False * 10)

### The "==" symbol checks if two things are equal

In [None]:
print(3 == 5)
print(5 == 5)
print(a == 5)
print(False == 0)

## Python Lists

### A variable can also contain many values at once. The most common example of this kind of variable in Python is a **list**. 
### You know it's a list because it has brackets [ ] with one or more values inside.
### A list can hold any number of any kind of values, and therefore is incredibly flexible and useful for python coders.

In [None]:
my_list = ['a', 5, True, 'DESIGN', 0.9]

print(my_list)
print(type(my_list))
print(len(my_list))

### Generally, we create lists because we want to access the items inside.

### If you want a particular value in the list, use brackets to extract it. The number inside the brackets indicates the position inside the list.

In [None]:
print(my_list[0])

### Why 0? Python is what's called a "zero-indexing" language, where the first thing in any data structure is actually the "zeroth" thing.

### <font color='#9E54C0'> **EXERCISE: Print the last (5th item) in _my_list_.**  </font>

In [11]:
print(____)

### 
## Errors in Python
### If you try something crazy, Python will print an **error**. 
### At first, errors may be incomprehensible and frustrating. 
### With a little experience, it'll start becoming obvious what you need to fix as soon as you see an error. You'll say "duh!" a lot and it'll feel good.

### 
### <font color='#9E54C0'> **EXERCISE: Try running each cell below.**  </font>

In [None]:
print(3

In [None]:
5 = a

# 
# Comments
### Comments are incredibly useful in any coding language. 
### Code is read way more often than it is written, so comments allow others (including future versions of yourself) ways to understand what your code is doing.

### Comments are also a useful way to keep code in a particular cell without having to run it.

In [None]:
# This is a comment about the line of code below so I can remember what it does.
a = 5

# I want to skip the line of code below for now, without deleting it, so I commented it out.
# print(a)

### 
# That's it for Python basics! The rest of this workshop is dedicated entirely to the essential data-analysis and visualization packages.
# **If you'd like to learn more about Python, please see the links in our reference section below.**
### 

# 
# **ESSENTIAL DATA SCIENCE PACKAGES**
### <img src="./images/pandas_logo.png" width="400" align="center">

## The vast majority of what you do in Python will involve working within numpy and pandas (mostly pandas), rather than base Python.
### 

# Importing packages to use in your notebook
### **Packages** are collections of useful functions that allow you to do something with your data.

### Luckily for you, we've already installed these packages in your workspace.
### Even though packages may already be installed, you still need to **import** them into your notebook.
### Generally, python programmers use abbreviations for packages when they import them so that they don't have to type more letters when referencing the package in the notebook.

In [2]:
import pandas as pd

### From this point on in the notebook, "pd. ..." means it comes from pandas.

### 
# **BACKGROUND: FACIAL MICROEXPRESSIONS DATASET**
### 

### In 2018, our Design Research group ran a standard, ~30-minute, 1:1 in-person formative test of an enterprise software prototype with Oracle employee participants.
### The test contained 3 realistic tasks for each participant to complete, while talking out loud. Each task ended with a special **Submit "emo"button** that allowed the participant to indicate their experience on the page (negative, neutral, or positive) at the exact same time as Submitting their data entry on the page. 
# 
# 
# 
# 

<img src="./images/emo_button.gif" width="400" align="center">

# 
## Testing this "emobutton" was the main goal of the research project: 
- <font size = 5 > Would participants notice it? 
- <font size = 5 > Would it seem like a valid measure of happiness or frustration?

### Another special part of the study was that we recorded participants' faces the whole time. Then we ran the recordings through an emotional expression-detecting software called Noldus Facereader. This gave us a measure of how happy, confused, etc., each participant's face looked every 10th of a second for the entire duration. 
# 
 <img src="./images/facereader example Hillel.png" width="750" align="center">

# 
# 
# <font color='#32925E'> **Our focus for the remainder of the workshop will be to explore some basic, fundamental questions about the data from this project - the <font size = 6 > "shape of the data".**</font> </font>
# 
# 

# **PANDAS**

<img src = "./images/distracted_researcher_meme.jpg" width = "900" align = "center" border = "5px solid #555" >


### is a package that allows you to efficiently view, edit, transform, and extract insights from 2-dimensional tables of data (called "dataframes").
### It's sort of like Excel, but can perform a lot of things faster and easier.

### Dataframes are organized in a Rows x Columns format, just like an Excel spreadsheet.

<img src = './images/dataframe_pic.png' width = 400 align = "center" > 

### After importing useful packages, the next thing researchers need do is read in data. 
### Most of the time, the data we need will be either a ".csv" or ".xlsx" file format.
# 
### Read in an **existing** CSV or Excel file and call it "df" (a common abbreviation that stands for "dataframe")

### (this is how you'll start off most of the time in real life)

In [11]:
df = pd.read_csv('face_data_all.csv.gz')

# 
### Beyond the data captured by Noldus Facereader, we also put together some data about participants. 
### We'll need to rely on information from both of those files for our work today.

### <font color='#9E54C0'> **EXERCISE: Read an Excel file named 'participant_data.xlsx' by using the command _pd.read_excel_. Call this dataframe "dfp".**  </font>

In [7]:
dfp = ____('participant_data.xlsx')

# <font color='#32925E'> **Literally, what is the shape of the data?** </font>

In [None]:
df.shape

### The 1st number in this output represents the number of rows in the df dataframe; the 2nd number is columns. 
### Given how large this file is, this is not something you'd want to open in Excel!
# 

# <font color='#32925E'> **What are the column names?** </font>

In [None]:
df.columns

# <font color='#32925E'> **That's fine, but what does the dataset actually _look like_?** </font>

In [None]:
df

### Pandas displays information in a neat and visible way. 
### When the dataframe has a large number of rows, Pandas puts "..." in the middle to show you the first five and the last five rows of the dataframe.



# <font color='#32925E'> **What if I want to see just the first few rows?** </font>

In [None]:
df.head(5)

### 
### You may have noticed the following pattern above: `df.some_attribute` or `df.some_function()`
### This makes Python syntax more user-friendly than some other languages, like Excel syntax. Instead of this:
### `function3(function2(function1(df)))` , you usually write and see this:
### `df.function1().function2().function3()` . Much better, right?
### 

### 
# **DIVING INTO THE FACIAL MICROEXPRESSIONS DATASET: DETAILS**

### **"participant_id"** is self-explanatory.
### Every row in the dataset is one frame (0.1 seconds) of one participant's video. Hence, **"capture_time_secs"** is the cumulative time at frame capture; every participant starts at 0.0.
### **"happiness"** is the degree of happiness in the participant's facial expression at that moment, ranging from 0 (none) to 1 (max); **"confusion"** - same idea. 

### **"primary_emotion"** is whichever of "happiness" or "confusion" is present to a greater degree in the face. Yes, multiple emotions can be expressed and detected in faces to varying degrees at the same time. <img src="./images/smile_emoji.jpg" width="50" align="Center" border = "2px solid #555"> <img src="./images/neutral_emoji.png" width="50" align="Center" border = "2px solid #555"> <img src="./images/frown_emoji.png" width="50" align="Center" border = "2px solid #555">
### **"task"** is which of the three aforementioned tasks the participant was doing at the time of capture.
### **"self_reported_attitude"** corresponds to which part of the Submit EmoButton that participant clicked.
### **"gender"** corresponds to the gender of the participant.

In [13]:
df

Unnamed: 0,participant_id,capture_time_secs,happiness,confusion,primary_emotion,task,self_reported_attitude,gender
0,1,0.0,0.013309,,happiness,no task,neutral,male
1,1,0.1,0.013633,,happiness,no task,neutral,male
2,1,0.2,0.015672,,happiness,no task,neutral,male
3,1,0.3,0.019451,,happiness,no task,neutral,male
4,1,0.4,,,,no task,neutral,male
...,...,...,...,...,...,...,...,...
403927,22,1781.7,0.000031,0.044686,confusion,no task,negative,male
403928,22,1781.8,0.000405,0.071171,confusion,no task,negative,male
403929,22,1781.9,0.000320,0.017704,confusion,no task,negative,male
403930,22,1782.0,0.000050,0.000000,happiness,no task,negative,male


#  <font color='#32925E'> **What are all those NaN**s? </font>
### **np.NaN** (as input) or **nan** or **NaN** (as output) are all the same thing: a special value indicating...the lack of a real value "**N**ot **A** **N**umber".
### CSV or Excel files that have "nothing" in the cell get automatically imported into Pandas as NaN, and you often want to keep it that way.

# 
# <font color='#32925E'> **How many participants are there?** </font>

In [None]:
df['participant_id'].nunique()

### `df['some_column_name']` is how you can refer to and isolate one data column in the dataframe.
### `df['some_column_name'].nunique()` is how you can count the number of unique values.

# 
#  <font color='#32925E'> **What is the shape of the data for Happiness?** </font>
## Pandas also has built in features for describing some basic stats about numerical columns

In [None]:
df['happiness'].describe().round(3)

# 
## You can even describe multiple numerical columns at once. 

### The general pandas syntax for selecting multiple columns is `df[[column1, column2, etc.]]`

In [None]:
df[['happiness', 'confusion']].describe().round(3)

# 
# Often, you want to transform the values in a column to make the data easier to understand, improve visualizations, or format for specific analyses.
# <font color='#32925E'> **You want to turn seconds into minutes. Is there a simple way to do that?** </font>

### Pandas makes it very easy to apply a simple operation to an entire column of data:

In [None]:
df['capture_time_secs']

In [None]:
df['capture_time_secs'] / 60

### **Two things to note:** 
### 1. Unlike in Excel, we were able to divide all 400 thousand rows with one line of code
### 2. Even with 400,000+ rows, python was able to perform this operation within milliseconds

### 
#  <font color='#32925E'> How do we add this as a new column in the data? </font> 
## Simply set a new column name equal to what we just created!

In [15]:
df['capture_time_mins'] = df['capture_time_secs'] / 60

In [None]:
df.head()

### 
# In the context of this dataset, the emotion-analysis software isn't perfect, and for many frames it simply cannot tell if an emotion is present, or simply cannot detect the boundaries of a face. The question is, did this happen often?

# <font color='#32925E'> **How much "happiness" data is missing?** </font>

In [None]:
df['happiness'].isnull() # tests whether each cell has a NaN in it or not. If it does, then True. If it doesn't, then False.

### Remember how `True == 1` and `False == 0`? How can we use that to answer the question?

In [None]:
n_missing = df['happiness'].isnull().sum()
print(n_missing)

### The sum of "True"s is the same thing as how many times "True" appears in the column, which is what we want!

### <font color='#9E54C0'> **OPTIONAL EXERCISE: How much "happiness" data is NOT missing?** </font>

In [None]:
# Hint: use "notnull" instead of "isnull"

n_not_missing = df['happiness'].____.sum()
print(n_not_missing)

### <font color='#9E54C0'> **OPTIONAL EXERCISE: The sum of # missing and # not missing should add up to how many total rows are in the data. Does it?** </font>

In [None]:
# Let's double check!

print("Missing and non-missing happiness rows add up to", n_missing + ____)
print("There are", df.shape[0], "rows in the data.")

# 
# <font color='#32925E'> **Which emotion (Happiness or Confusion) was the "primary emotion" more often?** </font>

In [17]:
df.head()

Unnamed: 0,participant_id,capture_time_secs,happiness,confusion,primary_emotion,task,self_reported_attitude,gender,capture_time_mins
0,1,0.0,0.013309,,happiness,no task,neutral,male,0.0
1,1,0.1,0.013633,,happiness,no task,neutral,male,0.001667
2,1,0.2,0.015672,,happiness,no task,neutral,male,0.003333
3,1,0.3,0.019451,,happiness,no task,neutral,male,0.005
4,1,0.4,,,,no task,neutral,male,0.006667


###  Let's display only the "primary_emotion" column of the dataset.

In [None]:
df['primary_emotion']

### Again, pandas makes it very easy to apply a simple operation to an entire column of data.
### What do you think the cell below does?

In [None]:
df['primary_emotion'] == 'happiness' # Note that missing values of primary_emotion also return False; after all, they are not equal to 'happiness'.

In [None]:
(df['primary_emotion'] == 'happiness').sum()

### <font color='#9E54C0'> **OPTIONAL EXERCISE: How often was Confusion the "primary emotion"?** </font>

In [None]:
(df['primary_emotion'] == 'confusion').____

### Participants looked happy far more frequently than they looked confused.
### How could we get this answer in one step?

In [None]:
df['primary_emotion'].value_counts()

### What if we want proportions instead of raw counts?

In [None]:
df['primary_emotion'].value_counts(normalize = True, dropna = False).round(2)

### <font color='#9E54C0'> **OPTIONAL EXERCISE: What proportion of the rows did participants spend in Tasks 1 vs. 2 vs. 3?** </font>

In [None]:
# Find how often the "task" column took on each of its values

df['task'].____(normalize = True, dropna = False).round(2)

In [18]:
df.head()

Unnamed: 0,participant_id,capture_time_secs,happiness,confusion,primary_emotion,task,self_reported_attitude,gender,capture_time_mins
0,1,0.0,0.013309,,happiness,no task,neutral,male,0.0
1,1,0.1,0.013633,,happiness,no task,neutral,male,0.001667
2,1,0.2,0.015672,,happiness,no task,neutral,male,0.003333
3,1,0.3,0.019451,,happiness,no task,neutral,male,0.005
4,1,0.4,,,,no task,neutral,male,0.006667


### 
# <font color='#32925E'> **Did the "primary emotion" differ by gender?** </font>

### **Filtering**: Let's first filter the dataset to only the female participants.

In [None]:
# This is what the filter looks like

df['gender'] == 'female'

In [None]:
# This is how you use the filter

df_female = df[df['gender'] == 'female']
df_female

### **Two things to note:**
### 1. df_female is a subset of df
### 2. This subset dataframe still keeps track of the original row numbers (i.e., indices)

### 
### <font color='#9E54C0'> **OPTIONAL EXERCISE: What were the proportions for the primary emotion for the female participants?** </font>

In [None]:
df_female['primary_emotion'].____(normalize = True).round(2)

### What about for the male participants?

In [None]:
df_male = df[df['gender'] == 'male']

df_male['primary_emotion'].value_counts(normalize = True).round(2)

### Looks like the men in this study were slightly more confused than happy, and substantially more confused and less happy than the women.
# 
# <font color='#32925E'> How could we get these answers in one step, and without creating extra smaller datasets? </font>
## Introducing **grouping**:

### Grouping in pandas works using the following syntax: 
### `df.groupby(column name to group on)[column name to aggregate].some_function()`
### Common functions to use while grouping are "value_counts", "mean", "median" "mode", "max", "min"

In [None]:
df.groupby('gender')['primary_emotion'].value_counts(normalize = True).round(2)

### Let's see if participants' faces reflected their emobutton choice?

In [None]:
df['self_reported_attitude'].value_counts(normalize = True).round(2)

In [None]:
df['primary_emotion'].value_counts(normalize = True).round(2)

In [None]:
df.groupby('self_reported_attitude')['primary_emotion'].value_counts(normalize = True).round(2).sort_index(ascending = False)

### The faces of the participants who hit the "unhappy" part of the Submit EmoButton looked substantially different than for the other two groups of participants.

# Using Python, there are numerous types of visualizations you have at your disposal to examine the shape of your data.
### The remainder of this workshop will go over different visualizations you can produce to generate meaningful insights.

### 
# <font color='#32925E'> **Were people who gave negative feedback more confused? Or less happy? Or both?** </font>

# A picture is worth a thousand words, and often researchers will use visualizations to explore their data and to isolate insights.

### Seaborn is a powerful and easy-to-learn package that can create beautiful visualizations - both quick/dirty and polished.

<img src = './images/seaborn_logo.png' width = 400 align = "center" border = "5px solid #555"> 

### Seaborn visualization gallery: https://seaborn.pydata.org/examples/index.html

In [None]:
import seaborn as sns

In [None]:
cat_df = df.groupby('self_reported_attitude')['primary_emotion'].value_counts(normalize = True).round(2).sort_index(ascending = False)

cat_df = pd.DataFrame(cat_df).rename(columns = {'primary_emotion':'proportion'}).reset_index()

cat_df

In [None]:
sns.catplot(data = cat_df, kind = 'bar', x = 'self_reported_attitude', y = 'proportion', hue = 'primary_emotion') 

### Participants who self-reported a negative attitude had less facial happiness.

# <font color='#32925E'> What were the session lengths? </font>

In [19]:
df.head()

Unnamed: 0,participant_id,capture_time_secs,happiness,confusion,primary_emotion,task,self_reported_attitude,gender,capture_time_mins
0,1,0.0,0.013309,,happiness,no task,neutral,male,0.0
1,1,0.1,0.013633,,happiness,no task,neutral,male,0.001667
2,1,0.2,0.015672,,happiness,no task,neutral,male,0.003333
3,1,0.3,0.019451,,happiness,no task,neutral,male,0.005
4,1,0.4,,,,no task,neutral,male,0.006667


In [None]:
# Hint: .max() finds the maximum of a column of numbers.

df.groupby('participant_id')['capture_time_mins'].max()

### For an even better idea of how long most sessions were, we can create a quick and dirty histogram with just one more function.

In [None]:
session_lengths = df.groupby('participant_id')['capture_time_mins'].max() 

sns.histplot(session_lengths)

# <font color='#32925E'> What is the shape (e.g., average and spread) of time spent per task? </font>

In [None]:
task_time = df.groupby(['task', 'participant_id'])['capture_time_mins'].size().reset_index()
task_time = avg_task_time[avg_task_time['task'] != 'no task']

In [None]:
sns.boxplot(data = task_time, x = 'task', y = 'capture_time_mins')

# 
# <font color='#32925E'> **Are happiness and confusion negatively correlated?** </font>

### It seems obvious that they should be, but let's confirm!

In [None]:
# Each data point is a participant

happiness_by_participant = df.groupby('participant_id')['happiness'].mean()
confusion_by_participant = df.groupby('participant_id')['confusion'].mean()

sns.regplot(x = happiness_by_participant, y = confusion_by_participant, ci = 0)

### Looks like a negative correlation - but let's double-check that directly

In [None]:
df[['happiness', 'confusion']].corr()

### As expected, there's a small negative correlation between facial happiness and facial confusion.

# Pretty Visualizations

In [None]:
from IPython.core.display import display, HTML
display(HTML(filename='./images/timelines_variable_y_scale 3 Ps.html'))

# 
# <font color = "red"> That's it for now!</font> Before we move on to next steps, we'd love to hear any questions you may have regarding anything we covered today.
# <img src = './images/any_questions.jpg' width = 500 >


# 
# **Next Steps**

<img src = "./images/Next_Steps_Slide.jpg" width = "900" align = "center" border = "5px solid #555" >

# [Pandas Cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)

### This official pandas cheatsheet covers everything we talked about today plus a lot more!

# 
# Installing Anaconda: the most widely used python software for data and science!

### **STEP 1:** Go to [Anaconda's website](https:/www.anaconda.com) and click on "Get Started"

<img src = "./images/Anaconda_Step1.png" width = "900" align = "center" border = "5px solid #555" >

### **STEP 2:** On the next page, click "Install Anaconda Individual Edition"

<img src = "./images/Anaconda_Step2.png" width = "900" align = "center" border = "5px solid #555" >

### **STEP 3:** Then, click "Download"

<img src = "./images/Anaconda_Step3.png" width = "900" align = "center" border = "5px solid #555" >

### **STEP 4:** Finally - paying attention to which operating system you have - click on the appropriate "graphical installer" link

<img src = "./images/Anaconda_Step4.png" width = "900" align = "center" border = "5px solid #555" >

# Once installed type the following in your command line (or terminal for Mac users)

# <img src = "./images/running_jupyter_lab.png" width = 700>

# 
# **POST-WORKSHOP: TEACHING YOURSELF**

# <font color='#32925E'> **You want to filter a column by two or more values. You've never done this before.** </font>

### _self_reported_attitude_ currently has 3 values: negative, neutral, positive. Earlier, we found that the "negative" participants differ sharply from the "neutral" and "positive" participants. For further analyses, you'd like to lump positive and neutral participants together as one group. The first step is to find all the rows where the _self_reported_attitude_ is _either_ "positive" or "neutral". How would you do that?

### Google "pandas filter column by multiple values".
### Find: https://stackoverflow.com/questions/12096252/use-a-list-of-values-to-select-rows-from-a-pandas-dataframe

In [None]:
# Key piece of info: use the .isin() function!

list_to_find = ['positive', 'neutral'] # list of all 2 values you want to find in the column

df['self_reported_attitude'].isin(list_to_find) # returns True in every row where 'self_reported_attitude' is either 'positive' or 'neutral'

### 
# <font color='#32925E'> **You want to read only the first 50 rows of a CSV. You've never done this before.** </font>

### Google "pandas read csv".
### Find: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html

In [None]:
# Key piece of info: use nrows parameter!

df = pd.read_csv('face_data.csv', nrows = 50)

df

### 
# <font color='#32925E'> **You want to save your CSV. You've never done this before.** </font>

### Google "pandas save csv".

### Find: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

In [None]:
# Key piece of info: saving a csv file to a directory! t

df.to_csv('example/path/to/save/your/file/file_name.csv')

# 
# <font color='red'> **ADVANCED SECTION: TRY AT YOUR OWN RISK** </font>

# <font color='#32925E'> **Finding delight: Identifying when participants are "extremely happy"** </font>

### In UX, this facial emotion-analysis tool gives us the unique opportunity to go back and see what made participants happiest during a task, even if they didn't self-report it. But first we need to define the cutoff for "extremely happy".
### An accepted way to "standardize" continuous variables (i.e., put them all on a comparable scale) is to create a **Z-score**: for every value, subtract the mean and divide by the standard deviation. The resulting variable always has Mean == 0 and Standard Deviation == 1. A Z-score >3 or <-3 is commonly considered extreme.

### Let's write a reusable function that will take any one Happiness value and make a Happiness Z-score out of it. Functions in Python are defined like this:
### `def my_function(inputs):`
###     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;do stuff to the inputs to create an output variable
###     &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`return output`

In [None]:
happiness_mean = df['happiness'].mean() # Average face happiness across all participants
happiness_std = df['happiness'].std() # Standard deviation (variability) of face happiness across all participants

In [None]:
# Give your new function a name: z_score
def z_score(happiness_value): 
    
    """ This function takes one happiness value and changes it to a Z-score value. """
    
    # From the happiness value, subtract the mean of all happiness values and divide by the standard deviation of all happiness values
    happiness_z = (happiness_value - happiness_mean) / happiness_std
    
    # Return the changed value
    return happiness_z

### What would be the Z-score of a Happiness value of 0.55? (remember, the Happiness scale ranges from 0 to 1, so 0.55 might not seem that high)

In [None]:
z_score(0.55)

### 0.55 on the Happiness scale is actually an extreme outlier with Z-score > 3.

### Now let's apply our Z-score function to the first real Happiness value in our actual data.

In [None]:
df.head()

### But instead of copying "0.013309" by hand, let's refer to its location in the dataframe: row 0, column "happiness"

In [None]:
print(df.loc[0, 'happiness'])
print(z_score(df.loc[0, 'happiness']))

### Now let's literally "apply" our Z-score function to _every single row_ of the Happiness column in one short line of code!

In [None]:
df['happiness_z_score'] = df['happiness'].apply(z_score)
df['happiness_z_score']

### Your function just ran 400,000+ times in a row at the blink of an eye!

### Finally, let's confirm that the mean of the Happiness Z-score is indeed 0 and the standard deviation is indeed 1.

In [None]:
df['happiness_z_score'].describe().round(3)

# 
# <font color='red'> **In case of emergency: load the face dataset at different stages** </font>

In [None]:
df = pd.read_csv('face_data_all.csv.gz') # initial state of the dataset

In [None]:
dfp = pd.read_excel('participant_data.xlsx') # original small dataset

In [None]:
# df['capture_time_mins'] = df['capture_time_secs'] / 60

df = pd.read_csv('face_data_minutes.csv.gz', index_col = 0) # minutes column added