# Scientific Programming: A Crash Course

## Class 5 – Visualization and Analysis

One of the most important (but often overlooked) skills in science is being able to communicate your work to others. And, in my opinion, the absolute best way to communicate your work is with pictures. As the old saying goes, "a picture is worth a thousand words". There are many specialist packages for certain types of illustration – technical diagrams, brain scans, network structures, etc. – but the most common thing we need to do is create various types of plot to show the results of our data analyses. There are at least a few different options here. [Matplotlib](https://matplotlib.org), which we'll use today, is the most comprehensive and widely-used package, but another good option is [Seaborn](https://seaborn.pydata.org), which is a bit more modern, although more limited.

## Matplotlib

Matplotlib is the most widely-used plotting library in Python and a key component of the scientific stack. If you're coming from the R world, it is roughly equivalent to Ggplot. Matplotlib allows you to create all sorts of plots: the obvious ones, like bar plots, scatter plots, and line plots, but also the less obvious ones like violin plots, heatmaps, and timelines.

Let's jump right in and import Matplotlib (we'll also import NumPy and Pandas as well, since we'll use them throughout this notebook):

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

Like NumPy and Pandas, there is a conventional way that Matplotlib is imported. Matplotlib includes other modules for dealing with lower-level stuff like shapes and text, but 95% of the time we just need the `pyplot` module, so the convention is to import just that module and call it `plt`.

Before we start plotting the some real data, let's look first at some more basic examples to get the general idea. First, let's start with a simple line plot. This type of plot is so basic, Matplotlib literally just calls the relevant function `plot()`:

In [None]:
x = np.arange(1, 101)
y = np.arange(1, 101)

plt.plot(x, y);

Here I created two sets of values, `x` and `y`, both of which are just the numbers one through one hundred. I then plotted `x` against `y`. To make this fake data look a little more interesting, let's add some random noise to the `y` values:

In [None]:
x = np.arange(1, 101)
y = np.arange(1, 101)
y += np.random.randint(0, 5, 100)

plt.plot(x, y);

If you're a little unsure what's happening here in terms of the data generation, try taking it step by step (what do `x` and `y` look like, how are we generating and adding the noise?). To make this plot a little more informative, we should add some *x*- and *y*-axis labels:

In [None]:
plt.xlabel('Magical X numbers')
plt.ylabel('Special Y numbers')
plt.plot(x, y);

And... that default shade of blue is pretty boring...

In [None]:
plt.xlabel('Magical X numbers')
plt.ylabel('Special Y numbers')
plt.plot(x, y, color='hotpink');

Another way we might want to plot this data is with a scatter plot. A scatter plot is like a line plot, except the points are not joined together with lines. A scatter plot usually makes sense when the points are not intrinsically ordered.

In [None]:
plt.xlabel('Magical X numbers')
plt.ylabel('Special Y numbers')
plt.scatter(x, y, color='hotpink');

Let's make another scatter plot that looks a bit more realistic – we'll make the noise normally distributed. To do that, we'll use the `np.random.normal()` function:

In [None]:
x = np.linspace(0, 1, 100)
y = x * 2 + np.random.normal(0, 0.2, 100)

plt.xlabel('Magical X numbers')
plt.ylabel('Special Y numbers')
plt.scatter(x, y, color='hotpink');

Finally, let's overlay a simple linear regression line on this plot. To do this, we'll use NumPy's `polyfit()` function to determine the best fitting regression line (represented as an intercept and slope) and then we'll draw that line on the plot:

In [None]:
x = np.linspace(0, 1, 100)
y = x * 2 + np.random.normal(0, 0.2, 100)

β, α = np.polyfit(x, y, 1) # fit regression line - slope and intercept
y_predicted = α + β * x # y-values predicted by the regression line

plt.xlabel('Magical X numbers')
plt.ylabel('Special Y numbers')
plt.scatter(x, y, color='hotpink')
plt.plot(x, y_predicted, color='black', linewidth=5);

Try playing around with the colors and styles. Need some color inspiration? [Check this page for all the standard color names you can use.](https://www.w3schools.com/cssref/css_colors.asp) Can you change the circular points into squares or triangles? Can you make the regression line dashed instead of solid. Do some googling to find out how it's done.

A great place to start whenever you have some new data is to plot a histogram to get an overall sense of the distribution of the data points. So it's worth taking a quick moment to see how it's done in Matplotlib. As you'll see, it's super easy, so you have no excuses for not plotting histograms of your data! First, I'll generate 1000 random numbers that are normally distributed with a mean of 0 and a standard deviation of 1, and then I'll use `plt.hist()` to make the plot:

In [None]:
values = np.random.normal(0, 1, 1000)

plt.hist(values, color='indianred', bins=30);

The `bins` argument allows you to control how many "bins" the datapoints are categorized into, which can give you a more granular view. Try making it bigger and smaller.

Okay, now that we've played around with some a fake data, I hope you've got the general idea of how things work. Basically, you pass your data into one of the plotting functions – like `plt.plot()`, `plt.scatter()`, or `plt.hist()` – and you set some colors and labels, and hey presto, you have a plot! Of course, there are **a lot** more options for further customization. To see lots more example plots, and the code used to generate them, check the Matplotlib gallery here: https://matplotlib.org/stable/gallery/index

## Dataset 1

Let's go back to the dataset that we briefly worked with in the previous class. First, a quick reminder of what the dataset looked like:

In [None]:
df = pd.read_csv('example_data.csv')
df

There are four columns representing four different variables (I mean the word "variable" in the statistical sense, not the computing sense – i.e. things that are measured or recorded in the experiment), and there are 15,360 rows, each representing an individual experimental trial. The overall structure of the experiment is 240 subjects who do either a production test or a comprehension test, and they are tested on either a size, angle, or both category system. Finally, the `correct` variable/column records whether the trial subject was correct (`1`) or incorrect (`0`) on that trial.



First, I just want to look at the results from the production test type, so let's isolate those trials first into a new data frame which we'll assign to the variable `production_df`:

In [None]:
production_df = df.query('test_type=="production"')
production_df

As you can see, we now just have the 120 subjects who did the production test type. Next, I want to calculate accuracy for each of the category systems; to do this, we'll use the data frame's `.groupby()` method to group the data by category system and then calculate the mean of the `correct` column for each group:

In [None]:
accuracy_by_condition = production_df.groupby('category_system')['correct'].mean()
accuracy_by_condition

Right away we see that accuracy is highest in the `angle` condition and lowest in the `both` condition. Let's make a bar plot:

In [None]:
accuracy_by_condition = accuracy_by_condition.sort_values(ascending=False) # sort in descending order

plt.ylim(0, 1) # make the y-axis go from 0 to 1
plt.xlabel('Category system')
plt.ylabel('Accuracy')
plt.bar(accuracy_by_condition.index, accuracy_by_condition);

Bar plots are terrible! You should almost never use them. [#barbarplots](http://barbarplots.github.io)! Let's do better by making a violin plot, which will show us not only the central tendency of the three conditions, but also their distributions. To make a violin plot, we need to organize the data a little bit so that we have subject-level accuracy scores for each of the category systems. To do that I will further subset the `production_df` into a separate data frame for each category system, and then I'll compute subject-level accuracy using `.groupby()` (we did something similar yesterday for just one subject).

In [None]:
# create a separate dataframe for each category system by
# subsetting the main dataframe
angl_df = production_df.query('category_system=="angle"')
size_df = production_df.query('category_system=="size"')
both_df = production_df.query('category_system=="both"')

# group the dataframes by subject and then calculate the
# accuracy per subject
angl_accuracy_by_subject = angl_df.groupby('subject')['correct'].mean()
size_accuracy_by_subject = size_df.groupby('subject')['correct'].mean()
both_accuracy_by_subject = both_df.groupby('subject')['correct'].mean()

# put all these by-subject accuracy scores together in one
# list
data = [angl_accuracy_by_subject, size_accuracy_by_subject, both_accuracy_by_subject]

Print out the variables if you're unsure what's happening at each step. Finally, let's make the violin plot:

In [None]:
plt.violinplot(data, showmeans=True, showextrema=False)
plt.xlabel('Category system')
plt.ylabel('Accuracy')
plt.xticks([1, 2, 3], labels=['Angle', 'Size', 'Both'])
plt.ylim(0, 1); # make the y-axis go from 0 to 1

Great! Now we can get a sense of how much variation there is across participants within each condition, which we couldn't get from the bar plot. In this case, subjects who were tested on the angle system are very consistent – pretty much all of them have high accuracy scores; subjects who learned the size system are not only worse (on average) but also more variable – some did well and some did poorly; subjects who learned the both system are even more variable. Looking at the full distribution is always much more informative than just looking at the means (shown here with the dark blue lines) or other summary statistics, so you should get into this habit right away!

The plot seems to show that people who learned the angle category system were more accurate than people who learned the size category system. But could we have gotten these results simply by chance? To answer these kinds of questions, we need statistics, which is beyond the scope of what I want to cover in this course. But to give you a quick example, here's how you could run a t-test to evaluate whether, for example, subjects in the angle condition were significantly better that those in the size condition:

In [None]:
from scipy.stats import ttest_ind

ttest_ind(angl_accuracy_by_subject, size_accuracy_by_subject)

The p-value is less than 0.05, so yes indeed this does seem to be the case. To do this, I used the `ttest_ind()` function from SciPy (independent samples t-test), but you'll typically want to do more advanced analyses than this, for which there are more appropriate packages (e.g. the [Statsmodels](https://www.statsmodels.org/stable/index.html) package) or indeed more appropriate languages, like R.

Using the code above, you should be able to make the equivalent plot for the comprehension data, which we previously ignored by extracting only the production test results. Try building the plot in the cell below:

There are many more options for customizing the plot further. For reference, this is what the final plot looked like in my paper. You'll see that I added illustrations of the three category systems to make the plot easier to understand, and I also added a dashed line at 0.25 to show chance-level performance.

![title](images/prod_comp.png)

## Dataset 2

Let's switch now to anther dataset to get some more practice. This data is in `example_data2.csv`, so first we'll open the CSV file and print it to get a general sense of the dataset:

In [None]:
df = pd.read_csv('example_data2.csv')
df

Can you answer the following questions? You might have to do a little bit more digging into the data to get the right answers.

1. How many subjects are there?

2. How many conditions are there?

3. How many trials did each subject do?

This dataset is from an eye tracking experiment. Participants have to look at a word and we record which part of the word they look at – the "landing position", which is measured in pixels. Some participants were in the "left" condition where we expect them to look at the left part of the word, and some participants were in the right condition where we expect them to look at the right part of the word.

To get a general sense of whether this hypothesis is true, we could calculate the average landing position in each condition. Write some code in the cell below to get these two numbers. For reference, the correct numbers should be 97.545887 in the left condition and 129.512789 in the right condition. Can you reproduce these numbers?

As I mentioned before, just looking at summary statistics, like means, can be misleading, so it's essential to **always plot your data** (and, where possible, always look at the distribution – not just the central tendency). For this dataset, a density plot will work very nicely. To help us create this kind of plot, we'll use the `gaussian_kde()` function from SciPy. Study the code below to see how it works.

In [None]:
from scipy.stats import gaussian_kde

x = np.linspace(0, 252, 1000)

left_positions = df.query('condition=="left"')['landing_position']
right_positions = df.query('condition=="right"')['landing_position']

y_left = gaussian_kde(left_positions).pdf(x)
y_right = gaussian_kde(right_positions).pdf(x)

plt.plot(x, y_left, color='cadetblue')
plt.plot(x, y_right, color='crimson')

plt.xlabel('Landing position (in pixels)')
plt.xlim(0, 252)
x_ticks = []
for boundary in range(0, 253, 36):
    plt.axvline(boundary, color='lightgray', zorder=0)
    x_ticks.append(boundary)
plt.xticks(x_ticks);
plt.yticks([]);

Was our hypothesis correct? Is there a significant difference between conditions?

It's also really important to look at the data from each individual subject – maybe some subjects are being weird for some reason, or maybe there are clusters of subjects who tend to behave in similar ways. In the cell below, try to develop the above code to produce a density plot for each individual subject, not just the overall data from each condition.

For reference, here's what the final figure looked like in my paper:


![title](images/landing_pos.png)

##  Lastly: A Note on Notebooks!

During this course, we have mainly done all our coding within a Jupyter notebook. However, this is not the only way to use Python. You can also create Python scripts (`.py` files), which can be run directly, or use the interpreter. What do you think of the notebook format? Have you had any issues with it? Have you used something like this before in another language?

Personally, I don't *love* notebooks, although I understand why many people do. The really nice thing is that you can mix text and code together. This makes notebooks very useful in an educational context – like this class – because we can mix code with explanations. Notebooks can also be useful in a scientific context because they allow you to clearly document your thought processes as you explore some data. For example, you can create a notebook that shows all the steps you took in analyzing some data, and then you can share this document with your PhD supervisor or publish it as supplementary material alongside a journal article. Another really nice feature is that all the plots you generate in kept alongside the code that generated them, so it's easy to remember how you created each plot.

However, in my experience, the notebook style of coding can also sometimes get a bit messy and confusing for everyday programming, mostly because it's difficult to keep track of the current "state" of the underlying interpreter. For example, let's say we run the following code block:

In [None]:
my_special_number = 7

Then, maybe we do some calculations with this variable:

In [None]:
answer = my_special_number ** 2 + 100
print(answer)

Okay, great! The answer is 149. Now I run some more code:

In [None]:
magical_constant = 64
my_special_number = 10
answer = magical_constant * my_special_number
print(answer)

Okay, 640, sure. Now go back to the previous code block and run it again – the one where we got the answer 149. Did you get the same answer? You should get 200 instead – it no longer gives you 149. Why? Because in the subsequent code block you redefined `my_special_number` to `10`, perhaps without even realizing.

It's very easy to get into confusing situations like this because code blocks can be run in any order. If you lose track of what order you ran things in, you can quickly get in a pickle! Notebooks go against the top-to-bottom sequential flow that we normally expect when programming. Instead, the sequential flow – the order in which the blocks of code are run – exists only in your head and remains undocumented. I don't want to totally dissuade you from using notebooks (there are many good reasons to use them), but it's worth thinking about these issues if you plan to use them in the future.

Finally, and more generally, please don't feel that you have to adopt all the suggestions that I've made during this course. There are many, many, many ways to do good quality science. Explore the options for yourself and find a way of working that makes sense for you, your projects, and your collaborators. Good luck!