# Loading and plotting data

In this notebook, we will load in a datafile so that we can work with the data. We'll start by exploring the data and then will go on to plot it with _matplotlib_.

## Loading data

First, make sure you have the file `Pearson.txt` downloaded on your computer (you can get it from the _data_ folder in the _psy3035-hel8048_ repository)

There are many types of datafile that you might want to work with in Python. Two of the simplest and most common are `.txt` and `.csv` files.

In [None]:
# We will use a numpy function loadtxt to load in the data
# You will have to change the string, FILENAME, to the path of the file on your computer

import numpy as np
from pathlib import Path

FILENAME = Path("path-to-file\Pearson.txt")
FILENAME = Path("..\data\Pearson.txt")
data = np.loadtxt(FILENAME)

This should give us some sort of error. If you get a `FileNotFoundError` then you need to check the path you are using and adjust it until you can get rid of this error.

The other error that you will get is a `ValueError`. Read this carefully because it tells you about what is going wrong. A useful thing to do when working with certain types of data is to try looking at the data in another program like Notepad or Excel. So let's do this. Find the file on your computer and open it in another program.

When you do this, you should see that the file contains two columns of data with the headings, "Father" and "Son". The `ValueError` from _numpy_ makes sense in that it was expecting just numeric values and couldn't convert the string `"father"` to a number.

Maybe the function `np.loadtxt` has a way of dealing with this. We can check by looking at the help for the function. You can get help for any function by putting its name and then a question mark.

In [None]:
np.loadtxt?

You can see one of the possible options we have is `skiprows`. We can use this to skip the header row and just read in the numbers. Let's try it.

In [12]:
data = np.loadtxt(FILENAME, skiprows=1)

Now, we don't get an error and you can look at the data.

In [None]:
data

## Data structures

Because we have used a _numpy_ function to read in the file, our data are in a numpy array (we can check by using `type(data)`).

Numpy arrays have similarties with a `list` but can have two or more dimensions. They also have many in-built methods that you can use on them. One of these methods is `shape` which tells us the dimensions of the array. Let's try it.

In [None]:
data.shape

You can select different parts of the array using similar **indexing** methods as with lists. However, because you can have more than one dimension this can be a bit different

In [None]:
data[:, 0] # selects all rows, first column

In [None]:
data[:, 1] # selects all rows, second column

In [None]:
data[5,:] # what does this select???

It is also possible to assign a subset of the data to a new variable.

In [None]:
fathers = data[:,0]
fathers.shape

Use the **numpy** cheat sheet to  try out some of the different available functions on the whole dataset or on a subset of it.

## Plotting some data

In [21]:
# The most common package for making graphs and figures in Python is called matplotlib
# We will import the pyplot module from it using the alias plt

import matplotlib.pyplot as plt

In [None]:
# We can initialise a figure and axis using the subplots function
# We'll then use the hist function to plot a histogram of the data from father

fig, ax = plt.subplots()
ax.hist(fathers)


Try adapting the code above to...
1. Plot the _sons_ instead of the _fathers_
2. Change the number of bins
3. Change the colour
4. Only plot data that are above a certain threshold

Hint: use the help function for `ax.hist` or ask ChatGPT

You can also make multipanel figures easily which is really useful if you want things to be aligned and axes to have the same values.

In [None]:
# when we create this figure, we tell the subplots function that we want two columns
 
fig, [ax1, ax2] = plt.subplots(ncols=2) #ax1 is the plot on the left and ax2 is the plot on the right

ax1.hist(fathers, color="green")

# try to add a line to plot the histogram of the sons data on ax2

In the code above, continue experimenting by trying the following...
1. Changing colours or number of bins
2. Use `sharey=True` as an argument for the `subplots` function
3. Add labels and/or titles to the axes
4. Try a different layout (e.g., two rows instead of two columns)
5. Try a figure where the histograms are on the same plot (maybe add the `alpha` parameter when plotting)

Hint: Use the matplotlib cheat sheet on Canvas and/or ask ChatGPT


Finally, try making a scatter plot with the data and again experiment with different options.

In [24]:
# code for a scatter plot should go here