# Input and files

Very often, we want to work with data that has already been collected or generated elsewhere. This might take the form of data from online surveys, or acoustic monitoring sensors, or perhaps the results of intensive fluid-dynamics simulations carried out on huge remote clusters of computers. There are a vast number of possible formats for data, and ways of reading it into Python, so we will focus only on the simplest.

## `CSV` files

A *comma-separated value* (csv) file contains the data in a simple form: each line of the file contains a number of values, separated by commas. Often the first line consists of *headers* (i.e. bits of text describing each column of data). 

As an example, the file `data/strathkiness-rainfall.csv` consists of daily rainfall data for Strathkinness (near St Andrews), between August 1st 2020 and August 23rd 2021. This data is published by SEPA under the Open Government License v3.0; you can find the most recent rainfall data [here](https://www2.sepa.org.uk/rainfall/data/index/11368). The first few lines of that file look like this:

    Timestamp,Value
    01/08/2020 09:00:00,0.2
    02/08/2020 09:00:00,0.0
    03/08/2020 09:00:00,2.0

This has two columns: `Timestamp`, the day the data is for, and `Value`, the total daily rainfall (in millimetres). 

### Using `csv`:
We could read the rainfall data into Python like so:

In [1]:
# we can use the csv module to read csv files
import csv

# 
with open('data/strathkiness-rainfall.csv', newline='') as csvfile:
    # Create a "reader" object to do the reading.
    # You *can* use things other than commas in csv files, so we have to 
    # tell the reader that the value separator is a comma here.
    reader = csv.reader(csvfile, delimiter=',')
    rain_data = list(reader)  # convert the data to a list

rain_data

[['Timestamp', 'Value'],
 ['01/08/2020 09:00:00', '0.2'],
 ['02/08/2020 09:00:00', '0.0'],
 ['03/08/2020 09:00:00', '2.0'],
 ['04/08/2020 09:00:00', '15.8'],
 ['05/08/2020 09:00:00', '0.4'],
 ['06/08/2020 09:00:00', '0.0'],
 ['07/08/2020 09:00:00', '0.0'],
 ['08/08/2020 09:00:00', '0.0'],
 ['09/08/2020 09:00:00', '0.0'],
 ['10/08/2020 09:00:00', '1.6'],
 ['11/08/2020 09:00:00', '7.6'],
 ['12/08/2020 09:00:00', '0.0'],
 ['13/08/2020 09:00:00', '0.8'],
 ['14/08/2020 09:00:00', '0.0'],
 ['15/08/2020 09:00:00', '0.0'],
 ['16/08/2020 09:00:00', '0.2'],
 ['17/08/2020 09:00:00', '2.2'],
 ['18/08/2020 09:00:00', '0.0'],
 ['19/08/2020 09:00:00', '4.4'],
 ['20/08/2020 09:00:00', '1.4'],
 ['21/08/2020 09:00:00', '5.0'],
 ['22/08/2020 09:00:00', '0.0'],
 ['23/08/2020 09:00:00', '1.0'],
 ['24/08/2020 09:00:00', '3.2'],
 ['25/08/2020 09:00:00', '41.4'],
 ['26/08/2020 09:00:00', '1.8'],
 ['27/08/2020 09:00:00', '15.2'],
 ['28/08/2020 09:00:00', '0.0'],
 ['29/08/2020 09:00:00', '0.0'],
 ['30/08/2020 0

Note that this has read in all of the data as strings, and has included the headers. We could get rid of the headers by slicing the list (`rain_data = rain_data[1:]`), and we could convert the values to floats by using `float`. Here's how we might do both simultaneously using list comprehension:

In [2]:
rain_data = [[x[0], float(x[1])] for x in rain_data[1:]]

Note that this has left the timestamp as a string. Python does have the ability to work directly with so called `datetimes`, but we leave this out for simplicity.

## Using `pandas`

If you are planning to do statistical analysis on your data, you are likely to want to import it into a "dataframe" using `pandas` instead:

In [3]:
import pandas as pd

rain_data2 = pd.read_csv("data/strathkiness-rainfall.csv")
rain_data2

Unnamed: 0,Timestamp,Value
0,01/08/2020 09:00:00,0.2
1,02/08/2020 09:00:00,0.0
2,03/08/2020 09:00:00,2.0
3,04/08/2020 09:00:00,15.8
4,05/08/2020 09:00:00,0.4
...,...,...
383,19/08/2021 09:00:00,0.2
384,20/08/2021 09:00:00,3.2
385,21/08/2021 09:00:00,8.0
386,22/08/2021 09:00:00,0.0


Note that this is stored in a dataframe object, rather than a list. You could extract the values out of the dataframe object like so: 

In [4]:
rain_data2.values

array([['01/08/2020 09:00:00', 0.2],
       ['02/08/2020 09:00:00', 0.0],
       ['03/08/2020 09:00:00', 2.0],
       ['04/08/2020 09:00:00', 15.8],
       ['05/08/2020 09:00:00', 0.4],
       ['06/08/2020 09:00:00', 0.0],
       ['07/08/2020 09:00:00', 0.0],
       ['08/08/2020 09:00:00', 0.0],
       ['09/08/2020 09:00:00', 0.0],
       ['10/08/2020 09:00:00', 1.6],
       ['11/08/2020 09:00:00', 7.6],
       ['12/08/2020 09:00:00', 0.0],
       ['13/08/2020 09:00:00', 0.8],
       ['14/08/2020 09:00:00', 0.0],
       ['15/08/2020 09:00:00', 0.0],
       ['16/08/2020 09:00:00', 0.2],
       ['17/08/2020 09:00:00', 2.2],
       ['18/08/2020 09:00:00', 0.0],
       ['19/08/2020 09:00:00', 4.4],
       ['20/08/2020 09:00:00', 1.4],
       ['21/08/2020 09:00:00', 5.0],
       ['22/08/2020 09:00:00', 0.0],
       ['23/08/2020 09:00:00', 1.0],
       ['24/08/2020 09:00:00', 3.2],
       ['25/08/2020 09:00:00', 41.4],
       ['26/08/2020 09:00:00', 1.8],
       ['27/08/2020 09:00:00', 15.2]

Note that `pandas` has automatically converted the second column to `float`. This makes `pandas` a convenient way of reading in `csv` files even if you don't plan to use any other `pandas` features.

After manipulating the data using `pandas`, you might want to write the dataframe back to a `csv` file.

In [5]:
# rain_data2.to_csv("data/processed-rainfall-data.csv")

## `Numpy` input and output

Often the data we want to read or write to a file comes in the form of a `numpy` array. For example, after computing the values of a function on a large and high-resolution mesh, you might want to avoid recomputing them. The example below doesn't take long to compute, but it will take a lot longer if the number of points in the `linspace` increases significantly.

In [6]:
import numpy as np

X = Y = np.linspace(0, 10, 500)
X, Y = np.meshgrid(X, Y)
Z = 3 * X**2 * Y**3 + 27 * X

# now that we've computed these values, we can save them:
np.savetxt("data/Z.txt", Z)

# If there is a lot of data, you might want to save a compressed file instead.
# Just add ".gz" to the end of the file name
# np.savetxt("data/Z.txt.gz", Z)

Now that we've saved the files, we can easily load them in at any later point:

In [7]:
X = Y = np.linspace(0, 10, 500)
X, Y = np.meshgrid(X, Y)
Z = np.loadtxt("data/Z.txt")