# Storing Data in Numpy Arrays

## Overview
**Teaching**: 20 min \
**Exercises**: 25 min

**Questions**
- What is a `numpy` array and when is it useful?
- How can I import and clean data in Python?

**Objectives**
- Read tabular data from a file into a progran using `numpy`
- Describe the key features and use-cases of a Numpy n-dimensional array
- Clean data for easier file parsing
- Save data to file
- Create `numpy` arrays
- Create an automatic counter within a `for` loop
- Iterate through 1d and 2d `numpy` arrays

## Use the `numpy` library to work with numerical data in Python

In this tutorial we will use `numpy` to read in experimental data. In order to load this data, we need to access ([import](https://lucydot.github.io/python_novice/reference/#import) in Python terminology) a library called [`numpy`](). In general you should use this library if you want to do fancy things with numbers, especially if you have matrices or arrays. We can import `numpy` using:

## `numpy` or `pandas`
Another common Python package for working with 2-dimensional tabular data is `pandas`. If you have heterogeneous data - columns with different data-types (strings and floats for example) - Pandas might be a good choice. If you are applying mathematical operations to multi-dimensional arrays, `numpy` is a good choice. `numpy` typically consumes less memory than `pandas`, but this only becomes noticable for very large arrays. For this relatively small 2-dimensional table of floats, either will work well.

## Scientists dislike typing
We will always use the syntax `import numpy` to import `numpy`. However, in order to save typing, it is [often suggested]() to make a shortcut like so: `import numpy as np`. If you ever see Python code online using a `numpy` function with `np` (for example, `np.loadtxt(...)`), its because they've used this shortcut. When working with other people, it is important to agree on a convention of how common libraries are imported.

## The `loadtxt` function is used to read in csv data
To load the data file we can use the expression `numpy.loadtxt(...)`. This is a [function call](https://lucydot.github.io/python_novice/reference/#function-call) that asks Python to run the [function](https://lucydot.github.io/python_novice/reference/#function) `loadtxt` which belongs to the `numpy` library.

Since we haven't told the notebook to do anything else wiht the functions output, the notebook displays it. In this case, that output is the data we jsut loaded. By default, only a few rows and columns are shown (with `...` to omit elements when displaying big arrays). Also note thatm to save space, Python displays numbers as `1.` instead of `1.0` when there's nothing after the decimal point.

`numpy.loadtxt` has three [arguments](https://lucydot.github.io/python_novice/reference/#argument): the name of the file we want to read, the delimiter which separates data values in teh file and the bumber of rows to skip when reading the file. The first row contains column names, and so we ask `numpy` to skip this.

## What is the meaning of this data?
This data we have read in is transmittance data from a UV-Vis experiment. When light interacts with a medium some of it will be transmitted, reflected or absorbed. Transmittance is the intensity of the transmitted radiation leaving the medium, normalised by the intensity of the radiation entering the medium. As such it is usually expressed as a percentage. The rows contain the data for each wavelength, and the columns contain the data for different materials.

## Built-in python functions can be used to read file headings
To read in teh column headings we can use Python:

Note that in the code below we are _not_ using the `numpy` library - just built-in Python functions. There are some rogue unicode characters that are a relic from the excel file that originally held this data. To remove these characters we can specify the `encoding` keywork - an internet search of `ufeff` suggested this solution. The internet search bar is a useful friend when programming!

We can not see that the first heading corresponds to Wavelength, the secon to ITO transmittence, the third to CdS transmittance and so on. Note that the `\n` denotes a newline character - it is not unicode.

## To save data to memory we can assign it to a variable
Note that our call to `numpy.loadtxt` read our file but didn't save the data in memory. To do that, we need to assign the array to a variable. Just as we can assign a single value to a variable, we can also assign an array of values to a variable using the same syntax. Let's re-run `numpy.loadtxt` and save the returned data:

This statment doesn't produce any output because we've assigned the output to the variable `data`. If we want to check that the data have been loaded, we can `print` the variable's value:

## An array is a central data structe of the `numpy` library
First, let's ask what [type](https://lucydot.github.io/python_novice/reference/#type) of thing `data` refers to:

The output tells us that `data` currently refers to a n-dimensional array, the functionality for which is provided by the `numpy` library.

An array is a central data structure of the `numpy` library with columns of potentially different types. An array is a grid of values that can be indexed in various ways. The elements are all of the same type, referred to as the array `dtype`.

## Arrays vs lists
You may wonder why we care about Numpy arrays, when we already have Python lists. A `numpy` array holds on a single type of data, whilst lists can hold elements of different types. This makes `numpy` more efficient in memory usage. It also makes it quicker to iterate through a `numpy` array and manipulate elements in the array. Arrays are more suited to mathematical tasks as the operations are element-by-element. For example, what happens when we multiply a list by a integer? Is this what we would expect to see when working with vectors?

The `type` function tells us that a variable is a `numpy` array but won't tell you the type of thing inside the array. To find out the type of data contained in the `numpy` array we can `print` the `dtype` attribute.

This tells us that the `numpy` array's elements are [floating point numbers](https://lucydot.github.io/python_novice/reference/#floating-point%20number).

## Extra information about an array are stored as attributes.
When we created the variable `data` to store our absorption data, we didn't just create the array' we also created information about the array, called attributes. This extra information describes `data` in teh same way an adjective describes a noun.

For example, `data.shape` is an attribute of `data` which describes the dimensions of `data`.

## The `savetxt` function is used to write data to a file
If we want to save a file with clean header data (without the unicode) we can use the `numpy.savetxt` function: 

Now when we read this in we do not nbeed to speicyf the unicode ending, as the unicode is no longer there!

Note that Numpy has inserted a `%` at the start of the header line to indicate that it is a comment and should be ignored. `numpy` has also written the file without commas separating each data value. As a result, we can read in this cleaned data file with a single `numpy` argument:

## Create an array for storing data yet-to-be-generated using `numpy.zeros`

In the previous example we imported data from a file as a `numpy` array. However it may be that we want to create a `numpy` array which stores calculation data that is generated within the code itself. For example, we may want to calculate the velocity of a ball at 50 points in time between 0 and 10 seconds inclusive. First we create an empty Numpy array with the correct dimensions

## Use `numpy.linspace` to generate evenly spaced numbers over a given interval.
To specify the times at which we take measurements of the ball speed we can use `numpy.linspace` function. This will generate an array with 50 elements, each of which is evenly spaced between 0 and 10 (inclusive).

## The `enumerate` function allows us to have an automatic counter within a `for` loop
Enumerate is a Python built-in function. It allows us to have an automatic counter within a `for` loop. The best way to understand `enumerate` is to see it in action.

We can use `enumerate` to index the velocity array as we iterate through the `for` loop.

Finally, we can use `numpy.around` to round the calculated velocities to 2 decimal places.

Although this approach generates the correct velocity values, we can use `numpy` operations to write more readable code in fewer lines. This will be explored further in the following question.

## Key Points
- Use the `numpy` library to work with numerical data in Python
- The `loadtxt` function is used to read in .csv data
- Built-in Python functions can be used to read file headings
- To save the data to memory we can assign it to a variable
- An array is a central data structure of the `numpy` library
- Extra informatin about an array are stored as attributes
- The `savetxt` function is used to write dsata to a file
- `numpy.linspace` generates evenly spaced numbers over a given interval
- The `enumerate` function allows us to have an automatic counter within a `for` loop