# Non-Array Data Structures

In the core worksheet for session 1 we restricted ourselves to working with Numpy array data structures. As physicists this was natural; arrays are a n-dimensional structure that can hold integers or floats and allow statistics to be performed on them. Python however offered multiple types of data structures that each serve an unique and useful purpose when programming.

### Lists

Lists in Python are defined using square brackets surrounding zero or more comma separated literals: 
```python
some_primes = [2,3,5,7,11,13]
names_of_cats = ["Ginger", "Princess", "Zorxo the Clawful"]
```

Lists don't even have to be of the same type:

```python
Mixed_list = [2,"Python",16.5]
```

is an allowed list. Moreover lists don't need to contain any elements, they can be initalised as an empty list:

```python
Empty_list=[]
```
This is particularly useful when you will be adding elements to the list as your code progresses. To add elements you can use the <span style="color:blue">.append</span> method to a list:

```python
Empty_list.append(2)
```


### Tuples
Tuples behave very similarly to lists, but are immutable (i.e. they cannot be changed). Tuple literals are created by a writing a sequence of items separated by commas, optionally surrounded by parentheses. To get a tuple with only one element, you need to have a comma after the element.<br>
```python
my_tuple = 1,2,3
my_tuple = (1,2,3)        # equivalent
not_a_tuple = 1           # same as: not_a_tuple=1
a_tuple = 1,
a_tuple = ("first!",)     # here the first and only element of the tuple is "first!".
```

Many aspects of Python are implicit tuples. For instance, the assignment operator = will happily assign tuples of names to tuples of values:<br>
```python
A,B,C = 1,2,3
```
which is the same as:
```python
(A,B,C) = (1,2,3)
```
which is the same as:
```python
A = 1
B = 2
C = 3
```

This behaviour can be easily used to swap the names of data:<br>
```python
A,B = 1,2
A,B = B,A
print(A,B)   # prints 2,1
```

### Dictionaries
The third most common collection type used in Python is the Dictionary, or dict, which store mappings from keys to values. For every key, there is a value, which can be almost any Python object. Keys are usually strings, but it is possible to use certain other objects as keys. Dictionary literals are written as a comma-separated list of key:value pairs, with a colon separating key from value, surrounded by (curly) braces. Dict items are accessed using the key within square brackets.<br>
```python
student_grades = {"Simon": 60, "Jenny":68, "Laura":112}
student_grades["Laura"] = 100 # Change Laura's grade.
student_grades["Pug"] = 58    # New student!
print(student_grades["Jenny"])
68
```
<br>

### When should you use each data type?

An important question to ask yourself when first thinking about a problem is what data structure is the best for you to use? There is no universal answer for us and will really depend on the situation. When to use a dictionary is pretty straightforward, they are used when you want to link one set of values to another. A phonebook which links a person name to their number is a good use of a dictionary.

When to use a list or tuple can be harder to understand. The key difference between between lists and tuple is that tuples are immutable, i.e once created they cannot be altered. Therefore a sequence of tuples can be used if you don't expect the sequence (or indeed) want them to change. However if you want to add or remove elements from a sequence during your code execution then lists are the data structure to go with.

For most physics applications, we will be dealing with 2D, 3D or even higher dimensionality data that we need to operate on. This is best achieved using arrays contained within the numpy module.

### Objects and methods

When looking at lists I used a word <span style="color:blue">append</span> to add an element onto a list, this was called a method. What are methods, how do they work, how are they called and what are the implications of them? To understand this we need to dig a little deeper about how data structures are actually defined in Python.

In Python, the data structures are you encounter are stored as objects. Objects are a way of storing definitions, methods, variables and many other things in a package that is modular, i.e. it is self-contained. 

As an example, imagine we has an object square, that is the shape square. Within that object square, there would need to be variables that define the length of the square. Additionally there can be methods that would calculate its area or perimeter. As another example consider an object bike, which can have variables that define the colour, the number of gears and methods that calculate the maximum speed that could be achieved.

Coming back to data structures, the data structure objects contain definitions that define how data is stored within them, and then methods that allow you to manipulate that structure. This course will not consider objects explicitly but will merely make use of them. In next year's Computing course you will start to create your own objects and associated methods.

# Converting data types and data structures

In section 4, it was shown that variables can be converted from one data type to another. The exact same can be done with data structures, whether it is to change the data types of the structure, or change the actual structure itself. The first thing to consider is: What is my data structure type and what do I want it to be? For physicists the most common change will be to go between a list and an array. To find the type of a data structure, use the <span style="color:blue">type</span> command.

In [None]:
import numpy as np
Array=np.array([1,2,3,4])
print(type(Array))
List=[1,2,3,4]
print(type(List))

Knowing the data structure is crucial to ensuring you handle your data in the most efficient way. For example the Numpy library is optimised to run operations on arrays. We can convert a list to an array but calling the <span style="color:blue">np.array</span> function and passing it the list as an argument.

In [None]:
Array_of_list=np.array(List)
print(Array_of_list,type(Array_of_list))

To go the other way, we use the <span style="color:blue">np.ndarray.tolist</span> function

In [None]:
List_of_array=np.ndarray.tolist(Array)
print(List_of_array,type(List_of_array))

For arrays, we can also look to see the type of data that is stored within it by appending our array with a <span style="color:blue">.dtype</span>.

In [None]:
print(Array.dtype)

Here the data type is int32, which is a 32-bit integer (the 32-bit refers to how it is stored). Once the data type of an array is known, it can be converted by applying <span style="color:blue">.astype</span> to the array.

In [None]:
Array_float=Array.astype(float)
print(Array_float.dtype)
Array_complex=Array.astype(complex)
print(Array_complex.dtype)

<div style="background-color: #00FF00">

**Exercise: the following data structures are either tuples,lists or arrays, query their data types. For each data structure, convert them to the other data structures: i.e. convert the array to a list and tuple, the list to an array and list, and the list to an array and tuple:**

In [None]:
a=np.array([1,2,3])
b=(4,5,6)
c=[7,8,9]

# Advanced Data Structures - The Pandas DataFrame

When loading files in session 1, all of the data that was loaded in had some fundamental drawbacks: they all had to the same data type and there was no information on what the data itself represented. For small and simple data sets the np.loadtxt() will normally be sufficient, but when text files contain different data types or if you want to open for example an Excel spreadsheet, then the loadtxt function is insufficient. The Python library Pandas allows you to open and manipulate complex data in an intuitive fashion.

## Opening and examing a file with Pandas

The first thing to is to import the pandas library using:

```python
import pandas as pd
```

To open a file with pandas, the most common function is the read_csv command where csv stands for comma separated variable. To open the resistivity data used in session 1, we use:

```python
df=pd.read_csv('Data/Resistivity2.csv')
```

(Note that this is a slightly different .csv file specifically used in this advanced worksheet).
<div style="background-color: #00FF00">
    
**Exercise: open the Resistivity2.csv data contained with the name df using the pd.read_csv command. Use the type command and print what data structure has been generated, and then print the data itself. What does this return?**

In some cases, the data you are reading in might not be in a comma separated variable format, there may be whitespaces or semicolons. To ensure that the data is read in properly you will need to pass the delim_whitespace keyword to tell the function how the data is separated. Look at the read_csv documentation to learn about the other options available.

## The DataFrame

The data strucure that has been generated by the read_csv is called a DataFrame, this is at the heart of how Pandas stores and manipulates data. A DataFrame is a 2D data structure that is composed of the following components:

1) The data

2) The index

3) The columns

Looking as the DataFrame printed above the data should be obvious, it is all of the numbers that was contained in the Resistivity2.csv file. The index, which is the row number of the DataFrame, defines all of the instances when data was taken. In this case data was taken every 20 K between 200-360 K. The columns contain the data that was taken at each index, which for this data was the restivity of copper and aluminium.

There are two important notes about the DataFrame that should be noted now. The top of the DataFrame contains what was in the top of the file that was loaded in, in this instance it was the labels of the resistivity data. These are known as headers, and will allow you to access your data without needing to use indices. You should be careful when loading in data that it has header data, or else Pandas will place your first data row into that slot. 

The first column of the DataFrame contains the numbers 0-9. These are the index labels that can be used to access rows of data. These indices were generated automatically because we did not tell it what to use. Looking at our data however, we can see a more convenient index to use: The temperature of our data! To set the indices of our DataFrame to be the temperature, we can use the index_col keyword argument when reading in our data or use the set_index function on our DataFrame. Look up these two methods to understand.

<div style="background-color: #00FF00">
    
**Exercise: using the set_index function, make a new DataFrame df2 that has the temperature column as the index. Then, create a new DataFrame df3 that sets the index column to temperature during read in. Print these two DataFrames to look at their structure.**

For larger data sets, it can be disadvantageous to print the whole DataFrame to the screen. We can look at only parts of the data using the .head() and .tail() functions on the DataFrames.

<div style="background-color: #00FF00">
    
**Exercise: use the .head() and .tail() functions to look at the select parts of the DataFrame df2. How many rows does it print out by default? How can you change that? Look at the documentation to help you.**

We sometimes don't need to look at the information in the DataFrame itself, we only want a top level summary of the data. This is achieved using the DataFrame.info() command.

<div style="background-color: #00FF00">
    
**Exercise: use the DataFrame.info() command to generate a top level summary of the DataFrames df2/df3. What do you see?**

### Creating a DataFrame
Often, we are dealing with data that is not in a format that can immediately be turned into a DataFrame as it may be missing headers or an index. It is then down to the user to create the necessary information to turn the data into a DataFrame compatible format. To do this we use the pd.DataFrame function. For example:

In [None]:
import pandas as pd
import numpy as sp

d=[[2,3,'e',5],[4,3,'f',5],[5,3,'g',4]]
Headers=['a','b','c','d']
df=pd.DataFrame(data=d,columns=Headers)
print(df)

Notice that one of the column values are strings and not numbers; this is one of the big advantages of using a DataFrame to store your data.


<div style="background-color: #00FF00">
    
**Exercise: create a DataFrame from the following array:**

```python

Array=[[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4],[1,2,3,4]]

```

**with the following header:**

```python
header=['a','b','c','d']
```
<p>
.


## Plotting with Pandas

Being able to store heterogeneous data in a single data structure is useful, but the real power of Pandas comes when it comes to plotting using a DataFrame. This is accomplished by the following line of code:

```python
df['column name'].plot()
```

Alternatively you can use:

```python
df.plot('x column name','y column name')
```
Note how we only reference the name of the column, we don't need to know its index. For the first method we didn't set an x-axis; with that plotting nomenclature Pandas will use whatever the index is as an x-axis. Let's look at these methods of plotting:

In [None]:
import matplotlib.pyplot as plt # this needs to be imported to show the plot

df=pd.read_csv('Data/Resistivity2.csv')
df['Resistivity Cu (ohm/m)'].plot()
df.plot('Temperature (K)','Resistivity Cu (ohm/m)')
plt.show()

In the second case we had to use the DataFrame df, which didn't have temperature set as the index. In the first plotting style it has taken the DataFrame index, in this case temperature, and has set it to be the x-axis. Plotting data without having to reference column numbers is more intuitive and will make your code easier to understand.

**It however relies on the programmer to be careful when making the column names into something sensible, so take care !!**


<div style="background-color: #00FF00">

**Exercise - Plot the Al resisitivty data using both styles of plotting. Be sure to include a title and y label in your plot.**

On a more general note, we can access columns by querying their column name, for example:

In [None]:
print(df['Temperature (K)'])

## Filtering DataFrames

So far we have made use of the whole DataFrame. A powerful feature of the DataFrame is when you have a large amount of data and want to analyse only a small subset of it based on a condition that is already within the data. As an example, we can filter the above DataFrame df to only contain temperatures below 300 eV:

In [None]:
df3=df[df['Temperature (K)'] < 300]
print(df3)

Notice that compared to the original DataFrame, we have 4 less values. This becomes powerful when we have a large 2D DataFrame that contains lots of data that are grouped by the value contained in a certain column. Look at the following DataFrame:

In [None]:
car_df=pd.read_csv('Data/Car_Data.csv')
print(car_df.info())
print('')
print(car_df.head())

This DataFrame contains information about various aspects of 392 cars. There is a lot of information here, however we will focus only focus on a few columns to illustrate some key features of DataFrames. Let's plot the weight of the car versus the mpg, the miles per gallon it can achieve.

In [None]:
car_df.plot.scatter(x='weight',y='mpg')
plt.show()

Note that we have used the scatter plot function and not just the plot function. This is because the DataFrame is not sorted in ascending weight. Looking at the plot, we can see a clear trend that is not suprising: heavier cars get worse mileage. Looking at the summary of the DataFrame, we see that there is an origin column that says where the car was made. To see all the different values contained in the origin column, we use the following command:

In [None]:
print(car_df.origin.unique())

Note that using the command:

```python
print(car_df.mpg.unique())
```

would be a bad idea as there are lots of different values of mpg; we can see this from the graph we plotted! Just as we filtered our resistivity DataFrame based on temperature, we can filter our car data based on region. The syntax for this is a little different from above, and looks like:

```python
New_df=Old_df[old_df['column_name'].str.contains('condition')]
```

So to extract the US car data from the DataFrame, we would use:

```python
car_df_US=car_df[car_df['origin'].str.contains('US')]
```


<div style="background-color: #00FF00">

**Exercise: extract the data for each origin into a separate DataFrame, like the code snippet above. Then plot the weight vs mpg of each car region on the same graph in different colours to answer the question: which region produces cars with the worst mpg?**

**Hint: To get each scatter plot on a common axis, you will need to use the ax keyword argument. Consult Google, the Pandas documentation, and Stack Overflow about how to go about this!**