#### Data Processing with Python

If pandas is installed in your python environment, it's easy to import:

In [None]:
import pandas as pd

<hr>
##### IN CASE OF PROBLEMS WITH PACKAGES:


In [None]:
# SOLUTION A: select this cell and type Shift-Enter to execute the code below.

%conda install pandas

# Now restart the kernel (Menu -> Kernel -> Restart Kernel)

In [None]:
# SOLUTION B: select this cell and type Shift-Enter to execute the code below.

%pip install pandas

# Now restart the kernel (Menu -> Kernel -> Restart Kernel)

<hr>

# 1. DataFrames

Pandas is built around a fundamental data object called a [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html).

Here's how you can create one from a python [dict](https://docs.python.org/3/tutorial/datastructures.html#dictionaries):

In [None]:
planets = pd.DataFrame({ 
    'name' : ["Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune"], 
    'type' : ["Terrestrial", "Terrestrial", "Terrestrial", "Terrestrial", "Gas giant", "Gas giant", "Ice giant", "Ice giant"],
    'mass' : [0.0553, 0.815, 1, 0.107, 317.8, 95.2, 14.5, 17.1],
    'diameter' : [0.383, 0.949, 1, 0.532, 11.21, 9.45, 4.01, 3.88],
    'distance from sun' : [0.387, 0.723, 1, 1.52, 5.20, 9.58, 19.2, 30.05],
    'orbital period' : [0.241, 0.615, 1, 1.88, 11.9, 29.4, 83.7, 163.7],
    'rings' : [False, False, False, False, True, True, True, True]
})

planets

The variable `planets` now points to a DataFrame object containing our data. We can get a quick glimpse of the data using the [`head`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) method, which returns the first five rows:

In [None]:
planets.head(3)

The attribute [`shape`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) holds the dimensions of the DataFrame as (#rows, #columns) :

In [None]:
planets.shape

***

## 1.1 Methods and Attributes

To make use of a DataFrame, we need to understand some basic concepts in object-oriented python.

A *method* is a function that is bound to an object. We show that we want to call the method `head` of the object `planets` using a dot: `planets.head()`.

In a similar way, objects can have associated variables called *attributes*, such as `planets.shape`

A pandas DataFrame has many other useful [methods](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#computations-descriptive-stats) and [attributes](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html#attributes-and-underlying-data).

##### *Exercise 1a*

1. What do the following methods do?

[`tail`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html),
[`sample`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html),
[`describe`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html),
[`copy`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.copy.html)



2. To what do the following attributes refer?

[`size`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.size.html),
[`dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html),
[`columns`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.columns.html),
[`values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.values.html)

***

## 1.2 Accessing Data

Pandas provides several different ways to get data out of the DataFrame.



### Accessing single values

A single value can be accessed using [`iat[]`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html). 

We can think of it as meaning "the value **at** an **i**nteger position". 

It treats the DataFrame like an array with two *axes*.

The row coordinate is the first axis; the column coordinate is second.


In [None]:
planets.iat[1,2]

### Accessing rows and columns

[`iloc[]`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) means "**loc**ate data by **i**nteger position". 

It is used to access subsets of rows and columns, using the same coordinate system as `iat[]`.

#### Selecting rows

We can use `iloc[]` with a [slice](https://www.freecodecamp.org/news/slicing-and-indexing-in-python/) to get a subset of rows:

In [None]:
planets.iloc[2:4]

Because *slicing rows* is such a common operation, pandas also provides a shortcut:

In [None]:
planets[2:4]

Alternatively, we can provide `iloc[]` with a list of the indices to select:

In [None]:
planets.iloc[[1,3,5]]

##### *Exercise 1b*

1. Select the last three rows.

2. Select three rows at random.

3. Make a DataFrame containing only the first row.

4. Make a DataFrame containing the first, second and last rows.

#### Selecting columns

We can access columns by integer using the second axis of `iloc[]`:


In [None]:
planets.iloc[:,2]

Using an integer index (e.g. `2` above), this returns the column values in the form of a pandas [Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html) object. 

Here's how to return the same column as a DataFrame:

In [None]:
planets.iloc[:,[2]]

Notice that we still need to provide a placeholder `:` before the comma, to indicate "all of the rows".

Using a slice or list after the comma returns a subset of columns:

In [None]:
planets.iloc[:,2:4]

However, accessing columns by position is not usually very convenient. We need to be able to refer to the columns by their *labels*.

### Accessing by label
[`loc[]`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) means "locate by label". Our columns are labelled with strings.


In [None]:
planets.loc[:,'name']

This returns a Series object, which represents the data from a single column. The numbers shown next to the values are the *row labels*.

As a shortcut, we can also use `[]` with the *column labels* to select specified columns:

In [None]:
planets['name']

You may also encounter the following usage, where the column label is used as a direct attribute of the DataFrame:

In [None]:
planets.name

However, there are several limitations to this notation that mean it cannot be used in all situations (e.g. it will not deal with column labels that contain spaces, and there will be confusions if the column label is the same as an existing attribute or method.)

A list placed inside the `[]` shortcut can be used to select multiple columns.

In [None]:
planets[['name','mass']]

##### *Exercise 1c*

1. Select the first three rows, but only the **name** and **diameter** columns.

2. Select the first two columns for rows 4 and 6.

3. Select all columns from **type** to **diameter** inclusive.

***

## 1.3 Querying and sorting data

Of course, we are not just limited to accessing data by position and label.

Here are a couple of useful DataFrame methods for basic data manipulation:

### [`query`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html)
selects rows according to whatever conditions we specify, e.g.:

In [None]:
planets.query('name == "Earth"')

In [None]:
planets.query('diameter > 2')

Note that the query is a Boolean expression, provided as a string `''`. 

Inside the query, column names are unquoted and string values are quoted using `""`.

We can refer to columns containing spaces by enclosing them in backticks ` `` `.

We can also refer to variables in the environment using the `@` prefix.

In [None]:
max_period = 30
planets.query('rings and `orbital period` < @max_period')

### [`sort_values`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html)

returns a copy of the DataFrame, sorted by ascending column value:

In [None]:
planets.sort_values('diameter')

...or by descending value using `ascending=False`:

In [None]:
planets.sort_values('diameter', ascending=False)

The original DataFrame is unchanged:

In [None]:
planets

##### *Exercise 1d*

Use manipulations of `planets` to make DataFrames containing the following:

1. the terrestrial planets, ordered by increasing mass.

2. the giant planets, ordered from smallest to largest.

3. the planets that are more massive than Neptune.

***

## 1.4 Making new columns from existing ones

It's easy to add a new column to a DataFrame. We just use `[]=` to assign a Series to a new column label:

In [None]:
df = planets.copy()
df['radius'] = df['diameter'] / 2
df

Note that Series objects combine in a row-wise manner, similar to numpy arrays, e.g.:

In [None]:
planets['name'] + " -- " + planets['type']

##### *Exercise 1e*

Add a new column to `planets` to show the density of each planet relative to Earth.


***