# Pandas basics

This notebook covers the basic data structures in pandas (Series, DataFrame, Index), the dtype system and indexing within Series and DataFrames.

## What is pandas?

To cite the [project page](https://pandas.pydata.org/), it is:

> a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

For the wide data science community (its Python-speaking part), it is the de-facto tool for analysis and manipulation of data that:
- are in tabular format, i.e. having **columns** (usually describing a certain property) and **rows** (usually representing individual objects having those properties)
- are big/complex enough not to be easily handled by lists and dicts.
- are small enough to fit within a computer memory multiple times (rarely more than a few hundred million rows)

Let's start by importing the library using its typical alias:

In [50]:
import pandas as pd  # common alias

In [2]:
# And two functions to support the exercises

from solutions import run_solution    # Gives the expected output
from solutions import show_solution   # Show (one of) the right solution(s)

**Exercise hello_world**: Print "hello world"

In [3]:
run_solution("hello_world")

In [4]:
show_solution("hello_world")

## Reading data

We will start by reading existing data. For this, we have a simple table scraped from IMDB.com that shows the most favourite movies, as retrieved on a certain date in 2022. You can have a look in `data/top_movies.csv`:

```csv
rank,title,year,rating
1,The Shawshank Redemption,1994,9.2
2,The Godfather,1972,9.2
3,The Dark Knight,2008,9.0
4,The Godfather Part II,1974,9.0
5,12 Angry Men,1957,8.9
6,Schindler's List,1993,8.9
7,The Lord of the Rings: The Return of the King,2003,8.9
8,Pulp Fiction,1994,8.8
9,The Lord of the Rings: The Fellowship of the Ring,2001,8.8
10,"The Good, the Bad and the Ugly",1966,8.8
```

To read external data sources, pandas provides many functions, all named `read_*` (e.g. read_csv, read_excel, ...), which support most relevant data formats (and even databases, clipboard, ...). One of the most common (though perhaps not most suitable) formats is [**comma-separated values**](https://en.wikipedia.org/wiki/Comma-separated_values) (or **CSV**). This is also the case of our input file, so we will use [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html):

In [5]:
data = pd.read_csv("../data/top_movies.csv")

If there was no error (it shouldn't), you have received some Python representation of the input data. Let's see.

In [6]:
data

Nice, looks almost like a spreadsheet. The basic view (unless you are running this notebook in a more sophisticated environment) shows the first and last five rows and four named columns (i.e. rank, title, year and rating).

So what is this `data` object exactly?

## Pandas data structures

(see <https://pandas.pydata.org/docs/user_guide/dsintro.html>)

In [7]:
type(data)

It is a **data frame**, which basically means "a table" with somewhat strict requirements about the columns and rows (unlike spreadsheats, this is not a free-form area). You'll find a similar concept in statistical languages like [R](https://www.r-project.org/).

Let's dig in:

In [8]:
# Select just one "column"
titles = data["title"]
titles

In [9]:
type(titles)

### Series

Another fundamental pandas type is [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html), a one-dimensional data structure looking a bit like a list. Does it behave like one?

In [111]:
titles[0]    # First item

In [112]:
titles[:5]  # First five items

**Exercise last_10_titles**: Select the last 10 movie titles from the list.

In [115]:
# run_solution("last_10_titles")
# show_solution("last_10_titles")

**Exercise list_like**: Choose a few more operations you would typically do with a list and try to apply them on `titles`. What will happen?

In [116]:
# show_solution("list_like")

It's quite easy to convert a Series to a list and vice versa. The easiest way to
create a Series is actually to create its instance from a list (passed as the first argument):

In [14]:
directors = pd.Series(["Sofia Coppola", "Miloš Forman", "Lana Wachowski"])
directors

In [15]:
directors.tolist()
# or equivalently
list(directors)

![Series](images/series.png)

Series has 5 interesting properties:

#### [.values](https://pandas.pydata.org/docs/reference/api/pandas.Series.values.html)

The data itself, in the form of a numpy array (we are not covering this library in this workshop but feel free to explore it!). Quite close to the list.

In [117]:
titles[:20].values

In [118]:
type(titles.values)

#### [.dtype](https://pandas.pydata.org/docs/reference/api/pandas.Series.dtype.html)

Unlike list, all values of a series must be of the same type. Pandas has its own set of types, so-called **dtypes** that are close to basic Python types (like int, float, str) and can be converted to and from them but are (especially in the case of numbers) more specific in terms of precision and memory used. More on that later.

By default, objects are stored in the least efficient dtype, "object" (or "O"). Only if all items fit a particular dtype (especially a numeric one), this type is selected.

In [18]:
titles.dtype

#### [.index](https://pandas.pydata.org/docs/reference/api/pandas.Series.index.html)

In a list, you select items by their numerical order; in a dict, you use a key (but not order). Pandas introduces a generalized (potentially multi-level) index that allows accessing items by both the order or label, and also allows using ranges in natural scales like time.

By default, a range index is created for you that corresponds exactly to the order of items.

In [19]:
titles.index

### [.name](https://pandas.pydata.org/docs/reference/api/pandas.Series.name.html)

An optional attribute, used e.g. in the DataFrames. It does not have anything to do with the variable the Series is stored in.

In [20]:
titles.name

### [.size](https://pandas.pydata.org/docs/reference/api/pandas.Series.size.html)

A read-only property telling how many items are stored in the series.

In [119]:
titles.size

# or len(titles)

When creating a Series, you may set these attributes (apart from `size`) explicitly using keyword arguments:

In [120]:
ages = pd.Series(
    [73, 32, 83],
    name="Actor age",
    index=["Meryl Streep", "Emma Watson", "Ian MacKellen"],
    dtype=float
)
ages

### Index

As stated, `Index` is a generalized set of labels for the series items (or data frame rows or columns) that allows sophisticated access to them.

By default, a RangeIndex starts with 0 and monotonically increases item-by-item (or row-by-row), allowing the Series to behave like a list.

In [121]:
titles.index

However, there are other types of indices too:

In [124]:
# Arbitrary values
ages.index

In [123]:
# Numbers (but not as range)
best_picture = pd.Series(
    ["Parasite", "Nomadland", "Coda"],
    index=[2019, 2020, 2021]
)
best_picture.index

You can create a index also explicitly or using various utility functions:

In [26]:
# All days of September 2022
pd.date_range("2022-09-01", "2022-09-30")

**Exercise own_series**: Create your own series with your favourites in the movie industry, setting a custom index.

In [27]:
pd.Series(
    # ...
)

# No correct answer :-)

### DataFrame

![DataFrame](images/data_frame.png)

Let's get back to DataFrame...

As `Series` is a container of values labeled by an `Index`, a `DataFrame` is a two-dimensional container that wraps an array (`.values`) in two indices: one for rows and one for columns.

It has a similar set of properties:

In [28]:
data.columns

Note that the index is shared by all columns (Series) the DataFrame is composed of:

In [29]:
data.index

In [30]:
data.index is titles.index

In [31]:
data.values[:10]

In [32]:
data.shape    # Size (rows x columns)

In [33]:
data.dtypes   # Note the plural!

If you want to create a new DataFrame (apart from reading it from an external source), you can select one of several ways, most frequently from a list of dicts or a dict of lists (or even a dict of dicts).

In [101]:
characters = pd.DataFrame({
    "movie": ["Trainspotting", "The Matrix", "Forrest Gump"],
    "main character": ["Mark Renton", "Neo", "Forrest Gump"]
})
characters

is equivalent to:

In [35]:
pd.DataFrame([
    {"movie": "Trainspotting",  "main character": "Mark Renton"},
    {"movie": "The Matrix",  "main character": "Neo"},
    {"movie": "Forrest Gump",  "main character": "Forrest Gump"},
])

**Exercise actors_and_ages:** Similar to the `ages` Series defined above, create a DataFrame that contains three columns "first_name", "surname" and "age" of the listed actors (or your favourite ones.)

In [36]:
# run_solution("actors_and_ages")
# show_solution("actors_and_ages")

## Data types (dtypes)

(see <https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes>)

Pandas offers a huge variety of dtypes. It take the most commonly used ones from [numpy](https://numpy.org/doc/stable/user/basics.types.html) and adds a few of its ow  (suitable for data analysis).

Note that a dtype is not a type or class in Python sense, despite them serving a similar purpose.

In [37]:
data.dtypes

### Integers

There is a huge variety of integer dtypes. This allows advanced users to use as little memory as possible at the expense of limited range of values that can be represented by this dtype. We will not list them all here as for most practical purposes the default one, `int64`, is a good choice.

In [53]:
pd.Series(range(1900, 2023))

Note that Python is a relatively unique language by including arbitrarily large integers as a core type. Pandas can work with them too but with much lower efficiency (representing them as generic objects). You will rarely need this.

In [126]:
pd.Series({"septillion": 1_000_000_000_000_000_000_000_000})

In recent versions, pandas introduced a [nullable integer dtype](https://pandas.pydata.org/docs/user_guide/integer_na.html), that allows to express missing values (NA).

In [54]:
# pd.Series([1902, 1922, 1937, None, 2005, 2006], dtype=int)   # This would cause an error

pd.Series([1902, 1922, 1937, None, 2005, 2006], dtype="Int64")


### Floats

Floats allow to store non-integer (or **floating-point**) numbers with some limited fixed precision. As with integers, there are several float dtypes but the default, `float64` is just enough for regular use:

In [51]:
pd.Series([1, 2.71828, 3.141592, "inf"], dtype=float)

### Booleans

Booleans are exact equivalents of Python bools, no suprise about them. (Well, they occupy much less memory but this is not visible at first sight).

In [45]:
pd.Series([True, False, True, False, False])

As with integers, there are also [nullable booleans](https://pandas.pydata.org/docs/user_guide/boolean.html) that allow logical values to be missing.

In [46]:
# Beware!
pd.Series(["True", "False", "Yes", "No"], dtype=bool)

### Strings (& objects)

Currently, pandas has two dtypes to handle strings:
- (default) the object type that supports all Python objects in general (such as lists, user-defined classes, ...)
- the (relatively) new "string" dtype, that is more suited to strings (disallows some assignments) and promises higher effectivity in the future.

In most situations, the default is ok.

In [39]:
actor_names = pd.Series(["John Travolta", "Uma Thurman", "Samuel L. Jackson"])
actor_names

In [40]:
actor_names[0] = 1994
actor_names

In [41]:
actor_names = pd.Series(["John Travolta", "Uma Thurman", "Samuel L. Jackson"], dtype="string")

# actor_names[0] = 1994  # Not allowed

### Categorical

From time to time, there are many items in a series, but there only a few unique values. Especially in the case of strings, this suggests these data correspond to a certain categorization.

Pandas supports this via [categorical types](https://pandas.pydata.org/docs/user_guide/categorical.html). These not only allow to better capture the semantics of the series, but also typically result in much less memory used.

In [42]:
pd.Series(2 * ["apple"] + 8 * ["pear"], dtype="category")

## Datetime

In [47]:
from datetime import datetime

pd.Series([datetime.now(), "2022-09-09", "1 Jan 2023"], dtype="datetime64[ns]")

There are several other dtypes. However, we will not discuss them here; nor will we explain how to create custom, so-called **extension** types (this is [fun]()).

## Indexing

(see <https://pandas.pydata.org/docs/user_guide/indexing.html>)

When accessing parts of the DataFrame, there are many perspectives. We might want to select a particular row, a column, a range of those; we might want to use their labels (index or column names) or the order in which they occur. It is clear that simple brackets can't meet all the expectations.

Therefore pandas four different ways of accessing (or **indexing**) within the DataFrames / Series:

- [] as the default Python access (list-like or dict-like)
- loc - to access row/columns/cells using their label
- iloc - to access rows/columns/cells using their order
- [at()](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.at.html) - optimized access to cells (not much commonly used)

We will start by creating a non-numeric and sorted index for our table with movies (using the [`set_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.set_index.html) and [`sort_index`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.sort_index.html) methods):

In [58]:
movies_by_title = data.set_index("title").sort_index()
movies_by_title

In [59]:
movies_by_title.index

### \[\]

Let's start with the default, which selects:
- in DataFrame, a column by label
- in Series, an item by label (or order)


In [56]:
movies_by_title["year"]

In [89]:
# Multiple values in a list (not tuple!)
movies_by_title[["rank", "rating"]]

In [93]:
ages["Emma Watson"]

In [95]:
# Or (we discourage you from using it):
ages[1]

### [.loc\[\]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.loc.html)

This **indexer** syntactically looks like an attribute of the DataFrame/Series that stores parts of the dat structure under different keys (Technically speaking, it is a [descriptor](https://docs.python.org/3/reference/datamodel.html#implementing-descriptors)).

It uses labels and can select:

In [96]:
# A row in a DataFrame
movies_by_title.loc["The Godfather"]

In [69]:
# A range of rows (inclusive on both ends)
movies_by_title.loc["The Godfather":"The Green Mile"]

In [65]:
# All movies starting with "B" (works only with sorted)
movies_by_title.loc["B":"C"]

In [97]:
# Multiple rows
movies_by_title.loc[["Trainspotting", "Pulp Fiction"]]

If you want to get to a particular cell, just use two keys in the brackets instead of one (in the row, column order):

In [100]:
# A column
movies_by_title.loc[:, "rank"]

In [98]:
# Individual cells
movies_by_title.loc["Trainspotting", "rank"]

### [.iloc\[\]](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.iloc.html)

This indexer uses the physical order of the rows/columns/items within the axis of the data struture. It is very similar to `loc` and differs only in using numbers instead of generic labels.

In [102]:
# The tenth row
movies_by_title.iloc[9]

In [103]:
# Last ten rows
movies_by_title.iloc[-10:]

In [108]:
# A cell using two indices
movies_by_title.iloc[-1, 2]

There are three helpful methods to select particular rows or items:

In [109]:
# Shortcut for the first few rows/values
movies_by_title.head()

In [110]:
# Shortcut for the last few rows/values
movies_by_title.tail()

In [129]:
# One random row
movies_by_title.sample()

In [128]:
# Multiple random rows
movies_by_title.sample(5)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=234de414-c5f7-4e4d-a314-25100ac19112' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>