# Pandas basics

## What is pandas?

To cite the [project page](https://pandas.pydata.org/), it is:

> a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

For the wide data science community (its Python-speaking part), it is the de-facto tool for analysis and manipulation of data that:
- are in tabular format, i.e. having **columns** (usually describing a certain property) and **rows** (usually representing individual objects having those properties)
- are big/complex enough not to be easily handled by lists and dicts.
- are small enough to fit within a computer memory multiple times (rarely more than a few hundred million rows)

Let's start by importing the library using its typical alias:

In [1]:
import pandas as pd  # common alias

In [2]:
# And two functions to support the exercises

from solutions import run_solution    # Gives the expected output
from solutions import show_solution   # Show (one of) the right solution(s)

**Exercise hello_world**: Print "hello world"

In [3]:
run_solution("hello_world")

Hello world!


In [4]:
show_solution("hello_world")

def solution_hello_world():
    print("Hello world!")



## Reading data

We will start by reading existing data. For this, we have a simple table scraped from IMDB.com that shows the most favourite movies, as retrieved on a certain date in 2022. You can have a look in `data/top_movies.csv`:

```csv
rank,title,year,rating
1,The Shawshank Redemption,1994,9.2
2,The Godfather,1972,9.2
3,The Dark Knight,2008,9.0
4,The Godfather Part II,1974,9.0
5,12 Angry Men,1957,8.9
6,Schindler's List,1993,8.9
7,The Lord of the Rings: The Return of the King,2003,8.9
8,Pulp Fiction,1994,8.8
9,The Lord of the Rings: The Fellowship of the Ring,2001,8.8
10,"The Good, the Bad and the Ugly",1966,8.8
```

To read external data sources, pandas provides many functions, all named `read_*` (e.g. read_csv, read_excel, ...), which support most relevant data formats (and even databases, clipboard, ...). One of the most common (though perhaps not most suitable) formats is [**comma-separated values**](https://en.wikipedia.org/wiki/Comma-separated_values) (or **CSV**). This is also the case of our input file, so we will use [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html):

In [5]:
data = pd.read_csv("../data/top_movies.csv")

If there was no error (it shouldn't), you should have receive some Python representation of the input data. Let's see.

In [6]:
data

Unnamed: 0,rank,title,year,rating
0,1,The Shawshank Redemption,1994,9.2
1,2,The Godfather,1972,9.2
2,3,The Dark Knight,2008,9.0
3,4,The Godfather Part II,1974,9.0
4,5,12 Angry Men,1957,8.9
...,...,...,...,...
245,246,Dersu Uzala,1975,8.0
246,247,Aladdin,1992,8.0
247,248,Gandhi,1982,8.0
248,249,The Help,2011,8.0


Nice, looks almost like a spreadsheet. The basic view (unless you are running this notebook in a more sophisticated environment) shows the first and last five rows and four named columns (i.e. rank, title, year and rating).

So what is this `data` object exactly?

## Pandas data structures

(see <https://pandas.pydata.org/docs/user_guide/dsintro.html>)

In [7]:
type(data)

pandas.core.frame.DataFrame

It is a **data frame**, which basically means "a table" with somewhat strict requirements about the columns and rows (unlike spreadsheats, this is not a free-form area). You'll find a similar concept in statistical languages like [R](https://www.r-project.org/).

Let's dig in:

In [8]:
# Select just one "column"
titles = data["title"]
titles

0      The Shawshank Redemption
1                 The Godfather
2               The Dark Knight
3         The Godfather Part II
4                  12 Angry Men
                 ...           
245                 Dersu Uzala
246                     Aladdin
247                      Gandhi
248                    The Help
249              The Iron Giant
Name: title, Length: 250, dtype: object

In [9]:
type(titles)

pandas.core.series.Series

### Series

The second fundamental pandas type is a [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html), a one-dimensional data structure looking a bit like a list. Does it behave like one?

In [10]:
titles[0]    # First item

'The Shawshank Redemption'

In [11]:
titles[:5]  # First five items

0    The Shawshank Redemption
1               The Godfather
2             The Dark Knight
3       The Godfather Part II
4                12 Angry Men
Name: title, dtype: object

**Exercise last_10_titles**: Select the last 10 movie titles from the list.

In [12]:
run_solution("last_10_titles")
show_solution("last_10_titles")

def solution_last_10_titles(titles: pd.Series) -> pd.Series:
    """Select the last 10 movie titles from the list."""
    return titles[-10:]



**Exercise list_like**: Choose a few more operations you would typically do with a list and try to apply them on `titles`. What will happen?

In [13]:
show_solution("list_like")

def solution_list_like(titles: pd.Series) -> None:
    """Choose a few more operations you would typically do with
    a list and try to apply them on titles. What will happen?"""
    sorted(titles)  # OK
    reversed(titles)  # OK
    # TODO: Add a few?



It's quite easy to convert a Series to a list and vice versa. The easiest way to
create a Series is actually to create its instance from a list (passed as a first argument):

In [14]:
directors = pd.Series(["Sofia Coppola", "Miloš Forman", "Lana Wachowski"])
directors

0     Sofia Coppola
1      Miloš Forman
2    Lana Wachowski
dtype: object

In [15]:
directors.tolist()
# or equivalently
list(directors)

['Sofia Coppola', 'Miloš Forman', 'Lana Wachowski']

![Series](images/series.png)

The Series has 5 interesting properties:

#### [.values](https://pandas.pydata.org/docs/reference/api/pandas.Series.values.html)

The data itself, in the form of a numpy array (we are not covering this library in this workshop but be free to explore it!). Quite close to the list.

In [16]:
titles[:20].values

array(['The Shawshank Redemption', 'The Godfather', 'The Dark Knight',
       'The Godfather Part II', '12 Angry Men', "Schindler's List",
       'The Lord of the Rings: The Return of the King', 'Pulp Fiction',
       'The Lord of the Rings: The Fellowship of the Ring',
       'The Good, the Bad and the Ugly', 'Forrest Gump', 'Fight Club',
       'Inception', 'The Lord of the Rings: The Two Towers',
       'Star Wars: Episode V - The Empire Strikes Back', 'The Matrix',
       'Goodfellas', "One Flew Over the Cuckoo's Nest", 'Se7en',
       'Seven Samurai'], dtype=object)

In [17]:
type(titles.values)

numpy.ndarray

#### [.dtype](https://pandas.pydata.org/docs/reference/api/pandas.Series.dtype.html)

Unlike list, all values of a series must be of the same type. Pandas has its own set of types, so-called **dtypes** that are close to basic Python types (like int, float, str) and can be converted to and from them but are (especially in the case of numbers) more specific in terms of precision and memory used. More on that later.

By default, objects are stored in the least efficient dtypes, "object" (or "O")

In [18]:
titles.dtype

dtype('O')

#### [.index](https://pandas.pydata.org/docs/reference/api/pandas.Series.index.html)

In a list, you select items by their numerical order; in a dict, you use a key (but not order). Pandas introduces a generalized (potentially multi-level) index that allows accessing items by both the order or label, and also allows using ranges in natural scales like time.

By default, a range index is created for you that corresponds exactly to the order of items.

In [19]:
titles.index

RangeIndex(start=0, stop=250, step=1)

### [.name](https://pandas.pydata.org/docs/reference/api/pandas.Series.name.html)

An optional attribute, used e.g. in the DataFrames. It does not have anything to do with the variable the Series is stored in.

In [20]:
titles.name

'title'

### [.size](https://pandas.pydata.org/docs/reference/api/pandas.Series.size.html)

A read-only property telling how many items are stored in the series.

In [21]:
titles.size

# or len(titles)

250

When creating a Series, you may state these attributes (apart from `size`) explicitly using keyword arguments:

In [22]:
ages = pd.Series(
    [73, 32, 83],
    name="Actor age",
    index=["Meryl Streep", "Emma Watson", "Ian MacKellen"],
    dtype=float
)
ages

Meryl Streep     73.0
Emma Watson      32.0
Ian MacKellen    83.0
Name: Actor age, dtype: float64

### Index

As stated, the index is a generalized key/order label for the series items (or data frame rows or columns).

By default, a RangeIndex starts with 0 and monotonically increaes item-by-item (or row-by-row).

In [23]:
titles.index

RangeIndex(start=0, stop=250, step=1)

However, there are other types of indices too:

In [24]:
ages.index

Index(['Meryl Streep', 'Emma Watson', 'Ian MacKellen'], dtype='object')

In [25]:
best_picture = pd.Series(
    ["Parasite", "Nomadland", "Coda"],
    index=[2019, 2020, 2021]
)
best_picture.index

Int64Index([2019, 2020, 2021], dtype='int64')

You can create a index also explicitly or using various utility functions:

In [26]:
pd.date_range("2022-09-01", "2022-09-30")

DatetimeIndex(['2022-09-01', '2022-09-02', '2022-09-03', '2022-09-04',
               '2022-09-05', '2022-09-06', '2022-09-07', '2022-09-08',
               '2022-09-09', '2022-09-10', '2022-09-11', '2022-09-12',
               '2022-09-13', '2022-09-14', '2022-09-15', '2022-09-16',
               '2022-09-17', '2022-09-18', '2022-09-19', '2022-09-20',
               '2022-09-21', '2022-09-22', '2022-09-23', '2022-09-24',
               '2022-09-25', '2022-09-26', '2022-09-27', '2022-09-28',
               '2022-09-29', '2022-09-30'],
              dtype='datetime64[ns]', freq='D')

**Exercise own_series**: Create your own series with your favourites in the movie industry, setting a custom index.

In [27]:
pd.Series(
    ...
)

# No correct answer :-)

0    Ellipsis
dtype: object

### DataFrame

![DataFrame](images/data_frame.png)

Let's get back to DataFrame...

As `Series`is a container of values labeled by an `Index`, a `DataFrame` is a two-dimensional container that wraps an array (`.values`) in two indices: one for rows and one for columns.

It has a similar set of properties:

In [28]:
data.columns

Index(['rank', 'title', 'year', 'rating'], dtype='object')

Note that the index is shared by all columns (Series) the DataFrame is composed of:

In [29]:
data.index

RangeIndex(start=0, stop=250, step=1)

In [30]:
data.index is titles.index

True

In [31]:
data.values[:10]

array([[1, 'The Shawshank Redemption', 1994, 9.2],
       [2, 'The Godfather', 1972, 9.2],
       [3, 'The Dark Knight', 2008, 9.0],
       [4, 'The Godfather Part II', 1974, 9.0],
       [5, '12 Angry Men', 1957, 8.9],
       [6, "Schindler's List", 1993, 8.9],
       [7, 'The Lord of the Rings: The Return of the King', 2003, 8.9],
       [8, 'Pulp Fiction', 1994, 8.8],
       [9, 'The Lord of the Rings: The Fellowship of the Ring', 2001,
        8.8],
       [10, 'The Good, the Bad and the Ugly', 1966, 8.8]], dtype=object)

In [32]:
data.shape   # Size (rows x columns)

(250, 4)

In [33]:
data.dtypes

rank        int64
title      object
year        int64
rating    float64
dtype: object

If you want to create a new DataFrame (apart from reading it from an external source), you can select one of several ways, most frequently from a list of dicts or a dict of lists (or even a dict of dicts).

In [34]:
pd.DataFrame({
    "movie": ["Trainspotting", "The Matrix", "Forrest Gump"],
    "main character": ["Mark Renton", "Neo", "Forrest Gump"]
})

Unnamed: 0,movie,main character
0,Trainspotting,Mark Renton
1,The Matrix,Neo
2,Forrest Gump,Forrest Gump


is equivalent to:

In [35]:
pd.DataFrame([
    {"movie": "Trainspotting",  "main character": "Mark Renton"},
    {"movie": "The Matrix",  "main character": "Neo"},
    {"movie": "Forrest Gump",  "main character": "Forrest Gump"},
])

Unnamed: 0,movie,main character
0,Trainspotting,Mark Renton
1,The Matrix,Neo
2,Forrest Gump,Forrest Gump


**Exercise actors_and_ages:** Similar to the `ages` Series defined above, create a DataFrame that contains three columns "first_name", "surname" and "age" of the listed actors (or your favourite ones.)

In [36]:
# run_solution("actors_and_ages")
# show_solution("actors_and_ages")

## Data types

(see https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes)

In [37]:
data.dtypes

rank        int64
title      object
year        int64
rating    float64
dtype: object

### Integers

In recent versions, pandas introduce a [nullable integer dtype](https://pandas.pydata.org/docs/user_guide/integer_na.html), that allows to express missing values (NA).

In [38]:
# pd.Series([1, 2, 3, None, 5, 6], dtype=int)   # This would cause an error

pd.Series([1, 2, 3, None, 5, 6], dtype="Int64") 


0       1
1       2
2       3
3    <NA>
4       5
5       6
dtype: Int64

### Floats

### Booleans

### Strings (& objects)

Currently, pandas has two dtypes to handle strings:
- (default) the object type that supports all Python objects in general (such as lists, user-defined classes, ...)
- the (relatively) new "string" dtype, that is more suited to strings (disallows some assignments) and promises higher effectivity in the future.

In [39]:
actor_names = pd.Series(["John Travolta", "Uma Thurman", "Samuel L. Jackson"])
actor_names

0        John Travolta
1          Uma Thurman
2    Samuel L. Jackson
dtype: object

In [40]:
actor_names[0] = 1994
actor_names

0                 1994
1          Uma Thurman
2    Samuel L. Jackson
dtype: object

In [41]:
actor_names = pd.Series(["John Travolta", "Uma Thurman", "Samuel L. Jackson"], dtype="string")

# actor_names[0] = 1994  # Not allowed

### Categorical

In [44]:
pd.Series(2 * ["apple"] + 8 * ["pear"], dtype="category")

0    apple
1    apple
2     pear
3     pear
4     pear
5     pear
6     pear
7     pear
8     pear
9     pear
dtype: category
Categories (2, object): ['apple', 'pear']

TODO: Mention extension types

- 

## Indexing

(see https://pandas.pydata.org/docs/user_guide/indexing.html)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=234de414-c5f7-4e4d-a314-25100ac19112' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>