# Pandas basics

## What is pandas?

To cite the [project page](https://pandas.pydata.org/), it is:

> a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

For the wide data science community (its Python-speaking part), it is the de-facto tool for analysis and manipulation of data that:
- are in tabular format, i.e. having **columns** (usually describing a certain property) and **rows** (usually representing individual objects having those properties)
- are big/complex enough not to be easily handled by lists and dicts.
- are small enough to fit within a computer memory multiple times (rarely more than a few hundred million rows)

Let's start by importing the library using its typical alias:

In [1]:
import pandas as pd  # common alias

In [2]:
# And two functions to support the exercises

from solutions import run_solution    # Gives the expected output
from solutions import show_solution   # Show (one of) the right solution(s)

In [3]:
run_solution("hello_world")

Hello world!


In [4]:
show_solution("hello_world")

def solution_hello_world():
    print("Hello world!")



## Reading data

We will start by reading existing data. For this, we have a simple table scraped from IMDB.com that shows the most favourite movies, as retrieved on a certain date in 2022. You can have a look in `data/top_movies.csv`:

```csv
rank,title,year,rating
1,The Shawshank Redemption,1994,9.2
2,The Godfather,1972,9.2
3,The Dark Knight,2008,9.0
4,The Godfather Part II,1974,9.0
5,12 Angry Men,1957,8.9
6,Schindler's List,1993,8.9
7,The Lord of the Rings: The Return of the King,2003,8.9
8,Pulp Fiction,1994,8.8
9,The Lord of the Rings: The Fellowship of the Ring,2001,8.8
10,"The Good, the Bad and the Ugly",1966,8.8
```

To read external data sources, pandas provides many functions, all named `read_*` (e.g. read_csv, read_excel, ...), which support most relevant data formats (and even databases, clipboard, ...). One of the most common (though perhaps not most suitable) formats is [**comma-separated values**](https://en.wikipedia.org/wiki/Comma-separated_values) (or **CSV**). This is also the case of our input file, so we will use [`read_csv`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html):

In [5]:
data = pd.read_csv("../data/top_movies.csv")

If there was no error (it shouldn't), you should have receive some Python representation of the input data. Let's see.

In [6]:
data

Unnamed: 0,rank,title,year,rating
0,1,The Shawshank Redemption,1994,9.2
1,2,The Godfather,1972,9.2
2,3,The Dark Knight,2008,9.0
3,4,The Godfather Part II,1974,9.0
4,5,12 Angry Men,1957,8.9
...,...,...,...,...
245,246,Dersu Uzala,1975,8.0
246,247,Aladdin,1992,8.0
247,248,Gandhi,1982,8.0
248,249,The Help,2011,8.0


Nice, looks almost like a spreadsheet. The basic view (unless you are running this notebook in a more sophisticated environment) shows the first and last five rows and four named columns (i.e. rank, title, year and rating).

So what is this `data` object exactly?

## Pandas data structures

(see <https://pandas.pydata.org/docs/user_guide/dsintro.html>)

In [7]:
type(data)

pandas.core.frame.DataFrame

It is a **data frame**, which basically means "a table" with somewhat strict requirements about the columns and rows (unlike spreadsheats, this is not a free-form area). You'll find a similar concept in statistical languages like [R](https://www.r-project.org/).

Let's dig in:

In [8]:
# Select just one "column"
titles = data["title"]
titles

0      The Shawshank Redemption
1                 The Godfather
2               The Dark Knight
3         The Godfather Part II
4                  12 Angry Men
                 ...           
245                 Dersu Uzala
246                     Aladdin
247                      Gandhi
248                    The Help
249              The Iron Giant
Name: title, Length: 250, dtype: object

In [9]:
type(titles)

pandas.core.series.Series

### Series

The second fundamental pandas type is a [`Series`](https://pandas.pydata.org/docs/reference/api/pandas.Series.html), a one-dimensional data structure looking a bit like a list. Does it behave like one?

In [10]:
titles[0]    # First item

'The Shawshank Redemption'

In [11]:
titles[:5]  # First five items

0    The Shawshank Redemption
1               The Godfather
2             The Dark Knight
3       The Godfather Part II
4                12 Angry Men
Name: title, dtype: object

**Exercise last_10_titles**: Select the last 10 movie titles from the list.

In [12]:
run_solution("last_10_titles")
show_solution("last_10_titles")

def solution_last_10_titles(titles: pd.Series) -> pd.Series:
    """Select the last 10 movie titles from the list."""
    return titles[-10:]



**Exercise list_like**: Choose a few more operations you would typically do with a list and try to apply them on `titles`. What will happen?

In [13]:
show_solution("list_like")

def solution_list_like(titles: pd.Series) -> None:
    """Choose a few more operations you would typically do with
    a list and try to apply them on titles. What will happen?"""
    sorted(titles)  # OK
    reversed(titles)  # OK



It's quite easy to convert a Series to a list and vice versa. The easiest way to
create a Series is actually to create its instance from a list (passed as a first argument):

In [14]:
directors = pd.Series(["Quentin Tarantino", "Miloš Forman", "Michelangelo Antonioni"])
directors

0         Quentin Tarantino
1              Miloš Forman
2    Michelangelo Antonioni
dtype: object

In [15]:
directors.tolist()
# or equivalently
list(directors)

['Quentin Tarantino', 'Miloš Forman', 'Michelangelo Antonioni']

![Series](images/series.png)

In [16]:
titles[:20].values

array(['The Shawshank Redemption', 'The Godfather', 'The Dark Knight',
       'The Godfather Part II', '12 Angry Men', "Schindler's List",
       'The Lord of the Rings: The Return of the King', 'Pulp Fiction',
       'The Lord of the Rings: The Fellowship of the Ring',
       'The Good, the Bad and the Ugly', 'Forrest Gump', 'Fight Club',
       'Inception', 'The Lord of the Rings: The Two Towers',
       'Star Wars: Episode V - The Empire Strikes Back', 'The Matrix',
       'Goodfellas', "One Flew Over the Cuckoo's Nest", 'Se7en',
       'Seven Samurai'], dtype=object)

In [17]:
type(titles.values)

numpy.ndarray

In [18]:
titles.dtype

dtype('O')

In [19]:
titles.index

RangeIndex(start=0, stop=250, step=1)

In [20]:
titles.name

'title'

In [21]:
titles.size

250

### Index

In [22]:
titles.index

RangeIndex(start=0, stop=250, step=1)

### DataFrame

![DataFrame](images/data_frame.png)

In [23]:
data.columns

Index(['rank', 'title', 'year', 'rating'], dtype='object')

In [24]:
data.index

RangeIndex(start=0, stop=250, step=1)

In [25]:
data.values

array([[1, 'The Shawshank Redemption', 1994, 9.2],
       [2, 'The Godfather', 1972, 9.2],
       [3, 'The Dark Knight', 2008, 9.0],
       [4, 'The Godfather Part II', 1974, 9.0],
       [5, '12 Angry Men', 1957, 8.9],
       [6, "Schindler's List", 1993, 8.9],
       [7, 'The Lord of the Rings: The Return of the King', 2003, 8.9],
       [8, 'Pulp Fiction', 1994, 8.8],
       [9, 'The Lord of the Rings: The Fellowship of the Ring', 2001,
        8.8],
       [10, 'The Good, the Bad and the Ugly', 1966, 8.8],
       [11, 'Forrest Gump', 1994, 8.8],
       [12, 'Fight Club', 1999, 8.7],
       [13, 'Inception', 2010, 8.7],
       [14, 'The Lord of the Rings: The Two Towers', 2002, 8.7],
       [15, 'Star Wars: Episode V - The Empire Strikes Back', 1980, 8.7],
       [16, 'The Matrix', 1999, 8.7],
       [17, 'Goodfellas', 1990, 8.7],
       [18, "One Flew Over the Cuckoo's Nest", 1975, 8.6],
       [19, 'Se7en', 1995, 8.6],
       [20, 'Seven Samurai', 1954, 8.6],
       [21, "It's a W

In [26]:
data.shape

(250, 4)

In [27]:
my_df = pd.DataFrame({
    "movie": ["Trainspotting", "The Matrix", "Forrest Gump"],
    "main character": ["Mark Renton", "Neo", "Forrest Gump"]
})
my_df

Unnamed: 0,movie,main character
0,Trainspotting,Mark Renton
1,The Matrix,Neo
2,Forrest Gump,Forrest Gump


## Data types

(see https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes)

In [28]:
data.dtypes

rank        int64
title      object
year        int64
rating    float64
dtype: object

### Integers

### Floats

### Booleans

### Strings (& objects)

### Other types

TODO: Mention extension types

- 

## Indexing

(see https://pandas.pydata.org/docs/user_guide/indexing.html)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=234de414-c5f7-4e4d-a314-25100ac19112' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>