# Pandas basics

## What is pandas?

To cite the [project page](https://pandas.pydata.org/), it is:

> a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.

For the wide data science community (its Python-speaking part), it is the de-facto tool for analysis and manipulation of data that:
- are in tabular format, i.e. having columns (usually describing a certain property) and rows (usually representing individual objects having those properties)
- are big/complex enough not to be easily handled by lists and dicts.
- are small enough to fit within a computer memory multiple times (rarely more than a few hundred million rows)

Let's start by importing the library using its typical alias:

In [None]:
import pandas as pd  # common alias

## Reading data

We will start by reading existing data. For this, we have a simple table scraped from IMDB.com that shows the most favourite movies, as retrieved on a certain date in 2022. You can have a look in `data/top_movies.csv`:

```csv
rank,title,year,rating
1,The Shawshank Redemption,1994,9.2
2,The Godfather,1972,9.2
3,The Dark Knight,2008,9.0
4,The Godfather Part II,1974,9.0
5,12 Angry Men,1957,8.9
6,Schindler's List,1993,8.9
7,The Lord of the Rings: The Return of the King,2003,8.9
8,Pulp Fiction,1994,8.8
9,The Lord of the Rings: The Fellowship of the Ring,2001,8.8
10,"The Good, the Bad and the Ugly",1966,8.8
```


To read external data sources, pandas provides many functions, all named `read_*` (e.g. read_csv, read_excel, ...), which support most relevant data formats (and even databases, clipboard, ...). One of the most common (though perhaps not most suitable) formats is **comma-separated values** (or **CSV**). This is also the case of our input file, so we will use `read_csv`:

In [None]:
data = pd.read_csv("../data/top_movies.csv")

If there was no error (it shouldn't), you should have receive some Python representation of the input data. Let's see.

In [None]:
data

Unnamed: 0,rank,title,year,rating
0,1,The Shawshank Redemption,1994,9.2
1,2,The Godfather,1972,9.2
2,3,The Dark Knight,2008,9.0
3,4,The Godfather Part II,1974,9.0
4,5,12 Angry Men,1957,8.9
...,...,...,...,...
245,246,Dersu Uzala,1975,8.0
246,247,Aladdin,1992,8.0
247,248,Gandhi,1982,8.0
248,249,The Help,2011,8.0


Nice, looks almost like a spreadsheet. The basic view (unless you are running this notebook in a more sophisticated environment) shows the first and last five rows and four columns (i.e. rank, title, year and rating).

## Pandas data structures

(see https://pandas.pydata.org/docs/user_guide/dsintro.html)

In [None]:
type(data)

pandas.core.frame.DataFrame

In [None]:
titles = data["title"]
titles

0      The Shawshank Redemption
1                 The Godfather
2               The Dark Knight
3         The Godfather Part II
4                  12 Angry Men
                 ...           
245                 Dersu Uzala
246                     Aladdin
247                      Gandhi
248                    The Help
249              The Iron Giant
Name: title, Length: 250, dtype: object

In [None]:
type(titles)

pandas.core.series.Series

### Series

In [None]:
titles[0]

'The Shawshank Redemption'

In [None]:
titles[-5:]

245       Dersu Uzala
246           Aladdin
247            Gandhi
248          The Help
249    The Iron Giant
Name: title, dtype: object

In [None]:
titles.tolist()
list(titles)

['The Shawshank Redemption',
 'The Godfather',
 'The Dark Knight',
 'The Godfather Part II',
 '12 Angry Men',
 "Schindler's List",
 'The Lord of the Rings: The Return of the King',
 'Pulp Fiction',
 'The Lord of the Rings: The Fellowship of the Ring',
 'The Good, the Bad and the Ugly',
 'Forrest Gump',
 'Fight Club',
 'Inception',
 'The Lord of the Rings: The Two Towers',
 'Star Wars: Episode V - The Empire Strikes Back',
 'The Matrix',
 'Goodfellas',
 "One Flew Over the Cuckoo's Nest",
 'Se7en',
 'Seven Samurai',
 "It's a Wonderful Life",
 'The Silence of the Lambs',
 'City of God',
 'Saving Private Ryan',
 'Life Is Beautiful',
 'The Green Mile',
 'Interstellar',
 'Star Wars',
 'Terminator 2: Judgment Day',
 'Back to the Future',
 'Spirited Away',
 'Psycho',
 'The Pianist',
 'Léon: The Professional',
 'Parasite',
 'The Lion King',
 'Gladiator',
 'American History X',
 'The Departed',
 'The Usual Suspects',
 'The Prestige',
 'Casablanca',
 'Whiplash',
 'The Intouchables',
 'Grave of th

In [None]:
titles[:20].values

array(['The Shawshank Redemption', 'The Godfather', 'The Dark Knight',
       'The Godfather Part II', '12 Angry Men', "Schindler's List",
       'The Lord of the Rings: The Return of the King', 'Pulp Fiction',
       'The Lord of the Rings: The Fellowship of the Ring',
       'The Good, the Bad and the Ugly', 'Forrest Gump', 'Fight Club',
       'Inception', 'The Lord of the Rings: The Two Towers',
       'Star Wars: Episode V - The Empire Strikes Back', 'The Matrix',
       'Goodfellas', "One Flew Over the Cuckoo's Nest", 'Se7en',
       'Seven Samurai'], dtype=object)

In [None]:
type(titles.values)

numpy.ndarray

In [None]:
titles.dtype

dtype('O')

In [None]:
titles.index

RangeIndex(start=0, stop=250, step=1)

In [None]:
titles.name

'title'

In [None]:
titles.size

250

### Index

In [None]:
titles.index

RangeIndex(start=0, stop=250, step=1)

### DataFrame

In [None]:
data.columns

Index(['rank', 'title', 'year', 'rating'], dtype='object')

In [None]:
data.index

RangeIndex(start=0, stop=250, step=1)

In [None]:
data.values

array([[1, 'The Shawshank Redemption', 1994, 9.2],
       [2, 'The Godfather', 1972, 9.2],
       [3, 'The Dark Knight', 2008, 9.0],
       [4, 'The Godfather Part II', 1974, 9.0],
       [5, '12 Angry Men', 1957, 8.9],
       [6, "Schindler's List", 1993, 8.9],
       [7, 'The Lord of the Rings: The Return of the King', 2003, 8.9],
       [8, 'Pulp Fiction', 1994, 8.8],
       [9, 'The Lord of the Rings: The Fellowship of the Ring', 2001,
        8.8],
       [10, 'The Good, the Bad and the Ugly', 1966, 8.8],
       [11, 'Forrest Gump', 1994, 8.8],
       [12, 'Fight Club', 1999, 8.7],
       [13, 'Inception', 2010, 8.7],
       [14, 'The Lord of the Rings: The Two Towers', 2002, 8.7],
       [15, 'Star Wars: Episode V - The Empire Strikes Back', 1980, 8.7],
       [16, 'The Matrix', 1999, 8.7],
       [17, 'Goodfellas', 1990, 8.7],
       [18, "One Flew Over the Cuckoo's Nest", 1975, 8.6],
       [19, 'Se7en', 1995, 8.6],
       [20, 'Seven Samurai', 1954, 8.6],
       [21, "It's a W

In [None]:
data.shape

(250, 4)

In [None]:
my_df = pd.DataFrame({
    "movie": ["Trainspotting", "The Matrix", "Forrest Gump"],
    "main character": ["Mark Renton", "Neo", "Forrest Gump"]
})
my_df

Unnamed: 0,movie,main character
0,Trainspotting,Mark Renton
1,The Matrix,Neo
2,Forrest Gump,Forrest Gump


## Data types

(see https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes)

In [None]:
data.dtypes

rank        int64
title      object
year        int64
rating    float64
dtype: object

### Integers

### Floats

### Booleans

### Strings (& objects)

### Other types

TODO: Mention extension types

- 

## Indexing

(see https://pandas.pydata.org/docs/user_guide/indexing.html)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=234de414-c5f7-4e4d-a314-25100ac19112' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>