## Introduction to Pandas

**Pandas is probably the most widely used library for data manipulation and analysis in Python.** You will use it a lot during this course. Let's learn some of its functionalities.

pandas.DataFrame is a 2-dimensional data structure with columns of potentially different types. You can think of it like an Excel spreadsheet. It is generally the most commonly used pandas object and works great for storing and analyzing data.

<br/><br/>
<center>
<img src="imgs/characters.png" height="250">
</center>
<br/><br/>


Let's imagine we want to create a DataFrame that contains the information about characters from the **critically-acclaimed 2019 game Disco Elysium**. This would be an example to learn how to create a DataFrame from scratch and perform some basic operations on it.

In [1]:
# First, initialize an empty DataFrame

import pandas as pd

df = pd.DataFrame()

# Now, let's create the columns as lists

first_names = ['Harry', 'Kim', 'Lawrence', 'Joyce', 'Cuno', 'Goracy']
last_names =  ['Du Bois', 'Kitsuragi', 'Garte', 'Messier', 'de Ruyter', 'Kubek']
ages = [44, 43, 28, 48, 12, 39]
occupations = ['Cop', 'Cop', 'Bartender', 'Landlady', 'Unemployed', 'Cook']
origins = ['Revachol', 'Revachol', 'Revachol', 'Revachol', 'Oranje', 'Graad']

# Let's add the columns to the DataFrame

df['first_name'] = first_names
df['last_name'] = last_names
df['age'] = ages
df['occupation'] = occupations
df['origin'] = origins

# Let's take a peek at the first 5 rows of the DataFrame. Another method, tail(), would show the last 5 rows.
df.head()

Unnamed: 0,first_name,last_name,age,occupation,origin
0,Harry,Du Bois,44,Cop,Revachol
1,Kim,Kitsuragi,43,Cop,Revachol
2,Lawrence,Garte,28,Bartender,Revachol
3,Joyce,Messier,48,Landlady,Revachol
4,Cuno,de Ruyter,12,Unemployed,Oranje


In [2]:
# We can also create a DataFrame from a dictionary

data = {'first_name': first_names, # column name: column values
        'last_name': last_names, 
        'age': ages, 
        'occupation': occupations,
        'origin': origins}

df = pd.DataFrame(data)
df.head() 

# It's the same as before!

Unnamed: 0,first_name,last_name,age,occupation,origin
0,Harry,Du Bois,44,Cop,Revachol
1,Kim,Kitsuragi,43,Cop,Revachol
2,Lawrence,Garte,28,Bartender,Revachol
3,Joyce,Messier,48,Landlady,Revachol
4,Cuno,de Ruyter,12,Unemployed,Oranje


**We can access the columns of the DataFrame using the column name.** This will return a pandas Series, 
which is a one-dimensional labeled array (similar to a Python dictionary). It is the building block 
of pandas DataFrames and contains index values $[0, 1, 2, ..., n]$ as the "dictionary keys" and our 
data as the "dictionary values". 

You won't need to work with pandas Series objects as often as with DataFrames, but it's good to know 
they exist.

In [3]:
# Let's see the first names of the characters only.

first_names = df['first_name']
print(f"What we got is a pandas.Series object:")
print(first_names, '\n')

# pandas.Series can be converted to an ordinary Python list using the tolist() method
names_list = first_names.tolist()
print(f"Now it's a Python list:")
print(names_list)

What we got is a pandas.Series object:
0       Harry
1         Kim
2    Lawrence
3       Joyce
4        Cuno
5      Goracy
Name: first_name, dtype: object 

Now it's a Python list:
['Harry', 'Kim', 'Lawrence', 'Joyce', 'Cuno', 'Goracy']


### Basic DataFrame operations

Let's review some basic operations that can be performed on a pandas DataFrame.  

You may wonder how to append a new row to an existing DataFrame. Unfortunately, it is not as straightforward as appending a new element to a Python list, as **DataFrames are not designed to grow in size dynamically**, but you can still do it.

<br>
<center>
<img src="imgs/add-character.png" height="215">
</center>
<br>

One approach is to **create a new DataFrame containing only the new row** and then concatenate (meaning - join by rows) it with the original DataFrame.

In [4]:
# Let's create a dictionary with the information about a new character. The column names should match the
# column names of the original DataFrame.

new_character = {'first_name': 'Klaasje', 'last_name': 'Amandou', 'age': 28, 'occupation': 'Corporate spy', 'origin': 'Oranje'}

# Create a new DataFrame with the new character
df_new = pd.DataFrame(new_character, index=[0]) # The index parameter is irrelevant in this case

# Concatenate the original DataFrame with the new DataFrame using pandas.concat(

df = pd.concat([df, df_new], ignore_index=True) # The ignore_index parameter resets the index of the resulting DataFrame, and it is set to True to avoid having duplicate indices.
df.tail()

Unnamed: 0,first_name,last_name,age,occupation,origin
2,Lawrence,Garte,28,Bartender,Revachol
3,Joyce,Messier,48,Landlady,Revachol
4,Cuno,de Ruyter,12,Unemployed,Oranje
5,Goracy,Kubek,39,Cook,Graad
6,Klaasje,Amandou,28,Corporate spy,Oranje


In [5]:
# Suppose we only want to see the last names and occupations of the characters. 
# We can select multiple columns by passing a list of column names to the DataFrame, in square brackets.

columns = ['last_name', 'occupation']
name_and_occupation = df[columns] # This will return a DataFrame with only the selected columns
name_and_occupation.head(7) # Now we have only the last names and occupations

Unnamed: 0,last_name,occupation
0,Du Bois,Cop
1,Kitsuragi,Cop
2,Garte,Bartender
3,Messier,Landlady
4,de Ruyter,Unemployed
5,Kubek,Cook
6,Amandou,Corporate spy


In [6]:
# We can also select rows based on a condition. 
# For example, we can select only the characters who are cops.

cops = (df['occupation'] == 'Cop') # This will return a boolean mask, or a list of True/False values indicating whether the condition is met for each row
cops_only = df[cops]
cops_only.head()

Unnamed: 0,first_name,last_name,age,occupation,origin
0,Harry,Du Bois,44,Cop,Revachol
1,Kim,Kitsuragi,43,Cop,Revachol


In [7]:
# If we wanted to see some statistics about the dataframe values, we would use the describe() method

df['age'].describe()    # tells us abut the mean, standard deviation, min, max, and quartiles of the 'age' column

count     7.000000
mean     34.571429
std      12.620844
min      12.000000
25%      28.000000
50%      39.000000
75%      43.500000
max      48.000000
Name: age, dtype: float64

In [8]:
# We can also sort the DataFrame by values in a column. Let's sort the characters by age.

df_sorted = df.sort_values(by='age')
df_sorted.head(7)

Unnamed: 0,first_name,last_name,age,occupation,origin
4,Cuno,de Ruyter,12,Unemployed,Oranje
2,Lawrence,Garte,28,Bartender,Revachol
6,Klaasje,Amandou,28,Corporate spy,Oranje
5,Goracy,Kubek,39,Cook,Graad
1,Kim,Kitsuragi,43,Cop,Revachol
0,Harry,Du Bois,44,Cop,Revachol
3,Joyce,Messier,48,Landlady,Revachol


In [9]:
# As time passes, people usually get older. This does not really happen in the game, as the story 
# spans only a few days. Nevertheless, let's see what the characters' ages will be in 10 years and 
# add this information to the DataFrame.

df['age_in_10_yrs'] = df['age'] + 10    # We can perform operations on entire columns at once
df.head()

Unnamed: 0,first_name,last_name,age,occupation,origin,age_in_10_yrs
0,Harry,Du Bois,44,Cop,Revachol,54
1,Kim,Kitsuragi,43,Cop,Revachol,53
2,Lawrence,Garte,28,Bartender,Revachol,38
3,Joyce,Messier,48,Landlady,Revachol,58
4,Cuno,de Ruyter,12,Unemployed,Oranje,22


In [10]:
# If we wanted to perform a more complex operation on the 'age' column, we could use the apply() method.

# You can write a custom function that takes a value from a column, does something with it, and returns 
# the result. Then, you can apply this function to the entire column at once.

def describe_age(x):
    if x < 15:
        return 'Kid'
    elif x < 30:
        return 'Young' # If the age is less than 30, return 'Young'
    elif x < 60:
        return 'Middle-aged' # If the age is less than 60, return 'Middle-aged'
    else:
        return 'Elderly'

df['age_categorical'] = df['age'].apply(describe_age)
df.head(7)

Unnamed: 0,first_name,last_name,age,occupation,origin,age_in_10_yrs,age_categorical
0,Harry,Du Bois,44,Cop,Revachol,54,Middle-aged
1,Kim,Kitsuragi,43,Cop,Revachol,53,Middle-aged
2,Lawrence,Garte,28,Bartender,Revachol,38,Young
3,Joyce,Messier,48,Landlady,Revachol,58,Middle-aged
4,Cuno,de Ruyter,12,Unemployed,Oranje,22,Kid
5,Goracy,Kubek,39,Cook,Graad,49,Middle-aged
6,Klaasje,Amandou,28,Corporate spy,Oranje,38,Young


### Iterating over rows in a DataFrame (don't do it!)

When working with pandas DataFrames it is advised **not to use 'for' loops to iterate over rows**. In contrast to Python lists, pandas DataFrames are optimized for vectorized operations. This means that you can apply operations to entire columns at once, which is much faster than iterating over rows. **If you find yourself iterating over rows in a DataFrame, you are probably doing something wrong**. There is almost always a better, more efficient way to do it using pandas methods.

Nevertheless, be aware of the iterrows() method. Here is an example of how it could be used:

In [11]:
print('My favorite characters of Disco Elysium:')
for index, row in df.iterrows():
    first_name = row['first_name']
    last_name = row['last_name']
    job = row['occupation']
    print(f'{index}. {first_name} {last_name}, who is a(n) {job}')

My favorite characters of Disco Elysium:
0. Harry Du Bois, who is a(n) Cop
1. Kim Kitsuragi, who is a(n) Cop
2. Lawrence Garte, who is a(n) Bartender
3. Joyce Messier, who is a(n) Landlady
4. Cuno de Ruyter, who is a(n) Unemployed
5. Goracy Kubek, who is a(n) Cook
6. Klaasje Amandou, who is a(n) Corporate spy


### Get familiar with Pandas documentation!

In this notebook we explored only a fraction of pandas.DataFrame methods. You should get familiar with [Pandas documentation](https://pandas.pydata.org/docs/reference/frame.html) and play around with the methods to get a better understanding of what you can do with pandas. Go ahead and try some of the methods on the DataFrame we created in this notebook.