<a href="https://colab.research.google.com/github/goteguru/kmooc_python/blob/main/notebooks/en/kmooc_08_1_pandas_en.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas

If we're not working with purely numerical data but also have texts and categories, the de-facto Python library we reach for is `pandas`. This library has the computational capabilities of NumPy (it actually uses it) but extends it with many additional features:

* categorical / text data
* named columns
* grouping
* missing data handling

So pandas provides functionality in Python similar to a spreadsheet with some database-like features.

Combined with Python's very broad general capabilities, it can be a very powerful tool.



In [None]:
# we will need the pandas library,
# so definitely run this.

import pandas as pd


Pandas has two fundamental data structures:

* Series - one-dimensional data (like a column)
* DataFrame - two-dimensional table ("Excel")

## Series

In [None]:
numbers = pd.Series([10, 20, 30, 40, 50])
numbers


In [None]:
letters = pd.Series(['Alfa','Béta','Gamma'])
letters

We can see that unlike NumPy arrays these structures always have an index (which is printed). If we want, we can inspect the index or the values themselves:

In [None]:
letters.index

RangeIndex(start=0, stop=3, step=1)

Indexes are actually not stored as raw numbers (that wouldn't make much sense since they typically increase regularly), but a "RangeIndex" generator creates them for us if we want. It essentially only stores where it starts, where it ends, and the step size.

In [None]:
list(letters.index) # force it to make a list

In [None]:
letters.values # the values are normal values, of course.

## DataFrame

A DataFrame is more like an "Excel" table or a database table. Its columns have names. If we don't know the data, we don't provide them.

Pandas is so fundamental that Colab (where you're viewing this) offers a special renderer for it. If you run the code below, it won't just print the values, it will nicely format them and provide tools. You can request chart suggestions or make an "interactive" (sortable, filterable) table. Since Colab provides this, running purely in the interpreter wouldn't give you these features (unless you use additional libraries).

In [None]:
table = pd.DataFrame({
    "name": ["Anna", "Béla", "Cecil", "Gábor"],
    "age": [23, 30, None ,27],
    "city": ["Budapest", "Szeged", "Pécs", None]
})

table

Try out what you can do with such a table!

DataFrames also have attributes:


In [None]:
table.shape # there are four rows and three columns.

In [None]:
table.columns # these are the column names

In [None]:
table.index # and of course there are indexes here too

Usually we don't want to provide data by hand, but import it from some source (e.g., a spreadsheet) and then process it. Fortunately, we don't need to write special code for this: Pandas can easily read data from these files:

In [None]:
# pd.read_csv("adat.csv")  # from CSV file
# pd.read_excel("adat.xlsx") # from Excel format

# or even from a url:
url = "https://gist.githubusercontent.com/goteguru/2e1efde943f9c963dcbb1fcae598b646/raw/0c37ffa0c7fc2ee77a539101ea7986f21a141fa8/sample_people.csv"
data = pd.read_csv(url)
data

A pandas DataFrame not only resembles an Excel table, it can also handle it. We can easily read or write xlsx files. Try it! Upload an xlsx file (left-side menu) named data.xlsx (or change the filename in the code) and display its contents!

In [None]:
file_name = "data.xlsx"
df = pd.read_excel(file_name)

# we can also write it to a file (locally in Colab):
df.to_excel("output.xlsx", index=False)

# or display it:
df

## Indexing

Indexing is similar to what we saw with NumPy, but now we can also use column names.

In [None]:
data["age"] # or like this
data.age # or even like this (if it doesn't contain spaces or special characters)

In [None]:
# if we want multiple columns, provide them in a list
data[["city", "name", "age"]] # we gave a list as an index!

In [None]:
# we can access a specific row using the loc property
data.loc[2, "name"] # only the second row

In [None]:
data.loc[2:5, ["name","age"]] # rows labeled from 2 to 5, the name and age fields.

`.loc` looks at row labels, not positional indices. This is especially important if we've filtered the data. If you want to use positions (not labels), then:

In [None]:
data[2:5] # instead of label, with the (0-based) index (third, fourth and fifth)

In [None]:
data.iloc[2:5] # but this is how it's usually written (iloc = indexed loc)

## Conditional indexing

Just like with NumPy, we can use boolean matrices as indices. This allows us to filter our data as we like:

In [None]:
data[data.age<30]

In [None]:
data[data.city == "Budapest"]

In [None]:
data[(data.age > 30) & (data.city == "Pécs")]

## Statistics

As with databases, spreadsheets (or numpy), we can also work with aggregates.

In [None]:
data.describe() # statistics for all numeric fields

In [None]:
data["age"].mean() # mean
data["age"].max()
data["age"].min()
data["city"].value_counts() # counts of values (how many of each)


In [None]:
data["city"].unique() # unique values (cities, each once)

In [None]:
data["age"].agg(['mean', 'min', 'max']) # compute multiple statistics at once

## Modification

If we want to add a new column or compute one, simply use a new index (column name) just like with a Python dict.

In [None]:
from datetime import datetime
current_year = datetime.now().year

data["birth_year"] = current_year - data["age"]
data

In [None]:
data.drop(columns=["birth_year"]) # a copy without that column

In [None]:
data["name_length"] = data["name"].apply(len) # apply the len function to every name
data

## Handling missing values



In [None]:
data.isna() # where is data missing?

In [None]:
data.isna().sum() # how many are there?

Missing data often needs to be handled in some way. For example, we could drop rows with missing values (not ideal) or fill them somehow, e.g., with the mean or the mode.

In [None]:
data.dropna() # drop rows where data is missing

In [None]:
data["age"].fillna(data["age"].mean()) # fill missing with the mean

Modify the above code so that instead of the whole table it only prints the `.shape` attribute, so you can see how many rows remain!

## Sorting and grouping

One useful capability of database systems is fast grouping and sorting. Pandas DataFrame can do this too.

In [None]:
data.sort_index() # by row number
data.sort_values("age") # by age
data.sort_values("age", ascending=False) # in descending order

In [None]:
# What is the average age per city?
data.groupby("city")["age"].mean()

In [None]:
# What is the median height by gender?
data.groupby("gender")["height"].median()

In [None]:
# Compute multiple statistics per city:
data.groupby("city").agg({
    "age": "mean",
    "name": "count"
})


## Joining DataFrames

Often our data is not in a single table, but there is a relation between tables and we can match rows based on some key. For example, we might have another table with additional data about cities (e.g., area), so if we know a person's city from the residents table, we can also know the area of that municipality. (If you've worked with databases this will be familiar.) Pandas can also join DataFrames.

In [None]:
cities = pd.DataFrame({
    "city": ["Budapest", "Pécs", "Debrecen", "Szeged", "Győr"],
    "area": [525.14, 162.78, 461.66, 281.00, 174.62]
})

# the join is provided by the city field
data.merge(cities, on="city", how="left")

In the example above, `how="left"` may be familiar from SQL. It means we want a LEFT JOIN, so if a person doesn't have a city specified (missing), we still want them to appear in the result (they just won't have an area).

If we request an INNER JOIN, only rows that appear in both tables will be shown.

In [None]:
(
    len(data), # how big (how many rows) was the original?
    len(data.merge(cities, on="city", how="left")), # how many rows with left join?
    len(data.merge(cities, on="city")), # and how many with inner join?
)

In [None]:
# If the field used for joining is not named the same (unlike now), then:
data.merge(cities, left_on="city", right_on="city", how="left")

## Persisting data

When the Python program ends, the data in the DataFrame is lost. Of course we can save it to CSV or XLSX, but let's learn a better option. If you plan to use the saved data later with Pandas or modern data tools, parquet is a good choice. It's fast, compact and an efficient binary data format. (Feather is another alternative.)

Saving data is super easy:

In [None]:
# save:
data.to_parquet("adatok.parquet")

# read back:
restored = pd.read_parquet("adatok.parquet")

## Visualization

Pandas has built-in visualization (by default it uses Matplotlib) so we can quickly display our data. 



In [None]:
# plot the heights for me
data["height"].plot()

In [None]:
# age "distribution" (histogram), split into 20 bins
data['age'].plot.hist(bins=20)

In [None]:
# average height per city, as a bar chart:
data.groupby("city")["height"].mean().plot(kind="bar")

In [None]:
# per gender, how many people have a city specified?
data.groupby("gender").count()["city"].plot(kind="pie")

In [None]:
# number of people per city:

Try to compute (and maybe plot) the following:

- How many women (F) are in our data?
- How many rows have no city specified?
- How many people are there per city?
- What is the average height of people per city?
- Average area of residence by gender?
- What is the average of the product of people's age and their city's area? (don't ask what that's useful for :))

If you get stuck, AI can help.


In [None]:
# Your solutions:

### Solutions:

In [None]:
len(data[data["gender"]=="F"]) # how many women (could be done differently)
data["city"].isna().sum() # how many rows have no city specified?
data['city'].value_counts() # how many people per city
data.groupby('city')['height'].mean() # average height per city

data_with_cities = data.merge(cities, on="city")
# average area of residence by gender
data_with_cities.groupby("gender")["area"].mean()

# average of people's age multiplied by their city's area?
(data_with_cities["age"] * data_with_cities["area"]).mean()