# Getting started with pandas - Intro

This chapter introduces you to the pandas package. It also serves as a test for your technical setup.

# Load required packages and setup display of dataframes

In [1]:
import pandas as pd

# Loading and inspecting data

## Loading and inspecting a CSV file

In [2]:
df = pd.read_csv("../../data/raw/financial_data_intro.csv")
df.head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,32737.984,2707.11,False
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,34816.074,3713.506,False
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,37161.97,4226.559,False
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,40276.807,3591.888,False
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,43026.854,4386.107,False


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 824 entries, 0 to 823
Data columns (total 10 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   u_company_name_id      824 non-null    int64  
 1   u_year                 824 non-null    int64  
 2   u_company_name         824 non-null    object 
 3   cb_naics               824 non-null    int64  
 4   u_iso3                 824 non-null    object 
 5   u_fye                  824 non-null    object 
 6   cb_cusip               824 non-null    object 
 7   cb_at                  824 non-null    float64
 8   cb_ni                  824 non-null    float64
 9   cb_financial_industry  824 non-null    bool   
dtypes: bool(1), float64(2), int64(3), object(4)
memory usage: 58.9+ KB


In [4]:
df.columns

Index(['u_company_name_id', 'u_year', 'u_company_name', 'cb_naics', 'u_iso3',
       'u_fye', 'cb_cusip', 'cb_at', 'cb_ni', 'cb_financial_industry'],
      dtype='object')

In [5]:
df.describe()

Unnamed: 0,u_company_name_id,u_year,cb_naics,cb_at,cb_ni
count,824.0,824.0,824.0,824.0,824.0
mean,45233.387136,2012.11165,379140.90534,113685.9,2191.241188
std,32869.292595,4.166015,207892.935774,296348.9,5047.784304
min,2172.0,2005.0,325.0,0.0,-50119.0
25%,14651.0,2009.0,324110.0,763.492,-0.31275
50%,34617.0,2012.0,336111.0,16471.87,484.862
75%,77954.0,2016.0,523930.0,84863.73,3470.27075
max,109031.0,2019.0,999977.0,2261780.0,50778.396


In [6]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
u_company_name_id,824.0,45233.387136,32869.292595,2172.0,14651.0,34617.0,77954.0,109031.0
u_year,824.0,2012.11165,4.166015,2005.0,2009.0,2012.0,2016.0,2019.0
cb_naics,824.0,379140.90534,207892.935774,325.0,324110.0,336111.0,523930.0,999977.0
cb_at,824.0,113685.916945,296348.893884,0.0,763.492,16471.8695,84863.72725,2261780.0
cb_ni,824.0,2191.241188,5047.784304,-50119.0,-0.31275,484.862,3470.27075,50778.396


## Checking the file specifications

It is important to correctly adapt all settings for every single file. By "chance", the defaults of `pd.read_csv()` were correct for our example above. Generally, it is a good idea to have a look at the first few lines of a text file, especially when the defaults did not produce the expected columns and data types.

> **Note:** Use the [Checklist for data loading issues](../../checklist-data-loading.md)

In [7]:
# Printing the first few lines of the file:
with open("../../data/raw/financial_data_intro.csv") as f:
    for _ in range(5):
        print(f.readline(), end="")

u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry
14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,32737.984,2707.11,FALSE
14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,34816.074,3713.506,FALSE
14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,37161.97,4226.559,FALSE
14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,40276.807,3591.888,FALSE


In [8]:
# Pretend we use the wrong separator (";", which would often be correct for non-english CSV files):
df_wrong = pd.read_csv("../../data/raw/financial_data_intro.csv", sep=";")
df_wrong.head()

Unnamed: 0,"u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry"
0,"14651,2005,British American Tobacco PLC,312230..."
1,"14651,2006,British American Tobacco PLC,312230..."
2,"14651,2007,British American Tobacco PLC,312230..."
3,"14651,2008,British American Tobacco PLC,312230..."
4,"14651,2009,British American Tobacco PLC,312230..."


In [9]:
df_wrong.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 824 entries, 0 to 823
Data columns (total 1 columns):
 #   Column                                                                                                    Non-Null Count  Dtype 
---  ------                                                                                                    --------------  ----- 
 0   u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry  824 non-null    object
dtypes: object(1)
memory usage: 6.6+ KB


In [10]:
# Pretend we assume no header
df_wrong = pd.read_csv("../../data/raw/financial_data_intro.csv", header=None)
df_wrong.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry
1,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,32737.984,2707.11,FALSE
2,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,34816.074,3713.506,FALSE
3,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,37161.97,4226.559,FALSE
4,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,40276.807,3591.888,FALSE


In [11]:
df_wrong.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 825 entries, 0 to 824
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       825 non-null    object
 1   1       825 non-null    object
 2   2       825 non-null    object
 3   3       825 non-null    object
 4   4       825 non-null    object
 5   5       825 non-null    object
 6   6       825 non-null    object
 7   7       825 non-null    object
 8   8       825 non-null    object
 9   9       825 non-null    object
dtypes: object(10)
memory usage: 64.6+ KB


In [12]:
# Clean up
del df_wrong

# Select DataFrame columns

## Selecting a single column

using `[]` with a string returns one column as a `pd.Series`

>**Note**
You can also select a column using the dot notation, e.g. `df.u_company_name`, but this only works if the column name is a valid Python identifier and is not advised for general use.

In [13]:
# select a Series (single column) of a DataFrame
df["u_company_name"]

0      British American Tobacco PLC
1      British American Tobacco PLC
2      British American Tobacco PLC
3      British American Tobacco PLC
4      British American Tobacco PLC
                   ...             
819             Midatech Pharma PLC
820             Midatech Pharma PLC
821                   Motif Bio plc
822                   Motif Bio plc
823                   Motif Bio plc
Name: u_company_name, Length: 824, dtype: object

In [14]:
# Note that the result is a Pandas-Series, not a DataFrame!
type(df["u_company_name"])

pandas.core.series.Series

## Selecting multiple columns

using `[]` with a list of strings returns multiple columns as a `pd.DataFrame`.

In [15]:
# Multiple columns are selected by supplying a list:
df[["u_company_name", "u_year"]]

Unnamed: 0,u_company_name,u_year
0,British American Tobacco PLC,2005
1,British American Tobacco PLC,2006
2,British American Tobacco PLC,2007
3,British American Tobacco PLC,2008
4,British American Tobacco PLC,2009
...,...,...
819,Midatech Pharma PLC,2017
820,Midatech Pharma PLC,2018
821,Motif Bio plc,2016
822,Motif Bio plc,2017


In [16]:
# The result is a DataFrame!
type(df[["u_company_name", "u_year"]])

pandas.core.frame.DataFrame

## Selecting specific types of columns

The `select_dtypes()` method of `pd.DataFrame` allows you to subset the DataFrame by selecting only specific types of columns.


In [17]:
df.select_dtypes(["int", "float"])

Unnamed: 0,u_company_name_id,u_year,cb_naics,cb_at,cb_ni
0,14651,2005,312230,32737.984,2707.110
1,14651,2006,312230,34816.074,3713.506
2,14651,2007,312230,37161.970,4226.559
3,14651,2008,312230,40276.807,3591.888
4,14651,2009,312230,43026.854,4386.107
...,...,...,...,...,...
819,62880,2017,325414,66.595,-21.733
820,62880,2018,325414,26.056,-19.156
821,64528,2016,325414,28.426,-40.324
822,64528,2017,325414,29.188,-44.810


In [18]:
# you can also produce a list of the columns:
numeric_columns = df.select_dtypes(["int", "float"]).columns.to_list()
numeric_columns

['u_company_name_id', 'u_year', 'cb_naics', 'cb_at', 'cb_ni']


## Rule-based column selection

Especially for datasets with many columns, we often need to select columns based on specific patterns or rules.

This can be achieved using various methods, including loops, list comprehensions, and pandas string methods.

Most of the time, we refer to the `df.columns` attribute (a special kind of `pd.Series`)to get the column names to select from.

In [19]:
# df.columns is a special kind of `pd.Series` (an `Index`)
df.columns

Index(['u_company_name_id', 'u_year', 'u_company_name', 'cb_naics', 'u_iso3',
       'u_fye', 'cb_cusip', 'cb_at', 'cb_ni', 'cb_financial_industry'],
      dtype='object')

In [20]:
# lists created by code are especially useful for column selection

# Example: selecting columns starting with "cb_"
selected_cols = []
for c in df.columns:
    if c.startswith("cb_"):
        selected_cols.append(c)
print(f"selected_cols: {selected_cols}")

df[selected_cols]

selected_cols: ['cb_naics', 'cb_cusip', 'cb_at', 'cb_ni', 'cb_financial_industry']


Unnamed: 0,cb_naics,cb_cusip,cb_at,cb_ni,cb_financial_industry
0,312230,110448107,32737.984,2707.110,False
1,312230,110448107,34816.074,3713.506,False
2,312230,110448107,37161.970,4226.559,False
3,312230,110448107,40276.807,3591.888,False
4,312230,110448107,43026.854,4386.107,False
...,...,...,...,...,...
819,325414,59564R302,66.595,-21.733,False
820,325414,59564R302,26.056,-19.156,False
821,325414,619784101,28.426,-40.324,False
822,325414,619784101,29.188,-44.810,False


In [21]:
# Alternative solutions for generating the same `selected_cols`

# List comprehension:
selected_cols = [c for c in df.columns if c.startswith("cb_")]
print(f"From list comprehension: {selected_cols}")

# Using methods included in pandas:
selected_cols = df.columns[df.columns.str.startswith("cb_")]
print(f"Using pandas methods: {selected_cols}")

From list comprehension: ['cb_naics', 'cb_cusip', 'cb_at', 'cb_ni', 'cb_financial_industry']
Using pandas methods: Index(['cb_naics', 'cb_cusip', 'cb_at', 'cb_ni', 'cb_financial_industry'], dtype='object')


# Select rows from a DataFrame

## Selecting the first or last rows

In [22]:
df.head(10)

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,32737.984,2707.11,False
1,14651,2006,British American Tobacco PLC,312230,GBR,2006-12-31,110448107,34816.074,3713.506,False
2,14651,2007,British American Tobacco PLC,312230,GBR,2007-12-31,110448107,37161.97,4226.559,False
3,14651,2008,British American Tobacco PLC,312230,GBR,2008-12-31,110448107,40276.807,3591.888,False
4,14651,2009,British American Tobacco PLC,312230,GBR,2009-12-31,110448107,43026.854,4386.107,False
5,14651,2010,British American Tobacco PLC,312230,GBR,2010-12-31,110448107,42882.112,4431.357,False
6,14651,2011,British American Tobacco PLC,312230,GBR,2011-12-31,110448107,42134.79,4808.702,False
7,14651,2012,British American Tobacco PLC,312230,GBR,2012-12-31,110448107,44439.167,6246.234,False
8,14651,2013,British American Tobacco PLC,312230,GBR,2013-12-31,110448107,44552.569,6470.49,False
9,14651,2014,British American Tobacco PLC,312230,GBR,2014-12-31,110448107,40762.953,4852.547,False


In [23]:
df.tail(10)

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry
814,24931,2016,DBV Technologies S.A.,325414,FRA,2016-12-31,23306J101,303.37,-120.853,False
815,24931,2017,DBV Technologies S.A.,325414,FRA,2017-12-31,23306J101,213.76,-177.557,False
816,24931,2018,DBV Technologies S.A.,325414,FRA,2018-12-31,23306J101,196.635,-190.14,False
817,62880,2015,Midatech Pharma PLC,325414,GBR,2015-12-31,59564R302,94.435,-14.892,False
818,62880,2016,Midatech Pharma PLC,325414,GBR,2016-12-31,59564R302,69.937,-24.874,False
819,62880,2017,Midatech Pharma PLC,325414,GBR,2017-12-31,59564R302,66.595,-21.733,False
820,62880,2018,Midatech Pharma PLC,325414,GBR,2018-12-31,59564R302,26.056,-19.156,False
821,64528,2016,Motif Bio plc,325414,GBR,2016-12-31,619784101,28.426,-40.324,False
822,64528,2017,Motif Bio plc,325414,GBR,2017-12-31,619784101,29.188,-44.81,False
823,64528,2018,Motif Bio plc,325414,GBR,2018-12-31,619784101,18.724,-13.985,False



## Rule-based selection

Passing a `list` or `pd.Series` of **booleans** to `[]` will also select rows (not columns) when used with `[]`

>**Note**: The number of booleans passed to `[]` must equal the number of rows!
>
>e.g. `df[[True, False, True]]` will return `ValueError: Item wrong length 3 instead of 824.`

In [24]:
# Therefore, booleans used with `[]` are typically generated by code, e.g.:
df[df["u_year"] == 2005]

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,32737.984,2707.11,False
15,13722,2005,Bob Evans Farms Inc.,311612,USA,2006-04-30,096761101,1209.183,54.774,False
27,10161,2005,BP PLC,324110,GBR,2005-12-31,055622104,206914.0,22341.0,False
42,10231,2005,BT Group PLC,517311,GBR,2006-03-31,05577E101,42960.71,2690.697,False
56,39131,2005,Glaxosmithkline PLC,325412,GBR,2005-12-31,37733W105,46747.922,8059.453,False
71,56071,2005,LVMH Moet Hennessy Louis Vuitton SE,3152,FRA,2005-12-31,502441306,33220.363,1705.248,False
86,36437,2005,Freddie Mac,522298,USA,2005-12-31,313400301,806222.0,2130.0,True
101,4304,2005,Allianz SE,524126,DEU,2005-12-31,018820100,1181690.68,5186.796,True
115,29114,2005,EOG Resources Inc.,2111,USA,2005-12-31,26875P101,7753.32,1259.576,False
130,9724,2005,BASF SE,325,DEU,2005-12-31,055262505,42240.532,3560.534,False


In [25]:
# Let's take this apart, first, we generate a selection mask:
mask = df["u_year"] == 2005
# then we use it:
df[mask].head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,32737.984,2707.11,False
15,13722,2005,Bob Evans Farms Inc.,311612,USA,2006-04-30,096761101,1209.183,54.774,False
27,10161,2005,BP PLC,324110,GBR,2005-12-31,055622104,206914.0,22341.0,False
42,10231,2005,BT Group PLC,517311,GBR,2006-03-31,05577E101,42960.71,2690.697,False
56,39131,2005,Glaxosmithkline PLC,325412,GBR,2005-12-31,37733W105,46747.922,8059.453,False


In [26]:
# You can also use more complex boolean expressions to select rows that fulfill certain conditions
df[(df["u_year"] == 2005) & (df["u_iso3"] == "DEU")].head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry
101,4304,2005,Allianz SE,524126,DEU,2005-12-31,018820100,1181690.68,5186.796,True
130,9724,2005,BASF SE,325,DEU,2005-12-31,055262505,42240.532,3560.534,False
144,25503,2005,Daimler AG,336111,DEU,2005-12-31,D1668R123,238772.614,3376.154,False
169,87672,2005,Siemens AG,999977,DEU,2005-09-30,826197501,103945.989,2710.638,False
347,11568,2005,Bayer AG,325,DEU,2005-12-31,072730302,43486.192,1891.167,False


# Select rows and columns using `df.loc[]`

`.loc[]` can select either rows, or columns, or both at the same time.

## Selecting rows using `loc`

Just like with `[]`, we can pass a boolean mask to `loc[]` to select rows programmatically:

In [27]:
df.loc[df["u_year"] == 2005].head()

Unnamed: 0,u_company_name_id,u_year,u_company_name,cb_naics,u_iso3,u_fye,cb_cusip,cb_at,cb_ni,cb_financial_industry
0,14651,2005,British American Tobacco PLC,312230,GBR,2005-12-31,110448107,32737.984,2707.11,False
15,13722,2005,Bob Evans Farms Inc.,311612,USA,2006-04-30,096761101,1209.183,54.774,False
27,10161,2005,BP PLC,324110,GBR,2005-12-31,055622104,206914.0,22341.0,False
42,10231,2005,BT Group PLC,517311,GBR,2006-03-31,05577E101,42960.71,2690.697,False
56,39131,2005,Glaxosmithkline PLC,325412,GBR,2005-12-31,37733W105,46747.922,8059.453,False


## Select columns using `loc`

When using `loc[]`, we can select columns by their labels, but we must specify a row selection as well or a placeholder `:`

In [28]:
df.loc[:, ["u_year", "u_company_name"]]

Unnamed: 0,u_year,u_company_name
0,2005,British American Tobacco PLC
1,2006,British American Tobacco PLC
2,2007,British American Tobacco PLC
3,2008,British American Tobacco PLC
4,2009,British American Tobacco PLC
...,...,...
819,2017,Midatech Pharma PLC
820,2018,Midatech Pharma PLC
821,2016,Motif Bio plc
822,2017,Motif Bio plc


## Select rows and columns using `loc`

In [29]:
df.loc[df["u_year"] > 2007, ["u_company_name", "u_year"]]

Unnamed: 0,u_company_name,u_year
3,British American Tobacco PLC,2008
4,British American Tobacco PLC,2009
5,British American Tobacco PLC,2010
6,British American Tobacco PLC,2011
7,British American Tobacco PLC,2012
...,...,...
819,Midatech Pharma PLC,2017
820,Midatech Pharma PLC,2018
821,Motif Bio plc,2016
822,Motif Bio plc,2017


# Exercise

*Preparations*
* create new jupyter notebook
* store it under notebooks/own_exercises

*Tasks*
1. Load the first sheet of the Excel file "wdi_reduced.xlsx" into a pandas DataFrame (see [here](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) for help with `pandas.read_excel()`). *Hint*: you need to save the notebook in a suitable directory before you can use relative file paths! Also, if you use a different folder from the one specified in *preparations*, you may have to adjust the relative path, e.g. the number of times you use the "../" prefix.
2. What are the data types of the columns?
3. Display the descriptive statistics of the numeric columns.
4. Select the columns *countryname* and *year* from the DataFrame.
5. Select the rows from Germany and France.
6. Bonus: Select the rows from Germany and France in the years 2008 to 2010 and all columns starting with 'country' as well as *year*.
7. Bonus: Select only numeric columns.
8. Bonus: Select the numeric columns except those starting with "NY_".


# Side-Note: using the ydata-profiling package

The company [YData](https://ydata.ai/) is providing a package called [ydata-profiling](https://github.com/ydataai/ydata-profiling) (formerly known as pandas-profiling) that can be used to generate a report about a DataFrame. This report includes information about data types, missing values, distributions of values, correlations, and more.

In the past, the package has been prone to peculiar dependencies. Therefore, I recommend you setup a separate virtual environment.

Open a terminal in the project directory, then run:

**Windows**
```powershell
uv venv --python=3.12 .venv-ydata
Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass
.\.venv-ydata\Scripts\activate
uv pip install setuptools legacy-cgi htmlmin ydata-profiling ipykernel jupyter
```
**Linux / MacOS**
```bash
uv venv --python=3.12 .venv-ydata
source .venv-ydata/bin/activate
uv pip install setuptools legacy-cgi htmlmin ydata-profiling ipykernel jupyter
```
Then, in a jupyter notebook:

In [30]:
import pandas as pd
from ydata_profiling import ProfileReport
df = pd.read_csv("../../data/raw/financial_data_intro.csv")
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
profile

ModuleNotFoundError: No module named 'ydata_profiling'