# Data Handling with Pandas


While the [**numpy**](http://www.numpy.org/) library provides strong numerical operations for multidimensional arrays, [**pandas**](https://pandas.pydata.org/) focuses on data analysis. It is a collection of powerful tools for importing, rearranging, analyzing, and exporting tabular data sets.

[**pandas documentation**](http://pandas.pydata.org/pandas-docs/stable/): _"pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis/manipulation tool available in any language. It is already well on its way toward this goal.“_

Pandas uses the core functionality of numpy to handle its prominent data structure called the [`DataFrame`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) (an concept which originated in [**R**](https://www.r-project.org)). Pandas provides a wide range of functions for data analysis like importing and exporting, viewing, selection, indexing, handling missing data, statistical analyses, merging, grouping, reshaping, handling time series of data and much more. It also enables plotting via the [**matplotlib**](https://matplotlib.org/) library on a high-level. 

## Import of Pandas

It is common to use the abbreviation `pd` for pandas. 

In [None]:
import numpy as np
import pandas as pd

## A look at a dataset: Wine Quality

To get used to the basic functionality of the `pandas` library we are now going to work with an example data set. The data set is of roughly 1600 different red wine and around 5000 white wine samples ([Wine Quality Data Set](https://archive.ics.uci.edu/ml/datasets/Wine%2BQuality)). The different columns are chemical and physical attributes. In addition a quality score is given.

In [None]:
WINE_COLOR = "red"  # or white
df = pd.read_csv(f"../.assets/data/winequality/{WINE_COLOR}.csv.zip", sep=";")

In [None]:
# Data set size: ~5000 (~1600) different white (red) wines with 12 different attributes
df.shape

In [None]:
# Showing column names
df.columns

In [None]:
# It is a default integer index
df.index

In [None]:
# Show first 5 rows
df.head(5)

In [None]:
# Show last 5 rows
df.tail(5)

In [None]:
# Show a sample of 4 rows
df.sample(4)

In [None]:
# Select a column
df["pH"]

In [None]:
# Selection of three columns
df[["pH", "citric acid", "quality"]].head()

In [None]:
# Show unique values in a column
df["quality"].unique()

In [None]:
# Sort data set by column `pH`
df.sort_values(by="pH", ascending=False).head()

In [None]:
# Count values
df['quality'].value_counts()

In [None]:
# Get size (rows, columns)
df.shape

In [None]:
# Get more info on dataframe
df.info()

## Data structures: pandas.Series and pandas.DataFrame

The [**pandas.Series**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index. `Series` acts very similarly to a `ndarray`, and is a valid argument to most `numpy` functions. A series conists of the `index` and respective `values`. However, operations such as slicing will also slice the index.

The [**pandas.DataFrame**](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html) is the central data structure of pandas. `DataFrame` is a 2-dimensional labeled data structure with columns of potentially different types. Data coming from e.g. Excel or other SQL(-like) structures can be easily converted to a pandas DataFrame. 

<img src="graphics/pandas_dataframe_series.png" style="width: 300px;"/>

### Import and export data to and from pandas.DataFrame

Many methods for data IO are included in `pandas`, and standard file type are supported. A full list of them can be found [here](https://pandas.pydata.org/pandas-docs/stable/io.html). Some supported file types are:
* [CSV](https://en.wikipedia.org/wiki/Comma-separated_values) (`read_csv()`, `to_csv()`)
* [JSON](http://www.json.org/) (`read_json()`, `to_json()`)
* [MS Excel](https://en.wikipedia.org/wiki/Microsoft_Excel) (`read_excel()`, `to_excel()`)
* [SQL](https://en.wikipedia.org/wiki/SQL) (`read_sql()`, `to_sql()`)
* [Python Pickle Format](https://docs.python.org/3/library/pickle.html) (`read_pickle()`, `to_pickle()`)
* ...

The example shows the basic principle of importing and exporting data from/to files. In the case of text-based file types (e.g. csv) one important parameter to set is the separator (`sep`). This can probably interfere with the decimal separator and should be checked. In the case of an Excel file sheet name has to be set with `sheet_name`.

#### Import

In [None]:
# Import with setting the first column (index_col=0) as index or the sheet_name, respectively
df_csv_import = pd.read_csv("data.csv", sep=",", index_col=0)
df_excel_import = pd.read_excel("data.xls", sheet_name="data")

In [None]:
df_csv_import

In [None]:
df_excel_import

#### Export

In [None]:
# Export the built pandas.DataFrame to csv and excel files
df_csv_import.to_csv("data_new.csv", sep=",")
df_csv_import.to_excel("data_new.xlsx", sheet_name="data")

## More basic handling of dataframe

Let's go back to the bigger wine dataset. There are several ways to select single rows or single columns by the index or its column name, respectively. Selection rows via use of `[]` or `loc[]` are the easiest ways. Let's get back to our wine dataset.

In [None]:
WINE_COLOR = "red"  # or white
df = pd.read_csv(f"../.assets/data/winequality/{WINE_COLOR}.csv.zip", sep=";")

In [None]:
df.head()

### Adding new column

In [None]:
# New column by value
df["color"] = WINE_COLOR

# New column by combination
df["meaningless_column"] = df["chlorides"] + df["residual sugar"]
df.head()

### Set data type to column

In [None]:
# Show data types
df.dtypes

In [None]:
df['color'] = df['color'].astype('str')

In [None]:
df.dtypes

### Delete column

In [None]:
df = df.drop(["meaningless_column", "color"], axis=1)

In [None]:
df.head()

### Selecting rows and columns

In [None]:
# Selection of a single row by idx
df.iloc[[2]]

In [None]:
# Selection of a single by idx (different layout)
df.iloc[2]

In [None]:
# Selection of a range of rows
df[0:3]
#df[3:]
#df[:15]
#df[-3:-1]

In [None]:
# Select by index
df.loc[42]

For selection of columns there are two ways. When having a 'space' in a column name, only the second one is usable.

In [None]:
df.pH  # not possible for column names with spaces!

In [None]:
df["pH"]

In [None]:
df[["pH"]]

In [None]:
df[["pH", "quality"]]

Combination for selecting rows and columns.

In [None]:
df.loc[0:500, ["pH", "quality"]]

## Conditions on dataframe, rows and columns

In [None]:
# Use conditions on dataframe
df > 0

In [None]:
# Use conditions on columns/series
df["pH"] > 3.3

## Boolean indexing / filtering
Returns only rows when the condition is true.

In [None]:
df[df["pH"] > 3.3]

## Remark
There are much more operations you can think of (e.g. null-handling, joining, set single values, groupby....). At some point take your time to go through all points in the [pandas documentation](https://pandas.pydata.org/docs/user_guide/index.html) to get an overview what is possible. There are plenty of examples! However, do not forget to just ask the community for your problem. For example search with google: _pandas add row to dataframe_. I am pretty sure that you will end up either in the correct paragraph in the documentation or with an answer on stackoverflow.

## Constructing pandas data structures from scratch

Both, `Series` and `DataFrames`, can be constructed from different kinds of input such as dictionaries, lists and ndarrays. 

First, we create a `Series` and a `DataFrame` manually to discuss the basic features. For the following examples of constructing `Series` and `DataFrames` we need the following lists and ndarrays of the same length as a basis:

In [None]:
names = ["Jon", "Tim", "Lisa", "Jan", "Mary"]
ages = np.random.randint(20, 30, 5)
children = np.random.randint(0, 3, 5)
cousins = np.random.randint(0, 3, 5)

Note: Here we use the [numpy.random](https://numpy.org/devdocs/reference/random/index.html) module to generate random numbers.

Once again, in general you start from importing data directly into a DataFrame (e.g. with the above discussed csv import), instead creating it from scratch with building up `Series`. But at least once, you should go through it. In addition, at some later steps of data analysis you will see that it is good to know the basics. It will save you some time at a given timepoint, if you know the variaty and you are not dependet on external data. 

### Creating  pandas.Series

If `data` is an `ndarray`, `index` must be the same length as `data`. If no index is passed, one will be created having values `[0, ..., len(data) - 1]`.

In [None]:
s = pd.Series(data=ages)
s

In [None]:
s = pd.Series(data=ages, index=names)
s

In [None]:
type(s)

You can split the `pd.Series` in its two parts (`index` and `values`). Let's check the types and see what's [hiding as mentioned above](https://numpy.org/doc/1.18/reference/generated/numpy.ndarray.html).

In [None]:
# Get the index
print(s.index)
print(type(s.index))

In [None]:
# Get the values
print(s.values)
print(type(s.values))

### Creating pandas.DataFrame 

In [None]:
## Construction from lists or arrays

# Combine all lists
data = np.array([names, ages, children]).transpose()
part_data = np.array([ages, children]).transpose()

df = pd.DataFrame(data=data)
df

In [None]:
type(df)

The DataFrame has a table-like structure, and is in many ways similar to a relational database table. In the example above, columns and rows are created with a default integer index and the given column names, respectively. The labels of the columns and rows (`index`) can be included during constructing or set afterwards.

In [None]:
# Set index and column names after construction
df.columns = ["names", "ages", "children"]
df.set_index("names", inplace=True)
df

In [None]:
# Set index and column names during construction
df = pd.DataFrame(data=part_data, index=names, columns=["ages", "children"])
df

In [None]:
# Construct DataFram from dictionary
data = {"ages": ages, "children": children, "cousins": cousins}

df = pd.DataFrame(data=data, index=names)
df

## Exercise: Data basics with Pandas

Open exercise - try exploring the data set with pandas operations. Here are some ideas:

- Get familiar with selecting columns and rows and try out different filtering with boolean indexing.
- It is possible to sort not only by one column but you can set up a hierarchical order. Try it out!
- Check the data type of each column?
- Have a look at the quality column. Can you count how many wines exists for each quality level?
- We have not discussed it so far but with `pandas.DataFrame` there comes [**groupby**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html), a really strong tool if you want to do operations or calculations on subsets of your data. Can you print out the first or last wine for each quality level?
- Can you combine both data sets, red and white, to a single dataframe with an additional column for the wine colour, that you know what wine it is?
- ...

In [None]:
WINE_COLOR = "white"  # red
df = pd.read_csv(f"../.assets/data/winequality/{WINE_COLOR}.csv.zip", sep=";")

In [None]:
# Your code here





## Get startet with your own data set

You have your own data set on your machine (e.g. a csv file)? Feel free to upload the data to your workspace. But please keep it small. But do not use sensitive or safety-relevant data! 

- Open a new notebook
- Import pandas
- Upload data to your workspace
- Start with your import

You do not have any dataset? Have a look for an external one:
- [UCI Machine Learning Reposity](https://archive.ics.uci.edu/ml/index.php)
- [awesome-public-datasets](https://github.com/awesomedata/awesome-public-datasets)

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_