# Basics for Data Processing in Python 

## Module Overview

### Pandas

- creating tabular data, aka "DataFrames"
- loading data from external sources (text files, databases etc.)
- data sorting and selection
- creation of derived data
- time-series functionality
- plausibility checking and imputation

### Numpy

- fast array and matrix manipulation and operations
- linear algebra
- applying mathematical functions

### Matplotlib

- visualization of data and results
- highly customizable plots

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## basic Pandas functionality

Pandas works with objects called `DataFrame`. They store data in tabular form (rows and columns).

This code loads sample data from a CSV (comma-seperated values) file and displays the first five rows with the `head` function

In [None]:
data = pd.read_csv("../../data/example.csv", decimal=".", sep=",", encoding="utf-8")
print(data.head(5))

It is good practice to get an overview of the data first, to see if they are loaded correctly.

For this you can use the `DataFrame.info()` and `DataFrame.describe()` methods:

In [None]:
print(data.info(verbose=True))
print(data.describe())

### Selecting subsets of data

You can select a specific column of your dataframe with the name of the column header in square brackets: `data[column_name]`. You can also pass a list of column names. If you pass a range of integer values, you get the corresponding rows.

Also, the column name can be treated like any other property of a Python object (`data.age` for example), but this won't work if the column name has spaces in it, or any other characters that aren't allowed in variable names. Also, this can be harder to read since it's not directly clear that `age` is a column and not some default property of a DataFrame.

Lastly, there is the `DataFrame.loc` property, which we will talk more about later.

In [None]:
age_data = data["age"]
# or 
age_data = data.age
# or
age_data = data.loc[:,"age"]  # loc selects data by [index (=row), column], a colon : will just select everything
print(age_data)

# multiple columns
scores = data[["name", "score"]]
print(scores)

# passing a range accesses matching rows
subset = data[0:2]
print(subset)

### Filtering data

Often you need to select data that matches certain criteria. For example, in our data we want only people with a "score" of 90 and above. For that you have to specify one or more conditions inside the square brackets. This way, a mask is created (an array of boolean values) that is only true for the rows that meet the criteria.

In [None]:
# get the people with scores >= 90
high_scorers = data[data["score"] >= 90]

# the expression inside the brackets creates a boolean mask, which is used to select only the cells where the mask is "True"
mask = data["score"] >= 90
print(mask)
print("\nscore >= 90\n", data[mask])

# filtering with multiple conditions
height_scorers = data[(data["score"] >= 90) & (data["height"] >= 6.0)]
print("\nscore >= 90 and height >= 6\n", height_scorers)

In [6]:
# selecting rows and columns simultaneously with .loc and .iloc
# select data where "id" is either 1, 2, or 3, and the column is "name"
subset = data.loc[data["id"].isin([1, 2, 3]), "name"]
print(subset)

# select data with index from 0 to 2, and columns from 1 to 3
subset_2 = data.iloc[0:3, 1:4]
print(subset_2)

0       John Doe
1     Jane Smith
2    Bob Johnson
Name: name, dtype: object
          name  age  height
0     John Doe   28     5.9
1   Jane Smith   22     5.7
2  Bob Johnson   34     6.1


### Adding new data

In [11]:
# Adding new data to an existing DataFrame
# adding a new column from a series
pet_data = pd.Series(["cat", "dog", "cat", "goldfish", "hamster"])

data["pet"] = pet_data
print(data)

# adding data from a dictionary with merge()
# missing values get filled with NaN (not a number)
group_data = pd.DataFrame.from_dict({"name": ["John Doe", "Bob Johnson", "Alice Brown", "Charlie Davis"], "group": [1, 1, 2, 2]})
print(group_data)
merged_data = data.merge(group_data, how="left")
print(merged_data)

   id           name  age  height  score       pet
0   1       John Doe   28     5.9   85.3       cat
1   2     Jane Smith   22     5.7   92.5       dog
2   3    Bob Johnson   34     6.1   78.9       cat
3   4    Alice Brown   29     5.5   88.2  goldfish
4   5  Charlie Davis   25     6.0   91.4   hamster
            name  group
0       John Doe      1
1    Bob Johnson      1
2    Alice Brown      2
3  Charlie Davis      2
   id           name  age  height  score       pet  group
0   1       John Doe   28     5.9   85.3       cat    1.0
1   2     Jane Smith   22     5.7   92.5       dog    NaN
2   3    Bob Johnson   34     6.1   78.9       cat    1.0
3   4    Alice Brown   29     5.5   88.2  goldfish    2.0
4   5  Charlie Davis   25     6.0   91.4   hamster    2.0


In [18]:
# creating new columns from existing ones

# calculating the score to age ratio
data["score_age_ratio"] = data["score"] / data["age"]
print(data[["name", "score_age_ratio"]])

# applying a function to a column
def get_first_name(full_name):
    return full_name.split(" ")[0]

data["first_name"] = data["name"].apply(get_first_name)
print(data["first_name"])

# you can also pass an anonymous function. less code, but sometimes harder to read
data["last_name"] = data["name"].apply(lambda x: x.split(" ")[-1])
print(data["last_name"])

            name  score_age_ratio
0       John Doe         3.046429
1     Jane Smith         4.204545
2    Bob Johnson         2.320588
3    Alice Brown         3.041379
4  Charlie Davis         3.656000
0       John
1       Jane
2        Bob
3      Alice
4    Charlie
Name: first_name, dtype: object
0        Doe
1      Smith
2    Johnson
3      Brown
4      Davis
Name: last_name, dtype: object
