# Basics for Data Processing in Python 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Module Overview

### [Pandas](#basic-Pandas-functionality)

- creating tabular data, aka "DataFrames"
- loading data from external sources (text files, databases etc.)
- data sorting and selection
- creation of derived data
- time-series functionality
- plausibility checking and imputation

### [Numpy](#basic-numpy-functionality)

- fast array and matrix manipulation and operations
- linear algebra
- applying mathematical functions

### [Matplotlib](#basic-matplotlib-functionality)

- visualization of data and results
- highly customizable plots

## basic Pandas functionality

Pandas works with objects called `DataFrame`. They store data in tabular form (rows and columns).

This code loads sample data from a CSV (comma-seperated values) file and displays the first five rows with the `head` function

In [None]:
data = pd.read_csv("../../data/example.csv", decimal=".", sep=",", encoding="utf-8")
print(data.head(5))

You can also create Dataframes from other data formats, for example from a dictionary, where the keys are the column names and the values are lists of the column values.

In [None]:
lab_data = {
    "run": [1, 2, 3, 4, 5],
    "measurement": [0.435, 1.2784, 3.453, 0.988, 5.3482]
}

lab_data_df = pd.DataFrame.from_dict(lab_data)
print(lab_data_df)

It is good practice to get an overview of the data first, to see if they are loaded correctly.

For this you can use the `DataFrame.info()` and `DataFrame.describe()` methods:

In [None]:
print(data.info(verbose=True))
print(data.describe())

### Selecting subsets of data

You can select a specific column of your dataframe with the name of the column header in square brackets: `data[column_name]`. You can also pass a list of column names. If you pass a range of integer values, you get the corresponding rows.

Also, the column name can be treated like any other property of a Python object (`data.age` for example), but this won't work if the column name has spaces in it, or any other characters that aren't allowed in variable names. Also, this can be harder to read since it's not directly clear that `age` is a column and not some default property of a DataFrame.

Lastly, there is the `DataFrame.loc` property, which we will talk more about later.

In [None]:
age_data = data["age"]
# or 
age_data = data.age
# or
age_data = data.loc[:,"age"]
 # loc selects data by [index (=row), column], a colon : will just select everything
print(age_data)

# multiple columns
scores = data[["name", "score"]]
print(scores)

# passing a range accesses matching rows
subset = data[0:2]
print(subset)

### Filtering data

Often you need to select data that matches certain criteria. For example, in our data we want only people with a "score" of 90 and above. For that you have to specify one or more conditions inside the square brackets. This way, a mask is created (an array of boolean values) that is only true for the rows that meet the criteria.

In [None]:
# get the people with scores >= 90
high_scorers = data[data["score"] >= 90]

# the expression inside the brackets creates a boolean mask, which is used 
# to select only the cells where the mask is "True"
mask = data["score"] >= 90
print(mask)
print("\nscore >= 90\n", data[mask])

# filtering with multiple conditions
height_scorers = data[(data["score"] >= 90) & (data["height"] >= 6.0)]
print("\nscore >= 90 and height >= 6\n", height_scorers)

In [None]:
# selecting rows and columns simultaneously with .loc and .iloc
# select data where "id" is either 1, 2, or 3, and the column is "name"
subset = data.loc[data["id"].isin([1, 2, 3]), "name"]
print(subset)

# select data with index from 0 to 2, and columns from 1 to 3
subset_2 = data.iloc[0:3, 1:4]
print(subset_2)

### Adding new data

In [None]:
# Adding new data to an existing DataFrame
# adding a new column from a series
pet_data = pd.Series(["cat", "dog", "cat", "goldfish", "hamster", "parrot", 
                      "rabbit", "turtle", "guinea pig", "ferret"])


data["pet"] = pet_data
print(data)

# adding data from a dictionary with merge()
# missing values get filled with NaN (not a number)
group_data = pd.DataFrame.from_dict({"name": ["Alice Smith", "Charlie Williams", "David Brown", "Fay Garcia"], "group": [1, 1, 2, 2]})
print(group_data)
merged_data = data.merge(group_data, how="left")
print(merged_data)

In [None]:
# creating new columns from existing ones

# calculating the score to age ratio
data["score_age_ratio"] = data["score"] / data["age"]
print(data[["name", "score_age_ratio"]])

# applying a function to a column
def get_first_name(full_name):
    return full_name.split(" ")[0]

data["first_name"] = data["name"].apply(get_first_name)
print(data["first_name"])

# you can also pass an anonymous function. less code, but sometimes harder to read
data["last_name"] = data["name"].apply(lambda x: x.split(" ")[-1])
print(data["last_name"])

### Exporting data

You can write your data back to a csv file similar to how you read it. You can also choose a different format if it fits your use case better (e.g. JSON or Numpy .npy).

In [10]:
data.to_csv("../../data/edited_example.csv", index=False)

data.to_json("../../data/edited_example.json", indent=2)

## basic Numpy functionality

Numpy works with "arrays", which are data structures with N dimensions. They are mostly used to represent vectors and matrices.
The most basic example is a one-dimensional array, which behaves a lot like the `list` from Python's standard library. In fact, you can simply create a 1D array by passing a list:

In [2]:
array_1 = np.array([1, 2, 3, 4, 5])
print(array_1)

[1 2 3 4 5]


If you want more than one dimension, you can create an array by passing nested lists:

In [4]:
array_2 = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
print(array_2)
print("Dimensions: ", array_2.shape)

[[1 2 3]
 [4 5 6]
 [7 8 9]]
Dimensions:  (3, 3)


Numpy expects the dimensions to be consistent, so if you pass it lists of different dimensions, it will throw an error:

In [9]:
array_3 = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9, 10]
])