# Basics for Data Processing in Python 

If you run this notebook in google colab, set colab to True else False

In [6]:
colab = True

In [3]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

if colab:
    # define download url
    base = 'https://data.goettingen-research-online.de/api/access/datafile/:persistentId?persistentId=doi:10.25625/9AIY3V'
    folder = 'BXERKF'
    download_url = os.path.join(base, folder)

    # define save paths
    save_name_zip = '1_introduction.zip'
    raw_data_folder = 'data/raw_data'
    save_data_folder = 'data/output_data'

    # make data directories
    !mkdir -p $raw_data_folder
    !mkdir -p $save_data_folder

    # download and unzip data
    !wget -O $save_name_zip $download_url
    !unzip $save_name_zip -d $raw_data_folder
    !rm -rf $save_name_zip
    
    home_dir = '/content'
    raw_data_dir = os.path.join(home_dir, 'data/raw_data')
    output_data_dir = os.path.join(home_dir, 'data/output_data')
else:
    home_dir = os.path.expanduser('~')
    raw_data_dir = os.path.join(home_dir, 'repos/DaNuMa2024/data/raw_data')
    output_data_dir = os.path.join(home_dir, 'repos/DaNuMa2024/data/output_data')

## Module Overview

### [Pandas](#basic-Pandas-functionality)

- creating tabular data, aka "DataFrames"
- loading data from external sources (text files, databases etc.)
- data sorting and selection
- creation of derived data
- time-series functionality
- plausibility checking and imputation

### [Numpy](#basic-numpy-functionality)

- fast array and matrix manipulation and operations
- linear algebra
- applying mathematical functions

### [Matplotlib](#basic-matplotlib-functionality)

- visualization of data and results
- highly customizable plots

### [Exercises](#exercises)

## Basic Pandas functionality

Pandas works with objects called `DataFrame`. They store data in tabular form (rows and columns).

This code loads sample data from a CSV (comma-seperated values) file and displays the first five rows with the `head` function

In [None]:
filename = os.path.join(raw_data_dir, "1_introduction/example.csv")

data = pd.read_csv(filename, decimal=".", sep=",", encoding="utf-8")
print(data.head(5))

You can also create Dataframes from other data formats, for example from a dictionary, where the keys are the column names and the values are lists of the column values.

In [None]:
lab_data = {
    "run": [1, 2, 3, 4, 5],
    "measurement": [0.435, 1.2784, 3.453, 0.988, 5.3482]
}

lab_data_df = pd.DataFrame.from_dict(lab_data)
print(lab_data_df)

It is good practice to get an overview of the data first, to see if they are loaded correctly.

For this you can use the `DataFrame.info()` and `DataFrame.describe()` methods:

In [None]:
print(data.info(verbose=True))
print(data.describe())

### Selecting subsets of data

You can select a specific column of your dataframe with the name of the column header in square brackets: `data[column_name]`. You can also pass a list of column names. If you pass a range of integer values, you get the corresponding rows.

Also, the column name can be treated like any other property of a Python object (`data.age` for example), but this won't work if the column name has spaces in it, or any other characters that aren't allowed in variable names. Also, this can be harder to read since it's not directly clear that `age` is a column and not some default property of a DataFrame.

Lastly, there is the `DataFrame.loc` property, which we will talk more about later.

In [None]:
age_data = data["age"]
# or 
age_data = data.age
# or
age_data = data.loc[:,"age"]
 # loc selects data by [index (=row), column], a colon : will just select everything
print(age_data)

# multiple columns
scores = data[["name", "score"]]
print(scores)

# passing a range accesses matching rows
subset = data[0:2]
print(subset)

### Filtering data

Often you need to select data that matches certain criteria. For example, in our data we want only people with a "score" of 90 and above. For that you have to specify one or more conditions inside the square brackets. This way, a mask is created (an array of boolean values) that is only true for the rows that meet the criteria.

In [None]:
# get the people with scores >= 90
high_scorers = data[data["score"] >= 90]

# the expression inside the brackets creates a boolean mask, which is used 
# to select only the cells where the mask is "True"
mask = data["score"] >= 90
print(mask)
print("\nscore >= 90\n", data[mask])

# filtering with multiple conditions
height_scorers = data[(data["score"] >= 90) & (data["height"] >= 6.0)]
print("\nscore >= 90 and height >= 6\n", height_scorers)

In [None]:
# selecting rows and columns simultaneously with .loc and .iloc
# select data where "id" is either 1, 2, or 3, and the column is "name"
subset = data.loc[data["id"].isin([1, 2, 3]), "name"]
print(subset)

# select data with index from 0 to 2, and columns from 1 to 3
subset_2 = data.iloc[0:3, 1:4]
print(subset_2)

### Adding new data

In [None]:
# Adding new data to an existing DataFrame
# adding a new column from a series
pet_data = pd.Series(["cat", "dog", "cat", "goldfish", "hamster", "parrot", 
                      "rabbit", "turtle", "guinea pig", "ferret"])


data["pet"] = pet_data
print(data)

# adding data from a dictionary with merge()
# missing values get filled with NaN (not a number)
group_data = pd.DataFrame.from_dict({"name": ["Alice Smith", "Charlie Williams", "David Brown", "Fay Garcia"], "group": [1, 1, 2, 2]})
print(group_data)
merged_data = data.merge(group_data, how="left")
print(merged_data)

In [None]:
# creating new columns from existing ones

# calculating the score to age ratio
data["score_age_ratio"] = data["score"] / data["age"]
print(data[["name", "score_age_ratio"]])

# applying a function to a column
def get_first_name(full_name):
    return full_name.split(" ")[0]

data["first_name"] = data["name"].apply(get_first_name)
print(data["first_name"])

# you can also pass an anonymous function. less code, but sometimes harder to read
data["last_name"] = data["name"].apply(lambda x: x.split(" ")[-1])
print(data["last_name"])

### Exporting data

You can write your data back to a csv file similar to how you read it. You can also choose a different format if it fits your use case better (e.g. JSON or Numpy .npy).

In [12]:
data.to_csv(os.path.join(output_data_dir, "edited_example.csv"), index=False)

data.to_json(os.path.join(output_data_dir, "edited_example.json"), indent=2)

## Basic Numpy functionality

Numpy works with "arrays", which are data structures with N dimensions. They are mostly used to represent vectors and matrices.
The most basic example is a one-dimensional array, which behaves a lot like the `list` from Python's standard library. In fact, you can simply create a 1D array by passing a list:

In [None]:
array_1 = np.array([1, 2, 3, 4, 5])
print(array_1)

If you want more than one dimension, you can create an array by passing nested lists:

In [None]:
array_2 = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
print(array_2)
print("Shape: ", array_2.shape)
print("Dimensions: ", array_2.ndim)
print("Size: ", array_2.size)

Numpy expects the dimensions to be consistent, so if you pass it lists of different dimensions, it will throw an error:

In [None]:
try:
    array_3 = np.array([
        [1, 2, 3],
        [4, 5, 6],
        [7, 8, 9, 10]
    ])
except ValueError as e:
    print(e)

Numpy provides functions to create arrays of zeros or ones, which are useful for initializing weights in machine learning:

In [None]:
zeros_array = np.zeros((2, 3))
print(zeros_array)

ones_array = np.ones((2, 3))
print(ones_array)

If you need arrays with evenly spaced values you can use `arange`, which is similar to `list(range())`:

In [None]:
even_spaced_array = np.arange(0, 12, 2)
print(even_spaced_array)

The `linspace` function behaves similarly, but instead of the interval you give it the number of values, and it will calculate the interval for you.

In [None]:
even_spaced_array2 = np.linspace(0, 100, num=12)
print(even_spaced_array2)

Reshaping an array into a different dimension can be useful. If you want to reshape into one dimension, you can use `flatten()`

In [None]:
reshaped_array = even_spaced_array.reshape(3, 2)
print(reshaped_array)
flattened_array = reshaped_array.flatten()
print(flattened_array)

Numpy supports element-wise operations. For example, you can add, subtract, or multiply arrays of the same shape. 
Matrix multiplication can be done with `dot` or `@` operator.

In [None]:
matrix_a = np.array([[1, 2], [3, 4]])
matrix_b = np.array([[5, 6], [7, 8]])
result_matrix = np.dot(matrix_a, matrix_b)
print(result_matrix)

result_matrix = matrix_a @ matrix_b
print(result_matrix)

Numpy also provides simple statistical operations such as mean, sum, min, and max.

In [None]:
mean_value = np.mean(array_1)
print("Mean: ", mean_value)

total_sum = np.sum(array_1)
print("Sum: ", total_sum)

min_value = np.min(array_1)
print("Min :", min_value)

max_value = np.max(array_1)
print("Max: ", max_value)

When you ask for the statistics of a multidimensional array, the array is treatened as if it was flattened.

In [None]:
mean_value2 = np.mean(array_2)
print("Mean of whole array: ", mean_value2)

### Array indexing
Numpy arrays support indexing, which allows you to access individual elements or slices of the array.

For a one-dimensional array, indexing works just like with Python lists:

In [None]:
array_1d = np.array([10, 20, 30, 40, 50])
print(array_1d[0])  # Access the first element
print(array_1d[-1])  # Access the last element

For two-dimensional arrays (matrices), you can access elements using row and column indices.

In [None]:
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(array_2d[0, 0])  # Access the element in the first row, first column
print(array_2d[1, 2])  # Access the element in the second row, third column

You can also use slicing to access subarrays. For a 2D array, slicing works by selecting ranges of rows and columns. 
Negative indexing can also be used to access elements from the end.

In [None]:
print(array_1d[1:4])
print(array_2d[0:2, 1:3])
print(array_2d[-1, -1])

You can select entire rows or columns:

In [None]:
print(array_2d[1, :])  # Access the entire second row
print(array_2d[:, 2])  # Access the entire third column

Finally, you can modify an array by assigning values to an indexed position or slice.

In [None]:
array_2d[0, 0] = 99
print(array_2d)

array_2d[1, :] = [10, 20, 30]
print(array_2d)

### Sorting

sorting a 1D array works similar to a list. The `np.sort()` function will return a sorted copy, while `array.sort()` sorts in place, meaning it changes the original array.

In [None]:
unsorted_array = np.array([3, 5, 2, 4, 1, 6])
print(np.sort(unsorted_array))

print(unsorted_array)
unsorted_array.sort()  # sort in place
print(unsorted_array)

For arrays with multiple dimensions you can use the `index` argument. 

In [None]:
unsorted_2d = np.array([
    [5, 3, 4, 9], 
    [4, 1, 7, 2], 
    [1, 3, 2, 3]
    ])
print(np.sort(unsorted_2d, axis=0))
print(np.sort(unsorted_2d, axis=1))

`np.argsort` returns the sorted index instead of a sorted array. This can be useful in some cases.

In [None]:
sorted_index = np.argsort(unsorted_2d, axis=1)
print(sorted_index)
sorted_2d = np.take_along_axis(unsorted_2d, sorted_index, axis=1)
print(sorted_2d)

### data types

Arrays have fixed data types. By default, numpy infers them from the data you pass in. You can also explicitly specify the data type. 

In [None]:
int_array = np.array([1, 2, 3, 4])
print(int_array.dtype)

float_array = np.array([1.0, 1.5, 2.0])
print(float_array.dtype)

# Specify the data type. You might lose some information if you are not careful.
int_array_2 = np.array([1.0, 1.5, 2.0], dtype=np.int64)
print(int_array_2)

You can use the `object` datatype to store any Python object in a numpy array, if you really need to. But you need to be careful since you lose some functionality.

In [None]:
mixed_array = np.array(
    [[1, 2, 3, 4],
     [1.1, 2.2, 3.3, 4.4],
     "John Smith",
     {"hello": "world"},
     np.array([4, 5, 6])],
     dtype=object
)
print(mixed_array)

### Exporting and importing data

Sometimes you want to save your numpy arrays for later, or pass them to another program. 

In [None]:
filename = os.path.join(output_data_dir, "numpy_data.npy")
np.save(filename, sorted_2d)

# read the data 
np_data = np.load(filename)
print(np_data)

## Basic Matplotlib functionality

`Matplotlib.pyplot` is a common module used for plotting. You often see this imported as `plt`, this is just to save some space in your code.

Here is an example for a simple line graph:

In [None]:
x = np.arange(0, 10, 1)
y = x ** 2

plt.plot(x, y)
plt.title("Line Graph of y = x^2")
plt.xlabel("x")
plt.ylabel("y")
plt.show()


You can plot as many lines as you want into one figure.

In [None]:
x = np.arange(0, 10, 1)
y = x ** 2
y2 = x ** 3

plt.plot(x, y, label="y = x^2")
plt.plot(x, y2, label="y = x^3", linestyle='--')
plt.title("Multiple Line Plots")
plt.xlabel("x")
plt.ylabel("y")
plt.legend()
plt.show()

All kinds of plots are being supported. Here is a scatter plot for example:

In [None]:
x = np.random.rand(50)
y = np.random.rand(50)

plt.scatter(x, y, color='r')
plt.title("Scatter Plot of Random Points")
plt.xlabel("x")
plt.ylabel("y")
plt.show()

You can put multiple plots into one figure using `subplots`. Here is an example with different kind of plots.

In [None]:
# Scatter Plot data
x_scatter = np.random.rand(50)
y_scatter = np.random.rand(50)

# Bar plot data
categories = ['A', 'B', 'C', 'D']
values = [5, 7, 3, 8]

# Histgram data
data_hist = np.random.randn(1000)


# Create a figure and define subplots
plt.figure(figsize=(12, 4))

# Scatter plot (1st subplot)
plt.subplot(1, 3, 1)
plt.scatter(x_scatter, y_scatter, color='r')
plt.title("Scatter Plot")
plt.xlabel("x")
plt.ylabel("y")

# Bar plot (2nd subplot)
plt.subplot(1, 3, 2)
plt.bar(categories, values, color='b')
plt.title("Bar Plot")
plt.xlabel("Category")
plt.ylabel("Values")

# Histogram (3rd subplot)
plt.subplot(1, 3, 3)
plt.hist(data_hist, bins=30, edgecolor='black', color='g')
plt.title("Histogram")
plt.xlabel("Value")
plt.ylabel("Frequency")

# Adjust layout to avoid overlapping
plt.tight_layout()

# Show the combined plot
plt.show()

### Exporting Plots

Often you want to save a plot for later. You can just write them to a file like this:

In [38]:
x = np.random.rand(50)
y = np.random.rand(50)

plt.scatter(x, y, color='r')
plt.title("Scatter Plot of Random Points")
plt.xlabel("x")
plt.ylabel("y")
plt.savefig(os.path.join(output_data_dir, "my_plot.png"))
plt.close()

## Exercises



For this you need the "Titanic" dataset which contains data about the passengers of the titanic and their survival:

1. Load the Titanic dataset and display the first 5 rows. Then, use Pandas to display basic information and summary statistics of the dataset.

In [44]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
from sklearn.datasets import fetch_openml

titanic = fetch_openml("titanic", version=1, as_frame=True)
df = titanic.frame
df["survived"] = pd.to_numeric(df['survived'], errors='coerce')  # convert this for calculations

### Your code here:

2. How many passengers survived and how many did not survive? Use Pandas to count the number of survivors and non-survivors (the Survived column).

In [40]:
### Your code here:

3. What is the average age of the passengers? Use Pandas to calculate the mean age from the Age column. Ignore missing values in the calculation.

In [None]:
### Your code here:

4. What was the survival rate for male and female passengers? Use group-by functionality to calculate the survival rate by gender (Sex column).

In [None]:
### Your code here:

5. Create a 4x4 array filled with random values and find the sum of all elements. Use `np.random.rand()` to generate the values.

In [None]:
### Your code here:

6. Generate an array with 15 evenly spaced values between 5 and 50. Then, reshape this array into a 3x5 matrix.

In [None]:
### Your code here:

7. Given two arrays a and b, perform element-wise subtraction of b from a and print the result.

In [41]:
a = np.array([10, 20, 30, 40, 50])
b = np.array([5, 4, 3, 2, 1])

### Your code here:

8. Given the following 3x3 matrix, extract the second row and third column element from the matrix and print it.

In [42]:
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

9. Plot the following function

$$ y = x^{0.65} \cdot e^{-0.25x} $$

and display it with a dashed red line. Hint: use numpy's `power` and `exp` functions to translate it into Python code, and use `np.arange` or `np.linspace` to generate the values for `x`.

In [43]:
### your code here:
