# Pandas `Series` Exercises

Now that we have learned quite a lot about Pandas `Series`, `DataFrame`s and plotting, it is time to refresh the content of the past days by working on some exercises.

In [None]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt

mpl.style.use("seaborn-v0_8-colorblind")

f"Pandas version: {pd.__version__ = }, Numpy version: {np.__version__ = }"

## Creation and element access

### Ways to create a `Series`

You are given the following Python containers. Use them to create a `Series` in two different ways.

In [None]:
index, values = [f"a{idx}" for idx in range(len(range(0, 11, 2)))], list(
    range(0, 11, 2)
)
index, values

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

### Type conversion

Using the same Python containers as in the previous exercise, generate a `Series` containing the elements of the `values` container as 32-bit floating point values.

In [None]:
index, values = [f"a{idx}" for idx in range(len(range(0, 11, 2)))], list(
    range(0, 11, 2)
)
index, values

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

### Accessing the index

Given the series below access the elements with index labels `"a"`, `"f"`, and `"i"` in three different ways.

In [None]:
s = pd.Series(range(10, 110, 10), list("abcdefghij"))
s

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

### Filtering elements

Given the `Series` below, return a `Series` that contains al elements that are close to the corresponding integer part (ignoring all decimal places) within 0.001%.

In [None]:
s = pd.Series([1.1, 2.0, 3.001, 10.5, 4.0, 5, 7.00001, 12.0])
s

In [None]:
# Fill the gap!

From the same `Series` *with all its elements converted to integers* return another `Series` that contains all even entries.

In [None]:
# Fill the gap!

### Modifying elements

You are given the `Series` below. Replace the values at index labels `"c"`, `"d"`, `"e"`, and `"f"` with the value -1000 *in one of the copies of `s`*. Come with three different methods to accomplish these changes. For each of the different approaches of copy of the original `Series` is available below.

In [None]:
s = pd.Series(range(10, 110, 10), list("abcdefghij"))
s

In [None]:
s_copy1 = s.copy(deep=True)  # execute this line before making changes in the cell below

In [None]:
# Fill the gap!

In [None]:
s_copy2 = s.copy(deep=True)  # execute this cell before making changes in the cell below

In [None]:
# Fill the gap!

In [None]:
s_copy3 = s.copy(deep=True)  # execute this cell before making changes in the cell below

In [None]:
# Fill the gap!

### `Series` with duplicate entries in the index

The `Series` given below has duplicated entries in its index. Access all elements sharing this index label. Afterward return a `Series` where the duplicate entries have been removed.

*Hint*: You can for example use a boolean mask to accomplish the removal. Can you come up with different ways to create such a boolean mask?

In [None]:
s = pd.Series(range(10), index=["a", "b", "a", "a", "c", "d", "e", "f", "a", "g"])
s

Get all values at duplicate  index entries.

In [None]:
# Fill the gap!

Return `Series` without all duplicate index labels.

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

## Operations between `Series`

### Arithmetic

Given the `Series` below, compute their sum, difference, product and division. Do by using 1. the usual arithmetic operator, and 2. an appropriate method call.

In [None]:
s1, s2 = pd.Series(range(1, 6)), pd.Series(range(5, 0, -1))
s1, s2

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

### Match by index

The `Series` given below have different sizes and hence do not share the same index. Compute the result of adding both `Series`.

In [None]:
s1 = pd.Series(range(1, 6), index=list("abcde"))
s2 = pd.Series(range(70, 0, -10), index=list("abcdefg")[::-1])
s1, s2

1. Remove all elements for which no result is obtained.

In [None]:
# Fill the gap!

2. Fill all elements with no result with a default value of your choice.

In [None]:
# Fill the gap!

## `Series` method calls

### Statistics

Given the `Series` below determine the minimal, maximal, mean, and median value. Do so in three different ways.

*Hint*: The `.apply()` method can also reduce.

In [None]:
s = pd.Series(np.random.normal(size=200))

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

### Centering on the mean

Subtract the mean value computed for the `Series` below from all of its elements. Do so by using 

1. the `.transform()` method,
2. the `.apply()` method.
3. another method you can think of.

In [None]:
s = pd.Series(range(20, 110, 10))
s

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

### Fix me!

What is the issue with the following line of code? The code is *commented* and will not execute until you remove the `#`.

In [None]:
# pd.Series(range(10)).transform(lambda s: s.mean())

### Computing with NaNs

The `Series` below contains several `NaN` entries.

In [None]:
data = np.array(list(range(5)) + [np.nan] * 4 + list(range(10, 50, 10)))
np.random.shuffle(data)
s = pd.Series(data)
s

1. Compute the mean and the median value. 

In [None]:
# Fill the gap!

2. Recompute the mean and the median values but before doing this, replace the "missing values" (all the `NaN`s) with `np.pi`. Conduct this replacement in two different ways.

*Hint*: One way to achieve this is the `.where()`  method.

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!

## Plotting with `Series`

### Basic

Create a histogram plot for the data in the following `Series`.

In [None]:
s = pd.Series(
    np.concatenate(
        (
            np.random.normal(loc=0, scale=0.75, size=500),
            np.random.normal(loc=3, scale=0.5, size=500),
        ),
        axis=0,
    )
)

In [None]:
# Fill the gap!

Now create a boxplot.

In [None]:
# Fill the gap!

Now create a violin plot.

In [None]:
# Fill the gap!

### Subplots

You are given the `Series` below that contains duplicated index entries of three kinds: `"classA"`, `"classB"`, and `"classC"`.

In [None]:
s = pd.Series(
    np.concatenate(
        (
            np.random.normal(loc=2, scale=1, size=200),
            np.random.normal(loc=5, scale=0.5, size=200),
            np.random.normal(loc=3, scale=1.15, size=100),
            np.random.normal(loc=8, scale=0.75, size=200),
        ),
        axis=0,
    ),
    index=["classA"] * 200 + ["classB"] * 200 + ["classC"] * 100 + ["classC"] * 200,
)

Create a histogram plot of the that allows to distinguish all three classes; in particular the peaks of the single distributions must be visible. Make sure to add a legend so all classes can be distinguished.

In [None]:
# Fill the gap!

Now create a 3x1 grid of subplots. Each subplot shall hold a histogram of a particular class and the subplots shall share a common x-axis.

In [None]:
# Fill the gap!

### Polling for data

Suppose you have conducted an anonymous poll in which --- amongst other things --- you have asked participants to provide information regarding employment status. Annoyingly, the online form to gather the data contained a field in which people could write arbitray text (maybe the number of characters was limited) instead of a drop-down menu that provides several answers from which people can choose what fits them best. Anyway, as a result you are getting some answers which are not really suitable for your research (e.g. "Working with the Avengers"). You have to apply some of the techniques just learned about manipulating `Series` to bring the data into a form suitable for further processing.

To keep things simple let's assume the answers you were hoping for are `"Employed"` (or `"employed"` --- yes, capitalization can will also get in your way here ;-)), and `"Unemployed"` (or `"unemployed"`). We consider these as 'usable' while the rest is 'unusable' (we assume that the poll does not containe further information that allows to make them 'usable').

Replace all entries you consider 'unusable' with `"unknown"`. Futher change all 'usable' entries to lowercase.

Generate two plots side-by-side that show *one* of the following (you choose!):

* The counts of each category from the *uncleaned* dataset on the left and the counts of each category in *cleaned* dataset on the right.
* The relative proportions of the categories from the *uncleaned* dataset on the left and the relative proportions from the uncleaned dataset on right.

Depending on what kind visualization form you use, choose an appropriate plot.

In [None]:
rng = np.random.default_rng(seed=42)

data = np.array(
    ["Employed"] * rng.choice(range(2000, 5000))
    + ["employed"] * rng.choice(range(100, 500))
    + ["Unemployed"] * rng.choice(range(200, 400))
    + ["unemployed"] * rng.choice(range(50, 70))
    + ["Rate mal"] * rng.choice(range(10, 20))
    + ["Having fun all day"] * rng.choice(range(50, 100))
    + ["geht dich nix an"] * rng.choice(range(20, 60))
)
rng.shuffle(data)

poll = pd.Series(data=data)
poll

In [None]:
# Fill the gap!

In [None]:
# Fill the gap!