In [None]:
import pandas as pd
df = pd.read_csv('train.csv')

## Visualization
Now that we learned how pandas data selection works, let us investigate the different features visually. For that we use the Python package `matplotlib`, which provides versatile visualization capabilities. Luckily, pandas wraps calls to matplotlib neatly and has a great [documentation](https://pandas.pydata.org/docs/user_guide/visualization.html) about it.

Let us start by visualizing the tabular data we retrieved above. A handy way to do this is by using boxplots. Let us first make some configurations to make the plots look beautiful in jupyter and then actually plot the boxes.

In [None]:
import matplotlib.pyplot as plt
# Visualizations will be shown in the notebook.
%matplotlib inline
# Change default size of figures:
plt.rcParams["figure.figsize"] = (12,7)

Now let us do a box plot by simply selecting a columns and call [`.plot.box()`](https://pandas.pydata.org/docs/user_guide/visualization.html#box-plots) on it.

In [None]:
ax = df["Age"].plot.box()

A box plot can be handy to see the distribution over one variable. It gives us the following information:

- The lower line indicates the 1% quartile, that is the value at or below which 1% of the data lie.
- The lower boundry of the box indicates the 25% quartile.
- The green line indicates the average
- The upper boundry of the box indicates the 75% quartile.
- The upper line indicates the 99% quartile.
- The dots above this line are individual values that are higher than the 99% quartile They are often consider outliers.


We however, want to see how variables relate to our target, did the person survive or not. For that alpha blended histogramms or stacked bar charts come in handy. But pandas provides way more options to plot data, e.g. [histograms](https://pandas.pydata.org/docs/user_guide/visualization.html#histograms).

To try this, first plot a histogramm for one of the columns. Then (in a new cell) plot two of them when by filtering for all survivours and one for all others.

In [None]:
plt.title("Age")
df.loc[df["Survived"] == True, "Age"].plot.hist(alpha=0.5, label="Survived", bins=20)
df.loc[df["Survived"] == False, "Age"].plot.hist(alpha=0.5, label="Deceased", bins=20)
plt.legend()

If you have categorical values, stacked bar charts might be the right way to go.

For this, we unluckily have to transform our data a bit. This can be done using grouping. Pandas essentially allows you to split your dataframe up into multiple once. Each one having the same values over one or multiple columns.

Let us [group them by](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#splitting-an-object-into-groups) the column we want to plot (e.g. `"Pclass"`) and the target `"Survived"`.

In [None]:
grouped = df.groupby(["Pclass", "Survived"])
grouped.groups

In [None]:
grouped.get_group((1,0))

Now, we count the occurence of the `"Pclass"` for each combindation:

In [None]:
pclass_counts = grouped["Pclass"].count()
pclass_counts

Finally we move the `"Survived"` index to a new column:

In [None]:
pclass_counts = pclass_counts.unstack("Survived")
pclass_counts

This we can then plot:

In [None]:
pclass_counts.plot.bar(stacked=True)

For convenience reasons we should wrap this into a function:

In [None]:
def make_stacked_bar_chart(df: pd.DataFrame, x_axis: str, stack_dimension: str):
    bar_dataframe = df.groupby([x_axis, stack_dimension])[x_axis].count().unstack(stack_dimension).fillna(0)
    bar_dataframe.plot.bar(stacked = True)
    plt.legend()

In [None]:
make_stacked_bar_chart(df, "Pclass", "Survived")

Given these two option, investigate the remaining columns (`["Fare", "SibSp", "Parch"]`).

To already get an intuition in how the data correlate with our target, let us as well look at the correlation matrix:

In [None]:
numeric_features = ["Age", "SibSp", "Parch", "Fare"]
df[[*numeric_features, "Survived"]].corr()

Now that we have an intuition for our data, let us continue by [training a machine learning algorithm](./03-SklearnIntroduction.ipynb) with them.