### Oktoberfest Beer Data

In [1]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt

a) Read the data from *visitors.csv* into a pandas dataframe.

In [2]:
df_visitors = pd.read_csv("visitors.csv")

b) Take a look at the dataframe's column names. Display the ten top rows of the dataframes.

In [None]:
print(df_visitors.columns)
df_visitors.head(10)

c) Select and display only the column of visitors (... only the fifth row of your dataframe).

In [None]:
# Note: To avoid clutter, best comment out some lines...

# Visitors column
print(df_visitors["Visitors (million)"])
print(df_visitors.loc[:, "Visitors (million)"])  # Alternative

# Fith row
# print(df_visitors.iloc[4]) # The fith row has index 4...
# print(df_visitors.loc[4, :]) # Alternative

d) How many visitors were there in 1995?

In [None]:
x = df_visitors.loc[df_visitors["Year"] == 1995, "Visitors (million)"]
print(x)

e) What is the value range of the attribute "Beer consumption (million liters)"?

In [None]:
v_min = df_visitors["Beer consumption (million liters)"].min()
v_max = df_visitors["Beer consumption (million liters)"].max()
print(v_min, " - ", v_max)

f) What was the year with the greatest beer consumption? How much beer was consumed?

In [None]:
row = df_visitors["Beer consumption (million liters)"].argmax()
print(df_visitors.iloc[row])

g) Plot the beer consumption over the years. Turn on the grid of your plot.

In [None]:
df_visitors.plot(x="Year", y="Beer consumption (million liters)")
plt.grid()
# # Alternatively, you can use matplotlib
# plt.plot(df_visitors["Year"], df_visitors["Beer consumption (million liters)"])
# plt.grid()

h) Compute the correlation matrix of the dataframe. Which conclusions can you draw?

In [None]:
df_visitors.corr()

*The correlation between year and beer consumption is high.*   
*However, the correlation between visitors and beer consumption is close to zero, as is the correlation between year and visitors.*

*This indicates that the per-head beer consumption increased.*

In [None]:
plt.plot(
    df_visitors["Year"],
    df_visitors["Beer consumption (million liters)"]
    / df_visitors["Visitors (million)"],
)
plt.title("Per-head beer consumption")
plt.ylabel("Liters")
plt.xlabel("Year")
plt.grid()

i) Load the data from *beer_price.csv* into a second dataframe. Then merge the two dataframes based on the year with an outer join.  
Observe which values you get for years that are not listed in some of the data sets.

In [None]:
df_beer_price = pd.read_csv("beer_price.csv")

df_oktoberfest = pd.merge(df_visitors, df_beer_price, how="outer", on="Year")
df_oktoberfest

j) Until 2001, the beer price is reported in DEM ("Deutsche Mark"), the former German currency. 
Adjust the beer prices (min and max) to give values in EUR.
You may assume that 1 EUR = 1.95583 DEM. 

In [None]:
# Careful: Do not execute this cell more than once or your data will be incorrect.
df_oktoberfest.loc[df_oktoberfest["Year"] <= 2001, "Min price"] = (
    df_oktoberfest.loc[df_oktoberfest["Year"] <= 2001, "Min price"] / 1.95583
)
df_oktoberfest.loc[df_oktoberfest["Year"] <= 2001, "Max price"] = (
    df_oktoberfest.loc[df_oktoberfest["Year"] <= 2001, "Max price"] / 1.95583
)

# Visualization
df_oktoberfest

k) Compute the mean and variance of "Min price" during the period from 2000 to 2007.

In [None]:
# First, we construct a bit-mask.
mask = (2000 <= df_oktoberfest["Year"]) & (df_oktoberfest["Year"] <= 2007)
# Then, we apply the mask and compute the mean and variance.
mean = df_oktoberfest[mask]["Min price"].mean()
var = df_oktoberfest[mask]["Min price"].var()
print("mean =", mean)
print("var =", var)

l) Add a new column to the merged dataframe, describing the relative difference in beer prices
between two consecutive years in percent. Plot this difference against "Year".

In [None]:
df_oktoberfest["Price Increase (%)"] = (
    (df_oktoberfest["Min price"] - df_oktoberfest["Min price"].shift(1))
    / df_oktoberfest["Min price"]
    * 100
)
df_oktoberfest.plot(x="Year", y="Price Increase (%)")

*What does `shift(1)` do?*

In [15]:
# compare = pd.concat(
#     [
#         df_oktoberfest["Year"],
#         df_oktoberfest["Min price"],
#         df_oktoberfest["Min price"].shift(1),
#     ],
#     axis=1,
# )

# compare

m) Create a bar plot displaying the minimum and maximum prices over the years.

In [None]:
df_oktoberfest.plot.bar(x="Year", y=["Min price", "Max price"])

# # Alternatively, you can use matplotlib
# plt.bar(x=df_oktoberfest["Year"], height=df_oktoberfest["Max price"], label="Min price")
# plt.bar(x=df_oktoberfest["Year"], height=df_oktoberfest["Min price"], label="Max price")
# plt.xlabel("Year")
# plt.ylabel("Price")
# plt.legend()

n) Compute estimates (lower and upper bound) of the beer revenue and visualize them by bar plot.

In [None]:
df_oktoberfest["Min sales"] = (
    df_oktoberfest["Beer consumption (million liters)"] * df_oktoberfest["Min price"]
)
df_oktoberfest["Max sales"] = (
    df_oktoberfest["Beer consumption (million liters)"] * df_oktoberfest["Max price"]
)

df_oktoberfest.plot.bar(x="Year", y=["Min sales", "Max sales"])
plt.title("Revenue")
plt.ylabel("Million EUR")