# Summarizing Data

By now we know how to load, select, filter and manipulate data.  

Now, let's see what methods are available to us to further understand and describe our data and the patterns within it.  


* We start by importing the `movies` data set as a DataFrame:

In [None]:
import pandas as pd
movies = pd.read_csv('../data/movies.csv')
movies.head(2)

So far we learned how to manipulate data across one or more variables within the row(s):

![series-plus-series.png](images/series-plus-series.png)

Note that when doing so **we return the same number of records that we started with**.   

You can think of it as summarizing at the row-level.

We can achieve this with the following code:

```python
DataFrame['A'] + DataFrame['B']
```

We subset the two Series and then add them together using the `+` operator to achieve the sum.

### Summary Operations

Sometimes we want to aggregate/summarize values row-wise rather than column-wise.

<img src="images/aggregate-series.png" alt="aggregate-series.png" width="500" height="500">

Note that in this case **we return a single value** per Series representing some aggregation of the elements contained in it. 

This is known as a **summary function**. You can think of it as summarizing across rows.

This is what we are going to talk about next.

## Summarizing a Series

### Summary Methods

The easiest way to summarize a specific series is by using bracket subsetting notation and the built-in Series methods:

In [None]:
movies['num_reviews'].sum()

Note that a *single value* was returned because this is a **summary operation** -- we are summing the `num_reviews` variable across all rows.

There are other summary methods with a series:

In [None]:
movies['num_reviews'].mean()

In [None]:
movies['num_reviews'].median()

In [None]:
movies['num_reviews'].mode()

In [None]:
movies['num_reviews'].max()

In [None]:
movies['num_reviews'].idxmax()

All of the above methods work on quantitative variables, but we also have methods for character variables:

In [None]:
movies['director_name'].value_counts()

Here is a list of the most commonly used Series summary methods:

* `s.sum`
* `s.median`
* `s.mean`
* `s.min`
* `s.max`
* `s.idxmin`
* `s.idxmax`
* `s.count` - counts non-missing values
* `s.nunique` - returns the number of unique values

Note: Summary methods return a **single** value

## Your Turn

<img src="images/exercise.png" style="width: 1000px;"/>

<font class="your_turn">
    Your Turn (in groups of 2-3)
</font>

1. Für die folgenden Aufgaben nutze das `flights.csv` Dataset
2. Fülle die Lücken im Code Template, um die durchschnittliche ("mean" im Englischen) Verspätung bei der Abflugzeit (`dep_delay`) herauszufinden:

```python
flights_df['___'].___()
```
3. Finde heraus, wie viele verschiedene Fluggesellschaften (`carrier`) im Dataset vorhanden sind. (Tipp: `drop_duplicates()` bereinigt eine Serie von Duplikaten und `count()` zählt die enthaltenen Elemente.)
4. Finde heraus, wie oft jede Fluggesellschaft in den Daten auftaucht. (Tipp: `value_counts()` zählt die Anzahl der Vorkommnisse jeder Ausprägung in einer Serie.) Welche Fluggesellschaft ist am häufigsten Vertreten?
5. Bonus: Experimentiere mit einigen weiteren Serien Methoden. (Tipp: Nutze Tab-Complete um die Möglichkeiten anzuzeigen.) 

In [None]:
import pandas as pd
flights_df = pd.read_csv('../data/flights.csv')
flights_df.head(2)

### Describe Method

There is also a method `describe()` that provides a lot of information in one go -- this can be useful in exploratory data analysis.

In [None]:
flights_df['distance'].describe()

Note that `describe()` will return different results depending on the `type` of the Series:

In [None]:
flights_df['carrier'].describe()

## Summarizing a whole DataFrame

The `describe()` method is also available on DataFrames and can be helpful during exploratory data analysis. It displays the the most common aggregations for all columns all at once:

In [None]:
flights_df.describe()

But the volumne of output can also be overwhelming. If there is too much noise, it is hard to see the signal. So if you have an intuition of what you are looking for, limiting the output to only the relevant statistics is advised.

Notice that by default the string variables are missing from `df.describe()`. Only numeric colums are included.

We can make `describe()` compute on others variable types by using the `include` parameter and passing a list of data types to include:

In [None]:
flights_df.describe(include = ['object'])

# Questions

Are there any questions up to this point?

<img src="images/any_questions.png" style="width: 1000px;"/>

# Summarizing Grouped Data

### What we did so far
We summarized across **all** rows of a dataframe or series.

<center>
<img src="images/aggregate-series.png" alt="aggregate-series.png" width="400" height="400">
</center>

### Grouping

* We can group DataFrame rows together by the value in a Series/variable

* If we "**group by A**", then rows with the same value in variable A are in the same group

<img src="images/dataframe-groups.png" width="50%" height="50%"/>

* Note that groups do not need to be ordered by their values. The values can appear in any order:

<img src="images/dataframe-groups-unordered.png" width="50%" height="50%"/>

### Summarizing by Groups

* So far, when we've talked about **summary** operations, we've talked about collapsing a Series to a single value

* This is not necessarily the case -- we can also collapse to a _single value_ **per group**

* This is known as a **grouped aggregation**:

![summarizing-by-groups.png](images/summarizing-by-groups.png)

* This is useful when we want to **aggregate by category**, for example:
  * Maximum temperature *by month*
  * Average bedrooms *by property type*
  * Average number of seats *by plane manufacturer*
  * Total sales *by geography*

## Summarizing Grouped Data

* When we summarize by groups, we can use the same aggregation methods we have already seen
    * `s.sum()`, `s.mean()`, `s.count()` etc..

* The only difference is that we need to set the desired grouping prior to aggregating

### Setting the DataFrame Group

* We can set the grouping column by calling the `DataFrame.groupby()` method and passing a variable name:

In [None]:
airbnb = pd.read_csv('../data/airbnb.csv')
airbnb.head(3)

In [None]:
airbnb.groupby('property_type')

* Notice that a DataFrame doesn't print when it's grouped

* The `groupby()` method is just setting the group - you can see the changed DataFrame class:

In [None]:
type(airbnb.groupby('property_type'))

* If we then call an aggregation method, we will see the DataFrame returned with the aggregated results. Here we obtain the average price by property type and also sort the values in ascending order.

In [None]:
airbnb.groupby('property_type').agg(avg_price=('price', 'mean')).sort_values("avg_price").head(10)

The general pattern is always `df.groupby('grouping_column').agg(new_column_name=('aggregation_column','aggregation_function'))`.

* This process always follows this model:

![model-for-grouped-aggs.png](images/model-for-grouped-aggs.png)

* **Notice that the grouped variable becomes the Index in our example!**

In [None]:
airbnb.groupby('property_type').agg(avg_price=('price', 'mean')).sort_values("avg_price").head(10)

#### Groups as Indexes

* This is the default behavior of `pandas`, and probably how `pandas` wants to be used

* If you prefer to keep the grouping column as a variable instead, you can add another call to `.reset_index()` in the end

#### Groups as Index VS Groups as Variables

In [None]:
airbnb.groupby('property_type').agg(avg_price=('price', 'mean')).head(10)

In [None]:
airbnb.groupby('property_type').agg(avg_price=('price', 'mean')).reset_index().head(10)
# Added .reset_index() in the end

### Including multiple Aggregations at once

* Often we have multiple aggregations we are interested in

* For example, maybe we want to find not only the average price, but also the minimum and maximum price for each property type. And also the average rating.

* We can pass as many arguments to the `.agg` method as we like, each in the form `new_col_name=('aggregating_column', 'aggregation_function')`:

In [None]:
airbnb.groupby('property_type').agg(avg_price=('price', 'mean'), 
                                    min_price=('price','min'), 
                                    max_price=('price','max'), 
                                    avg_rating=('rating','mean')).head(10)

### Grouping by Multiple Variables

* Sometimes we have multiple categories by which we'd like to group

* To extend our example, assume we want to find the average price of an accomodation by property_type AND bedrooms available

* We can pass a list of variable names to the `groupby()` method:

In [None]:
airbnb.groupby(['property_type', 'bedrooms']).agg(avg_price=('price', 'mean')).head(15)

This does produce the result we want. But the output is not eactly easily readable. (It uses a so-called MultiIndex.)

We can improve on this by resetting the index:

In [None]:
airbnb.groupby(['property_type', 'bedrooms']).agg(avg_price=('price', 'mean')).reset_index()

But often, especially when the intention is to present the results, a better way to group by two columns is via a pivot table.

### Grouping by Multiple Variables with a Pivot Table

In [None]:
pivot_table = airbnb.pivot_table(index= 'property_type', columns= 'bedrooms', values='price', aggfunc='mean').fillna('-')
pivot_table

## Your Turn

<img src="images/exercise.png" style="width: 1000px;"/>

<font class="your_turn">
    Your Turn (in groups of 2-3)
</font>

1. Lade das Movie Dataset (`../data/movies.csv`)
2. Finde die durschnittliche Länge aller Filme je Erscheinungsjahr.
3. Finde zudem heraus, wie viele Filme in welchem Jahr produziert wurden.


4. Lade den "Employee" Datensatz (`../data/employee.csv`).
5. Gruppiere die Daten nach Deparment `dept` und finde heraus, ob es Unterschiede beim Durchschnittlichen Einkommen gibt.
6. Erzeuge eine Pivot Tabelle mit Department `dept` als Index und Ethnie (`race`) als Column. Finde heraus, ob es unterschiede je Department & Ethnie in der Vergütung gibt.
7. Erzeuge eine weitere Pivot Tabelle und finde heraus, wie viele Personen jeweils in welche Gruppe fallen (`count`)


# Questions

Are there any questions up to this point?

<img src="images/any_questions.png" style="width: 1000px;"/>