# Aggregating Dataframe

In [1]:
from IPython.display import IFrame

# Display PDF with responsive width and height
IFrame("https://projector-video-pdf-converter.datacamp.com/22066/chapter2.pdf", width="100%", height="600px")

## I. Summary Statistics
### (Theory)

(a). Summarizing numerical data
- .median() 
- .mode()
- .min() 
- .max()
- .var()  
- .std()
- .sum()
- .quantile()

example:
```python
dogs["height_cm"].mean()
```



(b). Summarizing dates
- Oldest dog:
```python
dogs["date_of_birth"].min()
```

- Youngest dog:
```python
dogs["date_of_birth"].max()
```

(c). the .agg() method
- .agg() method allows you to apply multiple functions to a DataFrame or Series.
- You can pass a list of functions to .agg() to apply them to the DataFrame or Series.

example 1 : percentile determination of a certain column
```python
def pct30(column):
return column.quantile(0.3)

dogs["weight_kg"].agg(pct30) #22.599999999999998
```

example 2: Summaries on multiple columns
```python
dogs[["weight_kg", "height_cm"]].agg(pct30)
```
output 2: 
```
weight_kg 22.6
height_cm 45.4
dtype: float64
```

example 3: multiple summaries
```python
def pct40(column):
return column.quantile(0.4)
dogs["weight_kg"].agg([pct30, pct40])
```
output 3:
```
pct30 22.6
pct40 24.0
Name: weight_kg, dtype: float64
```

(d). commutative 
[![https://imgur.com/SIuU5kQ.png](https://imgur.com/SIuU5kQ.png)](https://imgur.com/SIuU5kQ.png)


(e). Commutative statistics
- .cummax()
- .cummin()
- .cumprod()

### 1. (Practice)

- mean and median

Summary statistics are exactly what they sound like - they summarize many numbers in one statistic. For example, mean, median, minimum, maximum, and standard deviation are summary statistics. Calculating summary statistics allows you to get a better sense of your data, even if there's a lot of it.

sales is available and pandas is loaded as pd.

In [2]:
import pandas as pd
sales = pd.read_csv('./data/sales_subset.csv')

# Print the head of the sales DataFrame
print(sales.head())

# Print the info about the sales DataFrame
print(sales.info())

# # Print the mean of weekly_sales
print(sales['weekly_sales'].mean())

# # Print the median of weekly_sales
print(sales['weekly_sales'].median())

   Unnamed: 0  store type  department        date  weekly_sales  is_holiday  \
0           0      1    A           1  2010-02-05      24924.50       False   
1           1      1    A           1  2010-03-05      21827.90       False   
2           2      1    A           1  2010-04-02      57258.43       False   
3           3      1    A           1  2010-05-07      17413.94       False   
4           4      1    A           1  2010-06-04      17558.09       False   

   temperature_c  fuel_price_usd_per_l  unemployment  
0       5.727778              0.679451         8.106  
1       8.055556              0.693452         8.106  
2      16.816667              0.718284         7.808  
3      22.527778              0.748928         7.808  
4      27.050000              0.714586         7.808  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10774 entries, 0 to 10773
Data columns (total 10 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------

- Summarizing dates

Summary statistics can also be calculated on date columns that have values with the data type datetime64. Some summary statistics — like mean — don't make a ton of sense on dates, but others are super helpful, for example, minimum and maximum, which allow you to see what time range your data covers.

sales is available and pandas is loaded as pd.

In [3]:
# Print the maximum of the date column
print(sales['date'].max())

# Print the minimum of the date column
print(sales['date'].min())

2012-10-26
2010-02-05


Cumulative statistics
Cumulative statistics can also be helpful in tracking summary statistics over time. In this exercise, you'll calculate the cumulative sum and cumulative max of a department's weekly sales, which will allow you to identify what the total sales were so far as well as what the highest weekly sales were so far.

A DataFrame called sales_1_1 has been created for you, which contains the sales data for department 1 of store 1. pandas is loaded as pd.

In [4]:
sales_1_1 = sales.head()
# Sort sales_1_1 by date
sales_1_1 = sales_1_1.sort_values('date', ascending=True)
# Get the cumulative sum of weekly_sales, add as cum_weekly_sales col
sales_1_1['cum_weekly_sales'] = sales_1_1['weekly_sales'].cumsum()
sales_1_1


# # Get the cumulative max of weekly_sales, add as cum_max_sales col
sales_1_1['cum_max_sales'] = sales_1_1['weekly_sales'].cummax()

# # See the columns you calculated
print(sales_1_1[["date", "weekly_sales", "cum_weekly_sales", "cum_max_sales"]])

         date  weekly_sales  cum_weekly_sales  cum_max_sales
0  2010-02-05      24924.50          24924.50       24924.50
1  2010-03-05      21827.90          46752.40       24924.50
2  2010-04-02      57258.43         104010.83       57258.43
3  2010-05-07      17413.94         121424.77       57258.43
4  2010-06-04      17558.09         138982.86       57258.43


## II. Counting
### (Theory)

1. Dropping duplicate names
```python
 vet_visits.drop_duplicates(subset="name")
```
![{BBF1423B-AAD6-46D4-BC39-B906F995FE8D}.png](./images/{BBF1423B-AAD6-46D4-BC39-B906F995FE8D}.png)

2. dropping duplicate pairs
```python
 unique_dogs = vet_visits.drop_duplicates(subset=["name", "breed"]) 
print(unique_dogs)
```
![image.png](./images/image.png)

