## Aggregating Data

Aggregation methods in pandas are designed to consolidate the values of a series into a single scalar. These consolidated metrics are the ones that matter to your manager and should be reported. Imagine working at a clothing store and your manager walks in, asking about the store's performance. It wouldn't be appropriate to respond with, "Sarah bought a dress and a pair of shoes. John purchased a shirt and a hat. Lisa bought...".

Your manager isn't interested in such detailed information. What they really want to know is:

- The total number of customers who visited the store (count)
- The overall quantity of clothing items sold (count)
- The total revenue generated (sum)
- The distribution of customer visits throughout the day (skew)
- The average transaction amount (mean)

Aggregations enable you to condense intricate data into a single, meaningful metric.

Here's a list of common aggregation methods in pandas along with explanations and examples:

1. **`count`**: Counts the number of non-null values in a column or series.
   ```python
   df['column_name'].count()
   ```

2. **`sum`**: Calculates the sum of values in a column or series.
   ```python
   df['column_name'].sum()
   ```

3. **`mean`**: Computes the average of values in a column or series.
   ```python
   df['column_name'].mean()
   ```

4. **`median`**: Determines the middle value in a sorted column or series.
   ```python
   df['column_name'].median()
   ```

5. **`min`**: Finds the minimum value in a column or series.
   ```python
   df['column_name'].min()
   ```

6. **`max`**: Finds the maximum value in a column or series.
   ```python
   df['column_name'].max()
   ```

7. **`std`**: Calculates the standard deviation of values in a column or series.
   ```python
   df['column_name'].std()
   ```

8. **`var`**: Computes the variance of values in a column or series.
   ```python
   df['column_name'].var()
   ```

9. **`unique`**: Returns an array of unique values in a column or series.
   ```python
   df['column_name'].unique()
   ```

10. **`nunique`**: Counts the number of unique values in a column or series.
    ```python
    df['column_name'].nunique()
    ```

These are just a few examples of aggregation methods in pandas. The specific method you use will depend on the type of analysis you want to perform on your data.

### `count` method

In pandas, the `count` function can be used both for Series and DataFrame objects. The behavior of `count` is slightly different depending on whether it is applied to a Series or a DataFrame.

1. **Series**:
   When `count` is applied to a Series, it returns the number of non-null values in that Series.

   Here's an example:

   ```python
   import pandas as pd
   
   # Create a sample Series with some missing values
   data = pd.Series([10, 20, None, 30, None])
   
   # Count non-null values in the Series
   count = data.count()
   
   # Display the result
   print(count)
   ```

   Output:
   ```
   3
   ```

   In this example, the Series contains five elements, including two missing values represented as `None`. The `count` function returns the count of non-null values, which is 3 in this case.

2. **DataFrame**:
   When `count` is applied to a DataFrame, it returns a Series that contains the count of non-null values for each column.

   Here's an example:

   ```python
   import pandas as pd
   
   # Create a sample DataFrame with missing values
   data = {'Name': ['John', 'Alice', None, 'Jane', 'Max'],
           'Age': [25, 22, None, 30, None],
           'City': ['New York', 'Paris', 'London', None, None]}
   df = pd.DataFrame(data)
   
   # Count non-null values in each column of the DataFrame
   counts = df.count()
   
   # Display the result
   print(counts)
   ```

   Output:
   ```
   Name    4
   Age     3
   City    3
   dtype: int64
   ```

   In this example, the DataFrame has three columns: 'Name', 'Age', and 'City'. Each column contains some missing values represented as `None`. The `count` function returns a Series where each index corresponds to a column name, and the value represents the count of non-null values for that column. For example, the 'Name' column has 3 non-null values, the 'Age' column has 4 non-null values, and the 'City' column has 3 non-null values.

   Note that the `count` function excludes missing values from the count. If a Series or DataFrame column contains missing values, such as `None` or `NaN`, they are not considered in the count.

In [7]:
import pandas as pd

data = pd.Series([10, 20, None, 30, None])
count = data.count()
print(count)

3


In [8]:
data = {'Name': ['John', 'Alice', None, 'Jane', 'Max'],
        'Age': [25, 22, None, 30, None],
        'City': ['New York', 'Paris', 'London', None, None]}

df = pd.DataFrame(data)
df.count()

Name    4
Age     3
City    3
dtype: int64

In [9]:
import pandas as pd

df = pd.read_csv('./data/titanic.csv')
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [10]:
df['Survived'].count()

891

In [11]:
df['Name'].count()

891

In [12]:
df.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

### `sum` method

In pandas, the `sum` method is used to calculate the sum of values in a Series or DataFrame along a specified axis. The behavior of the `sum` method differs slightly between Series and DataFrames.

1. **Series**:
- If the Series contains numeric values, `sum` returns the sum of all the values in the Series.
- If the Series contains non-numeric values, `sum` will attempt to convert them to numeric values before calculating the sum. If the conversion is not possible, a `TypeError` will be raised.

Here's an example of using `sum` with a Series:

``` python
import pandas as pd

data = pd.Series([1, 2, 3, 4, 5])
total = data.sum()

print(total)  # Output: 15
```

2. **DataFrame**:
- By default, if you apply `sum` to a DataFrame, it will return the sum of each column as a Series, with the column labels as the index.
- You can specify the axis parameter to calculate the sum along the rows (axis=0) or columns (axis=1).

Here's an example of using `sum` with a DataFrame:

``` python
import pandas as pd

data = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
column_sums = data.sum()

print(column_sums)
```

Output:
```
A     6
B    15
C    24
dtype: int64
```

In the example above, `sum` calculates the sum of each column in the DataFrame, returning a Series with the column labels as the index.

You can also use the axis parameter to calculate the row sums:

``` python
row_sums = data.sum(axis=1)
print(row_sums)
```

Output:
```
0    12
1    15
2    18
dtype: int64
```

In this case, `sum` calculates the sum of each row in the DataFrame, returning a Series with the row indices as the index.

In [15]:
import pandas as pd

data = pd.Series([1, 2, 3, 4, 5])
total = data.sum()

print(total)

15


In [16]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df

Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8
2,3,6,9


In [17]:
df.sum()

A     6
B    15
C    24
dtype: int64

In [18]:
df.sum(axis=1)

0    12
1    15
2    18
dtype: int64

### `mean` method

The `mean` method is used to calculate the average or mean value of a Series or DataFrame along a specified axis. The behavior of the `mean` method differs slightly between Series and DataFrames.

1. **Series**:
- If the Series contains numeric values, `mean` returns the average of all the values in the Series.
- If the Series contains non-numeric values, `mean` will attempt to convert them to numeric values before calculating the mean. If the conversion is not possible, a `TypeError` will be raised.

Here's an example of using `mean` with a Series:

``` python
import pandas as pd

data = pd.Series([1, 2, 3, 4, 5])
average = data.mean()

print(average)  # Output: 3.0
```

2. **DataFrame**:
- By default, if you apply `mean` to a DataFrame, it will return the mean value for each column as a Series, with the column labels as the index.
- You can specify the axis parameter to calculate the mean along the rows (axis=0) or columns (axis=1).

Here's an example of using `mean` with a DataFrame:

``` python
import pandas as pd

data = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
column_means = data.mean()

print(column_means)
```

Output:
```
A    2.0
B    5.0
C    8.0
dtype: float64
```

In the example above, `mean` calculates the mean value of each column in the DataFrame, returning a Series with the column labels as the index.

You can also use the axis parameter to calculate the row means:

``` python
row_means = data.mean(axis=1)
print(row_means)
```

Output:
```
0    4.0
1    5.0
2    6.0
dtype: float64
```

In this case, `mean` calculates the mean value of each row in the DataFrame, returning a Series with the row indices as the index.

In [23]:
import pandas as pd

data = pd.Series([1, 2, 3, 4, 5])
average = data.mean()

print(average)

3.0


In [24]:
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df.mean()

A    2.0
B    5.0
C    8.0
dtype: float64

In [25]:
df.mean(axis=1)

0    4.0
1    5.0
2    6.0
dtype: float64

### `median` method

The median method is used to calculate the median value of a series or a column in a DataFrame. The median is a statistical measure that represents the middle value of a dataset when it is sorted in ascending or descending order. 

1. **Series**:
When applied to a series, the median method returns the median value of the elements in that series. For example:

```python
import pandas as pd

data = pd.Series([1, 2, 3, 4, 5])
median_value = data.median()
print(median_value)
```
Output:
```
3.0
```

In this case, the median of the series `[1, 2, 3, 4, 5]` is 3.0.

2. **DataFrame**:
When applied to a DataFrame, the median method calculates the median value for each column by default. By default, it operates vertically (i.e., along the columns). For example:

```python
import pandas as pd

data = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [6, 7, 8, 9, 10]})
median_values = data.median()
print(median_values)
```
Output:
```
A    3.0
B    8.0
dtype: float64
```

In this case, the median of column 'A' is 3.0, and the median of column 'B' is 8.0.

You can also calculate the median along a different axis by specifying the `axis` parameter. For example, setting `axis=1` will calculate the median for each row. 

```python
import pandas as pd

data = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [6, 7, 8, 9, 10]})
median_values = data.median(axis=1)
print(median_values)
```
Output:
```
0     3.5
1     4.5
2     5.5
3     6.5
4     7.5
dtype: float64
```

In this case, the median of the first row is 3.5, the median of the second row is 4.5, and so on.

The median method is useful for calculating the central tendency of a dataset and can be particularly helpful in identifying the typical or middle value in a distribution, especially when dealing with skewed or non-normal datasets.

In [26]:
import pandas as pd

data = pd.Series([1, 2, 3, 4, 5])
median_value = data.median()
print(median_value)

3.0


In [27]:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [6, 7, 8, 9, 10]})
df.median()

A    3.0
B    8.0
dtype: float64

In [28]:
df.median(axis=1)

0    3.5
1    4.5
2    5.5
3    6.5
4    7.5
dtype: float64

### `min` and `max` methods

The `min` and `max` methods are used to compute the minimum and maximum values, respectively, in a Pandas Series or DataFrame. Here's an explanation of their usage with examples:

1. **Series**:
    - The `min` method returns the minimum value in a Series.
    
    ```python
    import pandas as pd
    
    # Create a Series
    series = pd.Series([5, 2, 8, 1, 10])
    
    # Compute the minimum value
    minimum = series.min()
    
    print(minimum)
    ```
    
    Output:
    ```
    1
    ```

    - `max` method in a Series:
    The `max` method returns the maximum value in a Series.
    
    ```python
    import pandas as pd
    
    # Create a Series
    series = pd.Series([5, 2, 8, 1, 10])
    
    # Compute the maximum value
    maximum = series.max()
    
    print(maximum)
    ```
    
    Output:
    ```
    10
    ```

2. **DataFrame**:
    - `min` method can also be used on a DataFrame to compute the minimum value for each column.
    
    ```python
    import pandas as pd
    
    # Create a DataFrame
    data = {'A': [5, 2, 8, 1, 10],
            'B': [4, 7, 3, 9, 6]}
    
    df = pd.DataFrame(data)
    
    # Compute the minimum value for each column
    minimum = df.min()
    
    print(minimum)
    ```
    
    Output:
    ```
    A    1
    B    3
    dtype: int64
    ```
    
    - `max` method in a DataFrame:
    Similarly, the `max` method can be used on a DataFrame to compute the maximum value for each column.
    
    ```python
    import pandas as pd
    
    # Create a DataFrame
    data = {'A': [5, 2, 8, 1, 10],
            'B': [4, 7, 3, 9, 6]}
    
    df = pd.DataFrame(data)
    
    # Compute the maximum value for each column
    maximum = df.max()
    
    print(maximum)
    ```
    
    Output:
    ```
    A    10
    B     9
    dtype: int64
    ```

    You can also specify `axis=1` argument in the `min` or `max` methods of a DataFrame, which results in calculating min or max on the columns.
   In pandas, specifying `axis=1` in the `min()` or `max()` methods of a DataFrame operates on the columns. Here's what happens when you use `axis=1` in a pandas DataFrame:

   Here's an example to illustrate this:
    
    ```python
    import pandas as pd
    
    # Create a DataFrame
    df = pd.DataFrame({'A': [1, 2, 3],
                       'B': [4, 5, 6],
                       'C': [7, 8, 9]})
    
    # Computing minimum along axis 1 (rows)
    min_values = df.min(axis=1)
    print(min_values)
    ```
    
    Output:
    ```
    0    1
    1    2
    2    3
    dtype: int64
    ```

In [1]:
import pandas as pd

series = pd.Series([5, 2, 8, 1, 10])
series.min()

1

In [2]:
series.max()

10

In [10]:
data = {'A': [5, 2, 8, 1, 10],
        'B': [4, 7, 3, 9, 6]}

df = pd.DataFrame(data)
df

Unnamed: 0,A,B
0,5,4
1,2,7
2,8,3
3,1,9
4,10,6


In [11]:
df.min()

A    1
B    3
dtype: int64

In [4]:
df.max()

A    10
B     9
dtype: int64

In [8]:
df.min(axis=1)

0    4
1    2
2    3
3    1
4    6
dtype: int64

In [9]:
df.max(axis=1)

0     5
1     7
2     8
3     9
4    10
dtype: int64

### `std` and `var` methods

1. **Series**:
   - The `std` method calculates the standard deviation of the values in a Series. The standard deviation is a measure of the amount of variation or dispersion in the data. It quantifies how much the values in the Series deviate from the mean. The lower the standard deviation, the closer the values are to the mean, indicating less variation.
   - The `var` method calculates the variance of the values in a Series. The variance is a measure of how spread out the values are from the mean. It is calculated as the average of the squared differences between each value and the mean. The variance provides insights into the variability of the data.

2. **DataFrame**:
   - By default, the `std` method calculates the standard deviation of each column in the DataFrame. It returns a Series containing the standard deviation values for each column. The resulting Series has the column names as the index.
   - By default, the `var` method calculates the variance of each column in the DataFrame. It returns a Series containing the variance values for each column. The resulting Series has the column names as the index.
   - You can also pass `axis=1` argument to `std` or `var` methods to calculate standard deviation or variance along rows.

Here are some examples to illustrate the usage of `std` and `var` methods:

1. Working with a Series:
```python
import pandas as pd

# Create a Series
s = pd.Series([1, 2, 3, 4, 5])

# Calculate the standard deviation
std_value = s.std()
print("Standard Deviation:", std_value)

# Calculate the variance
var_value = s.var()
print("Variance:", var_value)
```

Output:
```
Standard Deviation: 1.5811388300841898
Variance: 2.5
```

In this example, the `std` method calculates the standard deviation of the values in the Series, which is approximately 1.58. The `var` method calculates the variance, which is 2.5.

2. Working with a DataFrame:
```python
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
                   'B': [2, 4, 6, 8, 10]})

# Calculate the standard deviation of each column
std_values = df.std()
print("Standard Deviation:")
print(std_values)

# Calculate the variance of each column
var_values = df.var()
print("Variance:")
print(var_values)
```

Output:
```
Standard Deviation:
A    1.581139
B    3.162278
dtype: float64
Variance:
A     2.5
B    10.0
dtype: float64
```

In this example, the `std` method calculates the standard deviation of each column in the DataFrame, resulting in a Series with the column names as the index. The `var` method calculates the variance of each column, also producing a Series with the column names as the index.

In [13]:
import pandas as pd

s = pd.Series([1, 2, 3, 4, 5])

In [14]:
s.std()

1.5811388300841898

In [15]:
s.var()

2.5

In [17]:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],
                   'B': [2, 4, 6, 8, 10]})
df

Unnamed: 0,A,B
0,1,2
1,2,4
2,3,6
3,4,8
4,5,10


In [18]:
df.std()

A    1.581139
B    3.162278
dtype: float64

In [19]:
df.var()

A     2.5
B    10.0
dtype: float64