<a href="https://colab.research.google.com/github/hewp84/CRT420/blob/main/Pandas_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# PANDAS: Grouping & Reshaping

## Grouping

Grouping is a fundamental operation that allows us to split our data into groups based on specific criteria, and then perform operations on each group separately. 

**Syntax:**

`df.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<no_default>, observed=False, dropna=True)`

**Parameters:**

* by : Mapping, function, str, or iterable to group by. Column name/s or index level/s to group.
* axis : Axis to group over. 0 for rows and 1 for columns. Default is 0.
* level : Level of MultiIndex to group.
* as_index : Group by index if True, else by values. Default is True.
* sort : Sort groups if True. Default is True.
* group_keys : Add group keys to index. Default is True.
* squeeze : Return NumPy value if possible.
* observed : Use only observed values.
* dropna : Don't include NaN values. Default is True.

**Returns:**

A groupby object that contains information about the groups.

In [None]:
# Import necessary libraries
import pandas as pd

# Create a sample DataFrame
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Value': [10, 15, 20, 25, 30, 35]
}

df = pd.DataFrame(data)

# Display the DataFrame
df


To group data in Pandas, we can use the `groupby()` method, which is often followed by an aggregation operation. Let's start by grouping our sample DataFrame `df` by the 'Category' column.


In [None]:
# Group the DataFrame by 'Category'
grouped = df.groupby(by='Category')

# Display the grouped object
grouped

In [None]:
# Calculate the mean value for each group
grouped_mean = grouped.mean()

# Display the mean values
grouped_mean


In [None]:
#Try alternative mathematical methods such as sum, mean, etc


#### Aggregation

We can also iterate through the groups and perform custom operations on each group. Let's print the groups and their corresponding data:

In [None]:
# Iterate through groups and print the group name and data
for name, group in grouped:
    print(f"Group: {name}")
    print(group)
    print()


#### Applying Multiple Aggregations

Pandas allows us to apply multiple aggregation functions at once using the `agg()` method. Let's calculate both the mean and sum for each group.

In [None]:
# Calculate both mean and sum for each group
grouped_agg = grouped['Value'].agg(['mean', 'sum'])

# Display the aggregated results
grouped_agg


#### Custom Aggregation Functions

You can define custom aggregation functions to apply to groups. Let's create a custom function to calculate the range for each group.

In [None]:
# Custom aggregation function to calculate the range
def custom_range(x):
    return x.max()-x.min()

# Apply the custom aggregation function
range_result = grouped['Value'].agg(custom_range)

# Display the range for each group
range_result


#### Real life dataset: ATP Tour 2013-2023

In [None]:
atp = pd.read_csv('atp_tennis.csv')
atp

In [None]:
#Filtering only for Hard surface data
hard = atp[atp['Surface'] == 'Hard']
hard

What is the lowest ranked player competing in each round of the ATP Tour Hard Surface?

In [None]:
r_atp = hard.groupby('Round')
# Use a dictionary to specify aggregation functions for each column
agg_dict = {
    'Rank_1': ['max', 'min'],
    'Rank_2': ['max', 'min']
}

# Apply .agg() using the dictionary
atp_agg = r_atp.agg(agg_dict)
atp_agg.rename(columns={'Rank_1': 'Rank 1'}, inplace=True)
#rename for all other columns

# Display the aggregated DataFrame
atp_agg

## Reshaping

### Pandas Pivot and Pivot Table
Pandas provides pivot() and pivot_table() functions to reshape data into a summarized table for analysis. Let's explore how to use these functions with some examples.

#### Pivot
The pivot() function is used to reshape a DataFrame by converting column values into index values.

**Syntax:**

`df.pivot(index=None, columns=None, values=None)`

**Parameters:**

* index - Column to use as the row index.
* columns - Column to use as the column index.
* values - Column to aggregate.

**Returns:**

Pivoted dataframe.

In [None]:
import pandas as pd

data = {'Brand': ['Toyota','Honda','Toyota','Ford','Honda','Toyota'], 
        'Model': ['Corolla','Civic','Camry','Focus','Accord','Prius'],
        'Year': [2018,2019,2020,2018,2019,2021],
        'Price': [20000,22000,25000,21000,24000,28000],
        'Kilometers':[30000,27000,20000,26000,25000,19000]}

df = pd.DataFrame(data)
df

Now we can pivot the DataFrame with 'Brand' as the index, 'Model' as the columns, and 'Price' as the values:

In [None]:
df1 = df.pivot(index='Brand', columns='Model', values='Price')
#df1.fillna('', inplace=True)
df1

#### Pivot Table
The pivot_table() function is similar to pivot() but provides more flexibility in calculating and aggregating data in the reshaped table.

**Syntax:**

`pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All')`

**Parameters:**


* values - Column to aggregate.
* index - Column(s) to be index.
* columns - Column(s) to be columns.
* aggfunc - Aggregation function like 'mean', 'sum', 'count'. Default is 'mean'.
* fill_value - Value to replace missing values with.
* margins - Add all row/column aggregates if True.
* dropna - Drop missing values. Default is True.
* margins_name - Name prefix for margin columns.

**Returns:**

Pivoted dataframe.

In [None]:
data = {'Brand': ['Toyota','Honda','Toyota','Ford','Honda','Toyota', 'Tesla', 'Toyota', 'Honda'], 
        'Model': ['Corolla','Civic','Camry','Focus','Accord','Prius', 'Model 3', 'RAV4', 'CR-V'],
        'Year': [2018,2019,2020,2018,2019,2021, 2020, 2019, 2020],
        'Price': [20000,22000,25000,21000,24000,28000, 50000, 28000, 27000],
        'Kilometers':[30000,27000,20000,26000,25000,19000, 10000, 31000, 22000]}

df = pd.DataFrame(data)
df

In [None]:
df.pivot_table(values='Price', index='Brand', columns='Model', aggfunc='mean')

In [None]:
#Multiple aggregated values
df.pivot_table(values=['Price', 'Kilometers'], index='Brand', columns='Model',aggfunc=['mean', 'max'])

In [None]:
# Adding new columns
df.pivot_table(values='Price', index='Brand', columns='Model', aggfunc='mean', margins=True)

### Webscrapping OCU's sports statistics

In [None]:
# URL of the webpage containing the HTML table(s)
url = 'https://gomightyoaks.com/sports/baseball/stats/2022'

# Read the HTML tables from the webpage
baseball = pd.read_html(url)

# Depending on the webpage structure, there might be multiple tables
# You can access each table using tables[index], where index is the index of the table you want to extract

# For example, to extract the first table
ocu = baseball[0]
ocu

In [None]:
ocu.pivot_table(values='R', index='Opponent', columns='W/L', aggfunc='mean', margins=True)

### Melt
The melt() function in pandas is used to unpivot or reshape your data from wide to long format. This allows you to reduce the number of columns in your dataframe by "melting" multiple columns into a single column.

**Syntax**

`df.melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None)`

**Parameters:**

* id_vars : Column(s) to use as identifier variables.
* value_vars : Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.
* var_name : Name to use for the 'variable' column. Default is 'variable'.
* value_name : Name to use for the 'value' column. Default is 'value'.
* col_level : If columns are a MultiIndex then use this level to melt.

**Returns:**

Pivoted DataFrame in long format.

In [None]:
cars = pd.DataFrame({'Car Model': ['Prius', 'CX-5', 'Tesla Model 3', 'Camry'],
                  'MPG': [50, 25, 30, 28],
                  'Horsepower': [120, 187, 258, 203], 
                  'Weight': [3000, 3500, 4079, 3300]})
cars

In [None]:
melted_df = cars.melt(id_vars='Car Model', 
                    value_vars=['MPG', 'Horsepower', 'Weight'])

melted_df

In [None]:
#Renaming columns
melted_df = cars.melt(id_vars='Car Model', 
                    value_vars=['MPG', 'Horsepower', 'Weight'],
                    var_name='Measurement',
                    value_name='Value')

melted_df