# Manipulation Techniques

## Combining Data

There are 3 main ways for combining pandas data frames

1. Concatenate
2. Merge
3. Join

#### todo: Add graphic on concatenate vs merge vs join here



In [None]:
# Setup default imports here
import pandas as pd
import numpy as np


In [None]:
# Example on concatenation

In [None]:
# Example on Merge

In [None]:
# Example on Join

#### Subtle differences (Merge vs Join)


In [None]:
# performance test



## Filtering Data
Filtering in Pandas involves selecting specific rows or columns from a DataFrame based on certain conditions. There are 2 main ways to filter data in Pandas,
1. conditional statements / boolean indexing
2. query expressions

In [None]:
# Create a DataFrame
data = {'Name': ['John', 'Jane', 'Alice'],
        'Age': [25, 30, 28],
        'Salary': [50000, 60000, 55000]}
df = pd.DataFrame(data)

# Filter rows where age is greater than 25

In [None]:
# 1. Conditional statements / Boolean Indexing
filtered_df = df[df['Age'] > 25]

# What is happening?
print(filtered_df)


In [None]:
# 2. Filter rows using a query expression
filtered_df = df.query('Age > 25')

print(filtered_df)


### Comparison
When it comes to performance differences between conditional statements and query expressions in Pandas, it depends on the specific use case and the complexity of the condition being applied.

In general, query expressions can provide faster performance for large DataFrames compared to conditional statements. This is because query expressions leverage underlying code optimizations and evaluation strategies implemented by Pandas. Query expressions are compiled and optimized, which allows for more efficient execution of the filtering operation.

On the other hand, conditional statements are more flexible and allow for more complex conditions and logical expressions. They can handle a wider range of conditions and provide more explicit control over the filtering process.

It's important to note that the performance difference between query expressions and conditional statements may not be significant for small DataFrames or simple conditions. The impact becomes more noticeable as the size of the DataFrame and the complexity of the condition increase.

In [None]:
# Performance example
# Create a large DataFrame with random values
np.random.seed(0)
df = pd.DataFrame({
    'A': np.random.randint(0, 1000, 10 ** 7),
    'B': np.random.randint(0, 1000, 10 ** 7),
    'C': np.random.randint(0, 1000, 10 ** 7),
    'D': np.random.randint(0, 1000, 10 ** 7),
})


In [None]:
# Using query expression

In [None]:
%%timeit
conditional_filtered = df[(df['A'] > 500) & (df['B'] < 500)]

In [None]:
%%timeit
query_filtered = df.query('A > 500 and B < 500')

In [None]:
conditional_filtered

In [None]:
query_filtered

## GroupBy / Aggregation Data


In [38]:
# Create a DataFrame
data = {
    'Name': ['John', 'Jane', 'John', 'Alice', 'Jane'],
    'Age': [25, 30, 28, 32, 27],
    'Salary': [50000, 60000, 55000, 70000, 65000]
}
df = pd.DataFrame(data)

In [41]:
# Basic grouping with aggregation
# Group by 'Name' and calculate the average salary
grouped_df = df.groupby('Name')['Salary'].mean()

print(grouped_df)

Name
Alice    70000.0
Jane     62500.0
John     52500.0
Name: Salary, dtype: float64


In [40]:
# Grouping with Multiple Columns and Aggregations:
# Group by 'Name' and 'Age', calculate the average salary and maximum age
grouped_df = df.groupby(['Name', 'Age']).agg({'Salary': 'mean', 'Age': 'max'})
print(grouped_df)

            Salary  Age
Name  Age              
Alice 32   70000.0   32
Jane  27   65000.0   27
      30   60000.0   30
John  25   50000.0   25
      28   55000.0   28


In [51]:
# Grouping with Custom Aggregation Functions:
# Define a custom aggregation function
def salary_range(series):
    return series.max() - series.min()


# Group by 'Name' and calculate the salary range using the custom function
grouped_df = df.groupby('Name').agg({'Salary': salary_range})
print(grouped_df)

       Salary
Name         
Alice       0
Jane     5000
John     5000


## Reshaping Data
Reshaping data in Pandas involves transforming the structure of a DataFrame to make it more suitable for analysis or presentation. There are several functions in Pandas that can be used for reshaping data.

### Long to Wide
pivot(): The pivot() function allows you to reshape a DataFrame based on unique values in one or more columns. It creates a new DataFrame with columns derived from the unique values and reshapes the data accordingly. Used most commonly to reshape data from long to wide format

### Wide to Long
melt(): The melt() function is used to unpivot or melt a DataFrame, transforming it from a wide format to a long format.
It gathers multiple columns into key-value pairs, creating a new DataFrame.

In [None]:
# Given data below how would you extract it?


In [None]:
# Given the same data but in wide format, how would you go about transforming it to wide format


The choice between wide format and long format depends on the specific circumstances and the type of analysis or presentation you want to perform. Here are some considerations to help you decide:

Wide Format:
Suitable when we have a small number of variables and want to display data in a compact and easily readable format.
Convenient for data entry or manual editing, as each variable has its own column.
Useful when performing operations that require calculations across multiple variables.
Commonly used for exporting data to other software or systems that expect a wide format.
Long Format:

Suitable when we have a large number of variables or variable categories, and you want to store and analyze them more efficiently.
Enables easy expansion of the dataset by adding new categories or variables.
Ideal for performing aggregations, transformations, and analysis using groupby, pivot, or melt operations.
Facilitates easier merging and joining of datasets with different variables or categories.
Often preferred for visualization purposes, as it allows for flexible plotting and faceting.

In summary, the wide format is advantageous when the focus is on a compact representation of data or when working with a small number of variables. On the other hand, the long format is more flexible for analysis, transformation, and visualization, especially when dealing with a large number of variables or when the dataset structure may evolve over time.