# Combining Datasets with Pandas

Pandas provides powerful tools to combine DataFrames: merging, joining, and concatenating. Unifying data through these operations is crucial for comprehensive analysis.

This section will primarily focus on the `.merge()` method and briefly introduce `.join()` and `pd.concat()`.

## Our Practice Data: The Spongebob Squarepants Dataset

<center><img src="../images/stock/pexels-spencphoto-29901208.jpg"></center>

To illustrate dataset combination, we'll use synthetic data inspired by Spongebob Squarepants. These fabricated datasets will help us focus on the merging process.

**Data Files:**

* **`spongebob1.csv`**:
    * `id`: Unique identifier for each character.
    * `name`: Character's name.
    * `job`: Character's occupation.
    * `species`: Character's biological classification.

* **`spongebob2.csv`**:
    * `id`: Unique identifier for each character.
    * `name`: Character's name.
    * `age`: Character's age.
    * `personality`: General personality traits of the character.

## Getting Started

### Importing the Tools

First, let's import the Pandas library, which is essential for data manipulation in Python:


In [None]:
## Begin Example
import pandas as pd
## End Example

### Loading the Datasets

Next, we'll load the two `csv` files into separate Pandas DataFrames.

File Locations:

* `../data/spongebob1.csv`
* `../data/spongebob2.csv`

Suggested dataframe names:

* `spongebob_df1`
* `spongebob_df2`

In [None]:
## Begin Example



### Inspect the Datasets

Before combining, it's good practice to inspect the structure and contents of each DataFrame. Let's use the `.info()` and `.head()` methods to get a quick overview.

In [None]:
## Begin Example


## Combining DataFrames with `.merge()`

<center><img src="../images/illustrations/data merge illustrations.jpg"></center>

The `.merge()` method in Pandas is ideal for combining rows from two DataFrames based on shared data. It intelligently aligns rows where values in specified columns or indices match.

**Basic Structure:**

To perform a merge, you specify two primary DataFrames:

* The **left DataFrame**: The first DataFrame you are merging.
* The **right DataFrame**: The second DataFrame you are merging with.

**Controlling the Merge with Optional Arguments:**

Several optional arguments allow you to fine-tune the merging process:

* **`how`**: Defines the type of merge to be performed:
    * `inner`: Keeps only the rows where the join key(s) exist in *both* DataFrames.
    * `outer`: Includes *all* rows from both DataFrames. Where a key exists in only one DataFrame, missing values (`NaN`) are introduced for the columns of the other DataFrame.
    * `left`: Includes all rows from the *left* DataFrame, and the matching rows from the *right* DataFrame. If a key in the left DataFrame has no match in the right, the corresponding columns from the right DataFrame will have `NaN` values.
    * `right`: Similar to `left`, but includes all rows from the *right* DataFrame and the matching rows from the *left* DataFrame.

* **`on`**: Specifies the column name(s) or index level name(s) that will be used as the join key(s).
    * If no `on` argument is provided, `.merge()` will automatically use any columns that have identical names in both DataFrames as the join keys.
    * The column(s) or index level(s) specified with `on` must be present in both DataFrames.

* **`left_on`**: Specifies the column name(s) or index level name(s) in the *left* DataFrame to use as join keys.

* **`right_on`**: Specifies the column name(s) or index level name(s) in the *right* DataFrame to use as join keys. This is useful when the columns you want to join have different names in the two DataFrames.

* **`left_index`**: A boolean value. If `True`, uses the index of the *left* DataFrame as the join key(s).

* **`right_index`**: A boolean value. If `True`, uses the index of the *right* DataFrame as the join key(s).

### `.merge()` - Inner Join

<center><img src="../images/illustrations/data merge illustrations - inner.jpg"></center>

An **inner join** using `.merge()` returns only the rows where the specified join key(s) have matching values in *both* the left and the right DataFrames. 

It essentially finds the intersection of the two datasets based on the common key(s).

Let's apply `.merge()` with the default inner join to our `spongebob_df1` and `spongebob_df2` DataFrames:

In [None]:
## Begin Example

## End Example

**Note:**

By default, calling `.merge()` without any additional arguments will perform an **inner join**.

However, for clarity and explicit control, you can achieve the same result by explicitly specifying the join type using the `how='inner'` argument and indicating the joining column(s) with the `on` argument.

In [None]:
## Begin Example

## End Example

### `.merge()` - Outer Join

<center><img src="../images/illustrations/data merge illustrations - outer.jpg"></center>

An **outer join** using `.merge()` combines *all* rows from both the left and the right DataFrames. 

Let's perform an outer join on our spongebob DataFrames:

In [None]:
## Begin Example

## End Example

__Note:__

If a join key exists in only one of the DataFrames, the resulting merged DataFrame will have missing values (`NaN`) in the columns originating from the DataFrame that doesn't have that key.

### `.merge()` - Left Join

<center><img src="../images/illustrations/data merge illustrations - left.jpg"></center>

A **left join** using `.merge()` includes all rows from the *left* DataFrame in the result. 

For each row in the left DataFrame, it also includes the matching rows from the *right* DataFrame based on the join key(s).

In [None]:
## Begin Example

## End Example

**Note**:

If a row in the left DataFrame has no matching key in the right DataFrame, the columns from the right DataFrame will have missing values (`NaN`).

### `.merge()` - Right Join

<center><img src="../images/illustrations/data merge illustrations - right.jpg"></center>

A **right join** using `.merge()` includes all rows from the *right* DataFrame in the result. 

For each row in the right DataFrame, it also includes the matching rows from the *left* DataFrame based on the join key(s).

In [None]:
## Begin Example


## End Example

**Note**:

If a row in the right DataFrame has no matching key in the left DataFrame, the columns from the left DataFrame will have missing values (`NaN`).

### `.merge()` - Cross Join

<center><img src="../images/illustrations/data merge illustrations cross.jpg"></center>

A **cross join** using `.merge()` produces the Cartesian product of the rows from the left and right DataFrames. 

This means that every row from the left DataFrame is combined with every row from the right DataFrame, resulting in a DataFrame where the number of rows is the product of the number of rows in the two original DataFrames.

In [None]:
## Begin Example


## End Example

__Note__:

The resulting table will contain rows equivalent to the product of the number of rows in the original dataframes.

For a more practical example let's load up two additional datasets, each contain menu items from the Krusty Krab.

**File Locations**:
1. `../data/krustykrab1.csv`
2. `../data/krustykrab2.csv`

**Recommended DataFrame Names:
1. `krusty_krab_df1`
2. `krusty_krab_df2`

In [None]:
## Begin Example


## End Example

Now let's follow up with `.merge()` using a Cross Join. 

In [None]:
## Begin Example


## End Example

The resulting DataFrame should reveal all the possible meal combinations based on our two dataframes. 

## Introducing `.join()`

The `.join()` method in Pandas provides another way to combine DataFrames. 

While its functionality overlaps with `.merge()`, it has a key distinction: **`.join()` primarily combines DataFrames based on their indices.**

**Key Differences from `.merge()`:**

* **Default Join Key:** `.join()` defaults to joining on the **index** of the DataFrames. `.merge()`, on the other hand, defaults to joining on columns with the same name.
* **Flexibility with Columns:** While `.join()` can also join on columns using the `on` argument, `.merge()` offers more flexibility in specifying different columns from the left and right DataFrames using `left_on` and `right_on`.


### Basic Usage

You typically call `.join()` on one DataFrame and pass the other DataFrame as an argument:

```python
joined_df = left_df.join(right_df, how='...', lsuffix='...', rsuffix='...')
```

__Key Arguments:__

* __other__: The DataFrame to join with.
* __how__: Specifies the type of join (`left`, `right`, `inner`, `outer`), similar to .merge(). It determines which keys are included in the resulting DataFrame.
* __on__: An optional argument to specify a column or a list of columns to join on. These columns must be in the calling DataFrame. The other DataFrame will join on its index based on the values in this column.
* __lsuffix__: A string suffix to apply to overlapping column names in the calling DataFrame.
* __rsuffix__: A string suffix to apply to overlapping column names in the other DataFrame. These suffixes help distinguish columns with the same name after the join.

In essence, use `.join()` when you want to combine DataFrames primarily based on their index values. 

If you need more explicit control over the join columns or want to join on different columns in the two DataFrames, `.merge()` is generally the more versatile option.

Let's perform an inner join with `.join()` to view the results versus a `.merge()`.

In [None]:
## Begin Example


## End Example

## `pd.concat()`

The `pd.concat()` function in pandas is used to combine pandas objects (like DataFrames or Series) along a specified axis. 

It's like "stacking" or "gluing" them together.

## Vertical Stacking

<center><img src="../images/illustrations/data merge illustrations - concat vertical.jpg"></center>

Vertical stacking, also known as appending rows, involves combining DataFrames one on top of the other. 

This is particularly useful when you have data split across multiple files or DataFrames with the same columns and you want to create a single, larger dataset.

Let's use our Krusty Krab menu items. We'll combine the entree items and menu items into one dataset.

In [None]:
## Begin Example



## End Example

## Horizontal Stacking

<center><img src="../images/illustrations/data merge illustrations horizontal stack.jpg"></center>

Horizontal stacking, also known as concatenating along columns, involves combining DataFrames side-by-side. 

This is useful when you have different sets of information about the same entities in separate DataFrames and you want to bring them together into a single DataFrame with more columns.

In additional to the `spongebob_df1` DataFrame, let's load the following dataset:

__File:__
* `../data/spongebob3.csv`

__Recommended DataFrame Name:__
* `spongebob_df3`

This dataset contains car racing stats for each character.

Let's perform a horizontal stack to "glue" the dataframes together.

In [None]:
## Begin Example


## End Example

`pd.concat()` is a fundamental tool for combining data in pandas, either by adding more rows or more columns.

The `axis` parameter is key to controlling this behavior.

## Conclusion

<center><img src="../images/stock/pexels-miami302-18403865.jpg"></center>

Combining data involves either joining based on shared attributes to integrate related information or stacking datasets vertically (for more observations) or horizontally (for more attributes of the same entities). 

Effective combination requires identifying linking attributes or ensuring consistent alignment through indexing. 

The goal is to create a unified dataset for more comprehensive analysis.