# Problem Set 2.6: Indexes and concatenation

[Click here to open this notebook in your browser](https://leifwalsh.github.io/data-analysis-problem-sets/lab/index.html?path=2-pandas-basics/2.6-indexes-and-concatenation/2.6-indexes-and-concatenation.ipynb)

Learn about DataFrame indexes and how to combine DataFrames with `concat`.

But first, we'll start with a little exploration of adding individual columns
to a DataFrame.

## Adding columns to a DataFrame

Previously, we worked with a deck of cards, to build intuition. In this
notebook, we'll start with a very small DataFrame, so we can focus directly on
these operations.

In [None]:
import pandas as pd

df = pd.DataFrame({
    "a": [1, 2, 3, 4],
    "b": [2, 4, 6, 8],
    "c": ["w", "x", "y", "z"],
})
df

Just like in [Section
2.1](../2.1-numpy-arrays-series-dataframe/2.1-numpy-arrays-series-dataframe.ipynb),
we can create a new Series from these columns:

In [None]:
s1 = df["a"] + df["b"]
s1

We can add this new Series to the DataFrame as a new column, with `assign`.

Note: `assign` returns a new DataFrame with the new column added, rather than
modifying the original DataFrame in place. We'll use `assign` in this section
so we can keep reusing `df` without modifying it.

In [None]:
df1 = df.assign(d=s1)
df1

As promised, `df` is unchanged:

In [None]:
df

More often, you'll see people assign columns to a DataFrame without setting
the Series to a variable first:

In [None]:
df1 = df.assign(d=df["a"] + df["b"])
df1

Notice above how `s` has the index `0, 1, 2, 3` and when we assigned it to the
DataFrame, the values `3, 6, 9, 12` were set to the rows that matched their
index values `0, 1, 2, 3`.

If we create a Series with a different index, pandas will reorder it to match
the index of that DataFrame. This is called _alignment_.

In [None]:
s2 = pd.Series(data=[8, 9, 10, 11], index=[2, 3, 1, 0])
s2

Notice the index in this case is `2, 3, 1, 0`. Now when we assign it to the
DataFrame, the values are reordered to match the index of the DataFrame:

In [None]:
df2 = df.assign(d=s2)
df2

If the new Series doesn't have values for all the rows in the DataFrame, pandas
will fill in missing values with a "missing value" marker, most commonly `NaN`
for "not a number".

In [None]:
s3 = pd.Series(data=[5, 6, 7], index=[0, 1, 3])
s3

In [None]:
df3 = df.assign(d=s3)
df3

Side note: because of "computer details", Series that are integers can't
generally represent this "not a number" value. That's why the `d` column has
decimal points in it. pandas automatically converts the integers to floats to
accommodate the `NaN` values.

In [None]:
df3.dtypes

We'll see more about missing values later.

You can also add a new column with more values than the DataFrame has rows. In
this case, pandas will only use the values that match the index of the
DataFrame.

In [None]:
s4 = pd.Series(data=[15, 16, 17, 18, 19, 20], index=[0, 1, 2, 3, 4, 100])
s4

In [None]:
df4 = df.assign(d=s4)
df4

### Modifying assignment

Above, we used `assign` to create a new DataFrame with the new column added.

We can also update the DataFrame in place, using the `[]` operator to add a new
column.

This is pretty common, so you should know it exists, but it can result in
notebooks that get more confusing, since then the order in which you execute
the cells can change the result.

To see this, let's make a copy of the DataFrame:

In [None]:
dataframe_to_modify = df.copy()
dataframe_to_modify

Now, we can update it in place:

In [None]:
dataframe_to_modify["d"] = s4

In [None]:
dataframe_to_modify

## Understanding Indexes

So far, we've only worked with indexes that represent row numbers (at least in
some original sense).

But indexes can be more than just row numbers. They can be any unique
identifier for the rows.

In some cases, they can be a combination of columns that uniquely identifies
rows (for example, a combination of first name and last, as long as you don't
have people with identical first and last names).

Technically, pandas won't explicitly complain if they aren't unique, but many
operations will do confusing things if they're not unique! Some pandas experts
even discourage use of indexes at all for this reason. But some pandas
operations will give you DataFrame objects with non-trivial indexes, so you
will need to understand them anyway.

As long as you understand them, they don't have to be scary, and often they can
be useful.

### Changing the index

You can set the index of a DataFrame with the `set_index` method:

In [None]:
df_indexed = df.set_index("c")
df_indexed

Note that when a DataFrame like this is displayed, the index's column name is
in the second header row.

The index is no longer a normal column:

In [None]:
df_indexed.columns

Remember that a DataFrame is roughly defined as a collection (`dict`) of Series
that share an index ("are aligned"). So, each of the columns in the DataFrame
now has `c` as its index:

In [None]:
df_indexed["a"]

In [None]:
df_indexed["b"]

We can now use `.loc[]` to select items along the new index with values from
that index:

In [None]:
df_indexed.loc["w":"y", ["a"]]

We can still use `.iloc[]` to access rows by their position:

In [None]:
df_indexed.iloc[1:3]

When you assign a column onto a DataFrame, remember that it will be aligned to
the index of the DataFrame. This is still true when the index isn't row
numbers.

Let's remind ourselves what that DataFrame looks like:

In [None]:
df_indexed

Let's add a new column to it:

In [None]:
s5 = pd.Series(data=[5, 6, 7, 8], index=["z", "y", "x", "w"])
s5

In [None]:
df5 = df_indexed.assign(d=s5)
df5

Now we're ready to explore `concat` and `join`.

## Concatenating DataFrames

`concat` is a function that takes a list of DataFrames and concatenates them
along either the "index" or "columns" axis.

We'll need two DataFrames:

In [None]:
df_a = pd.DataFrame({
    "a": [1, 2, 3, 4],
    "b": [2, 4, 6, 8],
    "idx": ["w", "x", "y", "z"],
}).set_index("idx")
df_a

In [None]:
df_b = pd.DataFrame({
    "c": [5, 6, 7, 8],
    "d": [9, 10, 11, 12],
    "idx": ["w", "x", "y", "z"],
}).set_index("idx")
df_b

`concat` well...concatenates DataFrames along the named axis. In this case, we
concatenate the columns, so the resulting DataFrame has all the columns from
both `df_a` and `df_b`, but it makes sure to align them by their index.

In [None]:
pd.concat([df_a, df_b], axis="columns")

You can also concatenate along the index axis. Let's make two more DataFrames:

In [None]:
df_c = pd.DataFrame({
    "a": [1, 2, 3, 4],
    "b": [2, 4, 6, 8],
    "idx": ["w", "x", "y", "z"],
}).set_index("idx")
df_c

In [None]:
df_d = pd.DataFrame({
    "a": [5, 6, 7, 8],
    "b": [9, 10, 11, 12],
    "idx": ["a", "b", "c", "d"],
}).set_index("idx")
df_d

In [None]:
pd.concat([df_c, df_d], axis="index")

This time, it made sure the columns were "aligned".

If this makes you wonder whether columns have alignment properties like indexes
do, then yes! We'll see more of that later.

When you concatenate DataFrames with overlapping but not identical alignment axes, you get missing values in the non-overlapping regions:

In [None]:
df_e = pd.DataFrame({
    "a": [1, 2, 3],
    "b": [2, 4, 6],
    "idx": ["w", "x", "y"],
}).set_index("idx")
df_e

In [None]:
df_f = pd.DataFrame({
    "c": [6, 7, 8],
    "d": [10, 11, 12],
    "idx": ["x", "y", "z"],
}).set_index("idx")
df_f

In [None]:
pd.concat([df_e, df_f], axis="columns")

With `assign` and `[]=`, pandas considers the DataFrame being added to
the authority on what the index is, so any extra rows in the column being added
get discanded. With `concat`, none of the DataFrames involved is "primary", so
pandas keeps all the rows from all the DataFrames, and fills in missing values
with `NaN`.

Note that you can also concatenate as many DataFrames as you want:

In [None]:
dfs = [
    pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "idx": ["w", "x", "y"]}).set_index("idx"),
    pd.DataFrame({"c": [7, 8, 9], "d": [10, 11, 12], "idx": ["x", "y", "z"]}).set_index("idx"),
    pd.DataFrame({"e": [13, 14, 15], "f": [16, 17, 18], "idx": ["y", "z", "w"]}).set_index("idx"),
]

In [None]:
dfs[0]

In [None]:
dfs[1]

In [None]:
dfs[2]

In [None]:
pd.concat(dfs, axis="columns")