# Question 0 - Topics in Pandas [25 points]

For this question, please pick a topic - such as a function, class, method, recipe or idiom related to the pandas python library and create a short tutorial or overview of that topic. 

## Duplicate labels

Real-world data is always messy. Since index objects are not required to be unique, sometimes we can have duplicate rows or column labels. 
In this section, we first show how duplicate labels change the behavior of certain operations. Then we will use pandas to detect them if there are any duplicate labels, or to deal with duplicate labels.

- Consequences of duplicate labels
- Duplicate label detection
- Deal with duplicate labels

In [1]:
import pandas as pd
import numpy as np



In [4]:
# Generate series with duplicate labels
s1 = pd.Series([0,4,6], index=["A", "B", "B"])

### Consequences of duplicate labels
Some pandas methods (`Series.reindex()` for example) don’t work with duplicate indexes. The output can’t be determined, and so pandas raises.

In [6]:
s1.reindex(["A", "B", "C"])

ValueError: cannot reindex from a duplicate axis

Other methods, like indexing, can cause unusual results. Normally indexing with a scalar will reduce dimensionality. Slicing a DataFrame with a scalar will return a Series. Slicing a Series with a scalar will return a scalar. However, with duplicate labels, this isn’t the case.

In [14]:
df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "A", "B"])
df1

Unnamed: 0,A,A.1,B
0,0,1,2
1,3,4,5


If we slice 'B', we get back a Series.

In [10]:
df1["B"] # This is a series

0    2
1    5
Name: B, dtype: int64

But slicing 'A' returns a DataFrame. Since there are two "A" columns.

In [12]:
df1["A"] # This is a dataframe

Unnamed: 0,A,A.1
0,0,1
1,3,4


This applies to row labels as well.

In [16]:
df2 = pd.DataFrame({"A": [0, 1, 2]}, index=["a", "a", "b"])
df2

Unnamed: 0,A
a,0
a,1
b,2


In [17]:
df2.loc["b", "A"]  # This is a scalar.

2

In [18]:
df2.loc["a", "A"]  # This is a Series.

a    0
a    1
Name: A, dtype: int64

### Duplicate Label Detection

We can check whether an Index (storing the row or column labels) is unique with `Index.is_unique`:

In [19]:
df2.index.is_unique # There are duplicate indexes in df2.

False

In [20]:
df2.columns.is_unique # Column names of df2 are unique.

True

`Index.duplicated()` will return a boolean ndarray indicating whether a label is repeated.

In [21]:
df2.index.duplicated()

array([False,  True, False])

### Deal with duplicate labels

- `Index.duplicated()` can be used as a boolean filter to drop duplicate rows.

In [22]:
df2.loc[~df2.index.duplicated(), :]

Unnamed: 0,A
a,0
b,2


- We can use `groupby()` to handle duplicate labels, rather than just dropping the repeats. 

For example, we’ll resolve duplicates by taking the average of all rows with the same label.

In [23]:
df2.groupby(level=0).mean()

Unnamed: 0,A
a,0.5
b,2.0
