# 🧹 Cleaning Data, Part 2: Cleaning in Pandas

*Same idea, different style.*

Let's take that same list of amount descriptions, but this time make it a Pandas `Series`:

In [1]:
import pandas as pd

In [2]:
amounts = pd.Series([
    "   1,000.31   doLLaRs   ",
    "54 cents  ",
    "33 CENTS",
    "$10"
])

amounts

0       1,000.31   doLLaRs   
1                  54 cents  
2                    33 CENTS
3                         $10
dtype: object

In Pandas, you can use the same methods we explored in plain Python, but you access them through each Series' `.str` attribute:

|Plain|Pandas|
|:-----|:------|
|`my_string.strip()`|`my_series.str.strip()`|
|`my_string.replace("a", "b")`|`my_series.str.replace("a", "b")`|
|... and so on||

Let's try it out:

In [3]:
amounts.str.strip("$ ")

0    1,000.31   doLLaRs
1              54 cents
2              33 CENTS
3                    10
dtype: object

In [4]:
amounts.str.replace(",", "")

0       1000.31   doLLaRs   
1                 54 cents  
2                   33 CENTS
3                        $10
dtype: object

In [5]:
amounts.str.lower()

0       1,000.31   dollars   
1                  54 cents  
2                    33 cents
3                         $10
dtype: object

In [6]:
amounts.str.split(" ")

0    [, , , 1,000.31, , , doLLaRs, , , ]
1                        [54, cents, , ]
2                            [33, CENTS]
3                                  [$10]
dtype: object

`.str.get(num)` is helpful to combine with `.str.split(...)`. It's equivalent to `my_list[num]` in plain Python:

In [7]:
amounts.str.split(" ")

0    [, , , 1,000.31, , , doLLaRs, , , ]
1                        [54, cents, , ]
2                            [33, CENTS]
3                                  [$10]
dtype: object

In [8]:
amounts.str.split(" ").str.get(0)

0       
1     54
2     33
3    $10
dtype: object

In [9]:
amounts.str.split(" ").str.get(-1)

0         
1         
2    CENTS
3      $10
dtype: object

Now, let's put it all together to get the __quantities__ from the amount descriptions (ignoring, for now, whether they represent the number of dollars or cents):

In [10]:
(
    amounts
    .str.strip("$ ")
    .str.replace(",", "")
    .str.split()
    .str.get(0)
)

0    1000.31
1         54
2         33
3         10
dtype: object

For handing dollars vs. cents, there are a couple of ways to do it:

- A very Pandas-y way (works, but a bit overly complex)
- Just writing a normal Python function, and passing it to `amounts.apply(...)`

In [11]:
def get_conversion(amt):
    if "$" in amt or "dollar" in amt.lower():
        conversion = 1
    elif "cent" in amt.lower():
        conversion = 0.01
    else:
        raise ValueError(f"Cannot determine unit for {amt}")
    return conversion

In [12]:
amounts.apply(get_conversion)

0    1.00
1    0.01
2    0.01
3    1.00
dtype: float64

Let's tie it all together, creating a `DataFrame` with:

- The raw, original description
- The quantity extracted
- The conversion factor

... which we'll use to convert to the dollars-normalized values, so we can `sum` it all up.

In [13]:
amounts_df = pd.DataFrame({
    "raw": amounts
})

amounts_df

Unnamed: 0,raw
0,"1,000.31 doLLaRs"
1,54 cents
2,33 CENTS
3,$10


In [14]:
amounts_df["quantity"] = (
    amounts_df["raw"]
    .str.strip("$ ")
    .str.replace(",", "")
    .str.split()
    .str.get(0)
    .astype(float)
)

amounts_df

Unnamed: 0,raw,quantity
0,"1,000.31 doLLaRs",1000.31
1,54 cents,54.0
2,33 CENTS,33.0
3,$10,10.0


In [15]:
amounts_df["conversion"] = amounts_df["raw"].apply(get_conversion)

amounts_df

Unnamed: 0,raw,quantity,conversion
0,"1,000.31 doLLaRs",1000.31,1.0
1,54 cents,54.0,0.01
2,33 CENTS,33.0,0.01
3,$10,10.0,1.0


In [16]:
amounts_df["dollars"] = amounts_df["quantity"] * amounts_df["conversion"]

amounts_df

Unnamed: 0,raw,quantity,conversion,dollars
0,"1,000.31 doLLaRs",1000.31,1.0,1000.31
1,54 cents,54.0,0.01,0.54
2,33 CENTS,33.0,0.01,0.33
3,$10,10.0,1.0,10.0


In [17]:
amounts_df["dollars"].sum()

1011.18

## Interlude: Chaining in Pandas

After using Pandas for (literally 😬) a decade, I've settled on a style that I feel is *my* favorite balance of expressive, flexible, and maintainable. It makes heavy use of `DataFrame.assign()`, method-chaining, and `lambda` functions.

For our example exercise, it'd look like this:

In [18]:
(
    pd.DataFrame({ "raw": amounts })
    .assign(
        quantity = lambda df: (
            df["raw"]
            .str.strip("$ ")
            .str.replace(",", "")
            .str.split()
            .str.get(0)
            .astype(float)
        ),
        conversion = lambda df: df["raw"].apply(get_conversion),
        dollars = lambda df: df["quantity"] * df["conversion"],
    )
)

Unnamed: 0,raw,quantity,conversion,dollars
0,"1,000.31 doLLaRs",1000.31,1.0,1000.31
1,54 cents,54.0,0.01,0.54
2,33 CENTS,33.0,0.01,0.33
3,$10,10.0,1.0,10.0


---

---

---