# Cleaning Data Basics

## Some common ways data can be messy

**Problem**: Leading/trailing whitespace (or other characters)

**Examples**:
    
- `police precinct    `
- `    police precinct`
- `   police precinct   `

**Solution**: `.strip(...)`

In [78]:
"police precinct    ".strip()

'police precinct'

In [79]:
"    police precinct".strip()

'police precinct'

In [80]:
"   police precinct   ".strip()

'police precinct'

By default, `.strip(...)` removes leading/trailing whitespace, but you can also tell it to remove different/other characters:

In [81]:
"///police precinct///".strip("/")

'police precinct'

It will only, however, strip *consecutive* characters:

In [82]:
"/ / /police precinct/ / /".strip("/")

' / /police precinct/ / '

In [83]:
"/ / /police precinct/ / /".strip("/ ")

'police precinct'

**Problem**: Extra/junk characters elsewhere

**Examples**:
    
- `15,645`
- `total_population_school_district`

**Solution**: `.replace(...)`

In [84]:
int("15,645")

ValueError: invalid literal for int() with base 10: '15,645'

In [85]:
"15,645".replace(",", "")

'15645'

In [86]:
int("15,645".replace(",", ""))

15645

In [87]:
"total_population_school_district".replace("_", " ")

'total population school district'

**Problem**: Inconsistent upper/lower-casing

**Examples**:

- `PALISADES FIRE`
- `Palisades Fire`
- `Palisades fIRE`

**Solution**: `.upper()`/`.lower()`/`.title()`

In [88]:
"PALISADES FIRE".upper()

'PALISADES FIRE'

In [89]:
"Palisades Fire".lower()

'palisades fire'

In [90]:
"Palisades fIRE".title()

'Palisades Fire'

**Problem**: One string contains **multiple** chunks of information

**Examples**:

- `10 lbs`
- `Soma, Jonathan`
- `06/16/2025`

**Solution**: `.split(...)`

This method returns a list:

In [91]:
"10 lbs".split(" ")

['10', 'lbs']

You can work with it like any list, including by using square-bracket notation to get any item:

In [92]:
"10 lbs".split(" ")[0]

'10'

Likewise, you can "unpack" each element into its own variable:

In [93]:
"Soma, Jonathan".split(", ")

['Soma', 'Jonathan']

In [94]:
last, first = "Soma, Jonathan".split(", ")
print(f"{first} {last}")

Jonathan Soma


In [95]:
m, d, y = "06/16/2025".split("/")
print("-".join([y, m, d]))

2025-06-16


## Exercise

Write code that calculates the total amount of money expressed in this list:

In [96]:
amounts = [
    "   5,1245.31   doLLaRs   ",
    "456 Dollars  ",
    "156.20 USD",
    "$15"
]

My solution:

In [97]:
def clean_amount(amt):
    amt = amt.strip("$ ")
    amt = amt.replace(",", "")
    amt = amt.split(" ")[0]
    return float(amt)

for amt in amounts:
    print("Original: ", amt)
    print("Converted:", clean_amount(amt))
    print("---")

Original:     5,1245.31   doLLaRs   
Converted: 51245.31
---
Original:  456 Dollars  
Converted: 456.0
---
Original:  156.20 USD
Converted: 156.2
---
Original:  $15
Converted: 15.0
---


In [98]:
sum(clean_amount(amt) for amt in amounts)

51872.509999999995

Let's take that same list of amount descriptions, but this time make it a Pandas `Series`:

In [99]:
import pandas as pd

In [100]:
amounts = pd.Series([
    "   5,1245.31   doLLaRs   ",
    "456 Dollars  ",
    "156.20 USD",
    "$15"
])

amounts

0       5,1245.31   doLLaRs   
1                456 Dollars  
2                   156.20 USD
3                          $15
dtype: object

In Pandas, you can use the same methods we explored in plain Python, but you access them through each Series' `.str` attribute:

- **Plain**: `my_string.strip()`
- **Pandas**: `my_series.str.strip()`

- **Plain**: `my_string.replace("a", "b")`
- **Pandas**: `my_series.str.replace("a", "b")`


Let's try it out:

In [101]:
amounts

0       5,1245.31   doLLaRs   
1                456 Dollars  
2                   156.20 USD
3                          $15
dtype: object

In [102]:
amounts.str.strip("$ ")

0    5,1245.31   doLLaRs
1            456 Dollars
2             156.20 USD
3                     15
dtype: object

In [103]:
amounts.str.replace(",", "")

0       51245.31   doLLaRs   
1               456 Dollars  
2                  156.20 USD
3                         $15
dtype: object

In [104]:
amounts.str.lower()

0       5,1245.31   dollars   
1                456 dollars  
2                   156.20 usd
3                          $15
dtype: object

In [105]:
amounts.str.split(" ")

0    [, , , 5,1245.31, , , doLLaRs, , , ]
1                      [456, Dollars, , ]
2                           [156.20, USD]
3                                   [$15]
dtype: object

`.str.get(num)` is equivalent to equivalent to `my_list[num]` in plain Python, and can be helpful to combine with `.str.split(...)`:

In [106]:
amounts.str.split(" ")

0    [, , , 5,1245.31, , , doLLaRs, , , ]
1                      [456, Dollars, , ]
2                           [156.20, USD]
3                                   [$15]
dtype: object

In [107]:
amounts.str.split(" ").str.get(0)

0          
1       456
2    156.20
3       $15
dtype: object

In [108]:
amounts.str.split(" ").str.get(-1)

0       
1       
2    USD
3    $15
dtype: object

Now, let's put it all together to get the **quantities** from the amount descriptions (ignoring, for now, whether they represent the number of dollars or cents):

In [109]:
(
    amounts
    .str.strip("$ ")
    .str.replace(",", "")
    .str.split()
    .str.get(0)
    .astype(float)    
)

0    51245.31
1      456.00
2      156.20
3       15.00
dtype: float64

We're going to want to reuse this approach, so let's **put it in a function**:

In [110]:
def get_quantity(amounts):
    return (
        amounts
        .str.strip("$ ")
        .str.replace(",", "")
        .str.split()
        .str.get(0)
        .astype(float)
    )

Now we can call that function on our pandas `Series`:

In [111]:
get_quantity(amounts)

0    51245.31
1      456.00
2      156.20
3       15.00
dtype: float64

Or..

In [112]:
amounts.pipe(get_quantity)

0    51245.31
1      456.00
2      156.20
3       15.00
dtype: float64