# 🧹 Cleaning Data, Part 1: Helpful Python Functions

*Basic methods for common issues.*

As you've probably gathered from your coursework so far, computer programs take things very literally. Differences that look small to humans — such as `"1,000 dollars"` vs. `"1000 dollars"` — can cause problems for computation.

The data we get is rarely as clean as we'd like. So let's explore some options for cleaning.

For starters, let's take a tour of common Python functions that can help.

## Some common ways data can be messy

__Problem__: Leading/trailing whitespace (or other characters)

__Examples__:
    
- `Fish tacos    `
- `    Fish tacos`
- `   Fish tacos   `

__Solution__: `.strip(...)`

In [1]:
"Fish tacos    ".strip()

'Fish tacos'

In [2]:
"    Fish tacos".strip()

'Fish tacos'

In [3]:
"   Fish tacos   ".strip()

'Fish tacos'

By default, `.strip(...)` removes leading/trailing whitespace, but you can also tell it to remove different/other characters:

In [4]:
"///Fish tacos///".strip("/")

'Fish tacos'

It will only, however, strip *consecutive* characters:

In [5]:
"/ / /Fish tacos/ / /".strip("/")

' / /Fish tacos/ / '

In [6]:
"/ / /Fish tacos/ / /".strip("/ ")

'Fish tacos'

__Problem__: Extra/junk characters elsewhere

__Examples__:
    
- `1,000`
- `I_want_spaces_instead`
- `jsvine AT gmail DOT com`

__Solution__: `.replace(...)`

In [7]:
int("1,000")

ValueError: invalid literal for int() with base 10: '1,000'

In [8]:
"1,000".replace(",", "")

'1000'

In [9]:
int("1,000".replace(",", ""))

1000

In [10]:
"I_want_spaces_instead".replace("_", " ")

'I want spaces instead'

As with most string operations, you can chain them together:

In [11]:
"jsvine AT gmail DOT com".replace(" AT ", "@").replace(" DOT ", ".")

'jsvine@gmail.com'

Chaining can be easier to follow by wrapping the statement in parens and putting each call on a new line:

In [12]:
(
    "jsvine AT gmail DOT com"
    .replace(" AT ", "@")
    .replace(" DOT ", ".")
)

'jsvine@gmail.com'

__Problem__: Inconsistent upper/lower-casing

__Examples__:

- `Lede Program`
- `LEDE PROGRAM`
- `LeDe PROgram`

__Solution__: `.upper()`/`.lower()`/`.title()`

In [13]:
"LeDe PROgram".upper()

'LEDE PROGRAM'

In [14]:
"LeDe PROgram".lower()

'lede program'

In [15]:
"LeDe PROgram".title()

'Lede Program'

__Problem__: One string contains __multiple__ chunks of information

__Examples__:

- `10 kgs`
- `Smith, Jane`
- `05/31/2023`

__Solution__: `.split(...)`

This method returns a list:

In [16]:
"10 kgs".split(" ")

['10', 'kgs']

You can work with it like any list, including by using square-bracket notation to get any item:

In [17]:
"10 kgs".split(" ")[0]

'10'

Likewise, you can "unpack" each element into its own variable:

In [18]:
"Smith, Jane".split(", ")

['Smith', 'Jane']

In [19]:
last, first = "Smith, Jane".split(", ")
print(f"{first} {last}")

Jane Smith


In [20]:
m, d, y = "05/31/2023".split("/")
print("-".join([y, m, d]))

2023-05-31


## Exercise

Write code that calculates the total amount of money expressed in this list:

In [21]:
amounts = [
    "   1,000.31   doLLaRs   ",
    "500 Dollars  ",
    "25.03 USD",
    "$10"
]

My solution:

In [22]:
def clean_amount(amt):
    amt = amt.strip("$ ")
    amt = amt.replace(",", "")
    amt = amt.split(" ")[0]
    return float(amt)

for amt in amounts:
    print("Original: ", amt)
    print("Converted:", clean_amount(amt))
    print("---")

Original:     1,000.31   doLLaRs   
Converted: 1000.31
---
Original:  500 Dollars  
Converted: 500.0
---
Original:  25.03 USD
Converted: 25.03
---
Original:  $10
Converted: 10.0
---


In [23]:
sum(clean_amount(amt) for amt in amounts)

1535.34

## Exercise

Now let's try to do the same, but with this new wrinkle:

In [24]:
amounts = [
    "   1,000.31   doLLaRs   ",
    "54 cents  ",
    "33 CENTS",
    "$10"
]

My solution:

In [27]:
def clean_amount(amt):
    if "$" in amt or "dollar" in amt.lower():
        conversion = 1
    elif "cent" in amt.lower():
        conversion = 0.01
    else:
        raise ValueError(f"Cannot determine unit for {amt}")
        
    amt = amt.strip("$ ")
    amt = amt.replace(",", "")
    amt = amt.split(" ")[0]
    return float(amt) * conversion

In [28]:
clean_amount("54 cents  ")

0.54

In [29]:
clean_amount("54 dollars")

54.0

In [31]:
clean_amount("$54.00")

54.0

In [34]:
clean_amount("54 euro")

ValueError: Cannot determine unit for 54 euro

In [33]:
sum(clean_amount(a) for a in amounts)

1011.18

---

---

---