# 5. Functions, calculated columns and cleaning data

In this notebook, we'll cover writing custom functions, adding calculated columns and a few data-cleaning strategies.

First, import pandas:

In [None]:
import pandas as pd

### Functions

If you find yourself doing the same thing over and over again in your code, it might be time to write a function.

Functions are blocks of reusable code -- little boxes that (usually) take inputs and (usually) return outputs. In Excel, `=SUM()` is a function. `print()` is one of Python's built-in function.

You can also _define your own functions_. This can save you some typing, and it will help separate your code into logical, easy-to-read pieces.

#### Syntax

Functions start with the `def` keyword -- short for _define_, because you're defining a function -- then the name of the function, then parentheses (sometimes with the names of any `arguments` your function requires inside the parentheses) and then a colon. The function's code sits inside an indented block immediately below that line. In most cases, a function will `return` a value at the end.

Here is a function that takes a number and returns that number multiplied by 10:

In [None]:
def times_ten(number):
    return number * 10

The `number` argument is just a placeholder for whatever value is handed the function as an input. We could have called that argument `banana` and things would be just fine (though it would be confusing for people reading your code).

#### Calling a function

By itself, a function doesn't do anything. We have built a tiny machine to multiply a number by 10. But it's just sitting on the workshop bench, waiting for us to use it.

Let's use it!

In [None]:
times_ten(2)

#### Function arguments

Functions can accept _positional_ arguments or _keyword_ arguments.

If your function uses _positional_ arguments, the order in which you pass arguments to the function matters. Here is a function that prints out a message based on its input: a person's name and their hometown.

(This function uses something called an "f-string" to format the result. For more information on text formatting, see [this notebook](String%20formatting.ipynb).)

In [None]:
def greet(name, hometown):
    return f'Hello, {name} from {hometown}!'

Now let's call it.

In [None]:
greet('Cody', 'Pavillion, WY')

If we change the order of the arguments, we get nonsense.

In [None]:
greet('Pavillion, WY', 'Cody')

Using _keyword_ arguments requires us to specify what value belongs to what argument, and it allows us to set a default value for the argument -- values that the function will use if you fail to pass any arguments when you call it. We could rewrite our function like this:

In [None]:
def greet(name='Cody', hometown='Pavillion, WY'):
    return f'Hello, {name} from {hometown}!'

And now it doesn't matter what order we pass in the arguments, because we're defining the keyword that they belong to:

In [None]:
greet(hometown='Pittsburgh, PA', name='Jacob')

What happens if we call the `greet()` function without any arguments at all, now? It'll use the default arguments.

In [None]:
greet()

### ✍️ Try it yourself

Use the code blocks below to experiment with functions.

### Adding new or calculated columns

In a spreadsheet program, if you want to add a new column of data -- maybe a copy of an existing column for cleaning -- you could just reference the original column in a formula. If you wanted to calculate a new column of values based on other values in each row, you might write a formula and fill it down. In SQL, you might run an `ALTER TABLE`/`UPDATE`/`SET` routine to handle this process.

In pandas, adding a new column is similar to adding a new record to a Python dictionary. Let's load in the CT overdose data to take a look at how this works.

In [None]:
df_ct = pd.read_excel('../data/CT_Overdoses_2012-2016.xlsx', sheet_name='Accidental_Drug_Related_Deaths_')

In [None]:
df_ct.head()

Let's say we eventually wanted to do some analysis based on the `Death City` column, but maybe first we need to clean it up. You always want to leave your original data intact, so first step would be to create a copy of the `Death City` column:

In [None]:
df_ct['death_city_clean'] = df_ct['Death City']

In [None]:
df_ct.columns

... and then you could work through some cleaning steps (more on that below).

To create a calculated column, you would first define a function to process a row of data in your dataframe, then _apply_ that function to your dataframe using a pandas method called [`apply()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html).

The values in several columns in our dataframe list whether a particular drug was found by the medical examiner examining the body, with `Y` meaning it was found and the default pandas null value (`NaN`) if not. Let's add a new column, `drugs_involved_total`, that totals up the number of `Y`s in each row for the columns listing individual drugs:

```python
'Heroin',
'Cocaine',
'Fentanyl',
'Oxycodone',
'Oxymorphone',
'EtOH',
'Hydro-codeine',
'Benzodiazepine',
'Methadone',
'Amphet',
'Tramad',
'Morphine (not heroin)',
'Other'
```

Now we can write a function that accepts as its one position argument a row of data in the dataframe, checks the values in each of our target columns -- keeping track of the `Y`s -- and then returns the total.

In [None]:
# the name of the function is more or less arbitrary
# `row` also an arbitrary argument name but helps us think about what's happening
def get_total_drugs(row):

    # start a counter for how many drugs were present
    total_drugs = 0
    
    # list the names of the columns to check
    drug_columns = [
        'Heroin',
        'Cocaine',
        'Fentanyl',
        'Oxycodone',
        'Oxymorphone',
        'EtOH',
        'Hydro-codeine',
        'Benzodiazepine',
        'Methadone',
        'Amphet',
        'Tramad',
        'Morphine (not heroin)',
        'Other'        
    ]
    
    # loop over the column list
    for col in drug_columns:
        
        # grab the value for that column in this row
        value = row[col]
        
        # if the value is `Y` ...
        if value == 'Y':
            
            # ... increment the counter 
            # (this is just a shortcut for `total_drugs = total_drugs + 1`)
            total_drugs += 1
    
    # once the loop completes, return the counter
    return total_drugs

Once you have a function defined, you can `apply()` it to the dataframe:

In [None]:
df_ct['drugs_involved_total'] = df_ct.apply(get_total_drugs, axis=1)

In [None]:
df_ct.columns

In [None]:
df_ct.drugs_involved_total.unique()

In [None]:
df_ct.head()

👉 For more information on applying functions to a pandas data frame, [check out this notebook](Using%20the%20apply%20method%20in%20pandas.ipynb).

### Cleaning data

For cleaning jobs of any size, specialized tools like [OpenRefine](http://openrefine.org/) are still your best bet -- a typical workflow is to clean your data in OpenRefine, export as a CSV, then load into pandas.

But in many cases, you can use some of pandas' built-in tools to whip your data into shape. This is especially useful for data processing tasks that you plan to repeat as the data are updated.

In Excel, running a pivot table (with counts) for each column will show you misspellings, external white space, inconsistent casing and other problems that keep your data from grouping correctly.

In SQL, you might do the same thing with The Golden Query™️:

```sql
SELECT column, COUNT(*)
FROM table
GROUP BY column
ORDER BY 2 DESC
```

To do the equivalent operation in pandas, you can just call the `value_counts()` method on a column. Let's look at some Congressional junkets data as an example:

In [None]:
df_junkets = pd.read_csv('../data/congress_junkets.csv')

In [None]:
df_junkets.head()

Let's run `value_counts()` on the _Destination_ colummn:

In [None]:
df_junkets['Destination'].value_counts()

The default sort order is by count descending, but it can also be helpful in finding typos to sort by the name -- the "index" of what `value_counts()` returns. To do that, tack on `sort_index()`:

In [None]:
df_junkets['Destination'].value_counts().sort_index()

... and now we start to see some common data problems in our 838 unique destinations -- whitespace, inconsistent values for the same thing ("Accra" and "Accra, Ghana") -- and can start fixing them.

### Fixing whitespace, casing and other "string" problems

If part of our analysis hinged on having a pristine "Destination" column, then we've got some work ahead of us. First thing I'd do: Strip whitespace and upcase the text.

You can do a lot of basic cleanup like this by applying Python's built-in string methods to the `str` attribute of a column.

To start with, let's create a new column, `destination_clean`, with a stripped/uppercase version of the destination data.

**Note**: Outside of pandas, you can use "method chaining" to apply multiple transformations to a string, like this: `'   My String'.upper().strip()`.

When you're chaining string methods on the `str` attribute of a pandas column series, though, it doesn't work like that -- you have to call `str` after each method call. In other words:

```python
# this will throw an error
junkets['destination_clean'] = junkets['Destination'].str.upper().strip()

# this will work
junkets['destination_clean'] = junkets['Destination'].str.upper().str.strip()
```

In [None]:
df_junkets['destination_clean'] = df_junkets['Destination'].str.upper().str.strip()

In [None]:
df_junkets.head()

Now let's run `value_counts()` again to see if that helped at all.

In [None]:
df_junkets['destination_clean'].value_counts().sort_index()

That eliminated a handful of problems. Now comes the tedious work of identifying entries to find and replace.

### Bulk-replacing values with other values

If we were at this point in Excel, we'd scroll through the list of unique names and start making notes of what we need to change. Same story here.

Let's loop over a [sorted](https://docs.python.org/3/howto/sorting.html) list of `unique()` destinations and `print()` each one.

In [None]:
for destination in sorted(df_junkets.destination_clean.unique()):
    print(destination)

And here is where we're going to start encoding our editorial choices. "Ames, IA" or "Ames, Iowa"? "Baku, Azerjaijan," or "Baku, Republic of Azerbaijan"? Etc.

There are several ways we could structure this data, but a dictionary makes some sense based on what we need to do, so let's do that. Each key will be a string that we'd like to replace; each value will be the string we'd like to replace it with. To get us started:

In [None]:
typo_fixes = {
    'BAKU, AZERBIJAN': 'BAKU, AZERBAIJAN',
    'BAKU, REPUBLIC OF AZERBAIJAN': 'BAKU, AZERBAIJAN',
    'ADDIS, ETHIOPIA': 'ADDIS ABABA, ETHIOPIA',
    'ANKEY, IA': 'ANKENY, IA'
}

... and so on. (This is tedious work, and -- again -- tools like OpenRefine make this process somewhat less tedious. But if you have a long-term project that involves data that will be updated regularly, and it's worth putting in the time to make sure the data are cleaned the same way each time, you can do it all in pandas.)

Here's how we might _apply_ our bulk find-and-replace dictionary:

In [None]:
def find_replace_destination(row):
    '''Given a row of data, see if the value is a typo to be replaced'''
    
    # get the clean destination value
    dest = row['destination_clean']
    
    # try to look it up in the `typo_fixes` dictionary
    # the `get()` method will return None if it doesn't find a match
    typo = typo_fixes.get(dest)
    
    # then we can test to see if `get()` got an item out of the dictionary (True)
    # or if it returned None (False)
    if typo:
        # if it found an entry in our dictionary,
        # return the value from that key/value pair
        return typo_fixes[dest]
    # otherwise
    else:
        # return the original destination string
        return dest

In [None]:
# apply the function and overwrite our working "clean' column"
df_junkets['destination_clean'] = df_junkets.apply(find_replace_destination, axis=1)

In [None]:
df_junkets.head()

### Further reading

This just scratches the surface of what you can do in pandas. Here are some other resources to check out:

- [Pythonic Data Cleaning With NumPy and Pandas](https://realpython.com/python-data-cleaning-numpy-pandas/)
- [pandas official list of tutorials](https://pandas.pydata.org/pandas-docs/stable/tutorials.html)
- [Karrie Kehoe's guide to cleaning data in pandas](https://github.com/KarrieK/pandas_data_cleaning)
- [Data cleaning with Python](https://www.dataquest.io/blog/data-cleaning-with-python/)