# Cleaning data in pandas

For cleaning jobs of any size, specialized tools like [OpenRefine](http://openrefine.org/) are still your best bet -- a typical workflow is to clean your data in OpenRefine, export as a CSV, then load into pandas.

But in many cases, you can use some of pandas' built-in tools to whip your data into shape. This is especially useful for data processing tasks that you plan to repeat as the data are updated.

Let's import pandas, then we'll run through some scenarios.

In [1]:
import pandas as pd

### How dirty is your data?

In Excel, running a pivot table (with counts) for each column will show you misspellings, external white space, inconsistent casing and other problems that keep your data from grouping correctly.

In SQL, you might do the same thing with The Golden Query™️:

```sql
SELECT column, COUNT(*)
FROM table
GROUP BY column
ORDER BY 2 DESC
```

To do the equivalent operation in pandas, you can just call the `value_counts()` method on a column. Let's look at some Congressional junkets data as an example:

In [14]:
junkets = pd.read_csv('../data/congress_junkets.csv')

In [15]:
junkets.head()

Unnamed: 0,DocID,FilerName,MemberName,State,District,Year,Destination,FilingType,DepartureDate,ReturnDate,TravelSponsor
0,500005076,Bobby Cornett,"Franks, Trent",AZ,8.0,2011,"Las Vegas, NV",Original,1/7/2011,1/9/2011,Consumer Electronics Association
1,500005077,Michael Strittmatter,"Franks, Trent",AZ,8.0,2011,"Las Vegas, NV",Original,1/7/2011,1/9/2011,CEA Leaders in Technology
2,500005081,Diane Rinaldo,"Rogers, Mike",AL,3.0,2011,"Las Vegas, NV",Original,1/6/2011,1/8/2011,Consumer Electronics Association
3,500005082,Kenneth DeGraff,"Doyle, Michael",PA,14.0,2011,"Las Vegas, NV",Original,1/6/2011,1/9/2011,Consumer Electronics Association
4,500005083,Michael Ryan Clough,"Lofgren, Zoe",CA,19.0,2011,"Las Vegas, NV",Original,1/6/2011,1/8/2011,Consumer Electronics Association


Let's run `value_counts()` on the _Destination_ colummn:

In [16]:
junkets['Destination'].value_counts()

Baltimore, MD                     827
Hot Springs, VA                   753
Tel Aviv, Israel                  651
New York, NY                      635
Philadelphia, PA                  487
Cambridge, MD                     468
Jerusalem, Israel                 466
Las Vegas, NV                     373
Williamsburg, VA                  371
Istanbul, Turkey                  365
Tiberias, Israel                  240
Ankara, Turkey                    222
Warrenton, VA                     159
Boston, MA                        138
Los Angeles, CA                   138
Middleburg, VA                    137
San Francisco, CA                 136
Miami, FL                         119
Berlin, Germany                   118
Hershey, PA                       108
Chicago, IL                       105
Atlanta, GA                        93
Brussels, Belgium                  89
Tokyo, Japan                       87
New Orleans, LA                    82
Palo Alto, CA                      70
Havana, Cuba

The default sort order is by count descending, but it can also be helpful in finding typos to sort by the name -- the "index" of what `value_counts()` returns. To do that, tack on `sort_index()`:

In [19]:
junkets['Destination'].value_counts().sort_index()

Abidjan, Cote d'Ivoire              3
Abu Dhabi, United Arab Emirate      3
Abuja, Nigeria                      4
Accra                               1
Accra, Ghana                       10
Addis Ababa, Ethiopia              43
Addis, Ethiopia                     3
Adelaide, Australia                 1
Aiken, SC                          15
Akron, OH                           7
Albany, NY                         10
Alberta, Canada                    13
Albuquerque, NM                     8
Algiers, Algeria                   11
Allentown, PA                       1
Amelia Island, FL                   2
Ames, IA                           23
Amman, Jordan                      17
Amsterdam, Netherlands              2
Anaheim, CA                         2
Anatalya, Turkey                    1
Anchorage, AK                       4
Andechs, Germany                    9
Ankara, Israel                      1
Ankara, Turkey                    222
Ankeny, IA                          3
Ankey, IA   

... and now we start to see some common data problems in our 838 unique destinations -- whitespace, inconsistent values for the same thing ("Accra" and "Accra, Ghana") -- and can start fixing them.

### Fixing whitespace, casing and other "string" problems

If part of our analysis hinged on having a pristine "Destination" column, then we've got some work ahead of us. First thing I'd do: Strip whitespace and upcase the text.

You can do a lot of basic cleanup like this by applying Python's built-in string methods to the `str` attribute of a column.

👉 For more information on Python string methods, [check out this notebook](Python%20data%20types%20and%20basic%20syntax.ipynb#String-methods).

To start with, let's create a new column, `destination_clean`, with a stripped/uppercase version of the destination data.

**Note**: Outside of pandas, you can use "method chaining" to apply multiple transformations to a string, like this: `'   My String'.upper().strip()`.

When you're chaining string methods on the `str` attribute of a pandas column series, though, it doesn't work like that -- you have to call `str` after each method call. In other words:

```python
# this will throw an error
junkets['destination_clean'] = junkets['Destination'].str.upper().strip()

# this will work
junkets['destination_clean'] = junkets['Destination'].str.upper().str.strip()
```

In [30]:
junkets['destination_clean'] = junkets['Destination'].str.upper().str.strip()

In [31]:
junkets.head()

Unnamed: 0,DocID,FilerName,MemberName,State,District,Year,Destination,FilingType,DepartureDate,ReturnDate,TravelSponsor,destination_clean
0,500005076,Bobby Cornett,"Franks, Trent",AZ,8.0,2011,"Las Vegas, NV",Original,1/7/2011,1/9/2011,Consumer Electronics Association,"LAS VEGAS, NV"
1,500005077,Michael Strittmatter,"Franks, Trent",AZ,8.0,2011,"Las Vegas, NV",Original,1/7/2011,1/9/2011,CEA Leaders in Technology,"LAS VEGAS, NV"
2,500005081,Diane Rinaldo,"Rogers, Mike",AL,3.0,2011,"Las Vegas, NV",Original,1/6/2011,1/8/2011,Consumer Electronics Association,"LAS VEGAS, NV"
3,500005082,Kenneth DeGraff,"Doyle, Michael",PA,14.0,2011,"Las Vegas, NV",Original,1/6/2011,1/9/2011,Consumer Electronics Association,"LAS VEGAS, NV"
4,500005083,Michael Ryan Clough,"Lofgren, Zoe",CA,19.0,2011,"Las Vegas, NV",Original,1/6/2011,1/8/2011,Consumer Electronics Association,"LAS VEGAS, NV"


Now let's run `value_counts()` again to see if that helped at all.

In [32]:
junkets['destination_clean'].value_counts().sort_index()

ABIDJAN, COTE D'IVOIRE              3
ABU DHABI, UNITED ARAB EMIRATE      3
ABUJA, NIGERIA                      4
ACCRA                               1
ACCRA, GHANA                       10
ADDIS ABABA, ETHIOPIA              43
ADDIS, ETHIOPIA                     3
ADELAIDE, AUSTRALIA                 1
AIKEN, SC                          15
AKRON, OH                           7
ALBANY, NY                         10
ALBERTA, CANADA                    13
ALBUQUERQUE, NM                     8
ALGIERS, ALGERIA                   11
ALLENTOWN, PA                       1
AMELIA ISLAND, FL                   2
AMES, IA                           23
AMMAN, JORDAN                      17
AMSTERDAM, NETHERLANDS              2
ANAHEIM, CA                         2
ANATALYA, TURKEY                    1
ANCHORAGE, AK                       4
ANDECHS, GERMANY                    9
ANKARA, ISRAEL                      1
ANKARA, TURKEY                    222
ANKENY, IA                          3
ANKEY, IA   

That eliminated a handful of problems. Now comes the tedious work of identifying entries to find and replace.

### Bulk-replacing values with other values

If we were at this point in Excel, we'd scroll through the list of unique names and start making notes of what we need to change. Same story here.

Let's loop over a [sorted](https://docs.python.org/3/howto/sorting.html) list of `unique()` destinations and `print()` each one.

👉 For a refresher on _for loops_, [see this notebook](Python%20data%20types%20and%20basic%20syntax.ipynb#for-loops).

In [35]:
for destination in sorted(junkets.destination_clean.unique()):
    print(destination)

ABIDJAN, COTE D'IVOIRE
ABU DHABI, UNITED ARAB EMIRATE
ABUJA, NIGERIA
ACCRA
ACCRA, GHANA
ADDIS ABABA, ETHIOPIA
ADDIS, ETHIOPIA
ADELAIDE, AUSTRALIA
AIKEN, SC
AKRON, OH
ALBANY, NY
ALBERTA, CANADA
ALBUQUERQUE, NM
ALGIERS, ALGERIA
ALLENTOWN, PA
AMELIA ISLAND, FL
AMES, IA
AMMAN, JORDAN
AMSTERDAM, NETHERLANDS
ANAHEIM, CA
ANATALYA, TURKEY
ANCHORAGE, AK
ANDECHS, GERMANY
ANKARA, ISRAEL
ANKARA, TURKEY
ANKENY, IA
ANKEY, IA
ANN ARBOR, MI
ANNAPOLIS, MD
ANOMABO, GHANA
ANTALYA, TURKEY
ANTIGUA, GUATEMALA
ARAUCA, COLOMBIA
ARLINGTON, VA
ARUSHA, TANZANIA
ASHEVILLE, NC
ASPEN, CO
ATLANTA, GA
ATLANTA, GEORGIA
ATLANTA,GA
AUGUSTA, GA
AUSTIN, TEXAS
AUSTIN, TX
AVENTURA, FL
AVILA BEACH, CA
AVOCA, IA
AWASSA, ETHIOPIA
BAKU, AZERBAIJAN
BAKU, AZERBIJAN
BAKU, REPUBLIC OF AZERBAIJAN
BALI, INDONESIA
BALTIMORE, DC
BALTIMORE, MD
BALTIMROE, MD
BANFF, CANADA
BANGALORE, INDIA
BANJA LUKA, BOSNIA-HERZEGOVINA
BARCELONA, SPAIN
BARTLESVILLE, OK
BATON ROUGE, LA
BATTLE CREEK, MI
BEARDSTOWN, IL
BEDFORD SPRINGS, PA
BEDFORD, PA
BEIJIN

And here is where we're going to start encoding our editorial choices. "Ames, IA" or "Ames, Iowa"? "Baku, Azerjaijan," or "Baku, Republic of Azerbaijan"? Etc.

There are several ways we could structure this data, but a dictionary sounds like it'd be the most fun, so let's do that. Each key will be a string that we'd like to replace; each value will be the string we'd like to replace it with. To get us started:

In [36]:
typo_fixes = {
    'BAKU, AZERBIJAN': 'BAKU, AZERBAIJAN',
    'BAKU, REPUBLIC OF AZERBAIJAN': 'BAKU, AZERBAIJAN',
    'ADDIS, ETHIOPIA': 'ADDIS ABABA, ETHIOPIA',
    'ANKEY, IA': 'ANKENY, IA'
}

... and so on. (This is tedious work, and -- again -- tools like OpenRefine make this process somewhat less tedious. But if you have a long-term project that involves data that will be updated regularly, and it's worth putting in the time to make sure the data are cleaned the same way each time, you can do it all in pandas.)

👉 For more information on dictionaries, [check out this notebook](Python%20data%20types%20and%20basic%20syntax.ipynb#Dictionaries).

Here's how we might _apply_ our bulk find-and-replace dictionary:

In [42]:
def find_replace_destination(row):
    '''Given a row of data, see if the value is a typo to be replaced'''
    
    # get the clean destination value
    dest = row['destination_clean']
    
    # try to look it up in the `typo_fixes` dictionary
    # the `get()` method will return None if it's not there
    typo = typo_fixes.get(dest)
    
    # then we can test to see if `get()` got an item out of the dictionary (True)
    # or if it returned None (False)
    if typo:
        # if it found an entry in our dictionary,
        # return the value from that key/value pair
        return typo_fixes[dest]
    # otherwise
    else:
        # return the original destination string
        return dest

# apply the function and overwrite our working "clean' column"
junkets['destination_clean'] = junkets.apply(find_replace_destination, axis=1)

In [43]:
junkets.head()

Unnamed: 0,DocID,FilerName,MemberName,State,District,Year,Destination,FilingType,DepartureDate,ReturnDate,TravelSponsor,destination_clean
0,500005076,Bobby Cornett,"Franks, Trent",AZ,8.0,2011,"Las Vegas, NV",Original,1/7/2011,1/9/2011,Consumer Electronics Association,"LAS VEGAS, NV"
1,500005077,Michael Strittmatter,"Franks, Trent",AZ,8.0,2011,"Las Vegas, NV",Original,1/7/2011,1/9/2011,CEA Leaders in Technology,"LAS VEGAS, NV"
2,500005081,Diane Rinaldo,"Rogers, Mike",AL,3.0,2011,"Las Vegas, NV",Original,1/6/2011,1/8/2011,Consumer Electronics Association,"LAS VEGAS, NV"
3,500005082,Kenneth DeGraff,"Doyle, Michael",PA,14.0,2011,"Las Vegas, NV",Original,1/6/2011,1/9/2011,Consumer Electronics Association,"LAS VEGAS, NV"
4,500005083,Michael Ryan Clough,"Lofgren, Zoe",CA,19.0,2011,"Las Vegas, NV",Original,1/6/2011,1/8/2011,Consumer Electronics Association,"LAS VEGAS, NV"


👉 For more information on writing your own functions, [check out this notebook](Functions.ipynb).

👉 For more information on applying functions to a pandas data frame, [check out this notebook](Using%20the%20apply%20method%20in%20pandas.ipynb).

### Nonstandard values to represent null entries

Data creators may express null values in a variety of ways -- `''`, `'n/a'`, `NA`, `.`, etc. But for your purposes, you want pandas to read them all as `NaN`, so you can take advantage of methods like [`isnull()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.isnull.html) in your analysis.

If you've already done some exploratory analysis, you can specify the `na_values` argument when you read in the data -- you can supply a _single_ value that should be interpreted as null, or you can hand off a list `[]` of values.

As an example, let's take a look at some EPA air quality data from Ohio:

In [44]:
air_quality = pd.read_excel('../data/epa.xlsx')

In [45]:
air_quality.head()

Unnamed: 0,CBSA Code,CBSA,CO 2nd Max 1-hr,CO 2nd Max 8-hr,NO2 98th Percentile 1-hr,NO2 Mean 1-hr,Ozone 2nd Max 1-hr,Ozone 4th Max 8-hr,SO2 99th Percentile 1-hr,SO2 2nd Max 24-hr,SO2 Mean 1-hr,PM2.5 98th Percentile 24-hr,PM2.5 Weighted Mean 24-hr,PM10 2nd Max 24-hr,PM10 Mean 24-hr,Lead Max 3-Mo Avg
0,10420,"Akron, OH",17.6,7.8,.,.,0.12,0.099,188,66,19,.,.,.,.,.
1,11780,"Ashtabula, OH",.,.,.,.,0.11,0.092,101,34,10,.,.,.,.,.
2,15940,"Canton-Massillon, OH",8,4.5,.,29,0.1,0.087,.,.,.,.,.,.,.,.
3,17060,"Chillicothe, OH",.,.,.,.,.,.,.,.,.,.,.,.,.,.
4,17140,"Cincinnati, OH-KY-IN",10,6,70,26,0.16,0.11,190,55,14,.,.,.,.,.


After conferring with the source of this data, the dots `.` represent "no observation" -- a null value. Let's try reading that in again, this time specifying `na_values`:

In [46]:
air_quality = pd.read_excel('../data/epa.xlsx',
                            na_values='.')

In [47]:
air_quality.head()

Unnamed: 0,CBSA Code,CBSA,CO 2nd Max 1-hr,CO 2nd Max 8-hr,NO2 98th Percentile 1-hr,NO2 Mean 1-hr,Ozone 2nd Max 1-hr,Ozone 4th Max 8-hr,SO2 99th Percentile 1-hr,SO2 2nd Max 24-hr,SO2 Mean 1-hr,PM2.5 98th Percentile 24-hr,PM2.5 Weighted Mean 24-hr,PM10 2nd Max 24-hr,PM10 Mean 24-hr,Lead Max 3-Mo Avg
0,10420,"Akron, OH",17.6,7.8,,,0.12,0.099,188.0,66.0,19.0,,,,,
1,11780,"Ashtabula, OH",,,,,0.11,0.092,101.0,34.0,10.0,,,,,
2,15940,"Canton-Massillon, OH",8.0,4.5,,29.0,0.1,0.087,,,,,,,,
3,17060,"Chillicothe, OH",,,,,,,,,,,,,,
4,17140,"Cincinnati, OH-KY-IN",10.0,6.0,70.0,26.0,0.16,0.11,190.0,55.0,14.0,,,,,


### You want to replace null values with 0, or something else

Use the [`fillna()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html) method on a data frame or column series to fill null values with some other value.

Let's say our reporting had shown that the dots in the air quality data weren't, in fact null. Let's say they were actually supposed to be zeroes. Here's how we'd fix that:

In [74]:
air_quality.fillna(0)

Unnamed: 0,CBSA Code,CBSA,CO 2nd Max 1-hr,CO 2nd Max 8-hr,NO2 98th Percentile 1-hr,NO2 Mean 1-hr,Ozone 2nd Max 1-hr,Ozone 4th Max 8-hr,SO2 99th Percentile 1-hr,SO2 2nd Max 24-hr,SO2 Mean 1-hr,PM2.5 98th Percentile 24-hr,PM2.5 Weighted Mean 24-hr,PM10 2nd Max 24-hr,PM10 Mean 24-hr,Lead Max 3-Mo Avg
0,10420,"Akron, OH",17.6,7.8,0.0,0.0,0.12,0.099,188.0,66.0,19.0,0.0,0.0,0.0,0.0,0.0
1,11780,"Ashtabula, OH",0.0,0.0,0.0,0.0,0.11,0.092,101.0,34.0,10.0,0.0,0.0,0.0,0.0,0.0
2,15940,"Canton-Massillon, OH",8.0,4.5,0.0,29.0,0.1,0.087,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,17060,"Chillicothe, OH",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,17140,"Cincinnati, OH-KY-IN",10.0,6.0,70.0,26.0,0.16,0.11,190.0,55.0,14.0,0.0,0.0,0.0,0.0,0.0
5,17460,"Cleveland-Elyria, OH",16.3,11.0,0.0,0.0,0.12,0.097,516.0,125.0,18.0,0.0,0.0,0.0,0.0,0.0
6,18140,"Columbus, OH",20.0,12.1,0.0,19.0,0.13,0.101,173.0,48.0,9.0,0.0,0.0,0.0,0.0,0.0
7,18740,"Coshocton, OH",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8,19380,"Dayton, OH",14.0,7.1,0.0,0.0,0.13,0.102,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,22300,"Findlay, OH",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### You have duplicate rows

If your data have rows that are incorrectly duplicated, you use the data frame method [`drop_duplicates()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop_duplicates.html) to delete the duplicates.

(This assumes, of course, that you've done sufficient reporting to feel confident that the duplicated rows aren't in there legitimately.)

Let's look at some fake data to show how this'd work:

In [54]:
fake_data = [
    {'id': 12345, 'name': 'Sally', 'position': 'Editor', 'org': 'Some News Organization'},
    {'id': 54321, 'name': 'George', 'position': 'Reporter', 'org': 'Some Other News Organization'},
    {'id': 12345, 'name': 'Sally', 'position': 'Editor', 'org': 'Some News Organization'},
    {'id': 49382, 'name': 'Sally', 'position': 'Editor', 'org': 'Some News Organization'},
    {'id': 39331, 'name': 'Pat', 'position': 'Producer', 'org': 'Some Other Other News Organization'},
]

fake_df = pd.DataFrame(fake_data)

In [55]:
fake_df.head()

Unnamed: 0,id,name,org,position
0,12345,Sally,Some News Organization,Editor
1,54321,George,Some Other News Organization,Reporter
2,12345,Sally,Some News Organization,Editor
3,49382,Sally,Some News Organization,Editor
4,39331,Pat,Some Other Other News Organization,Producer


Before you drop anything, you'd probably want to check for duplicate rows. You can do that by filtering your data using the `duplicated()` method.

👉 For more details on filtering data in pandas, [see this notebook](Filtering%20columns%20and%20rows%20in%20pandas.ipynb).

In [57]:
fake_df[fake_df.duplicated()]

Unnamed: 0,id,name,org,position
2,12345,Sally,Some News Organization,Editor


This is showing us a row where every value in every column matches exactly the values in at least one other row. So we've done the reporting to show that we need to cut the duplicates here.

The `drop_duplicates()` method gives you control over _how_ this happens:
- You can drop _all_ duplicate rows, or keep just the first instance (this is the default behavior), or the last instance
- You can drop rows where just the values in certain columns are duplicated

Here are a few examples:

In [58]:
# default behavior -- duplicate rows must match exactly
fake_df.drop_duplicates()

Unnamed: 0,id,name,org,position
0,12345,Sally,Some News Organization,Editor
1,54321,George,Some Other News Organization,Reporter
3,49382,Sally,Some News Organization,Editor
4,39331,Pat,Some Other Other News Organization,Producer


In [60]:
# drop rows where the values in name, org and position are identical
fake_df.drop_duplicates(subset=['name', 'org', 'position'])

Unnamed: 0,id,name,org,position
0,12345,Sally,Some News Organization,Editor
1,54321,George,Some Other News Organization,Reporter
4,39331,Pat,Some Other Other News Organization,Producer


Our original data frame is unchanged:

In [61]:
fake_df

Unnamed: 0,id,name,org,position
0,12345,Sally,Some News Organization,Editor
1,54321,George,Some Other News Organization,Reporter
2,12345,Sally,Some News Organization,Editor
3,49382,Sally,Some News Organization,Editor
4,39331,Pat,Some Other Other News Organization,Producer


That's because we didn't specify `in_place=True` as an argument.

You can take one of two approaches here. You could alter your original dataframe -- the code would look like this:

```python
# drop rows where the values in name, org and position are identical
fake_df.drop_duplicates(subset=['name', 'org', 'position'], inplace=True)
```

-- or you could "save" the resulting deduplicated data frame as a new variable, like this:

```python
# drop rows where the values in name, org and position are identical
deduped = fake_df.drop_duplicates(subset=['name', 'org', 'position'])
```

I prefer the latter approach because I like to leave the original data as untouched as possible, working up to successively cleaner and more analyze-able data frames as I go. I also find this approach is easier to follow when I come back to it after a few weeks or months of inaction.

### You have empty rows

To drop rows or columns whose values are all `NA`, use [`dropna()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html).

Specifying `axis=1` will drop empty _columns_; `axis=0` will drop empty _rows_.

In [68]:
import numpy as np

fake_df = fake_df.append({'id': np.nan, 'name': np.nan, 'position': np.nan, 'org': np.nan}, ignore_index=True)

In [69]:
fake_df

Unnamed: 0,id,name,org,position
0,12345.0,Sally,Some News Organization,Editor
1,54321.0,George,Some Other News Organization,Reporter
2,12345.0,Sally,Some News Organization,Editor
3,49382.0,Sally,Some News Organization,Editor
4,39331.0,Pat,Some Other Other News Organization,Producer
5,,,,


In [72]:
fake_df.dropna(axis=0)

Unnamed: 0,id,name,org,position
0,12345.0,Sally,Some News Organization,Editor
1,54321.0,George,Some Other News Organization,Reporter
2,12345.0,Sally,Some News Organization,Editor
3,49382.0,Sally,Some News Organization,Editor
4,39331.0,Pat,Some Other Other News Organization,Producer


### Further reading

This just scratches the surface of what you can do in pandas. Here are some other resources to check out:

- [Pythonic Data Cleaning With NumPy and Pandas](https://realpython.com/python-data-cleaning-numpy-pandas/)
- [pandas official list of tutorials](https://pandas.pydata.org/pandas-docs/stable/tutorials.html)
- [Karrie Kehoe's guide to cleaning data in pandas](https://github.com/KarrieK/pandas_data_cleaning)
- [Data cleaning with Python](https://www.dataquest.io/blog/data-cleaning-with-python/)