# Pythonic Data Cleaning With NumPy and Pandas

## What we are covering

- Dropping Columns in a DataFrame
- Changing the Index of a DataFrame
- Tidying up Fields in the Data
- Combining str Methods with NumPy to Clean Columns
- Cleaning the Entire Dataset Using the applymap Function
- Renaming Columns and Skipping Rows
- Python Data Cleaning: Recap and Resources



Analyst spend a large amount of their time cleaning datasets and getting them down to a form with which they can work. In fact, a lot of data scientists argue that the initial steps of obtaining and cleaning data constitute 80% of the job. So, it is important to be able to deal with messy data, whether that means missing values, inconsistent formatting, malformed records, or nonsensical outliers.

**Here are the datasets that will be used**:
 
- [BL-Flickr_Images-Book.csv](https://github.com/realpython/python-data-cleaning/blob/master/Datasets/BL-Flickr-Images-Book.csv) -  A CSV file containing information about books from the British Library.
- [University Townx text file](https://github.com/realpython/python-data-cleaning/blob/master/Datasets/university_towns.txt) - A text file containing names of college towns in every US state.
- [Olymics](https://github.com/realpython/python-data-cleaning/blob/master/Datasets/olympics.csv) - A CSV file summarizing the participation of all countries in the Summer and Winter Olypics.

Let’s import the required modules and get started!
```python
import pandas as pd
import numpy as np
```

In [None]:
import pandas as pd
import numpy as np
import os

### Dropping Columns in a DataFrame

Often, you’ll find that not all the categories of data in a dataset are useful to you. For example, you might have a dataset containing student information (name, grade, standard, parents’ names, and address) but want to focus on analyzing student grades. 

In this case, the address or parents’ names categories are not important to you. Retaining these unneeded categories will take up unnecessary space and potentially also bog down runtime.

Pandas provides a handy way of removing unwanted columns or rows from a DataFrame with the *drop()* function. Let’s look at a simple example where we drop a number of columns from a DataFrame.

#### BL-Flickr-Images-Book
First, let’s create a DataFrame out of the CSV file ‘BL-Flickr-Images-Book.csv’. In the examples below, we pass a relative path to pd.read_csv, meaning that all of the datasets are in a folder named Datasets in our current working directory:

In [None]:
# Read the csv file
df = pd.read_csv(os.path.join('.','BL-Flickr-Images-Book.csv'))

# print the first the head; which are the first 5
df.head()

# can you print more than 5 records using head

In [None]:
df.head(10)

When we look at the first five entries using the head() method, we can see that a handful of columns provide ancillary information that would be helpful to the library but isn’t very descriptive of the books themselves: Edition Statement, Corporate Author, Corporate Contributors, Former owner, Engraver, Issuance type and Shelfmarks.

We can drop these columns in the following way:

In [None]:
# create a list to drop columns
to_drop = ['Edition Statement',
            'Corporate Author',
            'Corporate Contributors',
            'Former owner',
            'Engraver',
            'Contributors',
            'Issuance type',          
           'Shelfmarks']
df.drop(to_drop, inplace=True, axis=1)

Above, we defined a list that contains the names of all the columns we want to drop. Next, we call the drop() function on our object, passing in the inplace parameter as True and the axis parameter as 1. This tells Pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object. 

When we inspect the DataFrame again, we’ll see that the unwanted columns have been removed:

In [None]:
df.head()

Alternatively, we could also remove the columns by passing them to the columns parameter directly instead of separately specifying the labels to be removed and the axis where Pandas should look for the labels:

```python
df.drop(columns=to_drop, inplace=True)
```

df.drop(columns=to_drop, inplace=True)

#### Changing the Index of a DataFrame

A Pandas Index extends the functionality of NumPy arrays to allow for more versatile slicing and labeling. In many cases, it is helpful to use a uniquely valued identifying field of the data as its index.

For example, in the dataset used in the previous section, it can be expected that when a librarian searches for a record, they may input the unique identifier (values in the Identifier column) for a book:

In [None]:
df['Identifier'].is_unique

Let’s replace the existing index with this column using 
```python
pandas.DataFrame.set_index
```

In [None]:
df = df.set_index('Identifier')
df.head()

**Technical Detail: Unlike primary keys in SQL, a Pandas Index doesn’t make any guarantee of being unique, although many indexing and merging operations will notice a speedup in runtime if it is.**

We can access each record in a straightforward way with loc[]. Although loc[] may not have all that intuitive of a name, it allows us to do label-based indexing, which is the labeling of a row or record without regard to its position:

In [None]:
df.loc[206]

In [None]:
df.loc?

In other words, 206 is the first label of the index. To access it by position, we could use df.iloc[0], which does position-based indexing.

**Technical Detail: .loc[] is technically a class instance and has some special syntax that doesn’t conform exactly to most plain-vanilla Python instance methods.**

Previously, our index was a RangeIndex: integers starting from 0, analogous to Python’s built-in range. By passing a column name to set_index, we have changed the index to the values in Identifier.

You may have noticed that we reassigned the variable to the object returned by the method with df = df.set_index(...). This is because, by default, the method returns a modified copy of our object and does not make the changes directly to the object. We can avoid this by setting the inplace parameter:

In [None]:
df.info()

df.set_index('Identifier', inplace=True)

#### Tidying up Fields in the Data

So far, we have removed unnecessary columns and changed the index of our DataFrame to something more sensible. In this section, we will clean specific columns and get them to a uniform format to get a better understanding of the dataset and enforce consistency. In particular, we will be cleaning Date of Publication and Place of Publication.

Upon inspection, all of the data types are currently the object dtype, which is roughly analogous to str in native Python.

It encapsulates any field that can’t be neatly fit as numerical or categorical data. This makes sense since we’re working with data that is initially a bunch of messy strings:

In [None]:
df.get_dtype_counts()

One field where it makes sense to enforce a numeric value is the date of publication so that we can do calculations down the road:

In [None]:
# get the label, column name and return only 10 results
df.loc[1905:, 'Date of Publication'].head(10)

A particular book can have only one date of publication. Therefore, we need to do the following:

- Remove the extra dates in square brackets, wherever present: 1879 [1878]

- Convert date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54

- Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?]

- Convert the string nan to NumPy’s NaN value
Synthesizing these patterns, we can actually take advantage of a single regular expression to extract the publication year:

In [None]:
import re
egex = r'^(\d{4})'

The regular expression above is meant to find any four digits at the beginning of a string, which suffices for our case. The above is a raw string (meaning that a backslash is no longer an escape character), which is standard practice with regular expressions.

The \d represents any digit, and {4} repeats this rule four times. The ^ character matches the start of a string, and the parentheses denote a capturing group, which signals to Pandas that we want to extract that part of the regex. (We want ^ to avoid cases where [ starts off the string.)

Let’s see what happens when we run this regex across our dataset:

In [None]:
extr = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
extr.head(20)

Not familiar with regex? You can [inspect the expression](https://regex101.com/r/3AJ1Pv/1) above at regex101.com and read more at the Python Regular Expressions [HOWTO](https://docs.python.org/3.6/howto/regex.html).

Technically, this column still has object dtype, but we can easily get its numerical version with pd.to_numeric:

In [None]:
df['Date of Publication'].isnull().sum() / len(df)

In [None]:
df.head()

In [None]:
df['Date of Publication']

In [None]:
extr = pd.to_numeric(extr)
extr = [ int(v) if str(v) != 'nan' else v for v in extr]

In [None]:
extr

Great! That’s done!

Next time will be 

# Combining str Methods with NumPy to Clean Columns