# Tips for Cleaning up Raw Data

## Detecting and Filling in Missing Data

The [missingno](https://github.com/ResidentMario/missingno#missingno----) package provides a small toolset of flexible and easy-to-use missing data visualizations and utilities. They allow you to get a quick visual summary of the completeness (or lack thereof) of your dataset.

In [1]:
%matplotlib notebook

import time
from IPython.display import HTML, clear_output
import numpy as np
import pandas as pd
import missingno as msno

raw_data = pd.read_csv('../data/classic-rock-songs.csv',
                       sep=',', dtype={'Release Year': 'Int16'})
raw_data = raw_data.iloc[:, 1:4]
raw_data = raw_data.rename(
    columns=dict(zip(raw_data.columns, ['song', 'artist', 'year'])))
plot = msno.matrix(raw_data)

chart_img = 'img/missingno.png'
plot.figure.savefig(chart_img)
clear_output()
print(raw_data.head(2))
HTML('<img src="{}?{}"></img>'.format(chart_img, time.time()))

               song       artist  year
0  Caught Up in You  .38 Special  1982
1      Fantasy Girl  .38 Special   NaN


As you can (literally) see, the `year` values are hit and miss.

Once you got a mental image of what data is missing and how you can fill in reasonable substitute values, the primary means to do that is the `fillna(value)` method. But you can also select rows with missing values via `loc` and `isna`, and calculate a replacement value from other rows, e.g. the average of some related group.