## Data Visualizations for Messy Data

#### Learn how to work around problems with visualizing messy and missing data.

Data visualization tutorials generally use pre-processed data. But what about datasets in the wild? What do we do about missing data? Or outliers that largely skew visualizations? What do we do when there are too many observations to be interpretable in a scatterplot? This article will introduce some of the methods we can use to work around these problems.

Let’s say we are new real estate agents who want to use data to better understand the relationship between the price and the number of bedrooms in a home. We will be using a dataset we have called housing from [Kaggle](https://www.kaggle.com/austinreese/usa-housing-listings) on USA Housing Listings.

### Missing data

Incomplete observations — or missing data — are generally ignored by plotting functions in commonly-used Python libraries, such as matplotlib and seaborn. Therefore, we may want to remove those rows or impute the missing values before plotting. We can check for missing data using <span style="color:aqua">**info()**</span>:

In [2]:
import pandas as pd

housing = pd.read_csv('/home/oldoc/largeDataSets/housing.csv')
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384977 entries, 0 to 384976
Data columns (total 22 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   id                       384977 non-null  int64  
 1   url                      384977 non-null  object 
 2   region                   384977 non-null  object 
 3   region_url               384977 non-null  object 
 4   price                    384977 non-null  int64  
 5   type                     384977 non-null  object 
 6   sqfeet                   384977 non-null  int64  
 7   beds                     384977 non-null  int64  
 8   baths                    384977 non-null  float64
 9   cats_allowed             384977 non-null  int64  
 10  dogs_allowed             384977 non-null  int64  
 11  smoking_allowed          384977 non-null  int64  
 12  wheelchair_access        384977 non-null  int64  
 13  electric_vehicle_charge  384977 non-null  int64  
 14  come

Based on this output, we may be concerned about the columns laundry_options and parking_options because they have more missing values than other columns.

In [7]:
housing[['laundry_options','parking_options']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 384977 entries, 0 to 384976
Data columns (total 2 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   laundry_options  305951 non-null  object
 1   parking_options  244290 non-null  object
dtypes: object(2)
memory usage: 5.9+ MB
