# Visualization using Matplotlib

Data visualization is one of the most essential steps in understanding the data at hand and the relations that can be drawn amongst various features and labels. While there are many libraries, Matplotlib forms the basics for data visualization and this notebook deals with its basics and common usage.

## Import libraries

In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

We will use the population dataset available on Kaggle to understand how we can leverage the power of visualizations to actually draw meaningful conclusions.

## Import dataset

We will import the dataset as a Panda's dataframe in the variable `dataset`.
However, we must first take a look at the data file. We find that first `4` rows can be skipped and then we can read the data into the variable.

In [9]:
dataset = pd.read_csv('dataset.csv', skiprows=4)
dataset.head(5)

Unnamed: 0,Country Name,Country Code,Indicator Name,Indicator Code,1960,1961,1962,1963,1964,1965,...,2008,2009,2010,2011,2012,2013,2014,2015,2016,Unnamed: 61
0,Aruba,ABW,Population density (people per sq. km of land ...,EN.POP.DNST,,307.972222,312.366667,314.983333,316.827778,318.666667,...,563.011111,563.422222,564.427778,566.311111,568.85,571.783333,574.672222,577.161111,,
1,Andorra,AND,Population density (people per sq. km of land ...,EN.POP.DNST,,30.587234,32.714894,34.914894,37.170213,39.470213,...,182.161702,181.859574,179.614894,175.161702,168.757447,161.493617,154.86383,149.942553,,
2,Afghanistan,AFG,Population density (people per sq. km of land ...,EN.POP.DNST,,14.038148,14.312061,14.599692,14.901579,15.218206,...,40.634655,41.674005,42.830327,44.127634,45.533197,46.997059,48.444546,49.821649,,
3,Angola,AGO,Population density (people per sq. km of land ...,EN.POP.DNST,,4.305195,4.384299,4.464433,4.544558,4.624228,...,15.915819,16.459536,17.020898,17.600302,18.196544,18.808215,19.433323,20.070565,,
4,Albania,ALB,Population density (people per sq. km of land ...,EN.POP.DNST,,60.576642,62.456898,64.329234,66.209307,68.058066,...,107.566204,106.843759,106.314635,106.013869,105.848431,105.717226,105.60781,105.444051,,


It appears that we can either keep the `Country Name` or the `Country Code`. For easy reference, let's keep the country name. The columns `Indicator Name` and `Indicator Code` are not required.
Columns `1960` and `2016` has values represented as `NaN` which means `Not a Number` and hence we can drop those columns too. Finally, the last column is unnamed and also has `NaN` values so we drop it too.

In [15]:
dataset.drop(['Country Code', 'Indicator Name', 'Indicator Code', '1960', '2016', 'Unnamed: 61'],
             axis = 1, inplace = True)
dataset.head(5)

Unnamed: 0,Country Name,1961,1962,1963,1964,1965,1966,1967,1968,1969,...,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015
0,Aruba,307.972222,312.366667,314.983333,316.827778,318.666667,320.622222,322.494444,324.361111,326.244444,...,560.166667,562.322222,563.011111,563.422222,564.427778,566.311111,568.85,571.783333,574.672222,577.161111
1,Andorra,30.587234,32.714894,34.914894,37.170213,39.470213,41.8,44.159574,46.570213,49.065957,...,177.389362,180.591489,182.161702,181.859574,179.614894,175.161702,168.757447,161.493617,154.86383,149.942553
2,Afghanistan,14.038148,14.312061,14.599692,14.901579,15.218206,15.545203,15.881812,16.235931,16.618433,...,38.574296,39.637202,40.634655,41.674005,42.830327,44.127634,45.533197,46.997059,48.444546,49.821649
3,Angola,4.305195,4.384299,4.464433,4.544558,4.624228,4.703271,4.782892,4.865721,4.955244,...,14.872437,15.387749,15.915819,16.459536,17.020898,17.600302,18.196544,18.808215,19.433323,20.070565
4,Albania,60.576642,62.456898,64.329234,66.209307,68.058066,69.874927,71.737153,73.805547,75.97427,...,109.217044,108.394781,107.566204,106.843759,106.314635,106.013869,105.848431,105.717226,105.60781,105.444051


The dataset has countries where all information is not available and hence, it's better to drop such rows as they provide no useful information.

In [29]:
dataset.dropna(how = 'any', axis = 0, inplace = True)
dataset.isnull().sum()

Country Name    0
dtype: int64