# Data Visualization 

Python is a very popular programming language and has very good libraries like _**NumPy**_ and _**Pandas**_ to manipulate data and _**Matplotlib**_ and _**Seaborn**_ to visualize the data. We can make many types of visualizations like Bar graphs, Line graphs, Boxplots, Histograms. 

Data visulalization is all about loading data, simplifying data, cleaning data, augmenting data (when it is not reach enough) and understand data on a more intuitive level. 

First of all, we need data in a plottable form to be able to plot data. NumPy and Pandas can be used for this purpose. We can efficinetly load, store, manipulate and export data using this libraries.

Matplotlib and Seaborn are very popular Python plotting libraries. While Matplotlib API is relatively low-level, Seaborn API is hig-level and provides high-level graphics.   

### Data Loading

We can load/import data from a csv file or an excel file using Pandas library. First, we need to import Pandas library. Then, we get our data using Pandas' read_csv function which takes the file path or url of the file as argument. 

In [None]:
import pandas as pd
file_url = 'https://drive.google.com/uc?id=1_0F4v5dven3QQ9QgmJieoxmTJi6mwjT5' 
dataset = pd.read_csv(file_url)

Let's look at a small part of our dataset to figure out what kind of data we have. Rows include observations and columns include features. Rows have labels called indices or headers. It starts with zero. 

In [None]:
dataset.head(5)  # show the first 5 rows

In [None]:
dataset.tail(5)  # show the last 5 rows


Each tabular view above is called Pandas' _**DataFrame**_ and we transfer some content of the data into the DataFrame.  

In [None]:
type(dataset)  # type is Pandas Dataframe


Now we can examine the data whether it is correctly loaded and valid. We also want to see the last rows to see if they have same format as the first rows.

The data here shows you for example how much you will pay in interest over time or you can find out how much will be the total interest payment according to months for a particular car. But it is hard to see these informations just by looking at this table. That's why we need visualizations. 

We should verify our dataset. There are a couple of methods and attributes in Pandas library for it. For example:

In [None]:
dataset.shape  # to see the shape of our dataset

In [None]:
dataset.dtypes  # to check the column data types

In [None]:
dataset.info()  # gives you number of non-null values in the each column

The last one is important to do because null values are not preferred for data analysis or visualization tasks. 

### Slicing

When we have a large dataset (which will be oftentimes), we can work on a smaller subset of it. For this, we can use following slicing techniques:

In [None]:
dataset[['car_type']].tail()  # select one column, tail takes 5 default

In [None]:
dataset[['car_type', 'Principal Paid']].tail()  # select multiple columns

In [None]:
dataset['car_type'].head()  # if we use single brackets, we get Pandas series

In [None]:
type(dataset['car_type'].head())  # type is Pandas series

We cannot select multiple columns using single brackets. 

With Pandas series, we can select rows using slicing:

In [None]:
dataset['car_type'][3:9]   # series[start_index:end_index] 

### Filtering

We can filter out the data using filtering techniques. 

In [None]:
dataset['car_type'].value_counts()  # to see what kind of cars we have

Here we can see that we have misspelling- Toyota Carolla instead of Toyota Corolla. 

In [None]:
car_filter=dataset['car_type']=='Toyota Sienna'  # produces Pandas series with True or False values
car_filter.head(5)

We can get observations only about Toyota Sienna using above car_filter:

In [None]:
dataset[car_filter]  # shows observations related to Toyota Sienna

In [None]:
dataset_sienna = dataset[car_filter]  # update the dataFrame after applying the filter
dataset_sienna['car_type'].value_counts()  # shows that we have only the dataFrame of Toyota Sienna

### Renaming/Deleting Columns

Sometimes we may want to change the name of the columns. For example, when we see car_type column, we can use the following code:

In [None]:
dataset.car_type.head()

However, if we want to see Interest Paid column, we cannot use this way because we have a space in the name and this will create error.

In [None]:
dataset.Principal Paid.head()
# dataset['Principal Paid'].head()  # this would work

Therefore, we may want to change the name of the columns. One approach to change the name of the columns is the dictionary substitution using rename method:

In [None]:
dataset_rn = dataset.rename(columns={'Starting Balance': 'starting_balance',
                                  'Interest Paid' : 'interest_paid',
                                  'Principal Paid': 'principal_paid',
                                  'New Balance': 'new_balance'})

In [None]:
dataset_rn.columns  # to check whether the names of the columns changed

We can delete the columns using one of the two approaches. 

In [None]:
del_dataset = dataset.drop(columns={'Repayment'})  # we can drop multiple columns
del_dataset.columns

In [None]:
del del_dataset['Month'] 
del_dataset.columns

### Missing Data

Before we graph data, we need to be sure there are no missing values like null, False, N/A, empty string. In Pandas, missing values are called NaN or None. 

If we have one of them in a row, then we should remove or modify that row. The _**isna**_ and _**isnull**_ methods can be used to indicate where the values in the DataFrame are missing. They are exaclty same methods. They return True if we have a missing value.

In [None]:
dataset['Interest Paid'].isnull()  # produces Panda series with True and False