# Pandas

While **numpy** deals only with homogeneous data types ( all numbers or all floats ), **_Pandas_** is heterogenous in dealing with data. Think of Pandas as a library that can deal with manipulating heterogenous data grids ( pretty much like excel )

## Table of Contents

- ### [Introduction](#Introduction)
 - #### [What is Pandas ](#What-is-Pandas)
 - #### [Why learn Pandas](#Why-learn-Pandas)
 - #### [Our approach to Pandas](#Our-approach-to-Pandas)
- ### [Getting Started](#Getting-Started)
 - #### [Installing Pandas](#Installing-Pandas)
- ### [Dataframes](#Dataframes)
 - #### [Create Dataframe](#Create-Dataframe)
   - ##### [From List or Dictionary](#From-List-or-Dictionary)
   - ##### [From an Empty Dataframe](#From-an-Empty-Dataframe)
   - ##### [From Files](#From-Files)
 - #### [Display Dataframe](#Display-Dataframe)
 - #### [Selecting Data from Dataframe](#Selecting-Data-from-Dataframe)

## Introduction

Most data is heterogenous and tabular in nature. For example, look at the following data which shows some stats from google play store. There is text, numbers, floats etc.

<img src="./pics/google_play_store_data.png"/>

_numpy_ is not suited to manipulate this kind of data. For that we need **Pandas**

### What is Pandas

Pandas is pretty much like a data manipulation tool ( think data munging, wrangling, preparation etc ) on a grid of data ( text, numbers, floats etc ). For example, if you look at the data grid above, and say you want to

_Filter_
- a particular category ( say only ART_AND_DESIGN ) 
- or all rows with Rating > 4.1  
- or all rows where the category is ART_AND_DESIGN and rating > 4.1 


_Collapse or Group by_
- and find how many rows are there in a particular category 
- or find how many rows are there with Rating > 4.1
- or a combination of both

_Read_
- data from different formats ( excel, csv, SQL databases etc )

_Handle_
- missing data ( like NAs, blanks etc )
- erroneous data ( data that does not comply with the data type of the column ) etc

_Manipulate_
- combine data from different sources into one
- or split data into a set of rows or columns or both
- or extract a sub-set of data into another data frame ( say create a new data set only for category ART_AND_DESIGN)
- or slice the dataset based on a variety of parameters
- or insert/delete columns or rows from/to to the dataset 
 - Like add a new app category or delete the rating column

Think of **Pandas** as _Excel_ on Steroids. 

### Why learn Pandas

In the context of Machine Learning and Python, **Pandas** is the gold standard in in-memory data management ( read or manipulate ). Written in C or Cython, Pandas is as fast as any C library in manipulating data. It is not uncommon for Pandas to comfortably handle large data sets ( around 5 to 10 GB ) without a hitch. 

### Our approach to Pandas

We will be using Pandas quite extensively in this Machine Learning course. However, we will cover most of the essential aspects of Pandas in this chapter and leave the more complicated options to later chapters where we would be  encountering situations that would lead us to explore them. For now, we will keep it pretty simple and to small test datasets. 

## Getting Started

### Installing Pandas

- pip

<pre>
    > pip install pandas
</pre>

If you are using Anaconda distribution, pandas is installed by default - you just have to enable it ( if necessary ). If you are just using the _conda_ package manager for Python, 
- conda

<pre>
    > conda install pandas
</pre>

You can verify if _Pandas_ is already installed on your python installation using the Python console.
<pre>
    >>> help("pandas")
</pre>
If it is installed, you will get a help message on Pandas.

## Dataframes

A **_Data Frame_** is the main data structure in pandas. Think of a data frame as an excel grid. You can create, add, delete, filter data very easily in pandas. For starters, let's see how easy it is to create a data set.

### Create Dataframe

#### From List or Dictionary

In [2]:
import pandas as pd

names        = ["India","United States","Canada"]
population_m = [1500,300,36]

d = {"names" :names , "population" : population_m}

df = pd.DataFrame(d)
df

Unnamed: 0,names,population
0,India,1500
1,United States,300
2,Canada,36


#### From an Empty Dataframe

You can create an empty data frame and start adding columns one by one. 

In [11]:
df = pd.DataFrame()
df["names"]      = names
df["population"] = population_m
df

Unnamed: 0,names,population
0,India,1500
1,United States,300
2,Canada,36


#### From Files

In [9]:
df = pd.read_csv("../data/apple_df.csv")
df

Unnamed: 0,Date,Open,High,Low,Close,Volume,Ex-Dividend,Split Ratio,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume
0,1980-12-12,28.750,28.870,28.7500,28.750,2093900.0,0.0,1.0,0.422706,0.424470,0.422706,0.422706,117258400.0
1,1980-12-15,27.380,27.380,27.2500,27.250,785200.0,0.0,1.0,0.402563,0.402563,0.400652,0.400652,43971200.0
2,1980-12-16,25.370,25.370,25.2500,25.250,472000.0,0.0,1.0,0.373010,0.373010,0.371246,0.371246,26432000.0
3,1980-12-17,25.870,26.000,25.8700,25.870,385900.0,0.0,1.0,0.380362,0.382273,0.380362,0.380362,21610400.0
4,1980-12-18,26.630,26.750,26.6300,26.630,327900.0,0.0,1.0,0.391536,0.393300,0.391536,0.391536,18362400.0
5,1980-12-19,28.250,28.380,28.2500,28.250,217100.0,0.0,1.0,0.415355,0.417266,0.415355,0.415355,12157600.0
6,1980-12-22,29.630,29.750,29.6300,29.630,166800.0,0.0,1.0,0.435644,0.437409,0.435644,0.435644,9340800.0
7,1980-12-23,30.880,31.000,30.8800,30.880,209600.0,0.0,1.0,0.454023,0.455787,0.454023,0.454023,11737600.0
8,1980-12-24,32.500,32.630,32.5000,32.500,214300.0,0.0,1.0,0.477841,0.479753,0.477841,0.477841,12000800.0
9,1980-12-26,35.500,35.620,35.5000,35.500,248100.0,0.0,1.0,0.521950,0.523714,0.521950,0.521950,13893600.0


### Display Dataframe

Once you read a dataframe, typically, you would want to examine it. We typically want to just see the first few rows or the last few rows. For that, you use the **head ( )** or **tail ( )** functions

In [10]:
df.head() # Shows the first few rows by default. 

Unnamed: 0,Date,Open,High,Low,Close,Volume,Ex-Dividend,Split Ratio,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume
0,1980-12-12,28.75,28.87,28.75,28.75,2093900.0,0.0,1.0,0.422706,0.42447,0.422706,0.422706,117258400.0
1,1980-12-15,27.38,27.38,27.25,27.25,785200.0,0.0,1.0,0.402563,0.402563,0.400652,0.400652,43971200.0
2,1980-12-16,25.37,25.37,25.25,25.25,472000.0,0.0,1.0,0.37301,0.37301,0.371246,0.371246,26432000.0
3,1980-12-17,25.87,26.0,25.87,25.87,385900.0,0.0,1.0,0.380362,0.382273,0.380362,0.380362,21610400.0
4,1980-12-18,26.63,26.75,26.63,26.63,327900.0,0.0,1.0,0.391536,0.3933,0.391536,0.391536,18362400.0


In [12]:
df.tail() # Shows the last few rows by default.

Unnamed: 0,Date,Open,High,Low,Close,Volume,Ex-Dividend,Split Ratio,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume
9395,2018-03-21,175.04,175.09,171.26,171.27,35247358.0,0.0,1.0,175.04,175.09,171.26,171.27,35247358.0
9396,2018-03-22,170.0,172.68,168.6,168.845,41051076.0,0.0,1.0,170.0,172.68,168.6,168.845,41051076.0
9397,2018-03-23,168.39,169.92,164.94,164.94,40248954.0,0.0,1.0,168.39,169.92,164.94,164.94,40248954.0
9398,2018-03-26,168.07,173.1,166.44,172.77,36272617.0,0.0,1.0,168.07,173.1,166.44,172.77,36272617.0
9399,2018-03-27,173.68,175.15,166.92,168.34,38962839.0,0.0,1.0,173.68,175.15,166.92,168.34,38962839.0


In [15]:
df.head(10)  # You can very well ask for a specific number of rows to be displayed.

Unnamed: 0,Date,Open,High,Low,Close,Volume,Ex-Dividend,Split Ratio,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume
0,1980-12-12,28.75,28.87,28.75,28.75,2093900.0,0.0,1.0,0.422706,0.42447,0.422706,0.422706,117258400.0
1,1980-12-15,27.38,27.38,27.25,27.25,785200.0,0.0,1.0,0.402563,0.402563,0.400652,0.400652,43971200.0
2,1980-12-16,25.37,25.37,25.25,25.25,472000.0,0.0,1.0,0.37301,0.37301,0.371246,0.371246,26432000.0
3,1980-12-17,25.87,26.0,25.87,25.87,385900.0,0.0,1.0,0.380362,0.382273,0.380362,0.380362,21610400.0
4,1980-12-18,26.63,26.75,26.63,26.63,327900.0,0.0,1.0,0.391536,0.3933,0.391536,0.391536,18362400.0
5,1980-12-19,28.25,28.38,28.25,28.25,217100.0,0.0,1.0,0.415355,0.417266,0.415355,0.415355,12157600.0
6,1980-12-22,29.63,29.75,29.63,29.63,166800.0,0.0,1.0,0.435644,0.437409,0.435644,0.435644,9340800.0
7,1980-12-23,30.88,31.0,30.88,30.88,209600.0,0.0,1.0,0.454023,0.455787,0.454023,0.454023,11737600.0
8,1980-12-24,32.5,32.63,32.5,32.5,214300.0,0.0,1.0,0.477841,0.479753,0.477841,0.477841,12000800.0
9,1980-12-26,35.5,35.62,35.5,35.5,248100.0,0.0,1.0,0.52195,0.523714,0.52195,0.52195,13893600.0


### Selecting Data from Dataframe

Selecting Data from Dataframes is also called **indexing** - because we use some form of indices. Let's see that with an example. 

In [None]:
import pandas as pd

names        = ["India","United States","Canada"]
population_m = [1500,300,36]

d = {"names" :names , "population" : population_m}

df = pd.DataFrame(d)
df