# Introduction to Pandas

**Pandas** is a package for data manipulation and analysis in Python. The name Panda is derived from **Pan**el **Da**ta.

Pandas incorporate two additional data structures into Python;
* Pandas series
* Pandas dataframe
These data structures allow us to work with labelled and relational data in an easy and intuitive manner.

## Why use Pandas?
The recent successes of machine learning algorithmsis partly due to the huge amounts of data that we have available to train our algorithms.
However, when it comes to data, quantity is not the only thing that matters, the quality of data is just as important. It often happens that large datasets don't come ready to be fed into your learning algorithms. They will often have incorrect values, missing values, outliers etc.
One important step in machine learning is to look at your data first and make sure it's well suited for your training algorithm by doing some basic data analysis.
This is where Pandas come in.
Pandas Series and Pandas Dataframes are designed for fast data analysis and manipulation. They are also flexible and easy to use.
Below are some features that make it an excellent package for data analysis.
* It allows the use of labels for rows and columns
* Can calculate rolling statistics on time series data
* Easy handling of NaN values
* It's able to load data of different formats into DataFrames
* Can join and merge different datasets together
* Integrates with NumPy and Matplotlib

## Creating a Panda series

A Panda series is a one-dimensional array like object that can hold many datatypes.

One main difference between Panda Series and a NumPy ndarray is that you can assign an index label to each element in the Panda (you can name the indices in your Panda Series).

Another big difference between the two is that a Panda Series can contain elements of different data types.

Let's create a Pandas Series.

The command `pd.Series(data, index)` will create a Pandas series where `index` is the list of index labels.

We can create a Pandas Series to store a grocery list.

In [3]:
import pandas as pd

groceries = pd.Series(data=[30, 6, "Yes", "No"], index=["eggs", "apples", "milk", "bread"])

groceries

eggs       30
apples      6
milk      Yes
bread      No
dtype: object

> Pandas Series are displayed with indices in the first column and the data in the second column.

Just like NumPy ndarrays, Pandas Series have attributes that allow us to get information from the series in an easy way.

In [4]:
print("Groceries has shape {}".format(groceries.shape))
print("Groceries has dimension {}".format(groceries.ndim))
print("Groceries has a total of {} elements".format(groceries.size))

Groceries has shape (4,)
Groceries has dimension 1
Groceries has a total of 4 elements


We can also print the index labels and the data separately. This is helpful if you don't happen to know what the index labels of the Pandas Series are.

In [5]:
print("The data in groceries is {}.".format(groceries.values))
print("The index of groceries is {}.".format(groceries.index))

The data in groceries is [30 6 'Yes' 'No'].
The index of groceries is Index(['eggs', 'apples', 'milk', 'bread'], dtype='object').


If you are dealing with a very large Pandas Series and you're not sure whether an index label exists, you can check using the `in` command.

In [6]:
print("Is bananas an index label in groceries? {}.".format("bananas" in groceries))
print("Is bread an index label in groceries? {}.".format("bread" in groceries))

Is bananas an index label in groceries? False.
Is bread an index label in groceries? True.


## Accessing and deleting elements in a Pandas Series

One great advantage of a Pandss Series is that it allows us to access data in many different ways.

Elements can be accessed using index labels or numerical indices inside square brackets. Both positive and negative indices can be used to access elements from the beginning and from the end of the Series respectively.

Since there are different ways to access elements, Pandas Series have devised a way to remove any abiguity when accessing elements. It has two attributes that allow us to explicitly write what we mean to do. These two attributes are;
* `.loc` - stands for location. It's used to explicitly state that we are using a labeled index.
* `.iloc` - stands for integer location. It's used to explicitly state that we are using a numerical index.

Let's see some examples.

In [10]:
print("How many eggs do we need to buy? {}".format(groceries["eggs"]))
print()
print("Do we need milk and bread? \n {}".format(groceries[["milk", "bread"]]))
print()
print("How many eggs and apples do we need to buy? \n {}".format(groceries[["eggs", "apples"]]))
print()
print("How many eggs and apples do we need to buy? \n {}".format(groceries[[0, 1]]))
print()
print("Do we need bread? {}".format(groceries[-1]))
print()

How many eggs do we need to buy? 30

Do we need milk and bread? 
 milk     Yes
bread     No
dtype: object

How many eggs and apples do we need to buy? 
 eggs      30
apples     6
dtype: object

How many eggs and apples do we need to buy? 
 eggs      30
apples     6
dtype: object

Do we need bread? No



Pandas Series are mutable. We can change elements after the Series has already been created.

In [13]:
print("Original groceries list: \n {}".format(groceries))

groceries["eggs"] = 82
print()
print("Modified groceries list: \n {}".format(groceries))

Original groceries list: 
 eggs        2
apples      6
milk      Yes
bread      No
dtype: object

Modified groceries list: 
 eggs       82
apples      6
milk      Yes
bread      No
dtype: object


We can delete items from Pandas Series using the `drop()` method. By default, this method drops elements **out of place**, meaning that the original Series will not be altered.

In [14]:
print("Original groceries list: \n {}".format(groceries))

print()
print("remove apples (out of place): \n {}".format(groceries.drop("apples")))
print()
print("groceries list after the drop: \n {}".format(groceries))

Original groceries list: 
 eggs       82
apples      6
milk      Yes
bread      No
dtype: object

remove apples (out of place): 
 eggs      82
milk     Yes
bread     No
dtype: object

groceries list after the drop: 
 eggs       82
apples      6
milk      Yes
bread      No
dtype: object


To delete items in place, we need to use the `inplace` keyword and set its value to `True`.

In [15]:
print("Original groceries list: \n {}".format(groceries))

print()
print("remove apples (in place): \n {}".format(groceries.drop("apples", inplace=True)))
print()
print("groceries list after the drop: \n {}".format(groceries))

Original groceries list: 
 eggs       82
apples      6
milk      Yes
bread      No
dtype: object

remove apples (in place): 
 None

groceries list after the drop: 
 eggs      82
milk     Yes
bread     No
dtype: object


## Arithmetic Operations on Pandas Series

Just like with NumPy ndarrays, we can perform element-wise operations on Pandas Series.

In [16]:
fruits = pd.Series(data=[10, 6, 3], index=["apples", "oranges", "bananas"])

fruits

apples     10
oranges     6
bananas     3
dtype: int64

We can modify the data in fruits by performing basic arithmetic operations. Let's see some examples.

In [17]:
print("Original list of fruits \n {}".format(fruits))
print()
print("fruits + 2: \n {}".format(fruits + 2))
print()
print("fruits - 2: \n {}".format(fruits - 2))
print()
print("fruits * 2: \n {}".format(fruits * 2))
print()
print("fruits / 2: \n {}".format(fruits / 2))

Original list of fruits 
 apples     10
oranges     6
bananas     3
dtype: int64

fruits + 2: 
 apples     12
oranges     8
bananas     5
dtype: int64

fruits - 2: 
 apples     8
oranges    4
bananas    1
dtype: int64

fruits * 2: 
 apples     20
oranges    12
bananas     6
dtype: int64

fruits / 2: 
 apples     5.0
oranges    3.0
bananas    1.5
dtype: float64


Mathematical functions such as `sqrt(x)` from NumPy can also be performed.

In [18]:
import numpy as np

print("Original list of fruits \n {}".format(fruits))
print()
print("exp(fruits): \n {}".format(np.exp(fruits)))
print()
print("sqrt(fruits): \n {}".format(np.sqrt(fruits)))
print()
print("power(fruits, 2): \n {}".format(np.power(fruits, 2)))

Original list of fruits 
 apples     10
oranges     6
bananas     3
dtype: int64

exp(fruits): 
 apples     22026.465795
oranges      403.428793
bananas       20.085537
dtype: float64

sqrt(fruits): 
 apples     3.162278
oranges    2.449490
bananas    1.732051
dtype: float64

power(fruits, 2): 
 apples     100
oranges     36
bananas      9
dtype: int64


Arithmetic operations can be applied only on selected items in the fruits list.

In [20]:
print("Original list of fruits \n {}".format(fruits))
print()
print("Amount of bananas + 2 = {}".format(fruits["bananas"] + 2))
print()
print("Amount of apples - 2 = {}".format(fruits.iloc[0] - 2))
print()
print("We double the amount of apples and oranges: \n {}".format(fruits[["apples", "oranges"]] * 2))
print()
print("We half the amount of apples and oranges: \n {}".format(fruits[["apples", "oranges"]] / 2))
print()

Original list of fruits 
 apples     10
oranges     6
bananas     3
dtype: int64

Amount of bananas + 2 = 5

Amount of apples - 2 = 8

We double the amount of apples and oranges: 
 apples     20
oranges    12
dtype: int64

We half the amount of apples and oranges: 
 apples     5.0
oranges    3.0
dtype: float64



Aruthmetic operations can also be applied on Pandas Series of mixed datatype, provided the arithmetic operation is defined for all data types in the series, otherwise, you get an error.

Let's see what happens when we multiply the groceries list by 2.

In [21]:
groceries * 2

eggs        164
milk     YesYes
bread      NoNo
dtype: object

Panda doubles each of the elements, including the strings.

If you were to apply an operation that is valid for numbers but not for strings e.g. `/` (division), you will get an error.

When you have mixed datatypes, make sure that arithmetic operations are valid for all datatypes in your Pandas Series.

## Creating Pandas DataFrames

Pandas DataFrames are two dimensional data structures with labelled rows and columns that can hold multiple datatypes.

Pandas DataFrames can be thought of as being similar to spreadsheets.

Pandas DataFrames can ether be created from dictionaries or from a file.

### Creating Pandas DataFrames from a dictionary of Pandas Series

We will create a dictionary of items purchased by two people; Alice and Bob.

In [23]:
import pandas as pd

items = {
    "Bob": pd.Series(data=[245, 25, 55], index=["bike", "pants", "watch"]),
    "Alice": pd.Series(data=[40, 110, 500, 45], index=["book", "glasses", "bike", "pants"]),
}

print(type(items))

<class 'dict'>


Now that we have a dictionary, we can pass it to the `pd.DataFrame()` function.

In [25]:
shopping_carts = pd.DataFrame(items)

shopping_carts

Unnamed: 0,Bob,Alice
bike,245.0,500.0
book,,40.0
glasses,,110.0
pants,25.0,45.0
watch,55.0,


There are several things to notice here;

1. Data frames are displayed in tabular form (like a spreadsheet)
2. The rows in the DataFrame are built from the union of the labels from the two Pandas Series in the dictionary
3. The column labels are taken from the dictionary keys
4. There are some NaN values. This is Pandas way of indicating that it doesn't have a value for a particular row and column index.

In the above example, we created a Pandas DataFrame from a dictionary of Pandas Series that have index labels. If we don't provide index labels in the Pandas Series, Pandas will use numerical row indexes when it creates a DataFrame. The numerical row indexes start at 0.

In [26]:
data = {
    "Bob": pd.Series([245, 25, 55]),
    "Alice": pd.Series([40, 110, 500, 45])
}

df = pd.DataFrame(data)

df

Unnamed: 0,Bob,Alice
0,245.0,40
1,25.0,110
2,55.0,500
3,,45


Just like with Pandas Series, we can extract information from DataFrames using attributes.

In [28]:
print("shopping_carts has shape {}".format(shopping_carts.shape))
print("shopping_carts has dimension {}".format(shopping_carts.ndim))
print("shopping_carts has a total of {} elements".format(shopping_carts.size))
print()
print("The data in shopping_carts is: \n {}".format(shopping_carts.values))
print()
print("The row index in shopping_carts is: \n {}".format(shopping_carts.index))
print()
print("The column index in shopping_carts is: \n {}".format(shopping_carts.columns))

shopping_carts has shape (5, 2)
shopping_carts has dimension 2
shopping_carts has a total of 10 elements

The data in shopping_carts is: 
 [[245. 500.]
 [ nan  40.]
 [ nan 110.]
 [ 25.  45.]
 [ 55.  nan]]

The row index in shopping_carts is: 
 Index(['bike', 'book', 'glasses', 'pants', 'watch'], dtype='object')

The column index in shopping_carts is: 
 Index(['Bob', 'Alice'], dtype='object')


When creating `shopping_carts`, we passed in the entire dictionary to `pd.DataFrame()`. However, there may be cases where you're only interested in a subset of the data.

Pandas allow us to select which data we want to put into our DataFrame by means of keywords `columns` and `index`.

In [31]:
bob_shopping_cart = pd.DataFrame(items, columns=["Bob"])

bob_shopping_cart

Unnamed: 0,Bob
bike,245
pants,25
watch,55


In [32]:
x = pd.DataFrame(items, index=["pants", "book"])

x

Unnamed: 0,Bob,Alice
pants,25.0,45
book,,40


### Creating Pandas DataFrames from a dictionary of lists

Pandas DataFrames can also be created from a dictionary of lists. However, all the lists must be of the same length.

In [34]:
data = {
    "integers": [1, 2, 3],
    "floats": [4.5, 8.2, 9.6]
}

df = pd.DataFrame(data)

df

Unnamed: 0,integers,floats
0,1,4.5
1,2,8.2
2,3,9.6


Notice that we are using numerical row indices since labels were not provided.

Labels can be added by using the `index` keyword.

In [36]:
df = pd.DataFrame(data, index=["label 1", "label 2", "label 3"])

df

Unnamed: 0,integers,floats
label 1,1,4.5
label 2,2,8.2
label 3,3,9.6


### Creating Pandas DataFrames from a list of dictionaries

In [37]:
items2 = [
    {
        "bikes": 20,
        "pants": 30,
        "watches": 35
    },
    {
        "watches": 10,
        "glasses": 50,
        "bikes": 15,
        "pants": 5
    }
]

store_items = pd.DataFrame(items2)

store_items

Unnamed: 0,bikes,glasses,pants,watches
0,20,,30,35
1,15,50.0,5,10


Again, we can label the rows.

In [38]:
store_items = pd.DataFrame(items2, index=["store 1", "store 2"])

store_items

Unnamed: 0,bikes,glasses,pants,watches
store 1,20,,30,35
store 2,15,50.0,5,10


## Accessing elements in Pandas DataFrames
