# Pandas

## Introduction to Pandas

**Pandas** is a package for data manipulation and analysis in Python. The name Pandas is derived from the econometrics term Panel Data. Pandas incorporates two additional data structures into Python, namely **Pandas Series** and **Pandas DataFrame**. These data structures allow us to work with labeled and relational data in an easy and intuitive manner. These lessons are intended as a basic overview of Pandas and introduces some of its most important features.


- How to import Pandas
- How to create Pandas Series and DataFrames using various methods
- How to access and change elements in Series and DataFrames
- How to perform arithmetic operations on Series
- How to load data into a DataFrame
- How to deal with Not a Number (NaN) values

## Why Use Pandas?

The recent success of machine learning algorithms is partly due to the huge amounts of data that we have available to train our algorithms on. However, when it comes to data, quantity is not the only thing that matters, the quality of your data is just as important. It often happens that large datasets don’t come ready to be fed into your learning algorithms. More often than not, large datasets will often have missing values, outliers, incorrect values, etc… Having data with a lot of missing or bad values, for example, is not going to allow your machine learning algorithms to perform well. Therefore, one very important step in machine learning is to look at your data first and make sure it is well suited for your training algorithm by doing some basic data analysis. This is where Pandas come in. Pandas Series and DataFrames are designed for fast data analysis and manipulation, as well as being flexible and easy to use. Below are just a few features that makes Pandas an excellent package for data analysis:

Allows the use of labels for rows and columns
- Can calculate rolling statistics on time series data
- Easy handling of NaN values
- Is able to load data of different formats into DataFrames
- Can join and merge different datasets together
- It integrates with NumPy and Matplotlib

For these and other reasons, Pandas DataFrames have become one of the most commonly used Pandas object for data analysis in Python.



-------------

# Creating Pandas Series
- pd.Series(data, index), where index is a list of index labels
    - series.shape
    - series.size
    - series.ndim
    - series.index
    - series.values

A Pandas series is a one-dimensional array-like object that can hold many data types, such as numbers or strings. One of the main differences between Pandas Series and NumPy ndarrays is that you can assign an index label to each element in the Pandas Series. In other words, you can name the indices of your Pandas Series anything you want. Another big difference between Pandas Series and NumPy ndarrays is that Pandas Series can hold data of different data types.

In [1]:
import pandas as pd

unlike NumPy, panda series can hold data of different types and index can also be string

In [2]:
groceries = pd.Series(data=[30, 6, "Yes", "No"], index=["egg", "apple", "milk", "bread"])
print(groceries)

egg       30
apple      6
milk     Yes
bread     No
dtype: object


In [3]:
print('Groceries has shape:', groceries.shape)
print('Groceries has dimension:', groceries.ndim)
print('Groceries has a total of', groceries.size, 'elements')

print()

print('The data in Groceries is:', groceries.values)
print('The index of Groceries is:', groceries.index)

Groceries has shape: (4,)
Groceries has dimension: 1
Groceries has a total of 4 elements

The data in Groceries is: [30 6 'Yes' 'No']
The index of Groceries is: Index(['egg', 'apple', 'milk', 'bread'], dtype='object')


to check whether a specific value is included in index values

In [4]:
"banana" in groceries.index

False

In [5]:
"apple" in groceries.index

True

In [6]:
# We check whether bananas is a food item (an index) in Groceries
x = 'bananas' in groceries

# We check whether bread is a food item (an index) in Groceries
y = 'bread' in groceries

# We print the results
print('Is bananas an index label in Groceries:', x)
print('Is bread an index label in Groceries:', y)

Is bananas an index label in Groceries: False
Is bread an index label in Groceries: True


-----------------

## Accessing and Deleting Elements in Pandas Series

One great advantage of Pandas Series is that it allows us to access data in many different ways. Elements can be accessed using index labels or numerical indices inside square brackets, [ ], similar to how we access elements in NumPy ndarrays. Since we can use numerical indices, we can use both positive and negative integers to access data from the beginning or from the end of the Series, respectively. Since we can access elements in various ways, in order to remove any ambiguity to whether we are referring to an index label or numerical index, Pandas Series have two attributes, **.loc** and **.iloc** to explicitly state what we mean. The attribute .loc stands for location and it is used to explicitly state that we are using a labeled index. Similarly, the attribute .iloc stands for integer location and it is used to explicitly state that we are using a numerical index.

### Accessing Pandas Series by Index Label

In [7]:
print(groceries)

egg       30
apple      6
milk     Yes
bread     No
dtype: object


In [8]:
# We use a single index label
groceries["egg"]

30

In [9]:
#via list of index labels
# we can access multiple index labels
groceries[["milk", "bread"]]

milk     Yes
bread     No
dtype: object

### Accessing Pandas Series by Numerical Indices
it is same as NumPy

In [25]:
# We use a single numerical index
groceries[0]

35

In [26]:
# We use a negative numerical index
print('Do we need bread:\n', groceries[[-1]]) 

Do we need bread:
 milk    Yes
dtype: object


In [11]:
groceries[2:4]

milk     Yes
bread     No
dtype: object

In [12]:
# we use multiple numerical indices
groceries[[2,3]]

milk     Yes
bread     No
dtype: object

## How to differentiate between label index and numerical index?
- **series.loc[]** : label index, label location
- **series.iloc[]**: numerical index, integer location

In [13]:
print(groceries)

egg       30
apple      6
milk     Yes
bread     No
dtype: object


In [14]:
#lable location
# we use loc to access multiple index labels
groceries.loc[["egg", "apple"]]

egg      30
apple     6
dtype: object

In [15]:
#integer location
# we use iloc to access multiple numerical indices
groceries.iloc[[0,1]]

egg      30
apple     6
dtype: object

## Changing Pandas Series
pandas series are mutable like numPy arrays

In [16]:
groceries

egg       30
apple      6
milk     Yes
bread     No
dtype: object

In [17]:
groceries.iloc[[0]] = 32
groceries

egg       32
apple      6
milk     Yes
bread     No
dtype: object

In [18]:
groceries.loc[["egg"]] = 33
groceries

egg       33
apple      6
milk     Yes
bread     No
dtype: object

In [19]:
groceries["egg"] = 35
groceries

egg       35
apple      6
milk     Yes
bread     No
dtype: object

## Deleting elements from Pandas Series
- series.drop(label index, inplace=False) : return the modified series, but no impact on original series
- series.drop(label index, inplace=True) : modified the original series 

We can delete items from a Pandas Series in place by setting the keyword inplace to True in the .drop() method.

In [20]:
groceries

egg       35
apple      6
milk     Yes
bread     No
dtype: object

In [21]:
new = groceries.drop("bread")
new

egg       35
apple      6
milk     Yes
dtype: object

In [22]:
groceries

egg       35
apple      6
milk     Yes
bread     No
dtype: object

In [23]:
#now original series got changed
groceries.drop("bread", inplace=True)
groceries

egg       35
apple      6
milk     Yes
dtype: object