# Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The data manipulation capabilities of pandas are built on top of the numpy library. In a way, numpy is a dependency of the pandas library.

In this notebook we'll try various pandas methods and in the process learn more about Pandas.

### Installation

Please follow this [link](https://pandas.pydata.org/pandas-docs/stable/install.html). All the necessary steps are mentioned here.

## Importing Pandas

Once Pandas is installed, we can use it our file

In [2]:
import numpy as np
import pandas as pd

## Series

Series are similar to numpy arrays. The only difference between them is that series can have axis labels which means that it can be indexed by a label and also by number location.

### Creating Series

There are various ways to create Series. Some of them are listed below.

1. **Using Python List**

In [3]:
seriesLabel = ['label1', 'label2', 'label3']
exampleList = [5, 10, 20]

In [4]:
pd.Series(exampleList)

0     5
1    10
2    20
dtype: int64

In [5]:
pd.Series(exampleList, seriesLabel)

label1     5
label2    10
label3    20
dtype: int64

2. **Using Numpy Arrays**

In [6]:
exampleNumpyArray = np.array([6, 12, 18])

In [7]:
pd.Series(exampleNumpyArray)

0     6
1    12
2    18
dtype: int64

In [8]:
pd.Series(exampleNumpyArray, seriesLabel)

label1     6
label2    12
label3    18
dtype: int64

3. **Using Dictionary**

In [9]:
exampleDictionary = { 'label4': 7, 'label5': 14, 'label6': 21 }

In [10]:
# No need to mention labels parameter
pd.Series(exampleDictionary)

label4     7
label5    14
label6    21
dtype: int64

In [11]:
# If you mention different labels for a dictionary
pd.Series(exampleDictionary, seriesLabel)

label1   NaN
label2   NaN
label3   NaN
dtype: float64

### Data and Index Parameter in Series

1. **Data**

Series can hold a variety of data.

In [12]:
def sampleFunc1():
    pass

def sampleFunc2():
    pass

def sampleFunc3():
    pass

pd.Series(data=[sampleFunc1, sampleFunc2, sampleFunc3])

0    <function sampleFunc1 at 0x11b70d950>
1    <function sampleFunc2 at 0x11b70da60>
2    <function sampleFunc3 at 0x11b70d9d8>
dtype: object

In [13]:
pd.Series(['a', 2, 'hey'])

0      a
1      2
2    hey
dtype: object

2. **Index**

It is the second parameter which acts as the label for the series.

In [14]:
pd.Series(data=[sampleFunc1, sampleFunc2, sampleFunc3], index=['a', 'b', 'c'])

a    <function sampleFunc1 at 0x11b70d950>
b    <function sampleFunc2 at 0x11b70da60>
c    <function sampleFunc3 at 0x11b70d9d8>
dtype: object

In [15]:
pd.Series(['a', 2, 'hey'], ['label', 2, 'key'])

label      a
2          2
key      hey
dtype: object

## DataFrames

DataFrames are like spreadsheets or SQL tables. DataFrames are utilised a lot by pandas users.

### Creating a DataFrame

pd.DataFrame( *data*, *index*, *columns* )

*data* -> content of the cells<br>
*index* -> labels for rows<br>
*columns* -> labels for columns

Returns wwo-dimensional size-mutable, potentially heterogeneous tabular data i.e. DataFrame

In [16]:
pd.DataFrame(data = np.random.randint(1,51, (4,3)), index = ['row1', 'row2', 'row3', 'row4'], columns = ['col1', 'col2', 'col3'])

Unnamed: 0,col1,col2,col3
row1,7,39,27
row2,19,14,40
row3,42,33,32
row4,20,20,28


### Selection and Indexing

In [17]:
dataFrame = pd.DataFrame(data = np.random.randint(1,51, (4,3)), index = ['row1', 'row2', 'row3', 'row4'], columns = ['col1', 'col2', 'col3'])
dataFrame

Unnamed: 0,col1,col2,col3
row1,1,48,3
row2,33,33,28
row3,19,20,24
row4,31,43,3


**Selecting a single column**

In [18]:
dataFrame['col1']

row1     1
row2    33
row3    19
row4    31
Name: col1, dtype: int64

**Selecting multiple columns**

In [19]:
dataFrame[['col1', 'col2']]

Unnamed: 0,col1,col2
row1,1,48
row2,33,33
row3,19,20
row4,31,43


**Creation of new columns using arithmetic operators**

In [20]:
dataFrame['newCol1'] = dataFrame['col3'] - dataFrame['col2']
dataFrame

Unnamed: 0,col1,col2,col3,newCol1
row1,1,48,3,-45
row2,33,33,28,-5
row3,19,20,24,4
row4,31,43,3,-40


In [21]:
dataFrame['newCol2'] = dataFrame['col1'] * dataFrame['col3']
dataFrame

Unnamed: 0,col1,col2,col3,newCol1,newCol2
row1,1,48,3,-45,3
row2,33,33,28,-5,924
row3,19,20,24,4,456
row4,31,43,3,-40,93


**Removal of columns**

In [22]:
# axis -> 0 means that we are targeting the rows
# axis -> 1 means that we are targeting the columns
dataFrame.drop('newCol1', axis=1)

Unnamed: 0,col1,col2,col3,newCol2
row1,1,48,3,3
row2,33,33,28,924
row3,19,20,24,456
row4,31,43,3,93


In [23]:
# we did not really drop the column
dataFrame

Unnamed: 0,col1,col2,col3,newCol1,newCol2
row1,1,48,3,-45,3
row2,33,33,28,-5,924
row3,19,20,24,4,456
row4,31,43,3,-40,93


In [24]:
# Pandas saves us from accidentally dropping the columns
# Inorder to delete it
dataFrame.drop('newCol1', axis=1, inplace=True)
dataFrame

Unnamed: 0,col1,col2,col3,newCol2
row1,1,48,3,3
row2,33,33,28,924
row3,19,20,24,456
row4,31,43,3,93


In [25]:
dataFrame.drop('newCol2', axis=1, inplace=True)
dataFrame

Unnamed: 0,col1,col2,col3
row1,1,48,3
row2,33,33,28
row3,19,20,24
row4,31,43,3


**Selecting a single Row**

In [26]:
dataFrame.loc['row1']

col1     1
col2    48
col3     3
Name: row1, dtype: int64

**Selecting multiple rows**

In [27]:
dataFrame.loc[['row1', 'row2']]

Unnamed: 0,col1,col2,col3
row1,1,48,3
row2,33,33,28


**Selecting rows based on their index number**

In [28]:
dataFrame.iloc[1]

col1    33
col2    33
col3    28
Name: row2, dtype: int64

**Removal of rows**

In [29]:
dataFrame.drop('row1', axis=0)

Unnamed: 0,col1,col2,col3
row2,33,33,28
row3,19,20,24
row4,31,43,3


In [30]:
# again pandas didn't drop it completely
dataFrame

Unnamed: 0,col1,col2,col3
row1,1,48,3
row2,33,33,28
row3,19,20,24
row4,31,43,3


In [31]:
# we should use 'inplace' to drop the row
# dataFrame.drop('row1', axis=0, inplace=True)
# dataFrame

**Selecting both columns and rows**

In [32]:
dataFrame.loc['row1', 'col2']

48

In [33]:
dataFrame.loc[['row1', 'row2', 'row3'],['col2', 'col3']]

Unnamed: 0,col2,col3
row1,48,3
row2,33,28
row3,20,24


In [34]:
dataFrame.iloc[0,1]

48

In [35]:
dataFrame.iloc[[0,1]]

Unnamed: 0,col1,col2,col3
row1,1,48,3
row2,33,33,28


### Conditional Selection

**Note**: This notebook is not complete, more content will be added soon.