## Pandas Part-1
- Pandas Dataframe
- Pandas Series
- Pandas Basic Operations

## Pandas
A Python library used for working with data sets.
It has functions for analyzing, cleaning, exploring, and manipulating data.
 
Pandas allows us to analyze big data and make conclusions based on statistical theories.

Pandas can clean messy data sets, and make them readable and relevant.

In [3]:
import pandas as pd
import numpy as np

In [5]:
np.arange(0,20).reshape(5,4)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19]])

#### What is a data frame?

A **DataFrame is a data structure** that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.

**For a data frame there should be atleast more than one row and more than one column.**

In [8]:
## Create Dataframe 
df = pd.DataFrame(data = np.arange(0,20).reshape(5,4), index = ['Row1', 'Row2', 'Row3', 'Row4', 'Row5'], 
             columns=['Column1', 'Column2', 'Column3', 'Column4'])

In [10]:
# By default head shows top 5 records 
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [12]:
# By default tail shows bottom 5 records 
df.tail()

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [13]:
type(df)

pandas.core.frame.DataFrame

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5 entries, Row1 to Row5
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype
---  ------   --------------  -----
 0   Column1  5 non-null      int32
 1   Column2  5 non-null      int32
 2   Column3  5 non-null      int32
 3   Column4  5 non-null      int32
dtypes: int32(4)
memory usage: 120.0+ bytes


In [15]:
df.describe()

Unnamed: 0,Column1,Column2,Column3,Column4
count,5.0,5.0,5.0,5.0
mean,8.0,9.0,10.0,11.0
std,6.324555,6.324555,6.324555,6.324555
min,0.0,1.0,2.0,3.0
25%,4.0,5.0,6.0,7.0
50%,8.0,9.0,10.0,11.0
75%,12.0,13.0,14.0,15.0
max,16.0,17.0,18.0,19.0


In [18]:
## Indexing

##  Columnname, rowindex[loc], rowindex, columnindex number[.iloc]
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [30]:
## By using columnname
df['Column1']


Row1     0
Row2     4
Row3     8
Row4    12
Row5    16
Name: Column1, dtype: int32

In [29]:
## Whenever their is more than one column we have to use nested list
df[['Column1', 'Column1', 'Column1']]


Unnamed: 0,Column1,Column1.1,Column1.2
Row1,0,0,0
Row2,4,4,4
Row3,8,8,8
Row4,12,12,12
Row5,16,16,16


In [31]:
print(type(df['Column1']))
print(type(df[['Column1', 'Column1', 'Column1']]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


#### Diff B/W series and DataFrame

- Series Contains either only One Row or One Column
- Data Frame Contains more than one Row and more than One Column

In [33]:
## By using Rowname
df.loc['Row3']

Column1     8
Column2     9
Column3    10
Column4    11
Name: Row3, dtype: int32

In [34]:
## Whenever their is more than one Row we have to use nested list
df.loc[['Row3', 'Row4']]

Unnamed: 0,Column1,Column2,Column3,Column4
Row3,8,9,10,11
Row4,12,13,14,15


In [35]:
df.head()

Unnamed: 0,Column1,Column2,Column3,Column4
Row1,0,1,2,3
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19


In [38]:
df.iloc[2:4,0:2]

Unnamed: 0,Column1,Column2
Row3,8,9
Row4,12,13


In [42]:
df.iloc[2:,1:]

Unnamed: 0,Column2,Column3,Column4
Row3,9,10,11
Row4,13,14,15
Row5,17,18,19


In [47]:
df[['Column1','Column4']]

Unnamed: 0,Column1,Column4
Row1,0,3
Row2,4,7
Row3,8,11
Row4,12,15
Row5,16,19


In [59]:
df.iloc[0:4:3, 0:4:3]
## Starting index : ending endex : stepup

Unnamed: 0,Column1,Column4
Row1,0,3
Row4,12,15


In [61]:
## Convert dataframe into arrays
df.iloc[:,1:].values

array([[ 1,  2,  3],
       [ 5,  6,  7],
       [ 9, 10, 11],
       [13, 14, 15],
       [17, 18, 19]])

In [63]:
## Basic Operation
df.isnull().sum()
## Calculate total no of null values

Column1    0
Column2    0
Column3    0
Column4    0
dtype: int64

In [67]:
df1 = pd.DataFrame(data = [[1,np.nan,2],[1,3,4]] , index = ['Row1', 'Row2'], 
             columns=['Column1', 'Column2', 'Column3'])
## nan means null

In [66]:
df1

Unnamed: 0,Column1,Column2,Column3
Row1,1,,2
Row2,1,3.0,4


In [68]:
df1.isnull().sum()

Column1    0
Column2    1
Column3    0
dtype: int64

In [69]:
df1.isnull()

Unnamed: 0,Column1,Column2,Column3
Row1,False,True,False
Row2,False,False,False


In [70]:
df1.isnull().sum() == 0

Column1     True
Column2    False
Column3     True
dtype: bool

In [72]:
df1['Column3'].value_counts()

2    1
4    1
Name: Column3, dtype: int64

In [74]:
df1['Column3'].unique()

array([2, 4], dtype=int64)

In [77]:
df[df['Column2']>2]

Unnamed: 0,Column1,Column2,Column3,Column4
Row2,4,5,6,7
Row3,8,9,10,11
Row4,12,13,14,15
Row5,16,17,18,19
