<img src='./images/Pandas_logo.svg.png'>

``Pandas`` is an open source Python library for data analysis. It gives Python the
ability to work with spreadsheet-like data for fast data loading, manipulating,
aligning, merging, etc. To give Python these enhanced features, Pandas
introduces two new data types to Python: ``Series`` and ``DataFrame``. The
DataFrame will represent your entire spreadsheet or rectangular data, whereas
the Series is a single column of the DataFrame. A Pandas DataFrame can also
be thought of as a dictionary or collection of Series.

While NumPy and its ndarray object, which provides efficient storage and manipulation 
of dense typed arrays in Python. Pandas is a package built on top of NumPy, and provides an
efficient implementation of a DataFrame. DataFrames are essentially multidimensional
arrays with attached row and column labels, and often with heterogeneous
types and/or missing data. As well as offering a convenient storage interface for
labeled data, Pandas implements a number of powerful data operations familiar to
users of both database frameworks and spreadsheet programs.

NumPy’s ndarray data structure provides essential features for the type of
clean, well-organized data typically seen in numerical computing tasks. While it
serves this purpose very well, its limitations become clear when we need more flexibility
(attaching labels to data, working with missing data, etc.) and when attempting
operations that do not map well to element-wise broadcasting (groupings, pivots,
etc.), each of which is an important piece of analyzing the less structured data available
in many forms in the world around us. Pandas, and in particular its Series and
DataFrame objects, builds on the NumPy array structure and provides efficient access
to these sorts of “data munging” tasks that occupy much of a data scientist’s time.

#### Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.
We shall learn about pandas datastructures and other aspects

In [None]:
! pip install pandas

In [2]:
## Let us start using the pandas libraray
## To use the library we shall import pandas as follows
## Pandas is the library name and in this jupyter notebook the alias name will be pd
import pandas as pd 

In [3]:
## to know the version of pandas we shall use the following version
## Pandas version is 1.5.3
pd.__version__

'1.5.3'

##### Pandas DataStructues
<b>Series</b> 
</ol> A Series is a one-dimensional array-like object contining an array of data (of any NumPy data type) and an associated array of data labels, called its index

In [4]:
### Building a Series object - 
s0 = pd.Series([24, 27, -25, 23])

In [5]:
## The index of the values start from 0 and end at 3. The outputs are 24,27,-25, and 23
## In general for a series of size N,  a default one consisting of the integers 0 through N - 1 (where N is the length of the data)
s0

0    24
1    27
2   -25
3    23
dtype: int64

In [7]:
### The values are displayed as following - The datatype of the values are int64.
s0.values

array([ 24,  27, -25,  23], dtype=int64)

In [8]:
### For the series object we can retrieve the values in the following ways.
s0.index

RangeIndex(start=0, stop=4, step=1)

In [6]:
#### Alternatively, the series object can be created with indices values other than the default values.
s2 = pd.Series([24, 27, -25, 23], index=['a1', 'a2', 'a3', 'a4'])
s2

a1    24
a2    27
a3   -25
a4    23
dtype: int64

In [7]:
s2.index

Index(['a1', 'a2', 'a3', 'a4'], dtype='object')

In [8]:
### Values can be retieved based on the indices
print('a1\'s value is ' + str(s2['a1']))
print('a2\'s value is ' + str(s2['a2']))

a1's value is 24
a2's value is 27


In [9]:
### Reassiging the values based on the index
s2['a4'] = 28

In [10]:
s2

a1    24
a2    27
a3   -25
a4    28
dtype: int64

In [18]:
### we can use values in the index when selecting single values or a set of values:
s2[['a3', 'a1', 'a4']]

a3   -25
a1    24
a4    28
dtype: int64

In [19]:
s2

a1    24
a2    27
a3   -25
a4    28
dtype: int64

In [20]:
### filtering with a boolean array, scalar multiplication,or applying math functions, will preserve the index-value link
print(s2[s2 > 0])
print(s2 * 2)

a1    24
a2    27
a4    28
dtype: int64
a1    48
a2    54
a3   -50
a4    56
dtype: int64


In [21]:
s2

a1    24
a2    27
a3   -25
a4    28
dtype: int64

In [22]:
import numpy as np
np.exp(s2)

a1    2.648912e+10
a2    5.320482e+11
a3    1.388794e-11
a4    1.446257e+12
dtype: float64

In [23]:
####  mapping of index values to data values
'b' in s2

False

In [24]:
'a2' in s2

True

In [25]:
#### We can also construct a dict using python dictionary
sdata = {'TN': 35000, 'TR': 71000, 'AN': 16000, 'UP': 5000}
diSer = pd.Series(sdata)

In [26]:
## Display the data
diSer

TN    35000
TR    71000
AN    16000
UP     5000
dtype: int64

In [27]:
## We pass other indices and see that values for those indices result as null
states = ['TN','AP','AN','KA']
s4 = pd.Series(sdata,index=states)
s4

TN    35000.0
AP        NaN
AN    16000.0
KA        NaN
dtype: float64

In [28]:
## Check for how many values are not null for corresponding index
pd.notnull(s4)

TN     True
AP    False
AN     True
KA    False
dtype: bool

In [29]:
## Check for how many values 
pd.isnull(s4)

TN    False
AP     True
AN    False
KA     True
dtype: bool

In [30]:
### The following series object has no nulls
diSer

TN    35000
TR    71000
AN    16000
UP     5000
dtype: int64

In [31]:
### The following series object has null values
s4

TN    35000.0
AP        NaN
AN    16000.0
KA        NaN
dtype: float64

In [32]:
### Series object itself and its index have a name attribute, which integrates with other key areas of pandas functionality
s4.name = "population"
s4.index.name = "state"
s4

state
TN    35000.0
AP        NaN
AN    16000.0
KA        NaN
Name: population, dtype: float64

In [33]:
### Series index can be altered by assignment
s4.index = ['a', 'b', 'c', 'd']
s4

a    35000.0
b        NaN
c    16000.0
d        NaN
Name: population, dtype: float64

<b>Series</b></br>
<b>DataFrame</b> : </br> A DataFrame represents a tabular, spreadsheet-like data structure containing an or-
dered collection of columns, each of which can be a different value type (numeric,
string, boolean, etc.). The DataFrame has both a row and column index; it can be
thought of as a dict of Series (one for all sharing the same index

In [6]:
### A dataframe is constructed using the following.
data={ "name":["Babllo","Tinku","Tina","Janardhan","Abhi","Vanita","Kalia"],      
       "score":[90,80,85,75,95,60,65],      
       "subject":["kusti","dancing","singing","Swimming","Tennis",
               "Karete","Surfing"],      
       "gender":["M","M","M","M","F","F","F"]
     }

In [7]:
df=pd.DataFrame(data)
df

Unnamed: 0,name,score,subject,gender
0,Babllo,90,kusti,M
1,Tinku,80,dancing,M
2,Tina,85,singing,M
3,Janardhan,75,Swimming,M
4,Abhi,95,Tennis,F
5,Vanita,60,Karete,F
6,Kalia,65,Surfing,F


In [8]:
df=pd.DataFrame(data,columns=["name","subject","gender","score"])
df

Unnamed: 0,name,subject,gender,score
0,Babllo,kusti,M,90
1,Tinku,dancing,M,80
2,Tina,singing,M,85
3,Janardhan,Swimming,M,75
4,Abhi,Tennis,F,95
5,Vanita,Karete,F,60
6,Kalia,Surfing,F,65


- Representing the dataframe using head

In [9]:
df.head()

Unnamed: 0,name,subject,gender,score
0,Babllo,kusti,M,90
1,Tinku,dancing,M,80
2,Tina,singing,M,85
3,Janardhan,Swimming,M,75
4,Abhi,Tennis,F,95


- Listing the last elements of the dataframe

In [10]:
df.tail()

Unnamed: 0,name,subject,gender,score
2,Tina,singing,M,85
3,Janardhan,Swimming,M,75
4,Abhi,Tennis,F,95
5,Vanita,Karete,F,60
6,Kalia,Surfing,F,65


In [11]:
df.tail(3)

Unnamed: 0,name,subject,gender,score
4,Abhi,Tennis,F,95
5,Vanita,Karete,F,60
6,Kalia,Surfing,F,65


In [14]:
df.head(n=10)

Unnamed: 0,name,subject,gender,score
0,Babllo,kusti,M,90
1,Tinku,dancing,M,80
2,Tina,singing,M,85
3,Janardhan,Swimming,M,75
4,Abhi,Tennis,F,95
5,Vanita,Karete,F,60
6,Kalia,Surfing,F,65


- Adding one more column to the datframe

In [45]:
df=pd.DataFrame(data,columns=["name", "subject", "area", "score", "age"]) ## passing age column to the dataframe
df

Unnamed: 0,name,subject,area,score,age
0,Babllo,kusti,,90,
1,Tinku,dancing,,80,
2,Tina,singing,,85,
3,Janardhan,Swimming,,75,
4,Abhi,Tennis,,95,
5,Vanita,Karete,,60,
6,Kalia,Surfing,,65,


- Adding columns and rows indices for the dataframe

In [46]:
df=pd.DataFrame(data,columns=["name", "subject", "gender", "score", "age"],
                index=["one","two","three","four","five","six","seven"])
df

Unnamed: 0,name,subject,gender,score,age
one,Babllo,kusti,M,90,
two,Tinku,dancing,M,80,
three,Tina,singing,M,85,
four,Janardhan,Swimming,M,75,
five,Abhi,Tennis,F,95,
six,Vanita,Karete,F,60,
seven,Kalia,Surfing,F,65,


In [47]:
df['subject']

one         kusti
two       dancing
three     singing
four     Swimming
five       Tennis
six        Karete
seven     Surfing
Name: subject, dtype: object

In [48]:
columns=["name","subject"]
df[["name","subject"]]
df[columns]

Unnamed: 0,name,subject
one,Babllo,kusti
two,Tinku,dancing
three,Tina,singing
four,Janardhan,Swimming
five,Abhi,Tennis
six,Vanita,Karete
seven,Kalia,Surfing


In [49]:
df.subject

one         kusti
two       dancing
three     singing
four     Swimming
five       Tennis
six        Karete
seven     Surfing
Name: subject, dtype: object

- **loc** This function is primarily label based, but it’s also used with a boolean array when we create statements.  

In [56]:
### ``.loc[]`` is primarily label based, but may also be used with a boolean array.
df.loc[["one"]]

Unnamed: 0,name,subject,gender,score,age
one,Babllo,kusti,M,90,


In [57]:
### We can access multiple columns in this fashion
df.loc[["one","two"]]

Unnamed: 0,name,subject,gender,score,age
one,Babllo,kusti,M,90,
two,Tinku,dancing,M,80,


In [50]:
df.loc[(df['gender']=='M') & (df['score']==90)].head()

Unnamed: 0,name,subject,gender,score,age
one,Babllo,kusti,M,90,


In [51]:
df["age"]=18
df

Unnamed: 0,name,subject,gender,score,age
one,Babllo,kusti,M,90,18
two,Tinku,dancing,M,80,18
three,Tina,singing,M,85,18
four,Janardhan,Swimming,M,75,18
five,Abhi,Tennis,F,95,18
six,Vanita,Karete,F,60,18
seven,Kalia,Surfing,F,65,18


In [52]:
### Accessing columns using different values.
df=pd.DataFrame(data,columns=["name", "subject", "gender", "score", "age"], 
                index=["one","two","three","four","five","six","seven"])
values=[18,19,20,18,17,17,18]
df["age"]=values
df

Unnamed: 0,name,subject,gender,score,age
one,Babllo,kusti,M,90,18
two,Tinku,dancing,M,80,19
three,Tina,singing,M,85,20
four,Janardhan,Swimming,M,75,18
five,Abhi,Tennis,F,95,17
six,Vanita,Karete,F,60,17
seven,Kalia,Surfing,F,65,18


In [53]:
df["pass"]=df.score>=70
df

Unnamed: 0,name,subject,gender,score,age,pass
one,Babllo,kusti,M,90,18,True
two,Tinku,dancing,M,80,19,True
three,Tina,singing,M,85,20,True
four,Janardhan,Swimming,M,75,18,True
five,Abhi,Tennis,F,95,17,True
six,Vanita,Karete,F,60,17,False
seven,Kalia,Surfing,F,65,18,False


In [54]:
del df["pass"]
df

Unnamed: 0,name,subject,gender,score,age
one,Babllo,kusti,M,90,18
two,Tinku,dancing,M,80,19
three,Tina,singing,M,85,20
four,Janardhan,Swimming,M,75,18
five,Abhi,Tennis,F,95,17
six,Vanita,Karete,F,60,17
seven,Kalia,Surfing,F,65,18


- **iloc** - The .iloc function is integer position based,but it could also be used with a boolean array

In [55]:
df.iloc[0,0]

'Babllo'

In [58]:
df.iloc[:2,:3]

Unnamed: 0,name,subject,gender
one,Babllo,kusti,M
two,Tinku,dancing,M


**Exercise :**
   - Build a dataframe where the column names are subject names (Maths, Physics, Chemistry etc)
   - Rows shall contain the names of the students and each cell should have marks

In [63]:
### Build a dataframe
scores={"Math":{"A":85,"B":90,"C":95}, "Physics":{"A":90,"B":80,"C":75}}
scores_df=pd.DataFrame(scores)
scores_df

Unnamed: 0,Math,Physics
A,85,90
B,90,80
C,95,75


In [64]:
scores_df.T

Unnamed: 0,A,B,C
Math,85,90,95
Physics,90,80,75


In [65]:
scores_df.index.name="name"
scores_df.columns.name="lesson"

In [67]:
scores_df

lesson,Math,Physics
name,Unnamed: 1_level_1,Unnamed: 2_level_1
A,85,90
B,90,80
C,95,75


In [68]:
scores_df.values

array([[85, 90],
       [90, 80],
       [95, 75]], dtype=int64)