# A Brief Intro to Pandas

### What does pandas do

- Loads data from files to a common work environment, it can work with file types like .csv, .xlsx, .hd5, SQL files, etc..
- Provides data structures that has many inbuilt methods to interact with data

-----
   ##### It helps in 
- Analysing the data
- Manipulating the Data
- Perform basic visualisations
- Perform Feature Engineering (Nothing but manipulating data based on our analysis).

#### Data Types in Pandas
- Series      (1D)
- Data Frames (2D)
- Panel  (Multi Dimensional)

----

- we work a lot with Series and Data Frames in real world, compared to panels (atleast in the basic level).

#### Some Common points to note 
- Strings are represented as object type in pandas.
- float is a commonly used numercial type compared to integer.
- There are several types of data namely **_Categorical, Nominal, Ordinal_** which are very important to deal with.
- There are **_Outliers_**, which affet the generalisation of the data, so they should be handled properly.
- It is imp to know which operation is **inplace** and which is not (_Remember me if I forget to explain this_) 

In [1]:
# importing necessary libraries with commonly used aliases

import pandas as pd

# don't worry about the following we'll discuss them later.
import numpy as np
from datetime import datetime

## Creation of a Pandas Series Object

In [2]:
# Creating a Series using array
arr = list(range(10,21))
ser_arr = pd.Series(arr)
print(ser_arr)

0     10
1     11
2     12
3     13
4     14
5     15
6     16
7     17
8     18
9     19
10    20
dtype: int64


In [3]:
ser_arr.dtype

dtype('int64')

In [4]:
ser_arr.shape

(11,)

In [5]:
# Creating a Series using numpy array
np_arr = np.random.randn(10)
ser_np_arr = pd.Series(np_arr)
print(ser_np_arr)

0   -0.178318
1   -0.448127
2   -0.782835
3   -1.132131
4   -1.162446
5   -0.322369
6   -0.575492
7    0.776008
8   -0.223818
9    2.472755
dtype: float64


In [6]:
# creating a Series using dictionary
dic = dict()
for i in range(10):
    dic[chr(ord('a')+i)] = i
print(dic)

ser_dic = pd.Series(dic)

print(ser_dic)

{'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4, 'f': 5, 'g': 6, 'h': 7, 'i': 8, 'j': 9}
a    0
b    1
c    2
d    3
e    4
f    5
g    6
h    7
i    8
j    9
dtype: int64


In [7]:
# changing the index of Series
ser_dic.index = [chr(ord('k')+i) for i in range(10)]
# if no. of elements in Index doesn't match with no. of data elements in Series, it would raise an error.

print(ser_dic)

k    0
l    1
m    2
n    3
o    4
p    5
q    6
r    7
s    8
t    9
dtype: int64


In [8]:
# creating a Series object with heterogenous data
dic = {'Name':'Pardhu','Age':21,'Dept':'CSE','Sem':'VI'}
pardhu = pd.Series(dic)
print(pardhu)

Name    Pardhu
Age         21
Dept       CSE
Sem         VI
dtype: object


In [9]:
#creating same series as above in a different way
pardhu_way2 = pd.Series(['Pardhu',21,'CSE','VI'], index=['Name','Age','Dept','Sem'])
print(pardhu_way2)

Name    Pardhu
Age         21
Dept       CSE
Sem         VI
dtype: object


## Accessing elements from a Series Object

In [10]:
# the primary way is to acess it as a dictionary in python 
print(pardhu['Name'])
print(ser_dic['m'])
print(ser_arr[6])

print('----------------')

#using a method called loc --> location
print(pardhu.loc['Name'])
print(ser_dic.loc['m'])
print(ser_arr.loc[6])

Pardhu
2
16
----------------
Pardhu
2
16


In [11]:
# every series object has a 0 based indexing irrespective of what index it has explicitly.
print(pardhu[1])   # our index label 'Age'
print(ser_dic[4])  # our index label 'o'

print('----------------')

#using a method called iloc --> index location
print(pardhu.iloc[1])
print(ser_dic.iloc[4])

21
4
----------------
21
4


In [12]:
# Accessing multiple elements
print(pardhu[[1,2,3]])
print(pardhu.iloc[[1,2,3]])
print('---------------')
print(ser_dic[[0,2,4,6,8]])
print(ser_dic.iloc[[0,2,4,6,8]])
print('---------------')
print(ser_arr[[1,3,5,7,9]])
print(ser_arr.iloc[[1,3,5,7,9]])

Age      21
Dept    CSE
Sem      VI
dtype: object
Age      21
Dept    CSE
Sem      VI
dtype: object
---------------
k    0
m    2
o    4
q    6
s    8
dtype: int64
k    0
m    2
o    4
q    6
s    8
dtype: int64
---------------
1    11
3    13
5    15
7    17
9    19
dtype: int64
1    11
3    13
5    15
7    17
9    19
dtype: int64


In [13]:
# Accesing multiple elements using Slicing (Same a slicing a list in python)
print(pardhu[1:3])
print('---------------')
print(ser_dic[0:9:2])
print('---------------')
print(ser_arr[1:10:2])

Age      21
Dept    CSE
dtype: object
---------------
k    0
m    2
o    4
q    6
s    8
dtype: int64
---------------
1    11
3    13
5    15
7    17
9    19
dtype: int64


In [14]:
# first five elements of a series
print(ser_dic.head())

print('------------')

# if we pass an integer n to head, it would return first n rows
ser_dic.head(3)

k    0
l    1
m    2
n    3
o    4
dtype: int64
------------


k    0
l    1
m    2
dtype: int64

In [15]:
# last five elements of a series
print(ser_dic.tail())

print('-------')

#if we pass an integer n to head, it would return last n rows
print(ser_dic.tail(4))

p    5
q    6
r    7
s    8
t    9
dtype: int64
-------
q    6
r    7
s    8
t    9
dtype: int64


## Modifying elements of a series

In [16]:
ser_dic

k    0
l    1
m    2
n    3
o    4
p    5
q    6
r    7
s    8
t    9
dtype: int64

In [17]:
ser_dic['m'] = 11
ser_dic.head()

k     0
l     1
m    11
n     3
o     4
dtype: int64

In [18]:
pardhu['Sem'] = 'VIII'
pardhu

Name    Pardhu
Age         21
Dept       CSE
Sem       VIII
dtype: object

In [19]:
ser_dic[['m','n','o','p']] = [11,12,13,14]
ser_dic.head(6)

k     0
l     1
m    11
n    12
o    13
p    14
dtype: int64

In [20]:
ser_dic.iloc[2] = 17
ser_dic.head()

k     0
l     1
m    17
n    12
o    13
dtype: int64

In [21]:
ser_dic.loc['k'] = 18
ser_dic.head()

k    18
l     1
m    17
n    12
o    13
dtype: int64

# Mathematical operations

In [22]:
ser_dic

k    18
l     1
m    17
n    12
o    13
p    14
q     6
r     7
s     8
t     9
dtype: int64

In [23]:
print('Sum:',ser_dic.sum())
print('Mean:',ser_dic.mean())
print('Standard Deviation:',ser_dic.std())

Sum: 105
Mean: 10.5
Standard Deviation: 5.275730597114805


In [24]:
print(ser_dic.max())
print(ser_dic.min())

18
1


In [25]:
print(ser_dic.idxmax())
print(ser_dic.idxmin())

k
l


In [26]:
all_falses = pd.Series([0]*10)
print(all_falses)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: int64


In [27]:
# checking if all the elements are non zeros
print(all_falses.all())

# checking if any one the elements is non Zero
print(all_falses.any())

False
False


In [28]:
# let's try changing a value in all_falses
all_falses[4] = 'Pardhu'
all_falses[6] = 7
all_falses[2] = True

# Miscellaneous

In [29]:
ser = pd.Series([1,2,3,np.NaN,4,3,2,1,np.NaN,5,6,4,3,np.NaN,2,1])

In [30]:
ser.isnull()

0     False
1     False
2     False
3      True
4     False
5     False
6     False
7     False
8      True
9     False
10    False
11    False
12    False
13     True
14    False
15    False
dtype: bool

In [31]:
ser.isnull().any()

True

In [32]:
ser.isnull().all()

False

In [33]:
# gives all the elements in the Series only once.
ser.unique()

array([ 1.,  2.,  3., nan,  4.,  5.,  6.])

In [34]:
# no, of elements which occured atleast once
ser.nunique()

6

In [35]:
# gives each element and the no.of occurances of that element
ser.value_counts()

3.0    3
2.0    3
1.0    3
4.0    2
6.0    1
5.0    1
dtype: int64

In [36]:
ser==1

0      True
1     False
2     False
3     False
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15     True
dtype: bool

In [37]:
ser[ser<6]
# element with index 8 is missing since it's not <6

0     1.0
1     2.0
2     3.0
4     4.0
5     3.0
6     2.0
7     1.0
9     5.0
11    4.0
12    3.0
14    2.0
15    1.0
dtype: float64

In [38]:
ser_dic

k    18
l     1
m    17
n    12
o    13
p    14
q     6
r     7
s     8
t     9
dtype: int64

In [39]:
# default is ascending
ser_dic = ser_dic.sort_values()
ser_dic

l     1
q     6
r     7
s     8
t     9
n    12
o    13
p    14
m    17
k    18
dtype: int64

In [40]:
ser_dic = ser_dic.sort_index()
ser_dic

k    18
l     1
m    17
n    12
o    13
p    14
q     6
r     7
s     8
t     9
dtype: int64

In [41]:
neg_pos = pd.Series(list(range(-5,5)))
neg_pos

0   -5
1   -4
2   -3
3   -2
4   -1
5    0
6    1
7    2
8    3
9    4
dtype: int64

In [42]:
neg_pos.abs()

0    5
1    4
2    3
3    2
4    1
5    0
6    1
7    2
8    3
9    4
dtype: int64

In [43]:
neg_pos.add_prefix('X')

X0   -5
X1   -4
X2   -3
X3   -2
X4   -1
X5    0
X6    1
X7    2
X8    3
X9    4
dtype: int64

In [44]:
neg_pos.add_suffix('Y')

0Y   -5
1Y   -4
2Y   -3
3Y   -2
4Y   -1
5Y    0
6Y    1
7Y    2
8Y    3
9Y    4
dtype: int64

In [45]:
neg_pos.apply(lambda x: x**2 if x>0 else x**3)

0   -125
1    -64
2    -27
3     -8
4     -1
5      0
6      1
7      4
8      9
9     16
dtype: int64

In [46]:
a = pd.Series(list(range(5)))
b = pd.Series(list(range(5,11)))
print(a.append(b)) # doesn't ignore index, keep the series' own index
print(a.append(b, ignore_index=1)) # creates a new index

0     0
1     1
2     2
3     3
4     4
0     5
1     6
2     7
3     8
4     9
5    10
dtype: int64
0      0
1      1
2      2
3      3
4      4
5      5
6      6
7      7
8      8
9      9
10    10
dtype: int64


In [47]:
a.astype('float')

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64

In [48]:
a.between(1,10)

0    False
1     True
2     True
3     True
4     True
dtype: bool

In [49]:
c = pd.Series(list(range(2,10)))

In [50]:
# all the elements that are less than lower are changed to lower
# all the elements that are greater than upper are changed to upper
# all the elements that are in between lower and upper are left as they are.
c.clip(lower=4, upper=7)

0    4
1    4
2    4
3    5
4    6
5    7
6    7
7    7
dtype: int64

In [51]:
# Gives cummulative max upto that index
c.cummax()

0    2
1    3
2    4
3    5
4    6
5    7
6    8
7    9
dtype: int64

In [52]:
# Gives cummulative min upto that index
c.cummin()

0    2
1    2
2    2
3    2
4    2
5    2
6    2
7    2
dtype: int64

# Ok let's move to Data Frames now... Bye :)