# Pandas

* As stated by Pandas site: 
    
    ***Pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language.***

* Pandas provide an easy interface and set of functions to manipulate 1-D data, called series and 2-Dimensional data called Dataframe.
* It is possible to play with index to represent multi-Dimension data as well.

In [1]:
separator = "\n###############\n"

In [2]:
# importing Pandas
import pandas as pd

* Pandas provide 2 powerful Data structures:
    1. Series - Equivalent to a list
    2. Dataframe - Equivalent to a Table

* You can think of Pandas dataframe as a grouping of Pandas Series, where each column represents a series.

## Pandas Series

In [5]:
pds = pd.Series([1,2,3,4,5])
pds

# * Here [1,2,3,4,5] represents the actual data.
# * [0 - 4] represents the index or row number. As like other python
#   data structures, you can use this index to access a particular row.
# * dtype is data type. For this case it is integer of 64 bits.

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [6]:
pds = pd.Series([1,2,3,4,5, 'a'])
pds

0    1
1    2
2    3
3    4
4    5
5    a
dtype: object

#### Creating a Series

In [19]:
# From a scalar value
print(pd.Series(5))
print(separator)
print(pd.Series(5, index = range(5)))

# below is not possible
# print(pd.Series([5], index = range(5)))

0    5
dtype: int64

###############

0    5
1    5
2    5
3    5
4    5
dtype: int64
0    5
1    2
2    4
3    3
4    1
dtype: int64


In [21]:
# Creating a series from an iterator
print(pd.Series(range(3, 15, 3)))

print(pd.Series(tuple(map(float, range(3, 8))), index = range(11,16)))

0     3
1     6
2     9
3    12
dtype: int64
11    3.0
12    4.0
13    5.0
14    6.0
15    7.0
dtype: float64


In [87]:
# Creating a series from a dictionay
state_capital = {'West Bengal': 'Kolkata', 'Delhi': 'Delhi', 'Maharashtra': 'Mumbai', 'Karnataka': 'Bangalore'}
print(pd.Series(state_capital))

## Try these
print(separator)
print(pd.Series(state_capital, index=range(4)))
print(separator)


print(pd.Series(state_capital, index = list(state_capital.keys()) + ['ABC']))
print(separator)
print(pd.Series(state_capital, index = sorted(list(state_capital.keys()) + ['ABC'])))

West Bengal      Kolkata
Delhi              Delhi
Maharashtra       Mumbai
Karnataka      Bangalore
dtype: object

###############

0    NaN
1    NaN
2    NaN
3    NaN
dtype: object

###############

West Bengal      Kolkata
Delhi              Delhi
Maharashtra       Mumbai
Karnataka      Bangalore
ABC                  NaN
dtype: object

###############

ABC                  NaN
Delhi              Delhi
Karnataka      Bangalore
Maharashtra       Mumbai
West Bengal      Kolkata
dtype: object


#### It is possible to assign:
    1. a specific name to Series
    2. a speicific datatype - any numpy datatype is valid
    3. custom index.

In [70]:
pds2 = pd.Series([1,2,3,4,5], name='list_data', index=list('ABCDE'), dtype='f8')
pds2

A    1.0
B    2.0
C    3.0
D    4.0
E    5.0
Name: list_data, dtype: float64

#### Accessing individual elements to the pandas Series.

In [37]:
# You can do it in multiple ways
print(pds2)
print(pds)


# print("1. using index value directly:")
# print("pds2  : ", pds2['B'])
# print("pds  : ", pds[1])

print(separator)
# print("2. using loc (location) function of the series:")
# print("pds2  : ", pds2.loc['B'])
# print("pds : ", pds.loc[1])

# Errorneous indexing
# print("pds2  : ", pds2.loc[1])  # Datatype of index doesn't match
# print("pds : ", pds.loc['B'])   # Key error, index 'B' is not available.

# print(separator)
# print("# 3. using iloc (integer or index location) function of the Series:")
# pds2.iloc[1]
# print("pds2  : ", pds2.iloc[1])
# print("pds  : ", pds.iloc[1])

# print(separator)
# Behavior of loc and iloc in suffled indexing.
# df = pd.Series([0,2,1,3,4])
# df.sort_values(inplace=True)
# print(df)
# print(df.loc[2], df.iloc[2])


print(separator)
print("# . using get function of the Series:")
pds2.iloc[1]
print("pds2  : ", pds2.get(1)) # or 'B'
print("pds  : ", pds.get(1))

### Get doesn't return any error, if key is not present in Series. it returns the default value
print("get F: ", pds2.get('F'))


A    1.0
B    2.0
C    3.0
D    4.0
E    5.0
Name: list_data, dtype: float64
0    1
1    2
2    3
3    4
4    5
5    a
dtype: object

###############


###############

# . using get function of the Series:
pds2  :  2.0
pds  :  2
get F:  None


In [44]:
import numpy as np

#### Appending more data to series

* You can only append a Series or DataFrame object.

In [49]:
print(pds2)
pds2.append(pd.Series(6))#, index=['F']))
pds2 = pds2.append(pd.Series([6], index=['F']))

A    1.0
B    2.0
C    3.0
D    4.0
E    5.0
Name: list_data, dtype: float64


In [54]:
pds2.append(pd.Series([7], ['G']))

A    1.0
B    2.0
C    3.0
D    4.0
E    5.0
F    6.0
G    7.0
dtype: float64

In [55]:
pds2

A    1.0
B    2.0
C    3.0
D    4.0
E    5.0
F    6.0
dtype: float64

In [63]:
ps = pds.append(pd.Series([6,7,8]))#, ignore_index=True)
ps

0    1
1    2
2    3
3    4
4    5
5    a
0    6
1    7
2    8
dtype: object

In [64]:
print(ps.loc[1])
print(ps.iloc[1])

1    2
1    7
dtype: object
2


#### Converting Series to other formats.

In [65]:
print(pds.to_list())
print(pds2.to_list())

[1, 2, 3, 4, 5, 'a']
[1.0, 2.0, 3.0, 4.0, 5.0, 6.0]


In [66]:
print(pds.to_dict())
print(pds2.to_dict())

{0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 'a'}
{'A': 1.0, 'B': 2.0, 'C': 3.0, 'D': 4.0, 'E': 5.0, 'F': 6.0}


In [67]:
pds.to_frame()

Unnamed: 0,0
0,1
1,2
2,3
3,4
4,5
5,a


In [68]:
pds.to_frame(name='values')

Unnamed: 0,values
0,1
1,2
2,3
3,4
4,5
5,a


In [71]:
pds2

A    1.0
B    2.0
C    3.0
D    4.0
E    5.0
Name: list_data, dtype: float64

In [72]:
pds2.to_frame()

Unnamed: 0,list_data
A,1.0
B,2.0
C,3.0
D,4.0
E,5.0


In [78]:
pds2.to_json(orient='values') # index (default), records, 'table', 'values'

'[1.0,2.0,3.0,4.0,5.0]'

In [80]:
pds2.to_numpy()

array([1., 2., 3., 4., 5.])

#### Other useful functions

In [None]:
# sc = pd.Series(state_capital, index = list(state_capital.keys()) + ['ABC', 'DEF'])

In [81]:
pds2.ndim   # retrns number of dimensions, by default 1

1

In [82]:
pds2.shape # shape of the Series

(5,)

In [None]:
pds2.size # number of elements in the Series

In [84]:
pds2

A    1.0
B    2.0
C    3.0
D    4.0
E    5.0
Name: list_data, dtype: float64

In [85]:
print(pds2.hasnans) # if series contains any NaN (not a number) value

print(pd.Series(state_capital, index = list(state_capital.keys()) + ['ABC', 'DEF']))
pd.Series(state_capital, index = list(state_capital.keys()) + ['ABC', 'DEF']).hasnans

False
West Bengal      Kolkata
Delhi              Delhi
Maharashtra       Mumbai
Karnataka      Bangalore
0                unknown
ABC                  NaN
DEF                  NaN
dtype: object


True

In [89]:
print(pds2.empty)  # if series is empty

print(pd.Series(state_capital, index=range(4)))
print(pd.Series(state_capital, index=range(4)).empty) # NaNs are not equivalent to empty

print(pd.Series())
print(pd.Series().empty)

False
0    NaN
1    NaN
2    NaN
3    NaN
dtype: object
False
Series([], dtype: float64)
True


  
  import sys


In [90]:
print(pds2.keys()) # retutns all index keys or related function
print(pds.keys())

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')
RangeIndex(start=0, stop=6, step=1)


In [91]:
print(pds2)
print(pds2.pop('E')) # removes given item (based on index) from series
print(pds2)

A    1.0
B    2.0
C    3.0
D    4.0
E    5.0
Name: list_data, dtype: float64
5.0
A    1.0
B    2.0
C    3.0
D    4.0
Name: list_data, dtype: float64


In [93]:
a = pd.Series([1, 1, 1, 1], index=['a', 'b', 'c', 'd'])
b = pd.Series([1, 2, 3, 4], index=['a', 'b', 'd', 'e'])
print(a + b)
print(a.add(b))

## Beware: add function converts values to NaN if indexes don't match. To correct this behavior always use fill_value attribute
a.add(b, fill_value=0)

a    2.0
b    3.0
c    NaN
d    4.0
e    NaN
dtype: float64
a    2.0
b    3.0
c    NaN
d    4.0
e    NaN
dtype: float64


a    2.0
b    3.0
c    1.0
d    4.0
e    4.0
dtype: float64

In [None]:
## Apply function: Apply the given function on all values of the the Series
# and returns a new series
print(pds.apply(lambda x: x+10))

print(separator)
# You can define your own custom functions with arguments as well.
# 1st attribute is always the Series's value
def custom_function(x, y, z):
    return x + y + z

print(pds.apply(custom_function, args=(10,20,)))
print(pds.apply(custom_function, y= 10, z=20))

In [None]:
# Agg or Aggregate: apply an aggregation function.
print(pds2.aggregate('min'))
print(pds2.agg(['min', 'max', 'sum']))

      
print(separator)
# You can define your own custom aggregation functions as well
# 1st arments is always series' value
def squared_sum(x, y=1):
    tot = 0
    for i in x:
        tot += i**2
    return tot + y

print(pds2.agg(squared_sum, y = 10))
print(separator)
print(pds2.agg(['min', 'max', 'sum', squared_sum]))


### Plotting Data

In [None]:
pds2.plot(kind='line')

# line, bar, barh, hist, box, kde, area, pie

In [None]:
pds2.plot.line() # you can put other kinds as well. eg: pds2.plot.bar()

## Pandas DataFrame

* It provides Tabular or Matrix Structure

In [None]:
name = ['Chintu', 'Mintu', 'Pinky', 'Minty', 'Golu', 'Molu']
gender = ['M', 'M', 'F', 'F', 'M', 'M']
age = [6, 5, 6, 5, 6, 5]
marks = [80.0, 65.0, 90.0, 82.0, 75.0, 82.0]

In [None]:
pd.DataFrame([name, gender, age, marks])

In [None]:
pd.DataFrame(list(zip(name, gender, age, marks))) #, columns = ['name', 'gender', 'age', 'marks'])

In [None]:
d = {'name': name, 'gender': gender, 'age': age, 'marks': marks}
d

In [None]:
pd.DataFrame(d)

In [None]:
d = [{'name': 'Chintu', 'gender': 'M', 'age': 6, 'marks': 80.0},
 {'name': 'Mintu', 'gender': 'M', 'age': 5, 'marks': 65.0},
 {'name': 'Pinky', 'gender': 'F', 'age': 6, 'marks': 90.0},
 {'name': 'Minty', 'gender': 'F', 'age': 5, 'marks': 82.0},
 {'name': 'Golu', 'gender': 'M', 'age': 6, 'marks': 75.0},
 {'name': 'Molu', 'gender': 'M', 'age': 5, 'marks': 82.0}]

students_df = pd.DataFrame(d)
students_df

In [None]:
pima_df = pd.read_csv('pima_indian_diabetes.csv')
pima_df

In [None]:
url = "https://raw.githubusercontent.com/ikhurana/code-asylums/master/CA-SS1/Decision%20Tree%20and%20Random%20Forest/pima_indian_diabetes.csv"
pd.read_csv(url)

In [None]:
students_df.info()  # Gives a summary of the type of data in data frame.

In [None]:
students_df.describe() # Gives the statstical summary of the dataframe

In [None]:
students_df.describe(include='all')

In [None]:
pima_df.describe() 

In [None]:
pima_df.info() 

In [None]:
pima_df.shape  # Returns the tuple of # of rows and columns in dataframe

In [None]:
pima_df.head() # list n columns from the top of the dataframe, default value is 5

In [None]:
pima_df.tail(5) # list n columns from the end/bottom of the dataframe, default value is 5

In [None]:
pima_df.columns # list all the columns of the Dataframe

In [None]:
pima_df.values  # returns all the values in np array format

In [None]:
type(pima_df['Pregnancies'])