# **PANDAS** Basics

makes data cleaning and analysis fast and convinient by:
* data structures
* data manipulation

Used along with 
* Numpy, SciPy (scientific computing)
* statsmodels and scikit-learn (analytics)
* matplotlib (data visualization)

Difference form NumPy
* Pandas - designed for working with tabular or heterogegneous data.
* NumPy - homgeneously type numerical 

In [313]:
import pandas as pd

--------------------

## pandas **Data Stucture**

In [314]:
from pandas import Series, DataFrame

### **pd.Series** 

**Series** - One dimensional array like object containing a sequence of values (like NumPy) of the same type and an association
- **OR** - a fixed-length, ordered dictionary, as it is a mapping of index values 
- It is used in many places in place of dictionary

In [315]:
obj = pd.Series([2,3,5,8,3])
obj

0    2
1    3
2    5
3    8
4    3
dtype: int64

- When index is not specified default sequence from `0 to N-1` is used

In [316]:
obj.array # to show the object in array form

<NumpyExtensionArray>
[np.int64(2), np.int64(3), np.int64(5), np.int64(8), np.int64(3)]
Length: 5, dtype: int64

In [317]:
obj.index # to show index

RangeIndex(start=0, stop=5, step=1)

In [318]:
# index array should be of same size as the data array 
obj2 = pd.Series([4,7,-4,21], index=["w","s","d","f"])  
obj2

w     4
s     7
d    -4
f    21
dtype: int64

In [319]:
obj2.index

Index(['w', 's', 'd', 'f'], dtype='object')

In [320]:
# Using index to select single or a range of values
obj2['w'] 

np.int64(4)

In [321]:
obj2[['s', 'd', 'f']]

s     7
d    -4
f    21
dtype: int64

- NumPy like operations

In [322]:
# filtering

obj2[obj2>5]

s     7
f    21
dtype: int64

In [323]:
# scalar multiplication

obj2 * 2

w     8
s    14
d    -8
f    42
dtype: int64

In [324]:
# applying a function
import numpy as np

np.exp(obj2) 

w    5.459815e+01
s    1.096633e+03
d    1.831564e-02
f    1.318816e+09
dtype: float64

- Conversion from **Python dictionary** to **pd.series**

In [325]:
obj3 = pd.Series({"Ohio": 23435, "Texas": 5393, "Oregon": 80530, "Utah": 9358})

- Concersion form **pd.Series** to **Python dictionary**

In [326]:
obj3.to_dict()

{'Ohio': 23435, 'Texas': 5393, 'Oregon': 80530, 'Utah': 9358}

- In a key order is respected - as per keys method of dictionary i.e., as per insertion order
- can be overridden when creating list from dict by passing an index with the dictionary keys in order of prefrence

In [327]:
sdata = {"Ohio": 23435, "Texas": 5393, "Oregon": 80530, "Utah": 9358}
obj4 = pd.Series(sdata, ["California", "Texas", "Oregon"]) 
# Non existing values will have NaN values
# keys not mentioned won't be included
obj4

California        NaN
Texas          5393.0
Oregon        80530.0
dtype: float64

- Check null entries
   - `pd.isna`
   - `pd.isnotna`
   - aslo can be used as instances method `obj.isna/notna()`

In [328]:
pd.isna(obj4)

California     True
Texas         False
Oregon        False
dtype: bool

In [329]:
pd.notna(obj4)

California    False
Texas          True
Oregon         True
dtype: bool

- For airthmetic operations the data automatically aligns by index label

In [330]:
obj3 + obj4

California         NaN
Ohio               NaN
Oregon        161060.0
Texas          10786.0
Utah               NaN
dtype: float64

- `name` - attribute for both the Series object and index object 

In [331]:
obj4.name = "State-wise Population"
obj4.index.name = "States"

In [332]:
obj4

States
California        NaN
Texas          5393.0
Oregon        80530.0
Name: State-wise Population, dtype: float64

- Series name can be altered in place by assignment

In [333]:
obj

0    2
1    3
2    5
3    8
4    3
dtype: int64

In [334]:
obj.index = ['hi', 'hello', 'yes', 'no', 'here']
obj

hi       2
hello    3
yes      5
no       8
here     3
dtype: int64

----------

### **pd.DataFrame**

**DataFrame** - a rectangular table of data and contains an ordered, named collection of columns, each with different type (num, str, bool etc)

- has both row and column index
- dictionary of `Series` all sharing the same index
- it is physically 2-D
   - higher dimensions can be represented in a tabular format usung hierarchical indexing

In [335]:
# most common way to construct DataFrame 
# thru Dictionary of equal-lengthed lists or NumPy arrays

data = {
    'state' : ["Gujarat", "Rajasthan", "Uttarakhand", "Tamil Nadu", "Maipur", "Jharkhand", "Bihar"],
    'year' : [2006, 2001, 2006, 2019, 2011, 2001, 2019],
    'pop': [1.3, 1.5, 1.8, 2.9, 3.4, 3.2, 1.2]
}

frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Gujarat,2006,1.3
1,Rajasthan,2001,1.5
2,Uttarakhand,2006,1.8
3,Tamil Nadu,2019,2.9
4,Maipur,2011,3.4
5,Jharkhand,2001,3.2
6,Bihar,2019,1.2


In [336]:
# first five rows
frame.head()

Unnamed: 0,state,year,pop
0,Gujarat,2006,1.3
1,Rajasthan,2001,1.5
2,Uttarakhand,2006,1.8
3,Tamil Nadu,2019,2.9
4,Maipur,2011,3.4


In [337]:
# last five rows
frame.tail()

Unnamed: 0,state,year,pop
2,Uttarakhand,2006,1.8
3,Tamil Nadu,2019,2.9
4,Maipur,2011,3.4
5,Jharkhand,2001,3.2
6,Bihar,2019,1.2


- order of DF's columns can be  arranged using `columns` attribute 

In [338]:
pd.DataFrame(data, columns=["year", "state", "pop"])

Unnamed: 0,year,state,pop
0,2006,Gujarat,1.3
1,2001,Rajasthan,1.5
2,2006,Uttarakhand,1.8
3,2019,Tamil Nadu,2.9
4,2011,Maipur,3.4
5,2001,Jharkhand,3.2
6,2019,Bihar,1.2


In [339]:
# for list in columns not being present in data - that column will be added with NaN (Missing) value

frame2 = pd.DataFrame(data, columns=["year", "state", "pop", "debt"]) 
frame2

Unnamed: 0,year,state,pop,debt
0,2006,Gujarat,1.3,
1,2001,Rajasthan,1.5,
2,2006,Uttarakhand,1.8,
3,2019,Tamil Nadu,2.9,
4,2011,Maipur,3.4,
5,2001,Jharkhand,3.2,
6,2019,Bihar,1.2,


In [340]:
frame2.columns # retreive all columns 

Index(['year', 'state', 'pop', 'debt'], dtype='object')

- Retreive columns as Series by [as a view]
   - dictionary like notation
   - using the dot attribute notation

In [341]:
frame2["state"] # dictionary like notation

0        Gujarat
1      Rajasthan
2    Uttarakhand
3     Tamil Nadu
4         Maipur
5      Jharkhand
6          Bihar
Name: state, dtype: object

In [342]:
frame2.state # dot attribute notation

0        Gujarat
1      Rajasthan
2    Uttarakhand
3     Tamil Nadu
4         Maipur
5      Jharkhand
6          Bihar
Name: state, dtype: object

- Retrieve rows by
   - position - `.loc[position]`
   - name - `.iloc[index]` - when defined

In [343]:
frame.iloc[1]

state    Rajasthan
year          2001
pop            1.5
Name: 1, dtype: object

In [344]:
frame.loc[2]

state    Uttarakhand
year            2006
pop              1.8
Name: 2, dtype: object

- Columns can be modified by assignement
   - array of values
   - one value for all
   - pd.Series - its index will be realigned exactly to DF's index - if index not present NaN will be inserted

- Columns if not present and assigned - creates new columns  
   - only thru dictionary like notation

In [345]:
frame2["debt"] = 12.5
frame2

Unnamed: 0,year,state,pop,debt
0,2006,Gujarat,1.3,12.5
1,2001,Rajasthan,1.5,12.5
2,2006,Uttarakhand,1.8,12.5
3,2019,Tamil Nadu,2.9,12.5
4,2011,Maipur,3.4,12.5
5,2001,Jharkhand,3.2,12.5
6,2019,Bihar,1.2,12.5


In [346]:
frame2["debt"] = np.arange(7.) # the value length must match the length of the DataFrame
frame2

Unnamed: 0,year,state,pop,debt
0,2006,Gujarat,1.3,0.0
1,2001,Rajasthan,1.5,1.0
2,2006,Uttarakhand,1.8,2.0
3,2019,Tamil Nadu,2.9,3.0
4,2011,Maipur,3.4,4.0
5,2001,Jharkhand,3.2,5.0
6,2019,Bihar,1.2,6.0


In [347]:
val = pd.Series([1.4,4.5,2.4], index=['two',1 ,4])
frame2["debt"] = val
frame2

Unnamed: 0,year,state,pop,debt
0,2006,Gujarat,1.3,
1,2001,Rajasthan,1.5,4.5
2,2006,Uttarakhand,1.8,
3,2019,Tamil Nadu,2.9,
4,2011,Maipur,3.4,2.4
5,2001,Jharkhand,3.2,
6,2019,Bihar,1.2,


In [348]:
frame2["eastern"] = list( ( True if x[1] == 'a'else False) for x in frame["state"])

frame2

Unnamed: 0,year,state,pop,debt,eastern
0,2006,Gujarat,1.3,,False
1,2001,Rajasthan,1.5,4.5,True
2,2006,Uttarakhand,1.8,,False
3,2019,Tamil Nadu,2.9,,True
4,2011,Maipur,3.4,2.4,True
5,2001,Jharkhand,3.2,,False
6,2019,Bihar,1.2,,False


- `del` keyword will delete columns 

In [349]:
del frame2["eastern"]
frame2

Unnamed: 0,year,state,pop,debt
0,2006,Gujarat,1.3,
1,2001,Rajasthan,1.5,4.5
2,2006,Uttarakhand,1.8,
3,2019,Tamil Nadu,2.9,
4,2011,Maipur,3.4,2.4
5,2001,Jharkhand,3.2,
6,2019,Bihar,1.2,


- as retreived column is a view of the original - if inplace assignment is used it will be reflected in the data

- FOR Nested dictionary of dictionaries / dictionaries of `Series`
   - by pandas outer dictionary key is interprted as columns 
   - inner keys as the row indices - by forming row with all the values combined in a that specific inner key for each outer keys

In [350]:
populations = {
    "Jharkhand": {2001: 4.35, 2011: 5.76, 2021: 7.54},
    "Uttarakhand": {2001: 5.43, 2021: 2.44}
}

frame3 = pd.DataFrame(populations)
frame3

Unnamed: 0,Jharkhand,Uttarakhand
2001,4.35,5.43
2011,5.76,
2021,7.54,2.44


- Transposing the DF by using `.T` method [ AS IN NUMPY ]
   - Transposing and Transposing back may lose type info as if all data are not of same type `.T` does not cares the type info

In [351]:
frame3.T

Unnamed: 0,2001,2011,2021
Jharkhand,4.35,5.76,7.54
Uttarakhand,5.43,,2.44


- **Data Inputs** of pd.DataFrame

In [352]:
# 2D ndArray

In [353]:
# Dictionary of arrays, lists or tuples 

In [354]:
# NumPy structured/ record array

In [355]:
# Dictionary of Series

In [356]:
# Dictionary of Dictionaries

In [357]:
# Lists of dictionaries or Series

In [358]:
# List of lists or tuples

In [359]:
# Another Dataframe

In [360]:
# NumPy MaskedArray 

- `name` attribute - DataFrame's index and columns
   - like Series - DataFrame itself does not have name attribute

In [361]:
frame3.index.name = "year"
frame3

Unnamed: 0_level_0,Jharkhand,Uttarakhand
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,4.35,5.43
2011,5.76,
2021,7.54,2.44


In [362]:
frame3.columns.name = "state"
frame3

state,Jharkhand,Uttarakhand
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,4.35,5.43
2011,5.76,
2021,7.54,2.44


- `.to_numpy` method returns the data contained in the DataFrame as 2D ndarray 
   - If DF's cols are of different data types then the returned array will be chosen to accomodate all the columns

In [363]:
frame3.to_numpy()

array([[4.35, 5.43],
       [5.76,  nan],
       [7.54, 2.44]])

In [364]:
frame2.to_numpy()

array([[2006, 'Gujarat', 1.3, nan],
       [2001, 'Rajasthan', 1.5, 4.5],
       [2006, 'Uttarakhand', 1.8, nan],
       [2019, 'Tamil Nadu', 2.9, nan],
       [2011, 'Maipur', 3.4, 2.4],
       [2001, 'Jharkhand', 3.2, nan],
       [2019, 'Bihar', 1.2, nan]], dtype=object)

---------------

### **pd.index** Objects

**Index objects** - for holding
   - axis labels [ including Dataframe's column names]
   - other metadata [ like axis name or names]

Any axis or other sequence lables used while contructing a Series or a DF, are internally converted to an Index 

In [365]:
obj.index

Index(['hi', 'hello', 'yes', 'no', 'here'], dtype='object')

In [366]:
obj.index[2:] # slicing index object

Index(['yes', 'no', 'here'], dtype='object')

- Index objects are immutable
   - thus safer to share among data structures

In [367]:
labels = pd.Index(np.arange(4))
labels

Index([0, 1, 2, 3], dtype='int64')

In [368]:
obj2

w     4
s     7
d    -4
f    21
dtype: int64

In [369]:
obj2.index = labels
obj2

0     4
1     7
2    -4
3    21
dtype: int64

- Checking if an object has given index or not

In [370]:
obj2.index is labels

True

- Index behaves like 
   - an array
   - a fixed size set

In [371]:
frame3

state,Jharkhand,Uttarakhand
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,4.35,5.43
2011,5.76,
2021,7.54,2.44


In [372]:
frame3.columns

Index(['Jharkhand', 'Uttarakhand'], dtype='object', name='state')

In [373]:
'Jharkhand' in frame3.columns

True

In [374]:
2001 in frame.index

False

- pandas Index can contain duplicate labels
   - selection with duplicate labels will select all the occurences

In [375]:
pd.Index(['foo', 'foo', 'bar', 'bar'])

Index(['foo', 'foo', 'bar', 'bar'], dtype='object')

- Index has a number of methods and properties for set logic to answer common quesrions about the data it contains

In [376]:
# append()

In [377]:
# difference()

In [378]:
# intersection()

In [379]:
# union()

In [380]:
# isin()

In [381]:
# delete()

In [382]:
# drop()

In [383]:
# insert()

In [384]:
# is_monotonic()

In [385]:
# is_unique()

In [386]:
# unique()

----------------

## Essential Functionality

Mechnaics of interacting with data in Series or DataFrame

### Reindexing

Method on pandas object
- to create a new object with the values rearranged to align with the new index - USING THE OLD INDEX
- Index in reindex with no values will be assigned NaN

In [387]:
obj.index = ['b', 'c', 'a', 'f', 'd']
obj

b    2
c    3
a    5
f    8
d    3
dtype: int64

In [388]:
obj.reindex(["a", "b", "c", "d", "e", "f"])

a    5.0
b    2.0
c    3.0
d    3.0
e    NaN
f    8.0
dtype: float64

- Oredered data like time-series may require filling values when reindexing
   - `method` option allows for this
      - `ffill` - forward fill will fill the previous value to the index with no values 
         - for this index must be *numeric and monotonically inc or dec* 

In [389]:
obj.index = [1, 3, 5, 7, 9]
obj.reindex(np.arange(7), method="ffill")

0    NaN
1    2.0
2    2.0
3    3.0
4    3.0
5    5.0
6    5.0
dtype: float64

- For DataFrame, `reindex` can alter row / column or both
- using `index` keyword - when given sequence it reindexes the row

In [390]:
frame = pd.DataFrame(np.arange(9).reshape((3,3)), index=['a', 'c', 'e'], columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
e,6,7,8


In [391]:
frame2 = frame.reindex(index = ["a", "b", "c", "d", "e"])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,,,
e,6.0,7.0,8.0


In [392]:
# Giving values to newly formed row
frame2.loc["b"] = [2, 3, 4]
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,2.0,3.0,4.0
c,3.0,4.0,5.0
d,,,
e,6.0,7.0,8.0


   - Columns can be reindexed using `column` keyword 

In [393]:
frame2.reindex(columns=["Texas", "California"]) # Ohio was not present the data was dropped

Unnamed: 0,Texas,California
a,1.0,2.0
b,3.0,4.0
c,4.0,5.0
d,,
e,7.0,8.0


- Other way to reindex a particular axis is to pass the name of the axis (column/row) in `axis` keyword

In [394]:
frame2.reindex(["Texas", "California"], axis="columns")

Unnamed: 0,Texas,California
a,1.0,2.0
b,3.0,4.0
c,4.0,5.0
d,,
e,7.0,8.0


- Common arguments of the reindex functions

In [395]:
# labels

In [396]:
# index

In [397]:
# columns

In [398]:
# axis

In [399]:
# method

In [400]:
# fill_value

In [401]:
# limit

In [402]:
# tolerance

In [403]:
# level

In [404]:
# copy

- reindexing can be performed using `.loc` operator
   - this works only if all the new index labels already exits in the DF
   - however `.reindex` adds the missing labels 

----------

### Dropping Entries from an Axis

Methods
- Using index array - by using the `reindex` or `.loc` based indexing
- `drop` method for new array - with given values deleted from the original set

In [405]:
# FOR SERIES
obj = pd.Series(np.arange(6.), index=['a','b','c','d','e','f'])
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
f    5.0
dtype: float64

In [406]:
obj.drop('b')

a    0.0
c    2.0
d    3.0
e    4.0
f    5.0
dtype: float64

In [407]:
obj.drop(['f','c'])

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [408]:
# FOR DATAFRAME
data = pd.DataFrame(np.arange(16).reshape((4,4)), index=['Ohio', 'Colorado', 'Utah', 'New York'], columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [409]:
data.drop(index=['Colorado', 'Utah']) # to drop row values - use INDEX keyword

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
New York,12,13,14,15


In [410]:
data.drop(columns=['three']) # to drop column values - use COLUMNS keyword

Unnamed: 0,one,two,four
Ohio,0,1,3
Colorado,4,5,7
Utah,8,9,11
New York,12,13,15


In [411]:
data.drop("two", axis=1) # "columns" as value to axis keyword will work too

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [412]:
data.drop("Ohio", axis=0) # "rows" as value to axis keyword will work too

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


-----------

### Indexing, Selection and FIltering

`obj[...]` notation works for indexing - with number as well as index values of Series

In [413]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
f    5.0
dtype: float64

In [414]:
obj["f"]

np.float64(5.0)

In [415]:
obj[2] #DEPRECATED

  obj[2] #DEPRECATED


np.float64(2.0)

In [416]:
obj[2:4] # for range of data

c    2.0
d    3.0
dtype: float64

In [417]:
obj[["a","c","f"]] # selecting data by labels

a    0.0
c    2.0
f    5.0
dtype: float64

In [418]:
obj[[1,4]]

  obj[[1,4]]


b    1.0
e    4.0
dtype: float64

In [419]:
obj[obj<2]

a    0.0
b    1.0
dtype: float64

- For selecting data by index prefreed way is to use - `loc` operator 
   - due to different treatment of integers when indexing with []
      - if index is number - `[number]` is treated as labels

- But treating keys as positions is deprecated so it should be valid now 

- `.loc` VS `.iloc` - for indexing using labels vs using integer location index

- slicing using *labels* with `.loc()` - HERE endpoint is inclusive

In [420]:
obj.loc["b":"d"]

b    1.0
c    2.0
d    3.0
dtype: float64

In [421]:
obj.loc["b":"d"] = 2
obj

a    0.0
b    2.0
c    2.0
d    2.0
e    4.0
f    5.0
dtype: float64

- Indexing into should be using square brackets rather than calling functions like iloc and loc 
- Square bracket notation is used
   - slice operations
   - allow indexing on multiple axes with DFs


In [422]:
# For DATA FARAMES
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [423]:
# data[column]
data["two"]

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [424]:
# data[list of cols]
data[["two","one"]]

Unnamed: 0,two,one
Ohio,1,0
Colorado,5,4
Utah,9,8
New York,13,12


Special cases of DF indexing using square bracket
- Slicing


In [425]:
# data[ from row : till row ] slicing
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


- Boolean Array

In [426]:
# data[condition EG -> data[column] > 5]
data[data["one"]> 5]

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


- row selection syntax `data[:2]` is special otherwise `data[col]` will select columns

- assignment using indexing with scalar comparision 
   - to change the value of mismatches with desired value 

In [427]:
data[data <= 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,0,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [428]:
# df[cols]

- Using `iloc` and `loc` with dataframes 

In [429]:
# df.loc[rows]

In [430]:
# df.loc[:, cols]

In [431]:
# df.loc[rows: cols]

In [432]:
# df.iloc[rows]

In [433]:
# df.iloc[:, cols]

In [434]:
# df.iloc[rows: cols]

- `at`, `iat` and `reindex` methods

In [435]:
# df.at[rows, col]

In [436]:
# df.iat[row, col]

In [437]:
# reindex method

Pitfalls of different indexing methods

- INTEGERS INDEXING

   - diff than built-in python data structure
      - eg: `ser[-1] : gives error for integer based labels as -1 may not be an integer`
   - for axis index with integers
      - data selection will be label oriented
   - slicing with integers is always integer oriented 
   
   - SO `iloc` and `loc` should be prefered

   - CHAINED INDEXING

      - Avoid chaining when using assignments
         - often leads to temprary assignment 

------------------

### Arithmetic and Data Alignment

- Working with objects of different indexes
   - for non similar index pairs - resulting index in the result will be uniion of the index pairs

In [438]:
# For Series alignment is performed for index basis

In [439]:
# For DFs the alignment is preformed for both row and col basis
s1 = pd.DataFrame(np.arange(16.).reshape((4,4)), index=['a','b','c','d'])
s2 = pd.DataFrame(np.arange(25.).reshape((5,5)), index=['a','e','c','f','d'])

In [440]:
s1

Unnamed: 0,0,1,2,3
a,0.0,1.0,2.0,3.0
b,4.0,5.0,6.0,7.0
c,8.0,9.0,10.0,11.0
d,12.0,13.0,14.0,15.0


In [441]:
s2

Unnamed: 0,0,1,2,3,4
a,0.0,1.0,2.0,3.0,4.0
e,5.0,6.0,7.0,8.0,9.0
c,10.0,11.0,12.0,13.0,14.0
f,15.0,16.0,17.0,18.0,19.0
d,20.0,21.0,22.0,23.0,24.0


In [442]:
s1 + s2

Unnamed: 0,0,1,2,3,4
a,0.0,2.0,4.0,6.0,
b,,,,,
c,18.0,20.0,22.0,24.0,
d,32.0,34.0,36.0,38.0,
e,,,,,
f,,,,,


- Missing values are introduced where index does not overlap
- They propagates to further computations
- For DFs both the rows and cols should be present in the operands to yeild output

##### Arithmetic Methods With Fill Vlaues

To fill the missing values with special vlaues
 - methods for arithematic operations are used with `fill_value` keywords
      - But may still give NaN - as it fills the values in inputs to align them for operation

In [465]:
# to add
s1.radd(s2, fill_value=0.0) # giving NaN output at b4 because there is no value of b for column 4 the data is absent in both the DFs so cannot be filled

Unnamed: 0,0,1,2,3,4
a,0.0,2.0,4.0,6.0,4.0
b,4.0,5.0,6.0,7.0,
c,18.0,20.0,22.0,24.0,14.0
d,32.0,34.0,36.0,38.0,24.0
e,5.0,6.0,7.0,8.0,9.0
f,15.0,16.0,17.0,18.0,19.0


- Methods for each operations

In [466]:
# add()/radd()

In [467]:
# sub()/rsub()

In [468]:
# div()/rdiv()

In [469]:
# floor()/rfloordiv()

In [470]:
# mul()/rmul()

In [471]:
# pow()/rpow()

------------------

##### Operations between DataFrame and Series

for operation the index of Series is matched to the columns of the DataFrames
 - Operation is broadcasted down the row
 - for no match of index value in any of the object
    - the resulting object is a reindexed object to form the union

In [485]:
series = pd.Series(np.arange(6.), index=['a','b','c','d','e','f'])
dataframe = pd.DataFrame(np.arange(16.).reshape((4,4)), columns=['a','l','c','m'])

In [484]:
series

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
f    5.0
dtype: float64

In [488]:
dataframe

Unnamed: 0,a,l,c,m
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0
3,12.0,13.0,14.0,15.0


In [487]:
dataframe + series

Unnamed: 0,a,b,c,d,e,f,l,m
0,0.0,,4.0,,,,,
1,4.0,,8.0,,,,,
2,8.0,,12.0,,,,,
3,12.0,,16.0,,,,,


- For broadcasting over columns -
   - arithmetic method is used over the objects with keyword `axis="index"`

In [493]:
series0 = dataframe['l']
dataframe.sub(series0, axis="index")

Unnamed: 0,a,l,c,m
0,-1.0,0.0,1.0,2.0
1,-1.0,0.0,1.0,2.0
2,-1.0,0.0,1.0,2.0
3,-1.0,0.0,1.0,2.0


---------------

### Function Application and Mapping

- `ufuncs` work with these DFs and Series

 - aplying function ops on on a one dimensional arrays to each row or column
    - `apply()` method is used

In [494]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
e,6,7,8


In [521]:
frame = np.cos(frame) # ufuncs
frame

Unnamed: 0,Ohio,Texas,California
a,0.540302,0.857553,0.914653
c,0.548696,0.793873,0.960037
e,0.57338,0.729023,0.989434


In [522]:
frame.apply(lambda x:  (x.max() - x.min())) # here the function is invoked once on each column in frame

Ohio          0.033078
Texas         0.128530
California    0.074780
dtype: float64

In [523]:
frame.apply(lambda x:  x.max() - x.min(), axis="columns") # here the function is invoked once on each row

a    0.374351
c    0.411341
e    0.416053
dtype: float64

- Common stats ops are DataFrame methods like sum, mean etc.

- function in `apply()` can return a Series too

- functions can be used Element-wise too
   - using `map()` for DFs and Series in pandas

In [524]:
frame.map(lambda x: f"{x:.2f}")

Unnamed: 0,Ohio,Texas,California
a,0.54,0.86,0.91
c,0.55,0.79,0.96
e,0.57,0.73,0.99


-------------------------

### Sorting and Ranking

To sort Lexicographically by row or column label
   - `sort_index()`

In [530]:
s1.index = ['c','x','a','r']
s1

Unnamed: 0,0,1,2,3
c,0.0,1.0,2.0,3.0
x,4.0,5.0,6.0,7.0
a,8.0,9.0,10.0,11.0
r,12.0,13.0,14.0,15.0


In [533]:
s1.sort_index() # for sorting columns use axis="columns"

Unnamed: 0,0,1,2,3
a,8.0,9.0,10.0,11.0
c,0.0,1.0,2.0,3.0
r,12.0,13.0,14.0,15.0
x,4.0,5.0,6.0,7.0


- default ordering is using ascending order 
   - can be changed using `ascending=False` keyword

To sort by Value 
   - `sort_values()`

In [536]:
# Series
series

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
f    5.0
dtype: float64

In [535]:
series.sort_values(ascending=False)

f    5.0
e    4.0
d    3.0
c    2.0
b    1.0
a    0.0
dtype: float64

 - NaN values are given last position and can be brought to the first by using `na_position = "first"` keyword 

In [537]:
# DataFrame
frame

Unnamed: 0,Ohio,Texas,California
a,0.540302,0.857553,0.914653
c,0.548696,0.793873,0.960037
e,0.57338,0.729023,0.989434


In [545]:
frame.sort_values(["Ohio", "California"]) # by multiple columns - single can also be used

Unnamed: 0,Ohio,Texas,California
a,0.540302,0.857553,0.914653
c,0.548696,0.793873,0.960037
e,0.57338,0.729023,0.989434


**Ranking** - Assigns rank from one to number of valid points in an array
   - starts from lowest value
   - for DFs ranks can be computed over both rows and columns

In [554]:
series = pd.Series([7,-4,1,5,-2,5,-6,0])
series.rank() # each index is ranked using a specific method 

0    8.0
1    2.0
2    5.0
3    6.5
4    3.0
5    6.5
6    1.0
7    4.0
dtype: float64

In [557]:
series.rank(method="first") # first method also uses label to compare the values

0    8.0
1    2.0
2    5.0
3    6.0
4    3.0
5    7.0
6    1.0
7    4.0
dtype: float64

In [558]:
frame

Unnamed: 0,Ohio,Texas,California
a,0.540302,0.857553,0.914653
c,0.548696,0.793873,0.960037
e,0.57338,0.729023,0.989434


In [563]:
frame.rank(axis="rows")

Unnamed: 0,Ohio,Texas,California
a,1.0,3.0,1.0
c,2.0,2.0,2.0
e,3.0,1.0,3.0


- methods for ranking
   - `"average"` - DEFAULT - assigns average rank to each entry in equal group
   - `"min"` - Use min rank for the whole group
   - `"max"` - Use max rank for the whole group
   - `"first"` - Assign ranks in order of the occurence of the data
   - `"dense"` - like "min" but rank increases by 1 between groups rather than the number of equal elements in a group

-----

### Axis Indexes with Duplicate Labels

- `reindex()` requires unique labels
- for series and Dfs with duplicate labels
   - `.index.is_unique` properties can be used to check if the index has duplicate or not 

In [566]:
series.index = ['a','b','c','d','b','c','d','s']

In [567]:
series

a    7
b   -4
c    1
d    5
b   -2
c    5
d   -6
s    0
dtype: int64

In [569]:
series.index.is_unique

False

In [571]:
series["b"] 
# Returns a series rather than a value
# makes code more complicated

b   -4
b   -2
dtype: int64

-----------

## Summarizing and Computing Descriptive Statistics

Pandas objects include common mathematical and statistical methods
 - Mostly for *reduction or summary statistics* 
    - these extract single value from a Series or a Series of values from the rows or columns of a DFs
    - can handle missing data

In [575]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,0,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [573]:
# sum() method returns a Series of column sums
data.sum()

one      20
two      22
three    30
four     33
dtype: int64

In [576]:
data.sum(axis='columns') # Row sum

Ohio         0
Colorado    13
Utah        38
New York    54
dtype: int64

In [586]:
(s1+s2).sum(axis=1, skipna=True)

a    44.0
c    52.0
d     0.0
e     0.0
f     0.0
r     0.0
x     0.0
dtype: float64

- For NA values the sum is zero
- can be disabled from `skipna  = True` option
- `mean()` require at least one non-NA value to yeild value result

In [589]:
(s1+s2).mean()

0     9.0
1    11.0
2    13.0
3    15.0
4     NaN
dtype: float64

- `level` keyword for - grouped by level if axis is heirarchically indexed (Multi-indexed)

- `df.idmax()` and `df.idmin()` used to return indirect statistics - index value where minimum and maximum values occur

- accumulation methods - `cumsum()` - cumulative sum

In [590]:
data.cumsum()

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,0,6,7
Utah,8,9,16,18
New York,20,22,30,33


- neither reduction nor accumulation - `describe` - multiple summary statistics
   - for non numeric data it descirbes alternative summary stats

In [595]:
data.describe() # NUMERIC

Unnamed: 0,one,two,three,four
count,4.0,4.0,4.0,4.0
mean,5.0,5.5,7.5,8.25
std,6.0,6.557439,5.972158,6.396614
min,0.0,0.0,0.0,0.0
25%,0.0,0.0,4.5,5.25
50%,4.0,4.5,8.0,9.0
75%,9.0,10.0,11.0,12.0
max,12.0,13.0,14.0,15.0


In [594]:
pd.Series(["a","b","c","d","e","f","a","b","c","d"]).describe() # NON NUMERIC

count     10
unique     6
top        a
freq       2
dtype: object

- Some methods 

In [596]:
# count()

In [597]:
# argmin, argmax - for series to give index location of min / max value obtained 

In [598]:
# quantile - compute ample quantile range from 0 to 1

In [599]:
# median

In [601]:
# mad - Mean Absolute Deviation from mean value

In [602]:
# prof - product of all values

In [603]:
# var - Variance

In [604]:
# std - standard deviation

In [605]:
# skew - Sample skewness (third moment of values)

In [606]:
# kurt - Sample kutiosis (fourth moment of values)

In [607]:
# cumsum - Cummulative sum

In [608]:
# cumprod - Cummulative product

In [609]:
# diff - first arithmetic difference - For TIME SERIES

In [610]:
# pct_change - Compute percent change

---------------

#### Correlation and Covariance

Example Summary

- read binary python pickle files 
   - `pd.read_pickle('/file/location/")`

- `.corr()` & `.cov()` mehtod of series
   
   - on one series to another
      - `dataframe["Column1"].corr(dataframe["Column2"])`
      - gives single value

   - on DataFrame
      - gives correlation or covariance matrix as a DF

- `.corrwith()` method performs pair-wise correlations between DF's columns or rows with another Series or DataFrame
   - for series as input - returns corr value for each column
   - for DF as input - returns corr of matching column names
      - passing `axis="columns"` computes corr row by row  - by alligned by label 

-------------

#### Unique Values, Value Counts, and Membership

Info in 1-D series extraction:

- `unique` - array of unique values in a Series
    - `uniques.sort()` - for sorting

- `value_counts` - computes a Series containing values frequencies
    - sorted by value in descending order
    - also vailable as top-level pandas method - used with numpy arrays or python sequences

- `isin` - vectorized set memberdhip check - to filter a dataset for a subset of values 

- `Index(unique_vals).get_indexer(to_match)` method - gives an index array from an array of possibly non-distinct values into another array of disctinct values 

- `data.apply(pd.value_counts).fillna(0)` - to compute value counts for all columns 

- `DataFrame.value_count()` - gives number of occurence occurence of each distinct row considering each as a tuple 

----------------------