<a href="https://colab.research.google.com/github/gundaminpde/2023-spring/blob/main/01_3_pandas_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is an almost dupulicate of a tutorial by Dr. Tirthajyoti Sarkar, Fremont, CA, 94536. You can find the original one at https://github.com/tirthajyoti/Machine-Learning-with-Python/blob/master/Pandas%20and%20Numpy/Pandas_Operations.ipynb

# **Series**

# Loading packages and initializations

In [1]:
import numpy as np
import pandas as pd

labels = ['a','b','c']
my_data = [10,20,30]

arr = np.array(my_data)

d = {'a':10,'b':20,'c':30}

print ("Labels:", labels)
print("My data:", my_data)
print("Dictionary:", d)

Labels: ['a', 'b', 'c']
My data: [10, 20, 30]
Dictionary: {'a': 10, 'b': 20, 'c': 30}


# Creating a Series (Pandas class)

In [2]:
pd.Series(data=my_data) # Output looks very similar to a NumPy array

0    10
1    20
2    30
dtype: int64

In [3]:
pd.Series(data=my_data, index=labels) # Note the extra information about index

a    10
b    20
c    30
dtype: int64

In [4]:
# Inputs are in order of the expected parameters (not explicitly named), NumPy array is used for data
pd.Series(arr, labels)

a    10
b    20
c    30
dtype: int64

In [5]:
pd.Series(d) # Using a pre-defined Dictionary object


a    10
b    20
c    30
dtype: int64

# What type of values can a Pandas Series hold?

In [6]:
print ("\nHolding numerical data\n",'-'*25, sep='')
print(pd.Series(arr))

print ("\nHolding text labels\n",'-'*20, sep='')
print(pd.Series(labels))

print ("\nHolding functions\n",'-'*20, sep='')
print(pd.Series(data=[sum,print,len]))

print ("\nHolding objects from a dictionary\n",'-'*40, sep='')
print(pd.Series(data=[d.keys, d.items, d.values]))


Holding numerical data
-------------------------
0    10
1    20
2    30
dtype: int64

Holding text labels
--------------------
0    a
1    b
2    c
dtype: object

Holding functions
--------------------
0      <built-in function sum>
1    <built-in function print>
2      <built-in function len>
dtype: object

Holding objects from a dictionary
----------------------------------------
0    <built-in method keys of dict object at 0x7f1d...
1    <built-in method items of dict object at 0x7f1...
2    <built-in method values of dict object at 0x7f...
dtype: object


# Indexing and slicing

In [9]:

ser1 = pd.Series([1,2,3,4],['CA', 'OR', 'CO', 'AZ'])
ser2 = pd.Series([1,2,5,4],['CA', 'OR', 'NV', 'AZ'])

print ("\nIndexing by name of the item/object (string identifier)\n",'-'*56, sep='')
print("Value for CA in ser1:", ser1['CA'])
print("Value for AZ in ser1:", ser1['AZ'])
print("Value for NV in ser2:", ser2['NV'])

print ("\nIndexing by number (positional value in the series)\n",'-'*52, sep='')
print("Value for CA in ser1:", ser1[0])
print("Value for AZ in ser1:", ser1[3])
print("Value for NV in ser2:", ser2[2])

print ("\nIndexing by a range\n",'-'*25, sep='')
print ("Value for OR, CO, and AZ in ser1:\n", ser1[1:4], sep='')


print ("\nHead of ser1\n",'-'*25, sep='')
print ("Head of ser1:\n", ser1.head(), sep='')


Indexing by name of the item/object (string identifier)
--------------------------------------------------------
Value for CA in ser1: 1
Value for AZ in ser1: 4
Value for NV in ser2: 5

Indexing by number (positional value in the series)
----------------------------------------------------
Value for CA in ser1: 1
Value for AZ in ser1: 4
Value for NV in ser2: 5

Indexing by a range
-------------------------
Value for OR, CO, and AZ in ser1:
OR    2
CO    3
AZ    4
dtype: int64

Head of ser1
-------------------------
Head of ser1:
CA    1
OR    2
CO    3
AZ    4
dtype: int64


# Adding/Merging two series with common indices

In [10]:
ser1 = pd.Series([1,2,3,4],['CA', 'OR', 'CO', 'AZ'])
ser2 = pd.Series([1,2,5,4],['CA', 'OR', 'NV', 'AZ'])
ser3 = ser1+ser2

print ("\nAfter adding the two series, the result looks like this...\n",'-'*59, sep='')
print(ser3)

print("\nPython tries to add values where it finds common index name, and puts NaN where indices are missing\n")

print ("\nThe idea works even for multiplication...\n",'-'*43, sep='')
print (ser1*ser2)

print ("\nOr even for combination of mathematical operations!\n",'-'*53, sep='')
print (np.exp(ser1)+np.log10(ser2))


After adding the two series, the result looks like this...
-----------------------------------------------------------
AZ    8.0
CA    2.0
CO    NaN
NV    NaN
OR    4.0
dtype: float64

Python tries to add values where it finds common index name, and puts NaN where indices are missing


The idea works even for multiplication...
-------------------------------------------
AZ    16.0
CA     1.0
CO     NaN
NV     NaN
OR     4.0
dtype: float64

Or even for combination of mathematical operations!
-----------------------------------------------------
AZ    55.200210
CA     2.718282
CO          NaN
NV          NaN
OR     7.690086
dtype: float64


# **DataFrame** 

In [12]:
from numpy.random import randn as rn


# Creating and accessing DataFrame

*   Indexing
*   Adding and deleting rows and columns
*   Subsetting DataFrame



In [13]:
np.random.seed(101)

matrix_data = rn(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']

df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)
print("\nThe data frame looks like\n",'-'*45, sep='')
print(df)


The data frame looks like
---------------------------------------------
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509


In [19]:
df.head(n=3)

Unnamed: 0,W,X,Y,Z
A,2.70685,0.628133,0.907969,0.503826
B,0.651118,-0.319318,-0.848077,0.605965
C,-2.018168,0.740122,0.528813,-0.589001


In [20]:
df.tail(n=3)

Unnamed: 0,W,X,Y,Z
C,-2.018168,0.740122,0.528813,-0.589001
D,0.188695,-0.758872,-0.933237,0.955057
E,0.190794,1.978757,2.605967,0.683509


# Indexing and slicing (columns)

In [14]:
print("\nThe 'X' column\n",'-'*25, sep='')
print(df['X'])
print("\nType of the column: ", type(df['X']), sep='')
print("\nThe 'X' and 'Z' columns indexed by passing a list\n",'-'*55, sep='')
print(df[['X','Z']])
print("\nType of the pair of columns: ", type(df[['X','Z']]), sep='')
print ("\nSo, for more than one column, the object turns into a DataFrame")



The 'X' column
-------------------------
A    0.628133
B   -0.319318
C    0.740122
D   -0.758872
E    1.978757
Name: X, dtype: float64

Type of the column: <class 'pandas.core.series.Series'>

The 'X' and 'Z' columns indexed by passing a list
-------------------------------------------------------
          X         Z
A  0.628133  0.503826
B -0.319318  0.605965
C  0.740122 -0.589001
D -0.758872  0.955057
E  1.978757  0.683509

Type of the pair of columns: <class 'pandas.core.frame.DataFrame'>

So, for more than one column, the object turns into a DataFrame


In [15]:
print("\nThe 'X' column accessed by DOT method (NOT recommended)\n",'-'*55, sep='')
print(df.X)


The 'X' column accessed by DOT method (NOT recommended)
-------------------------------------------------------
A    0.628133
B   -0.319318
C    0.740122
D   -0.758872
E    1.978757
Name: X, dtype: float64


## Creating and deleting a (new) column (or row)

In [16]:
print("\nA column is created by assigning it in relation to an existing column\n",'-'*75, sep='')

df['New'] = df['X']+df['Z']
df['New (Sum of X and Z)'] = df['X']+df['Z']

print(df)


print("\nA column is dropped by using df.drop() method\n",'-'*55, sep='')
df = df.drop('New', axis=1) # Notice the axis=1 option, axis = 0 is default, so one has to change it to 1
print(df)
df1=df.drop('A')
print("\nA row (index) is dropped by using df.drop() method and axis=0\n",'-'*65, sep='')
print(df1)
print("\nAn in-place change can be done by making inplace=True in the drop method\n",'-'*75, sep='')
df.drop('New (Sum of X and Z)', axis=1, inplace=True)
print(df)


A column is created by assigning it in relation to an existing column
---------------------------------------------------------------------------
          W         X         Y         Z       New  New (Sum of X and Z)
A  2.706850  0.628133  0.907969  0.503826  1.131958              1.131958
B  0.651118 -0.319318 -0.848077  0.605965  0.286647              0.286647
C -2.018168  0.740122  0.528813 -0.589001  0.151122              0.151122
D  0.188695 -0.758872 -0.933237  0.955057  0.196184              0.196184
E  0.190794  1.978757  2.605967  0.683509  2.662266              2.662266

A column is dropped by using df.drop() method
-------------------------------------------------------
          W         X         Y         Z  New (Sum of X and Z)
A  2.706850  0.628133  0.907969  0.503826              1.131958
B  0.651118 -0.319318 -0.848077  0.605965              0.286647
C -2.018168  0.740122  0.528813 -0.589001              0.151122
D  0.188695 -0.758872 -0.933237  0.955057         

# Selecting/indexing Rows

* Label-based 'loc' method
* Index (numeric) 'iloc' method

In [21]:
print("\nLabel-based 'loc' method can be used for selecting row(s)\n",'-'*60, sep='')
print("\nSingle row\n")
print(df.loc['C'])
print("\nMultiple rows\n")
print(df.loc[['B','C']])
print("\nIndex position based 'iloc' method can be used for selecting row(s)\n",'-'*70, sep='')
print("\nSingle row\n")
print(df.iloc[2])
print("\nMultiple rows\n")
print(df.iloc[[1,2]])


Label-based 'loc' method can be used for selecting row(s)
------------------------------------------------------------

Single row

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

Multiple rows

          W         X         Y         Z
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001

Index position based 'iloc' method can be used for selecting row(s)
----------------------------------------------------------------------

Single row

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

Multiple rows

          W         X         Y         Z
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001


# Subsetting DataFrame

In [22]:
np.random.seed(101)
matrix_data = rn(5,4)
row_labels = ['A','B','C','D','E']
column_headings = ['W','X','Y','Z']
df = pd.DataFrame(data=matrix_data, index=row_labels, columns=column_headings)

print("\nThe DatFrame\n",'-'*45, sep='')
print(df)
print("\nElement at row 'B' and column 'Y' is\n") 
print(df.loc['B','Y'])
print("\nSubset comprising of rows B and D, and columns W and Y, is\n")
df.loc[['B','D'],['W','Y']]


The DatFrame
---------------------------------------------
          W         X         Y         Z
A  2.706850  0.628133  0.907969  0.503826
B  0.651118 -0.319318 -0.848077  0.605965
C -2.018168  0.740122  0.528813 -0.589001
D  0.188695 -0.758872 -0.933237  0.955057
E  0.190794  1.978757  2.605967  0.683509

Element at row 'B' and column 'Y' is

-0.8480769834036315

Subset comprising of rows B and D, and columns W and Y, is



Unnamed: 0,W,Y
B,0.651118,-0.848077
D,0.188695,-0.933237
