## DataFrames

- DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. Let's use pandas to explore this topic

In [1]:
## DataFrames are in the form of tables, rows, columns

import pandas as pd
import numpy as np


In [4]:
from numpy.random import randn
np.random.seed(101)




- np.random.seed(101) is a command in the NumPy library that sets the seed for the random number generator. Setting the seed ensures that the sequence of random numbers generated is reproducible. This means that every time you run your code with the same seed, you will get the same sequence of random numbers. This is particularly useful for debugging and sharing your work, as it ensures consistency.

- Here’s a brief explanation and example:

- Explanation
- np.random is the random number module in the NumPy library.
- seed(101) sets the seed to the value 101. You can choose any integer as the seed value.

## ______________________________________________________________

## This is an example for setting a seed -> np.random.seed(101)

In [7]:
import numpy as np

# Set the seed for reproducibility
np.random.seed(101)

# Generate some random numbers
random_numbers = np.random.rand(5)
print(random_numbers)

# Set the seed again to the same value
np.random.seed(101)

# Generate the same random numbers again
random_numbers_reproducible = np.random.rand(5)
print(random_numbers_reproducible)


[0.51639863 0.57066759 0.02847423 0.17152166 0.68527698]
[0.51639863 0.57066759 0.02847423 0.17152166 0.68527698]


- As you can see, the two arrays of random numbers are identical because the same seed was used before generating them. This demonstrates how setting the seed ensures reproducibility of the random numbers generated by NumPy.








## example is over for now -> seed example

In [8]:
df = pd.DataFrame(randn(5, 4), index = 'A B C D E'.split(), columns = 'W X Y Z'.split())

In [9]:
df

Unnamed: 0,W,X,Y,Z
A,-0.510021,0.8822,0.312745,0.555649
B,0.234393,-1.340208,-1.079692,-0.501881
C,-0.168145,-0.108067,1.511444,0.683461
D,0.939919,-1.134972,-0.635792,0.02516
E,-0.27307,-0.3159,0.008583,0.205307


- Init signature of pd.DataFrame:


    
- pd.DataFrame(
-    data=None,
-    index: 'Axes | None' = None,
-    columns: 'Axes | None' = None,
-
-    dtype: 'Dtype | None' = None,
-   copy: 'bool | None' = None,
- )

In [11]:
df = pd.DataFrame(randn(5, 4), index = 'A B C D E'.split(), columns = 'W X Y Z'.split())

In [12]:
df

Unnamed: 0,W,X,Y,Z
A,1.589176,0.533221,0.857349,0.087076
B,-0.397462,1.261688,0.433331,-0.136614
C,0.88214,-0.377811,0.146639,1.269122
D,0.307881,-0.606806,0.919318,-1.196118
E,0.508945,-0.84844,2.010793,-0.13761


# This is wrong which is below

In [14]:
# df1 = pd.DataFrame(randn(5, 4), index = 'A B C D E', columns = 'W X Y Z')

- The split() method is used in this context to transform a string of space-separated characters into a list of characters. This is necessary because the 'index' and 'columns' parameters of the 'pd.DataFrame' constructor expect list, not strings

- Explanation:
- 'A B C D E'.split():

- This converts the string 'A B C D E' into a list of strings: ['A', 'B', 'C', 'D', 'E'].
- 'W X Y Z'.split():

- This converts the string 'W X Y Z' into a list of strings: ['W', 'X', 'Y', 'Z'].

- The pd.DataFrame constructor expects index and columns to be lists, so the split() method is used to transform the space-separated strings into lists.

# What happens if we don't use a split() method????

-  df1 = pd.DataFrame(randn(5, 4), index='A B C D E', columns='W X Y Z')
-  Here, index='A B C D E' is a single string, not a list.
-  Similarly, columns='W X Y Z' is a single string, not a list.
-  The DataFrame constructor will not interpret these strings as lists of labels for the index and columns, which will result in an error or unintended behavior.



## Correct usage with lists directly:


- If we don't wish to use split(), we can directly provide list by this methodology:

In [18]:
df1 = pd.DataFrame(randn(6, 4), index = ['A', 'B', 'C', 'D', 'E', 'F'], columns=['W', 'X', 'Y', 'Z'])

In [19]:
df1

Unnamed: 0,W,X,Y,Z
A,-1.14822,1.607435,-1.22687,1.405532
B,-1.137201,-0.535478,2.142717,1.691452
C,0.275225,-0.852057,0.298659,-0.56537
D,0.358325,0.699676,0.417366,-0.238049
E,-1.850038,1.049774,-0.43787,0.608334
F,-0.342021,0.58902,0.827388,0.163044


In [20]:
print(df1)

          W         X         Y         Z
A -1.148220  1.607435 -1.226870  1.405532
B -1.137201 -0.535478  2.142717  1.691452
C  0.275225 -0.852057  0.298659 -0.565370
D  0.358325  0.699676  0.417366 -0.238049
E -1.850038  1.049774 -0.437870  0.608334
F -0.342021  0.589020  0.827388  0.163044


## Done with split() method in pd.DataFrame()

## Selection and Indexing

- Let's learn the various methods to grab data from a DataFrame

In [21]:
df['W']

A    1.589176
B   -0.397462
C    0.882140
D    0.307881
E    0.508945
Name: W, dtype: float64

In [23]:
# pass a list of column names

# df['W', 'Z'] this is wrong



In [24]:
df[['W', 'Z']]

Unnamed: 0,W,Z
A,1.589176,0.087076
B,-0.397462,-0.136614
C,0.88214,1.269122
D,0.307881,-1.196118
E,0.508945,-0.13761


In [25]:
# or SQL syntax NOT RECOMMENDED 

df.W

A    1.589176
B   -0.397462
C    0.882140
D    0.307881
E    0.508945
Name: W, dtype: float64

In [26]:
# DataFrame columns are just Series

type(df['W'])

pandas.core.series.Series

## Creating a new column

In [27]:
df['new'] = df['W'] + df['Y']

In [28]:
df

Unnamed: 0,W,X,Y,Z,new
A,1.589176,0.533221,0.857349,0.087076,2.446525
B,-0.397462,1.261688,0.433331,-0.136614,0.03587
C,0.88214,-0.377811,0.146639,1.269122,1.028779
D,0.307881,-0.606806,0.919318,-1.196118,1.2272
E,0.508945,-0.84844,2.010793,-0.13761,2.519739


In [29]:
df.drop('new', axis = 1)

Unnamed: 0,W,X,Y,Z
A,1.589176,0.533221,0.857349,0.087076
B,-0.397462,1.261688,0.433331,-0.136614
C,0.88214,-0.377811,0.146639,1.269122
D,0.307881,-0.606806,0.919318,-1.196118
E,0.508945,-0.84844,2.010793,-0.13761


In [31]:
# df.drop('new')

# only this causes error coz we need to specify axis = 1 coz we are deleting or dropping the column

In [32]:
df

Unnamed: 0,W,X,Y,Z,new
A,1.589176,0.533221,0.857349,0.087076,2.446525
B,-0.397462,1.261688,0.433331,-0.136614,0.03587
C,0.88214,-0.377811,0.146639,1.269122,1.028779
D,0.307881,-0.606806,0.919318,-1.196118,1.2272
E,0.508945,-0.84844,2.010793,-0.13761,2.519739


In [33]:
# Not inplace unless specified

In [34]:
df.drop('new', axis = 1, inplace = True)

In [35]:
df

Unnamed: 0,W,X,Y,Z
A,1.589176,0.533221,0.857349,0.087076
B,-0.397462,1.261688,0.433331,-0.136614
C,0.88214,-0.377811,0.146639,1.269122
D,0.307881,-0.606806,0.919318,-1.196118
E,0.508945,-0.84844,2.010793,-0.13761


In [36]:
df.drop('E', axis = 0)

# axis = 0 for row

Unnamed: 0,W,X,Y,Z
A,1.589176,0.533221,0.857349,0.087076
B,-0.397462,1.261688,0.433331,-0.136614
C,0.88214,-0.377811,0.146639,1.269122
D,0.307881,-0.606806,0.919318,-1.196118


In [37]:
df.loc['A']

W    1.589176
X    0.533221
Y    0.857349
Z    0.087076
Name: A, dtype: float64

In [38]:
df

Unnamed: 0,W,X,Y,Z
A,1.589176,0.533221,0.857349,0.087076
B,-0.397462,1.261688,0.433331,-0.136614
C,0.88214,-0.377811,0.146639,1.269122
D,0.307881,-0.606806,0.919318,-1.196118
E,0.508945,-0.84844,2.010793,-0.13761


In [39]:
df.loc['A']

W    1.589176
X    0.533221
Y    0.857349
Z    0.087076
Name: A, dtype: float64

In [40]:
df.iloc[2]

# Select based off of position instead of label

W    0.882140
X   -0.377811
Y    0.146639
Z    1.269122
Name: C, dtype: float64

In [41]:
## Selecting subset of rows and columns

In [42]:
df.loc['B', 'Y']

0.4333313170148615

In [43]:
df

Unnamed: 0,W,X,Y,Z
A,1.589176,0.533221,0.857349,0.087076
B,-0.397462,1.261688,0.433331,-0.136614
C,0.88214,-0.377811,0.146639,1.269122
D,0.307881,-0.606806,0.919318,-1.196118
E,0.508945,-0.84844,2.010793,-0.13761


In [44]:
df.loc[['A', 'B'], ['W', 'Y']]

Unnamed: 0,W,Y
A,1.589176,0.857349
B,-0.397462,0.433331


## Conditional Selection

- An important feature of pandas is conditional selection using bracket notation, very similar to numpy

In [46]:
df

Unnamed: 0,W,X,Y,Z
A,1.589176,0.533221,0.857349,0.087076
B,-0.397462,1.261688,0.433331,-0.136614
C,0.88214,-0.377811,0.146639,1.269122
D,0.307881,-0.606806,0.919318,-1.196118
E,0.508945,-0.84844,2.010793,-0.13761


In [47]:
df > 0

Unnamed: 0,W,X,Y,Z
A,True,True,True,True
B,False,True,True,False
C,True,False,True,True
D,True,False,True,False
E,True,False,True,False


In [48]:
df[df > 0]

Unnamed: 0,W,X,Y,Z
A,1.589176,0.533221,0.857349,0.087076
B,,1.261688,0.433331,
C,0.88214,,0.146639,1.269122
D,0.307881,,0.919318,
E,0.508945,,2.010793,


In [49]:
df[df['W'] > 0]

Unnamed: 0,W,X,Y,Z
A,1.589176,0.533221,0.857349,0.087076
C,0.88214,-0.377811,0.146639,1.269122
D,0.307881,-0.606806,0.919318,-1.196118
E,0.508945,-0.84844,2.010793,-0.13761


## Know the difference between df > 0 and df[df > 0]

In [52]:
df[df['W'] > 0]['Y']

A    0.857349
C    0.146639
D    0.919318
E    2.010793
Name: Y, dtype: float64

In [54]:
df[df['W'] > 0][['Y', 'X']]

Unnamed: 0,Y,X
A,0.857349,0.533221
C,0.146639,-0.377811
D,0.919318,-0.606806
E,2.010793,-0.84844


## More Index Details

Let's discuss some more features of indexing, including resetting the index or setting it something else.

In [55]:
df

Unnamed: 0,W,X,Y,Z
A,1.589176,0.533221,0.857349,0.087076
B,-0.397462,1.261688,0.433331,-0.136614
C,0.88214,-0.377811,0.146639,1.269122
D,0.307881,-0.606806,0.919318,-1.196118
E,0.508945,-0.84844,2.010793,-0.13761


In [56]:
# Reset to default 0, 1, .....n index

In [57]:
df.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,1.589176,0.533221,0.857349,0.087076
1,B,-0.397462,1.261688,0.433331,-0.136614
2,C,0.88214,-0.377811,0.146639,1.269122
3,D,0.307881,-0.606806,0.919318,-1.196118
4,E,0.508945,-0.84844,2.010793,-0.13761


In [58]:
newIndex = 'CA NA WY OR CO'.split()

In [59]:
df['States'] = newIndex

In [60]:
df

Unnamed: 0,W,X,Y,Z,States
A,1.589176,0.533221,0.857349,0.087076,CA
B,-0.397462,1.261688,0.433331,-0.136614,
C,0.88214,-0.377811,0.146639,1.269122,WY
D,0.307881,-0.606806,0.919318,-1.196118,OR
E,0.508945,-0.84844,2.010793,-0.13761,CO


In [61]:
df['Monisha'] = newIndex

In [62]:
df

Unnamed: 0,W,X,Y,Z,States,Monisha
A,1.589176,0.533221,0.857349,0.087076,CA,CA
B,-0.397462,1.261688,0.433331,-0.136614,,
C,0.88214,-0.377811,0.146639,1.269122,WY,WY
D,0.307881,-0.606806,0.919318,-1.196118,OR,OR
E,0.508945,-0.84844,2.010793,-0.13761,CO,CO


In [63]:
df

Unnamed: 0,W,X,Y,Z,States,Monisha
A,1.589176,0.533221,0.857349,0.087076,CA,CA
B,-0.397462,1.261688,0.433331,-0.136614,,
C,0.88214,-0.377811,0.146639,1.269122,WY,WY
D,0.307881,-0.606806,0.919318,-1.196118,OR,OR
E,0.508945,-0.84844,2.010793,-0.13761,CO,CO


In [64]:
df.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z,Monisha
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CA,1.589176,0.533221,0.857349,0.087076,CA
,-0.397462,1.261688,0.433331,-0.136614,
WY,0.88214,-0.377811,0.146639,1.269122,WY
OR,0.307881,-0.606806,0.919318,-1.196118,OR
CO,0.508945,-0.84844,2.010793,-0.13761,CO


In [65]:
df.set_index('States', inplace = True)


In [66]:
df

Unnamed: 0_level_0,W,X,Y,Z,Monisha
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
CA,1.589176,0.533221,0.857349,0.087076,CA
,-0.397462,1.261688,0.433331,-0.136614,
WY,0.88214,-0.377811,0.146639,1.269122,WY
OR,0.307881,-0.606806,0.919318,-1.196118,OR
CO,0.508945,-0.84844,2.010793,-0.13761,CO


## Multi-Index and Index Hierarchy

## Let us go over how to work with Multi-index, first we will create a quick example of what a Multi-Indexed Data Frame would look like:

In [67]:
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

In [69]:
hier_index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

In [70]:
df = pd.DataFrame(np.random.randn(6, 2), index = hier_index, columns = ['A', 'B'])

In [71]:
df

Unnamed: 0,Unnamed: 1,A,B
G1,1,0.031363,0.783105
G1,2,0.06956,0.660136
G1,3,0.811349,-1.299794
G2,1,2.195249,-0.620243
G2,2,-1.531769,0.061996
G2,3,0.823122,0.644121


In [72]:
df.loc['G1']

Unnamed: 0,A,B
1,0.031363,0.783105
2,0.06956,0.660136
3,0.811349,-1.299794


In [73]:
df.loc['G1'].loc[1]

A    0.031363
B    0.783105
Name: 1, dtype: float64

In [74]:
df.index.names

FrozenList([None, None])

In [75]:
# What is FrozenList([None, None])

In [76]:
df.index.names = ['Group', 'Num']

In [77]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.031363,0.783105
G1,2,0.06956,0.660136
G1,3,0.811349,-1.299794
G2,1,2.195249,-0.620243
G2,2,-1.531769,0.061996
G2,3,0.823122,0.644121


In [78]:

df.xs('G1')

# df.xs('G1') ????

Unnamed: 0_level_0,A,B
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0.031363,0.783105
2,0.06956,0.660136
3,0.811349,-1.299794


In [79]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
G1,1,0.031363,0.783105
G1,2,0.06956,0.660136
G1,3,0.811349,-1.299794
G2,1,2.195249,-0.620243
G2,2,-1.531769,0.061996
G2,3,0.823122,0.644121


In [80]:
df.xs(['G1', 1])

  df.xs(['G1', 1])


A    0.031363
B    0.783105
Name: (G1, 1), dtype: float64

In [81]:
df.xs(1, level = 'Num')

Unnamed: 0_level_0,A,B
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
G1,0.031363,0.783105
G2,2.195249,-0.620243
