### Pandas

Pandas as an extremely powerful version of Excel, with a lot more features.

* Pandas is an open source library in Python. It provides ready to use high-performance data structures and data analysis tools.

* Pandas module runs on top of NumPy and it is popularly used for data science and data analytics.

* NumPy is a low-level data structure that supports multi-dimensional arrays and a wide range of mathematical array operations. 

* Pandas has a higher-level interface. It also provides streamlined alignment of tabular data and powerful time series functionality.

* DataFrame is the key data structure in Pandas. It allows us to store and manipulate tabular data as a 2-D data structure.

In [None]:
# Installation

# pip install pandas

In [2]:
# Import libraries

import pandas as pd
import numpy as np

#### Data Structures in Pandas module
There are 3 data structures provided by the Pandas module, which are as follows:

* <b>Series: </b> It is a 1-D size-immutable array like structure having homogeneous data.
* <b>DataFrames: </b> It is a 2-D size-mutable tabular structure with heterogeneously typed columns.
* <b>Panel:</b> It is a 3-D, size-mutable array.

### Series

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.


Syntax:
    <b>pandas.Series(data=None, index=None, dtype=None, name=None, copy=False)</b>
    
Parameters:

* data: Contains data stored in Series.
* index: Values must be hashable and have the same length as data. Non-unique index values are allowed.
* dtypestr: Data type for the output Series.
* name: The name to give to the Series.
* copy: Copy input data. Only affects Series or 1d ndarray input.

In [3]:
pd.Series(dtype='float64')

Series([], dtype: float64)

In [4]:
la = ['a','b','c']
m= [10,20,30]
a = np.array([10,20,30])
d = {'a':10,'b':20,'c':30}


In [5]:
# using list

pd.Series(m)

0    10
1    20
2    30
dtype: int64

In [6]:
# series with list and labels

pd.Series(data=m,index=la)

a    10
b    20
c    30
dtype: int64

In [7]:
# using dictionary

pd.Series(d)

a    10
b    20
c    30
dtype: int64

In [8]:
#series from numpy array

pd.Series(a)

0    10
1    20
2    30
dtype: int32

In [9]:
# Series with numpy array and labels

pd.Series(a,la)

a    10
b    20
c    30
dtype: int32

#### Data in a Series

A pandas Series can hold a variety of object types:

In [10]:
pd.Series(la)

0    a
1    b
2    c
dtype: object

## Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

In [11]:
ser1 = pd.Series([1,2,3,4],index = ['Ptk', 'Kullu','Shimla', 'Mohali'])              

ser1

Ptk       1
Kullu     2
Shimla    3
Mohali    4
dtype: int64

In [12]:
ser2 = pd.Series([1,3,5,4],index = ['Ptk', 'Kullu','Una', 'Mohali']) 

ser2

Ptk       1
Kullu     3
Una       5
Mohali    4
dtype: int64

In [13]:
ser1['Ptk']

1

In [14]:
ser2['Una']

5

### Operations

In [15]:
ser1 - ser2

Kullu    -1.0
Mohali    0.0
Ptk       0.0
Shimla    NaN
Una       NaN
dtype: float64

In [16]:
ser1 + ser2

Kullu     5.0
Mohali    8.0
Ptk       2.0
Shimla    NaN
Una       NaN
dtype: float64

In [17]:
ser1 * ser2

Kullu      6.0
Mohali    16.0
Ptk        1.0
Shimla     NaN
Una        NaN
dtype: float64

In [18]:
ser1 / ser2

Kullu     0.666667
Mohali    1.000000
Ptk       1.000000
Shimla         NaN
Una            NaN
dtype: float64

In [19]:
np.sin(ser1)

Ptk       0.841471
Kullu     0.909297
Shimla    0.141120
Mohali   -0.756802
dtype: float64

In [20]:
np.log(ser2)

Ptk       0.000000
Kullu     1.098612
Una       1.609438
Mohali    1.386294
dtype: float64

### Usage of copy = False or True

In [2]:
g = [7, 8]
ser = pd.Series(g, copy=False)
ser.iloc[0] = 20
g

[7, 8]

In [3]:
ser

0    20
1     8
dtype: int64

Due to input data type the Series has a copy of the original data even though copy=False, so the data is unchanged.

In [6]:
r = np.array([7, 8])
ser = pd.Series(r, copy=True)
ser.iloc[0] = 10
r

array([7, 8])

In [7]:
ser

0    10
1     8
dtype: int32

Due to input data type the Series has a view on the original data, so the data is changed as well.

### Dataframe

DataFrame is the most important and widely used data structure and is a standard way to store data. DataFrame has data aligned in rows and columns like the SQL table or a spreadsheet database. We can either hard code data into a DataFrame or import a CSV file, tsv file, Excel file, SQL table, etc.

We can say DataFrame as a bunch of Series objects put together to share the same index. 

Syntax: 
    <b>pandas.DataFrame(data, index, columns, dtype, copy)</b>

Parameters:

* data - create a DataFrame object from the input data. It can be list, dict, series, Numpy ndarrays or even, any other DataFrame.
* index - has the row labels
* columns - used to create column labels
* dtype - used to specify the data type of each column, optional parameter
* copy - used for copying data, if any

### Create empty Dataframe

In [8]:
pd.DataFrame()


In [9]:
df = pd.DataFrame(np.random.randn(5,4),index='A B C D E'.split(),
                  columns='W X Y Z'.split())
df

Unnamed: 0,W,X,Y,Z
A,-0.355316,-0.063108,-0.448544,-1.217602
B,-1.057282,0.192831,-0.036189,-1.040613
C,-0.072865,0.251521,-0.714947,-2.385446
D,-0.396575,0.746462,-0.642134,0.193443
E,-1.679646,-0.550614,1.649683,0.770565


In [27]:
d = {
    "Name": ['Rohit', 'Aryan', 'Ayush', 'Yash', 'Neha','Suraj'],
    "Subject": ['Python', 'PHP', 'JAVA', 'C++', 'Ruby','Javascript'],
    "Marks": [70, 77, 75, 67,85,90],
    "Highest Marks": [43, 30, 25, 40, 32, 45 ]
    }

In [28]:
df = pd.DataFrame(d)
df

Unnamed: 0,Name,Subject,Marks,Highest Marks
0,Rohit,Python,70,43
1,Aryan,PHP,77,30
2,Ayush,JAVA,75,25
3,Yash,C++,67,40
4,Neha,Ruby,85,32
5,Suraj,Javascript,90,45


### Inspecting data in DataFrame

In [12]:
# Get first two values from the dataframe

df.head(2)

Unnamed: 0,Name,Subject,Marks,Highest Marks
0,Rohit,Python,70,43
1,Aryan,PHP,77,30


In [15]:
# Get last value from the dataframe

df.tail(1)

Unnamed: 0,Name,Subject,Marks,Highest Marks
5,Suraj,Javascript,90,45


In [18]:
df.sample(3)

Unnamed: 0,Name,Subject,Marks,Highest Marks
1,Aryan,PHP,77,30
2,Ayush,JAVA,75,25
4,Neha,Ruby,85,32


In [19]:
# Get the data types of the object

df.dtypes

Name             object
Subject          object
Marks             int64
Highest Marks     int64
dtype: object

In [20]:
# Get the index values

df.index

RangeIndex(start=0, stop=6, step=1)

In [21]:
# Get the columns of the DataFrame

df.columns

Index(['Name', 'Subject', 'Marks', 'Highest Marks'], dtype='object')

### Statistical summary of records

statistical summary (count, mean, standard deviation, min, max etc.) of the data using ``df.describe()`` function.

In [22]:
df.describe()

Unnamed: 0,Marks,Highest Marks
count,6.0,6.0
mean,77.333333,35.833333
std,8.778762,7.985403
min,67.0,25.0
25%,71.25,30.5
50%,76.0,36.0
75%,83.0,42.25
max,90.0,45.0


In [23]:
# Get a column from a DataFrame by accessing it as an attribute

df['Marks'].describe()

count     6.000000
mean     77.333333
std       8.778762
min      67.000000
25%      71.250000
50%      76.000000
75%      83.000000
max      90.000000
Name: Marks, dtype: float64

In [24]:
# Get all columns of a DataFrame regardless of data type.

df.describe(include='all')

Unnamed: 0,Name,Subject,Marks,Highest Marks
count,6,6,6.0,6.0
unique,6,6,,
top,Rohit,Python,,
freq,1,1,,
mean,,,77.333333,35.833333
std,,,8.778762,7.985403
min,,,67.0,25.0
25%,,,71.25,30.5
50%,,,76.0,36.0
75%,,,83.0,42.25


In [25]:
# Get only numeric columns in a DataFrame description.

df.describe(include=[np.number])

Unnamed: 0,Marks,Highest Marks
count,6.0,6.0
mean,77.333333,35.833333
std,8.778762,7.985403
min,67.0,25.0
25%,71.25,30.5
50%,76.0,36.0
75%,83.0,42.25
max,90.0,45.0


In [26]:
# Get only string columns in a DataFrame description.

df.describe(include=[object]) 

Unnamed: 0,Name,Subject
count,6,6
unique,6,6
top,Rohit,Python
freq,1,1


In [27]:
# Get only categorical columns from a DataFrame description.

df.describe(include=['category'])

ValueError: No objects to concatenate

In [28]:
d = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']),
                   'numeric': [1, 2, 3],
                   'object': ['a', 'b', 'c']
                  })

d.describe(include=['category'])

Unnamed: 0,categorical
count,3
unique,3
top,d
freq,1


In [29]:
# Skip numeric columns from a DataFrame description.

df.describe(exclude=[np.number]) 

Unnamed: 0,Name,Subject
count,6,6
unique,6,6
top,Rohit,Python
freq,1,1


In [30]:
# Skip object columns from a DataFrame description.

df.describe(exclude=[object]) 

Unnamed: 0,Marks,Highest Marks
count,6.0,6.0
mean,77.333333,35.833333
std,8.778762,7.985403
min,67.0,25.0
25%,71.25,30.5
50%,76.0,36.0
75%,83.0,42.25
max,90.0,45.0


### Sorting records

We can sort records by any column using ``df.sort_values()`` function

In [31]:
df.sort_values('Marks', ascending=False)

Unnamed: 0,Name,Subject,Marks,Highest Marks
5,Suraj,Javascript,90,45
4,Neha,Ruby,85,32
1,Aryan,PHP,77,30
2,Ayush,JAVA,75,25
0,Rohit,Python,70,43
3,Yash,C++,67,40


### Slicing records

It is possible to extract data of a particular column, by using the column name. 

<b>Dataframe[Column_name] or Dataframe.Column_name</b>

In [32]:
df['Subject']

0        Python
1           PHP
2          JAVA
3           C++
4          Ruby
5    Javascript
Name: Subject, dtype: object

In [33]:
df.Subject

0        Python
1           PHP
2          JAVA
3           C++
4          Ruby
5    Javascript
Name: Subject, dtype: object

It is also possible to slice multiple columns. This is done by enclosing multiple column names enclosed in 2 square brackets, with the column names separated using commas. 

In [34]:
df[['Name', 'Subject']]

Unnamed: 0,Name,Subject
0,Rohit,Python
1,Aryan,PHP
2,Ayush,JAVA
3,Yash,C++
4,Neha,Ruby
5,Suraj,Javascript


DataFrame Columns are just Series

In [35]:
# check type of column of dataframe

type(df['Subject'])

pandas.core.series.Series

In [36]:
# check type of dataframe

type(df)

pandas.core.frame.DataFrame

It is also possible to slice rows. Multiple rows can be selected using “:” operator.

In [37]:
df[0:3]

Unnamed: 0,Name,Subject,Marks,Highest Marks
0,Rohit,Python,70,43
1,Aryan,PHP,77,30
2,Ayush,JAVA,75,25


Pandas library also allow to select data based on its row and column labels using ``iloc[0]`` function. Some times, we need only few columns or rows to analyze the data. We can also select by index using ``loc['index_one']``. 

For example, to select the third row, use ``df.iloc[2,:] ``. Suppose we want second element of the second column. This can be done by using ``df.iloc[1,1]`` function.

In [38]:
df.iloc[2]

Name             Ayush
Subject           JAVA
Marks               75
Highest Marks       25
Name: 2, dtype: object

In [39]:
df.loc[2,'Subject']

'JAVA'

In [40]:
df.iloc[2,1]

'JAVA'

#### difference between loc[] vs iloc[]

* ``loc[]`` is used to select rows and columns by Names/Labels
* ``iloc[]`` is used to select rows and columns by Integer Index/Position. zero based index position.

#### Creating a new column

In [41]:
df['new_col'] = df['Marks'] + df['Highest Marks']

In [42]:
df

Unnamed: 0,Name,Subject,Marks,Highest Marks,new_col
0,Rohit,Python,70,43,113
1,Aryan,PHP,77,30,107
2,Ayush,JAVA,75,25,100
3,Yash,C++,67,40,107
4,Neha,Ruby,85,32,117
5,Suraj,Javascript,90,45,135


#### Removing Columns

In [43]:
df.drop('new_col',axis=1)

Unnamed: 0,Name,Subject,Marks,Highest Marks
0,Rohit,Python,70,43
1,Aryan,PHP,77,30
2,Ayush,JAVA,75,25
3,Yash,C++,67,40
4,Neha,Ruby,85,32
5,Suraj,Javascript,90,45


In [44]:
# The new_col will not remove untill you specify inplace true.

df 

Unnamed: 0,Name,Subject,Marks,Highest Marks,new_col
0,Rohit,Python,70,43,113
1,Aryan,PHP,77,30,107
2,Ayush,JAVA,75,25,100
3,Yash,C++,67,40,107
4,Neha,Ruby,85,32,117
5,Suraj,Javascript,90,45,135


In [45]:
df.drop('new_col',axis=1,inplace=True)

In [46]:
df

Unnamed: 0,Name,Subject,Marks,Highest Marks
0,Rohit,Python,70,43
1,Aryan,PHP,77,30
2,Ayush,JAVA,75,25
3,Yash,C++,67,40
4,Neha,Ruby,85,32
5,Suraj,Javascript,90,45


In [47]:
# Can also drop rows this way

df.drop(4,axis=0)

Unnamed: 0,Name,Subject,Marks,Highest Marks
0,Rohit,Python,70,43
1,Aryan,PHP,77,30
2,Ayush,JAVA,75,25
3,Yash,C++,67,40
5,Suraj,Javascript,90,45


In [48]:
df

Unnamed: 0,Name,Subject,Marks,Highest Marks
0,Rohit,Python,70,43
1,Aryan,PHP,77,30
2,Ayush,JAVA,75,25
3,Yash,C++,67,40
4,Neha,Ruby,85,32
5,Suraj,Javascript,90,45


### Conditional Selection or Filtering

It is also possible to filter on column values using comparison operators

In [49]:
# Get the marks greater than 70 gives you boolean answer

df['Marks']>70

0    False
1     True
2     True
3    False
4     True
5     True
Name: Marks, dtype: bool

In [50]:
# Get the dataframe having marks more than 70

df[df['Marks']>70]

Unnamed: 0,Name,Subject,Marks,Highest Marks
1,Aryan,PHP,77,30
2,Ayush,JAVA,75,25
4,Neha,Ruby,85,32
5,Suraj,Javascript,90,45


Another way to filter data is using the ``isin``

In [51]:
# Check Suraj and Neha in dataframe

df[df['Name'].isin(['Suraj', 'Neha'])]

Unnamed: 0,Name,Subject,Marks,Highest Marks
4,Neha,Ruby,85,32
5,Suraj,Javascript,90,45


In [52]:
# Get Name of students having marks more than 70

df[df['Marks']>70]['Name']

1    Aryan
2    Ayush
4     Neha
5    Suraj
Name: Name, dtype: object

In [None]:
#Get names and subject having highest marks more than 35

df[df['Highest Marks']>35][['Name','Subject']]

For two conditions you can use | and & with parenthesis

In [53]:
# Get Dataframe having marks more than 70 or highest marks more than 35

df[(df['Marks']>70) | (df['Highest Marks'] > 35)]

Unnamed: 0,Name,Subject,Marks,Highest Marks
0,Rohit,Python,70,43
1,Aryan,PHP,77,30
2,Ayush,JAVA,75,25
3,Yash,C++,67,40
4,Neha,Ruby,85,32
5,Suraj,Javascript,90,45


In [54]:
# Get Dataframe having marks more than 70 and highest marks more than 30

df[(df['Marks']>70) & (df['Highest Marks'] > 30)]

Unnamed: 0,Name,Subject,Marks,Highest Marks
4,Neha,Ruby,85,32
5,Suraj,Javascript,90,45


### Rename column

``df.rename()`` function is used to rename a column. The function takes the old column name and new column name as arguments.

In [55]:
# The argument `inplace=True` makes the changes to the DataFrame.

df.rename(columns = {'Highest Marks':'Avg Marks'}, inplace=True)

In [56]:
df

Unnamed: 0,Name,Subject,Marks,Avg Marks
0,Rohit,Python,70,43
1,Aryan,PHP,77,30
2,Ayush,JAVA,75,25
3,Yash,C++,67,40
4,Neha,Ruby,85,32
5,Suraj,Javascript,90,45


### Data Wrangling

Data Science includes the data processing so that the data algorithms can be applied on data.

Data Wrangling is the process of processing data, like merging, grouping and concatenating. The Pandas library provides functions like ``merge()``, ``groupby()`` ,``join()`` and ``concat()`` to do data wrangling.

Make two different Dataframes for better understanding

In [3]:
d = {  
    'Student_id': ['1', '2', '3', '4', '5'],
    'Student_name': ['Kartik', 'Shruti', 'Aryan', 'Sehaj', 'Yamini'],
    'Marks':[60,55,70,80,75]
}
df1 = pd.DataFrame(d, columns=['Student_id', 'Student_name','Marks'])  

In [4]:
d = {  
    'Student_id': ['4', '5', '6', '7', '8'],
    'Student_name': ['Shruti', 'Sehaj', 'Rajat', 'Puneet', 'Rajat'],
    'Marks':[70,65,60,85,95]
}
df2= pd.DataFrame(d, columns=['Student_id', 'Student_name','Marks']) 

In [5]:
df1

Unnamed: 0,Student_id,Student_name,Marks
0,1,Kartik,60
1,2,Shruti,55
2,3,Aryan,70
3,4,Sehaj,80
4,5,Yamini,75


In [6]:
df2

Unnamed: 0,Student_id,Student_name,Marks
0,4,Shruti,70
1,5,Sehaj,65
2,6,Rajat,60
3,7,Puneet,85
4,8,Rajat,95


### Merging

Merge the 2 DataFrames we created, along the values of ‘Student_id’ using the ``merge()`` function

In [9]:
pd.merge(df1, df2, on='Student_id')

Unnamed: 0,Student_id,Student_name_x,Marks_x,Student_name_y,Marks_y
0,4,Sehaj,80,Shruti,70
1,5,Yamini,75,Sehaj,65


### Grouping

Grouping is a process of collecting data into different categories. 
The ``groupby()`` function allows you to group rows of data together and call aggregate functions

In [10]:
group = df2.groupby('Student_name')

In [11]:
g = group.get_group('Rajat')
g

Unnamed: 0,Student_id,Student_name,Marks
2,6,Rajat,60
4,8,Rajat,95


In [12]:
g.min()

Student_id          6
Student_name    Rajat
Marks              60
dtype: object

In [13]:
g.max()

Student_id          8
Student_name    Rajat
Marks              95
dtype: object

In [14]:
g.count()

Student_id      2
Student_name    2
Marks           2
dtype: int64

In [15]:
group.describe()

Unnamed: 0_level_0,Marks,Marks,Marks,Marks,Marks,Marks,Marks,Marks
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Student_name,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Puneet,1.0,85.0,,85.0,85.0,85.0,85.0,85.0
Rajat,2.0,77.5,24.748737,60.0,68.75,77.5,86.25,95.0
Sehaj,1.0,65.0,,65.0,65.0,65.0,65.0,65.0
Shruti,1.0,70.0,,70.0,70.0,70.0,70.0,70.0


### Concatenating

Concatenating data involves to add one set of data to other. Pandas provides a function named ``concat()`` to concatenate DataFrames

In [16]:
pd.concat([df1, df2])

Unnamed: 0,Student_id,Student_name,Marks
0,1,Kartik,60
1,2,Shruti,55
2,3,Aryan,70
3,4,Sehaj,80
4,5,Yamini,75
0,4,Shruti,70
1,5,Sehaj,65
2,6,Rajat,60
3,7,Puneet,85
4,8,Rajat,95


### Joining
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

In [17]:
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])



In [18]:
left

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2


In [19]:
right

Unnamed: 0,C,D
K0,C0,D0
K2,C2,D2
K3,C3,D3


In [20]:
left.join(right)

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [21]:
right.join(left)

Unnamed: 0,C,D,A,B
K0,C0,D0,A0,B0
K2,C2,D2,A2,B2
K3,C3,D3,,


In [22]:
left.join(right, how='outer')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


In [23]:
left.join(right, how='inner')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2


In [24]:
left.join(right, how='left')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [25]:
left.join(right, how='right')

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2
K3,,,C3,D3


### Reshaping the dataframe

In [29]:
df

Unnamed: 0,Name,Subject,Marks,Highest Marks
0,Rohit,Python,70,43
1,Aryan,PHP,77,30
2,Ayush,JAVA,75,25
3,Yash,C++,67,40
4,Neha,Ruby,85,32
5,Suraj,Javascript,90,45


In [30]:
df.melt()

Unnamed: 0,variable,value
0,Name,Rohit
1,Name,Aryan
2,Name,Ayush
3,Name,Yash
4,Name,Neha
5,Name,Suraj
6,Subject,Python
7,Subject,PHP
8,Subject,JAVA
9,Subject,C++


### More Index Details
Let's discuss some more features of indexing, including resetting the index or setting it something else.

In [31]:
frame = pd.DataFrame(np.random.randn(4,4),index='A B C D'.split(),
                     columns='W X Y Z'.split())
frame

Unnamed: 0,W,X,Y,Z
A,0.715933,0.472646,0.876887,-0.785034
B,-0.744715,-2.22825,1.027896,1.242633
C,1.602421,-0.928471,0.309558,1.147111
D,0.604446,1.308025,0.590933,2.133005


In [32]:
# Reset to default 0,1...n index

frame.reset_index()

Unnamed: 0,index,W,X,Y,Z
0,A,0.715933,0.472646,0.876887,-0.785034
1,B,-0.744715,-2.22825,1.027896,1.242633
2,C,1.602421,-0.928471,0.309558,1.147111
3,D,0.604446,1.308025,0.590933,2.133005


In [33]:
newindex = 'P Q R S'.split()

In [34]:
frame['States'] = newindex

In [35]:
frame

Unnamed: 0,W,X,Y,Z,States
A,0.715933,0.472646,0.876887,-0.785034,P
B,-0.744715,-2.22825,1.027896,1.242633,Q
C,1.602421,-0.928471,0.309558,1.147111,R
D,0.604446,1.308025,0.590933,2.133005,S


In [36]:
# Set index value with new value

frame.set_index('States')

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
P,0.715933,0.472646,0.876887,-0.785034
Q,-0.744715,-2.22825,1.027896,1.242633
R,1.602421,-0.928471,0.309558,1.147111
S,0.604446,1.308025,0.590933,2.133005


In [38]:
frame

Unnamed: 0,W,X,Y,Z,States
A,0.715933,0.472646,0.876887,-0.785034,P
B,-0.744715,-2.22825,1.027896,1.242633,Q
C,1.602421,-0.928471,0.309558,1.147111,R
D,0.604446,1.308025,0.590933,2.133005,S


In [39]:
frame.set_index('States',inplace=True)

In [40]:
frame

Unnamed: 0_level_0,W,X,Y,Z
States,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
P,0.715933,0.472646,0.876887,-0.785034
Q,-0.744715,-2.22825,1.027896,1.242633
R,1.602421,-0.928471,0.309558,1.147111
S,0.604446,1.308025,0.590933,2.133005


### Multi-Index and Index Hierarchy

In [42]:
# Index Levels
outside = ['A1','A1','A1','A2','A2','A2']
inside = [1,2,3,1,2,3]
index = list(zip(outside,inside))
index = pd.MultiIndex.from_tuples(index)

In [43]:
index

MultiIndex([('A1', 1),
            ('A1', 2),
            ('A1', 3),
            ('A2', 1),
            ('A2', 2),
            ('A2', 3)],
           )

In [44]:
f = pd.DataFrame(np.random.randn(6,2),index=index,columns=['B','C'])
f

Unnamed: 0,Unnamed: 1,B,C
A1,1,1.79485,-0.463469
A1,2,-1.245202,-0.181134
A1,3,-0.667159,-0.936902
A2,1,0.443477,-0.870769
A2,2,-0.410578,0.836448
A2,3,0.851038,-0.365968


#### Calling index hierarchy

In [45]:
f.loc['A2']

Unnamed: 0,B,C
1,0.443477,-0.870769
2,-0.410578,0.836448
3,0.851038,-0.365968


In [46]:
f.loc['A1'].loc[3]

B   -0.667159
C   -0.936902
Name: 3, dtype: float64

In [47]:
f.index.names

FrozenList([None, None])

In [48]:
f.index.names = ['Group','Num']

In [49]:
f

Unnamed: 0_level_0,Unnamed: 1_level_0,B,C
Group,Num,Unnamed: 2_level_1,Unnamed: 3_level_1
A1,1,1.79485,-0.463469
A1,2,-1.245202,-0.181134
A1,3,-0.667159,-0.936902
A2,1,0.443477,-0.870769
A2,2,-0.410578,0.836448
A2,3,0.851038,-0.365968


In [50]:
# to extract the subset of the dataframe

f.xs('A1')

Unnamed: 0_level_0,B,C
Num,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1.79485,-0.463469
2,-1.245202,-0.181134
3,-0.667159,-0.936902


In [51]:
f.xs(['A1',1])

  f.xs(['A1',1])


B    1.794850
C   -0.463469
Name: (A1, 1), dtype: float64

In [52]:
f.xs(1,level='Num')

Unnamed: 0_level_0,B,C
Group,Unnamed: 1_level_1,Unnamed: 2_level_1
A1,1.79485,-0.463469
A2,0.443477,-0.870769
