In [2]:
import pandas as pd

### Reading from a csv

In [6]:
df=pd.read_csv('demo.csv')

In [7]:
df.head()

Unnamed: 0,ID,Name,Age,Gender,Height,Weight,Income
0,1,A,37,M,172,72,38960
1,2,B,29,F,172,79,74863
2,3,C,41,M,166,67,77256
3,4,D,29,F,158,61,84022
4,5,E,34,M,177,70,41907


# Subsetting,slicing

How to subset a dataframe by giving a condition

Syntax for subsetting a dataframe based on a condition is :- df[ condition ]

In [9]:
df[df.Age>=30]

Unnamed: 0,ID,Name,Age,Gender,Height,Weight,Income
0,1,A,37,M,172,72,38960
2,3,C,41,M,166,67,77256
4,5,E,34,M,177,70,41907
5,6,F,34,F,158,79,86198
7,8,H,41,F,165,75,48630
9,10,J,45,F,178,55,19021


The condition returns a boolean variable. So df[condition] returns all the rows from dataframe where the condition is true

In [10]:
df.Age>=30

0     True
1    False
2     True
3    False
4     True
5     True
6    False
7     True
8    False
9     True
Name: Age, dtype: bool

### We can give multiple conditions while subsetting with each condition separated by either '&' or '|'.
##### Here '&' is logical AND and '|' is logical OR

In [14]:
df[(df.Age>30) & (df.Height>170)]

Unnamed: 0,ID,Name,Age,Gender,Height,Weight,Income
0,1,A,37,M,172,72,38960
4,5,E,34,M,177,70,41907
9,10,J,45,F,178,55,19021


### Enclose each condition within a bracket else it might throw an error

In [15]:
df[df.Age>30 & df.Height>170]

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

### If we wish to select only few columns instead of the entire dataframe, then we can pass them in a list inside '[ ]'

In [16]:
df[['Age','Height','Weight']]

Unnamed: 0,Age,Height,Weight
0,37,172,72
1,29,172,79
2,41,166,67
3,29,158,61
4,34,177,70
5,34,158,79
6,21,175,69
7,41,165,75
8,25,158,90
9,45,178,55


### To reference a single column, we can even use df.column_name

In [17]:
df.Age

0    37
1    29
2    41
3    29
4    34
5    34
6    21
7    41
8    25
9    45
Name: Age, dtype: int64

### To select few columns while subsetting, we use .loc

.loc[ ] is primarily label based, but may also be used with a boolean array.

In [18]:
df.loc[df.Age>30,['Name','Age','Height']]

Unnamed: 0,Name,Age,Height
0,A,37,172
2,C,41,166
4,E,34,177
5,F,34,158
7,H,41,165
9,J,45,178


#### without .loc, it throws an error

In [19]:
df[df.Age>30,['Name','Age','Height']]

TypeError: 'Series' objects are mutable, thus they cannot be hashed

#### .iloc is used when we want to select rows/columns with integer value

In [21]:
df.iloc[2,:]

ID            3
Name          C
Age          41
Gender        M
Height      166
Weight       67
Income    77256
Name: 2, dtype: object

In [22]:
df.iloc[:,2]

0    37
1    29
2    41
3    29
4    34
5    34
6    21
7    41
8    25
9    45
Name: Age, dtype: int64

In [23]:
df.iloc[1,2]

29

### To check the data types of all columns, we use dtypes

In [24]:
df.dtypes

ID         int64
Name      object
Age        int64
Gender    object
Height     int64
Weight     int64
Income     int64
dtype: object

### To change the datatype of a particular column, we use pd.to_'datatype' function of pandas