# pandas Dataframes - Slicing and Filtering

## lesson_2_2_2

## We will use the same dataframe as last lesson.
### Import packages

In [1]:
import pandas as pd

### Creating a Basic Dataframe From JSON

In [2]:
# define the data as a list
data = [
    ("Dexter","Johnsons","dog","shiba inu","red sesame",1.5,35,"m",False,"both",True),
    ("Alfred","Johnsons","cat","mix","tuxedo",4,12,"m",True,"indoor",True),
    ("Petra","Smith","cat","ragdoll","calico",None,10,"f",False,"both",True),
    ("Ava","Smith","dog","mix","blk/wht",12,32,"f",True,"both",False),
    ("Schroder","Brown","cat","mix","orange",13,15,"m",False,"indoor",True),
    ("Blackbeard","Brown","bird","parrot","multi",5,3,"f",False,"indoor",),
]

# define the labels
labels = ["name","owner","type","breed","color","age","weight","gender","health issues","indoor/outboor","vaccinated"]

# create dataframe
vet_records = pd.DataFrame.from_records(data, columns=labels)

### A Note of Caution

Changes and updates to a dataframe is only permanent if saved to the dataframe.  So for example we might say `vet_records = ...` to permanently change the dataframe `vet_records`.  In many cases keeping a reference dataframe is a good practice.  For example, `vet_records_dogs = vet_records[vet_records.type=="dog"]` instead of `vet_records = vet_records[vet_records.type=="dog"]`.  This will leave you with a dataframe to reference that contains the unaldulterated data.

### Grouping and Counting Data

Using counting and grouping can help you get a better grasp of the data.

In [3]:
# How many types of pets do we have?
vet_records.type.count()

6

In [4]:
vet_records.groupby('type').count()

Unnamed: 0_level_0,name,owner,breed,color,age,weight,gender,health issues,indoor/outboor,vaccinated
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
bird,1,1,1,1,1,1,1,1,1,0
cat,3,3,3,3,2,3,3,3,3,3
dog,2,2,2,2,2,2,2,2,2,2


In [5]:
vet_records.type.value_counts()

cat     3
dog     2
bird    1
Name: type, dtype: int64

### Slicing (Filtering) Data

Slicing data, that is, picking parts of teh data you want to use for a specific purpose is easy with pandas once you have the conpcets down.


#### Here we slice the data to get only the weight column.

In [6]:
# Create a pandas series from the dataframe
weight = vet_records['weight']

In [7]:
weight

0    35
1    12
2    10
3    32
4    15
5     3
Name: weight, dtype: int64

Notice that vet_records was not changed

In [8]:
vet_records.head()

Unnamed: 0,name,owner,type,breed,color,age,weight,gender,health issues,indoor/outboor,vaccinated
0,Dexter,Johnsons,dog,shiba inu,red sesame,1.5,35,m,False,both,True
1,Alfred,Johnsons,cat,mix,tuxedo,4.0,12,m,True,indoor,True
2,Petra,Smith,cat,ragdoll,calico,,10,f,False,both,True
3,Ava,Smith,dog,mix,blk/wht,12.0,32,f,True,both,False
4,Schroder,Brown,cat,mix,orange,13.0,15,m,False,indoor,True


While `weight` does show us all the weights for the animals in the dataframe, unless we are interested in straight weight values for some calculation, it is not very useful data.  A list of numbers by themselves is usually not data that can be used.

So, instead let's get all the dog weights.

In [9]:
# Collect the dog weights only using a boolean filter
dog_weight = vet_records.weight[vet_records.type=='dog']

In [10]:
dog_weight

0    35
3    32
Name: weight, dtype: int64

While this still only is a list of values, at least by the variable name we know these are the weights of all the dogs in the sample.

A better way might be to just slice all the dog data.

In [11]:
dogs = vet_records[vet_records.type=='dog']

In [12]:
dogs

Unnamed: 0,name,owner,type,breed,color,age,weight,gender,health issues,indoor/outboor,vaccinated
0,Dexter,Johnsons,dog,shiba inu,red sesame,1.5,35,m,False,both,True
3,Ava,Smith,dog,mix,blk/wht,12.0,32,f,True,both,False


#### Using `loc` and `iloc`

- `loc` allows you to use column names to slice data
- `iloc` requires the use of index numbers.  Example: `.iloc[row, column]`. Remember: python indexes starting at 0.

In [13]:
# get the pet name and owner for the 2nd record in the dataframe
vet_records.loc[1,["name", "owner"]]

name       Alfred
owner    Johnsons
Name: 1, dtype: object

In [14]:
# get the pet name and owner for all pets in the dataframe
vet_records.loc[:,["name", "owner"]]

Unnamed: 0,name,owner
0,Dexter,Johnsons
1,Alfred,Johnsons
2,Petra,Smith
3,Ava,Smith
4,Schroder,Brown
5,Blackbeard,Brown


In [15]:
# get all the names of the pets using iloc
vet_records.iloc[:,0]

0        Dexter
1        Alfred
2         Petra
3           Ava
4      Schroder
5    Blackbeard
Name: name, dtype: object

In [16]:
# get the name Petra
vet_records.iloc[2,0]

'Petra'

In [17]:
# get the color and age of the 3rd and 4th pet, notice these are not contiguous
vet_records.iloc[[2,3],[4,5]]

Unnamed: 0,color,age
2,calico,
3,blk/wht,12.0


#### `.isin` can be used to gather data about a list of items

Collect the data for Dexter and Blackbeard

In [18]:

vet_records[vet_records.name.isin(['Dexter','Blackbeard'])]

Unnamed: 0,name,owner,type,breed,color,age,weight,gender,health issues,indoor/outboor,vaccinated
0,Dexter,Johnsons,dog,shiba inu,red sesame,1.5,35,m,False,both,True
5,Blackbeard,Brown,bird,parrot,multi,5.0,3,f,False,indoor,


#### `~` can be used as a *not* logical operator.

Here we ask for all pets **_not_** named Dexter or Blackbeard

In [19]:
vet_records[~vet_records.name.isin(['Dexter','Blackbeard'])]

Unnamed: 0,name,owner,type,breed,color,age,weight,gender,health issues,indoor/outboor,vaccinated
1,Alfred,Johnsons,cat,mix,tuxedo,4.0,12,m,True,indoor,True
2,Petra,Smith,cat,ragdoll,calico,,10,f,False,both,True
3,Ava,Smith,dog,mix,blk/wht,12.0,32,f,True,both,False
4,Schroder,Brown,cat,mix,orange,13.0,15,m,False,indoor,True


#### Boolean Masks

There are times when a boolean mask will be useful to you.  They are similar to filtereing by booleans, but involve using `mask` file.  The `mask` name is what I choose to call them they can be named anything you like.

Create a mask for male pets.

In [20]:
mask = vet_records.gender=='m'

Notice this is a series of `True` and `False` where if the gender column as "m", then it was True.

In [21]:
mask

0     True
1     True
2    False
3    False
4     True
5    False
Name: gender, dtype: bool

Applying this series as a mask results in only returning the male pets.  You can also use `~` to get the female pets.

In [22]:
vet_records[mask]

Unnamed: 0,name,owner,type,breed,color,age,weight,gender,health issues,indoor/outboor,vaccinated
0,Dexter,Johnsons,dog,shiba inu,red sesame,1.5,35,m,False,both,True
1,Alfred,Johnsons,cat,mix,tuxedo,4.0,12,m,True,indoor,True
4,Schroder,Brown,cat,mix,orange,13.0,15,m,False,indoor,True


Finally check to see that vet_records was not altered.

In [23]:
vet_records

Unnamed: 0,name,owner,type,breed,color,age,weight,gender,health issues,indoor/outboor,vaccinated
0,Dexter,Johnsons,dog,shiba inu,red sesame,1.5,35,m,False,both,True
1,Alfred,Johnsons,cat,mix,tuxedo,4.0,12,m,True,indoor,True
2,Petra,Smith,cat,ragdoll,calico,,10,f,False,both,True
3,Ava,Smith,dog,mix,blk/wht,12.0,32,f,True,both,False
4,Schroder,Brown,cat,mix,orange,13.0,15,m,False,indoor,True
5,Blackbeard,Brown,bird,parrot,multi,5.0,3,f,False,indoor,


### None and NaN
#### `.isna` will create a boolean dataframe `True` where the value is `NaN` or `None`.
**It is advisable to deal with NaN and None values before doing ny calculations.  A NaN and None cell are ignored during calculations.**

In [24]:
vet_records.isna()

Unnamed: 0,name,owner,type,breed,color,age,weight,gender,health issues,indoor/outboor,vaccinated
0,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,True,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False
5,False,False,False,False,False,False,False,False,False,False,True


In [25]:
vet_records_example = vet_records_example.fillna(0)

NameError: name 'vet_records_example' is not defined

In [None]:
vet_records_example

#### Use `fillna` With a Values Dictionary

In [None]:
values = {"age": 12, "vaccinated": False}

In [None]:
vet_records.fillna(value=values)

Notice that `vet_records` was not changed.  It would need to set equal to another variable or itself to save the changes.

In [None]:
vet_records

In [None]:
vet_records_na = vet_records.fillna(value=values)

In [None]:
vet_records_na