
 <img src="https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg" alt="Panda Logo" width="500">

`Pandas` is a `Python` module for data manipulation and analysis widely used all around the world both in universities and companies. We will show how easy is to work with data in notebooks using a few lines of `pandas` code.

https://pandas.pydata.org/

As usual, let's load a dataset to work with.



In [None]:
from vega_datasets import data
dfr = data.la_riots()
dfr

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
0,Cesar A.,Aguilar,18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281
1,George,Alvarez,42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690
2,Wilson,Alvarez,40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662
3,Brian E.,Andrew,30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457
4,Vivian,Austin,87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667
...,...,...,...,...,...,...,...,...,...,...,...
58,Fredrick,Ward,20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098
59,Louis A.,Watson,18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244
60,Elbert O.,Wilkins,33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767
61,John H.,Willers,37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184


# Selecting columns

On occasions we want to select a few columns to work with.

In [None]:
dfr.columns

Index(['first_name', 'last_name', 'age', 'gender', 'race', 'death_date',
       'address', 'neighborhood', 'type', 'longitude', 'latitude'],
      dtype='object')

We can use the name of the columns we want. Notice the syntax!

In [None]:
dfr['age']

0     18.0
1     42.0
2     40.0
3     30.0
4     87.0
      ... 
58    20.0
59    18.0
60    33.0
61    37.0
62    29.0
Name: age, Length: 63, dtype: float64

In [None]:
dfr[['first_name','last_name']]

Unnamed: 0,first_name,last_name
0,Cesar A.,Aguilar
1,George,Alvarez
2,Wilson,Alvarez
3,Brian E.,Andrew
4,Vivian,Austin
...,...,...
58,Fredrick,Ward
59,Louis A.,Watson
60,Elbert O.,Wilkins
61,John H.,Willers


Conversely, we can name the columns we are not interested in:

In [None]:
dfr.drop(columns=['first_name', 'last_name'])

Unnamed: 0,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
0,18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281
1,42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690
2,40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662
3,30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457
4,87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667
...,...,...,...,...,...,...,...,...,...
58,20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098
59,18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244
60,33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767
61,37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184


But selection or dropping do not alter the original dataframe.

In [None]:
dfr

The result of the previous operations can be assigned to a new dataframe variable.

In [None]:
dfr_selected = dfr.drop(columns=['first_name', 'last_name'])

In [None]:
dfr_selected

Unnamed: 0,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
0,18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281
1,42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690
2,40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662
3,30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457
4,87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667
...,...,...,...,...,...,...,...,...,...
58,20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098
59,18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244
60,33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767
61,37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184


In [None]:
dfr_selected.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   age           62 non-null     float64       
 1   gender        63 non-null     object        
 2   race          63 non-null     object        
 3   death_date    63 non-null     datetime64[ns]
 4   address       63 non-null     object        
 5   neighborhood  63 non-null     object        
 6   type          63 non-null     object        
 7   longitude     63 non-null     float64       
 8   latitude      63 non-null     float64       
dtypes: datetime64[ns](1), float64(3), object(5)
memory usage: 4.6+ KB


Another option `pandas` has, is to select the columns of a given type:

In [None]:
dfr.select_dtypes('object')


Unnamed: 0,first_name,last_name,gender,race,address,neighborhood,type
0,Cesar A.,Aguilar,Male,Latino,2009 W. 6th St.,Westlake,Officer-involved shooting
1,George,Alvarez,Male,Latino,Main & College streets,Chinatown,Not riot-related
2,Wilson,Alvarez,Male,Latino,3100 Rosecrans Ave.,Hawthorne,Homicide
3,Brian E.,Andrew,Male,Black,Rosecrans & Chester avenues,Compton,Officer-involved shooting
4,Vivian,Austin,Female,Black,1600 W. 60th St.,Harvard Park,Death
...,...,...,...,...,...,...,...
58,Fredrick,Ward,Male,Black,11932 Cometa Ave.,Pacoima,Homicide
59,Louis A.,Watson,Male,Black,4365 S. Vermont Ave.,Vermont Square,Homicide
60,Elbert O.,Wilkins,Male,Black,Western Avenue & 92nd Street,Gramercy Park,Homicide
61,John H.,Willers,Male,White,10621 Sepulveda Blvd.,Mission Hills,Homicide


## **Exercise**

In the `dfr` dataframe, select the columns that do not have a numerical type.  

In [None]:
dfr.select_dtypes(exclude='float')

Unnamed: 0,first_name,last_name,gender,race,death_date,address,neighborhood,type
0,Cesar A.,Aguilar,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting
1,George,Alvarez,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related
2,Wilson,Alvarez,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide
3,Brian E.,Andrew,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting
4,Vivian,Austin,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death
...,...,...,...,...,...,...,...,...
58,Fredrick,Ward,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide
59,Louis A.,Watson,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide
60,Elbert O.,Wilkins,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide
61,John H.,Willers,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide


## **Exercise**

Select the columns in the dataframe `dfr` that satisfy a condition. For instance, the columns whose name has a 'g'; or columns that contains the word 'name'...

In [None]:
dfr[[name for name in dfr.columns if len(name)<5]]

Unnamed: 0,age,race,type
0,18.0,Latino,Officer-involved shooting
1,42.0,Latino,Not riot-related
2,40.0,Latino,Homicide
3,30.0,Black,Officer-involved shooting
4,87.0,Black,Death
...,...,...,...
58,20.0,Black,Homicide
59,18.0,Black,Homicide
60,33.0,Black,Homicide
61,37.0,White,Homicide


# Selectig rows

## Sample

A quite frequent operation with data is to select rows in a dataframe. For instance, to get a sample.

In `pandas` we can use the function `sample()`

In [None]:
dfr.sample(5)

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
26,Betty,Jackson,56.0,Female,Black,1992-05-01,Main & 51st streets,South Park,Death,-118.273931,33.996522
49,Imad,Sharaf,31.0,Male,Black,1992-05-03,San Diego Freeway & San Fernando Mission Boule...,Mission Hills,Not riot-related,-118.471745,34.271856
58,Fredrick,Ward,20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098
44,Hugo G.,Ramirez,23.0,Male,Latino,1992-05-03,12732 Bess St.,Baldwin Park,Not riot-related,-117.997106,34.070238
11,John,Doe #80,,Male,White,1992-05-02,5800 block of South Vermont Avenue,Vermont-Slauson,Homicide,-118.291495,33.989399


In [None]:
dfr.sample(frac=0.1)

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
36,Arturo C.,Miranda,23.0,Male,Latino,1992-04-29,120th Street & Central Avenue,Green Meadows,Homicide,-118.254343,33.92373
57,Eduardo C.,Vela,33.0,Male,Latino,1992-04-29,5100 block of West Slauson Avenue,Ladera Heights,Homicide,-118.36872,33.987604
38,Nissar,Mustafa,20.0,Male,White,1992-08-12,1601 S. Western Ave.,Harvard Heights,Homicide,-118.309034,34.043554
32,Lucie R.,Maronian,51.0,Female,White,1992-05-01,1800 block of East New York Drive,Altadena,Homicide,-118.113436,34.178505
27,Dennis Ray,Jackson,38.0,Male,Black,1992-04-30,11322 Alvaro St.,Watts,Officer-involved shooting,-118.253757,33.932292
17,Jose L.,Garcia,15.0,Male,Latino,1992-04-30,1005 S. Fresno St.,Boyle Heights,Not riot-related,-118.207042,34.027553


In [None]:
dfr.sample?

## Index

On some occasions we do not want random rows, but quite on the contrary we want to select very specific rows.

In [None]:
dfr.loc[16]

first_name                                  Kevin A.
last_name                                   Evanahen
age                                             24.0
gender                                          Male
race                                           White
death_date                       1992-05-01 00:00:00
address         Braddock Drive & Inglewood Boulevard
neighborhood                                 Del Rey
type                                        Homicide
longitude                                -118.414453
latitude                                   33.992672
Name: 16, dtype: object

In [None]:
dfr.loc[16:20]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
16,Kevin A.,Evanahen,24.0,Male,White,1992-05-01,Braddock Drive & Inglewood Boulevard,Del Rey,Homicide,-118.414453,33.992672
17,Jose L.,Garcia,15.0,Male,Latino,1992-04-30,1005 S. Fresno St.,Boyle Heights,Not riot-related,-118.207042,34.027553
18,Mark,Garcia,15.0,Male,Latino,1992-04-30,10700 block of Burin Avenue,Lennox,Officer-involved shooting,-118.353766,33.938954
19,Elias,Garcia Rivera,32.0,Male,Latino,1992-12-16,12834 Vanowen St.,Valley Glen,Homicide,-118.413791,34.193934
20,Andreas,Garnica,36.0,Male,Latino,1992-04-30,2034 W. Pico Blvd.,Pico-Union,Not riot-related,-118.281879,34.046844


In [None]:
dfr.loc[16:25:3]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
16,Kevin A.,Evanahen,24.0,Male,White,1992-05-01,Braddock Drive & Inglewood Boulevard,Del Rey,Homicide,-118.414453,33.992672
19,Elias,Garcia Rivera,32.0,Male,Latino,1992-12-16,12834 Vanowen St.,Valley Glen,Homicide,-118.413791,34.193934
22,Matthew D.,Haines,32.0,Male,White,1992-04-30,Lemon Avenue & Pacific Coast Highway,Long Beach,Homicide,-118.178511,33.789857
25,Paul D.,Horace,38.0,Male,Black,1992-05-01,1439 E. Walnut,Central-Alameda,Homicide,-118.247414,34.021735


In [None]:
list(range(10,13))+[15,16]

[10, 11, 12, 15, 16]

In [None]:
dfr.loc[list(range(10,13))+[15,16]]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
10,Gregory,Davis Jr.,15.0,Male,Black,1992-04-30,Vermont Avenue & 43rd Street,Vermont Square,Homicide,-118.291549,34.005485
11,John,Doe #80,,Male,White,1992-05-02,5800 block of South Vermont Avenue,Vermont-Slauson,Homicide,-118.291495,33.989399
12,Harry,Doller,56.0,Male,White,1992-05-01,3500 block of Winslow Drive,Silver Lake,Not riot-related,-118.278763,34.087788
15,Juana,Espinosa,65.0,Female,Latino,1992-05-02,7608 S. Compton Ave.,Compton,Homicide,-118.246188,33.919821
16,Kevin A.,Evanahen,24.0,Male,White,1992-05-01,Braddock Drive & Inglewood Boulevard,Del Rey,Homicide,-118.414453,33.992672


On many occasions we do not want/need to know the index of the rows, we want/need instead to select all the rows that satisfy some property. Let's see how to select rows by using different criteria.

## Numerical criteria

In the dataframe `dfr`, imagine we want to select all rows for which the `age` value satisfy some property, for instance:

In [None]:
dfr[dfr.age>=56]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
4,Vivian,Austin,87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667
12,Harry,Doller,56.0,Male,White,1992-05-01,3500 block of Winslow Drive,Silver Lake,Not riot-related,-118.278763,34.087788
15,Juana,Espinosa,65.0,Female,Latino,1992-05-02,7608 S. Compton Ave.,Compton,Homicide,-118.246188,33.919821
26,Betty,Jackson,56.0,Female,Black,1992-05-01,Main & 51st streets,South Park,Death,-118.273931,33.996522
45,Aaron,Ratinoff,68.0,Male,White,1992-05-01,11690 Gateway Blvd.,Sawtelle,Homicide,-118.4431,34.028655


As we have already commented, if we want to keep the resulting dataframe for further processing, we have to define a new variable.




In [None]:
dfr_age50 = dfr[dfr.age>50]
dfr_age50.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, 4 to 55
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   first_name    7 non-null      object        
 1   last_name     7 non-null      object        
 2   age           7 non-null      float64       
 3   gender        7 non-null      object        
 4   race          7 non-null      object        
 5   death_date    7 non-null      datetime64[ns]
 6   address       7 non-null      object        
 7   neighborhood  7 non-null      object        
 8   type          7 non-null      object        
 9   longitude     7 non-null      float64       
 10  latitude      7 non-null      float64       
dtypes: datetime64[ns](1), float64(3), object(7)
memory usage: 672.0+ bytes


We can select an age range also:


In [None]:
dfr[dfr.age.between(30,40)]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
2,Wilson,Alvarez,40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662
3,Brian E.,Andrew,30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.21539,33.903457
7,Patrick,Bettan,30.0,Male,White,1992-04-30,2740 W. Olympic Blvd.,Koreatown,Homicide,-118.293181,34.052068
13,Kevin J.,Edwards,35.0,Male,Black,1992-04-30,614 S. Locust Ave.,Compton,Not riot-related,-118.200252,33.892421
19,Elias,Garcia Rivera,32.0,Male,Latino,1992-12-16,12834 Vanowen St.,Valley Glen,Homicide,-118.413791,34.193934
20,Andreas,Garnica,36.0,Male,Latino,1992-04-30,2034 W. Pico Blvd.,Pico-Union,Not riot-related,-118.281879,34.046844
21,Meeker,Gibson,35.0,Male,Black,1992-05-01,Holt & Loranne avenues,Pomona,Homicide,-117.730647,34.062846
22,Matthew D.,Haines,32.0,Male,White,1992-04-30,Lemon Avenue & Pacific Coast Highway,Long Beach,Homicide,-118.178511,33.789857
23,Jimmie,Harris,38.0,Male,Black,1992-04-29,Avalon Boulevard & Slauson Avenue,South Park,Death,-118.265199,33.989245
25,Paul D.,Horace,38.0,Male,Black,1992-05-01,1439 E. Walnut,Central-Alameda,Homicide,-118.247414,34.021735


### **Exercise**

Select rows in `dfr` whose age value are in the fourth quartile.

In [None]:
dfr[dfr.age>=dfr.age.quantile(0.75)]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
1,George,Alvarez,42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.06269
2,Wilson,Alvarez,40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662
4,Vivian,Austin,87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667
6,Carol,Benson,42.0,Female,Black,1992-05-02,Harbor Freeway near Slauson Avenue,South Park,Death,-118.280504,33.989168
8,Hector,Castro,49.0,Male,Latino,1992-04-30,Vermont & Leeward avenues,Koreatown,Homicide,-118.291654,34.058702
12,Harry,Doller,56.0,Male,White,1992-05-01,3500 block of Winslow Drive,Silver Lake,Not riot-related,-118.278763,34.087788
13,Kevin J.,Edwards,35.0,Male,Black,1992-04-30,614 S. Locust Ave.,Compton,Not riot-related,-118.200252,33.892421
14,Howard,Epstein,45.0,Male,White,1992-04-30,Slauson & 7th avenues,Hyde Park,Homicide,-118.324742,33.989049
15,Juana,Espinosa,65.0,Female,Latino,1992-05-02,7608 S. Compton Ave.,Compton,Homicide,-118.246188,33.919821
19,Elias,Garcia Rivera,32.0,Male,Latino,1992-12-16,12834 Vanowen St.,Valley Glen,Homicide,-118.413791,34.193934


## Date criteria

If we want to select rows considering a property that depends on date values, the idea is the same: we have to add conditions to that column.

In [None]:
dfr[dfr['death_date']>('1992-05-03')]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
2,Wilson,Alvarez,40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662
19,Elias,Garcia Rivera,32.0,Male,Latino,1992-12-16,12834 Vanowen St.,Valley Glen,Homicide,-118.413791,34.193934
38,Nissar,Mustafa,20.0,Male,White,1992-08-12,1601 S. Western Ave.,Harvard Heights,Homicide,-118.309034,34.043554
48,Juan V.,Salgado,20.0,Male,Latino,1992-05-20,3100 block of South Main Street,Historic South-Central,Homicide,-118.271379,34.020622
55,Wallace,Tope,54.0,Male,White,1993-11-24,5510 W. Sunset Blvd.,Hollywood,Homicide,-118.309822,34.098082


### **Exercise**

Select the rows occurred in the month with the largest number of deaths.  

In [None]:
dfr.death_date.dt.month.value_counts().index[0]

dfr[dfr.death_date.dt.month==dfr.death_date.dt.month.value_counts().index[0]]


Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
0,Cesar A.,Aguilar,18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281
3,Brian E.,Andrew,30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.21539,33.903457
5,Franklin,Benavidez,27.0,Male,Latino,1992-04-30,4404 S. Western Ave.,Vermont Square,Officer-involved shooting,-118.308821,34.003473
7,Patrick,Bettan,30.0,Male,White,1992-04-30,2740 W. Olympic Blvd.,Koreatown,Homicide,-118.293181,34.052068
8,Hector,Castro,49.0,Male,Latino,1992-04-30,Vermont & Leeward avenues,Koreatown,Homicide,-118.291654,34.058702
9,Jerel L.,Channell,26.0,Male,Black,1992-04-30,Santa Monica Boulevard & Seward Street,Hollywood,Death,-118.332378,34.091298
10,Gregory,Davis Jr.,15.0,Male,Black,1992-04-30,Vermont Avenue & 43rd Street,Vermont Square,Homicide,-118.291549,34.005485
13,Kevin J.,Edwards,35.0,Male,Black,1992-04-30,614 S. Locust Ave.,Compton,Not riot-related,-118.200252,33.892421
14,Howard,Epstein,45.0,Male,White,1992-04-30,Slauson & 7th avenues,Hyde Park,Homicide,-118.324742,33.989049
17,Jose L.,Garcia,15.0,Male,Latino,1992-04-30,1005 S. Fresno St.,Boyle Heights,Not riot-related,-118.207042,34.027553


## String criteria

Similarly, if the criteria to row selection depends on string values, it is natural to use the functions we saw to work with string columns.

For instance, consider this method: `.str.contains()`


In [None]:
dfr[dfr.neighborhood.str.contains('Hollywood')]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
9,Jerel L.,Channell,26.0,Male,Black,1992-04-30,Santa Monica Boulevard & Seward Street,Hollywood,Death,-118.332378,34.091298
31,Darnell R.,Mallory,18.0,Male,Black,1992-04-30,Santa Monica Boulevard & Seward Street,Hollywood,Death,-118.334138,34.090972
42,Juanita,Pettaway,37.0,Female,Black,1992-04-30,Santa Monica Boulevard & Seward Street,Hollywood,Death,-118.333177,34.090696
50,Jose,Solorzano,25.0,Male,Latino,1992-05-01,Vermont Avenue & Santa Monica Boulevard,East Hollywood,Homicide,-118.291755,34.090868
54,James L.,Taylor,26.0,Male,Black,1992-04-30,5213 Sunset Blvd.,East Hollywood,Homicide,-118.303756,34.098248
55,Wallace,Tope,54.0,Male,White,1993-11-24,5510 W. Sunset Blvd.,Hollywood,Homicide,-118.309822,34.098082


Or, this other method...

In [None]:
dfr[dfr.last_name.str.startswith('C')]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
8,Hector,Castro,49.0,Male,Latino,1992-04-30,Vermont & Leeward avenues,Koreatown,Homicide,-118.291654,34.058702
9,Jerel L.,Channell,26.0,Male,Black,1992-04-30,Santa Monica Boulevard & Seward Street,Hollywood,Death,-118.332378,34.091298


### **Exercise**

Select the rows in `dfr` that occurred in a crossroad.

In [None]:
dfr[dfr.address.str.contains('&')]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
1,George,Alvarez,42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.06269
3,Brian E.,Andrew,30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.21539,33.903457
8,Hector,Castro,49.0,Male,Latino,1992-04-30,Vermont & Leeward avenues,Koreatown,Homicide,-118.291654,34.058702
9,Jerel L.,Channell,26.0,Male,Black,1992-04-30,Santa Monica Boulevard & Seward Street,Hollywood,Death,-118.332378,34.091298
10,Gregory,Davis Jr.,15.0,Male,Black,1992-04-30,Vermont Avenue & 43rd Street,Vermont Square,Homicide,-118.291549,34.005485
14,Howard,Epstein,45.0,Male,White,1992-04-30,Slauson & 7th avenues,Hyde Park,Homicide,-118.324742,33.989049
16,Kevin A.,Evanahen,24.0,Male,White,1992-05-01,Braddock Drive & Inglewood Boulevard,Del Rey,Homicide,-118.414453,33.992672
21,Meeker,Gibson,35.0,Male,Black,1992-05-01,Holt & Loranne avenues,Pomona,Homicide,-117.730647,34.062846
22,Matthew D.,Haines,32.0,Male,White,1992-04-30,Lemon Avenue & Pacific Coast Highway,Long Beach,Homicide,-118.178511,33.789857
23,Jimmie,Harris,38.0,Male,Black,1992-04-29,Avalon Boulevard & Slauson Avenue,South Park,Death,-118.265199,33.989245


## All together

The different conditions we have commented so far can be grouped together.
Let's suppose that we want the rows whose `age` is over 30, `neiborhood` contains `Park` and `gender` is `Female`:

In [None]:
dfr[(dfr['age']>=20) \ #the \ is for passing to the next line
    & (dfr['neighborhood'].str.contains('Park')) \
    & (dfr['gender']=='Female')]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
4,Vivian,Austin,87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667
6,Carol,Benson,42.0,Female,Black,1992-05-02,Harbor Freeway near Slauson Avenue,South Park,Death,-118.280504,33.989168
26,Betty,Jackson,56.0,Female,Black,1992-05-01,Main & 51st streets,South Park,Death,-118.273931,33.996522


Be careful with the syntax, do not forget the parenthesis!


### **Exercise**

Select the rows with: not "White" in the column `race`; that occurred in a  crossroad with "Western" avenue.

In [None]:
dfr[(dfr['race']!='White') \
    & (dfr['address'].str.contains('&')) \
    & (dfr['address'].str.contains('Western'))]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
39,Ernest,Neal Jr.,27.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.309006,33.952709
60,Elbert O.,Wilkins,33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767
62,Willie Bernard,Williams,29.0,Male,Black,1992-04-29,Gage & Western avenues,Chesterfield Square,Death,-118.308952,33.982363


# Selecting undefined values

A very special case we will probably want to select when working with datasets are the null values, also called `nan` values.

In [None]:
dfr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   first_name    63 non-null     object        
 1   last_name     63 non-null     object        
 2   age           62 non-null     float64       
 3   gender        63 non-null     object        
 4   race          63 non-null     object        
 5   death_date    63 non-null     datetime64[ns]
 6   address       63 non-null     object        
 7   neighborhood  63 non-null     object        
 8   type          63 non-null     object        
 9   longitude     63 non-null     float64       
 10  latitude      63 non-null     float64       
dtypes: datetime64[ns](1), float64(3), object(7)
memory usage: 5.5+ KB


There is a very convenient method that do that:

In [None]:
dfr[dfr.age.isna()]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
11,John,Doe #80,,Male,White,1992-05-02,5800 block of South Vermont Avenue,Vermont-Slauson,Homicide,-118.291495,33.989399


We can have more complex situations, where `nan` values are spread over different columns in the dataframe. Let's consider another dataframe to illustrate this situation.

In [None]:
dfc = data.cars()
dfc

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA
...,...,...,...,...,...,...,...,...,...
401,ford mustang gl,27.0,4,140.0,86.0,2790,15.6,1982-01-01,USA
402,vw pickup,44.0,4,97.0,52.0,2130,24.6,1982-01-01,Europe
403,dodge rampage,32.0,4,135.0,84.0,2295,11.6,1982-01-01,USA
404,ford ranger,28.0,4,120.0,79.0,2625,18.6,1982-01-01,USA


In [None]:
dfc.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 406 entries, 0 to 405
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   Name              406 non-null    object        
 1   Miles_per_Gallon  398 non-null    float64       
 2   Cylinders         406 non-null    int64         
 3   Displacement      406 non-null    float64       
 4   Horsepower        400 non-null    float64       
 5   Weight_in_lbs     406 non-null    int64         
 6   Acceleration      406 non-null    float64       
 7   Year              406 non-null    datetime64[ns]
 8   Origin            406 non-null    object        
dtypes: datetime64[ns](1), float64(4), int64(2), object(2)
memory usage: 28.7+ KB


In [None]:
dfc[dfc.Horsepower.isna()]

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
38,ford pinto,25.0,4,98.0,,2046,19.0,1971-01-01,USA
133,ford maverick,21.0,6,200.0,,2875,17.0,1974-01-01,USA
337,renault lecar deluxe,40.9,4,85.0,,1835,17.3,1980-01-01,Europe
343,ford mustang cobra,23.6,4,140.0,,2905,14.3,1980-01-01,USA
361,renault 18i,34.5,4,100.0,,2320,15.8,1982-01-01,Europe
382,amc concord dl,23.0,4,151.0,,3035,20.5,1982-01-01,USA


To show all data with `nan` values we can write:

In [None]:
dfc[dfc.Miles_per_Gallon.isna() | dfc.Horsepower.isna()]

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
10,citroen ds-21 pallas,,4,133.0,115.0,3090,17.5,1970-01-01,Europe
11,chevrolet chevelle concours (sw),,8,350.0,165.0,4142,11.5,1970-01-01,USA
12,ford torino (sw),,8,351.0,153.0,4034,11.0,1970-01-01,USA
13,plymouth satellite (sw),,8,383.0,175.0,4166,10.5,1970-01-01,USA
14,amc rebel sst (sw),,8,360.0,175.0,3850,11.0,1970-01-01,USA
17,ford mustang boss 302,,8,302.0,140.0,3353,8.0,1970-01-01,USA
38,ford pinto,25.0,4,98.0,,2046,19.0,1971-01-01,USA
39,volkswagen super beetle 117,,4,97.0,48.0,1978,20.0,1971-01-01,Europe
133,ford maverick,21.0,6,200.0,,2875,17.0,1974-01-01,USA
337,renault lecar deluxe,40.9,4,85.0,,1835,17.3,1980-01-01,Europe


Once the `nan` values are identified, some operations can be done. Let's comment two simple ones:

*   Remove rows with `nan`
*   Fill values to `nan`




In [None]:
dfr.dropna()

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
0,Cesar A.,Aguilar,18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281
1,George,Alvarez,42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690
2,Wilson,Alvarez,40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662
3,Brian E.,Andrew,30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457
4,Vivian,Austin,87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667
...,...,...,...,...,...,...,...,...,...,...,...
58,Fredrick,Ward,20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098
59,Louis A.,Watson,18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244
60,Elbert O.,Wilkins,33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767
61,John H.,Willers,37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184


And this is the function to give values to `nan`.

In [None]:
dfr[dfr.age.isna()]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
11,John,Doe #80,,Male,White,1992-05-02,5800 block of South Vermont Avenue,Vermont-Slauson,Homicide,-118.291495,33.989399


In [None]:
dfr.fillna(value=80)

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
0,Cesar A.,Aguilar,18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281
1,George,Alvarez,42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690
2,Wilson,Alvarez,40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662
3,Brian E.,Andrew,30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457
4,Vivian,Austin,87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667
...,...,...,...,...,...,...,...,...,...,...,...
58,Fredrick,Ward,20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098
59,Louis A.,Watson,18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244
60,Elbert O.,Wilkins,33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767
61,John H.,Willers,37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184


## Exercise

In the dataframe `dfc`, drop the rows with a missing `Horsepower` value and assign the median value of the column to the missing values in `Miles_per_Gallon`.

In [None]:
dfc_cleaned = dfc.dropna(subset=['Horsepower'])

median_mpg = dfc_cleaned['Miles_per_Gallon'].median()
dfc_cleaned['Miles_per_Gallon'].fillna(median_mpg, inplace=True)

dfc_cleaned

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  dfc_cleaned['Miles_per_Gallon'].fillna(median_mpg, inplace=True)


Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA
...,...,...,...,...,...,...,...,...,...
401,ford mustang gl,27.0,4,140.0,86.0,2790,15.6,1982-01-01,USA
402,vw pickup,44.0,4,97.0,52.0,2130,24.6,1982-01-01,Europe
403,dodge rampage,32.0,4,135.0,84.0,2295,11.6,1982-01-01,USA
404,ford ranger,28.0,4,120.0,79.0,2625,18.6,1982-01-01,USA


## Exercise

Write the code to view all the rows in a dataframe that contains a `nan` value. It should work for any dataframe.

In [None]:
dfr[dfr.isna()]

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
0,,,,,,NaT,,,,,
1,,,,,,NaT,,,,,
2,,,,,,NaT,,,,,
3,,,,,,NaT,,,,,
4,,,,,,NaT,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
58,,,,,,NaT,,,,,
59,,,,,,NaT,,,,,
60,,,,,,NaT,,,,,
61,,,,,,NaT,,,,,


# Query

There is an useful and powerful way to select rows from a DataFrame that meet certain criteria. The `query` method allows to write complex Boolean expressions that can be evaluated against each row of the DataFrame.

Let's consider the car dataframe `dfc`.

In [None]:
dfc = data.cars()
dfc

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA
...,...,...,...,...,...,...,...,...,...
401,ford mustang gl,27.0,4,140.0,86.0,2790,15.6,1982-01-01,USA
402,vw pickup,44.0,4,97.0,52.0,2130,24.6,1982-01-01,Europe
403,dodge rampage,32.0,4,135.0,84.0,2295,11.6,1982-01-01,USA
404,ford ranger,28.0,4,120.0,79.0,2625,18.6,1982-01-01,USA


We can select rows considering a condition on a column

In [None]:
dfc.query('Cylinders == 4')

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
10,citroen ds-21 pallas,,4,133.0,115.0,3090,17.5,1970-01-01,Europe
20,toyota corona mark ii,24.0,4,113.0,95.0,2372,15.0,1970-01-01,Japan
24,datsun pl510,27.0,4,97.0,88.0,2130,14.5,1970-01-01,Japan
25,volkswagen 1131 deluxe sedan,26.0,4,97.0,46.0,1835,20.5,1970-01-01,Europe
26,peugeot 504,25.0,4,110.0,87.0,2672,17.5,1970-01-01,Europe
...,...,...,...,...,...,...,...,...,...
401,ford mustang gl,27.0,4,140.0,86.0,2790,15.6,1982-01-01,USA
402,vw pickup,44.0,4,97.0,52.0,2130,24.6,1982-01-01,Europe
403,dodge rampage,32.0,4,135.0,84.0,2295,11.6,1982-01-01,USA
404,ford ranger,28.0,4,120.0,79.0,2625,18.6,1982-01-01,USA


Or, condition involving several columns

In [None]:
dfc.query('Cylinders==4 & Miles_per_Gallon < 20')

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
83,volvo 145e (sw),18.0,4,121.0,112.0,2933,14.5,1972-01-01,Europe
119,ford pinto,19.0,4,122.0,85.0,2310,18.5,1973-01-01,USA
127,volvo 144ea,19.0,4,121.0,112.0,2868,15.5,1973-01-01,Europe
216,peugeot 504,19.0,4,120.0,88.0,3270,21.9,1976-01-01,Europe


In [None]:
dfc.query('(Cylinders==4 | Cylinders==6) & Miles_per_Gallon < 20')

But the most interesting feature is that we can compare values in different columns:

In [None]:
dfc.query('Displacement > Horsepower')

And use expressions!

In [None]:
dfc.query('Displacement > Horsepower * 3')

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
161,mercury monarch,15.0,6,250.0,72.0,3432,21.0,1975-01-01,USA
162,ford maverick,15.0,6,250.0,72.0,3158,19.5,1975-01-01,USA
207,ford granada ghia,18.0,6,250.0,78.0,3574,21.0,1976-01-01,USA
372,oldsmobile cutlass ls,26.6,8,350.0,105.0,3725,19.0,1982-01-01,USA
395,oldsmobile cutlass ciera (diesel),38.0,6,262.0,85.0,3015,17.0,1982-01-01,USA


## **Exercise**

In the dataframe `dfr`, select the rows whose `age` is over 30, `neiborhood` contains `Park` and `gender` is `Female`:

In [None]:
dfr.query("age > 30 and neighborhood.str.contains('Park') and gender == 'Female'")

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
4,Vivian,Austin,87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667
6,Carol,Benson,42.0,Female,Black,1992-05-02,Harbor Freeway near Slauson Avenue,South Park,Death,-118.280504,33.989168
26,Betty,Jackson,56.0,Female,Black,1992-05-01,Main & 51st streets,South Park,Death,-118.273931,33.996522


## **Exercise**

In the dataframe `dfc` select the Volkswagen cars with a weight of more than 20 kilograms per horsepower.

In [None]:
dfc.query(" Name.str.contains('Volkswagen') & (Weight_in_lbs/2.5)/Horsepower>20")

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin


<hr>
<hr>
Carlos Gregorio Rodríguez

Universidad Complutense de Madrid

<img src="https://static0.makeuseofimages.com/wordpress/wp-content/uploads/2019/11/CC-BY-NC-License.png" alt="cc by nc" width="200"/>


https://creativecommons.org/licenses/by-nc/4.0/