
 <img src="https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg" alt="Panda Logo" width="500">

`Pandas` is a `Python` module for data manipulation and analysis widely used all around the world both in universities and companies. We will show how easy is to work with data in notebooks using a few lines of `Pandas` code.

https://pandas.pydata.org/

In [1]:
%reload_ext google.colab.data_table
import pandas as pd
from vega_datasets import data

# New columns

There are occasions when we want to combine the information in a dataframe in order to generate new columns.

Let's consider the dataframe `dfr`

In [2]:
dfr = data.la_riots()
dfr.sample(20)


Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
19,Elias,Garcia Rivera,32.0,Male,Latino,1992-12-16,12834 Vanowen St.,Valley Glen,Homicide,-118.413791,34.193934
14,Howard,Epstein,45.0,Male,White,1992-04-30,Slauson & 7th avenues,Hyde Park,Homicide,-118.324742,33.989049
20,Andreas,Garnica,36.0,Male,Latino,1992-04-30,2034 W. Pico Blvd.,Pico-Union,Not riot-related,-118.281879,34.046844
12,Harry,Doller,56.0,Male,White,1992-05-01,3500 block of Winslow Drive,Silver Lake,Not riot-related,-118.278763,34.087788
45,Aaron,Ratinoff,68.0,Male,White,1992-05-01,11690 Gateway Blvd.,Sawtelle,Homicide,-118.4431,34.028655
49,Imad,Sharaf,31.0,Male,Black,1992-05-03,San Diego Freeway & San Fernando Mission Boule...,Mission Hills,Not riot-related,-118.471745,34.271856
3,Brian E.,Andrew,30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.21539,33.903457
2,Wilson,Alvarez,40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662
26,Betty,Jackson,56.0,Female,Black,1992-05-01,Main & 51st streets,South Park,Death,-118.273931,33.996522
44,Hugo G.,Ramirez,23.0,Male,Latino,1992-05-03,12732 Bess St.,Baldwin Park,Not riot-related,-117.997106,34.070238


## Column operators


Let's merge, for instance, the `first_name` and `last_name` columns. As it is usual, we can add a comma ',' to separate the last name from the first name. The combined data will be placed in a new column that we will call just `name`.

In [3]:
dfr['name'] = dfr.last_name+', '+dfr.first_name
dfr

Unnamed: 0,first_name,last_name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude,name
0,Cesar A.,Aguilar,18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281,"Aguilar, Cesar A."
1,George,Alvarez,42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690,"Alvarez, George"
2,Wilson,Alvarez,40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662,"Alvarez, Wilson"
3,Brian E.,Andrew,30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457,"Andrew, Brian E."
4,Vivian,Austin,87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667,"Austin, Vivian"
...,...,...,...,...,...,...,...,...,...,...,...,...
58,Fredrick,Ward,20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098,"Ward, Fredrick"
59,Louis A.,Watson,18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244,"Watson, Louis A."
60,Elbert O.,Wilkins,33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767,"Wilkins, Elbert O."
61,John H.,Willers,37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184,"Willers, John H."


If we want to get rid of the original columns, we can use the `drop` method.


In [4]:
dfr.drop(columns=['first_name', 'last_name'], inplace=True)
dfr

Unnamed: 0,age,gender,race,death_date,address,neighborhood,type,longitude,latitude,name
0,18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281,"Aguilar, Cesar A."
1,42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690,"Alvarez, George"
2,40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662,"Alvarez, Wilson"
3,30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457,"Andrew, Brian E."
4,87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667,"Austin, Vivian"
...,...,...,...,...,...,...,...,...,...,...
58,20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098,"Ward, Fredrick"
59,18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244,"Watson, Louis A."
60,33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767,"Wilkins, Elbert O."
61,37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184,"Willers, John H."



The position in which to add a new column can be choosen.
Notice that this operation transform the original dataframe.

In [5]:
dfr = data.la_riots()
dfr.insert(loc=2, column='name', value=dfr.last_name+', '+dfr.first_name)
dfr

Unnamed: 0,first_name,last_name,name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude
0,Cesar A.,Aguilar,"Aguilar, Cesar A.",18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281
1,George,Alvarez,"Alvarez, George",42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690
2,Wilson,Alvarez,"Alvarez, Wilson",40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662
3,Brian E.,Andrew,"Andrew, Brian E.",30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457
4,Vivian,Austin,"Austin, Vivian",87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667
...,...,...,...,...,...,...,...,...,...,...,...,...
58,Fredrick,Ward,"Ward, Fredrick",20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098
59,Louis A.,Watson,"Watson, Louis A.",18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244
60,Elbert O.,Wilkins,"Wilkins, Elbert O.",33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767
61,John H.,Willers,"Willers, John H.",37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184


### **Exercise**

Add a new column with a boolean value indicating if the address is a crossroads.

In [13]:
dfr['cross'] = [1 if '&' in a else 0 for a in dfr['address']]

dfr

Unnamed: 0,first_name,last_name,name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude,cross
0,Cesar A.,Aguilar,"Aguilar, Cesar A.",18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281,0
1,George,Alvarez,"Alvarez, George",42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690,1
2,Wilson,Alvarez,"Alvarez, Wilson",40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662,0
3,Brian E.,Andrew,"Andrew, Brian E.",30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457,1
4,Vivian,Austin,"Austin, Vivian",87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,Fredrick,Ward,"Ward, Fredrick",20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098,0
59,Louis A.,Watson,"Watson, Louis A.",18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244,0
60,Elbert O.,Wilkins,"Wilkins, Elbert O.",33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767,1
61,John H.,Willers,"Willers, John H.",37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184,0


## Python code

But the combination of data can be more complex.
Imagine we want to create a new column with compound values from other columns.

For instance, we would like to combine the information in `latitude` and `longitude` in just one column. Naive expressions, as the following one, do not work!


In [14]:
dfr['position'] = (dfr.longitude, dfr.latitude)

ValueError: Length of values (2) does not match length of index (63)

We need to combine the values and then generate the new column. There is a `python` predefined function that is useful for that: `zip`

In [15]:
zip([1,2,3],['a','b','c'])

<zip at 0x7b1ddceb6b00>

As we have already encountered before (for instance with `range`), `zip` is an example of lazy function!

In [16]:
list(zip([1,2,3],['a','b','c']))

[(1, 'a'), (2, 'b'), (3, 'c')]

We now have the tool to solve the question:

In [17]:
dfr['position'] = list(zip(dfr.longitude, dfr.latitude))
dfr

Unnamed: 0,first_name,last_name,name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude,cross,position
0,Cesar A.,Aguilar,"Aguilar, Cesar A.",18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281,0,"(-118.2739756, 34.0592814)"
1,George,Alvarez,"Alvarez, George",42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690,1,"(-118.2340982, 34.0626901)"
2,Wilson,Alvarez,"Alvarez, Wilson",40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662,0,"(-118.326816, 33.901662)"
3,Brian E.,Andrew,"Andrew, Brian E.",30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457,1,"(-118.2153903, 33.9034569)"
4,Vivian,Austin,"Austin, Vivian",87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667,0,"(-118.304741, 33.985667)"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,Fredrick,Ward,"Ward, Fredrick",20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098,0,"(-118.412778, 34.287098)"
59,Louis A.,Watson,"Watson, Louis A.",18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244,0,"(-118.2915566, 34.00524354)"
60,Elbert O.,Wilkins,"Wilkins, Elbert O.",33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767,1,"(-118.3100043, 33.95276731)"
61,John H.,Willers,"Willers, John H.",37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184,0,"(-118.46777, 34.263184)"


Finally, the most complex situation would be to generate values for a new column considering the some of data in the row and conditions or further processing. In this case, we could use a function to define the code to be executed.

Let's consider the following code to create a new column, observe how we could use any value in the row and arbitrary complex `python` code to generate the result.

In [18]:
def my_f(any_row):
  if '&' in any_row['address']:
    return (any_row['address'].split('&'))
  else:
    return any_row['address']

#create new column 'Good' using the function above
dfr['new_address'] = dfr.apply(my_f, axis=1)
dfr

Unnamed: 0,first_name,last_name,name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude,cross,position,new_address
0,Cesar A.,Aguilar,"Aguilar, Cesar A.",18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281,0,"(-118.2739756, 34.0592814)",2009 W. 6th St.
1,George,Alvarez,"Alvarez, George",42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690,1,"(-118.2340982, 34.0626901)","[Main , College streets]"
2,Wilson,Alvarez,"Alvarez, Wilson",40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662,0,"(-118.326816, 33.901662)",3100 Rosecrans Ave.
3,Brian E.,Andrew,"Andrew, Brian E.",30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457,1,"(-118.2153903, 33.9034569)","[Rosecrans , Chester avenues]"
4,Vivian,Austin,"Austin, Vivian",87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667,0,"(-118.304741, 33.985667)",1600 W. 60th St.
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,Fredrick,Ward,"Ward, Fredrick",20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098,0,"(-118.412778, 34.287098)",11932 Cometa Ave.
59,Louis A.,Watson,"Watson, Louis A.",18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244,0,"(-118.2915566, 34.00524354)",4365 S. Vermont Ave.
60,Elbert O.,Wilkins,"Wilkins, Elbert O.",33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767,1,"(-118.3100043, 33.95276731)","[Western Avenue , 92nd Street]"
61,John H.,Willers,"Willers, John H.",37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184,0,"(-118.46777, 34.263184)",10621 Sepulveda Blvd.


### Exercise

The previous code was intended for explaining the possibility of using a general `python` function. But, this particular problem could be solved without using an auxiliary function. Find a solution with pure `pandas`.

In [25]:
import pandas as pd

dfr['new_address'] = dfr['address'].str.split('&')

dfr


Unnamed: 0,first_name,last_name,name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude,cross,position,new_address,old
0,Cesar A.,Aguilar,"Aguilar, Cesar A.",18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281,0,"(-118.2739756, 34.0592814)",[2009 W. 6th St.],18.0
1,George,Alvarez,"Alvarez, George",42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690,1,"(-118.2340982, 34.0626901)","[Main , College streets]",47.0
2,Wilson,Alvarez,"Alvarez, Wilson",40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662,0,"(-118.326816, 33.901662)",[3100 Rosecrans Ave.],45.0
3,Brian E.,Andrew,"Andrew, Brian E.",30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457,1,"(-118.2153903, 33.9034569)","[Rosecrans , Chester avenues]",30.0
4,Vivian,Austin,"Austin, Vivian",87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667,0,"(-118.304741, 33.985667)",[1600 W. 60th St.],92.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,Fredrick,Ward,"Ward, Fredrick",20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098,0,"(-118.412778, 34.287098)",[11932 Cometa Ave.],20.0
59,Louis A.,Watson,"Watson, Louis A.",18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244,0,"(-118.2915566, 34.00524354)",[4365 S. Vermont Ave.],18.0
60,Elbert O.,Wilkins,"Wilkins, Elbert O.",33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767,1,"(-118.3100043, 33.95276731)","[Western Avenue , 92nd Street]",33.0
61,John H.,Willers,"Willers, John H.",37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184,0,"(-118.46777, 34.263184)",[10621 Sepulveda Blvd.],42.0


### Exercise

Use the `apply` function to generate a new column using the information in each row.

In [24]:
dfr['old'] = dfr['age'].apply(lambda x: x+5 if x>35 else x)

dfr



Unnamed: 0,first_name,last_name,name,age,gender,race,death_date,address,neighborhood,type,longitude,latitude,cross,position,new_address,old
0,Cesar A.,Aguilar,"Aguilar, Cesar A.",18.0,Male,Latino,1992-04-30,2009 W. 6th St.,Westlake,Officer-involved shooting,-118.273976,34.059281,0,"(-118.2739756, 34.0592814)",2009 W. 6th St.,18.0
1,George,Alvarez,"Alvarez, George",42.0,Male,Latino,1992-05-01,Main & College streets,Chinatown,Not riot-related,-118.234098,34.062690,1,"(-118.2340982, 34.0626901)","[Main , College streets]",47.0
2,Wilson,Alvarez,"Alvarez, Wilson",40.0,Male,Latino,1992-05-23,3100 Rosecrans Ave.,Hawthorne,Homicide,-118.326816,33.901662,0,"(-118.326816, 33.901662)",3100 Rosecrans Ave.,45.0
3,Brian E.,Andrew,"Andrew, Brian E.",30.0,Male,Black,1992-04-30,Rosecrans & Chester avenues,Compton,Officer-involved shooting,-118.215390,33.903457,1,"(-118.2153903, 33.9034569)","[Rosecrans , Chester avenues]",30.0
4,Vivian,Austin,"Austin, Vivian",87.0,Female,Black,1992-05-03,1600 W. 60th St.,Harvard Park,Death,-118.304741,33.985667,0,"(-118.304741, 33.985667)",1600 W. 60th St.,92.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58,Fredrick,Ward,"Ward, Fredrick",20.0,Male,Black,1992-05-02,11932 Cometa Ave.,Pacoima,Homicide,-118.412778,34.287098,0,"(-118.412778, 34.287098)",11932 Cometa Ave.,20.0
59,Louis A.,Watson,"Watson, Louis A.",18.0,Male,Black,1992-04-29,4365 S. Vermont Ave.,Vermont Square,Homicide,-118.291557,34.005244,0,"(-118.2915566, 34.00524354)",4365 S. Vermont Ave.,18.0
60,Elbert O.,Wilkins,"Wilkins, Elbert O.",33.0,Male,Black,1992-04-30,Western Avenue & 92nd Street,Gramercy Park,Homicide,-118.310004,33.952767,1,"(-118.3100043, 33.95276731)","[Western Avenue , 92nd Street]",33.0
61,John H.,Willers,"Willers, John H.",37.0,Male,White,1992-04-29,10621 Sepulveda Blvd.,Mission Hills,Homicide,-118.467770,34.263184,0,"(-118.46777, 34.263184)",10621 Sepulveda Blvd.,42.0


<hr>
<hr>
Carlos Gregorio Rodríguez

Universidad Complutense de Madrid

<img src="https://static0.makeuseofimages.com/wordpress/wp-content/uploads/2019/11/CC-BY-NC-License.png" alt="cc by nc" width="200"/>

https://creativecommons.org/licenses/by-nc/4.0/