
 <img src="https://upload.wikimedia.org/wikipedia/commons/e/ed/Pandas_logo.svg" alt="Panda Logo" width="500">

`Pandas` is a `Python` module for data manipulation and analysis widely used all around the world both in universities and companies. We will show how easy is to work with data in notebooks using a few lines of `pandas` code.

https://pandas.pydata.org/

<hr>


Working with data, it is frequent to have data in tabulated form, that is, a format with rows and columns as in the spreadsheets. This two dimensional data structure is called `DataFrame` in `pandas`.


# Elements in a dataframe

A Dataframe is a data structure in which we have _rows_ and _columns_.

<img src="https://pandas.pydata.org/pandas-docs/stable/_images/01_table_dataframe.svg" width="300"/>

A **column** usually has a _header_ that identifies the information that it contains (age, address, identity number...)

A **row** has the information of one instance or record,
it gathers the registered values for the categories defined in the columns.

The next code cell read a `DataFrame` and shows part of its content.

**Note:** We will start working with dataset directly availables in `Colab`, but quite soon you will learn to use your own datasets.

In [None]:
from vega_datasets import data
dfa = data.airports()
print(type(dfa))
dfa

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,iata,name,city,state,country,latitude,longitude
0,00M,Thigpen,Bay Springs,MS,USA,31.953765,-89.234505
1,00R,Livingston Municipal,Livingston,TX,USA,30.685861,-95.017928
2,00V,Meadow Lake,Colorado Springs,CO,USA,38.945749,-104.569893
3,01G,Perry-Warsaw,Perry,NY,USA,42.741347,-78.052081
4,01J,Hilliard Airpark,Hilliard,FL,USA,30.688012,-81.905944
...,...,...,...,...,...,...,...
3371,ZEF,Elkin Municipal,Elkin,NC,USA,36.280024,-80.786069
3372,ZER,Schuylkill Cty/Joe Zerbey,Pottsville,PA,USA,40.706449,-76.373147
3373,ZPH,Zephyrhills Municipal,Zephyrhills,FL,USA,28.228065,-82.155916
3374,ZUN,Black Rock,Zuni,NM,USA,35.083227,-108.791777


The output of the previous code cell shows a representation of the dataframe.

You can see data related to airports.  Every column has a header (`iata, name, city, state, country, latitude, longitude`) and in every row you find the data for a concrete airport.

There are also some other useful information. The dataframe has 3376 rows numbered from 0 to 3375. This row number works as an index; we could talk about row 3 or row 3373.


If we use the magic wand on the top right, we see a slightly different representation but the important elements are just the same.

If you run the next code cell you will see a different dataset.

In [None]:
dfr = data.la_riots()
dfr

NameError: name 'data' is not defined

In this case, the dataframe gathered data related to the mortal victims of the riots in Los Angeles in 1992.

What are the names of the columns? How many rows has the dataframe?

You should find no problem in answering these questions.

# Overview of a dataframe

We will start learning the `pandas`codes that allow to show general information of a dataframe.

All along this notebook, we will make use of the previous dataframes:
* `dfa`, as a short name for _dataframe airports_ and
* `dfr`, as a short name for _dataframe riots_.

The first function `pandas` offers to know a dataframe is `info()`.

In [None]:
dfr.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 63 entries, 0 to 62
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   first_name    63 non-null     object        
 1   last_name     63 non-null     object        
 2   age           62 non-null     float64       
 3   gender        63 non-null     object        
 4   race          63 non-null     object        
 5   death_date    63 non-null     datetime64[ns]
 6   address       63 non-null     object        
 7   neighborhood  63 non-null     object        
 8   type          63 non-null     object        
 9   longitude     63 non-null     float64       
 10  latitude      63 non-null     float64       
dtypes: datetime64[ns](1), float64(3), object(7)
memory usage: 5.5+ KB


As you can read on the output of the code cell execution, `dfr` has 63 rows (`63 entries`) and 11 columns.

For each one of the columns there are also some information: every column has a name, a number of non-null values (`Non-Null Count`) and the type of the values (`Dtype`).

In this example, notice how the column `age` has 62 non-null values, as there is 63 rows, that means that there is a row for which the `age` value is null.

As for the types, these indicate how are the values for the data in every column. In `age` and `longitude, latitude` there are numerical values (`float64`), in `death_date` there are dates (`datetime64`) and in all the rest of the columns there are strings of characters (`object`). The type of a column determines the possible operations on the data.

All the information condensed in the `info()` function can be recovered, these are the codes:

**Note:** don't forget to execute the cell codes and check the results.

In [None]:
dfr.columns

Index(['first_name', 'last_name', 'age', 'gender', 'race', 'death_date',
       'address', 'neighborhood', 'type', 'longitude', 'latitude'],
      dtype='object')

In [None]:
dfr.dtypes

first_name              object
last_name               object
age                    float64
gender                  object
race                    object
death_date      datetime64[ns]
address                 object
neighborhood            object
type                    object
longitude              float64
latitude               float64
dtype: object

In [None]:
dfr.shape

(63, 11)

A very condensed way to express 63 rows and 11 columns.

In [None]:
dfr.count()


first_name      63
last_name       63
age             62
gender          63
race            63
death_date      63
address         63
neighborhood    63
type            63
longitude       63
latitude        63
dtype: int64

For the numerical columns, the function `describe()` computes a few basic statistical properties.

In [None]:
dfr.describe()

Unnamed: 0,age,death_date,longitude,latitude
count,62.0,63,63.0,63.0
mean,32.370968,1992-05-15 12:34:17.142857088,-118.27991,34.026713
min,15.0,1992-04-29 00:00:00,-118.471745,33.789857
25%,21.25,1992-04-30 00:00:00,-118.309913,33.97148
50%,30.5,1992-04-30 00:00:00,-118.291495,34.005485
75%,38.0,1992-05-01 00:00:00,-118.253477,34.069612
max,87.0,1993-11-24 00:00:00,-117.730647,34.287098
std,14.253253,,0.105198,0.098471


Notice how the function only works on numerical columns.

## **Exercise**
Add a few code cells and test with the dataframe `dfa` the codes above to get information about the dataframe.  

What are the types of the columns? Does it have null values?...


In [None]:
dfa.info()
dfa.describe()
dfa.counts()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3376 entries, 0 to 3375
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   iata       3376 non-null   object 
 1   name       3376 non-null   object 
 2   city       3364 non-null   object 
 3   state      3364 non-null   object 
 4   country    3376 non-null   object 
 5   latitude   3376 non-null   float64
 6   longitude  3376 non-null   float64
dtypes: float64(2), object(5)
memory usage: 184.8+ KB


# Working with columns

In many occasions working with a dataframe, we want to concentrate on one concrete column.

There are different ways to obtain data in one column. One possibility is this:

In [None]:
dfr.age

0     18.0
1     42.0
2     40.0
3     30.0
4     87.0
      ... 
58    20.0
59    18.0
60    33.0
61    37.0
62    29.0
Name: age, Length: 63, dtype: float64

After the dataframe name we add a dot and the name of the column. Be careful, no whitespaces allowed!

There is an alternative that can be used to column names with whitespaces or not.

In [None]:
dfr['age']

0     18.0
1     42.0
2     40.0
3     30.0
4     87.0
      ... 
58    20.0
59    18.0
60    33.0
61    37.0
62    29.0
Name: age, Length: 63, dtype: float64

Look carefully at the syntax. Code is powerful but requires precision!

In the following examples we are going to use both methods for you to get used to both of them.

If we focus on the results of the codes above, we see the type of the elements in the column, in this case `float64`. Type is important because it determines the possible operations with the data.

## **Exercise**
Add a few code cells and try to select some columns in the dataframe `dfr`.

In [None]:
dfr[[name for name in dfr.columns if len(name)<5]]


Index(['first_name', 'last_name', 'age', 'gender', 'race', 'death_date',
       'address', 'neighborhood', 'type', 'longitude', 'latitude'],
      dtype='object')

## Numerical values

Columns with numerical values usually have type `float64` or `int64`. The former indicating values with decimal component (the separator is a dot "."), the later for integer numbers with no decimal part.

It is very natural, for instance, to find the maximun and minimun values in a numerical column.

In [None]:
dfr.age.max(), dfr.age.min()

(87.0, 15.0)

And also some basic statistics:



In [None]:
dfr.age.mean(), dfr.age.median(), dfr.age.mode()

(32.37096774193548,
 30.5,
 0    20.0
 Name: age, dtype: float64)

The `describe()` function shows much of these informations:

In [None]:
dfr.age.describe()

count    62.000000
mean     32.370968
std      14.253253
min      15.000000
25%      21.250000
50%      30.500000
75%      38.000000
max      87.000000
Name: age, dtype: float64

Talking about age with decimal numbers is a bit odd. We could easily convert data to integers:

In [None]:
dfr.age.convert_dtypes()

0     18
1     42
2     40
3     30
4     87
      ..
58    20
59    18
60    33
61    37
62    29
Name: age, Length: 63, dtype: Int64

We could also select the largest or smallest numbers in the column. In my code I select 5, but you can choose the number.

In [None]:
dfr.age.nlargest(5)

4     87.0
45    68.0
15    65.0
12    56.0
26    56.0
Name: age, dtype: float64

In [None]:
dfr.age.nsmallest(5)

10    15.0
17    15.0
18    15.0
56    15.0
24    17.0
Name: age, dtype: float64

Finally, some global operations considering all the elements in the column are also possible, for instance to add all of them.

In [None]:
dfr.age.sum()

2007.0

## Strings of characters

Another frequent column type is`object`. The values in this columns are strings of characters, or strings for short.  

Let see for instance the values of column `race`.


In [None]:
dfr['race'].convert_dtypes()

0     Latino
1     Latino
2     Latino
3      Black
4      Black
       ...  
58     Black
59     Black
60     Black
61     White
62     Black
Name: race, Length: 63, dtype: string

In [None]:
dfr['race']

NameError: name 'dfr' is not defined

An interesting operation is to see the number of repetitions of every label.

In [None]:
dfr['race'].value_counts()

race
Black     28
Latino    19
White     14
Asian      2
Name: count, dtype: int64

In strings we can use the lexicographical order:

In [None]:
dfr.first_name

0           Cesar A.
1             George
2             Wilson
3           Brian E.
4             Vivian
           ...      
58          Fredrick
59          Louis A.
60         Elbert O.
61           John H.
62    Willie Bernard
Name: first_name, Length: 63, dtype: object

In [None]:
dfr.first_name.min(), dfr.first_name.max()

('Aaron', 'Wilson')

And we can order the data:

In [None]:
dfr.first_name.sort_values()

45             Aaron
35         Alfred V.
20           Andreas
40           Anthony
52        Anthony J.
           ...      
4             Vivian
55           Wallace
47           William
62    Willie Bernard
2             Wilson
Name: first_name, Length: 63, dtype: object

### **Exercise**
Can you guess how to order in reverse order? That is from max values to min values.

Add a new code cell and try yourself!

In [None]:
dfr.first_name.sort_values(ascending=False)

NameError: name 'dfr' is not defined

To finish this section let's see a couple of example of operations that only have meaning with strings.

These are the original data:

In [None]:
dfa['city']

And we perform the following transformation:

In [None]:
dfa['city'].str.upper()

0            BAY SPRINGS
1             LIVINGSTON
2       COLORADO SPRINGS
3                  PERRY
4               HILLIARD
              ...       
3371               ELKIN
3372          POTTSVILLE
3373         ZEPHYRHILLS
3374                ZUNI
3375          ZANESVILLE
Name: city, Length: 3376, dtype: object

In [None]:
dfa['city'].str.

Afther the column`dfa['city']` we add `.str` to indicate that we want string operations and we choose  `upper()`. Try other alternatives `lower()`, `capitalize()`...

Also interesting are other functions whose result is not a string. For instance:

In [None]:
dfa['city'].str.len()

Computes the length of the strings:
_bay springs_ has 11 characters, _livingston_ 10, _colorado springs_ 16 and so on.


## Dates

A very interesting data type are the dates.

In [None]:
dfr.death_date

0    1992-04-30
1    1992-05-01
2    1992-05-23
3    1992-04-30
4    1992-05-03
        ...    
58   1992-05-02
59   1992-04-29
60   1992-04-30
61   1992-04-29
62   1992-04-29
Name: death_date, Length: 63, dtype: datetime64[ns]

As it is natural to assume, there is a whole family of functions to work with data.

In [None]:
dfr.death_date.dt.day

0     30
1      1
2     23
3     30
4      3
      ..
58     2
59    29
60    30
61    29
62    29
Name: death_date, Length: 63, dtype: int32

After the column `dfr.death_date` we add `.dt` to use date specific operations, in the cell above `.day` to recover the day in the date. Similarly

In [None]:
dfr.death_date.dt.month

### **Exercise**
Add cell codes and try with other operations: `dt.year`, `dt.month_name()`, `dt.day_name()`, `dt.is_leap_year`...


In [None]:
dfr.death_date.dt.day_name()
dfr.death_date.dt.day_name()
dfr.death_date.dt.month.value_counts()

<hr>
<hr>
Carlos Gregorio Rodríguez

Universidad Complutense de Madrid

<img src="https://static0.makeuseofimages.com/wordpress/wp-content/uploads/2019/11/CC-BY-NC-License.png" alt="cc by nc" width="200"/>

https://creativecommons.org/licenses/by-nc/4.0/