# Handling tabular data

The output of the last part of the course were tables containing information about nuclei segmented from images, specifically their size (area) and intensities. One the great strengths of Python, is that after the image processing part, we can remain in the same environment to perform the "data science" part of such a project which includes: data crunching, visualization and modeling. In this chapter we will learn the basics of handling tabular data using Pandas DataFrames. 

## What is a DataFrame

In the last chapter we have seen that we could easily export a DataFrame using the ```to_csv``` method. Now we can do the reverse and use the ```pd.read_csv``` function:

In [1]:
import pandas as pd

In [2]:
my_df = pd.read_csv('../exports/19838_1252_F8_1_in.csv')

In [3]:
my_df

Unnamed: 0,label,area,mean_intensity
0,1,5629,28.21407
1,2,9904,44.429826
2,4,15070,53.126078
3,5,20884,49.792856
4,6,12972,42.911116
5,7,16068,54.610904
6,8,27912,52.343007
7,9,26131,60.766178
8,10,28071,58.83043
9,11,16176,54.782517


What we see above is just the standard output offered by Jupyter of DataFrames. Note in particular that it is not interactive. A DataFrame is just a table of items, and in that sense very comparable to an Excel sheet. Another property that DataFrames have in common with Excel sheets is that they can contain divers types of data: one column can be text, the other numbers etc. unlike a Numpy array which is homogeneous.

Just like arrays, DataFrames are objects with properties and methods attached, some very similar to Numpy. For example we can asks what the dimensions of the table are:

In [4]:
my_df.shape

(11, 3)

Or we can just display the few first rows using the ```head``` method:

In [5]:
my_df.head(5)

Unnamed: 0,label,area,mean_intensity
0,1,5629,28.21407
1,2,9904,44.429826
2,4,15070,53.126078
3,5,20884,49.792856
4,6,12972,42.911116


## Columns and indices.

Just like in an Excel sheet, items can be located via a row and column name which are not just numerical like in Numpy's row/column system. We see above that the first column contains bold numbers: those are the *indices* of the tables. We also see at the top bold names which are the *column* names.

We can in fact access both these indices and columns:

In [6]:
my_df = my_df.rename({'new A': 're-name A'},axis='columns')

In [7]:
my_df.index

RangeIndex(start=0, stop=11, step=1)

In [8]:
my_df.columns

Index(['label', 'area', 'mean_intensity'], dtype='object')

We see that they are both specific Pandas objects. The index here is just a range of numbers while the column names is something closer to a list of items.

### Renaming

We can also modify the indices and columns. For example we might want to rename a column. For this we define a dictionary of keys-values where the key is the existing name and the value the new name. We also need to specify whether we want to act on ```columns``` or on the ```index```:

In [9]:
my_df = my_df.rename({'mean_intensity': 'intensity_in'}, axis='columns')

In [10]:
my_df

Unnamed: 0,label,area,intensity_in
0,1,5629,28.21407
1,2,9904,44.429826
2,4,15070,53.126078
3,5,20884,49.792856
4,6,12972,42.911116
5,7,16068,54.610904
6,8,27912,52.343007
7,9,26131,60.766178
8,10,28071,58.83043
9,11,16176,54.782517


### Adding columns

It is also very easy to add a new column to an existing DataFrame. For that you just assign a list or a single value to a not yet existing column:

In [11]:
my_df['new column'] = 1
my_df

Unnamed: 0,label,area,intensity_in,new column
0,1,5629,28.21407,1
1,2,9904,44.429826,1
2,4,15070,53.126078,1
3,5,20884,49.792856,1
4,6,12972,42.911116,1
5,7,16068,54.610904,1
6,8,27912,52.343007,1
7,9,26131,60.766178,1
8,10,28071,58.83043,1
9,11,16176,54.782517,1


In [12]:
my_df['text column'] = 'my file name'
my_df

Unnamed: 0,label,area,intensity_in,new column,text column
0,1,5629,28.21407,1,my file name
1,2,9904,44.429826,1,my file name
2,4,15070,53.126078,1,my file name
3,5,20884,49.792856,1,my file name
4,6,12972,42.911116,1,my file name
5,7,16068,54.610904,1,my file name
6,8,27912,52.343007,1,my file name
7,9,26131,60.766178,1,my file name
8,10,28071,58.83043,1,my file name
9,11,16176,54.782517,1,my file name


### Droping items

Inversely we can remove columns by using the ```drop()``` function. Here also we have to say with ```axis``` if we want to affect ```'rows'``` or ```'columns'```:

In [13]:
my_df.drop(labels='area', axis='columns')

Unnamed: 0,label,intensity_in,new column,text column
0,1,28.21407,1,my file name
1,2,44.429826,1,my file name
2,4,53.126078,1,my file name
3,5,49.792856,1,my file name
4,6,42.911116,1,my file name
5,7,54.610904,1,my file name
6,8,52.343007,1,my file name
7,9,60.766178,1,my file name
8,10,58.83043,1,my file name
9,11,54.782517,1,my file name


Note that many functions that affect the content of a table do not act by default **in place**. It means that they do not directly affect the DataFrame but create a modified **copy** of it. For example if we check ```my_df``` we will see that the ```area``` column is still in there even though we dropped it:

In [14]:
my_df

Unnamed: 0,label,area,intensity_in,new column,text column
0,1,5629,28.21407,1,my file name
1,2,9904,44.429826,1,my file name
2,4,15070,53.126078,1,my file name
3,5,20884,49.792856,1,my file name
4,6,12972,42.911116,1,my file name
5,7,16068,54.610904,1,my file name
6,8,27912,52.343007,1,my file name
7,9,26131,60.766178,1,my file name
8,10,28071,58.83043,1,my file name
9,11,16176,54.782517,1,my file name


There are two solutions: either we reassign the output to the same variable and basically overwrite it:

In [15]:
my_df = my_df.drop(labels='area', axis='columns')

my_df

Unnamed: 0,label,intensity_in,new column,text column
0,1,28.21407,1,my file name
1,2,44.429826,1,my file name
2,4,53.126078,1,my file name
3,5,49.792856,1,my file name
4,6,42.911116,1,my file name
5,7,54.610904,1,my file name
6,8,52.343007,1,my file name
7,9,60.766178,1,my file name
8,10,58.83043,1,my file name
9,11,54.782517,1,my file name


Or we use the ```in_place``` option that is available for many functions:

In [16]:
my_df.drop(labels='new column', axis='columns', inplace=True)
my_df

Unnamed: 0,label,intensity_in,text column
0,1,28.21407,my file name
1,2,44.429826,my file name
2,4,53.126078,my file name
3,5,49.792856,my file name
4,6,42.911116,my file name
5,7,54.610904,my file name
6,8,52.343007,my file name
7,9,60.766178,my file name
8,10,58.83043,my file name
9,11,54.782517,my file name


## Exercise

1. Import the penguin dataset (the ```read_csv``` file also works with urls) https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv
2. Rename the bill_length_mm and bill_depth_mm colums into length and depth.
3. Check that the names of the DataFrame have changed. If not, do you understand why?
4. Add a column with name my_column and fill it with default value 'test'.
5. Remove the ```body_mass_g``` column