# Operating on DataFrames

Now that we know how to import DataFrames, how to format them to our needs, and how to access specific parts of them, we want to be able to maths with them. We will see that a lot of the things we have learned with Numpy arrays can be "recycled" here.

In [1]:
import pandas as pd
import numpy as np

We here again use our composer data:

In [2]:
nuclei = pd.read_csv('../exports/19838_1252_F8_1_in.csv')

In [3]:
nuclei

Unnamed: 0,label,area,mean_intensity
0,1,5629,28.21407
1,2,9904,44.429826
2,4,15070,53.126078
3,5,20884,49.792856
4,6,12972,42.911116
5,7,16068,54.610904
6,8,27912,52.343007
7,9,26131,60.766178
8,10,28071,58.83043
9,11,16176,54.782517


Just like with Numpy arrays, we can apply various mathematical operations on DataFrames columns or entire DataFrames. 

## Arithmetics single columns
Just like we could do an operation like ```my_array = my_array + 1```, we can do the same with Pandas columns. Doing such operations, we directly reaplace an existing column or create new one like here. For example we can estimate what's the radius of a disk of equivalent area to the nuclei

In [4]:
np.pi*42.329**2

5628.93054463742

In [5]:
nuclei['radius'] = np.sqrt(nuclei['area']/np.pi)
nuclei

Unnamed: 0,label,area,mean_intensity,radius
0,1,5629,28.21407,42.329261
1,2,9904,44.429826,56.147494
2,4,15070,53.126078,69.259873
3,5,20884,49.792856,81.532715
4,6,12972,42.911116,64.258197
5,7,16068,54.610904,71.516454
6,8,27912,52.343007,94.258504
7,9,26131,60.766178,91.20173
8,10,28071,58.83043,94.526593
9,11,16176,54.782517,71.756398


### Arithmetics multiple columns

We can also do maths with multiple columns as we have previously done with arrays. Just like with arrays, **operations are performed element-wise**. We can for example compute the intensity per area:

In [6]:
nuclei['mean_intensity'] / nuclei['area']

0     0.005012
1     0.004486
2     0.003525
3     0.002384
4     0.003308
5     0.003399
6     0.001875
7     0.002325
8     0.002096
9     0.003387
10    0.003232
dtype: float64

Note that the output here is a series with the correct length, so we can simply add it to our original table directly!

In [7]:
nuclei['intensity_area'] = nuclei['mean_intensity'] / nuclei['area']
nuclei

Unnamed: 0,label,area,mean_intensity,radius,intensity_area
0,1,5629,28.21407,42.329261,0.005012
1,2,9904,44.429826,56.147494,0.004486
2,4,15070,53.126078,69.259873,0.003525
3,5,20884,49.792856,81.532715,0.002384
4,6,12972,42.911116,64.258197,0.003308
5,7,16068,54.610904,71.516454,0.003399
6,8,27912,52.343007,94.258504,0.001875
7,9,26131,60.766178,91.20173,0.002325
8,10,28071,58.83043,94.526593,0.002096
9,11,16176,54.782517,71.756398,0.003387


## Applying functions

Again, just like we could previously apply a function to an array and recover an array of the same size where the function had been applied **element-wise** we can do that same with a DataFrame column and recover a column. And we can simply use Numpy functions!

In [8]:
np.log10(nuclei['mean_intensity'])

0     1.450466
1     1.647675
2     1.725308
3     1.697167
4     1.632570
5     1.737279
6     1.718859
7     1.783662
8     1.769602
9     1.738642
10    1.784913
Name: mean_intensity, dtype: float64

## Summarizing functions
Finally, we can summarized columns by using either Numpy function or the methods attached to a Dataframe or a column. For example we can take the mean:

In [9]:
np.mean(nuclei['area'])

17970.0

In [10]:
nuclei['area'].mean()

17970.0

Such summary functions can also be directly applied DataFrame wide:

In [11]:
nuclei.mean()

label                 6.818182
area              17970.000000
mean_intensity       50.977129
radius               74.023088
intensity_area        0.003185
dtype: float64

And we can even get complete summaries:

In [12]:
nuclei.describe()

Unnamed: 0,label,area,mean_intensity,radius,intensity_area
count,11.0,11.0,11.0,11.0,11.0
mean,6.818182,17970.0,50.977129,74.023088,0.003185
std,3.600505,7309.810203,9.583545,16.268751,0.000976
min,1.0,5629.0,28.21407,42.329261,0.001875
25%,4.5,14021.0,47.111341,66.759035,0.002355
50%,7.0,16176.0,53.126078,71.756398,0.003308
75%,9.5,23507.5,56.806474,86.367223,0.003462
max,12.0,28071.0,60.941442,94.526593,0.005012


## Exercise

1. Import the penguin dataset https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv
2. Create a new columns called ```body_mass_kg``` by transforming the ```body_mass_g``` column into kilograms.
3. Create a new column containiing the division the ```bill_length_mm``` by the ```bill_depth_mm``` column
4. Calculate the median value of the ```body_mass_g``` column both by using a Numpy function and by using the method attached to the column. What do you observe? Can you infer what the problem is?