# Spheres

We'll try to do all this with [`pandas`](https://pandas.pydata.org/docs/index.html) and using a CSV on disk named `spheres.csv` that looks like this:

```
pid,type,radius,more
123,m,1,more
124,b,2,more
125,m,3,more
126,b,4,more
127,m,5,more
128,m,,more
```

## Imports

It is [convention](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#what-kind-of-data-does-pandas-handle) to import `pandas` as `pd`. We'll also need the constant Pi from the `math` module.

In [1]:
import pandas as pd
from math import pi

## Helper Functions

Let's write functions to calculate the raidus, surface area, and volume of each sphere.

In [2]:
def diameter(rad):
    return rad*2

In [3]:
def surface(rad):
    return 4*pi*pow(rad,2)

In [4]:
def volume(rad):
    return (4/3)*(pi*pow(rad,3))

## Read CSV

Read data from file:

In [5]:
raw_spheres = pd.read_csv('spheres.csv')

Let's see what that [`dataframe`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html?highlight=dataframe#pandas-dataframe) looks like:

In [6]:
print(raw_spheres)

   pid mb  rad  more
0  123  m  1.0  more
1  124  b  2.0  more
2  125  m  3.0  more
3  126  b  4.0  more
4  127  m  5.0  more
5  128  m  NaN  more


That first column is the index column and not actually in the CSV itself.

## Filtering Columns

The source CSV may have other data that you don't care about. I only care about the first three columns.

In [7]:
spheres = raw_spheres.iloc[:, :3]
print(spheres)

   pid mb  rad
0  123  m  1.0
1  124  b  2.0
2  125  m  3.0
3  126  b  4.0
4  127  m  5.0
5  128  m  NaN


## Changing Headers

Those headers could use some help... Let's rename them.

In [8]:
new_headers = ["id", "type", "radius"]
spheres.columns = new_headers
print(spheres)

    id type  radius
0  123    m     1.0
1  124    b     2.0
2  125    m     3.0
3  126    b     4.0
4  127    m     5.0
5  128    m     NaN


## Missing Data

In our case, I want to drop any rows missing the `radius` data as not to have it skew our calculation later. `dataframe.dropna()` will handle that for us and update our dataframe in-place.

In [9]:
spheres.dropna(subset=["radius"], inplace=True)
print(spheres)

    id type  radius
0  123    m     1.0
1  124    b     2.0
2  125    m     3.0
3  126    b     4.0
4  127    m     5.0


## Sorting

Say we want to sort our dataframe by type. `dataframe.sort_values()` is a lexical sort by default and makes this trivial.

In [10]:
spheres.sort_values(by=["type"], inplace=True)
print(spheres)

    id type  radius
1  124    b     2.0
3  126    b     4.0
0  123    m     1.0
2  125    m     3.0
4  127    m     5.0


> Note that each row kept its index.

## New Data Series

Let's add a calculated [series](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#each-column-in-a-dataframe-is-a-series) (column) for diameter, surface area, and volume. This is done by specifying the name of the series on the left assignment (`spheres["diameter"]`) and how to calculate the result on the right assignment (`diameter(spheres["radius"])`).

### Diameter

In [11]:
spheres["diameter"] = diameter(spheres["radius"])
print(spheres[["id","diameter"]])

    id  diameter
1  124       4.0
3  126       8.0
0  123       2.0
2  125       6.0
4  127      10.0


### Surface Area

In [12]:
spheres["surface area"] = surface(spheres["radius"])
print(spheres[["id","surface area"]])

    id  surface area
1  124     50.265482
3  126    201.061930
0  123     12.566371
2  125    113.097336
4  127    314.159265


### Volume

In [13]:
spheres["volume"] = volume(spheres["radius"])
print(spheres[["id","volume"]])

    id      volume
1  124   33.510322
3  126  268.082573
0  123    4.188790
2  125  113.097336
4  127  523.598776


### Checkpoint

This is what the fully calculated dataframe looks like so far:

In [14]:
print(spheres)

    id type  radius  diameter  surface area      volume
1  124    b     2.0       4.0     50.265482   33.510322
3  126    b     4.0       8.0    201.061930  268.082573
0  123    m     1.0       2.0     12.566371    4.188790
2  125    m     3.0       6.0    113.097336  113.097336
4  127    m     5.0      10.0    314.159265  523.598776


## Averages

We can now calculate the average diameter, surface area, and volume for all M and B. Select a list of columns, how to group that selection, and what calculation to apply to the result.

> Note: You **must** select what you're grouping by. In other words, if we do not _select_ `type` in the list below, then we will not be able to _groupby()_ it.

In [15]:
averages = (
    spheres[["type","diameter","surface area","volume"]]
    .groupby(["type"])
    .mean()
)
print(averages)

      diameter  surface area      volume
type                                    
b          6.0    125.663706  150.796447
m          6.0    146.607657  213.628300


## Counting

We need to know the total of each `type` of sphere. You could pick any of the fields here to count.

In [16]:
counts = spheres.groupby(by=["type"])["diameter"].count()
print(counts)

type
b    2
m    3
Name: diameter, dtype: int64


## Joining Data

Say we would like the averages per type along with the total count of each type. `pd.merge()` has us covered.

In [17]:
merged = pd.merge(counts, averages, on=["type"])
print(merged)

      diameter_x  diameter_y  surface area      volume
type                                                  
b              2         6.0    125.663706  150.796447
m              3         6.0    146.607657  213.628300


### Correct those headers

We had a `daimeter` field in both sets of data. This is likely to happen on large sets of data and sometimes you need to change these columns in place. Also, everyting after `count` is an average.

In [18]:
merged.columns = ["count", "avg diameter", "avg surface area", "avg volume"]
print(merged)

      count  avg diameter  avg surface area  avg volume
type                                                   
b         2           6.0        125.663706  150.796447
m         3           6.0        146.607657  213.628300


## File Output

Let's save our calculated data to new files. The index is included by default, but we don't care about that so we'll set `index=False` to disable that behavior.

In [19]:
spheres.to_csv('calculated_spheres.csv', index=False)
merged.to_csv('averages_spheres.csv')