## 5. DataFrame Analysis

---

### Questions

* "What are some common attributes for Pandas `DataFrame`s?"
* "What are some common methods for Pandas `DataFrame`s?"
* "How can you do arithmetic between two Pandas columns?"

### Objectives
* "Learn how to access `DataFrame` attributes"
* "Learn how to get statistics on a loaded `DataFrame`"
* "Learn how to sum two Pandas `DataFrame` columns together"

# DataFrame Attributes & Arithmetic

Once you have loaded in one or more `DataFrames` you may want to investigate various aspects of the data. This could be by looking at the shape of the `DataFrame` or the mean of a single column. This could also be through arithmetic between different `DataFrame` columns (i.e. `Series`). The following lesson will focus on these two concepts and will help you better understand how you can analyze the data you have loaded into Pandas.

## DataFrame Attributes

It is often useful to quickly explore some of the descriptive attributes and statistics of the dataset that you are working with. For instance, the shape and datatypes of the DataFrame, and the range, mean, standard deviation, etc. of the rows or columns. You may find interesting patterns or possibly catch errors in your dataset this way. As we will see, accessing these attributes and computing the descriptive statistics is easy with pandas.

DataFrames have a number of attributes associated with them. With respect to exploring your dataset, perhaps the 4 most useful attributes are summarized in the table below:

| Attribute | Description|
|:----------|-----------|
| `shape`| Returns a tuple representing the dimensionality of the `DataFrame`. |
| `size` | Returns an int representing the number of elements in this object.  |
| `dtypes` | Returns the data types in the `DataFrame`. |
| `columns` | Returns a `Series` of the header names from the `DataFrame`|

A list of all the DataFrame attributes can be found on the pandas website ([Link to `DataFrame` Docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)).

### Inspecting Data Types

DataFrame types are important since they will determine what methods can be used. For example you can't compute the mean of a Object column that contains strings (i.e. words).

One attribute that we have already used previously was the columns attribute that returns the name of each column header

Load in a dataframe and check the various columns data types

In [4]:
import pandas as pd

In [8]:
df = pd.read_csv("data/types_dataframe.csv")
df

Unnamed: 0,Sample ID,date mmddyy,press dbar,temp ITS-90,csal PSS-78,coxy umol/kg,ph
0,Sample-1,40610,239.8,18.9625,35.0636,,7.951
1,Sample-2,40610,280.7,16.1095,34.6103,192.3,
2,Sample-3,40610,320.1,12.9729,34.2475,190.8,
3,Sample-4,40610,341.3,11.9665,34.1884,191.3,7.78
4,Sample-5,40610,360.1,11.3636,34.1709,203.5,
5,Sample-6,40610,385.0,10.4636,34.1083,193.7,
6,Sample-7,40610,443.7,8.5897,34.0567,156.5,
7,Sample-8,40610,497.8,7.1464,34.0424,110.7,7.496


In [6]:
df.columns

Index(['Sample ID', 'date mmddyy', 'press dbar', 'temp ITS-90', 'csal PSS-78',
       'coxy umol/kg', 'ph'],
      dtype='object')

However, what if we wanted to see the data type associated with each column header? Luckily, there is a quick and easy way to do this by accessing the `dtypes` attribute. `dtypes` is a series maintained by each `DataFrame` that contains the data type for each column inside a `DataFrame`. As an example if we want to access the `dtypes` attribute the `DataFrame` called `df` (seen below) we can access the `dtypes` of the `DataFrame`.

![image.png](https://change-hi.github.io/morea/data-wrangling/fig/E5_1_types_dataframe.png)

In [7]:
df.dtypes

Sample ID        object
date mmddyy       int64
press dbar      float64
temp ITS-90     float64
csal PSS-78     float64
coxy umol/kg    float64
ph              float64
dtype: object

## Data Types

Pandas has a number of different data types:

| Python Type       | Equivalent Pandas Type | Description                                                                                                       |
| :---------------- | :--------------------- | :---------------------------------------------------------------------------------------------------------------- |
| `string or mixed` | `object`               | Columns contain partially or completely made up from strings                                                      |
| `int`             | `int64`                | Columns with numeric (integer) values. The 64 here refers <br/>to size of the memory space allocated to this type |
| `float`           | `float64`              | Columns with floating points numbers (numbers with decimal points)                                                |
| `bool`            | `bool`                 | True/False values                                                                                                 |
| `datetime`        | `datetime`             | Date and/or time values 

While Pandas is usually pretty good at getting the type of a column right sometimes you might need help it by providing the type when the data is loaded in or by converting it to a more suitable format.

As an example we are going to use the column 'date mmddyy' to create a new column just called 'date' that has the type `datetime`.

To start we can convert the information stored in 'date mmddyy' into a new `Series` with the `datetime` type. To do this we call the `to_datetime` method and provide the `Series` we want it to convert from as a parameter. Additionally we also need to specify the format that our date format is in. In our case we have month day and then year with each denoted by two numbers and no separators. To tell `to_datetime` that our data is formatted in this way we pass '%m%d%y' to the `format` parameter. This format parameter Python is based on native python string conversion to `datetime` format more information can be found on on the python docs ([Link to string to `datetime` conversion docs](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior))

### For more information

More information on the to_datetime method can be found on the Pandas website ([Link to `to_datetime` method docs](https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html)).

In [8]:
pd.to_datetime(df['date mmddyy'], format='%m%d%y')

0   2010-04-06
1   2010-04-06
2   2010-04-06
3   2010-04-06
4   2010-04-06
5   2010-04-06
6   2010-04-06
7   2010-04-06
Name: date mmddyy, dtype: datetime64[ns]

Now that we have the correct output format we can create a new column to hold the converted data in by creating a new named column. We will also drop the previously used 'date mmddyy' column to prevent confusion. Lastly, we will display the types for each of the columns to check that everything went the way we wanted it to.

In [9]:
df["date"] = pd.to_datetime(df['date mmddyy'], format='%m%d%y')
df = df.drop(columns=["date mmddyy"])
df.dtypes

Sample ID               object
press dbar             float64
temp ITS-90            float64
csal PSS-78            float64
coxy umol/kg           float64
ph                     float64
date            datetime64[ns]
dtype: object

For reference this is what the final `DataFrame` looks like. **Note that the date column is at the right side of the `DataFrame` since it was added last.**

In [10]:
df

Unnamed: 0,Sample ID,press dbar,temp ITS-90,csal PSS-78,coxy umol/kg,ph,date
0,Sample-1,239.8,18.9625,35.0636,,7.951,2010-04-06
1,Sample-2,280.7,16.1095,34.6103,192.3,,2010-04-06
2,Sample-3,320.1,12.9729,34.2475,190.8,,2010-04-06
3,Sample-4,341.3,11.9665,34.1884,191.3,7.78,2010-04-06
4,Sample-5,360.1,11.3636,34.1709,203.5,,2010-04-06
5,Sample-6,385.0,10.4636,34.1083,193.7,,2010-04-06
6,Sample-7,443.7,8.5897,34.0567,156.5,,2010-04-06
7,Sample-8,497.8,7.1464,34.0424,110.7,7.496,2010-04-06


---

### `DataFrame` Methods

There are a variety of built-in methods to work with `DataFrame`. These are accessible using e.g. `df.method_name()` where `df` is a `DataFrame` variable and `method_name`is some method.  A list of some useful methods is provided below:

| Method|Description|
|:----------|-----------|
| `head()`| Return the first `n=5` rows by default. The value of `n` can be changed. |
| `tail()` | Return the last `n=5` rows by default. The value of `n` can be changed. |
| `min()`, `max()` | Computes the numeric (for numeric value) or alphanumeric (for object values) row-wise min, max in a Series or DataFrame.|
| `sum()`, `mean()`, `std()`, `var()`   | Computes the sum, mean, standard deviation and variance in a `Series` or DataFrame.|
|`nlargest()`| Return the first n rows of the `Series` or `DataFrame`, ordered by the specified columns in descending order. |
| `count()` |  Returns the number of non-NaN values in the in a `Series` or `DataFrame`. |
| `value_counts()` |  Returns the frequency for each value in the `Series`. |
| `describe()` | Computes row-wise statistics. |

### For more information

A full list of methods for `DataFrames` can be found in the Pandas docs ([Link to `DataFrame` Docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html)).

#### `mean()` Method

The `mean()` method calculates the mean for an axis (rows = 0, columns = 1). As an example let's return to our previous `DataFrame` `df`.

If we want to find the mean of all of our numeric columns we would give the following command

In [11]:
df.mean(numeric_only=True)

press dbar      358.562500
temp ITS-90      12.196838
csal PSS-78      34.311013
coxy umol/kg    176.971429
ph                7.742333
dtype: float64

**Note: only the columns with numeric data types had their means calculated.**

## Single Column (`Series`) Methods

If we only want the mean of a single column we would instead give the `mean()` method a single column (i.e. a `Series`). This could be done for the latitude column in the example above via the code bit `df['Latitude'].mean()` which would return a single value 31.09682 which is the mean of that column (as seen above).


Other methods like `max()`, `var()`, and `count()` function in much the same way.

#### `describe()` Method

A method that is a bit more tricky to understand is the `describe()` method. This method provides a range of statistics about the `DataFrame` depending on the contents. For example if we were to run `describe()` on the previously mentioned `DataFrame` called `df` using the code bit below. 

### Single Column (`Series`) Methods

By default the `describe()` method will only use numeric columns. To tell it to use all columns regardless of whether they are numeric or not we have to set `include='all'`-

In [12]:
df.describe(include='all', datetime_is_numeric=True)

Unnamed: 0,Sample ID,press dbar,temp ITS-90,csal PSS-78,coxy umol/kg,ph,date
count,8,8.0,8.0,8.0,7.0,3.0,8
unique,8,,,,,,
top,Sample-1,,,,,,
freq,1,,,,,,
mean,,358.5625,12.196838,34.311013,176.971429,7.742333,2010-04-06 00:00:00
min,,239.8,7.1464,34.0424,110.7,7.496,2010-04-06 00:00:00
25%,,310.25,9.995125,34.0954,173.65,7.638,2010-04-06 00:00:00
50%,,350.7,11.66505,34.17965,191.3,7.78,2010-04-06 00:00:00
75%,,399.675,13.75705,34.3382,193.0,7.8655,2010-04-06 00:00:00
max,,497.8,18.9625,35.0636,203.5,7.951,2010-04-06 00:00:00


Here we get statistics regarding e.g. the mean of each column, how many non-NaN values are found in the columns, the standard deviation of the column, etc. The percent values correspond to the different percentiles of each column e.g. the 25% percentile. The NaN values are since we can't get e.g. the `mean()` of an `object` type column. 

### For more information

More information about the `describe()` method can be found on the Pandas website ([Link to `describe()` method docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)).

In [6]:
temp_df = pd.DataFrame({"A": [1,2,3], "B": [5,6,7]})
temp_df

Unnamed: 0,A,B
0,1,5
1,2,6
2,3,7


### Key points
* Use `.dtypes` to get the types of each column in a `DataFrame`.
* To get general statistics on the DataFrame you can use the `describe` method.
* You can add a constant to a numeric column by using the `column + constant`.

### Exercise 1

Find the mean temperatre (`"temp ITS-90"`) of the `nlargest` observation where `n = 5`. To achieve this, you can use the method `nlargest`, which takes two parameters, `n` the number of values to show and `columns` is the list of columns on which we would like to sort the data.


### Solutions


In [None]:
### Exercise 1
# df.nlargest(n=5, columns=["temp ITS-90"])["temp ITS-90"].mean()