# `DataFrame` Attributes and Arithmetic 


---

It is crucial to have a deep understanding of your data in order to draw meaningful insights from it. In this chapter we will see how to use the built in functionalities of `pandas` to begin exploring and transforming our data. This will help us identify patterns or potential flaws in our dataset and hopefully inspire or even answer some interesting questions.

The structure of this chapter is as follows: We will start with preparing our environment and a review of the data set that we will be working with to illustrate new concepts and skills. Once we complete our preperation, we dive into the attributes and methods that can be used summarize your data. After that, we continue with learning about arithmetic and data alignment between `Series` and `DataFrames` by covering vectorization and broadcasting. 

# Preparing our Environment

---

In this chapter we will only be needing the pandas toolkit. We will be importing the library under the standard alias `pd`.

```python
import pandas as pd
```

In [1]:
import pandas as pd

# About the Data

---

To help us grasp the new content we will be covering in this chapter, we will again be using the data stored in the `'Data/spending_10k.csv'` file. This data is a subset of the complete dataset that is publicly available on the Centers for Medicare & Medicaid Services website ([`CMS` website](https://www.cms.gov/OpenPayments/Explore-the-Data/Dataset-Downloads.html)).  Here is a brief review of the data:

| Column |Description|
|:----------|-----------|
| `unique_id`| A unique identifier for a Medicare claim to CMS |
| `doctor_id` | The Unique Identifier of the doctor who <br/> prescribed the medicine  |
| `specialty` | The specialty of the doctor who prescribed the medicine |
| `medication` | The medication prescribed |
| `nb_beneficiaries` | The number of beneficiaries the <br/> medicine was prescribed to  |
| `spending` | The total cost of the medicine prescribed <br/>for the CMS |

This file has a header containing the labels of the column names so we will leave the `pd.read_csv()` `header` parameter to its default setting when we read the file using `pd.read_csv()`. Let us also use the `unique_id` column as the index rather than a range of integers, so we will pass the name of the column, `unique_id`, to the `index_col` parameter of the `pd.read_csv()` function.

In [2]:
spending_df = pd.read_csv('Data/spending_10k.csv', index_col='unique_id')

## Exercise 3.0: Importing the Honolulu Flights Data Set

For some of the exercises in this chapter we will be working with a data set containing information about all the arriving and departing flights in and out of the Honolulu aiport, HNL, on the Island of Oahu in December 2015. 

This data set contains the following columns:

| Column |Description|
|:----------|-----------|
| `YEAR` | The year of the flight  |
| `MONTH` |  The month of the flight |
| `DAY` |  The day of the flight |
| `DAY_OF_WEEK` |  The day of the week of the flight |
| `FLIGHT_NUMBER` |  The flight number of the flight |
| `ORIGIN_AIRPORT` |  The origin airport of the flight  |
| `DESTINATION_AIRPORT` |  The destination airport of the flight |
| `DEPARTURE_DELAY` |  The departure delay of the flight  |
| `DISTANCE` |  The distance of the flight in miles |
| `AIR_TIME` |  The flight time without taxiing in minutes |
| `ARRIVAL_DELAY` |  The arrival delay of the flight  |

Please run the following code cell which will parse the 'honolulu_flights.csv' file, and build the `HNL_flights_df DataFrame` before trying the exercises in this chapter related to this data set.

In [0]:
HNL_flights_df = pd.read_csv('Data/honolulu_flights.csv')

# Summarizing Your Data

---

It is often useful to quickly explore some of the descriptive attributes and statistics of the dataset that you are working with. For instance, the shape and datatypes of the `DataFrame`, and the range, mean, standard deviation, etc. of the rows or columns. You may find interesting patterns or possibly catch errors in your dataset this way. As we will see, accessing these attributes and computing the descriptive statistics is easy with `pandas`.  

## Attributes
As was mentioned in Chapter 1: *Introduction to pandas Data Structures* , `DataFrames` have a number of attributes associated with them. With respect to exploring your dataset, perhaps the 3 most useful attributes are summarized in the table below:

| Attribute |Description|
|:----------|-----------|
| `shape`| Return a tuple representing the dimensionality of the DataFrame. |
| `size` | Return an int representing the number of elements in this object.  |
| `dtypes` | Return the dtypes in the DataFrame. |

In `Python`, you can access an object’s attribute using the syntax `ObjectName.attributeName`. For instance, if our `DataFrame` is named `df` and our attribute is `shape`, `df.shape`will return the shape attribute of our `DataFrame`.

A list of all the `DataFrame` attributes can be found at the [`pandas` Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html)

### Inspecting Data Types

We have seen and have been using both `shape` and `size` and their relevance is clear, but it is not so obvious why we should be concerned with the data types of the `DataFrame`. One reason is that some methods can only work on specific data types.  For example, it would be unreasonable to compute the mean of the column `specialty` since the data type of `specialty` does not lend itself to being averaged.

`DataFrames` maintain an additional `Series`, named `dtypes`, which holds the data types in the `DataFrame` by column. The `dtypes` `Series` of a `DataFrame` allows `pandas` to instantly evaluate which methods can be applied to which columns. 

Let us look at the `dtypes` attribute of the `spending_df` `DataFrame`.

```python
>>> spending_df.dtypes
doctor_id             int64
specialty            object
medication           object
nb_beneficiaries      int64
spending            float64
dtype: object
```

We see that the `dtypes` attribute of the `spending_df` `DataFrame` is a `Series` with an `index` equivalent to the `columns` attribute of the `spending_df` `DataFrame`, and values which specify the data types of the entries in the corresponding column of the `DataFrame`.

![](images/dtypes.png) 

In [3]:
spending_df.dtypes

doctor_id             int64
specialty            object
medication           object
nb_beneficiaries      int64
spending            float64
dtype: object

## Exercise 3.1: Attributes

Which of the following lines of code will give the number of rows and columns, i.e. the shape attribute, of  the `HNL_flights_df DataFrame`?  Please note that the output should be: (7975, 11).

A:
```python
HNL_flights_df.size
```

B: 
```python
HNL_flights_df.shape
```

C: 
```python
HNL_flights_df.shape()
```

D: 
```python
pd.shape(HNL_flights_df)
```

Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.

In [0]:
# Exercise 3.1 Scratch Code Cell

## Methods

Please recall that in `Python` a method is a function that is accessible via an object instance, such as a `DataFrame`. The syntax for using a method is similar to accessing an attribute: `ObjectName.methodName()`. `DataFrames` have many built-in methods to summarize our data. The methods we will be most interested in to explore our data set are:

| Method|Description|
|:----------|-----------|
| `head()`| Return the first n rows. |
| `tail()` | Return the last n rows. |
| `min()`, `max()` | Computes the numeric (for numeric value) or alphanumeric (for object values) row-wise min, max in a Series or DataFrame.|
| `sum()`, `mean()`, `std()`, `var()`   | Computes the row-wise sum, mean, standard deviation and variance in a `Series` or DataFrame.|
|`nlargest()`|	Return the first n rows of the `Series` or `DataFrame`, ordered by the spceified columns in descending order. |
| `count()` |  Returns the number of non-NaN values in the in a `Series` or `DataFrame`. |
| `value_counts()` |  Returns the frequency for each value in the `Series`. |
| `describe()` | Computes row-wise statistics. |

We will be covering when and how to use each of the methods described in the table above.

If interested, a list and description of all the `DataFrame` methods may be found at the [pandas Documentation](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html).

### Viewing Pieces of the Data

When working with large `DataFrames` we may want to obtain a small sample of the `DataFrame` in order to get a quick overview of its organization (header, index, entries, etc....) without having to load it entirely. We had similar motivations in Chapter 2: *Loading and Storing data* when we introduced the `nrows` parameter of the `read_table()` and `read_csv()` `pandas` functions. 

Using the `head()` method we can make a new `DataFrame` with the same `columns` but with only the ***first*** $n$ rows rather than the entire `DataFrame` where * $n$ is an optional parameter that is by default 5. For example, if we want to make a new `DataFrame` made of only the first 2 rows of `spending_df`, then we would type:

```python
>>> spending_df.head(n=2)
              doctor_id   specialty       medication  nb_beneficiaries  spending
unique_id                                                                     
NX531425   1255626040  FAMILY PRACTICE  METFORMIN HCL                30   135.24 
QG879256   1699761833  FAMILY PRACTICE    ALLOPURINOL                30   715.76
```

We see that the `DataFrame` method `head()` with the optional parameter `n` set to 2 returns a new `DataFrame` with only 2 rows but the same number of columns as the `spending_df` `DataFrame`.

Similarly, with the `tail()` method, we can make a new `DataFrame` with the same `columns` but with only the ***last*** $n$ rows rather than the entire `DataFrame` where * $n$, again, is an optional parameter that is by default 5. 

In [0]:
spending_df.head(n=2)

### DataFrame Axes

We will begin to notice a concept of `axis` that is recurrent throughout the `pandas` `python` package. Many of the methods and functions we will see will have an optional `axis` parameter that may be set when the method of function is called. For instance, the methods `sum()`, `min()`, `max()`, etc.. can all be applied row- or column-wise. It helps to think about the operation as being carried across the axis.

![](images/axis_example.png)

The example seen in the image above is a visualization of how the `sum()` `DataFrame` method is carried out. The `sum()` method will calculate the sum of all the entries across either the rows or columns of the `DataFrame`, depending on how the optional `axis` parameter is set. Since the `axis` parameter of the `sum()` method is *optional*, it has a default setting, and in this case the default will be `axis=0` or equivalently `axis=rows`. However, this default may vary from method to method, so it is important to verify before use. 

When the `DataFrame` `sum()` method is called and the optional `axis` parameter is set to 0, or 'rows',  then the method will add up all the entries in the same column across all the rows, as is shown in the example above. 

We will see more examples of the `axis` parameter of `DataFrame` methods, and it will become more comfortable with practice. 

### Descriptive Statistics

`pandas` has many methods to compute the most common descriptive statistics such as the minimum, maximum, mean, variance, etc. The collection of `DataFrame` methods which will calculate the descriptive statistics can operate on either axis (column or row). 

Let us see some examples of computing descriptive statistics. We will start by creating a very simple `DataFrame`, call it `df`, which will help us illustrate what is happening.

```python
>>> df = pd.DataFrame({'A':[0,1,2], 'B':[5,6,7]})
>>> df
   A  B
0  0  5
1  1  6
2  2  7
```

`df` is simply a $3 \times 2$ `DataFrame` with numeric entries. If we wanted to calculate the mean of the all the row entries in each of the columns labeled `A` and `B`, then we could use the `mean()` `DataFrame` method with its default parameters, i.e. `axis='rows'`. 

``` python
>>> df.mean()
A    1.0
B    6.0
dtype: float64
```

The resulting `Series` contains the mean of the columns `A` and `B`. Please note that the opeartion was carried across the "rows". 

Alternatively, if we wanted the mean of the column entries for each each row, then we would set the optional `axis` parameter of the `mean()` method to 1 or 'columns':

```python
>>> df.mean(axis=1)
0    2.5
1    3.5
2    4.5
dtype: float64
```

This time, we can see the mean of the rows. Also note that the operation is carried across the “columns”.

Descriptive statistic methods can also be called by `Series` objects, i.e. individual rows and columns of the `DataFrame`. For example, if we only wanted to find the mean of all the entries in the column labeled by `A`, then we could first access the column `A` in `df` using the syntax: `df['A']`, and then call the `Series` `mean()` method on the returned `Series`. 

```python
>>> df['A'].mean()
1.0
```

We see that the mean of all the entries in the column `A` is $1.0$. Note that the `Series` `mean()` method does *not* have the optional `axis` parameter since it would not make sense to calculate the mean across the 'columns' of a `Series` since `Series` do not have multiple columns. 

In [0]:
df = pd.DataFrame({'A':[0,1,2], 'B':[5,6,7]})
df

Unnamed: 0,A,B
0,0,5
1,1,6
2,2,7


In [0]:
print('Column wise mean:')
print(df.mean())
print('---------------')
print('Row wise mean:')
print(df.mean(axis=1))
print('---------------')
print('Series mean:')
print(df['A'].mean())

Column wise mean:
A    1.0
B    6.0
dtype: float64
---------------
Row wise mean:
A    1.0
B    6.0
dtype: float64
---------------
Series mean:
1.0


#### Descriptive Statistics on the Medical Spending Data Set

Let us try and calculate some statistics for our Medical Spending Dataset, we will start with the mean

```python
>>> spending_df.mean()
doctor_id           1.503766e+09
nb_beneficiaries    5.091830e+01
spending            4.333839e+03
dtype: float64
```

Notice that the mean was calculated across the rows for all of the columns whose data type was numeric (int or float), and *not* for those whose data type was `object`. 

These results seem reasonable at first glance, but, if we look closely, it doesn't quite make sense to calculate the mean of the `doctor_id` column. We as humans can recognize that the `doctor_id` column is essentially just labels for each doctor that may have been assigned arbitrarily, so it would make more sense if the doctor IDs were instead stored as a `pandas` `object` type, since the `pandas` `object` type corresponds to the native `python` `string` data type. 

This is our first flaw that we shoud take note of in this dataset; we will learn how to clean this up in the following Chapter: *Data Cleaning*.

In [0]:
spending_df.mean()

### Count

Earlier we introduced and used the `shape` and `size` attributes of `DataFrames` and `Series`. Recall that the `shape` and `size` attributes tell us how many rows and columns there are and how many entries there are in a `DataFrame` or `Series` respectively, but these attributes include missing values in their counts. To illustrate the point lets build a `DataFrame` with all missing values and check its shape and size:

```python
>>> na_df = pd.DataFrame({1: [None, None], 2: [None, None]})
>>> na_df.shape
(2, 2)
>>> na_df.size
4
```

If we were none the wiser then we would say that there are 2 rows and 2 columns with a total of 4 values in the `DataFrame`, but there is actually no real data in this `DataFrame`, all the entries are blank. 

Intstead of using `shape` and `size` to understand how many data points we have, we can instead use the built in `count()` method for `DataFrames` and `Series`, which will exclude missing values. In other words, `count()` tells us the number of non-missing values in a `DataFrame` or `Series`.

Let us take a look at a practical example. First let us read the data in the file 'spending_missing_values.csv'. This file is specifcally constructed to have some missing values to demonstrate this concept on an interesting life like data set.

```python
>>> spending_missing_values_df = pd.read_csv('spending_missing_values.csv', 
                                               index_col='unique_id', 
                                               na_values=['Null'])
```

Now let us call the `count()` method of the `spending_missing_values_df` `DataFrame` to see how many missing values there are in each column. The method has the optional `axis` parameter with a default of `axis=0`, or equivalently `axis=rows`.

```python
>>> spending_missing_values_df.count()
doctor_id           54
specialty           48
medication          53
nb_beneficiaries    50
spending            50
dtype: int64
```

The returned `Series` tells us the number of actual values in each `column` of the calling `DataFrame`. This `Series` shows us that there are many missing values in this `DataFrame`, and some columns have more missing values than others. 

If we wanted to know the total number of non-missing values in the `DataFrame` spending_missing_values_df, then we could call the `Series` `sum()` method on the returned `Series` from the `count()` call. For example, to find the total number of non-missing entries in the `DataFrame` `spending_missing_values_df` we would type:

```python
>>> spending_missing_values_df.count().sum()
255
```

The `sum()` `Series` method added up all the values in the `Series` returned by the `counts()` call, i.e. $54 + 48 + 53 + 50 + 50 = 255$.

In [0]:
spending_missing_values_df = pd.read_csv('Data/spending_missing_values.csv', 
                                         index_col='unique_id', 
                                         na_values=['Null'])
print("-------------Counts of Missing Values Across Rows----------------------")
print(spending_missing_values_df.count())
print("---------------Total Count of Non-Missing Values-----------------------")
print(spending_missing_values_df.count().sum())

### Value Counts 

Not only will we be interested in counting how many non-missing values there are in a `DataSet`, but we may also be interested in counting the number of occurences of a unique value. `value_counts()` is strictly a `Series` method and does exactly that; `value_counts()` tells us the frequency of each unique value in the `Series`. 

For example, suppose we wanted to count the frequency of occurence of each value in the `specialty` column of the `spending_df` `DataFrame`. To do this we would first access the `specialty` column of the `spending_df` `DataFrame`,  then we would call the `value_counts()` method with the retrieved `Series`:

```python
>>> spending_df['specialty'].value_counts()
INTERNAL MEDICINE                                                 3060
FAMILY PRACTICE                                                   2606
NURSE PRACTITIONER                                                 822
.
.
.
Name: specialty, Length: 75, dtype: int64
```

 The values in the resulting `Series` correspond to the frequency of occurence of each `specialty`, for instance we see that the value 'INTERNAL MEDICINE' shows up in the `spending_df` `DataFrame` `specialty` column $3060$ times. Also notice that the length of the `Series` can be interepreted as the number of unique values in the `spending_df` `DataFrame` `specialty` column.

In [0]:
spending_df['specialty'].value_counts().head()

### The Describe Method

If you just want a breif summary of your dataset including the main descriptive statistics and the counts then you can use the `describe()` method. By default, the `describe()` method will only describe the numeric columns, but we can set the `include` parameter to 'all' to see all of the columns described.

First, let us look at the default description we are given when we call the `describe()` method for the `spending_df` `DataFrame`.

```python
>>> spending_df.describe()
          doctor_id  nb_beneficiaries       spending
count  1.000000e+04      10000.000000   10000.000000
mean   1.503766e+09         50.918300    4333.838595
std    2.874269e+08         86.493443   21915.925814
min    1.003010e+09         11.000000      15.020000
25%    1.255580e+09         15.000000     253.237500
50%    1.508818e+09         24.000000     677.970000
75%    1.750467e+09         50.000000    2442.955000
max    1.992999e+09       1987.000000  892027.000000
```

We see that the returned object is a `DataFrame` with column labels equivalent to the numeric columns of the calling `DataFrame` (`spending_df` in this case) and index labels of different descriptive statistics. For instance, the entry in the row labeled 'count' at the column labeled 'spending' tells us the count of non-missing values there are in the column `spending`of the `spending_df` `DataFrame`, and the 25%, 50% and 75% rows are the percentile values. This information is all very useful when you are trying to understand your data set.

Now let us see what happens when we set the optional `include`parameter of the `describe()` `DataFrame` method to 'all', i.e. `include='all'`.

```python
>>> spending_df.describe(include='all')
           doctor_id          specialty            medication  nb_beneficiaries       spending
count   1.000000e+04              10000                 10000      10000.000000   10000.000000
unique           NaN                 75                   617               NaN            NaN
top              NaN  INTERNAL MEDICINE  LEVOTHYROXINE SODIUM               NaN            NaN
freq             NaN               3060                   150               NaN            NaN
mean    1.503766e+09                NaN                   NaN         50.918300    4333.838595
std     2.874269e+08                NaN                   NaN         86.493443   21915.925814
min     1.003010e+09                NaN                   NaN         11.000000      15.020000
25%     1.255580e+09                NaN                   NaN         15.000000     253.237500
50%     1.508818e+09                NaN                   NaN         24.000000     677.970000
75%     1.750467e+09                NaN                   NaN         50.000000    2442.955000
max     1.992999e+09                NaN                   NaN       1987.000000  892027.000000
```

We now see that the returned `DataFrame` from the `describe()` call includes all of the columns of the `spending_df` `DataFrame`. Also notice that there are additional rows labeled 'top' and 'freq'. The 'top' and 'freq' row entries tell us the most common entry in the column and the corresponding frequency of that top entry respectively. 

The 'top' and 'freq' row entries are NaN, i.e. missing, for the numeric columns. This is because it doesn't quite make sense to calculate those statistics for the numeric columns. Similarly the count, mean, std, etc. row entries are NaN in the object type columns, since we cannot calculate these statistics in the object columns.

In [0]:
spending_df.describe()

In [0]:
spending_df.describe(include='all')

## Exercise 3.2: `DataFrame` Methods

Which of the following lines of code will output the average arrival delay time for the flights described in `HNL_flights_df`? Note that the output should be: -2.2572254335260116

A:
```python
HNL_flights_df['ARRIVAL_DELAY'].mean()
```

B:
```python
HNL_flights_df.mean()
```

C:
```python
pd.mean(HNL_flights_df.ARRIVAL_DELAY)
```

D:
```python
HNL_flights_df.describe().ARRIVAL_DELAY
```

Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.

In [0]:
# Exercise 3.2 Scratch Code Cell

# Arithmetic and Data Alignment

---

## Vectorization

***Vectorization*** is simply operations applied index by index to array like data structures. Vectorization is important since it is a much more efficient alternative to for loops in `Python`. 

`pandas` seamlessly supports vectorization. A key feature of pandas `Series` and `DataFrames` is that when executing an arithmetic operation, the `Series` or `DataFrames` will first be aligned by their matching indices and applied in a pairwise fashion. A new index is created from the union of the indices of both `Series` or `DataFrames` and values for indices present in only one of the `Series` or `DataFrames` are filled with missing values (NaN). 

In the following cells we will be seeing examples of how to perform vectorized operations on `Series` and `DataFrames`. To demonstrate the process without getting lost in the complexities of a life like dataset we will be working with two very basic `DataFrames`, call them `df_1` and `df_2`, that we will construct like so:

```python
df_1 = pd.DataFrame({'AA':{'A':79, 'C':2, 'T':12, 'X':21},
                     'BB':{'A':11, 'C':2, 'T':2, 'X':9}})
df_2 = pd.DataFrame({'AA':{'A':21,'D':14,'T':5},
                     'CC':{'A':12,'D':28,'T':121}})
```


In [0]:
df_1 = pd.DataFrame({'AA':{'A':79, 'C':2, 'T':12, 'X':21},
                     'BB':{'A':11, 'C':2, 'T':2, 'X':9}})
df_2 = pd.DataFrame({'AA':{'A':21,'D':14,'T':5},
                     'CC':{'A':12,'D':28,'T':121}})

### Vectorized Arithmetic Between `Series` (`DataFrame` columns)

Let us first look at an example of performing arithmetic between `pandas` `Series`. If we wanted to add the column labeled 'AA' in `df_1` to the column labeled "AA" in `df_2` we would type:

```python
df_1["AA"] + df_2["AA"]
```

![](images/alignment_arithmetic_col.png)

The image above breaks down the process of adding the `Series`. First we see that the two columns, both labeled 'AA' are accessed from the `DataFrame` `df_1` and `df_2` by typing `df_1['AA']` and `df_2['AA']` respectively. Second, the two column `Series` are aligned by the their individual indices. Notice in the image that the `Series` `df_1['AA']` was extended to include the new entry labled 'D' and the new row entry was populated by `NaN`. Similarly the `Series` `df_2['AA']` was extended to include two new entries with labels 'C' and 'X' and they too were populated with `NaNs`. Lastly, the two extended `Series` are added together to make a new `Series` with an index equivalent to the union of the indices of both of the two `Series` `df_1['AA']` and `df_2['AA']`. Each entry of the new `Series` is the sum of the entries with the same labels in `df_1['AA']` and `df_2['AA']`, but notice that if either one of the entries in the `Series` was `NaN`, then the result was also `NaN`. 

In [0]:
df_1["AA"] + df_2["AA"]

### Vectorized Arithmetic Between `Series` (`DataFrame` Row)

The process behind adding row `Series` is very similar to the process of adding column `Series`, the only difference is that we initially acess the rows of the two `DataFrames`. For instance, to add the row `Series` `df_1.loc["A", :]` to the row `Series`  `df_2.loc["D", :]` we would type:

```python
df_1.loc["A"] + df_2.loc["D"]
```

![](images/alignment_arithmetic_row.png)


The image above again breaks down the process of what is happening when we run this small command.

In [0]:
df_1.loc["A"] + df_2.loc["D"]

### Vectorized Arithmetic Between DataFrames

Our final example is arithmetic between two entire `DataFrames`. The concept behind what is happening here is just an extension of both adding row and column `Series`; when we add two `DataFrames` in a vectorized way, the alignment is with both the row and column index. For example let us add the two `DataFrames` `df_1` and df_2`.

```python
df_1 + df_2
```

![](images/alignment_arithmetic_df.png)

The image above breaks down the two step process of adding the `DataFrames`. First the `DataFrames` are extended so that they have matching column labels and indices. Then the `DataFrames` are added entry by entry, meaning the entries with the same index and column label are added, to create a new `DataFrame`. Remember that if an entry that is missing is added to a another entry, then the missing value, `NaN`, will carry through to the new `DataFrame`.

In [0]:
df_1 + df_2

Unnamed: 0,AA,BB,CC
A,100.0,,
C,,,
D,,,
T,17.0,,
X,,,


## Vectorization Example with the Medical Spending Data Set

We have seen vectorization with basic `DataFrames`, now at a more concrete example . What if we wanted to calculate the average spending per beneficiary of the `spending_df` `DataFrame`?

To do this we could first make a new `Series` that is the spending per benificiary by dividing `spending` column by the 'nb_beneficiaries' column. Then the average of this new `Series` can be calculated by calling the `mean()` method.  This can all be done in one line by chaining the operations together.

```python
>>>> (spending_df['spending'] / spending_df['nb_beneficiaries']).mean()
131.92616419345254
```

In [0]:
(spending_df['spending'] / spending_df['nb_beneficiaries']).mean()

## Exercise 3.3: Vectorization

Which of the following lines of code will result in a `Series` that contains the average speed of the plane in miles per minute for each flight decribed in  HNL_flights_df? Please note that this can be calculated by dividing the distance by the air time of the flight.

A:
```python
HNL_flights_df.loc[:, 'DISTANCE' / 'AIR_TIME']
```

B:
```python
HNL_flights_df.loc['DISTANCE', :] / HNL_flights_df.loc['AIR_TIME', :]
```

C:
```python
HNL_flights_df.loc[:, 'DISTANCE'] / HNL_flights_df.loc[:, 'AIR_TIME']
```

D:
```python
HNL_flights_df.loc[:, ['DISTANCE', 'AIR_TIME']].divide(HNL_flights_df.loc[:, 'AIR_TIME'], axis='rows')
```

Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.

In [0]:
# Exercise 3.3 Scratch Code Cell

## Broadcasting

Arithmetic operations between `Series` or `DataFrames` and a scalar require expanding the scalar into a `Series` or `DataFrame` of the same dimension. This process of expanding a single value into a `Series` or `DataFrame` is called **broadcasting**. 

For example, suppose we wanted to add the value 1.2 to `Series` `df_1['AA']`, where `df_1` is the same `DataFrame` we were using in the section on *Vectorization*.  To do this, all we would need to type is the following:

```python
df_1['AA'] + 1.2
```

![](images/alignment.png)


The image above shows what is happening when we type this command. We see that a new `Series` is created with an index matching the index of `df_1['AA']` and with entries of all the same value, 1.2. This new `Series` is aligned with `df_1['AA']` and added entry by entry, just like in the *Vectorization* examples. 

In [0]:
df_1['AA'] + 1.2

## Exercise 3.4: Broadcasting

Currently the `AIR_TIME` column, which gives the total flight time without taxiing, is in units of minutes. Which of the following lines of code will result in a `Series` that contains the total flight time without taxiing in units of hours? Please note that this can be found by dividing each of the entries in the `AIR_TIME` column by sixty.

A:
```python
HNL_flights_df.loc[:, 'AIR_TIME'] / pd.DataFrame([60])
```

B:
```python
HNL_flights_df.loc[:, 'AIR_TIME'] / pd.Series([60])
```

C:
```python
HNL_flights_df.loc['AIR_TIME', :] / 60
```

D:
```python
HNL_flights_df.loc[:, 'AIR_TIME'] / 60
```

Hint: Feel free to use the code cell below to try these commands out. For the incorrect options, make note of what is going wrong and or what errors are being thrown.

In [0]:
# Exercise 3.4 Scratch Code Cell

# Summary

---

**Summarizing Your Data**

* Important Attributes:

| Attribute |Description|
|:----------|-----------|
| `shape`| Return a tuple representing the dimensionality of the DataFrame. |
| `size` | Return an int representing the number of elements in this object.  |
| `dtypes` | Return the dtypes in the DataFrame. |

* Important Methods:

| Method|Description|
|:----------|-----------|
| `head()`| Return the first n rows. |
| `tail()` | Return the last n rows. |
| `min()`, `max()` | Computes the numeric (for numeric value) or alphanumeric (for object values) row-wise min, max in a Series or DataFrame|
| `sum()`, `mean()`, `std()`, `var()`   |  Computes the row-wise sum, mean, standard deviation and variance in a `Series` or DataFrame|
|`nlargest()`|	Return the first n rows ordered by the spceified columns in descending order. |
| `count()` |  returns the number of non-NaN values in the in a `Series` or `DataFrame` |
| `value_counts()` |  returns the frequency for each value in the `Series` |
| `describe()` | Computes row-wise statistics |


**Arithmetic and Data Alignment**

* When executing an arithmetic operation between `Series` or `DataFrames`, the object will first be extended and then aligned by their indices and then the arithmetic will be applied in a pairwise fashion, this is called *vectorized arithmetic between `Series` or `DataFrames`*

![](images/alignment_arithmetic_col.png)

* When executing an arithmetic operation between constants and `Series` or `Dataframes`, the constant will be extended to a new `Series` or `DataFrame` to align with the first `Series` or `Dataframe` and then the arithmetic will be applied in a pairwise fashion, this is referred to as *broadcasting*

![](images/alignment.png)

