# More DataFrame Methods

In this chapter, we cover several more less common, but still useful and important DataFrames methods that you need to know in order to be fully capable at analyzing data with pandas. 

* `agg` - Compute multiple aggregations at once
* `idxmax` and `idxmin` - Return the index of the max/min
* `diff` and `pct_change` - Find the difference/percent change from one value to the next
* `sample` - Randomly sample rows/columns
* `nsmallest`/`nlargest` - Return the top/bottom `n` values
* `replace` - Replace one or more values in a variety of ways
* `corr` - Compute the correlation between each pair of numeric columns

Let's read in the movie dataset with the title in the index and select just the numeric columns.

In [1]:
import pandas as pd
movie = pd.read_csv('../data/movie.csv', index_col='title').select_dtypes('number')
movie.head()

Unnamed: 0_level_0,year,duration,director_fb,actor1_fb,actor2_fb,actor3_fb,gross,num_reviews,num_voted_users,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Avatar,2009.0,178.0,0.0,1000.0,936.0,855.0,760505847.0,723.0,886204,237000000.0,7.9
Pirates of the Caribbean: At World's End,2007.0,169.0,563.0,40000.0,5000.0,1000.0,309404152.0,302.0,471220,300000000.0,7.1
Spectre,2015.0,148.0,0.0,11000.0,393.0,161.0,200074175.0,602.0,275868,245000000.0,6.8
The Dark Knight Rises,2012.0,164.0,22000.0,27000.0,23000.0,23000.0,448130642.0,813.0,1144337,250000000.0,8.5
Star Wars: Episode VII - The Force Awakens,,,131.0,131.0,12.0,,,,8,,7.1


## The `agg` method

The `agg` method allows us to calculate several aggregations at once by providing it a list of the aggregation methods as strings. Here, we find the min, max, and number of unique values for each column.

In [2]:
aggs = movie.agg(['min', 'max', 'nunique'])
aggs

Unnamed: 0,year,duration,director_fb,actor1_fb,actor2_fb,actor3_fb,gross,num_reviews,num_voted_users,budget,imdb_score
min,1916.0,7.0,0.0,0.0,0.0,0.0,162.0,1.0,5,218.0,1.6
max,2016.0,511.0,23000.0,640000.0,137000.0,23000.0,760505847.0,813.0,1689764,4200000000.0,9.5
nunique,91.0,191.0,435.0,877.0,917.0,906.0,4033.0,528.0,4750,438.0,78.0


This returned data might be easier to read when transposed. Let's transpose the results with the `T` attribute.

In [3]:
aggs.T

Unnamed: 0,min,max,nunique
year,1916.0,2016.0,91.0
duration,7.0,511.0,191.0
director_fb,0.0,23000.0,435.0
actor1_fb,0.0,640000.0,877.0
actor2_fb,0.0,137000.0,917.0
actor3_fb,0.0,23000.0,906.0
gross,162.0,760505800.0,4033.0
num_reviews,1.0,813.0,528.0
num_voted_users,5.0,1689764.0,4750.0
budget,218.0,4200000000.0,438.0


## The index of the minimum and maximum

The `idxmin` and `idxmax` methods return the index where the maximum value occurs for each column. When we call the `idxmax` method on our DataFrame, we learn that the movie with longest duration is 'Trapped', the movie with the highest gross is 'Avatar', the one highest IMDB score is 'Towering Inferno', etc... These methods do NOT work for string columns and will error if used with them.

In [4]:
movie.idxmax()

year                  Batman v Superman: Dawn of Justice
duration                                         Trapped
director_fb                                      Don Jon
actor1_fb          Anchorman: The Legend of Ron Burgundy
actor2_fb                          The Final Destination
actor3_fb                          The Dark Knight Rises
gross                                             Avatar
num_reviews                        The Dark Knight Rises
num_voted_users                 The Shawshank Redemption
budget                                    Lady Vengeance
imdb_score                              Towering Inferno
dtype: object

In [5]:
movie.idxmin()

year               Intolerance: Love's Struggle Throughout the Ages
duration                                            Shaun the Sheep
director_fb                                                  Avatar
actor1_fb                                   Yu-Gi-Oh! Duel Monsters
actor2_fb                                                Red Planet
actor3_fb                                                 Daredevil
gross                                                    Skin Trade
num_reviews                                     Godzilla Resurgence
num_voted_users                        The Hadza: Last of the First
budget                                                    Tarnation
imdb_score                           Justin Bieber: Never Say Never
dtype: object

## Differencing methods `diff` and `pct_change`

The `diff` and `pct_change` methods work just as they do on a Series. Let's read in the `stocks10` dataset which contains the closing stock price for ten stocks beginning from 2010.

In [6]:
stocks = pd.read_csv('../data/stocks/stocks10.csv', index_col='date', 
                     parse_dates=['date'])
stocks.head()

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-26,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
1999-10-27,29.33,2.38,16.52,75.94,,20.8,36.94,18.27,,
1999-10-28,29.01,2.43,16.59,71.0,,21.19,38.85,19.79,,
1999-10-29,29.88,2.5,17.21,70.62,,21.47,39.25,20.0,,


The `diff` method takes the difference between the current value and the nth value preceding it. Below, we get the change in price from two trading days prior.

In [7]:
stocks.diff(2).head()

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,,,,,,,,,,
1999-10-26,,,,,,,,,,
1999-10-27,-0.51,0.06,-0.5,-6.81,,-0.65,-2.05,1.49,,
1999-10-28,-0.81,0.09,-0.06,-10.25,,0.3,1.74,2.51,,
1999-10-29,0.55,0.12,0.69,-5.32,,0.67,2.31,1.73,,


The `pct_change` method returns the percentage change as a fraction. Here, we round the number and multiply by 100 so the results show actual percentages.

In [8]:
stocks.pct_change(2).round(3).head() * 100

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,,,,,,,,,,
1999-10-26,,,,,,,,,,
1999-10-27,-1.7,2.6,-2.9,-8.2,,-3.0,-5.3,8.9,,
1999-10-28,-2.7,3.8,-0.4,-12.6,,1.4,4.7,14.5,,
1999-10-29,1.9,5.0,4.2,-7.0,,3.2,6.3,9.5,,


## The `sample` method

The `sample` method randomly samples rows or columns from the DataFrame. Here, we select three random rows. By default, sampling is done without replacement, so these will be three unique rows.

In [9]:
movie.sample(3)

Unnamed: 0_level_0,year,duration,director_fb,actor1_fb,actor2_fb,actor3_fb,gross,num_reviews,num_voted_users,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
The Trials of Darryl Hunt,2006.0,106.0,15.0,2.0,0.0,0.0,1111.0,11.0,771,200000.0,7.7
Roadside,2013.0,81.0,15.0,847.0,94.0,93.0,,15.0,268,,4.1
Royal Kill,2009.0,90.0,0.0,502.0,119.0,32.0,,8.0,476,350000.0,3.2


It's possible to randomly sample columns by setting the `axis` parameter to 'columns' or 1.

In [12]:
movie.sample(5, axis='columns').head()

Unnamed: 0_level_0,budget,year,director_fb,imdb_score,num_reviews
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Avatar,237000000.0,2009.0,0.0,7.9,723.0
Pirates of the Caribbean: At World's End,300000000.0,2007.0,563.0,7.1,302.0
Spectre,245000000.0,2015.0,0.0,6.8,602.0
The Dark Knight Rises,250000000.0,2012.0,22000.0,8.5,813.0
Star Wars: Episode VII - The Force Awakens,,,131.0,7.1,


Use the `frac` parameter to select a random fraction of the rows and set `replace` equal to `True` to sample with replacement. Here, we select a random 25% of the rows with replacement.

In [24]:
movie.sample(frac=0.25, replace=True).shape

(1229, 11)

## The `nsmallest` and `nlargest` methods

The `nsmallest` and `nlargest` methods provide a similar solution that `sort_values` does. Pass them the number of rows to return as an integer and a string of a column name you would like to use to determine the ordering.  The following returns all the rows for movies with the three highest values of the column gross.

In [25]:
movie.nlargest(3, 'gross')

Unnamed: 0_level_0,year,duration,director_fb,actor1_fb,actor2_fb,actor3_fb,gross,num_reviews,num_voted_users,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Avatar,2009.0,178.0,0.0,1000.0,936.0,855.0,760505847.0,723.0,886204,237000000.0,7.9
Titanic,1997.0,194.0,0.0,29000.0,14000.0,794.0,658672302.0,315.0,793059,200000000.0,7.7
Jurassic World,2015.0,124.0,365.0,3000.0,2000.0,1000.0,652177271.0,644.0,418214,150000000.0,7.0


It is possible to duplicate this with `sort_values` together with the `head` method.

In [26]:
movie.sort_values('gross', ascending=False).head(3)

Unnamed: 0_level_0,year,duration,director_fb,actor1_fb,actor2_fb,actor3_fb,gross,num_reviews,num_voted_users,budget,imdb_score
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Avatar,2009.0,178.0,0.0,1000.0,936.0,855.0,760505847.0,723.0,886204,237000000.0,7.9
Titanic,1997.0,194.0,0.0,29000.0,14000.0,794.0,658672302.0,315.0,793059,200000000.0,7.7
Jurassic World,2015.0,124.0,365.0,3000.0,2000.0,1000.0,652177271.0,644.0,418214,150000000.0,7.0


### Why use `nsmallest/nlargest`?

While `nsmallest/nlargest` can be duplicated with `sort_values`, in theory, `nsmallest/nlargest` should perform better as they use the [selection algorithm][1] and not a sorting one. The `nsmallest/nlargest` methods also have the ability to keep the top n rows with ties by setting the `keep` parameter to `True`. 

[1]: https://en.wikipedia.org/wiki/Selection_algorithm

## The `corr` method

The `corr` method computes the correlation between every pair of numeric columns in the DataFrame. By default, it computes Pearson's correlation coefficient which is a metric that determines how well the two variables are linearly related, returning a score ranging between -1 and 1. When an increase in one variable always corresponds with the same relative increase in the other variable, a perfect positive linear relationship exists and yields a correlation of 1.

For example, the relationship between Celsius and Fahrenheit is a perfect positive relationship. An increase in one degree Celsius always corresponds with an increase in a 1.8 degree change in Fahrenheit. A perfect negative linear relationship does the opposite and yields a correlation of -1. An increase in one variable always corresponds with the same relative decrease in the other.

The result of the `corr` method is a square DataFrame (has the same number of rows as columns) where the new row labels are the same as the original columns. The number of rows will equal the number of columns. Let's call the `corr` method now to compute the correlation between each pair of stocks.

In [28]:
stocks

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-26,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
1999-10-27,29.33,2.38,16.52,75.94,,20.80,36.94,18.27,,
1999-10-28,29.01,2.43,16.59,71.00,,21.19,38.85,19.79,,
1999-10-29,29.88,2.50,17.21,70.62,,21.47,39.25,20.00,,
...,...,...,...,...,...,...,...,...,...,...
2019-10-18,137.41,236.41,32.31,1757.51,256.95,67.61,119.14,38.47,185.85,175.71
2019-10-21,138.43,240.51,33.59,1785.66,253.50,68.74,119.74,38.23,189.76,176.43
2019-10-22,136.37,239.96,34.82,1765.73,255.58,69.09,119.58,38.17,182.34,170.86
2019-10-23,137.24,243.18,35.33,1762.17,254.68,69.75,119.35,37.74,186.15,171.32


In [27]:
stocks.corr().round(2)

Unnamed: 0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
MSFT,1.0,0.92,0.24,0.98,0.72,0.57,0.91,0.76,0.9,0.99
AAPL,0.92,1.0,0.5,0.94,0.8,0.78,0.95,0.88,0.91,0.97
SLB,0.24,0.5,1.0,0.29,0.06,0.88,0.42,0.65,-0.43,-0.12
AMZN,0.98,0.94,0.29,1.0,0.72,0.62,0.91,0.77,0.9,0.97
TSLA,0.72,0.8,0.06,0.72,1.0,0.64,0.73,0.8,0.82,0.79
XOM,0.57,0.78,0.88,0.62,0.64,1.0,0.73,0.83,0.19,0.63
WMT,0.91,0.95,0.42,0.91,0.73,0.73,1.0,0.84,0.74,0.94
T,0.76,0.88,0.65,0.77,0.8,0.83,0.84,1.0,0.78,0.82
FB,0.9,0.91,-0.43,0.9,0.82,0.19,0.74,0.78,1.0,0.92
V,0.99,0.97,-0.12,0.97,0.79,0.63,0.94,0.82,0.92,1.0


Take a look at the first column of data. This is the pairwise correlation between MSFT and all other stocks. For example, the correlation between MSFT and TSLA is 0.72. This means that there is a tendency for the stocks MSFT and TSLA to move in the same direction. One should not read too much into correlation. By itself, correlation does not imply a causal relationship between the variables. It is just one metric to provide some information about the linear relationship between two variables.

The above DataFrame is also **symmetric**. All values along the diagonal are 1, as each stock has a perfect correlation with itself. All values to the left of the diagonal are the same as they are to the right, as the correlation is the same regardless of the order.

Notice that the technology stocks, MSFT, AAPL, AMZN, and FB are all highly correlated with one another. The energy stocks, XOM and SLB, are also highly correlated with one another, but less correlated with the technology stocks.

### Series correlation method

Series also have a `corr` method. You must pass it a Series to find its correlation. Below, we get the correlation between MSFT and AAPL, which is the same value found in the DataFrame above.

In [29]:
stocks['MSFT'].corr(stocks['AAPL'])

np.float64(0.9221687315401949)

## The `replace` method

The `replace` method can be used to replace values in your DataFrame. It is very powerful and flexible. It is also quite complex as there are many different combinations of parameters to handle a variety of different kinds of replacement. Let's read in the first 5 rows of the San Francisco employee compensation dataset dropping the year column. Each numeric column is rounded to the nearest ten-thousand.

In [30]:
sf_emp_head = pd.read_csv('../data/sf_employee_compensation.csv', nrows=5)
sf_emp_head = sf_emp_head.drop(columns='year').round(-4)
sf_emp_head

Unnamed: 0,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,Public Protection,Personnel Technician,70000.0,0.0,0.0,10000.0,10000.0,10000.0
1,General Administration & Finance,Planner 2,70000.0,0.0,0.0,10000.0,10000.0,10000.0
2,Public Protection,Firefighter,120000.0,60000.0,20000.0,20000.0,20000.0,0.0
3,Community Health,IT Operations Support Admn III,30000.0,0.0,0.0,10000.0,10000.0,0.0
4,Community Health,Special Nurse,30000.0,0.0,10000.0,0.0,0.0,10000.0


The `replace` method has two main parameters, `to_replace` and `value`. The simplest application is to set each one to a single value. Below, we replace all of the values equal to 10,000 with 9,999. All values in the entire DataFrame are searched to be replaced.

In [31]:
sf_emp_head.replace(to_replace=10000, value=9999)

Unnamed: 0,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,Public Protection,Personnel Technician,70000.0,0.0,0.0,9999.0,9999.0,9999.0
1,General Administration & Finance,Planner 2,70000.0,0.0,0.0,9999.0,9999.0,9999.0
2,Public Protection,Firefighter,120000.0,60000.0,20000.0,20000.0,20000.0,0.0
3,Community Health,IT Operations Support Admn III,30000.0,0.0,0.0,9999.0,9999.0,0.0
4,Community Health,Special Nurse,30000.0,0.0,9999.0,0.0,0.0,9999.0


The `replace` method can also replace exact strings. Here, we replace 'Public Protection' with 'PP'.

In [32]:
sf_emp_head.replace(to_replace='Public Protection', value='PP')

Unnamed: 0,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,PP,Personnel Technician,70000.0,0.0,0.0,10000.0,10000.0,10000.0
1,General Administration & Finance,Planner 2,70000.0,0.0,0.0,10000.0,10000.0,10000.0
2,PP,Firefighter,120000.0,60000.0,20000.0,20000.0,20000.0,0.0
3,Community Health,IT Operations Support Admn III,30000.0,0.0,0.0,10000.0,10000.0,0.0
4,Community Health,Special Nurse,30000.0,0.0,10000.0,0.0,0.0,10000.0


Instead of using two parameters, you can set `to_replace` to a dictionary to map the old values to the new values. When using a dictionary, you do not use the parameter `value`. Below, we replace 'Community Health' with 'Health'.

In [33]:
sf_emp_head.replace(to_replace={'Community Health': 'Health'})

Unnamed: 0,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,Public Protection,Personnel Technician,70000.0,0.0,0.0,10000.0,10000.0,10000.0
1,General Administration & Finance,Planner 2,70000.0,0.0,0.0,10000.0,10000.0,10000.0
2,Public Protection,Firefighter,120000.0,60000.0,20000.0,20000.0,20000.0,0.0
3,Health,IT Operations Support Admn III,30000.0,0.0,0.0,10000.0,10000.0,0.0
4,Health,Special Nurse,30000.0,0.0,10000.0,0.0,0.0,10000.0


You can replace as many values as you'd like with a dictionary. The first parameter is `to_replace`, so we can call this method without explicitly providing the parameter name. We import `numpy` to help replace all zeros with missing values.

In [34]:
import numpy as np
sf_emp_head.replace({'Community Health':'Health', 0: np.nan})

Unnamed: 0,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,Public Protection,Personnel Technician,70000.0,,,10000.0,10000.0,10000.0
1,General Administration & Finance,Planner 2,70000.0,,,10000.0,10000.0,10000.0
2,Public Protection,Firefighter,120000.0,60000.0,20000.0,20000.0,20000.0,
3,Health,IT Operations Support Admn III,30000.0,,,10000.0,10000.0,
4,Health,Special Nurse,30000.0,,10000.0,,,10000.0


### Specifying which columns to search for replacement

Calling `replace` as we did above replaces all values in all columns that match the value to replace. Instead, we might be interested in only replacing values in a particular column, or replacing the same value with different values depending on the column.

We can specify which columns to replace which values by using in a dictionary of dictionaries, where the keys of the dictionary specify the column names and the values are dictionaries of original values mapped to their replacement. Take a look at the following dictionary. When passed to the `replace` method, it instructs it to replace 0 with nan and 60,000 with 99,999 for just the overtime column. The retirement column will have 0 replaced with -999.

```python
{'overtime':{0: np.nan, 
             60000: 99999}, 
 'retirement': {0: -999}}
```

Let's use this dictionary to make the specific replacement.

In [35]:
sf_emp_head.replace({'overtime':{0: np.nan, 60000:99999}, 
                     'retirement': {0:-999}})

Unnamed: 0,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,Public Protection,Personnel Technician,70000.0,,0.0,10000.0,10000.0,10000.0
1,General Administration & Finance,Planner 2,70000.0,,0.0,10000.0,10000.0,10000.0
2,Public Protection,Firefighter,120000.0,99999.0,20000.0,20000.0,20000.0,0.0
3,Community Health,IT Operations Support Admn III,30000.0,,0.0,10000.0,10000.0,0.0
4,Community Health,Special Nurse,30000.0,,10000.0,-999.0,0.0,10000.0


### Replacing Substrings

By default, the `replace` method searches for exact strings. Attempting to replace 'Public' with 'Pub.' will do nothing in our DataFrame as there is no exact value 'Public'.

In [36]:
sf_emp_head.replace({'Public':'Pub.'})

Unnamed: 0,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,Public Protection,Personnel Technician,70000.0,0.0,0.0,10000.0,10000.0,10000.0
1,General Administration & Finance,Planner 2,70000.0,0.0,0.0,10000.0,10000.0,10000.0
2,Public Protection,Firefighter,120000.0,60000.0,20000.0,20000.0,20000.0,0.0
3,Community Health,IT Operations Support Admn III,30000.0,0.0,0.0,10000.0,10000.0,0.0
4,Community Health,Special Nurse,30000.0,0.0,10000.0,0.0,0.0,10000.0


In order to replace a substring, you must set the `regex` parameter to `True`.

In [37]:
sf_emp_head.replace({'Public':'Pub.'}, regex=True)

Unnamed: 0,organization group,job,salaries,overtime,other salaries,retirement,health and dental,other benefits
0,Pub. Protection,Personnel Technician,70000.0,0.0,0.0,10000.0,10000.0,10000.0
1,General Administration & Finance,Planner 2,70000.0,0.0,0.0,10000.0,10000.0,10000.0
2,Pub. Protection,Firefighter,120000.0,60000.0,20000.0,20000.0,20000.0,0.0
3,Community Health,IT Operations Support Admn III,30000.0,0.0,0.0,10000.0,10000.0,0.0
4,Community Health,Special Nurse,30000.0,0.0,10000.0,0.0,0.0,10000.0


## Methods available only to Series and not DataFrames

There are more than a few methods that are available only to Series objects, but the following are the most important.

### No `str` or `dt` accessor or `unique` method

DataFrames have no special methods just for strings or datetimes. There is no `str` or `dt` accessor. They can only be used on Series objects. Also, the `unique` method is only available to Series.

## Exercises

Execute the following cell to read in the City of Houston dataset and use it to answer the next exercises.

In [38]:
emp = pd.read_csv('../data/employee.csv')
emp.head(3)

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.0,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.0,Male,Black


In [52]:
emp.size

145848

### Exercise 1

<span style="color:green; font-size:16px">Find the relative frequency of departments for all employees and then find the relative frequency of departments for the top 100 salaries. Compare the differences.</span>

In [59]:
emp

Unnamed: 0,dept,title,hire_date,salary,sex,race
0,Police,POLICE SERGEANT,2001-12-03,87545.38,Male,White
1,Other,ASSISTANT CITY ATTORNEY II,2010-11-15,82182.00,Male,Hispanic
2,Houston Public Works,SENIOR SLUDGE PROCESSOR,2006-01-09,49275.00,Male,Black
3,Police,SENIOR POLICE OFFICER,1997-05-27,75942.10,Male,Hispanic
4,Police,SENIOR POLICE OFFICER,2006-01-23,69355.26,Male,White
...,...,...,...,...,...,...
24303,Police,SENIOR POLICE OFFICER,2001-12-03,75942.10,Male,Black
24304,Other,SENIOR PROCUREMENT SPECIALIST,2016-03-28,76175.00,Female,Black
24305,Houston Public Works,WATER SERVICE INSPECTOR I,2015-09-14,35173.00,Male,Black
24306,Health & Human Services,HUMAN SERVICE PROGRAM MANAGER,2008-05-19,67198.00,Female,Black


In [42]:
emp['dept'].value_counts(normalize=True)

dept
Police                     0.311544
Fire                       0.180023
Houston Public Works       0.172371
Other                      0.138761
Health & Human Services    0.055661
Houston Airport System     0.050025
Parks & Recreation         0.047392
Library                    0.023161
Solid Waste Management     0.021063
Name: proportion, dtype: float64

In [47]:
emp.nlargest(100,'salary',keep='all')['dept'].value_counts(normalize=True)

dept
Other                      0.36
Fire                       0.22
Police                     0.15
Houston Airport System     0.09
Houston Public Works       0.09
Health & Human Services    0.07
Solid Waste Management     0.01
Library                    0.01
Name: proportion, dtype: float64

### Exercise 2

<span style="color:green; font-size:16px">Sample 100 rows of data with replacement using a random state value of 999. Then find the count of each unique department as a Series.</span>

In [51]:
emp.sample(100,replace=True, random_state=999)['dept'].value_counts(dropna=False)

dept
Other                      26
Police                     25
Fire                       23
Houston Public Works        9
Health & Human Services     6
Parks & Recreation          4
Solid Waste Management      3
Houston Airport System      2
Library                     2
Name: count, dtype: int64

### Stocks dataset

Use the following stocks dataset for the remaining exercises.

In [53]:
stocks = pd.read_csv('../data/stocks/stocks10.csv', index_col='date', parse_dates=['date'])
stocks.head(3)

Unnamed: 0_level_0,MSFT,AAPL,SLB,AMZN,TSLA,XOM,WMT,T,FB,V
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1999-10-25,29.84,2.32,17.02,82.75,,21.45,38.99,16.78,,
1999-10-26,29.82,2.34,16.65,81.25,,20.89,37.11,17.28,,
1999-10-27,29.33,2.38,16.52,75.94,,20.8,36.94,18.27,,


### Exercise 3

<span style="color:green; font-size:16px">Find the day that each stock had its largest percentage one-day drop in price.</span>

In [56]:
stocks.pct_change().idxmin()

MSFT   2000-04-24
AAPL   2000-09-29
SLB    2008-10-15
AMZN   2001-07-24
TSLA   2012-01-13
XOM    2008-10-15
WMT    2018-02-20
T      2000-12-19
FB     2018-07-26
V      2008-10-15
dtype: datetime64[ns]

### Exercise 4

<span style="color:green; font-size:16px">Find the min, max, and date of the min and max for each stock. Return a DataFrame with the stock ticker symbols in the index and the aggregations as column names.</span>

In [58]:
stocks.agg(['min','max','idxmin','idxmax']).T

Unnamed: 0,min,max,idxmin,idxmax
MSFT,11.77,141.57,2009-03-09 00:00:00,2019-10-15 00:00:00
AAPL,0.82,243.18,2003-04-11 00:00:00,2019-10-23 00:00:00
SLB,11.86,99.66,2002-10-09 00:00:00,2014-06-30 00:00:00
AMZN,5.97,2039.51,2001-09-28 00:00:00,2018-09-04 00:00:00
TSLA,15.8,385.0,2010-07-07 00:00:00,2017-09-18 00:00:00
XOM,18.84,85.86,2002-07-22 00:00:00,2014-06-23 00:00:00
WMT,30.27,120.24,2000-10-27 00:00:00,2019-10-11 00:00:00
T,8.01,38.47,2003-03-10 00:00:00,2019-10-18 00:00:00
FB,17.73,217.5,2012-09-04 00:00:00,2018-07-25 00:00:00
V,9.8,185.74,2009-01-20 00:00:00,2019-09-06 00:00:00
