In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
print("pandas", pd.__version__)
print("numpy",np.__version__)

# Computational tools

## Statistical functions


<a id='computation-pct-change'></a>

### Percent change

`Series` and `DataFrame` have a method
`pct_change()` to compute the percent change over a given number
of periods (using `fill_method` to fill NA/null values *before* computing
the percent change).

In [None]:
ser = pd.Series(np.random.randn(8))
ser.pct_change()

In [None]:
df = pd.DataFrame(np.random.randn(10, 4))
df.pct_change(periods=3)


<a id='computation-covariance'></a>

### Covariance

`Series.cov()` can be used to compute covariance between series
(excluding missing values).

In [None]:
s1 = pd.Series(np.random.randn(1000))
s2 = pd.Series(np.random.randn(1000))
s1.cov(s2)

Analogously, `DataFrame.cov()` to compute pairwise covariances among the
series in the DataFrame, also excluding NA/null values.


<a id='computation-covariance-caveats'></a>
>**Note**
>
>Assuming the missing data are missing at random this results in an estimate
for the covariance matrix which is unbiased. However, for many applications
this estimate may not be acceptable because the estimated covariance matrix
is not guaranteed to be positive semi-definite. This could lead to
estimated correlations having absolute values which are greater than one,
and/or a non-invertible covariance matrix. See [Estimation of covariance
matrices](https://en.wikipedia.org/w/index.php?title=Estimation_of_covariance_matrices)
for more details.

In [None]:
frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])
frame.cov()

`DataFrame.cov` also supports an optional `min_periods` keyword that
specifies the required minimum number of observations for each column pair
in order to have a valid result.

In [None]:
frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])
frame.loc[frame.index[:5], "a"] = np.nan
frame.loc[frame.index[5:10], "b"] = np.nan
frame.cov()

In [None]:
frame.cov(min_periods=12)


<a id='computation-correlation'></a>

### Correlation

Correlation may be computed using the `corr()` method.
Using the `method` parameter, several methods for computing correlations are
provided:

|Method name|Description|
|:------------------|:------------------------------------------------------------------------------|
|pearson (default)|Standard correlation coefficient|
|kendall|Kendall Tau correlation coefficient|
|spearman|Spearman rank correlation coefficient|


All of these are currently computed using pairwise complete observations.
Wikipedia has articles covering the above correlation coefficients:

- [Pearson correlation coefficient](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)  
- [Kendall rank correlation coefficient](https://en.wikipedia.org/wiki/Kendall_rank_correlation_coefficient)  
- [Spearman’s rank correlation coefficient](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient)  


>**Note**
>
>Please see the [caveats](#computation-covariance-caveats) associated
with this method of calculating correlation matrices in the
[covariance section](#computation-covariance).

In [None]:
frame = pd.DataFrame(np.random.randn(1000, 5), columns=["a", "b", "c", "d", "e"])
frame.iloc[::2] = np.nan

# Series with Series
frame["a"].corr(frame["b"])

In [None]:
frame["a"].corr(frame["b"], method="spearman")

In [None]:
# Pairwise correlation of DataFrame columns
frame.corr()

Note that non-numeric columns will be automatically excluded from the
correlation calculation.

Like `cov`, `corr` also supports the optional `min_periods` keyword:

In [None]:
frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])
frame.loc[frame.index[:5], "a"] = np.nan
frame.loc[frame.index[5:10], "b"] = np.nan
frame.corr()

In [None]:
frame.corr(min_periods=12)

The `method` argument can also be a callable for a generic correlation
calculation. In this case, it should be a single function
that produces a single value from two ndarray inputs. Suppose we wanted to
compute the correlation based on histogram intersection:

In [None]:
# histogram intersection
def histogram_intersection(a, b):
    return np.minimum(np.true_divide(a, a.sum()), np.true_divide(b, b.sum())).sum()


frame.corr(method=histogram_intersection)

A related method `corrwith()` is implemented on DataFrame to
compute the correlation between like-labeled Series contained in different
DataFrame objects.

In [None]:
index = ["a", "b", "c", "d", "e"]
columns = ["one", "two", "three", "four"]
df1 = pd.DataFrame(np.random.randn(5, 4), index=index, columns=columns)
df2 = pd.DataFrame(np.random.randn(4, 4), index=index[:4], columns=columns)
df1.corrwith(df2)

In [None]:
df2.corrwith(df1, axis=1)


<a id='computation-ranking'></a>

### Data ranking

The `rank()` method produces a data ranking with ties being
assigned the mean of the ranks (by default) for the group:

In [None]:
s = pd.Series(np.random.randn(5), index=list("abcde"))
s["d"] = s["b"]  # so there's a tie
s.rank()

`rank()` is also a DataFrame method and can rank either the rows
(`axis=0`) or the columns (`axis=1`). `NaN` values are excluded from the
ranking.

In [None]:
df = pd.DataFrame(np.random.randn(10, 6))
df[4] = df[2][:5]  # some ties
df

In [None]:
df.rank(1)

`rank` optionally takes a parameter `ascending` which by default is true;
when false, data is reverse-ranked, with larger values assigned a smaller rank.

`rank` supports different tie-breaking methods, specified with the `method`
parameter:

> - `average` : average rank of tied group  
- `min` : lowest rank in the group  
- `max` : highest rank in the group  
- `first` : ranks assigned in the order they appear in the array  




<a id='computation-windowing'></a>

### Windowing functions

See [the window operations user guide](37_window.ipynb#window-overview) for an overview of windowing functions.