# <span style="color:#130654; font-family: Helvetica; font-size: 200%; font-weight:700"> Pandas | <span style="font-size: 50%; font-weight:300">Statistical Functions</span>

Statistical methods help in the understanding and analyzing the behavior of data.

To use pandas in python import it first by using the following command:

In [6]:
# import pandas
import pandas as pd

<br>

### <span style="color:#130654">Create DataFrame</span>

Creating a dataset using dictionary:

In [10]:
data = {
    'fruit':['Apple', 'Banana', 'Dates', 'Grapes', 'Mango', 'Orange', 'Papaya'],
    'price':[80, 40, 180, 65, 60, 35, 55],
    'serving(grams)':[182, 125, 7.1, 151, 336, 131, 500],
    'calories':[95, 111, 20, 104, 202, 62, 215]
}

In [11]:
df = pd.DataFrame(data)
df

Unnamed: 0,fruit,price,serving(grams),calories
0,Apple,80,182.0,95
1,Banana,40,125.0,111
2,Dates,180,7.1,20
3,Grapes,65,151.0,104
4,Mango,60,336.0,202
5,Orange,35,131.0,62
6,Papaya,55,500.0,215


<br>

### <span style="color:#130654">Data Rank</span>

`rank()` - produces ranking for each element in the array of elements. In case of ties, assigns the mean rank.
- Rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with larger values assigned a smaller rank.

*Syntax:*
```python
DataFrame.rank(self, axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False)
```


|       Name       | Description                                                  | Type                              | Required |
| :--------------: | :----------------------------------------------------------- | :---------------------------------------------- | :-----------------: |
|     **axis**     | Index to direct ranking.                                     | {0 or ‘index’, 1 or ‘columns’}                  |      Required       |
|    **method**    | How to rank the group of records that have the same value (i.e. ties):<br />1. average: average rank of the group<br />2. min: lowest rank in the group<br />3. max: highest rank in the group<br />4. first: ranks assigned in order they appear in the array<br />5. dense: like ‘min’, but rank always increases by 1 between groups | {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}     |      Required       |
| **numeric_only** | For DataFrame objects, rank only numeric columns if set to True. | bool                                            |      Optional       |
|  **na_option**   | How to rank NaN values:<br />1. keep: assign NaN rank to NaN values<br />2. top: assign smallest rank to NaN values if ascending<br />3. bottom: assign highest rank to NaN values if ascending | {‘keep’, ‘top’, ‘bottom’} |      Required       |
|  **ascending**   | Whether or not the elements should be ranked in ascending order. | bool                                            |      Required       |
|     **pct**      | Whether or not to display the returned rankings in percentile form. | bool                                            |      Required       |

In [17]:
df['serving(grams)'].rank(method='first', ascending=True)

0    5.0
1    2.0
2    1.0
3    4.0
4    6.0
5    3.0
6    7.0
Name: serving(grams), dtype: float64

<br>

### <span style="color:#130654">Percentage Change</span>

`pct_change()` - compares every element with its prior element and computes the change percentage.

By default, the pct_change() operates on columns; if you want to apply the same row wise, then use axis=1() argument.

*Synaxt:*
```python
Series.pct_change(self, periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
```

|      Name       | Description                                                  | Type                                          | Required |
| :-------------: | :----------------------------------------------------------- | :-------------------------------------------- | :------- |
|   **periods**   | Periods to shift for forming percent change.                 | int                                           | Required |
| **fill_method** | How to handle NAs before computing percent changes.          | str                                           | Required |
|    **limit**    | The number of consecutive NAs to fill before stopping.       | int                                           | Required |
|    **freq**     | Increment to use from time series API (e.g. ‘M’ or BDay()).  | DateOffset, timedelta, or offset alias string | Optional |
|    **kwargs     | Additional keyword arguments are passed into DataFrame.shift or Series.shift. |                                               | Required |

In [18]:
df['serving(grams)'].pct_change()

0          NaN
1    -0.313187
2    -0.943200
3    20.267606
4     1.225166
5    -0.610119
6     2.816794
Name: serving(grams), dtype: float64

<br>

### <span style="color:#130654">Correlation</span>

`corr()` - compute pairwise correlation of columns
- Correlation shows the linear relationship between any two array of values (series).
- There are multiple methods to compute the correlation like pearson(default), spearman and kendall.
- If any non-numeric column is present in the DataFrame, it is excluded automatically.

*Synatx:*
```python
DataFrame.corr(self, method='pearson', min_periods=1)
```

|      Name       | Description                                                  | Type                             | Required |
| :-------------: | :----------------------------------------------------------- | :--------------------------------------------- | :------------------ |
|   **method**    | 1. pearson : standard correlation coefficient<br />2. kendall : Kendall Tau correlation coefficient<br />3. spearman : Spearman rank correlation <br /><br />Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior | {'pearson', 'kendall', 'spearman'} or callable | Required            |
| **min_periods** | Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation. | int                                            | Optional            |

Correlation between servings and calories columns:

In [19]:
df['serving(grams)'].corr(df['calories'])

0.9430266848211656

<br>

### <span style="color:#130654">Covariance</span>

`cov()` - compute pairwise covariance of columns.
- Covariance is applied on series data. 
- The Series object has a method cov to compute covariance between series objects. 
- NA will be excluded automatically.

*Syntax:*
```python
DataFrame.cov(self, min_periods=None)
```

| Name            | Description                                                  | Type | Required |
| :-------------- | :----------------------------------------------------------- | :--- | :------: |
| **min_periods** | Minimum number of observations required per pair of columns to have a valid result. | int  | Optional |

Covariance between servings and calories columns:

In [20]:
df['serving(grams)'].cov(df['calories'])

10832.52619047619