<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#DataFrame" data-toc-modified-id="DataFrame-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>DataFrame</a></span><ul class="toc-item"><li><span><a href="#Create" data-toc-modified-id="Create-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Create</a></span></li><li><span><a href="#Read" data-toc-modified-id="Read-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Read</a></span><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Overview</a></span></li><li><span><a href="#Dict-Like-Access" data-toc-modified-id="Dict-Like-Access-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Dict-Like Access</a></span></li><li><span><a href="#List-Like-Access" data-toc-modified-id="List-Like-Access-1.2.3"><span class="toc-item-num">1.2.3&nbsp;&nbsp;</span>List-Like Access</a></span></li><li><span><a href="#Boolean-Indexing" data-toc-modified-id="Boolean-Indexing-1.2.4"><span class="toc-item-num">1.2.4&nbsp;&nbsp;</span>Boolean Indexing</a></span></li><li><span><a href="#Transpose" data-toc-modified-id="Transpose-1.2.5"><span class="toc-item-num">1.2.5&nbsp;&nbsp;</span>Transpose</a></span></li></ul></li><li><span><a href="#Update" data-toc-modified-id="Update-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Update</a></span><ul class="toc-item"><li><span><a href="#BMI" data-toc-modified-id="BMI-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>BMI</a></span></li><li><span><a href="#BMR" data-toc-modified-id="BMR-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>BMR</a></span></li></ul></li><li><span><a href="#Delete" data-toc-modified-id="Delete-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Delete</a></span></li></ul></li><li><span><a href="#NumPy" data-toc-modified-id="NumPy-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>NumPy</a></span></li><li><span><a href="#Dig-More" data-toc-modified-id="Dig-More-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Dig More</a></span></li></ul></div>

In [1]:
%matplotlib inline
import numpy as np
import scipy as sp
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import IPython as ip
mpl.style.use('ggplot')
mpl.rc('font', family='Noto Sans CJK TC')
ip.display.set_matplotlib_formats('svg')

In [2]:
health_lists = [
    [152, 48, 63, 1],
    [157, 53, 41, 1],
    [140, 37, 63, 0],
    [137, 32, 65, 0],
]

## DataFrame

- Data frame: Like a sheet in Excel.
    - Series: A column is a series.
    - Indexes: The values to identify the rows or the columns.
        - Labels: The values in an index.
    - The column index: ≡ `columns=`.
    - The row index: ≡ `index=`.

### Create

In [3]:
# pd.DataFrame?

In [4]:
health_df = pd.DataFrame(
    health_lists,
    columns=['height_cm', 'weight_kg', 'age', 'male_yn'],
    index=['A', 'B', 'C', 'D'],  # usually the rows are long, we would let pandas decides the index
)
health_df

Unnamed: 0,height_cm,weight_kg,age,male_yn
A,152,48,63,1
B,157,53,41,1
C,140,37,63,0
D,137,32,65,0


### Read

#### Overview

In [5]:
display(
    health_df.columns,  # column index
    health_df.index,  # row index
)

Index(['height_cm', 'weight_kg', 'age', 'male_yn'], dtype='object')

Index(['A', 'B', 'C', 'D'], dtype='object')

In [6]:
display(
    health_df.shape,
    health_df.dtypes,  # data types
)

(4, 4)

height_cm    int64
weight_kg    int64
age          int64
male_yn      int64
dtype: object

In [7]:
# ndarray (n-dimensional array) in numpy
health_df.values

array([[152,  48,  63,   1],
       [157,  53,  41,   1],
       [140,  37,  63,   0],
       [137,  32,  65,   0]])

#### Dict-Like Access

In [8]:
health_df['height_cm']  # -> the column in a series

A    152
B    157
C    140
D    137
Name: height_cm, dtype: int64

In [9]:
health_df.height_cm  # a shortcut

A    152
B    157
C    140
D    137
Name: height_cm, dtype: int64

In [10]:
health_df['height_cm']['A']  # -> a value

152

In [11]:
health_df[['height_cm', 'weight_kg']] # -> the columns in a dataframe

Unnamed: 0,height_cm,weight_kg
A,152,48
B,157,53
C,140,37
D,137,32


#### List-Like Access

In [12]:
health_df[:2]  # -> rows in dataframe

Unnamed: 0,height_cm,weight_kg,age,male_yn
A,152,48,63,1
B,157,53,41,1


In [13]:
# all rows of the two columns
health_df.loc[:, ['height_cm', 'weight_kg']]

Unnamed: 0,height_cm,weight_kg
A,152,48
B,157,53
C,140,37
D,137,32


In [14]:
# all rows, until the column
health_df.loc[:, :'weight_kg']

Unnamed: 0,height_cm,weight_kg
A,152,48
B,157,53
C,140,37
D,137,32


In [15]:
# until the row, until the column
health_df.loc[:'B', :'weight_kg']

Unnamed: 0,height_cm,weight_kg
A,152,48
B,157,53


In [16]:
# the first two rows, the first two columns
health_df.iloc[:2, :2]

Unnamed: 0,height_cm,weight_kg
A,152,48
B,157,53


#### Boolean Indexing

In [17]:
health_df[health_df.male_yn == 1]

Unnamed: 0,height_cm,weight_kg,age,male_yn
A,152,48,63,1
B,157,53,41,1


In [18]:
health_df[(health_df.male_yn == 1) & (health_df.age > 60)]

Unnamed: 0,height_cm,weight_kg,age,male_yn
A,152,48,63,1


#### Transpose

In [19]:
health_df.T

Unnamed: 0,A,B,C,D
height_cm,152,157,140,137
weight_kg,48,53,37,32
age,63,41,63,65
male_yn,1,1,0,0


### Update

#### BMI

$ BMI = \dfrac{weight}{height^{2}} $

Where:

- $ weight $: weight in kg
- $ height $: height in m

In [20]:
bmi_s = health_df.weight_kg / (health_df.height_cm/100)**2
# or
#bmi_s = health_df.weight_kg / (health_df.height_cm/100).pow(2)
# or
#bmi_s = health_df.weight_kg / np.pow(health_df.height_cm/100, 2)
bmi_s

A    20.775623
B    21.501886
C    18.877551
D    17.049390
dtype: float64

#### BMR

$ P = 10m + 6.25h - 5a + s $

Where:

- $ P $: BMR, kcal / day
- $ m $: weight in kg
- $ h $: height in cm
- $ a $: age in year
- $ s $: +5 for males, -161 for females

$ \equiv $

$ P = 10m + 6.25h - 5a + 5g - 161(1-g) $

Where:

- $ g $: gender in int, 0 is female, 1 is male.


$ \equiv $

$
    \begin{bmatrix}
        P
    \end{bmatrix}
    = 
    \begin{bmatrix}
        m & h & a & g & 1-g
    \end{bmatrix}
    \begin{bmatrix}
        10 \\ 6.25 \\ -5 \\ 5 \\ -161
    \end{bmatrix}
$

In [21]:
# just avoid to affect the other cells
tmp_df = health_df.copy()

In [22]:
tmp_df['female_yn'] = 1 - health_df.male_yn
tmp_df

Unnamed: 0,height_cm,weight_kg,age,male_yn,female_yn
A,152,48,63,1,0
B,157,53,41,1,0
C,140,37,63,0,1
D,137,32,65,0,1


In [23]:
tmp_df * [6.25, 10, -5, 5, -161]

Unnamed: 0,height_cm,weight_kg,age,male_yn,female_yn
A,950.0,480.0,-315.0,5.0,-0.0
B,981.25,530.0,-205.0,5.0,-0.0
C,875.0,370.0,-315.0,0.0,-161.0
D,856.25,320.0,-325.0,0.0,-161.0


In [24]:
# or
# .sum(): sum of all values
# .sum(axis=0): sum along 0th axis = rows, down, or variable
# .sum(axis=1): sum along 1st axis = columns, left, or sample
bmr_s = (tmp_df * [6.25, 10, -5, 5, -161]).sum(axis=1)
bmr_s

A    1120.00
B    1311.25
C     769.00
D     690.25
dtype: float64

In [25]:
# @: matrix multiplication
bmr_s = tmp_df @ [6.25, 10, -5, 5, -161]
bmr_s

A    1120.00
B    1311.25
C     769.00
D     690.25
dtype: float64

In [26]:
tmp_df = health_df.copy()
tmp_df['bmi'] = bmi_s
tmp_df['bmr'] = bmr_s
tmp_df

Unnamed: 0,height_cm,weight_kg,age,male_yn,bmi,bmr
A,152,48,63,1,20.775623,1120.0
B,157,53,41,1,21.501886,1311.25
C,140,37,63,0,18.877551,769.0
D,137,32,65,0,17.04939,690.25


### Delete

In [27]:
tmp_df = health_df.copy()
tmp_df

Unnamed: 0,height_cm,weight_kg,age,male_yn
A,152,48,63,1
B,157,53,41,1
C,140,37,63,0
D,137,32,65,0


In [28]:
del tmp_df['male_yn']
tmp_df

Unnamed: 0,height_cm,weight_kg,age
A,152,48,63
B,157,53,41
C,140,37,63
D,137,32,65


## NumPy

Let's borrow the [linear regression's math notations](https://en.wikipedia.org/wiki/Linear_regression#Introduction) to calculate the BMR.

$ {\displaystyle \{y_{i},\,x_{i1},\ldots ,x_{ip}\}_{i=1}^{n}} $

* $ y_i $: the dependent variable of the $ i $-th statistical unit.
* $ x_{ip} $: the $ p $-th independent variable of the $ i $-th statistical unit.
* $ n $: the number of statistical units.

$ {\displaystyle y_{i}=\beta _{0}1+\beta _{1}x_{i1}+\cdots +\beta _{p}x_{ip}+\varepsilon _{i}=\mathbf {x} _{i}^{\top }{\boldsymbol {\beta }}+\varepsilon _{i},\qquad i=1,\ldots ,n,} $

* $ \beta $: the parameter.
* $ \varepsilon $: the error term or noise, an unobserved random variable.
* $ \mathbf {x}_{i} $, $ {\boldsymbol {\beta }} $: vectors.

Stack these $ n $ equations:

$ {\displaystyle \mathbf {y} =X{\boldsymbol {\beta }}+{\boldsymbol {\varepsilon }},\,} $

Where:

$ \mathbf {y} ={\begin{pmatrix}y_{1}\\y_{2}\\\vdots \\y_{n}\end{pmatrix}},\quad $

$ {\displaystyle X={\begin{pmatrix}\mathbf {x} _{1}^{\top }\\\mathbf {x} _{2}^{\top }\\\vdots \\\mathbf {x} _{n}^{\top }\end{pmatrix}}={\begin{pmatrix}1&x_{11}&\cdots &x_{1p}\\1&x_{21}&\cdots &x_{2p}\\\vdots &\vdots &\ddots &\vdots \\1&x_{n1}&\cdots &x_{np}\end{pmatrix}},} $

$ {\displaystyle {\boldsymbol {\beta }}={\begin{pmatrix}\beta _{0}\\\beta _{1}\\\beta _{2}\\\vdots \\\beta _{p}\end{pmatrix}},\quad {\boldsymbol {\varepsilon }}={\begin{pmatrix}\varepsilon _{1}\\\varepsilon _{2}\\\vdots \\\varepsilon _{n}\end{pmatrix}}.} $

In [29]:
# m: hints for 2d array
health_m = np.array(health_lists)

In [30]:
health_m[:, -1:]

array([[1],
       [1],
       [0],
       [0]])

In [31]:
X = np.hstack((health_m, 1-health_m[:, -1:]))
X

array([[152,  48,  63,   1,   0],
       [157,  53,  41,   1,   0],
       [140,  37,  63,   0,   1],
       [137,  32,  65,   0,   1]])

In [32]:
[6.25, 10, -5, 5, -161]

[6.25, 10, -5, 5, -161]

In [33]:
np.array([6.25, 10, -5, 5, -161])

array([   6.25,   10.  ,   -5.  ,    5.  , -161.  ])

In [34]:
np.array([6.25, 10, -5, 5, -161])[:, None]

array([[   6.25],
       [  10.  ],
       [  -5.  ],
       [   5.  ],
       [-161.  ]])

In [35]:
beta = np.array([6.25, 10, -5, 5, -161])[:, None]
# # or
# beta = [6.25, 10, -5, 5, -161]
# will also work well
beta

array([[   6.25],
       [  10.  ],
       [  -5.  ],
       [   5.  ],
       [-161.  ]])

In [36]:
X @ beta

array([[1120.  ],
       [1311.25],
       [ 769.  ],
       [ 690.25]])

## Dig More

- [10 Minutes to pandas – Pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)
- [Cookbook – Pandas](https://pandas.pydata.org/pandas-docs/stable/cookbook.html)
- [Broadcasting – NumPy](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html)
- [Array manipulation routines – NumPy](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.array-manipulation.html)
- [Mathematical functions – NumPy](https://docs.scipy.org/doc/numpy-1.13.0/reference/routines.math.html)