# Convexity of Log-Loss/Logistic Objective Function
Consider a logistic regression model with parameters $\theta$, and with labeled training data $D = {(x_i , y_i)}$, $1 ≤ i ≤ N$, where $y_i \in \{0, 1\}$ are class labels and $x_i$ are d-dimensional feature vectors (you can assume that the first component of $x_i$ is set to the constant 1). The log-loss objective function (i.e., the negative of
the log-likelihood for a 2-class problem) is convex as a function of the parameters θ. A function is convex if and only if the p × p Hessian matrix H of partial second derivatives
(with respect to the p parameters) is positive semi-definite, i.e., $x^T Hx ≥ 0$ for any real-valued column vector x of dimension p. 


It is sufficient to show the Hessian of the loss function is positive semi-definite(PSD).The log loss or crossentropy error function is

\begin{equation}
    \mathcal{L} = \sum_{i=1}^{N} -y_i \log \sigma(x_i \theta) - (1 - y_i) \log (1-\sigma(x_i \theta))
\end{equation}

where $\sigma$ is the logisitic function. The first partial derivative is (full derivation [here](https://github.com/jordanott/DeepLearning/tree/master/Miscellaneous/Logistic%20Regression.ipynb)): 
\begin{equation}
    \frac{\partial}{\partial \theta_j} = \sum_{i=1}^{N} [\sigma(x_i \theta) - y_i] x_{i,j}
\end{equation}

The second partial derivative is
\begin{equation}
    \frac{\partial^2}{\partial \theta_j \partial \theta_k} = \frac{\partial}{\partial \theta_k} \sum_{i=1}^{N} [\sigma(x_i \theta) - y_i] x_{i,j}
\end{equation}

\begin{equation}
     = \sum_{i=1}^{N} \frac{\partial}{\partial \theta_k} \sigma(x_i \theta)x_{i,j}
\end{equation}

\begin{equation}
     = \sum_{i=1}^{N} \sigma(x_i \theta)(1 - \sigma(x_i \theta))x_{i,j}x_{i,k}
\end{equation}

For the on-diagonal terms $j=k$
\begin{equation}
    \frac{\partial^2}{\partial \theta_j^2} =  \sum_{i=1}^{N} \sigma(x_i \theta)(1 - \sigma(x_i \theta)) y_i x_{i,j}^2
\end{equation}

A matrix $A$ is PSD if $\forall c \in \mathbb{R}^d$ $c^T A c \geq 0$
\begin{equation}
c^T Ac = c^T[\sum_{i=1}^{N} \sigma(x_i \theta)(1- \sigma(x_i \theta))x_{i}^T x_i ] c
\end{equation}

\begin{equation}
= \sum_{i=1}^{N} \sigma(x_i \theta)(1 − \sigma(x_i \theta))c^T x_i^T x_i c
\end{equation}

\begin{equation}
= \sum_{i=1}^{N} \sigma(x_i \theta)(1 − \sigma(x_i \theta))(x_i c)^T x_i c
\end{equation}

\begin{equation}
= \sum_{i=1}^{N} \sigma(x_i \theta)(1 − \sigma(x_i \theta))(x_i c)^2
\end{equation}



Both $\sigma(x_i\theta)(1 − \sigma(x_i\theta)) ≥ 0$ and $(x_i c)^2 ≥ 0$, which implies the Hessian is PSD.