A function \( f: \mathbb{R}^n \to \mathbb{R} \) is said to be **\( L \)-smooth** if its gradient \( \nabla f \) is **Lipschitz continuous** with constant \( L \). This means that for all \( x, y \in \mathbb{R}^n \), the following inequality holds:

\[
\| \nabla f(x) - \nabla f(y) \| \leq L \| x - y \|
\]

Here, \( \| \cdot \| \) denotes the Euclidean norm. Intuitively, \( L \)-smoothness implies that the gradient of \( f \) does not change too rapidly, and the function has a bounded curvature.

### Loss Functions That Are \( L \)-Smooth
Many common loss functions used in machine learning and optimization are \( L \)-smooth. Below are some examples:

---

#### 1. **Quadratic Loss (Mean Squared Error)**:
   - **Function**: \( f(x) = \frac{1}{2} \| Ax - b \|^2 \)
   - **Gradient**: \( \nabla f(x) = A^T (Ax - b) \)
   - **Smoothness**: If \( A \) is a matrix, then \( f \) is \( L \)-smooth with \( L = \| A^T A \| \), where \( \| \cdot \| \) is the spectral norm (largest singular value of \( A^T A \)).

---

#### 2. **Logistic Loss**:
   - **Function**: \( f(x) = \sum_{i=1}^n \log(1 + \exp(-y_i (x^T z_i))) \), where \( y_i \in \{-1, 1\} \) and \( z_i \) are data points.
   - **Gradient**: \( \nabla f(x) = \sum_{i=1}^n \frac{-y_i z_i}{1 + \exp(y_i (x^T z_i))} \)
   - **Smoothness**: The logistic loss is \( L \)-smooth with \( L = \frac{1}{4} \| Z^T Z \| \), where \( Z \) is the data matrix.

---

#### 3. **Hinge Loss (with Regularization)**:
   - **Function**: \( f(x) = \sum_{i=1}^n \max(0, 1 - y_i (x^T z_i)) + \frac{\lambda}{2} \| x \|^2 \), where \( \lambda > 0 \) is a regularization parameter.
   - **Gradient**: \( \nabla f(x) = \sum_{i=1}^n -y_i z_i \cdot \mathbb{I}(y_i (x^T z_i) < 1) + \lambda x \)
   - **Smoothness**: The hinge loss itself is not smooth (due to the non-differentiability at \( y_i (x^T z_i) = 1 \)), but adding the regularization term \( \frac{\lambda}{2} \| x \|^2 \) makes it \( L \)-smooth with \( L = \lambda \).

---

#### 4. **Softmax Loss**:
   - **Function**: \( f(x) = -\sum_{i=1}^n y_i \log\left( \frac{\exp(x^T z_i)}{\sum_{j=1}^k \exp(x^T z_j)} \right) \), where \( y_i \) is a one-hot encoded label.
   - **Gradient**: \( \nabla f(x) = \sum_{i=1}^n (p_i - y_i) z_i \), where \( p_i = \frac{\exp(x^T z_i)}{\sum_{j=1}^k \exp(x^T z_j)} \).
   - **Smoothness**: The softmax loss is \( L \)-smooth with \( L = \| Z^T Z \| \), where \( Z \) is the data matrix.

---

#### 5. **Regularized Linear Regression**:
   - **Function**: \( f(x) = \frac{1}{2} \| Ax - b \|^2 + \frac{\lambda}{2} \| x \|^2 \), where \( \lambda > 0 \) is a regularization parameter.
   - **Gradient**: \( \nabla f(x) = A^T (Ax - b) + \lambda x \)
   - **Smoothness**: The function is \( L \)-smooth with \( L = \| A^T A \| + \lambda \).

---

#### 6. **Strongly Convex Functions**:
   - If a function \( f \) is strongly convex with parameter \( \mu \) and \( L \)-smooth, then \( \mu \leq L \). Many regularized loss functions (e.g., ridge regression) are both strongly convex and \( L \)-smooth.

---

### Key Properties of \( L \)-Smooth Functions
1. **Bounded Gradient**: If \( f \) is \( L \)-smooth, then the gradient \( \nabla f \) does not change too quickly. This is useful for convergence analysis in gradient-based optimization algorithms.
2. **Quadratic Upper Bound**: For an \( L \)-smooth function, the following inequality holds:
   \[
   f(y) \leq f(x) + \langle \nabla f(x), y - x \rangle + \frac{L}{2} \| y - x \|^2
   \]
   This property is often used to prove convergence of gradient descent.
3. **Convexity**: If \( f \) is convex and \( L \)-smooth, then gradient descent converges to the global minimum at a rate of \( O(1/k) \), where \( k \) is the number of iterations.

---

### Verifying \( L \)-Smoothness
To verify that a function \( f \) is \( L \)-smooth, you can:
1. Compute the Hessian \( \nabla^2 f(x) \) (if it exists).
2. Check if the spectral norm of the Hessian is bounded by \( L \), i.e., \( \| \nabla^2 f(x) \| \leq L \) for all \( x \).

For example, for the quadratic loss \( f(x) = \frac{1}{2} \| Ax - b \|^2 \), the Hessian is \( \nabla^2 f(x) = A^T A \), and its spectral norm is \( \| A^T A \| \), so \( f \) is \( L \)-smooth with \( L = \| A^T A \| \).

---

In summary, many common loss functions in machine learning are \( L \)-smooth, which makes them amenable to optimization using gradient-based methods like gradient descent. The value of \( L \) depends on the specific function and its parameters (e.g., data matrix \( A \), regularization parameter \( \lambda \)).

Gradient boosting with regression trees typically involves using loss functions that are differentiable and have Lipschitz continuous gradients (L-smooth). Here are some common L-smooth loss functions that can be used in this context, along with relevant citations:

1. **Squared Error Loss (L2 Loss)**:
   - This is the most common loss function used for regression tasks. It is defined as the squared difference between the predicted and actual values.
   - Citation: Friedman, J. H. (2001). "Greedy function approximation: A gradient boosting machine." *Annals of Statistics*, 29(5), 1189-1232.

2. **Huber Loss**:
   - Huber loss is robust to outliers and combines the properties of L1 and L2 losses. It is quadratic for small errors and linear for large errors.
   - Citation: Huber, P. J. (1964). "Robust Estimation of a Location Parameter." *Annals of Mathematical Statistics*, 35(1), 73-101.

3. **Quantile Loss**:
   - This loss function is used for quantile regression, which estimates the conditional quantiles of the response variable. It is useful for modeling the median or other quantiles.
   - Citation: Koenker, R., & Bassett, G. (1978). "Regression Quantiles." *Econometrica*, 46(1), 33-50.

4. **Log-Cosh Loss**:
   - Log-cosh loss is another smooth alternative to the squared error loss. It is less sensitive to outliers than L2 loss.
   - While there isn't a specific seminal paper for log-cosh loss, it is a well-known function in the machine learning community.

5. **Pseudo-Huber Loss**:
   - This is a smooth approximation of the Huber loss, which maintains differentiability everywhere.
   - Citation: Barron, A. R. (1989). "Statistical properties of artificial neural networks." *Proceedings of the IEEE Conference on Decision and Control*, 280-285.

These loss functions are suitable for gradient boosting with regression trees due to their differentiability and smoothness properties, which facilitate the optimization process. When selecting a loss function, consider the specific characteristics of your data and the robustness required for your application.

Estimating the upper bound of data, such as a high quantile, can be effectively handled using quantile regression techniques. For this purpose, you can use quantile loss functions, which are indeed L-smooth and suitable for gradient boosting with regression trees. Here are some approaches and relevant citations:

1. **Quantile Loss (Pinball Loss)**:
   - Quantile loss, also known as pinball loss, is used to estimate quantiles. It is defined as:
     \[
     L(y, \hat{y}) = \max(q(y - \hat{y}), (q - 1)(y - \hat{y}))
     \]
     where \( q \) is the quantile level (e.g., 0.95 for the 95th percentile).
   - This loss function is piecewise linear and differentiable, making it suitable for optimization.
   - Citation: Koenker, R., & Bassett, G. (1978). "Regression Quantiles." *Econometrica*, 46(1), 33-50.

2. **Smooth Approximations of Quantile Loss**:
   - To ensure smoothness, you can use smooth approximations of the quantile loss. These approximations maintain differentiability and can be more amenable to optimization algorithms.
   - One approach is to use a smoothed version of the pinball loss, which can be achieved by adding a small smoothing parameter to the loss function.

3. **Expectile Loss**:
   - Expectile regression is an alternative to quantile regression and can be used to estimate upper bounds. Expectiles are similar to quantiles but are based on minimizing an asymmetric squared loss.
   - Citation: Newey, W. K., & Powell, J. L. (1987). "Asymmetric Least Squares Estimation and Testing." *Econometrica*, 55(4), 819-847.

These loss functions and their smooth approximations are suitable for estimating upper bounds or high quantiles in data using gradient boosting with regression trees. When implementing these methods, it's important to choose the appropriate quantile level or expectile level based on the specific upper bound you wish to estimate.