search.json

[
  {
    "objectID": "project/proposals_and_projects.html",
    "href": "project/proposals_and_projects.html",
    "title": "Proposal",
    "section": "",
    "text": "This may be done as a group project or individually. The size of the group must not exceed two. You need to formulate a problem for which a Gaussian process model is a reasonable modelling choice. As such your problem can take the form of regression, classification, design of experiment, or even unsupervised learning. You are permitted to use any open-source python platform for your chosen problem, however, you may also choose to code up your implementation from scratch.\nThe data that underscores your problem should be open-source; you are not permitted to work on problems of a proprietary nature. It is completely acceptable to have a problem that is aligned with your research interests, although your work will be judged by what you do during this project and not prior research.\nA standard Gaussian process workflow assumes a prior based on a 1D RBF or Matern or periodic kernel. It also assumes that both prior and likelihood are Gaussian. You must articulate how your problem and the solution you are proposing requires either (i) a richer likelihood function, or (ii) a more complex covariance function.\n\n\nThe first part of this assignment is a proposal. The proposal must address the following questions:\n\nWhat is input / output sought? [5]\nWhat type of kernels will be considered? [5]\nWhat are 1-2 relevant papers? [5]\nWhat is particularly novel about the implementation (prior, likelihood, kernel, mean function, etc.)? [3]\nWhat is the data source being used? [2]\n\nKindly ensure that your proposal has a list of references. The main body of the proposal (without references) should not exceed one page. Once submitted, you will receive feedback on your proposal which will help with your project.\n\n\n\n\n\n\nBelow you will find a non-exhaustive list of ideas that would satisfy the points above:\n\nDesign a Gaussian process model for a 2D/3D velocity field based on sparse velocity measurements. Describe and code how you would make the covariance functions (a) divergence free and (b) curl free. For training and testing data you may consider analytical equations such as the Taylor Green vortex system.\nDesign a Gaussian process model for sensor data that exhibits some symmetry that is captured in the construction of the covariance function. If you plan to work on the JetBot project, then your proposal should focus on velocity magnitude measurements taken at several locations within a rectangular area. All measurements are at the same height and the objective is to use as few measurements as possible to estimate the velocity magnitude field.\nDesign a Gaussian process model that is constructed across a sphere, or more generally, over a space that is not a hypercube. If using polar or spherical coordinates ensure relevant [0,2] periodicity. One example would be modelling temperature or pressure across the globe using sparse measurements from ECMWF or NOAA.\nLinear operators acting on a Gaussian process yield another Gaussian process. Build a Gaussian process that leverages this for both derivative and integral operations, i.e., you observe both integral (e.g., location) and derivative (e.g., acceleration) data for a Gaussian process model (e.g., velocity).\nAny Gaussian process model with multiple output quantities, where one needs to account for correlations between the outputs.\nA Gaussian process model that forecasts the latitude, longitude and altitude of flights departing a given airport. A project like this may leverage historical flight trajectory data found in websites such as Flightradar24.\nGaussian process models where one also wants to reduce the dimensionality associated with the inputs; this can take the form of a dimension reducing subspace or even a deep neural network kernel."
  },
  {
    "objectID": "project/proposals_and_projects.html#instructions",
    "href": "project/proposals_and_projects.html#instructions",
    "title": "Proposal",
    "section": "",
    "text": "This may be done as a group project or individually. The size of the group must not exceed two. You need to formulate a problem for which a Gaussian process model is a reasonable modelling choice. As such your problem can take the form of regression, classification, design of experiment, or even unsupervised learning. You are permitted to use any open-source python platform for your chosen problem, however, you may also choose to code up your implementation from scratch.\nThe data that underscores your problem should be open-source; you are not permitted to work on problems of a proprietary nature. It is completely acceptable to have a problem that is aligned with your research interests, although your work will be judged by what you do during this project and not prior research.\nA standard Gaussian process workflow assumes a prior based on a 1D RBF or Matern or periodic kernel. It also assumes that both prior and likelihood are Gaussian. You must articulate how your problem and the solution you are proposing requires either (i) a richer likelihood function, or (ii) a more complex covariance function.\n\n\nThe first part of this assignment is a proposal. The proposal must address the following questions:\n\nWhat is input / output sought? [5]\nWhat type of kernels will be considered? [5]\nWhat are 1-2 relevant papers? [5]\nWhat is particularly novel about the implementation (prior, likelihood, kernel, mean function, etc.)? [3]\nWhat is the data source being used? [2]\n\nKindly ensure that your proposal has a list of references. The main body of the proposal (without references) should not exceed one page. Once submitted, you will receive feedback on your proposal which will help with your project.\n\n\n\n\n\n\nBelow you will find a non-exhaustive list of ideas that would satisfy the points above:\n\nDesign a Gaussian process model for a 2D/3D velocity field based on sparse velocity measurements. Describe and code how you would make the covariance functions (a) divergence free and (b) curl free. For training and testing data you may consider analytical equations such as the Taylor Green vortex system.\nDesign a Gaussian process model for sensor data that exhibits some symmetry that is captured in the construction of the covariance function. If you plan to work on the JetBot project, then your proposal should focus on velocity magnitude measurements taken at several locations within a rectangular area. All measurements are at the same height and the objective is to use as few measurements as possible to estimate the velocity magnitude field.\nDesign a Gaussian process model that is constructed across a sphere, or more generally, over a space that is not a hypercube. If using polar or spherical coordinates ensure relevant [0,2] periodicity. One example would be modelling temperature or pressure across the globe using sparse measurements from ECMWF or NOAA.\nLinear operators acting on a Gaussian process yield another Gaussian process. Build a Gaussian process that leverages this for both derivative and integral operations, i.e., you observe both integral (e.g., location) and derivative (e.g., acceleration) data for a Gaussian process model (e.g., velocity).\nAny Gaussian process model with multiple output quantities, where one needs to account for correlations between the outputs.\nA Gaussian process model that forecasts the latitude, longitude and altitude of flights departing a given airport. A project like this may leverage historical flight trajectory data found in websites such as Flightradar24.\nGaussian process models where one also wants to reduce the dimensionality associated with the inputs; this can take the form of a dimension reducing subspace or even a deep neural network kernel."
  },
  {
    "objectID": "coding-1/coding-1.html",
    "href": "coding-1/coding-1.html",
    "title": "Build your own GP",
    "section": "",
    "text": "Assignment 2\nThis assignment requires you to fit a Gaussian process model to the Mauna Loa data set. It is a univariate dataset that comprises the monthly average carbon dioxide concentration, measured in parts per million.\nYou will find the data at this site. You will have to download the monthly_in_situ_co2_mlo.csv file directly. Details about the data can be found in reference 1.\nThe plot below shows the data: year vs. CO2 (ppm). Some rows of the table have -99.99 values; these may be ignored. Note that as there are 12 measurements per year (1 for each month), utilizing just the year as the covariate is not appropriate, and that is why the “Date” or third column must be used.\nYour training data must be limited to all years before 2014, i.e., you may only use CO2 concentrations in the years 1958 to 2013. It is entirely your decision whether you wish to use all this data, or select a subset.\nA plot of all the data is shown below.\n\n\nCode\nimport pandas as pd\nimport matplotlib.pyplot as plt\nimport numpy as np\n\ndf = pd.read_csv('data.csv')\ndf2014 = df[df['Date']&lt; 2014]\ndfnew = df[df['Date']&gt;= 2014]\n\nfig = plt.figure(figsize=(10,4))\nplt.plot(df2014['Date'].values, df2014['CO2'].values, 'o', ms=1, color='crimson', label='Pre 2014 (training)')\nplt.plot(dfnew['Date'].values, dfnew['CO2'].values, 'o', ms=1, color='dodgerblue', label='2014 and later')\nplt.legend()\nplt.axvline(x=2014, color=\"grey\")\nplt.xlabel('Year')\nplt.ylabel(r'$CO_2$ emissions (ppm)')\nplt.show()\n\n\n\n\n\nDespite the fact that this is a univariate dataset, it is challenging as it requires multiple kernel functions. Ten minutes on your favorite search browser will give you some clues. Your grade will be determined via the following criterion.\n\nAppropriate importing of the data and filtering of non-relevant rows. I will run your code on the “.csv” file as provided on the Scripps website. You cannot submit your amended version of the data.\nUse of multiple kernel functions, justifying what exactly each kernel is doing.\nA well-documented Jupyter notebook with equations for all the relevant formulas and code. If your code does not run, or produces an error upon running, you will loose a lot of marks.\n\nOne approach for hyperparameter inference (e.g., maximum likelihood, cross validation, Markov chain Monte Carlo, etc.). Please note that the signal noise need not be optimized over (but can be if you wish).\nYou will have to analytically calculate any gradients for hyperparameter inference. To clarify, code that does not use gradients, or code where the gradients are incorrect, will not receive full marks. To check your gradients you can always use finite differences.\n\nYou are restricted to the following libraries: numpy, seaborn, matplotlib, scipy, pandas. Thus, you will have to build a lot of the codebase yourself.\nThe last plot in your submission should have the same data as the plot above (both pre- and post-), along with predictive posterior mean and standard deviation contours.\n\n\nDue date: 15th March 2024 | 21:00 on Canvas.\nGrading rubric [marks in brackets]:\n\nData importing [5]\nGP model architecture (i.e., kernels) [5]\nHyperparameter inference [10]\nClarity of documentation [5]\n\n\n\n\nReference\n\nC. D. Keeling, S. C. Piper, R. B. Bacastow, M. Wahlen, T. P. Whorf, M. Heimann, and H. A. Meijer, Exchanges of atmospheric CO2 and 13CO2 with the terrestrial biosphere and oceans from 1978 to 2000. I. Global aspects, SIO Reference Series, No. 01-06, Scripps Institution of Oceanography, San Diego, 88 pages, 2001."
  },
  {
    "objectID": "useful_codes/curl.html",
    "href": "useful_codes/curl.html",
    "title": "Curl free",
    "section": "",
    "text": "Code\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport matplotlib\nfrom scipy import stats\nfrom copy import deepcopy\nimport pymc as pm\nimport pytensor\nimport pytensor.tensor as tt\nfrom pymc.gp.cov import Covariance\nfrom functools import partial\nfrom pytensor.tensor.linalg import cholesky, eigh, solve_triangular\nfrom scipy.stats import multivariate_normal\nsolve_lower = partial(solve_triangular, lower=True)\nsolve_upper = partial(solve_triangular, lower=False)\n\n\nWARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.\nCode\nm = 40\nn = 40\ns = np.linspace(0, 1, m) * 4\nt = np.linspace(0, 1, n) * 4\n[S, T] = np.meshgrid(s, t)\nSS = S.flatten()\nTT = T.flatten()\nN = SS.shape[0]\nXpred = np.hstack([TT.reshape(N,1), SS.reshape(N,1)])\nCode\nnp.random.seed(seed=10)\nnum_random_points = 7\nX_init = np.array([[3, 4], \n                   [2, 1],\n                   [0, 3.5], \n                   [2, 2], \n                   [3, 0], \n                   [1, 1],\n                   [1.5, 3]]).reshape(num_random_points,2)\nvel_x = -np.sin(X_init[:,0]) * X_init[:,1]\nvel_y =  np.cos(X_init[:,0])\nsigma_noise = 1e-6\nCode\nvel_x_truth = -np.sin(Xpred[:,0]) * Xpred[:,1]\nvel_y_truth = np.cos(Xpred[:,0])\nvel_mag_truth = np.sqrt(vel_x_truth**2 + vel_y_truth**2)\nCode\nplt.scatter(X_init[:,0], X_init[:,1], c='w', s=40, lw=1, edgecolor='k')\nplt.quiver(X_init[:,0], X_init[:,1], vel_x, vel_y)\nplt.show()\nCode\nX_init = np.vstack([X_init, X_init])\nprint(X_init)\n\n\n[[3.  4. ]\n [2.  1. ]\n [0.  3.5]\n [2.  2. ]\n [3.  0. ]\n [1.  1. ]\n [1.5 3. ]\n [3.  4. ]\n [2.  1. ]\n [0.  3.5]\n [2.  2. ]\n [3.  0. ]\n [1.  1. ]\n [1.5 3. ]]\nCode\ny_init = np.vstack([vel_x.reshape(num_random_points,1), vel_y.reshape(num_random_points,1)]).flatten()\nCode\ny_init\n\n\narray([-0.56448003, -0.90929743, -0.        , -1.81859485, -0.        ,\n       -0.84147098, -2.99248496, -0.9899925 , -0.41614684,  1.        ,\n       -0.41614684, -0.9899925 ,  0.54030231,  0.0707372 ])"
  },
  {
    "objectID": "useful_codes/curl.html#standard-approach-independent-gps-for-each-velocity-component.",
    "href": "useful_codes/curl.html#standard-approach-independent-gps-for-each-velocity-component.",
    "title": "Curl free",
    "section": "Standard approach – Independent GPs for each velocity component.",
    "text": "Standard approach – Independent GPs for each velocity component.\n\n\nCode\nwith pm.Model() as model2:\n    \n    sigma_f = pm.HalfNormal(\"sigma_f\", sigma=1)\n    l = pm.HalfNormal(\"l\", sigma=1.0)\n    cov = SquaredExp(2, sigma_f, l)\n    gp = pm.gp.Marginal(cov_func=cov)\n    y_ = gp.marginal_likelihood(\"y_\", X=X_init[0:num_random_points,:].reshape(num_random_points, 2), \\\n                                      y=y_init[0:num_random_points] - np.mean(y_init[0:num_random_points]), \\\n                                      noise=1e-4)\n\n\n/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pymc/gp/gp.py:56: FutureWarning: The 'noise' parameter has been been changed to 'sigma' in order to standardize the GP API and will be deprecated in future releases.\n  warnings.warn(_noise_deprecation_warning, FutureWarning)\n\n\n\n\nCode\nwith model2:\n    mp2 = pm.find_MAP()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCode\nwith model2:\n    post_mean2, post_covar2 = gp.predict(Xpred, point=mp2, diag=False)\n\n\n\n\nCode\nwith pm.Model() as model3:\n    \n    sigma_f = pm.HalfNormal(\"sigma_f\", sigma=1)\n    l = pm.HalfNormal(\"l\", sigma=1.0)\n    cov = SquaredExp(2, sigma_f, l)\n    gp = pm.gp.Marginal(cov_func=cov)\n    y_ = gp.marginal_likelihood(\"y_\", X=X_init[0:num_random_points,:].reshape(num_random_points, 2), \\\n                                      y=y_init[num_random_points:] - np.mean(y_init[num_random_points:]), \\\n                                      noise=1e-4)\n\n\n/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pymc/gp/gp.py:56: FutureWarning: The 'noise' parameter has been been changed to 'sigma' in order to standardize the GP API and will be deprecated in future releases.\n  warnings.warn(_noise_deprecation_warning, FutureWarning)\n\n\n\n\nCode\nwith model3:\n    mp3 = pm.find_MAP()\n    \n        \nwith model3:\n    post_mean3, post_covar3 = gp.predict(Xpred, point=mp3, diag=False)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCode\nvelocity_x_mean_gp = post_mean2\nvelocity_y_mean_gp = post_mean3\nvelocity_mag_mean_gp = np.sqrt(velocity_x_mean_gp**2 + velocity_y_mean_gp**2 )\nvelocity_x_std_gp = np.sqrt(np.diag(post_covar2))\nvelocity_y_std_gp = np.sqrt(np.diag(post_covar3))\n\n\n\n\nCode\nnorm = matplotlib.colors.Normalize(vmin=np.min(velocity_mag_mean_gp),\\\n                                    vmax=np.max(velocity_mag_mean_gp))\n\nfig = plt.figure(figsize=(15,4))\nax1 = plt.subplot(131)\nc = ax1.contourf(T, S, velocity_mag_mean_gp.reshape(n, m), 50, cmap=plt.cm.turbo, norm=norm)\nplt.quiver(Xpred[:,0], Xpred[:,1], velocity_x_mean_gp, velocity_y_mean_gp, headwidth=5, scale=10)\nplt.scatter(X_init[0:num_random_points,0], X_init[0:num_random_points,1], c='w', s=70, lw=1, edgecolor='k')\ncbar = plt.colorbar(c, pad=0.05, shrink=0.6)\ncbar.ax.tick_params(labelsize=13)\nax1.set_yticklabels([])\nax1.set_xticklabels([])\nplt.xlabel(r'$x_1$')\nplt.ylabel(r'$x_2$')\nax1.set_title('Posterior velocity mag. with vectors', fontsize=13)\n\n\nnorm = matplotlib.colors.Normalize(vmin=np.min(velocity_x_std_gp),\\\n                                    vmax=np.max(velocity_x_std_gp))\n\nax2 = plt.subplot(132)\nc = ax2.contourf(T, S, velocity_x_std_gp.reshape(n, m), 50, cmap=plt.cm.turbo, norm=norm)\nplt.scatter(X_init[0:num_random_points,0], X_init[0:num_random_points,1], c='w', s=70, lw=1, edgecolor='k')\ncbar = plt.colorbar(c, pad=0.05, shrink=0.6)\ncbar.ax.tick_params(labelsize=13)\nax2.set_yticklabels([])\nax2.set_xticklabels([])\nplt.xlabel(r'$x_1$')\nplt.ylabel(r'$x_2$')\nax2.set_title('Posterior velocity-x std dev.', fontsize=13)\n\nnorm = matplotlib.colors.Normalize(vmin=np.min(velocity_y_std_gp),\\\n                                    vmax=np.max(velocity_y_std_gp))\n\nax3 = plt.subplot(133)\nc = ax3.contourf(T, S, velocity_y_std_gp.reshape(n, m), 50, cmap=plt.cm.turbo, norm=norm)\nplt.scatter(X_init[0:num_random_points,0], X_init[0:num_random_points,1], c='w', s=70, lw=1, edgecolor='k')\ncbar = plt.colorbar(c, pad=0.05, shrink=0.6)\ncbar.ax.tick_params(labelsize=13)\nax3.set_yticklabels([])\nax3.set_xticklabels([])\nplt.xlabel(r'$x_1$')\nplt.ylabel(r'$x_2$')\nax3.set_title('Posterior velocity-y std dev.', fontsize=13)\nplt.savefig('velocity_gp.png', dpi=170, bbox_inches='tight', transparent=True)\nplt.show()"
  },
  {
    "objectID": "useful_codes/curl.html#debugging-below",
    "href": "useful_codes/curl.html#debugging-below",
    "title": "Curl free",
    "section": "Debugging below",
    "text": "Debugging below\n\n\nCode\nsigma_f2 = 0.7767**2\nl2 = 1.6295**2\n\ndef full(self, X, Xs=None):\n    if Xs is None:\n        Xs = X\n    m = int(X.shape[0]/2)\n    n = int(Xs.shape[0]/2)\n\n    X = X[0:m,:]\n    Xs = Xs[0:n,:]\n\n    m2 = int( m * 2 )\n    n2 = int( n * 2 )\n\n    del_X = get_X_minus_X(X, Xs, 0)\n    del_Y = get_X_minus_X(X, Xs, 1)\n    del_X2 = del_X**2\n    del_Y2 = del_Y**2\n    \n\n    K = np.zeros((m2,n2))\n\n    K[0:m, 0:n] = dK_yy(del_X2,del_Y2)\n    K[0:m,n:n2] = dK_yx(del_X2,del_Y2, del_X, del_Y)\n\n    K[m:m2,0:n] = dK_yx(del_X2,del_Y2, del_X, del_Y)\n    K[m:m2, n:n2] = dK_xx(del_X2,del_Y2)\n    \n    return K, get_X_minus_X(X, Xs, 0), get_X_minus_X(X, Xs, 1)\n\ndef get_X_minus_X(X, Xs, input_dim):\n    k, v = X.shape[0], Xs.shape[0]\n    M = X[:,input_dim].reshape(k,1)\n    Ms = Xs[:,input_dim].reshape(v,1)\n    return np.reshape(M, (-1,1)) - np.reshape(Ms, (1,-1))\n\ndef dK_xx(del_X2, del_Y2):\n    # (sigma_f^2*exp(-(x^2 + y^2)/(2*l^2))*(l^2 - x^2))/l^4\n    return (sigma_f2 * \\\n            np.exp(-(del_X2 + del_Y2)/(2*l2) ) * \\\n            (l2  - del_X2))/l2**2\n\ndef dK_yy(del_X2, del_Y2):\n    # (sigma_f^2*exp(-(x^2 + y^2)/(2*l^2))*(l^2 - y^2))/l^4\n    return (sigma_f2 * \n            np.exp(-(del_X2 + del_Y2)/(2*l2) ) * \\\n            (l2 - del_Y2))/l2**2\n\n#def dK_xy(self, del_X2, del_Y2):\n#    # - l^4*sigma_f^2*x*y*exp(-(l^2*(x^2 + y^2))/2)\n#    return tt.square(self.l2) * self.sigma_f2 * tt.sqrt(del_X2) * tt.sqrt(del_Y2) * \\\n#            tt.exp(-0.5 * self.l2 * (del_X2 + del_Y2))\n\ndef dK_yx(del_X2, del_Y2, del_X, del_Y):\n    # (sigma_f^2*x*y*exp(-(x^2 + y^2)/(2*l^2)))/l^4\n    return  (sigma_f2 * del_X * del_Y * \\\n            np.exp(-(del_X2 + del_Y2)/(2*l2)))/l2**2\n\n\n\n\nCode\nK, delX, delY = full(X_init, X_init)\n\n\n\n\nCode\nc = plt.imshow(K)\nplt.colorbar(c)\n\n\n\n\nCode\nfrom scipy.linalg import cholesky\n\n\n\n\nCode\ncholesky(K + 1e-12 * np.eye(K.shape[0]))\n\n\n\n\nCode\nnp.savetxt('value.txt', np.around(K, 4))\n\n\n\n\nCode\nnp.around(K, 4)\n\n\n\n\nCode\nnp.savetxt(\"foo.csv\", K, delimiter=\",\")"
  },
  {
    "objectID": "useful_codes/vi.html",
    "href": "useful_codes/vi.html",
    "title": "Variational inference",
    "section": "",
    "text": "Bayes’ rule\nLet us begin with Bayes’ rule\n\\[\np \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) = \\frac{p \\left(\\mathbf{t} | \\mathbf{f} \\right)p \\left( \\mathbf{f} | \\mathbf{X} \\right) }{p \\left( \\mathbf{t} | \\mathbf{X} \\right) }\n\\]\nwhere assuming a Gaussian likelihood and Gaussian noise model, we have\n\\[\n\\textrm{Likelihood}: p \\left(\\mathbf{t} | \\mathbf{f} \\right) = \\mathcal{N} \\left( \\mathbf{f} , \\sigma^2 \\mathbf{I} \\right)\n\\]\n\\[\n\\textrm{Prior}: p \\left( \\mathbf{f} | \\mathbf{X} \\right) = \\mathcal{N} \\left( \\mathbf{0}, \\mathbf{K} \\right)\n\\]\n\\[\n\\textrm{Posterior}: p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) = \\mathcal{N} \\left( \\boldsymbol{\\mu}, \\boldsymbol{\\Sigma} \\right)\n\\]\n\\[\n\\textrm{Marginal likelihood (or evidence)}: p \\left( \\mathbf{t} | \\mathbf{X} \\right)\n\\]\nwhere\n\\[\n\\boldsymbol{\\mu} = \\mathbf{K}\\left( \\mathbf{X}, \\mathbf{X}' \\right) \\left[ \\mathbf{K}\\left( \\mathbf{X} , \\mathbf{X}' \\right)  + \\sigma^2 \\mathbf{I} \\right]^{-1} \\mathbf{t}\n\\]\nand\n\\[\n\\boldsymbol{\\Sigma} = \\mathbf{K}\\left( \\mathbf{X} , \\mathbf{X}' \\right) - \\mathbf{K}\\left( \\mathbf{X}_{\\ast}, \\mathbf{X} \\right) \\left[ \\mathbf{K}\\left( \\mathbf{X} , \\mathbf{X}' \\right)  + \\sigma^2 \\mathbf{I} \\right]^{-1} \\mathbf{K}^{T}\\left( \\mathbf{X} , \\mathbf{X}' \\right)\n\\]\nFor non-Gaussian likelihoods, one cannot express the posterior in terms of the mean and covariance terms above. Thus, we require a strategy to do so without having to resort to Markov Chain Monte Carlo. For Gaussian likelihoods, as we have already established, the closed-form posterior above requires inverting a matrix of size \\(N \\times N\\) where \\(N\\) corresponds to the number of training data points.\n\n\nThe ELBO\nThe objective of variational inference is to approximate the exact posterior by introducing a variational distribution, \\(q \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)\\). One seeks to minimize the Kullback-Leibler divergence between the exact and variational distribution. This is given by\n\\[\nKL \\left[ \\underbrace{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)}_{\\textrm{approximate posterior}} || \\underbrace{p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)}_{\\textrm{true posterior}} \\right]\n\\]\nUsing the definition of the KL divergence, this may be re-written as\n\\[\n\\begin{aligned}\nKL \\left[ q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) || p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) \\right] & = \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; \\frac{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)}{p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\right]\n\\end{aligned}\n\\]\nNow using Bayes’ rule, we have\n\\[\n\\begin{aligned}\nKL \\left[ q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) || p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) \\right] & = \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; \\frac{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)}{\\frac{p \\left(\\mathbf{t} | \\mathbf{f} \\right)p \\left( \\mathbf{f} | \\mathbf{X} \\right) }{p \\left( \\mathbf{t} | \\mathbf{X} \\right) }} \\right]\n\\end{aligned}\n\\]\n\\[\n\\begin{aligned}\nKL \\left[ q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) || p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) \\right] & = \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)\\right] - \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; p \\left(\\mathbf{t} | \\mathbf{f} \\right)p \\left( \\mathbf{f} | \\mathbf{X} \\right)  \\right] + \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; p \\left( \\mathbf{t} | \\mathbf{X} \\right) \\right]\n\\end{aligned}\n\\]\n\\[\n\\begin{aligned}\nKL \\left[ q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) || p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) \\right] & = \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)\\right] - \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; p \\left(\\mathbf{t} | \\mathbf{f} \\right) \\right] - \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log p \\left( \\mathbf{f} | \\mathbf{X} \\right)  \\right] + \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; p \\left( \\mathbf{t} | \\mathbf{X} \\right) \\right]\n\\end{aligned}\n\\]\n\\[\n\\begin{aligned}\nKL \\left[ q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) || p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) \\right] & = \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; \\frac{ q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)}{ p \\left( \\mathbf{f} | \\mathbf{X} \\right)} \\right] - \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; p \\left(\\mathbf{t} | \\mathbf{f} \\right) \\right] + \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; p \\left( \\mathbf{t} | \\mathbf{X} \\right) \\right]\n\\end{aligned}\n\\]\nSo, we can express the expectation of the marginal likelihood as\n\\[\n\\begin{aligned}\nKL \\left[  q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) ||  p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) \\right] & = KL \\left[ q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) || p \\left( \\mathbf{f} | \\mathbf{X}\\right) \\right]  - \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; p \\left(\\mathbf{t} | \\mathbf{f} \\right) \\right] + \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; p \\left( \\mathbf{t} | \\mathbf{X} \\right) \\right] \\\\\nKL \\left[  q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) ||  p \\left( \\mathbf{f} | \\mathbf{X} , \\mathbf{t}\\right) \\right] & = KL \\left[ q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) || p \\left( \\mathbf{f} | \\mathbf{X} \\right) \\right]  - \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; p \\left(\\mathbf{t} | \\mathbf{f} \\right) \\right]  + log \\; p \\left( \\mathbf{t} | \\mathbf{X} \\right) \\\\\nKL \\left[  q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) ||  p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) \\right] &  =  log \\; p \\left( \\mathbf{t} | \\mathbf{X} \\right) + \\underbrace{KL \\left[ q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) || p \\left( \\mathbf{f} | \\mathbf{X}\\right) \\right]  - \\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; p \\left(\\mathbf{t} | \\mathbf{f} \\right) \\right]}_{-ELBO} \\\\\nKL \\left[  q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) ||  p \\left( \\mathbf{f} | \\mathbf{X} , \\mathbf{t}\\right) \\right] &  = log \\; p \\left( \\mathbf{t} | \\mathbf{X} \\right) - ELBO \\left( \\phi \\right)\\\\\n\\end{aligned}\n\\]\nNote that the expectation of the marginal likelihood is the expectation of a constant – i.e., it does not change when any variational parameters change – resulting in the \\(log \\; p \\left( \\mathbf{t} | \\mathbf{X} \\right)\\) term above sans the expectation. The ELBO acronym above is short for evidence lower bound. It poses a lower bound to the log marginal likelihood, i.e.,\n\\[\n\\begin{aligned}\nELBO \\left( \\phi \\right) & = log \\; p \\left( \\mathbf{t} | \\mathbf{X} \\right)  - KL \\left[  q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) ||  p \\left( \\mathbf{f} | \\mathbf{X} \\right) \\right]\n\\end{aligned}\n\\]\nAs the KL divergence is non-negative, we can write\n\\[\n\\begin{aligned}\nELBO \\left( \\phi \\right) & \\leq log \\; p \\left( \\mathbf{t} | \\mathbf{X} \\right)\n\\end{aligned}\n\\]\nIf we maximize the ELBO, for a fixed marginal likelihood, we are minimizing the KL divergence between the true posterior and its approximation. In other words, we need to compute:\n\nThe KL divergence between the prior and the approximate posterior, \\[KL \\left[  q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) ||  p \\left( \\mathbf{f} | \\mathbf{X} \\right) \\right];\\]\nThe integral of the likelihood with the approximate posterior, \\[\\mathbb{E}_{q_{\\phi} \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)} \\left[ log \\; p \\left(\\mathbf{t} | \\mathbf{f} \\right) \\right].\\]\n\nFor the latter, this approach requires factorizing the likelihood to avoid a \\(N\\)-dimensional integral.\n\n\nAn alternate derivation\nThis alternate derivation is based on a blog from Ryan Adams. It starts off by modifying Bayes’ rule to arrive at an expression for the marginal likelihood with respect to the posterior\n\\[\np \\left( \\mathbf{t} | \\mathbf{X} \\right)  = \\frac{p \\left(\\mathbf{t} | \\mathbf{f} \\right)p \\left( \\mathbf{f} | \\mathbf{X} \\right) }{p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right) }.\n\\]\nTaking the log on both sides then yields\n\\[\nlog \\; p \\left( \\mathbf{t} | \\mathbf{X} \\right) = log \\; p \\left(\\mathbf{t} | \\mathbf{f} \\right) + log \\; p \\left( \\mathbf{f} | \\mathbf{X} \\right) - log \\; p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)\n\\]\nNow, we can combine the first two terms on the right hand side\n\\[\nlog \\; p \\left( \\mathbf{t} | \\mathbf{X} \\right) = log \\; p \\left(\\mathbf{t} , \\mathbf{f} | \\mathbf{X} \\right) - log \\; p \\left( \\mathbf{f} | \\mathbf{X}, \\mathbf{t} \\right)\n\\]\n\n\nGP demonstration\n\n\nCode\nimport numpy as np\nimport pymc as pm\nimport matplotlib.pyplot as plt\nimport arviz as az\nimport pandas as pd\nimport seaborn as sns\n\n\n\n\nCode\n# Training data\nn = 80 \nX = np.linspace(0, 10, n)[:, None]  \n\n# Define the true covariance function and its parameters\nell_true = 1.0\neta_true = 3.0\ncov_func = eta_true**2 * pm.gp.cov.Matern52(1, ell_true)\nmean_func = pm.gp.mean.Zero()\nf_true = np.random.multivariate_normal(\n    mean_func(X).eval(), cov_func(X).eval() + 1e-8 * np.eye(n), 1\n).flatten()\nsigma_true = 2.0\n\n# True signal is corrupted by random noise\ny = f_true + sigma_true * np.random.randn(n)\n\n## Plot the data and the unobserved latent function\nfig = plt.figure(figsize=(8, 5))\nax = fig.gca()\nax.plot(X, f_true, \"dodgerblue\", lw=3, label=\"True f\")\nax.plot(X, y, \"ok\", ms=3, alpha=0.5, label=\"Data\")\nax.set_xlabel(\"X\")\nax.set_ylabel(\"The true f(x)\")\nplt.legend();\n\n\n\n\n\n\n\nCode\nwith pm.Model() as model:\n    ell = pm.Gamma(\"ell\", alpha=2, beta=1)\n    eta = pm.HalfCauchy(\"eta\", beta=5)\n\n    cov = eta**2 * pm.gp.cov.Matern52(1, ell)\n    gp = pm.gp.Marginal(cov_func=cov)\n\n    sigma = pm.HalfCauchy(\"sigma\", beta=5)\n    y_ = gp.marginal_likelihood(\"y\", X=X, y=y, sigma=sigma)\n\n\n\n\nCode\nwith model:\n    vi = pm.fit(method='advi')\n\n\n\n\n\n\n\n    \n      \n      100.00% [10000/10000 00:12&lt;00:00 Average Loss = 179.49]\n    \n    \n\n\nFinished [100%]: Average Loss = 179.46\n\n\n\n\nCode\nvi_elbo = pd.DataFrame(\n    {'log-ELBO': -np.log(vi.hist),\n     'n': np.arange(vi.hist.shape[0])})\n\n_ = sns.lineplot(y='log-ELBO', x='n', data=vi_elbo)\n\n\n\n\n\n\n\nCode\n# Test values\nX_new = np.linspace(0, 20, 600)[:, None]\n\nadvi_trace = vi.sample(10000)\n\n# add the GP conditional to the model, given the new X values\nwith model:\n    f_pred = gp.conditional(\"f_pred\", X_new)\n\nwith model:\n    pred_samples = pm.sample_posterior_predictive(\n        advi_trace.sel(draw=slice(0, 50)), var_names=[\"f_pred\"] # using 50 samples from the chain\n    )\n\n\nSampling: [f_pred]\n\n\n\n\n\n\n\n    \n      \n      100.00% [51/51 00:37&lt;00:00]\n    \n    \n\n\n\n\nCode\n# plot the results\nfig = plt.figure(figsize=(12, 5))\nax = fig.gca()\n\n# plot the samples from the gp posterior with samples and shading\nfrom pymc.gp.util import plot_gp_dist\n\nf_pred_samples = az.extract(pred_samples, group=\"posterior_predictive\", var_names=[\"f_pred\"])\nplot_gp_dist(ax, samples=f_pred_samples.T, x=X_new)\n\n# plot the data and the true latent function\nplt.plot(X, f_true, \"dodgerblue\", lw=3, label=\"True f\")\nplt.plot(X, y, \"ok\", ms=3, alpha=0.5, label=\"Observed data\")\n\n# axis labels and title\nplt.xlabel(\"X\")\nplt.ylim([-5, 8])\nplt.title(\"Variational Inference Result\")\nplt.legend();"
  },
  {
    "objectID": "useful_codes/deep_gp.html",
    "href": "useful_codes/deep_gp.html",
    "title": "",
    "section": "",
    "text": "CodeShow All CodeHide All Code\n\n\n\n\n\nCode\nimport pandas as pd\nimport numpy as np\nimport matplotlib\nimport matplotlib.pyplot as plt\nfrom scipy.stats import multivariate_normal\nfrom scipy.linalg import cholesky, solve_triangular\nimport seaborn as sns\n\n\n\n\nCode\ndef kernel(xa, xb, amp, ll):\n    Xa, Xb = get_tiled(xa, xb)\n    return amp**2 * np.exp(-0.5 * 1./ll**2 * (Xa - Xb)**2 )\n\ndef get_tiled(xa, xb):\n    m, n = len(xa), len(xb)\n    xa, xb = xa.reshape(m,1) , xb.reshape(n,1)\n    Xa = np.tile(xa, (1, n))\n    Xb = np.tile(xb.T, (m, 1))\n    return Xa, Xb\n\ndef get_posterior(amp, ll, x, x_data, y_data, noise):\n    u = y_data.shape[0]\n    mu_y = np.mean(y_data)\n    y = (y_data - mu_y).reshape(u,1)\n    Sigma = noise * np.eye(u)\n    \n    Kxx = kernel(x_data, x_data, amp, ll)\n    Kxpx = kernel(x, x_data, amp, ll)\n    Kxpxp = kernel(x, x, amp, ll)\n    \n    # Inverse\n    jitter = np.eye(u) * 1e-12\n    L = cholesky(Kxx + Sigma + jitter)\n    S1 = solve_triangular(L.T, y, lower=True)\n    S2 = solve_triangular(L.T, Kxpx.T, lower=True).T\n    \n    mu = S2 @ S1  + mu_y\n    cov = Kxpxp - S2 @ S2.T\n    return mu, cov\n\n\n\n\nCode\nX = np.linspace(0, 1, 100)\n\ndef get_prior(X):\n    mu = np.zeros_like(X)\n    cov = kernel(X, X, amp=1.0, ll=0.25)\n    prior = multivariate_normal(mu, cov, allow_singular=True)\n    return prior\n\n\n\n\nCode\ndef random_sample3():\n    zj_1 = get_prior(X)\n    us = []\n    for j in range(0, 6):\n        uj = zj_1.rvs(1)\n        zj = get_prior(zj_1.rvs(1))\n        zj_1 = zj\n        us.append(uj)\n    return us\n\n\nG = 10\nU1c = np.zeros((G, 100))\nU2c = np.zeros((G, 100))\nU3c = np.zeros((G, 100))\nU4c = np.zeros((G, 100))\nfor j in range(0, G):\n    us = random_sample3()\n    U1c[j,:] = us[-4]\n    U2c[j,:] = us[-3]\n    U3c[j,:] = us[-2]\n    U4c[j,:] = us[-1]\n\n\n\n\nCode\nfig = plt.figure(layout='constrained', figsize=(8, 6))\nplt.subplot(221)\nplt.title('Layer 3')\nplt.plot(X, U1c.T, alpha=0.5, lw=2)\nplt.xlabel('x')\nplt.subplot(222)\nplt.title('Layer 4')\nplt.plot(X, U2c.T, alpha=0.5, lw=2)\nplt.subplot(223)\nplt.title('Layer 5')\nplt.plot(X, U3c.T, alpha=0.5, lw=2)\nplt.subplot(224)\nplt.title('Layer 6')\nplt.plot(X, U4c.T, alpha=0.5, lw=2)\nplt.savefig('layers1.png', dpi=170, bbox_inches='tight', transparent=True)\nplt.show()\n\n\n\n\n\n\n\nCode\nU4c\n\n\narray([[ 0.79651676,  0.79651671,  0.79651677,  0.79651642,  0.79651664,\n         0.79651673,  0.79651678,  0.79651627,  0.79651637,  0.79651643,\n         0.79651614,  0.79651623,  0.79651647,  0.79651617,  0.79651595,\n         0.79651611,  0.79651594,  0.7965161 ,  0.79651559,  0.79651593,\n         0.79651575,  0.79651581,  0.79651579,  0.79651574,  0.79651571,\n         0.79651558,  0.79651548,  0.79651542,  0.79651528,  0.79651551,\n         0.7965153 ,  0.79651531,  0.79651533,  0.79651549,  0.79651516,\n         0.79651539,  0.79651516,  0.79651488,  0.79651496,  0.79651509,\n         0.79651528,  0.79651518,  0.79651531,  0.79651532,  0.79651542,\n         0.79651544,  0.79651547,  0.79651541,  0.79651555,  0.79651548,\n         0.79651558,  0.79651576,  0.79651582,  0.7965159 ,  0.79651596,\n         0.79651591,  0.79651609,  0.79651614,  0.79651629,  0.796516  ,\n         0.79651583,  0.79651651,  0.79651644,  0.79651667,  0.79651618,\n         0.79651665,  0.7965166 ,  0.79651671,  0.79651646,  0.79651632,\n         0.79651667,  0.7965167 ,  0.79651692,  0.79651685,  0.7965168 ,\n         0.79651671,  0.79651653,  0.79651688,  0.79651639,  0.79651666,\n         0.79651626,  0.79651671,  0.79651651,  0.79651595,  0.79651594,\n         0.79651604,  0.79651586,  0.79651538,  0.79651567,  0.79651508,\n         0.7965147 ,  0.79651449,  0.79651427,  0.79651408,  0.79651381,\n         0.79651328,  0.79651328,  0.79651312,  0.7965129 ,  0.79651259],\n       [-0.56515191, -0.56510762, -0.56508754, -0.56508794, -0.56510023,\n        -0.56511307, -0.56512101, -0.56512213, -0.56511788, -0.56511083,\n        -0.56510251, -0.56509574, -0.56509072, -0.56508746, -0.56508604,\n        -0.56508565, -0.56508624, -0.56508722, -0.56508865, -0.5650906 ,\n        -0.56509211, -0.56509418, -0.5650959 , -0.56509792, -0.56509949,\n        -0.56510133, -0.56510307, -0.56510443, -0.56510618, -0.56510757,\n        -0.56510881, -0.56511025, -0.56511169, -0.56511329, -0.56511433,\n        -0.56511554, -0.56511634, -0.56511764, -0.56511884, -0.56511953,\n        -0.56512059, -0.56512171, -0.56512255, -0.56512323, -0.56512394,\n        -0.56512471, -0.56512537, -0.56512584, -0.56512644, -0.56512701,\n        -0.56512745, -0.56512778, -0.56512835, -0.56512857, -0.56512898,\n        -0.56512945, -0.56512966, -0.56513004, -0.56512997, -0.56513053,\n        -0.5651305 , -0.56513057, -0.56513093, -0.56513069, -0.56513096,\n        -0.56513106, -0.56513125, -0.56513092, -0.56513121, -0.56513109,\n        -0.56513116, -0.56513117, -0.56513106, -0.56513134, -0.56513133,\n        -0.565131  , -0.56513105, -0.56513099, -0.56513096, -0.56513045,\n        -0.56513075, -0.56513053, -0.56513029, -0.56513012, -0.56513003,\n        -0.56512978, -0.56512954, -0.56512935, -0.56512943, -0.56512882,\n        -0.56512877, -0.56512831, -0.5651281 , -0.56512783, -0.56512774,\n        -0.56512746, -0.56512692, -0.56512664, -0.56512619, -0.56512605],\n       [-1.172252  , -1.17226223, -1.17227225, -1.17228285, -1.1722933 ,\n        -1.17230357, -1.17231459, -1.17232516, -1.17233591, -1.17234708,\n        -1.17235791, -1.17236934, -1.17238018, -1.17239125, -1.1724028 ,\n        -1.17241355, -1.1724241 , -1.17243513, -1.17244631, -1.17245681,\n        -1.17246761, -1.17247743, -1.17248793, -1.17249771, -1.17250697,\n        -1.17251686, -1.17252581, -1.1725348 , -1.17254322, -1.17255149,\n        -1.17255928, -1.17256666, -1.1725738 , -1.17258048, -1.17258665,\n        -1.17259302, -1.17259817, -1.17260368, -1.17260844, -1.17261299,\n        -1.17261684, -1.17262094, -1.17262436, -1.17262766, -1.17263022,\n        -1.17263319, -1.17263547, -1.17263743, -1.17263939, -1.17264119,\n        -1.17264249, -1.17264381, -1.17264536, -1.17264604, -1.17264698,\n        -1.17264803, -1.17264865, -1.17264952, -1.17264964, -1.17265018,\n        -1.17265056, -1.172651  , -1.17265124, -1.17265109, -1.17265139,\n        -1.17265166, -1.17265156, -1.17265205, -1.17265206, -1.17265172,\n        -1.17265191, -1.1726522 , -1.1726522 , -1.17265222, -1.17265246,\n        -1.17265244, -1.17265215, -1.17265191, -1.17265182, -1.17265212,\n        -1.17265204, -1.1726521 , -1.17265157, -1.17265224, -1.17265187,\n        -1.17265194, -1.17265147, -1.17265224, -1.17265189, -1.17265212,\n        -1.17265174, -1.17265212, -1.17265225, -1.17265206, -1.17265207,\n        -1.17265172, -1.17265206, -1.17265188, -1.17265222, -1.17265219],\n       [-0.47202208, -0.4720213 , -0.47202112, -0.47202099, -0.47202044,\n        -0.47202081, -0.47202002, -0.47202077, -0.47201982, -0.47201984,\n        -0.47201983, -0.47202022, -0.47202078, -0.47202037, -0.47202072,\n        -0.47202165, -0.47202189, -0.47202226, -0.47202351, -0.47202434,\n        -0.47202531, -0.47202653, -0.47202821, -0.47202946, -0.47203129,\n        -0.4720329 , -0.47203515, -0.47203716, -0.47204034, -0.47204328,\n        -0.47204636, -0.47204959, -0.47205301, -0.47205778, -0.47206253,\n        -0.47206635, -0.4720713 , -0.47207685, -0.47208236, -0.47208866,\n        -0.47209437, -0.47210105, -0.47210836, -0.47211541, -0.47212255,\n        -0.47213031, -0.47213743, -0.47214482, -0.47215232, -0.47215881,\n        -0.47216554, -0.47217225, -0.47217817, -0.47218228, -0.47218676,\n        -0.47218954, -0.4721932 , -0.47219506, -0.47219628, -0.47219667,\n        -0.47219689, -0.47219584, -0.47219537, -0.47219459, -0.47219274,\n        -0.47219184, -0.47219021, -0.4721902 , -0.47218959, -0.47218954,\n        -0.47218835, -0.47218878, -0.4721899 , -0.4721907 , -0.47219139,\n        -0.4721927 , -0.47219326, -0.47219467, -0.47219529, -0.47219629,\n        -0.47219686, -0.47219675, -0.47219718, -0.47219716, -0.47219617,\n        -0.47219605, -0.47219569, -0.47219481, -0.47219377, -0.47219448,\n        -0.47219314, -0.47219306, -0.47219284, -0.47219262, -0.47219311,\n        -0.47219257, -0.47219227, -0.47219289, -0.47219308, -0.47219427],\n       [-0.73116026, -0.73115144, -0.73114303, -0.73113512, -0.73112831,\n        -0.73112055, -0.73111454, -0.73110769, -0.73110234, -0.73109686,\n        -0.73109187, -0.73108698, -0.73108271, -0.73107844, -0.73107469,\n        -0.73107195, -0.73106863, -0.73106622, -0.73106387, -0.73106132,\n        -0.73105969, -0.73105862, -0.73105717, -0.73105638, -0.73105569,\n        -0.73105542, -0.73105563, -0.73105574, -0.73105601, -0.73105721,\n        -0.73105854, -0.7310599 , -0.73106176, -0.73106356, -0.73106643,\n        -0.73106962, -0.73107255, -0.73107598, -0.73108018, -0.73108414,\n        -0.73108878, -0.73109423, -0.73110005, -0.73110575, -0.73111311,\n        -0.73112005, -0.73112825, -0.73113613, -0.73114459, -0.73115451,\n        -0.73116496, -0.73117554, -0.73118751, -0.73119984, -0.73121335,\n        -0.73122717, -0.73124233, -0.73125753, -0.73127462, -0.73129228,\n        -0.73131113, -0.73133093, -0.73135208, -0.73137407, -0.73139721,\n        -0.73142213, -0.73144767, -0.73147449, -0.73150273, -0.73153231,\n        -0.73156344, -0.73159552, -0.73162904, -0.73166371, -0.73170006,\n        -0.73173731, -0.73177646, -0.73181601, -0.73185704, -0.73189938,\n        -0.73194206, -0.73198605, -0.7320312 , -0.73207618, -0.73212263,\n        -0.73216931, -0.732216  , -0.73226304, -0.73230998, -0.73235585,\n        -0.73240196, -0.73244724, -0.7324917 , -0.73253487, -0.7325769 ,\n        -0.7326172 , -0.73265559, -0.73269208, -0.73272677, -0.73275802],\n       [-1.02971147, -1.02965617, -1.02960148, -1.0295492 , -1.02949719,\n        -1.02944862, -1.02940034, -1.02935515, -1.02931233, -1.02927204,\n        -1.02923341, -1.02919889, -1.02916522, -1.02913558, -1.0291096 ,\n        -1.029086  , -1.0290659 , -1.02904763, -1.02903361, -1.0290237 ,\n        -1.02901644, -1.0290133 , -1.02901314, -1.02901655, -1.02902322,\n        -1.0290346 , -1.02904794, -1.02906566, -1.02908763, -1.0291119 ,\n        -1.02914044, -1.02917168, -1.02920639, -1.02924457, -1.02928637,\n        -1.02933087, -1.02937828, -1.02942799, -1.02948057, -1.02953566,\n        -1.02959357, -1.02965296, -1.02971415, -1.02977744, -1.02984259,\n        -1.0299085 , -1.02997569, -1.03004246, -1.03011102, -1.03017945,\n        -1.03024857, -1.03031536, -1.03038373, -1.03044949, -1.03051521,\n        -1.03057908, -1.0306421 , -1.03070306, -1.03076137, -1.0308175 ,\n        -1.03087171, -1.0309234 , -1.03097239, -1.03101862, -1.03106083,\n        -1.03110139, -1.03113865, -1.03117152, -1.03120319, -1.03122877,\n        -1.03125319, -1.03127353, -1.03129188, -1.0313048 , -1.03131594,\n        -1.03132303, -1.03132771, -1.03132771, -1.03132502, -1.03131982,\n        -1.03131082, -1.03129828, -1.03128331, -1.03126503, -1.03124416,\n        -1.03121919, -1.03119216, -1.03116273, -1.03112895, -1.0310941 ,\n        -1.03105579, -1.03101544, -1.03097207, -1.03092711, -1.03088068,\n        -1.03083039, -1.03077827, -1.03072656, -1.03067051, -1.03061446],\n       [ 0.71949186,  0.71949387,  0.71949622,  0.71949803,  0.71949945,\n         0.71950078,  0.71950173,  0.71950296,  0.71950425,  0.71950461,\n         0.71950487,  0.71950567,  0.71950627,  0.71950663,  0.71950663,\n         0.71950704,  0.71950686,  0.71950719,  0.71950722,  0.71950699,\n         0.71950674,  0.71950643,  0.71950627,  0.71950556,  0.7195049 ,\n         0.71950401,  0.71950298,  0.7195016 ,  0.71950013,  0.71949818,\n         0.71949602,  0.71949277,  0.71948966,  0.71948568,  0.71948095,\n         0.71947571,  0.71946951,  0.7194623 ,  0.71945443,  0.71944562,\n         0.71943542,  0.71942392,  0.71941166,  0.71939758,  0.71938234,\n         0.71936554,  0.71934769,  0.71932817,  0.71930751,  0.7192854 ,\n         0.71926245,  0.71923848,  0.71921351,  0.71918798,  0.71916179,\n         0.71913536,  0.71910847,  0.71908188,  0.71905612,  0.71903005,\n         0.71900502,  0.7189808 ,  0.71895775,  0.71893574,  0.71891466,\n         0.71889548,  0.71887711,  0.7188602 ,  0.71884418,  0.71883005,\n         0.7188166 ,  0.71880443,  0.71879389,  0.718784  ,  0.71877538,\n         0.71876768,  0.71876042,  0.71875447,  0.71874914,  0.71874457,\n         0.71874073,  0.71873725,  0.71873428,  0.71873194,  0.71873021,\n         0.7187288 ,  0.71872785,  0.71872694,  0.71872701,  0.71872712,\n         0.71872757,  0.71872881,  0.71873005,  0.71873196,  0.71873391,\n         0.71873676,  0.71873963,  0.71874291,  0.71874737,  0.71875156],\n       [-0.73876683, -0.73876696, -0.73876677, -0.73876686, -0.73876719,\n        -0.73876711, -0.73876664, -0.73876708, -0.73876703, -0.73876695,\n        -0.73876688, -0.73876719, -0.73876666, -0.7387672 , -0.73876675,\n        -0.73876679, -0.73876701, -0.73876699, -0.73876727, -0.73876676,\n        -0.738767  , -0.7387671 , -0.73876701, -0.73876677, -0.7387672 ,\n        -0.73876748, -0.73876666, -0.7387668 , -0.73876665, -0.73876689,\n        -0.73876751, -0.73876696, -0.73876728, -0.73876709, -0.73876683,\n        -0.73876701, -0.73876702, -0.73876679, -0.738767  , -0.73876718,\n        -0.73876694, -0.73876737, -0.73876667, -0.73876709, -0.73876716,\n        -0.73876716, -0.7387666 , -0.73876709, -0.73876705, -0.73876695,\n        -0.73876723, -0.7387666 , -0.73876705, -0.7387667 , -0.73876693,\n        -0.73876707, -0.73876664, -0.73876671, -0.73876684, -0.73876672,\n        -0.73876696, -0.73876682, -0.7387671 , -0.73876677, -0.73876679,\n        -0.73876675, -0.73876689, -0.73876696, -0.73876694, -0.73876669,\n        -0.73876663, -0.73876709, -0.73876726, -0.73876723, -0.73876695,\n        -0.73876686, -0.73876698, -0.73876686, -0.73876693, -0.73876662,\n        -0.73876659, -0.73876633, -0.73876684, -0.73876697, -0.73876674,\n        -0.7387669 , -0.73876688, -0.73876697, -0.7387667 , -0.73876688,\n        -0.73876693, -0.73876703, -0.73876705, -0.73876717, -0.738767  ,\n        -0.73876692, -0.73876697, -0.73876668, -0.73876699, -0.73876727],\n       [ 0.16078905,  0.16078928,  0.16078995,  0.1607907 ,  0.16079126,\n         0.16079208,  0.16079277,  0.16079367,  0.16079439,  0.16079551,\n         0.16079682,  0.16079759,  0.16079907,  0.1608001 ,  0.16080152,\n         0.16080265,  0.16080431,  0.16080587,  0.16080754,  0.1608094 ,\n         0.16081124,  0.1608131 ,  0.16081524,  0.16081737,  0.16081942,\n         0.16082145,  0.16082331,  0.16082597,  0.16082827,  0.16083076,\n         0.16083336,  0.16083585,  0.16083836,  0.1608409 ,  0.16084356,\n         0.16084651,  0.16084903,  0.16085203,  0.16085456,  0.16085722,\n         0.16086021,  0.16086276,  0.16086592,  0.16086866,  0.16087135,\n         0.16087408,  0.16087671,  0.16087934,  0.16088164,  0.16088449,\n         0.1608873 ,  0.16088976,  0.16089204,  0.16089433,  0.16089685,\n         0.16089858,  0.16090091,  0.16090345,  0.16090528,  0.16090728,\n         0.16090889,  0.16091071,  0.16091265,  0.160914  ,  0.16091555,\n         0.1609175 ,  0.16091867,  0.1609201 ,  0.16092075,  0.16092209,\n         0.16092308,  0.16092433,  0.16092526,  0.160926  ,  0.16092671,\n         0.16092731,  0.16092833,  0.16092872,  0.16092924,  0.16092977,\n         0.1609301 ,  0.16093073,  0.16093114,  0.16093112,  0.16093159,\n         0.16093189,  0.16093216,  0.16093213,  0.16093183,  0.16093222,\n         0.16093226,  0.16093205,  0.16093181,  0.16093168,  0.16093174,\n         0.16093116,  0.16093125,  0.16093075,  0.16093044,  0.16092961],\n       [ 0.04219713,  0.04219818,  0.04219925,  0.04220029,  0.04220112,\n         0.04220175,  0.04220289,  0.04220259,  0.04220346,  0.04220389,\n         0.042204  ,  0.04220428,  0.04220444,  0.0422044 ,  0.04220433,\n         0.04220426,  0.04220409,  0.04220397,  0.04220362,  0.04220291,\n         0.04220273,  0.04220195,  0.04220111,  0.04220051,  0.04219967,\n         0.04219887,  0.04219779,  0.04219612,  0.04219523,  0.04219348,\n         0.04219215,  0.04219052,  0.04218916,  0.04218762,  0.04218584,\n         0.04218446,  0.042183  ,  0.04218182,  0.04218065,  0.04217949,\n         0.04217874,  0.04217819,  0.04217758,  0.04217683,  0.04217697,\n         0.04217671,  0.04217675,  0.04217671,  0.0421767 ,  0.04217668,\n         0.04217685,  0.04217709,  0.04217715,  0.04217696,  0.0421768 ,\n         0.04217696,  0.04217696,  0.04217694,  0.04217703,  0.04217688,\n         0.04217667,  0.04217688,  0.04217668,  0.04217664,  0.04217654,\n         0.04217693,  0.04217698,  0.04217712,  0.04217708,  0.04217745,\n         0.04217782,  0.04217807,  0.04217856,  0.04217901,  0.04217974,\n         0.0421802 ,  0.04218057,  0.04218118,  0.04218193,  0.04218222,\n         0.04218241,  0.04218296,  0.04218305,  0.04218345,  0.04218335,\n         0.04218356,  0.04218361,  0.04218328,  0.04218315,  0.0421827 ,\n         0.04218252,  0.04218222,  0.04218175,  0.04218106,  0.04218065,\n         0.04217963,  0.04217947,  0.04217887,  0.0421784 ,  0.04217788]])\n\n\n\n\nCode\nasdasd\n\n\n\n\nCode\ndef random_sample():\n    z1 = get_prior(X)\n    u1 = z1.rvs(1)\n    z2 = get_prior(z1.rvs(1))\n    u2 = z2.rvs(1)\n    z3 = get_prior(z2.rvs(1))\n    u3 = z3.rvs(1)\n    z4 = get_prior(z3.rvs(1))\n    u4 = z4.rvs(1)\n    return u1, u2, u3, u4\n\nG = 15\nU1 = np.zeros((G, 100))\nU2 = np.zeros((G, 100))\nU3 = np.zeros((G, 100))\nU4 = np.zeros((G, 100))\nfor j in range(0, G):\n    u1, u2, u3, u4 = random_sample()\n    U1[j,:] = u1\n    U2[j,:] = u2\n    U3[j,:] = u3\n    U4[j,:] = u4\n\n\n\n\nCode\nfig = plt.figure(layout='constrained', figsize=(8, 6))\nplt.subplot(221)\nplt.title('Layer 1')\nplt.plot(X, U1.T, alpha=0.5, lw=2)\nplt.xlabel('x')\nplt.subplot(222)\nplt.title('Layer 2')\nplt.plot(X, U2.T, alpha=0.5, lw=2)\nplt.subplot(223)\nplt.title('Layer 3')\nplt.plot(X, U3.T, alpha=0.5, lw=2)\nplt.subplot(224)\nplt.title('Layer 4')\nplt.plot(X, U4.T, alpha=0.5, lw=2)\nplt.savefig('layers.png', dpi=170, bbox_inches='tight', transparent=True)\nplt.show()\n\n\n\n\nCode\ndef random_sample2():\n    z1 = get_prior(X)\n    u1 = z1.rvs(1)\n    z2 = get_prior(z1.rvs(1))\n    u2 = z2.rvs(1)\n    z3 = get_prior(z2.rvs(1))\n    u3 = z3.rvs(1)\n    z4 = get_prior(z3.rvs(1))\n    u4 = z4.rvs(1)\n    \n    z5 = get_prior(z4.rvs(1))\n    u5 = z5.rvs(1)\n    \n    z6 = get_prior(z5.rvs(1))\n    u6 = z6.rvs(1)\n    \n    z7 = get_prior(z6.rvs(1))\n    u7 = z7.rvs(1)\n    \n    z8 = get_prior(z7.rvs(1))\n    u8 = z8.rvs(1)\n    \n    z9 = get_prior(z8.rvs(1))\n    u9 = z9.rvs(1)\n    \n    return u6, u7, u8, u9\n\nG = 15\nU1b = np.zeros((G, 100))\nU2b = np.zeros((G, 100))\nU3b = np.zeros((G, 100))\nU4b = np.zeros((G, 100))\nfor j in range(0, G):\n    u1, u2, u3, u4 = random_sample2()\n    U1b[j,:] = u1\n    U2b[j,:] = u2\n    U3b[j,:] = u3\n    U4b[j,:] = u4\n\n\n\n\nCode\nfig = plt.figure(layout='constrained', figsize=(8, 6))\nplt.subplot(221)\nplt.title('Layer 6')\nplt.plot(X, U1b.T, alpha=0.5, lw=2)\nplt.xlabel('x')\nplt.subplot(222)\nplt.title('Layer 7')\nplt.plot(X, U2b.T, alpha=0.5, lw=2)\nplt.subplot(223)\nplt.title('Layer 8')\nplt.plot(X, U3b.T, alpha=0.5, lw=2)\nplt.subplot(224)\nplt.title('Layer 9')\nplt.plot(X, U4b.T, alpha=0.5, lw=2)\nplt.savefig('layers2.png', dpi=170, bbox_inches='tight', transparent=True)\nplt.show()\n\n\n\n\nCode\nfig = plt.figure(layout='constrained', figsize=(8, 6))\nplt.subplot(221)\nplt.title('Layer 17')\nplt.plot(X, U1b.T, alpha=0.5, lw=2)\nplt.xlabel('x')\nplt.subplot(222)\nplt.title('Layer 18')\nplt.plot(X, U2b.T, alpha=0.5, lw=2)\nplt.subplot(223)\nplt.title('Layer 19')\nplt.plot(X, U3b.T, alpha=0.5, lw=2)\nplt.subplot(224)\nplt.title('Layer 20')\nplt.plot(X, U4b.T, alpha=0.5, lw=2)\nplt.savefig('layers3.png', dpi=170, bbox_inches='tight', transparent=True)\nplt.show()\n\n\n\n\nCode\nx = np.random.rand(80)*2 - 1\ny = np.sign(x) + np.random.randn(80)*0.05\n\n\n\n\nCode\nplt.plot(x, y, '.')\nplt.show()\n\n\n\n\nCode\nimport GPy\nm_full = GPy.models.GPRegression(x,yhat)\n_ = m_full.optimize() # Optimize parameters of covariance function"
  },
  {
    "objectID": "useful_codes/gp101.html",
    "href": "useful_codes/gp101.html",
    "title": "GP 101",
    "section": "",
    "text": "This notebook provides a quick start guide to building a Gaussian process model using only numpy, scipy, pandas, and matplotlib.\n\n\nCode\nimport pandas as pd\nimport numpy as np\nimport matplotlib\nimport matplotlib.pyplot as plt\nfrom scipy.stats import multivariate_normal\nfrom scipy.linalg import cholesky, solve_triangular\nimport seaborn as sns\n\n\n\n\nIn this tutorial, we will use the Olympic gold dataset that we have used quite a few times in Lecture. First, we shall use pandas to retrieve the data.\n\n\nCode\ndf = pd.read_csv('data/data100m.csv')\ndf.columns=['Year', 'Time']\nN = df.shape[0]\n\n\nThe three code blocks below define the kernel, a utility function for tiling, and the posterior calculation. To clarify, this is the predictive posterior distribution evaluated at some test locations, \\(\\mathbf{X}_{\\ast}\\). The expression for both the predictive posterior mean and covariance are given by:\n\\[\n\\begin{aligned}\n\\mathbb{E} \\left[ \\mathbf{y}_{\\ast} | \\mathbf{X}_{\\ast} \\right] & = \\mathbf{K}\\left( \\mathbf{X}_{\\ast}, \\mathbf{X} \\right) \\left[\\mathbf{K}\\left( \\mathbf{X}, \\mathbf{X} \\right) + \\sigma_{n}^2 \\mathbf{I} \\right]^{-1} \\mathbf{y} \\\\\nCovar\\left[ \\mathbf{y}_{\\ast} | \\mathbf{X}_{\\ast} \\right] & = \\mathbf{K}\\left( \\mathbf{X}_{\\ast}, \\mathbf{X}_{\\ast} \\right) - \\mathbf{K}\\left( \\mathbf{X}_{\\ast}, \\mathbf{X} \\right) \\left[\\mathbf{K}\\left( \\mathbf{X}, \\mathbf{X} \\right) + \\sigma_{n}^2 \\mathbf{I} \\right]^{-1} \\mathbf{K}\\left( \\mathbf{X}, \\mathbf{X}{\\ast} \\right)\n\\end{aligned}\n\\]\n\n\nCode\ndef kernel(xa, xb, amp, ll):\n    Xa, Xb = get_tiled(xa, xb)\n    return amp**2 * np.exp(-0.5 * 1./ll**2 * (Xa - Xb)**2 )\n\ndef get_tiled(xa, xb):\n    m, n = len(xa), len(xb)\n    xa, xb = xa.reshape(m,1) , xb.reshape(n,1)\n    Xa = np.tile(xa, (1, n))\n    Xb = np.tile(xb.T, (m, 1))\n    return Xa, Xb\n\ndef get_posterior(amp, ll, x, x_data, y_data, noise):\n    u = y_data.shape[0]\n    mu_y = np.mean(y_data)\n    y = (y_data - mu_y).reshape(u,1)\n    Sigma = noise * np.eye(u)\n    \n    Kxx = kernel(x_data, x_data, amp, ll)\n    Kxpx = kernel(x, x_data, amp, ll)\n    Kxpxp = kernel(x, x, amp, ll)\n    \n    # Inverse\n    jitter = np.eye(u) * 1e-12\n    L = cholesky(Kxx + Sigma + jitter)\n    S1 = solve_triangular(L.T, y, lower=True)\n    S2 = solve_triangular(L.T, Kxpx.T, lower=True).T\n    \n    mu = S2 @ S1  + mu_y\n    cov = Kxpxp - S2 @ S2.T\n    return mu, cov\n\n\n\n\nCode\nXt = np.linspace(1890, 2022, 200) # test data locations (years)\n\n# Hyperparameters (note these are not optimized!)\nlength_scale = 7.0\namplitude = 0.8\n\n\nnoise_variance = 0.1\nmu, cov = get_posterior(amplitude, length_scale, Xt, df['Year'].values, df['Time'].values, noise_variance)\n\n\n\n\nCode\nXt = Xt.flatten()\nmu = mu.flatten() \nstd = np.sqrt(np.diag(cov)).flatten()\n\nfig = plt.figure(figsize=(8, 5))\nplt.plot(Xt, mu, '-', label=r'$\\mu$', color='navy')\nplt.fill_between(Xt, mu+std, mu-std, color='blue', alpha=0.2, label=r'$\\sigma$')\nplt.plot(df['Year'].values, df['Time'].values, 'go', label='Data', ms=8)\nplt.xlabel('Years')\nplt.ylabel('Winning times')\nplt.legend()\nplt.show()"
  },
  {
    "objectID": "useful_codes/gp101.html#overview",
    "href": "useful_codes/gp101.html#overview",
    "title": "GP 101",
    "section": "",
    "text": "This notebook provides a quick start guide to building a Gaussian process model using only numpy, scipy, pandas, and matplotlib.\n\n\nCode\nimport pandas as pd\nimport numpy as np\nimport matplotlib\nimport matplotlib.pyplot as plt\nfrom scipy.stats import multivariate_normal\nfrom scipy.linalg import cholesky, solve_triangular\nimport seaborn as sns\n\n\n\n\nIn this tutorial, we will use the Olympic gold dataset that we have used quite a few times in Lecture. First, we shall use pandas to retrieve the data.\n\n\nCode\ndf = pd.read_csv('data/data100m.csv')\ndf.columns=['Year', 'Time']\nN = df.shape[0]\n\n\nThe three code blocks below define the kernel, a utility function for tiling, and the posterior calculation. To clarify, this is the predictive posterior distribution evaluated at some test locations, \\(\\mathbf{X}_{\\ast}\\). The expression for both the predictive posterior mean and covariance are given by:\n\\[\n\\begin{aligned}\n\\mathbb{E} \\left[ \\mathbf{y}_{\\ast} | \\mathbf{X}_{\\ast} \\right] & = \\mathbf{K}\\left( \\mathbf{X}_{\\ast}, \\mathbf{X} \\right) \\left[\\mathbf{K}\\left( \\mathbf{X}, \\mathbf{X} \\right) + \\sigma_{n}^2 \\mathbf{I} \\right]^{-1} \\mathbf{y} \\\\\nCovar\\left[ \\mathbf{y}_{\\ast} | \\mathbf{X}_{\\ast} \\right] & = \\mathbf{K}\\left( \\mathbf{X}_{\\ast}, \\mathbf{X}_{\\ast} \\right) - \\mathbf{K}\\left( \\mathbf{X}_{\\ast}, \\mathbf{X} \\right) \\left[\\mathbf{K}\\left( \\mathbf{X}, \\mathbf{X} \\right) + \\sigma_{n}^2 \\mathbf{I} \\right]^{-1} \\mathbf{K}\\left( \\mathbf{X}, \\mathbf{X}{\\ast} \\right)\n\\end{aligned}\n\\]\n\n\nCode\ndef kernel(xa, xb, amp, ll):\n    Xa, Xb = get_tiled(xa, xb)\n    return amp**2 * np.exp(-0.5 * 1./ll**2 * (Xa - Xb)**2 )\n\ndef get_tiled(xa, xb):\n    m, n = len(xa), len(xb)\n    xa, xb = xa.reshape(m,1) , xb.reshape(n,1)\n    Xa = np.tile(xa, (1, n))\n    Xb = np.tile(xb.T, (m, 1))\n    return Xa, Xb\n\ndef get_posterior(amp, ll, x, x_data, y_data, noise):\n    u = y_data.shape[0]\n    mu_y = np.mean(y_data)\n    y = (y_data - mu_y).reshape(u,1)\n    Sigma = noise * np.eye(u)\n    \n    Kxx = kernel(x_data, x_data, amp, ll)\n    Kxpx = kernel(x, x_data, amp, ll)\n    Kxpxp = kernel(x, x, amp, ll)\n    \n    # Inverse\n    jitter = np.eye(u) * 1e-12\n    L = cholesky(Kxx + Sigma + jitter)\n    S1 = solve_triangular(L.T, y, lower=True)\n    S2 = solve_triangular(L.T, Kxpx.T, lower=True).T\n    \n    mu = S2 @ S1  + mu_y\n    cov = Kxpxp - S2 @ S2.T\n    return mu, cov\n\n\n\n\nCode\nXt = np.linspace(1890, 2022, 200) # test data locations (years)\n\n# Hyperparameters (note these are not optimized!)\nlength_scale = 7.0\namplitude = 0.8\n\n\nnoise_variance = 0.1\nmu, cov = get_posterior(amplitude, length_scale, Xt, df['Year'].values, df['Time'].values, noise_variance)\n\n\n\n\nCode\nXt = Xt.flatten()\nmu = mu.flatten() \nstd = np.sqrt(np.diag(cov)).flatten()\n\nfig = plt.figure(figsize=(8, 5))\nplt.plot(Xt, mu, '-', label=r'$\\mu$', color='navy')\nplt.fill_between(Xt, mu+std, mu-std, color='blue', alpha=0.2, label=r'$\\sigma$')\nplt.plot(df['Year'].values, df['Time'].values, 'go', label='Data', ms=8)\nplt.xlabel('Years')\nplt.ylabel('Winning times')\nplt.legend()\nplt.show()"
  },
  {
    "objectID": "useful_codes/eigen.html",
    "href": "useful_codes/eigen.html",
    "title": "Eigenfunction analysis of kernels",
    "section": "",
    "text": "This attempts to describe kernels. The hope is after going through this, the reader appreciates just how powerful kernels are, and the central role they play in Gaussian process models.\n\n\nCode\n### Data \nimport matplotlib.pyplot as plt\nimport numpy as np\nfrom IPython.display import display, HTML\n\n\n\n\nNow, we shall be interested in mapping from a kernel to a feature map. This leads us to Mercer’s theorem, which states that: A symmetric function \\(k \\left( \\mathbf{x}, \\mathbf{x}' \\right)\\) can be expressed as the inner product\n\\[\nk \\left( \\mathbf{x}, \\mathbf{x}' \\right) = \\phi^{T} \\left( \\mathbf{x} \\right) \\phi \\left( \\mathbf{x}'\\right) = \\left\\langle \\phi \\left( \\mathbf{x} \\right), \\phi\\left( \\mathbf{x}' \\right) \\right\\rangle\n\\]\nfor some feature map \\(\\phi\\) if and only if \\(k \\left( \\mathbf{x}, \\mathbf{x}' \\right)\\) is positive semidefinite, i.e.,\n\\[\n\\int k \\left( \\mathbf{x}, \\mathbf{x}' \\right) g \\left(  \\mathbf{x} \\right)  g \\left(  \\mathbf{x}' \\right) d \\mathbf{x} d \\mathbf{x}' \\geq 0\n\\]\nfor all real \\(g\\).\nOne possible set of features corresponds to eigenfunctions. A function \\(\\nu\\left( \\mathbf{x} \\right)\\) that satisfies the integral equation\n\\[\n\\int k \\left( \\mathbf{x}, \\mathbf{x}' \\right) \\nu \\left( \\mathbf{x} \\right) d  \\mathbf{x}  = \\lambda  \\nu \\left( \\mathbf{x} \\right)\n\\]\nis termed an eigenfunction of the kernel \\(k\\). In the expression above, \\(\\lambda\\) is the corresponding eigenvalue. While the integral above is taken with respect to \\(\\mathbf{x}\\), more formally, it can be taken with respect to either a density \\(\\rho \\left( \\mathbf{x} \\right)\\), or the Lebesgue measure over a compact subset of \\(\\mathbb{R}^{D}\\), which reduces to \\(d \\mathbf{x}\\). The eigenfunctions form an orthogonal basis and thus\n\\[\n\\int \\nu_{i} \\left( \\mathbf{x} \\right) \\nu_{j} \\left( \\mathbf{x} \\right) d \\mathbf{x} = \\delta_{ij}\n\\]\nwhere \\(\\delta_{ij}\\) is the Kronecker delta. When \\(i=j\\), its value is \\(1\\); zero otherwise. Thus, one can define a kernel using its eigenfunctions\n\\[\nk \\left(  \\mathbf{x}, \\mathbf{x}' \\right) = \\sum_{i=1}^{\\infty} \\lambda_i \\nu \\left( \\mathbf{x} \\right) \\nu \\left( \\mathbf{x}' \\right).\n\\]"
  },
  {
    "objectID": "useful_codes/eigen.html#overview",
    "href": "useful_codes/eigen.html#overview",
    "title": "Eigenfunction analysis of kernels",
    "section": "",
    "text": "This attempts to describe kernels. The hope is after going through this, the reader appreciates just how powerful kernels are, and the central role they play in Gaussian process models.\n\n\nCode\n### Data \nimport matplotlib.pyplot as plt\nimport numpy as np\nfrom IPython.display import display, HTML\n\n\n\n\nNow, we shall be interested in mapping from a kernel to a feature map. This leads us to Mercer’s theorem, which states that: A symmetric function \\(k \\left( \\mathbf{x}, \\mathbf{x}' \\right)\\) can be expressed as the inner product\n\\[\nk \\left( \\mathbf{x}, \\mathbf{x}' \\right) = \\phi^{T} \\left( \\mathbf{x} \\right) \\phi \\left( \\mathbf{x}'\\right) = \\left\\langle \\phi \\left( \\mathbf{x} \\right), \\phi\\left( \\mathbf{x}' \\right) \\right\\rangle\n\\]\nfor some feature map \\(\\phi\\) if and only if \\(k \\left( \\mathbf{x}, \\mathbf{x}' \\right)\\) is positive semidefinite, i.e.,\n\\[\n\\int k \\left( \\mathbf{x}, \\mathbf{x}' \\right) g \\left(  \\mathbf{x} \\right)  g \\left(  \\mathbf{x}' \\right) d \\mathbf{x} d \\mathbf{x}' \\geq 0\n\\]\nfor all real \\(g\\).\nOne possible set of features corresponds to eigenfunctions. A function \\(\\nu\\left( \\mathbf{x} \\right)\\) that satisfies the integral equation\n\\[\n\\int k \\left( \\mathbf{x}, \\mathbf{x}' \\right) \\nu \\left( \\mathbf{x} \\right) d  \\mathbf{x}  = \\lambda  \\nu \\left( \\mathbf{x} \\right)\n\\]\nis termed an eigenfunction of the kernel \\(k\\). In the expression above, \\(\\lambda\\) is the corresponding eigenvalue. While the integral above is taken with respect to \\(\\mathbf{x}\\), more formally, it can be taken with respect to either a density \\(\\rho \\left( \\mathbf{x} \\right)\\), or the Lebesgue measure over a compact subset of \\(\\mathbb{R}^{D}\\), which reduces to \\(d \\mathbf{x}\\). The eigenfunctions form an orthogonal basis and thus\n\\[\n\\int \\nu_{i} \\left( \\mathbf{x} \\right) \\nu_{j} \\left( \\mathbf{x} \\right) d \\mathbf{x} = \\delta_{ij}\n\\]\nwhere \\(\\delta_{ij}\\) is the Kronecker delta. When \\(i=j\\), its value is \\(1\\); zero otherwise. Thus, one can define a kernel using its eigenfunctions\n\\[\nk \\left(  \\mathbf{x}, \\mathbf{x}' \\right) = \\sum_{i=1}^{\\infty} \\lambda_i \\nu \\left( \\mathbf{x} \\right) \\nu \\left( \\mathbf{x}' \\right).\n\\]"
  },
  {
    "objectID": "useful_codes/eigen.html#numerical-solution",
    "href": "useful_codes/eigen.html#numerical-solution",
    "title": "Eigenfunction analysis of kernels",
    "section": "Numerical solution",
    "text": "Numerical solution\nIf the covariance matrix is already available, one write its eigendecomposition\n\\[\n\\mathbf{K} = \\mathbf{V} \\boldsymbol{\\Lambda} \\mathbf{V}^{T}\n\\]\nwhere \\(\\mathbf{V}\\) is a matrix of formed by the eigenvectors of \\(\\mathbf{K}\\) and \\(\\boldsymbol{\\Lambda}\\) is a diagonal matrix of its eigenvalues, i.e.,\n\\[\n\\mathbf{V} = \\left[\\begin{array}{cccc}\n| & | &  & |\\\\\n\\mathbf{v}_{1} & \\mathbf{v}_{2} & \\ldots & \\mathbf{v}_{N}\\\\\n| & | &  & |\n\\end{array}\\right], \\; \\; \\; \\; \\textrm{and} \\; \\; \\; \\; \\boldsymbol{\\Lambda}=\\left[\\begin{array}{cccc}\n\\lambda_{1}\\\\\n& \\lambda_{2}\\\\\n&  & \\ddots\\\\\n&  &  & \\lambda_{N}\n\\end{array}\\right],\n\\]\nwhere \\(\\lambda_1 \\geq \\lambda_2 \\geq \\ldots \\lambda_{N} \\geq 0\\). This expansion permits one to express each element of \\(\\mathbf{K}\\) as\n\\[\n\\mathbf{K} = \\sum_{i=1}^{N} \\left( \\sqrt{\\lambda_{i}} \\mathbf{v}_{i} \\right) \\left(  \\sqrt{\\lambda_{i}} \\mathbf{v}_{i}\\right)^{T}.\n\\]\nBeyond numerical solutions, for many kernels there exists analytical solutions for the eigenvalues and eigenvectors. For further details please see page 97 in RW. For now, we simply consider numerical solutions as shown below.\n\n\nCode\nN = 30\nx = np.linspace(-2, 2, N).reshape(N,1)\nR = (np.tile(x, [1, N]) - np.tile(x.T, [N, 1]))**2\nl = 0.5\nK = np.exp(-0.5 * R * 1/l**2)\n\nfig = plt.figure()\nd = plt.imshow(K)\nplt.colorbar(d)\nplt.title('Squared exponential')\nplt.show()\n\n\n\n\n\n\n\nCode\nLambda, V = np.linalg.eigh(K)\nidx = Lambda.argsort()[::-1]\nlambdas = Lambda[idx]\nV = V[:, idx]\n\n\n\n\nCode\nfig = plt.figure()\nplt.semilogy(lambdas, 'o-')\nplt.ylabel('Eigenvalues of covariance matrix (log)')\nplt.xlabel('Number of data points')\nplt.show()\n\n\n\n\n\n\n\nCode\nT = 5 # truncated basis\nK_approx = np.zeros((N, N))\nfor i in range(0, T):\n    feature = (np.sqrt(lambdas[i]) * V[:,i]).reshape(N,1)\n    K_approx += feature @ feature.T\n\n\n\n\nCode\nfig = plt.figure()\nplt.subplot(121)\nd = plt.imshow(K_approx, vmin=0, vmax=1)\nplt.colorbar(d, shrink=0.3)\nplt.title('Truncated approximation (5 terms)')\n\nplt.subplot(122)\ne = plt.imshow(K, vmin=0, vmax=1)\nplt.colorbar(e, shrink=0.3)\nplt.title('Squared exponential')\nplt.show()"
  },
  {
    "objectID": "useful_codes/gaussians.html",
    "href": "useful_codes/gaussians.html",
    "title": "Gaussian marginals and conditionals",
    "section": "",
    "text": "This notebook covers a few basic ideas with regards to Gaussian marginals and conditionals.\n\n\nConsider a random vector \\(\\mathbf{u} = \\left[u_1, u_2, u_3, u_4 \\right]^{T}\\) following a multivariate Gaussian distribution a mean vector \\(\\boldsymbol{\\mu}\\) and a covariance matrix \\(\\mathbf{K}\\). These are given by\n\\[\n\\boldsymbol{\\mu}=\\left[\\begin{array}{c}\n\\mu_{1}\\\\\n\\mu_{2}\\\\\n\\mu_{3}\\\\\n\\mu_{4}\n\\end{array}\\right], \\; \\; \\; \\; \\mathbf{K}=\\left[\\begin{array}{cccc}\nk_{11} & k_{12} & k_{13} & k_{14}\\\\\nk_{21} & k_{22} & k_{23} & k_{24}\\\\\nk_{31} & k_{32} & k_{33} & k_{34}\\\\\nk_{41} & k_{42} & k_{43} & k_{44}\n\\end{array}\\right]\n\\]\nThe marginal distribution of any subset of these four variables is obtained by integrating over the remaining ones. This can be trivially done by simply extracting the relevant elements of \\(\\boldsymbol{\\mu}\\) and \\(\\mathbf{K}\\). For instance the joint distribution given by \\(p\\left(u_2, u_3 \\right)\\) is a Gaussian\n\\[\np\\left( u_2, u_3 \\right) = \\mathcal{N} \\left( \\boldsymbol{\\mu}_{\\left(2,3\\right)}, \\boldsymbol{\\Sigma}_{\\left( 2,3 \\right)} \\right)\n\\]\nwhere\n\\[\n\\boldsymbol{\\mu}_{\\left(2,3\\right)}=\\left[\\begin{array}{c}\n\\mu_{2}\\\\\n\\mu_{3}\n\\end{array}\\right], \\; \\; \\; \\; \\boldsymbol{\\Sigma}_{\\left(2,3\\right)} = \\left[\\begin{array}{cc}\nk_{22} & k_{23} \\\\\nk_{32} & k_{33}\n\\end{array}\\right].\n\\]\nSimilarly, the marginal distribution of \\(p\\left( u_1 \\right)\\) is a Gaussian with a mean of \\(\\mu_1\\) and a variance of \\(k_{11}\\).\n\n\n\nConsider a random vector \\(\\mathbf{u}\\), composed of two sets \\(\\mathbf{u}_{1}\\) and \\(\\mathbf{u}_{2}\\). Assume, as before, that \\(\\mathbf{u} = p \\left( \\boldsymbol{\\mu}, \\boldsymbol{\\Sigma} \\right)\\), where\n\\[\n\\boldsymbol{\\mu} =\\left[\\begin{array}{c}\n\\boldsymbol{\\mu}_{1}\\\\\n\\boldsymbol{\\mu}_{2}\n\\end{array}\\right], \\; \\; \\; \\; \\boldsymbol{\\Sigma} = \\left[\\begin{array}{cc}\n\\boldsymbol{\\Sigma}_{11} & \\boldsymbol{\\Sigma}_{12} \\\\\n\\boldsymbol{\\Sigma}_{21} & \\boldsymbol{\\Sigma}_{22}\n\\end{array}\\right].\n\\]\nIf we observe one of these sets, say \\(\\mathbf{u}_{1}\\), then the conditional density of the other set \\(\\mathbf{u}_{2}\\) is a Gaussian of the form\n\\[\np \\left(\\mathbf{u}_{2} | \\mathbf{u}_{1} \\right) = \\mathcal{N} \\left( \\mathbf{d}, \\mathbf{D} \\right)\n\\]\nwhere\n\\[\n\\mathbf{d} = \\boldsymbol{\\mu}_{2} + \\boldsymbol{\\Sigma}_{12}^{T} \\boldsymbol{\\Sigma}_{11}^{-1} \\left( \\mathbf{u}_1 - \\boldsymbol{\\mu}_{1} \\right)\n\\]\nand\n\\[\n\\mathbf{D} = \\boldsymbol{\\Sigma}_{22} - \\boldsymbol{\\Sigma}_{12}^{T} \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12}\n\\]\n\n\nThe proof for the result above begins with the definition of the conditional distribution, i.e.,\n\\[\np \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) = \\frac{p\\left( \\mathbf{u}_2 , \\mathbf{u}_1 \\right) }{ p \\left( \\mathbf{u}_1 \\right) }\n\\]\nPlugging in the definition of a multivariate normal distribution, we have:\n\\[\np\\left( \\mathbf{u}_2 , \\mathbf{u}_1 \\right) = p \\left( \\mathbf{u} \\right) = \\frac{1}{\\sqrt{\\left( 2 \\pi \\right)^{N} |\\boldsymbol{\\Sigma} | }} exp \\left[ -\\frac{1}{2} \\left(\\mathbf{u} - \\boldsymbol{\\mu} \\right)^{T} \\boldsymbol{\\Sigma}^{-1}  \\left(\\mathbf{u} - \\boldsymbol{\\mu} \\right) \\right]\n\\]\nand\n\\[\np\\left( \\mathbf{u}_1 \\right) =  \\frac{1}{\\sqrt{\\left( 2 \\pi \\right)^{N_1} |\\boldsymbol{\\Sigma}_{11} | }} exp \\left[ -\\frac{1}{2} \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\boldsymbol{\\Sigma}_{11}^{-1}  \\left(\\mathbf{u}_{1} - \\boldsymbol{\\mu}_{1} \\right) \\right]\n\\]\nThus,\n\\[\np \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) =  \\frac{\\sqrt{|\\boldsymbol{\\Sigma}_{11} |}}{\\sqrt{\\left( 2 \\pi \\right)^{N - N_1} |\\boldsymbol{\\Sigma} | }} exp \\left[ -\\frac{1}{2} \\left(\\mathbf{u} - \\boldsymbol{\\mu} \\right)^{T} \\boldsymbol{\\Sigma}^{-1}  \\left(\\mathbf{u} - \\boldsymbol{\\mu} \\right) + \\frac{1}{2} \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\boldsymbol{\\Sigma}_{11}^{-1}  \\left(\\mathbf{u}_{1} - \\boldsymbol{\\mu}_{1} \\right) \\right]\n\\]\nExpanding the above, we arrive at\n\\[\np \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) = \\frac{\\sqrt{|\\boldsymbol{\\Sigma}_{11} |}}{\\sqrt{\\left( 2 \\pi \\right)^{N - N_1} |\\boldsymbol{\\Sigma} | }}exp \\left[ -\\frac{1}{2} \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\boldsymbol{\\Gamma}_{11}  \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right) +\n\\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\boldsymbol{\\Gamma}_{12}  \\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 \\right) -\n\\frac{1}{2}\\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 \\right)^{T} \\boldsymbol{\\Gamma}_{22}  \\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 \\right)\n-\\frac{1}{2}  \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\boldsymbol{\\Sigma}_{11}^{-1}  \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)\n\\right]\n\\]\nwhere\n\\[\n\\boldsymbol{\\Sigma}^{-1} = \\left[\\begin{array}{cc}\n\\boldsymbol{\\Gamma}_{11} & \\boldsymbol{\\Gamma}_{12} \\\\\n\\boldsymbol{\\Gamma}_{21} & \\boldsymbol{\\Gamma}_{22}\n\\end{array}\\right].\n\\]\nUsing the block matrix inverse and Woodbury matrix identity, we have\n\\[\n\\boldsymbol{\\Gamma}_{11} = \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12}\\left( \\boldsymbol{\\Sigma}_{22} - \\boldsymbol{\\Sigma}_{21} \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12} \\right)^{-1} \\boldsymbol{\\Sigma}_{21} \\boldsymbol{\\Sigma}_{11}^{-1}\n\\]\n\\[\n\\boldsymbol{\\Gamma}_{12} = - \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12} \\left( \\boldsymbol{\\Sigma}_{22} - \\boldsymbol{\\Sigma}_{21} \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12} \\right)^{-1}\n\\]\n\\[\n\\boldsymbol{\\Gamma}_{22} = \\left( \\boldsymbol{\\Sigma}_{22} - \\boldsymbol{\\Sigma}_{21} \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12} \\right)^{-1}\n\\]\nThe prior expansion then yields\n\\[\np \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) = \\frac{\\sqrt{|\\boldsymbol{\\Sigma}_{11} |}}{\\sqrt{\\left( 2 \\pi \\right)^{N - N_1} |\\boldsymbol{\\Sigma} | }}exp \\left[ -\\frac{1}{2} \\left[\n\\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\left( \\boldsymbol{\\Gamma}_{11} -  \\boldsymbol{\\Sigma}_{11}^{-1} \\right) \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right) -  \n2\\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\boldsymbol{\\Gamma}_{12}  \\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 \\right) +\n\\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 \\right)^{T} \\boldsymbol{\\Gamma}_{22}  \\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 \\right) \\right] \\right]\n\\]\nExpanding this out and grouping similar terms leads to\n\\[\np \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) = \\frac{\\sqrt{|\\boldsymbol{\\Sigma}_{11} |}}{\\sqrt{\\left( 2 \\pi \\right)^{N - N_1} |\\boldsymbol{\\Sigma} | }}exp \\left[ -\\frac{1}{2} \\left[ \\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 - \\boldsymbol{\\Sigma}_{21}\\boldsymbol{\\Sigma}^{-1}\\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_{1} \\right) \\right)^{T} \\left( \\boldsymbol{\\Sigma}_{22} - \\boldsymbol{\\Sigma}_{21} \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12} \\right)^{-1} \\left(u_2 - \\mu_2 - \\boldsymbol{\\Sigma}_{21}\\boldsymbol{\\Sigma}^{-1}\\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_{1} \\right) \\right) \\right] \\right]\n\\]\nFrom this it is readily apparent that the mean and covariance of the new density is given by\n\\[\n\\mathbb{E}\\left[ p \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) \\right ] =    \\boldsymbol{\\mu}_2 + \\boldsymbol{\\Sigma}_{21}\\boldsymbol{\\Sigma}_{11}^{-1}\\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_{1} \\right)\n\\]\n\\[\nCov \\left[ p \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) \\right ] =  \\boldsymbol{\\Sigma}_{22} - \\boldsymbol{\\Sigma}_{21} \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12}\n\\]\nThe scaling constant in the equation prior can be expanded using the determinant of a block matrix identity."
  },
  {
    "objectID": "useful_codes/gaussians.html#overview",
    "href": "useful_codes/gaussians.html#overview",
    "title": "Gaussian marginals and conditionals",
    "section": "",
    "text": "This notebook covers a few basic ideas with regards to Gaussian marginals and conditionals.\n\n\nConsider a random vector \\(\\mathbf{u} = \\left[u_1, u_2, u_3, u_4 \\right]^{T}\\) following a multivariate Gaussian distribution a mean vector \\(\\boldsymbol{\\mu}\\) and a covariance matrix \\(\\mathbf{K}\\). These are given by\n\\[\n\\boldsymbol{\\mu}=\\left[\\begin{array}{c}\n\\mu_{1}\\\\\n\\mu_{2}\\\\\n\\mu_{3}\\\\\n\\mu_{4}\n\\end{array}\\right], \\; \\; \\; \\; \\mathbf{K}=\\left[\\begin{array}{cccc}\nk_{11} & k_{12} & k_{13} & k_{14}\\\\\nk_{21} & k_{22} & k_{23} & k_{24}\\\\\nk_{31} & k_{32} & k_{33} & k_{34}\\\\\nk_{41} & k_{42} & k_{43} & k_{44}\n\\end{array}\\right]\n\\]\nThe marginal distribution of any subset of these four variables is obtained by integrating over the remaining ones. This can be trivially done by simply extracting the relevant elements of \\(\\boldsymbol{\\mu}\\) and \\(\\mathbf{K}\\). For instance the joint distribution given by \\(p\\left(u_2, u_3 \\right)\\) is a Gaussian\n\\[\np\\left( u_2, u_3 \\right) = \\mathcal{N} \\left( \\boldsymbol{\\mu}_{\\left(2,3\\right)}, \\boldsymbol{\\Sigma}_{\\left( 2,3 \\right)} \\right)\n\\]\nwhere\n\\[\n\\boldsymbol{\\mu}_{\\left(2,3\\right)}=\\left[\\begin{array}{c}\n\\mu_{2}\\\\\n\\mu_{3}\n\\end{array}\\right], \\; \\; \\; \\; \\boldsymbol{\\Sigma}_{\\left(2,3\\right)} = \\left[\\begin{array}{cc}\nk_{22} & k_{23} \\\\\nk_{32} & k_{33}\n\\end{array}\\right].\n\\]\nSimilarly, the marginal distribution of \\(p\\left( u_1 \\right)\\) is a Gaussian with a mean of \\(\\mu_1\\) and a variance of \\(k_{11}\\).\n\n\n\nConsider a random vector \\(\\mathbf{u}\\), composed of two sets \\(\\mathbf{u}_{1}\\) and \\(\\mathbf{u}_{2}\\). Assume, as before, that \\(\\mathbf{u} = p \\left( \\boldsymbol{\\mu}, \\boldsymbol{\\Sigma} \\right)\\), where\n\\[\n\\boldsymbol{\\mu} =\\left[\\begin{array}{c}\n\\boldsymbol{\\mu}_{1}\\\\\n\\boldsymbol{\\mu}_{2}\n\\end{array}\\right], \\; \\; \\; \\; \\boldsymbol{\\Sigma} = \\left[\\begin{array}{cc}\n\\boldsymbol{\\Sigma}_{11} & \\boldsymbol{\\Sigma}_{12} \\\\\n\\boldsymbol{\\Sigma}_{21} & \\boldsymbol{\\Sigma}_{22}\n\\end{array}\\right].\n\\]\nIf we observe one of these sets, say \\(\\mathbf{u}_{1}\\), then the conditional density of the other set \\(\\mathbf{u}_{2}\\) is a Gaussian of the form\n\\[\np \\left(\\mathbf{u}_{2} | \\mathbf{u}_{1} \\right) = \\mathcal{N} \\left( \\mathbf{d}, \\mathbf{D} \\right)\n\\]\nwhere\n\\[\n\\mathbf{d} = \\boldsymbol{\\mu}_{2} + \\boldsymbol{\\Sigma}_{12}^{T} \\boldsymbol{\\Sigma}_{11}^{-1} \\left( \\mathbf{u}_1 - \\boldsymbol{\\mu}_{1} \\right)\n\\]\nand\n\\[\n\\mathbf{D} = \\boldsymbol{\\Sigma}_{22} - \\boldsymbol{\\Sigma}_{12}^{T} \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12}\n\\]\n\n\nThe proof for the result above begins with the definition of the conditional distribution, i.e.,\n\\[\np \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) = \\frac{p\\left( \\mathbf{u}_2 , \\mathbf{u}_1 \\right) }{ p \\left( \\mathbf{u}_1 \\right) }\n\\]\nPlugging in the definition of a multivariate normal distribution, we have:\n\\[\np\\left( \\mathbf{u}_2 , \\mathbf{u}_1 \\right) = p \\left( \\mathbf{u} \\right) = \\frac{1}{\\sqrt{\\left( 2 \\pi \\right)^{N} |\\boldsymbol{\\Sigma} | }} exp \\left[ -\\frac{1}{2} \\left(\\mathbf{u} - \\boldsymbol{\\mu} \\right)^{T} \\boldsymbol{\\Sigma}^{-1}  \\left(\\mathbf{u} - \\boldsymbol{\\mu} \\right) \\right]\n\\]\nand\n\\[\np\\left( \\mathbf{u}_1 \\right) =  \\frac{1}{\\sqrt{\\left( 2 \\pi \\right)^{N_1} |\\boldsymbol{\\Sigma}_{11} | }} exp \\left[ -\\frac{1}{2} \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\boldsymbol{\\Sigma}_{11}^{-1}  \\left(\\mathbf{u}_{1} - \\boldsymbol{\\mu}_{1} \\right) \\right]\n\\]\nThus,\n\\[\np \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) =  \\frac{\\sqrt{|\\boldsymbol{\\Sigma}_{11} |}}{\\sqrt{\\left( 2 \\pi \\right)^{N - N_1} |\\boldsymbol{\\Sigma} | }} exp \\left[ -\\frac{1}{2} \\left(\\mathbf{u} - \\boldsymbol{\\mu} \\right)^{T} \\boldsymbol{\\Sigma}^{-1}  \\left(\\mathbf{u} - \\boldsymbol{\\mu} \\right) + \\frac{1}{2} \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\boldsymbol{\\Sigma}_{11}^{-1}  \\left(\\mathbf{u}_{1} - \\boldsymbol{\\mu}_{1} \\right) \\right]\n\\]\nExpanding the above, we arrive at\n\\[\np \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) = \\frac{\\sqrt{|\\boldsymbol{\\Sigma}_{11} |}}{\\sqrt{\\left( 2 \\pi \\right)^{N - N_1} |\\boldsymbol{\\Sigma} | }}exp \\left[ -\\frac{1}{2} \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\boldsymbol{\\Gamma}_{11}  \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right) +\n\\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\boldsymbol{\\Gamma}_{12}  \\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 \\right) -\n\\frac{1}{2}\\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 \\right)^{T} \\boldsymbol{\\Gamma}_{22}  \\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 \\right)\n-\\frac{1}{2}  \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\boldsymbol{\\Sigma}_{11}^{-1}  \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)\n\\right]\n\\]\nwhere\n\\[\n\\boldsymbol{\\Sigma}^{-1} = \\left[\\begin{array}{cc}\n\\boldsymbol{\\Gamma}_{11} & \\boldsymbol{\\Gamma}_{12} \\\\\n\\boldsymbol{\\Gamma}_{21} & \\boldsymbol{\\Gamma}_{22}\n\\end{array}\\right].\n\\]\nUsing the block matrix inverse and Woodbury matrix identity, we have\n\\[\n\\boldsymbol{\\Gamma}_{11} = \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12}\\left( \\boldsymbol{\\Sigma}_{22} - \\boldsymbol{\\Sigma}_{21} \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12} \\right)^{-1} \\boldsymbol{\\Sigma}_{21} \\boldsymbol{\\Sigma}_{11}^{-1}\n\\]\n\\[\n\\boldsymbol{\\Gamma}_{12} = - \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12} \\left( \\boldsymbol{\\Sigma}_{22} - \\boldsymbol{\\Sigma}_{21} \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12} \\right)^{-1}\n\\]\n\\[\n\\boldsymbol{\\Gamma}_{22} = \\left( \\boldsymbol{\\Sigma}_{22} - \\boldsymbol{\\Sigma}_{21} \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12} \\right)^{-1}\n\\]\nThe prior expansion then yields\n\\[\np \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) = \\frac{\\sqrt{|\\boldsymbol{\\Sigma}_{11} |}}{\\sqrt{\\left( 2 \\pi \\right)^{N - N_1} |\\boldsymbol{\\Sigma} | }}exp \\left[ -\\frac{1}{2} \\left[\n\\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\left( \\boldsymbol{\\Gamma}_{11} -  \\boldsymbol{\\Sigma}_{11}^{-1} \\right) \\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right) -  \n2\\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_1 \\right)^{T} \\boldsymbol{\\Gamma}_{12}  \\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 \\right) +\n\\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 \\right)^{T} \\boldsymbol{\\Gamma}_{22}  \\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 \\right) \\right] \\right]\n\\]\nExpanding this out and grouping similar terms leads to\n\\[\np \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) = \\frac{\\sqrt{|\\boldsymbol{\\Sigma}_{11} |}}{\\sqrt{\\left( 2 \\pi \\right)^{N - N_1} |\\boldsymbol{\\Sigma} | }}exp \\left[ -\\frac{1}{2} \\left[ \\left(\\mathbf{u}_2 - \\boldsymbol{\\mu}_2 - \\boldsymbol{\\Sigma}_{21}\\boldsymbol{\\Sigma}^{-1}\\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_{1} \\right) \\right)^{T} \\left( \\boldsymbol{\\Sigma}_{22} - \\boldsymbol{\\Sigma}_{21} \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12} \\right)^{-1} \\left(u_2 - \\mu_2 - \\boldsymbol{\\Sigma}_{21}\\boldsymbol{\\Sigma}^{-1}\\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_{1} \\right) \\right) \\right] \\right]\n\\]\nFrom this it is readily apparent that the mean and covariance of the new density is given by\n\\[\n\\mathbb{E}\\left[ p \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) \\right ] =    \\boldsymbol{\\mu}_2 + \\boldsymbol{\\Sigma}_{21}\\boldsymbol{\\Sigma}_{11}^{-1}\\left(\\mathbf{u}_1 - \\boldsymbol{\\mu}_{1} \\right)\n\\]\n\\[\nCov \\left[ p \\left( \\mathbf{u}_2 | \\mathbf{u}_1 \\right) \\right ] =  \\boldsymbol{\\Sigma}_{22} - \\boldsymbol{\\Sigma}_{21} \\boldsymbol{\\Sigma}_{11}^{-1} \\boldsymbol{\\Sigma}_{12}\n\\]\nThe scaling constant in the equation prior can be expanded using the determinant of a block matrix identity."
  },
  {
    "objectID": "sample_problems/lecture_4.html",
    "href": "sample_problems/lecture_4.html",
    "title": "L4 examples",
    "section": "",
    "text": "Code\nimport numpy as np \nfrom scipy.stats import bernoulli, binom, expon\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.special import comb\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')"
  },
  {
    "objectID": "sample_problems/lecture_4.html#problem-1",
    "href": "sample_problems/lecture_4.html#problem-1",
    "title": "L4 examples",
    "section": "Problem 1",
    "text": "Problem 1\nFind the probability density function of \\(Y = h \\left( X \\right) = X^2\\), for any \\(y &gt; 0\\), where \\(X\\) is a continuous random variable with a known probability density function.\n\n\nSolution\n\nFor \\(y &gt; 0\\) we have\n\\[\nF_Y \\left( y \\right) = p \\left( Y \\leq y \\right) = p \\left( X^2 \\leq y \\right) = p \\left( - \\sqrt{y} \\leq X \\leq \\sqrt{y} \\right)\n\\]\n\\[\n\\Rightarrow F_Y \\left( y \\right) = F_{X} \\left( \\sqrt{y} \\right) - F_{X} \\left( - \\sqrt{y} \\right)\n\\]\nThus, by differentiating and applying the chain rule we have\n\\[\nf_{Y} \\left( y \\right) = \\frac{1}{2\\sqrt{y}} f_{X} \\left( \\sqrt{y} \\right) + \\frac{1}{2 \\sqrt{y}} f_{X} \\left( - \\sqrt{y} \\right), \\; \\; \\; \\; y &gt; 0\n\\]"
  },
  {
    "objectID": "sample_problems/lecture_4.html#problem-2",
    "href": "sample_problems/lecture_4.html#problem-2",
    "title": "L4 examples",
    "section": "Problem 2",
    "text": "Problem 2\nFind the probability density function of \\(Y = exp \\left( X^2 \\right)\\) if \\(X\\) is a non-negative random variable.\n\n\nSolution\n\nNote that \\(F_Y \\left( y \\right) = 0\\) for \\(y &lt; 1\\). For \\(y \\geq 1\\), we have\n\\[\nF_{Y} \\left( y \\right) = p \\left(exp \\left(X^2 \\right) \\leq y \\right) = p \\left(X^2 \\leq log \\left( y \\right) \\right)\n\\]\n\\[\n\\Rightarrow F_{Y} = p \\left( X \\leq \\sqrt{log \\left( y \\right) } \\right).\n\\]\nBy differentiating and using the chain rule, we obtain\n\\[\nf_{Y} \\left( y \\right) = f_{X} \\left( \\sqrt{log \\left( y \\right) } \\right) \\frac{1}{2y \\sqrt{log \\left( y \\right) } }, \\; \\; \\; y &gt; 1.\n\\]\n\n\n\nCode\nlam = 1.\nX = expon.rvs(size=15000, scale=1/lam)\nY = np.sin(X)\n\nfig = plt.figure(figsize=(10,3))\nplt.subplot(121)\nplt.hist(X,40, density=True, color='crimson')\nplt.ylabel(r'$f_{X}(x)$')\nplt.xlabel('x')\nplt.subplot(122)\nplt.hist(Y,40, density=True, color='dodgerblue')\nplt.ylabel(r'$f_{Y}(y)$')\nplt.xlabel('y')\nplt.show()"
  },
  {
    "objectID": "sample_problems/lecture_4.html#problem-3",
    "href": "sample_problems/lecture_4.html#problem-3",
    "title": "L4 examples",
    "section": "Problem 3",
    "text": "Problem 3\nFollowing the bit of code above, let \\(X\\) be an exponential random variable with parameter \\(\\lambda\\), i.e., \\(f_{X} \\left( x \\right) = \\lambda exp \\left( -\\lambda x \\right)\\) and \\(F_{X} \\left( x \\right) = 1 - exp \\left( -\\lambda x \\right)\\). Let \\(Y= sin\\left( X \\right)\\). Determine \\(F_Y\\left( y \\right)\\) and \\(f_{Y} \\left( y \\right)\\).\n\n\nSolution\n\nFrom the event \\(\\left\\{ Y \\leq y \\right\\}\\), we can conclude that for \\(x = sin^{-1} \\left( y \\right)\\) we have\n\\[\nF_{Y} \\left( y \\right) = p \\left( Y \\leq y \\right)\n\\]\n\\[\nF_{Y} \\left( y \\right) = p \\left( X \\leq x \\right) + \\sum_{k=1}^{\\infty} \\left[ F_{X} \\left( 2 k \\pi + x \\right) - F_{X} \\left( \\left(2k - 1 \\right) \\pi - x\\right) \\right]\n\\]\n\\[\nF_{Y} \\left( y \\right) = p \\left( X \\leq x \\right) + \\sum_{k=1}^{\\infty} \\left[  1 - exp\\left( -\\lambda x - 2 \\lambda k \\pi \\right) - 1 + exp \\left( \\lambda x + \\lambda \\pi - 2 \\lambda k \\pi \\right) \\right]\n\\]\n\\[\n= p \\left( X \\leq x \\right) + \\left[ exp\\left( \\lambda x \\right) exp \\left( \\lambda \\pi \\right) - exp \\left( - \\lambda x \\right) \\right] \\sum_{k=1}^{\\infty}  exp \\left( - 2 \\lambda k \\pi \\right)\n\\]\n\\[\n= p \\left( X \\leq sin^{-1} \\left( y \\right) \\right) + \\left[ exp \\left( \\lambda sin^{-1} \\left( y \\right) + \\lambda \\pi \\right) - exp \\left( -\\lambda sin^{-1} \\left( y \\right) \\right) \\right] \\frac{exp \\left( -2 \\lambda \\pi \\right) }{1 - exp \\left( -2 \\lambda \\pi \\right)}\n\\]\nThis expansion uses the sum of a geometric sequence formula. The first term above is zero for negative \\(y \\in [-1, 0)\\) and\n\\[\np \\left( X \\leq sin^{-1} \\left( y \\right) \\right) = F_{X} \\left( sin^{-1} \\left( y \\right) \\right) = 1 - exp(- \\lambda sin^{-1}\\left( y \\right) )\n\\]\nfor non-negative \\(y \\in [0, 1]\\). Since \\(F_{X} \\left(0\\right) = 0\\), the cumulative probability \\(F_Y\\left( y \\right)\\) will remain continuous at \\(y=0\\). However, its derivative is discontinuous and we will be unable to derive an expression for \\(f_{Y} \\left( 0 \\right)\\). Hence, for negative \\(y \\in [-1, 0)\\) we have\n\\[\nf_{Y} \\left( y \\right) = \\frac{d}{dx} F_{X} \\left( x \\right) \\frac{dx}{dy} = \\frac{\\lambda}{\\sqrt{1 - y^2}} \\frac{exp \\left( \\lambda \\left(sin^{-1} \\left( y \\right) + \\pi \\right) \\right) + exp \\left( -\\lambda sin^{-1} \\left( y \\right) \\right) }{exp \\left( 2 \\lambda \\pi -1 \\right) }.\n\\]\nFor positive \\(y \\in (0, 1]\\), we have\n\\[\nf_{Y} \\left( y \\right) = \\frac{d}{dx} F_{X} \\left( x \\right) \\frac{dx}{dy} = \\frac{\\lambda}{\\sqrt{1 - y^2}} \\left[ \\frac{exp \\left( \\lambda \\left(sin^{-1} \\left( y \\right) + \\pi \\right) \\right) + exp \\left( -\\lambda sin^{-1} \\left( y \\right) \\right) }{exp \\left( 2 \\lambda \\pi -1 \\right) } + exp \\left( -\\lambda sin^{-1} \\left( y \\right) \\right) \\right].\n\\]\n\n\n\nCode\nx = np.linspace(-2.5, 2.5, 500)\ny = np.sin(x)\n\ndef f_y(y):\n    f_y = np.zeros((y.shape[0]))\n    for i in range(0, f_y.shape[0]):\n        if y[i] &gt; 0:\n            f_y[i] = lam/np.sqrt(1 - y[i]**2) * (np.exp(-lam * np.arcsin(y[i])) \\\n                        + (np.exp(lam * np.arcsin(y[i]) + lam * np.pi) + \\\n                          np.exp(-lam * np.arcsin(y[i])))/(np.exp(2 * lam * np.pi) - 1))\n        else:\n            f_y[i] = lam/np.sqrt(1 - y[i]**2) * ((np.exp(lam * np.arcsin(y[i]) + lam * np.pi) + \\\n                          np.exp(-lam * np.arcsin(y[i])))/(np.exp(2 * lam * np.pi) - 1))\n    return f_y\n\nfig = plt.figure(figsize=(6,4))\nplt.plot(y, f_y(y), color='navy', lw=3)\nplt.hist(Y,40, density=True, color='dodgerblue')\nplt.ylabel(r'$f_{Y}(y)$')\nplt.xlabel('y')\nplt.ylim([0, 2.7])\nplt.show()"
  },
  {
    "objectID": "sample_problems/lecture_4.html#problem-4",
    "href": "sample_problems/lecture_4.html#problem-4",
    "title": "L4 examples",
    "section": "Problem 4",
    "text": "Problem 4\nLet \\(X\\) and \\(Y\\) be independent and uniform between \\(0\\) and \\(1\\). Compute \\(X + Y\\). To set the stage for the problem, consider the code and plot below.\n\n\nCode\nX = np.random.rand(9000)\nY = np.random.rand(9000)\nS = X + Y \n\nfig = plt.figure(figsize=(6,3))\nplt.hist(X+Y,40, density=True, color='orangered')\nplt.ylabel(r'$f_{S}(s)$')\nplt.xlabel('s')\nplt.show()\n\n\n\n\n\nIt appears we have a triangular distribution. In what follows we shall aim to derive this analytically.\n\n\nSolution\n\nFrom the Lecture notes, we have:\n\\[\nf_{S} \\left( s \\right) = \\int_{0}^{1} f_{X} \\left( x \\right) f_{Y} \\left( s - x \\right) dx = \\int_{0}^{1} f_{Y} \\left( s - x \\right) dx\n\\]\n\\[\n\\Rightarrow f_{S} \\left( s \\right) = \\begin{cases}\n\\begin{array}{c}\n\\int_{0}^{s}1dx=s\\\\\n\\int_{s-1}^{1}1dx=2-s\n\\end{array} & \\begin{array}{c}\n\\textrm{for} \\; \\; s \\in [0, 1] \\\\\n\\textrm{for} \\; \\; s \\in [1, 2]\n\\end{array}\\end{cases}\n\\]"
  },
  {
    "objectID": "sample_problems/lecture_2.html",
    "href": "sample_problems/lecture_2.html",
    "title": "L2 examples",
    "section": "",
    "text": "Commercial airline pilots need to pass four out of five separate tests for certification. Assume that the tests are equally difficult, and that the performance on separate tests are independent.\n\nIf the probability of failing each separate test is \\(p=0.2\\), then what is the probability of failing certification?\nTo improve safety, more stringent regulations require that pilots pass all five tests. To be able to meet the demand, the individual tests are made easier. What should the new individual failure rate be if the overall certification probability is to remain unchanged?\n\n\n\nSolution\n\n\nGiven that each test is independent, the combined probabilities follow a Binomial distribution. A pilot will fail certification if they fail two or more tests, and they will pass if they fail zero or one of the individual tests. Thus, the probability of passing certification is\n\n\\[\n\\large\np_{pass} = \\left(\\begin{array}{c}\n5\\\\\n0\n\\end{array}\\right) p^{0} \\left( 1 - p \\right)^{5} + \\left(\\begin{array}{c}\n5\\\\\n1\n\\end{array}\\right)p^{1} \\left( 1 - p \\right)^{4}\n\\]\nFrom the code snippet below this is roughly 0.737. Thus the combined failure rate is \\(1 - 0.7373 = 0.2627\\).\n\nUnder the new certification protocol, as there is no possibility of failing a test, we have\n\n\\[\n\\large\n\\left(1 - p_{fail, new} \\right)^{5} = 1 - 0.2627 \\Rightarrow p_{fail, new} = 0.06\n\\]\n\n\n\nCode\nfrom scipy.special import comb\nimport numpy as np\n\n# part a.\np = 0.2\np_pass = comb(5, 0) * p**0 * (1 - p)**5 + comb(5, 1) * p**1 * (1 - p)**4\np_fail = 1 - p_pass\nprint(p_fail)\n\n# part b.\np_fail_new = 1 - (1 - p_fail)**(1/5)\nprint(p_fail_new)\n\n\n0.26271999999999984\n0.059136781980261066"
  },
  {
    "objectID": "sample_problems/lecture_2.html#problem-1",
    "href": "sample_problems/lecture_2.html#problem-1",
    "title": "L2 examples",
    "section": "",
    "text": "Commercial airline pilots need to pass four out of five separate tests for certification. Assume that the tests are equally difficult, and that the performance on separate tests are independent.\n\nIf the probability of failing each separate test is \\(p=0.2\\), then what is the probability of failing certification?\nTo improve safety, more stringent regulations require that pilots pass all five tests. To be able to meet the demand, the individual tests are made easier. What should the new individual failure rate be if the overall certification probability is to remain unchanged?\n\n\n\nSolution\n\n\nGiven that each test is independent, the combined probabilities follow a Binomial distribution. A pilot will fail certification if they fail two or more tests, and they will pass if they fail zero or one of the individual tests. Thus, the probability of passing certification is\n\n\\[\n\\large\np_{pass} = \\left(\\begin{array}{c}\n5\\\\\n0\n\\end{array}\\right) p^{0} \\left( 1 - p \\right)^{5} + \\left(\\begin{array}{c}\n5\\\\\n1\n\\end{array}\\right)p^{1} \\left( 1 - p \\right)^{4}\n\\]\nFrom the code snippet below this is roughly 0.737. Thus the combined failure rate is \\(1 - 0.7373 = 0.2627\\).\n\nUnder the new certification protocol, as there is no possibility of failing a test, we have\n\n\\[\n\\large\n\\left(1 - p_{fail, new} \\right)^{5} = 1 - 0.2627 \\Rightarrow p_{fail, new} = 0.06\n\\]\n\n\n\nCode\nfrom scipy.special import comb\nimport numpy as np\n\n# part a.\np = 0.2\np_pass = comb(5, 0) * p**0 * (1 - p)**5 + comb(5, 1) * p**1 * (1 - p)**4\np_fail = 1 - p_pass\nprint(p_fail)\n\n# part b.\np_fail_new = 1 - (1 - p_fail)**(1/5)\nprint(p_fail_new)\n\n\n0.26271999999999984\n0.059136781980261066"
  },
  {
    "objectID": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes",
    "href": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes",
    "title": "Lecture 9",
    "section": "An introduction to Gaussian processes",
    "text": "An introduction to Gaussian processes\nA few remarks\n\nGaussian process (GP) models assume that the vector of targets (e.g., \\mathbf{t} ) come from a Gaussian distribution.\nRather than opting for a parametric form of the regression function, in GPs a mean vector and a covariance matrix are selected for this Gaussian.\nFollowing Chapter 2 of Rasmussen and Williams, we shall begin with a noise-free case, followed by the noisy case.\nNotionally, with GPs, we assume that our function is an infinite dimensional Gaussian! However, any subset of this infinite dimensional Gaussian is by definition also Gaussian!"
  },
  {
    "objectID": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-1",
    "href": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-1",
    "title": "Lecture 9",
    "section": "An introduction to Gaussian processes",
    "text": "An introduction to Gaussian processes\nGP prior\n\nLet \\boldsymbol{x} \\in \\mathbb{R}^{d} denote a point in d-dimensional space, i.e., \\boldsymbol{x} \\in \\mathcal{X}\nAs with any Gaussian, a GP is typically specified through its mean \\mu\\left( \\mathbf{x} \\right) and a two-point covariance function k \\left( \\boldsymbol{x}, \\boldsymbol{x}' \\right).\nA popular choice for the mean function is \\mu\\left( \\boldsymbol{x} \\right) = 0 for all x \\in \\mathcal{X}.\nCovariance functions are typically parameterized, and a popular choice is the radial basis function (also known as the squared exponential) \nk \\left( \\boldsymbol{x}, \\boldsymbol{x}' \\right) = \\alpha \\; exp \\left(- \\frac{1}{2l^2}  \\left\\Vert  \\boldsymbol{x} - \\boldsymbol{x}' \\right\\Vert_{2}^{2}  \\right)\n where l is the length scale and \\alpha is the amplitude.\nThe function k is referred to as the kernel function."
  },
  {
    "objectID": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-2",
    "href": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-2",
    "title": "Lecture 9",
    "section": "An introduction to Gaussian processes",
    "text": "An introduction to Gaussian processes\nVisualizing GP priors\n\nPlotCode\n\n\nDefining a grid of points \\mathcal{X} \\equiv \\left[-2, 2 \\right], and choosing values for \\alpha and l, we can sample vectors \\mathbf{t} from the GP prior \\mathcal{N}\\left(\\mathbf{0}, \\mathbf{C} \\right), where \\mathbf{C}_{ij} = k \\left( \\boldsymbol{x}_{i}, \\boldsymbol{x}_{j} \\right).\n\n\n\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib\nimport matplotlib.pyplot as plt\nfrom scipy.stats import multivariate_normal\nfrom scipy.linalg import cholesky, solve_triangular\nimport seaborn as sns\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\nplt.style.use('dark_background')\n\n\ndef kernel(xa, xb, amp, ll):\n    Xa, Xb = get_tiled(xa, xb)\n    return amp**2 * np.exp(-0.5 * 1./ll**2 * (Xa - Xb)**2 )\n\n\ndef get_tiled(xa, xb):\n    m, n = len(xa), len(xb)\n    xa, xb = xa.reshape(m,1) , xb.reshape(n,1)\n    Xa = np.tile(xa, (1, n))\n    Xb = np.tile(xb.T, (m, 1))\n    return Xa, Xb\n\nX = np.linspace(-2, 2, 150)\ncov_1 = kernel(X, X, 1, 0.1) \nmu_1 = np.zeros((150,))\nprior_1 = multivariate_normal(mu_1, cov_1, allow_singular=True)\n\ncov_2 = kernel(X, X, 0.5, 1)\nmu_2 = np.zeros((150,))\nprior_2 = multivariate_normal(mu_2, cov_2, allow_singular=True)\n\nrandom_samples = 50\n\nfig, ax = plt.subplots(2, figsize=(12,4))\nfig.patch.set_facecolor('#6C757D')\nax[0].set_fc('#6C757D')\nplt.subplot(121)\nplt.plot(X, prior_1.rvs(random_samples).T, alpha=0.5)\nplt.title(r'Samples from GP prior with $\\alpha=1$, $l=0.1$')\nplt.xlabel(r'$x$')\nplt.ylabel(r'$f$')\n#plt.ylabel(r'$\\mathbf{w}_1$')\nfig.patch.set_facecolor('#6C757D')\n\nplt.subplot(122)\nplt.rcParams['axes.facecolor']='#6C757D'\nax[1].set_facecolor('#6C757D')\nplt.plot(X, prior_2.rvs(random_samples).T, alpha=0.5)\nplt.title(r'Samples from GP prior with $\\alpha=0.5$, $l=1$')\nplt.xlabel(r'$x$')\nplt.ylabel(r'$f$')\nplt.savefig('prior.png', dpi=150, bbox_inches='tight', facecolor=\"#6C757D\")\nplt.close()"
  },
  {
    "objectID": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-3",
    "href": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-3",
    "title": "Lecture 9",
    "section": "An introduction to Gaussian processes",
    "text": "An introduction to Gaussian processes\nVisualizing GP priors\n\nThe prior covariance function depended only on the difference between pairs of points, i.e., \\left\\Vert \\boldsymbol{x} - \\boldsymbol{x}' \\right\\Vert_{2}^{2}.\nSuch kernels are said to be stationary; an RBF kernel can be written as\n\n\nk \\left( \\boldsymbol{r} \\right) =  \\alpha \\; exp \\left(- \\frac{1}{2l^2}  \\boldsymbol{r}^2 \\right).\n\n\nWe will encounter many kernel functions later on, some of them are stationary, whilst others are not."
  },
  {
    "objectID": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-4",
    "href": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-4",
    "title": "Lecture 9",
    "section": "An introduction to Gaussian processes",
    "text": "An introduction to Gaussian processes\nGaussian marginals and conditionals\nClick here."
  },
  {
    "objectID": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-5",
    "href": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-5",
    "title": "Lecture 9",
    "section": "An introduction to Gaussian processes",
    "text": "An introduction to Gaussian processes\nNoise-free regression\n\nWe will consider the Olympic winning times dataset again, but this time assume there is no observational sensor noise.\nThe markers denote the training \\left(x_i, f_i \\right) pairs, while the dashed lines denote the locations are which we would like to make predictions. we will use the superscript \\ast to denote points at which we would like to infer predictions \n\\underbrace{\\mathbf{x}=\\left[\\begin{array}{c}\nx_1 \\\\\nx_2 \\\\\n\\vdots \\\\\nx_N \\\\\n\\end{array}\\right], \\; \\;\\; \\mathbf{f}=\\left[\\begin{array}{c}\nf_1\\\\\nf_2\\\\\n\\vdots \\\\\nf_N \\\\\n\\end{array}\\right]}_{\\text{training}}, \\; \\; \\; \\; \\; \\;  \\underbrace{\\mathbf{x}^{\\ast}=\\left[\\begin{array}{c}\nx_1^{\\ast} \\\\\nx_2^{\\ast} \\\\\n\\vdots \\\\\nx_L^{\\ast} \\\\\n\\end{array}\\right], \\; \\;\\; \\mathbf{f}^{\\ast}=\\left[\\begin{array}{c}\nf_1^{\\ast}\\\\\nf_2^{\\ast}\\\\\n\\vdots \\\\\nf_L^{\\ast} \\\\\n\\end{array}\\right]}_{\\text{prediction}}\n where N denotes the number of points in the training set, and L denotes the number of points in the prediction set.\nIt will be useful to combine the winning times (both training and prediction) in the same vector, i.e., \\mathbf{\\hat{f}} = \\left[ \\mathbf{f}, \\mathbf{f}^{\\ast} \\right]^{T}."
  },
  {
    "objectID": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-6",
    "href": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-6",
    "title": "Lecture 9",
    "section": "An introduction to Gaussian processes",
    "text": "An introduction to Gaussian processes\nNoise-free regression\nIt will be useful to define three covariance matrices:\n\n\n\nA covariance matrix, \\mathbf{C} \\in \\mathbb{R}^{N\\times N}, on the training data: \n\\mathbf{C}=\\left[\\begin{array}{ccc}\nk\\left(x_{1},x_{1}\\right) & \\ldots & k\\left(x_{1},x_{N}\\right)\\\\\n\\vdots & \\ddots & \\vdots\\\\\nk\\left(x_{N},x_{1}\\right) & \\cdots & k\\left(x_{N},x_{N}\\right)\n\\end{array}\\right]\n\nA covariance matrix, \\mathbf{C}^{\\ast} \\in \\mathbb{R}^{L\\times L}, on the prediction (or test) data: \n\\mathbf{C}^{\\ast}=\\left[\\begin{array}{ccc}\nk\\left(x_{1}^{\\ast},x_{1}^{\\ast}\\right) & \\ldots & k\\left(x_{1}^{\\ast},x_{L}^{\\ast}\\right)\\\\\n\\vdots & \\ddots & \\vdots\\\\\nk\\left(x_{L}^{\\ast},x_{1}^{\\ast}\\right) & \\cdots & k\\left(x_{L}^{\\ast},x_{L}^{\\ast}\\right)\n\\end{array}\\right]\n\n\n\n\nA cross-covariance matrix, \\mathbf{R} \\in \\mathbb{R}^{N \\times L} on the training and testing data: \n\\mathbf{R} = \\left[\\begin{array}{ccc}\nk\\left(x_{1},x_{1}^{\\ast}\\right) & \\ldots & k\\left(x_{1},x_{L}^{\\ast}\\right)\\\\\n\\vdots & \\ddots & \\vdots\\\\\nk\\left(x_{N},x_{1}^{\\ast}\\right) & \\cdots & k\\left(x_{N},x_{L}^{\\ast}\\right)\n\\end{array}\\right]"
  },
  {
    "objectID": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-7",
    "href": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-7",
    "title": "Lecture 9",
    "section": "An introduction to Gaussian processes",
    "text": "An introduction to Gaussian processes\nNoise-free regression\n\nAssuming a zero mean, the GP prior over \\hat{\\mathbf{t}} is given by \np \\left( \\mathbf{\\hat{f}} \\right) = \\mathcal{N} \\left( \\mathbf{0}, \\left[\\begin{array}{cc}\n\\mathbf{C} & \\mathbf{R} \\\\\n\\mathbf{R}^{T} & \\mathbf{C}^{\\ast}\n\\end{array}\\right]  \\right)\n\nThis distribution is the complete definition of our model. It tells us how the function values at the training and prediction points co-vary.\nMaking predictions amounts to manipulating this distribution to give a distribution over the function values at the prediction points conditioned on the observed training data, i.e., p \\left( \\mathbf{t}^{\\ast} | \\mathbf{t} \\right)\nFrom our prior foray into Gaussian conditionals, we recognize this to be \n\\begin{aligned}\np \\left( \\mathbf{f}^{\\ast} | \\mathbf{f} \\right) & = \\mathcal{N}\\left( \\boldsymbol{\\mu}^{\\ast}, \\boldsymbol{\\Sigma}^{\\ast} \\right), \\; \\; \\; \\; \\text{where} \\\\\n\\boldsymbol{\\mu}^{\\ast} = \\mathbf{R}^{T} \\mathbf{C}^{-1} \\mathbf{f}, \\; \\; & \\; \\; \\boldsymbol{\\Sigma}^{\\ast} = \\mathbf{C}^{\\ast} - \\mathbf{R}^{T} \\mathbf{C}^{-1} \\mathbf{R}\n\\end{aligned}"
  },
  {
    "objectID": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-8",
    "href": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-8",
    "title": "Lecture 9",
    "section": "An introduction to Gaussian processes",
    "text": "An introduction to Gaussian processes\nVisualizing GP posteriors\n\nPlotCode\n\n\n\n\n\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib\nimport matplotlib.pyplot as plt\nfrom scipy.stats import multivariate_normal\nfrom scipy.linalg import cholesky, solve_triangular\nimport seaborn as sns\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\nplt.style.use('dark_background')\n\n\ndef kernel(xa, xb, amp, ll):\n    Xa, Xb = get_tiled(xa, xb)\n    return amp**2 * np.exp(-0.5 * 1./ll**2 * (Xa - Xb)**2 )\n\ndef get_tiled(xa, xb):\n    m, n = len(xa), len(xb)\n    xa, xb = xa.reshape(m,1) , xb.reshape(n,1)\n    Xa = np.tile(xa, (1, n))\n    Xb = np.tile(xb.T, (m, 1))\n    return Xa, Xb\n\ndef get_posterior(amp, ll, x, x_data, y_data):\n    u = y_data.shape[0]\n    mu_y = np.mean(y_data)\n    y = (y_data - mu_y).reshape(u,1)\n    \n    Kxx = kernel(x_data, x_data, amp, ll)\n    Kxpx = kernel(x, x_data, amp, ll)\n    Kxpxp = kernel(x, x, amp, ll)\n    \n    # Inverse\n    jitter = np.eye(u) * 1e-8\n    L = cholesky(Kxx + jitter)\n    S1 = solve_triangular(L.T, y, lower=True)\n    S2 = solve_triangular(L.T, Kxpx.T, lower=True).T\n    \n    mu = S2 @ S1  + mu_y\n    cov = Kxpxp - S2 @ S2.T\n    return mu, cov\n\nx_data = np.random.rand(10)*4 - 2.\ny_data = np.cos(5*x_data) + x_data**2 + 2*x_data\nX = np.linspace(-2, 2, 150)\nrandom_samples = 50\n\nfig, ax = plt.subplots(2, figsize=(12,4))\nfig.patch.set_facecolor('#6C757D')\nax[0].set_fc('#6C757D')\nplt.subplot(121)\nmu, cov = get_posterior(1, 0.1, X, x_data, y_data)\nposterior = multivariate_normal(mu.flatten(), cov, allow_singular=True)\n#mu = mu.flatten()\n#std = np.sqrt(np.diag(cov)).flatten()\nplt.plot(x_data, y_data, 'o', ms=12, color='dodgerblue', lw=1, markeredgecolor='w', zorder=3)\nplt.plot(X, posterior.rvs(random_samples).T, alpha=0.5, zorder=2)\nplt.title(r'Samples from GP posterior with $\\alpha=1$, $l=0.1$')\nplt.xlabel(r'$x$')\nplt.ylabel(r'$t$')\n#plt.ylabel(r'$\\mathbf{w}_1$')\nfig.patch.set_facecolor('#6C757D')\n\nplt.subplot(122)\nplt.rcParams['axes.facecolor']='#6C757D'\nax[1].set_facecolor('#6C757D')\nmu2, cov2 = get_posterior(0.5, 1, X, x_data, y_data)\nposterior2 = multivariate_normal(mu2.flatten(), cov2, allow_singular=True)\nplt.plot(x_data, y_data, 'o', ms=12, color='dodgerblue', lw=1, markeredgecolor='w', zorder=3)\nplt.plot(X, posterior2.rvs(random_samples).T, alpha=0.5, zorder=2)\nplt.title(r'Samples from GP posterior with $\\alpha=0.5$, $l=1$')\nplt.xlabel(r'$x$')\nplt.ylabel(r'$t$')\nplt.savefig('posterior.png', dpi=150, bbox_inches='tight', facecolor=\"#6C757D\")\nplt.close()"
  },
  {
    "objectID": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-9",
    "href": "slides/lecture-9/index.html#an-introduction-to-gaussian-processes-9",
    "title": "Lecture 9",
    "section": "An introduction to Gaussian processes",
    "text": "An introduction to Gaussian processes\nKernel functions\nThe RBF covariance function presented in the prior section is not the only candidate for a kernel function. Consider some of the other ones.\n\nLinearPolynomialCosine\n\n\n\nk \\left( \\boldsymbol{x}, \\boldsymbol{x}' \\right) = \\alpha \\; \\boldsymbol{x}^{T} \\boldsymbol{x}'\n\n\n\n\n\nk \\left( \\boldsymbol{x}, \\boldsymbol{x}' \\right) = \\alpha \\; \\left( 1 + \\boldsymbol{x}^{T} \\boldsymbol{x}' \\right)^{\\gamma}\n\n\n\n\n\nk \\left( \\boldsymbol{x}, \\boldsymbol{x}' \\right) = \\alpha^2 \\; cos\\left( \\frac{2 \\pi}{l^2}   \\left\\Vert  \\boldsymbol{x} - \\boldsymbol{x}' \\right\\Vert_{2}^{2} \\right)\n\n\n\n\n\n\n\nAE8803 | Gaussian Processes for Machine Learning"
  },
  {
    "objectID": "slides/lecture-7/index.html#gaussian-noise-model-3",
    "href": "slides/lecture-7/index.html#gaussian-noise-model-3",
    "title": "Lecture 7",
    "section": "Gaussian noise model",
    "text": "Gaussian noise model\nThis yields\n\n\\mathbb{E}_{p \\left( \\mathbf{t} | \\mathbf{X}, \\mathbf{w}, \\sigma^2 \\right)} \\left[ \\hat{\\mathbf{w}} \\right] = \\left( \\mathbf{X}^{T} \\mathbf{X} \\right)^{-1} \\mathbf{X}^{T} \\mathbf{X w} = \\mathbf{w}.\n\nSo, the expected value of our approximation \\mathbf{\\hat{w}} is the true parameter value \\mathbf{w}. This means that our estimator is unbiased.\nAny potential variability in \\mathbf{\\hat{w}} is captured by its covariance.\n\n\\begin{aligned}\nCov \\left[ \\mathbf{\\hat{w}} \\right] & = \\mathbb{E}_{p \\left( \\mathbf{t} | \\mathbf{X}, \\mathbf{w}, \\sigma^2 \\right)} \\left[\\mathbf{\\hat{w}} \\mathbf{\\hat{w}}^{T} \\right] - \\mathbb{E}_{p \\left( \\mathbf{t} | \\mathbf{X}, \\mathbf{w}, \\sigma^2 \\right)} \\left[ \\mathbf{\\hat{w}}  \\right] \\mathbb{E}_{p \\left( \\mathbf{t} | \\mathbf{X}, \\mathbf{w}, \\sigma^2 \\right)} \\left[ \\mathbf{\\hat{w}}  \\right]^{T} \\\\\n& = \\mathbb{E}_{p \\left( \\mathbf{t} | \\mathbf{X}, \\mathbf{w}, \\sigma^2 \\right)} \\left[\\mathbf{\\hat{w}} \\mathbf{\\hat{w}}^{T} \\right] - \\mathbf{ww}^{T}\n\\end{aligned}\n\\tag{1}\nFocusing solely on the first term, we have \n\\begin{aligned}\n\\mathbb{E}_{p \\left( \\mathbf{t} | \\mathbf{X}, \\mathbf{w}, \\sigma^2 \\right)} \\left[\\mathbf{\\hat{w}} \\mathbf{\\hat{w}}^{T} \\right] & = \\mathbb{E}_{p \\left( \\mathbf{t} | \\mathbf{X}, \\mathbf{w}, \\sigma^2 \\right)} \\left[  \\left( \\left(\\mathbf{X}^{T} \\mathbf{X} \\right)^{-1} \\mathbf{X}^{T} \\mathbf{t}  \\right) \\left( \\left(\\mathbf{X}^{T} \\mathbf{X} \\right)^{-1} \\mathbf{X}^{T} \\mathbf{t}  \\right)^{T} \\right] \\\\\n& =  \\left(\\mathbf{X}^{T} \\mathbf{X} \\right)^{-1} \\mathbf{X}^{T}  \\underbrace{\\mathbb{E}_{p \\left( \\mathbf{t} | \\mathbf{X}, \\mathbf{w}, \\sigma^2 \\right)}\\left[ \\mathbf{tt}^{T} \\right]}_{\\text{need to determine}} \\mathbf{X} \\left( \\mathbf{X}^{T} \\mathbf{X} \\right)^{-1}\n\\end{aligned}\n\\tag{2}"
  },
  {
    "objectID": "slides/lecture-5/index.html#multivariate-gaussians",
    "href": "slides/lecture-5/index.html#multivariate-gaussians",
    "title": "Lecture 5",
    "section": "Multivariate Gaussians",
    "text": "Multivariate Gaussians\nAlthough we have introduced joint probabilities and learnt how to manipulate them in the lectures prior, we have thus far only stuied univariate densities.\nIn this lecture, we will focus on one multivariate density that sets the stage for our journey into machine learning: the Gaussian distribution!"
  },
  {
    "objectID": "slides/lecture-5/index.html#multivariate-gaussians-1",
    "href": "slides/lecture-5/index.html#multivariate-gaussians-1",
    "title": "Lecture 5",
    "section": "Multivariate Gaussians",
    "text": "Multivariate Gaussians\nThe random vector, \\mathbf{X} = \\left(X_1, X_2, \\ldots, X_n \\right) is a multivariate Gaussian \\mathbf{X} \\sim \\mathcal{N} \\left( \\boldsymbol{\\mu}, \\boldsymbol{\\Sigma} \\right) if\n\nf_{\\mathbf{X}} \\left( \\mathbf{x} \\right) = \\frac{1}{\\left( 2 \\pi \\right)^{n/2}} \\left|\\boldsymbol{\\Sigma} \\right|^{-\\frac{1}{2}} exp \\left(-\\frac{1}{2}\\left( \\mathbf{x}-\\boldsymbol{\\mu}\\right)^{T} \\boldsymbol{\\Sigma}^{-1} \\left( \\mathbf{x}-\\boldsymbol{\\mu}\\right) \\right)\n\nwhere\n\n\\boldsymbol{\\Sigma} is a n \\times n covariance matrix\n\\boldsymbol{\\mu} = \\left( \\mu_1, \\mu_2, \\ldots, \\mu_{n} \\right)^{T} is a n \\times 1 mean vector."
  },
  {
    "objectID": "slides/lecture-5/index.html#multivariate-gaussians-2",
    "href": "slides/lecture-5/index.html#multivariate-gaussians-2",
    "title": "Lecture 5",
    "section": "Multivariate Gaussians",
    "text": "Multivariate Gaussians\n\nPlotCode\n\n\n\n\n\n\n\n\n\n\nHere\n\n\\boldsymbol{\\mu}=\\left(\\begin{array}{c}\n-2\\\\\n1\n\\end{array}\\right),\\Sigma=\\left(\\begin{array}{cc}\n3 & 0\\\\\n0 & 6\n\\end{array}\\right)\n\nTry adding non-zero entries into the off-diagonal. Note the matrix must be symmetric!\n\n\n\n\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom scipy.stats import multivariate_normal\nfrom mpl_toolkits.mplot3d import Axes3D\nimport seaborn as sns\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\n\n#Parameters to set\nmu_x = -2\nvariance_x = 3\n\nmu_y = 1\nvariance_y = 6\n\n#Create grid and multivariate normal\nx = np.linspace(-10,10,500)\ny = np.linspace(-10,10,500)\nX, Y = np.meshgrid(x,y)\npos = np.empty(X.shape + (2,))\npos[:, :, 0] = X; pos[:, :, 1] = Y\nrv = multivariate_normal([mu_x, mu_y], [[variance_x, 0], [0, variance_y]])\n\n#Make a 3D plot\nfig, ax = plt.subplots(subplot_kw=dict(projection='3d'), figsize=(7,8))\nax.plot_surface(X, Y, rv.pdf(pos),cmap='viridis',linewidth=0)\nax.set_xlabel('$x_1$')\nax.set_ylabel('$x_2$')\nax.set_title(r'Probability density, $f_{\\mathbf{X}} (\\mathbf{x})$')\nplt.close()"
  },
  {
    "objectID": "slides/lecture-5/index.html#multivariate-gaussians-3",
    "href": "slides/lecture-5/index.html#multivariate-gaussians-3",
    "title": "Lecture 5",
    "section": "Multivariate Gaussians",
    "text": "Multivariate Gaussians\n\nRemember that \\mathbf{X} is a random vector and its possible values \\mathbf{x} are also vectors.\nThe density, f_{\\mathbf{X}} \\left( \\mathbf{x} \\right), is a scalar-valued function.\nThe coefficient \n\\frac{1}{\\left( 2 \\pi \\right)^{n/2}} \\left|\\boldsymbol{\\Sigma} \\right|^{-\\frac{1}{2}}\n acts as a normalizing constant."
  },
  {
    "objectID": "slides/lecture-5/index.html#covariance-matrix",
    "href": "slides/lecture-5/index.html#covariance-matrix",
    "title": "Lecture 5",
    "section": "Covariance matrix",
    "text": "Covariance matrix\n\nElements of the covariance matrix have the following form:\n\n\n\\left[ \\boldsymbol{\\Sigma} \\right]_{ij} = \\mathbb{E} \\left[ \\left( X_i - \\mu_{i} \\right) \\left( X_j - \\mu_{j} \\right) \\right] = \\mathbb{E} \\left[ X_i X_j \\right] - \\mu_{i}\\mu_{j}.\n\\tag{1}\n\nFollowing Equation 1 it is clear that the diagonal elements are simply the individual variances:\n\n\n\\left[ \\boldsymbol{\\Sigma} \\right]_{ii}  = \\mathbb{E} \\left[ X^2_i \\right] - \\mu^2_{i} = Var \\left(X_i \\right).\n\n\nThe matrix is symmetric with off-diagonal terms being zero when two components X_i and X_j are independent, i.e., \\left[ \\boldsymbol{\\Sigma} \\right]_{ij} = \\mathbb{E} \\left[ X_i \\right] \\mathbb{E} \\left[ X_j \\right] - \\mu_{i} \\mu_{j} = 0."
  },
  {
    "objectID": "slides/lecture-5/index.html#covariance-matrix-1",
    "href": "slides/lecture-5/index.html#covariance-matrix-1",
    "title": "Lecture 5",
    "section": "Covariance matrix",
    "text": "Covariance matrix\n\nWhen the off-diagonal elements are not zero, i.e., when two components X_i and X_j are related, we can introduce a measure called the correlation coefficient\n\n\n\\rho_{ij} = \\frac{\\left[ \\boldsymbol{\\Sigma} \\right]_{ij} }{\\left( Var\\left(X_i \\right)Var\\left(X_j \\right) \\right)^{1/2} }.\n\n\nThis values satisfies -1 \\leq \\rho_{ij} \\leq 1, and depending upon the sign it is said to be either negatively correlated or positively correlated.\nWhen \\rho_{ij}=0, i.e., when there is no correlation, \\left[ \\boldsymbol{\\Sigma} \\right]_{ij}= 0."
  },
  {
    "objectID": "slides/lecture-5/index.html#marginal-distribution",
    "href": "slides/lecture-5/index.html#marginal-distribution",
    "title": "Lecture 5",
    "section": "Marginal distribution",
    "text": "Marginal distribution\n\nIt can be shown that the marginal density of any component \\left(X_1, \\ldots, X_n \\right) of a multivariate Gaussian is a univariate Gaussian.\nTo see this, consider that\n\n\nf_{X_k}\\left( x \\right) = \\int_{-\\infty}^{\\infty} \\ldots \\int_{-\\infty}^{\\infty} f_{\\mathbf{X}} \\left(\\mathbf{x} \\right) dx_1 dx_2 \\ldots dx_{k-1} dx_{k+1} \\ldots dx_{n}\n\n\n= \\frac{1}{\\sqrt{2 \\pi \\left[ \\boldsymbol{\\Sigma} \\right]_{kk} } } exp \\left( \\frac{\\left( x - \\mu_{k} \\right)^2}{2 \\left[ \\boldsymbol{\\Sigma} \\right]_{kk} } \\right)\n\n\nIn practice any partial marginalization of a multivariate Gaussian will yield another multivariate Gaussian (but with reduced dimensions)."
  },
  {
    "objectID": "slides/lecture-5/index.html#marginal-and-conditional-distribution",
    "href": "slides/lecture-5/index.html#marginal-and-conditional-distribution",
    "title": "Lecture 5",
    "section": "Marginal and conditional distribution",
    "text": "Marginal and conditional distribution\n\nLet \\mathbf{X} and \\mathbf{Y} be jointly Gaussian random vectors with marginals \n\\mathbf{X} \\sim \\mathcal{N}\\left(\\boldsymbol{\\mu}_{x}, \\mathbf{A} \\right), \\; \\; \\; \\text{and} \\; \\; \\; \\mathbf{Y} \\sim \\mathcal{N}\\left(\\boldsymbol{\\mu}_{y}, \\mathbf{B} \\right).\n\nWe can write the joint distribution as shown below\n\n\n\\left[\\begin{array}{c}\n\\mathbf{X}\\\\\n\\mathbf{Y}\n\\end{array}\\right]\\sim\\mathcal{N}\\left( \\underbrace{\\left[\\begin{array}{c}\n\\boldsymbol{\\mu}_{x}\\\\\n\\boldsymbol{\\mu}_{y}\n\\end{array}\\right]}_{\\boldsymbol{\\mu}}, \\underbrace{\\left[\\begin{array}{cc}\n\\mathbf{A} & \\mathbf{C}\\\\\n\\mathbf{C}^{T} & \\mathbf{B}\n\\end{array}\\right]}_{\\boldsymbol{\\Sigma}}\\right)\n\n\nThe conditional distribution of \\mathbf{X} given \\mathbf{Y} is\n\n\nf_{\\mathbf{X} | \\mathbf{Y}} \\left( \\mathbf{x}, \\mathbf{y} \\right) = \\mathcal{N} \\left( \\boldsymbol{\\mu}_{x} + \\mathbf{CB}^{-1} \\left(\\mathbf{y} - \\boldsymbol{\\mu}_{y} \\right), \\mathbf{A} - \\mathbf{CB}^{-1} \\mathbf{C}^{T} \\right)\n\n\nAlgebraically, this uses the Schur complement. To explore this further, consider the following schematic."
  },
  {
    "objectID": "slides/lecture-5/index.html#marginal-and-conditional-distribution-1",
    "href": "slides/lecture-5/index.html#marginal-and-conditional-distribution-1",
    "title": "Lecture 5",
    "section": "Marginal and conditional distribution",
    "text": "Marginal and conditional distribution\n\nPlotCode\n\n\n\n\n\n\n\n\n\n\nThe joint multivariate Gaussian distribution to the left has mean and covariance:\n\n\\boldsymbol{\\mu}=\\left(\\begin{array}{c}\n0.5\\\\\n0.2\n\\end{array}\\right),\\Sigma=\\left(\\begin{array}{cc}\n1.5 & -1.27\\\\\n-1.27 & 3\n\\end{array}\\right)\n\nAs an example, we wish to work out what f_{X| Y} \\left( x, y=3.7 \\right) is (see code).\nThe conditional is Gaussian!\n\n\n\n\n\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nfrom scipy.stats import multivariate_normal, norm\nimport pandas as pd\nimport seaborn as sns\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\n\nvar_1 = 1.5\nvar_2 = 3.0\nrho = -0.6\noff_diag = np.sqrt(var_1 * var_2) * rho\n\n\nmu = np.array([0.5, 0.2])\ncov = np.array([[var_1, off_diag], \\\n       [off_diag, var_2]])\n\nrv = multivariate_normal(mu, cov)\n\n# Generate random samples from this multivariate normal (largely for plotting!)\ndata = rv.rvs(8500)\ndf = pd.DataFrame({'$x$': data[:,0].flatten(), '$y$': data[:,1].flatten()})\n\n# Now, to plot the conditional distribution of $X_1$ at $X_2=5.0$, we would have\ndef calculate_conditional(mu, cov, yy):\n    new_mu = mu[0] + cov[0,1] * (cov[1,1])**(-1) * (yy - mu[1])\n    new_var =  cov[0,0] - cov[0,1] * (cov[1,1])**(-1) * cov[0,1]\n    return new_mu, new_var\n\ny_new = 3.7\ncond_mu, cond_var = calculate_conditional(mu, cov, y_new)\n\n# Now, to plot the conditional distribution of $X_1$ at $X_2=5.0$, we would have\ndef calculate_conditional(mu, cov, yy):\n    new_mu = mu[0] + cov[0,1] * (cov[1,1])**(-1) * (yy - mu[1])\n    new_var =  cov[0,0] - cov[0,1] * (cov[1,1])**(-1) * cov[0,1]\n    return new_mu, new_var\n\ny_new = 3.7\ncond_mu, cond_var = calculate_conditional(mu, cov, y_new)\n\nX_samples = np.tile( np.linspace(-10, 10, 200).reshape(200,1) , (1, 2))\nX_samples[:,1] = X_samples[:,1]* 0 + y_new\n\nf_X = rv.pdf(X_samples)\nrv2 = multivariate_normal(cond_mu, cond_var)\nf_X1 = rv2.pdf(X_samples[:,0])\n\n# Plot!\ng = sns.JointGrid(data=df, x=\"$x$\", y=\"$y$\", space=0)\ng.plot_joint(sns.kdeplot, fill=True,  cmap=\"turbo\", thresh=0, levels=100)\ng.plot_marginals(sns.kdeplot, color=\"grey\", gridsize=100)\nplt.close()\n\nfig = plt.figure(figsize=(8,3))\nplt.plot(X_samples[:,0], f_X1, 'r-')\nplt.xlabel('$x$')\nplt.title('Conditional distribution of $x$ at $y=3.7$')\nplt.close()"
  },
  {
    "objectID": "slides/lecture-5/index.html#generating-samples",
    "href": "slides/lecture-5/index.html#generating-samples",
    "title": "Lecture 5",
    "section": "Generating samples",
    "text": "Generating samples\nIt will be useful to generate samples from a multivariate Gaussian. To understand how to do this, consider the following setup.\n\nLet \\mathbf{X} \\sim \\mathcal{N} \\left(\\mathbf{0}, \\mathbf{I}\\right). Thus, \\mathbb{E} \\left[ \\mathbf{X} \\right] = \\mathbf{0}, and Cov\\left[ \\mathbf{X} \\right] = \\mathbf{I}.\nNow consider the map given by \\tilde{\\mathbf{x}} = \\mathbf{S} \\mathbf{x} + \\mathbf{b}, where \\mathbf{x} is a particular value from the random variable \\mathbf{X}, where\n\n\\mathbf{S} \\in \\mathbb{R}^{n \\times n} is a matrix;\n\\mathbf{b} \\in \\mathbb{R}^{n} is a vector.\n\nBy linearity of the expectation we can show that\n\n\n\\mathbb{E} \\left[ \\tilde{\\mathbf{X}} \\right] = \\mathbf{b}, \\; \\; \\; \\text{and} \\; \\; \\; Cov   \\left[ \\tilde{\\mathbf{X}} \\right] = \\mathbf{SS}^{T}.\n\n\nThe distribution \\mathcal{N}\\left( \\mathbf{b}, \\mathbf{SS}^{T} \\right) is valid (i.e., it is Gaussian), only if \\mathbf{S} is non-singular, i.e., \\mathbf{SS}^{T} is positive definite.\nIn practice, if we need to generate samples from \\mathcal{N}\\left( \\mathbf{b}, \\mathbf{B} \\right), we would compute the Cholesky decomposition of \\mathbf{B}= \\mathbf{LL}^{T}, and then use \\tilde{\\mathbf{x}} = \\mathbf{b} + \\mathbf{L} \\mathbf{x}.\n\n\n\nAE8803 | Gaussian Processes for Machine Learning"
  },
  {
    "objectID": "index.html",
    "href": "index.html",
    "title": "Overview",
    "section": "",
    "text": "This graduate-level course offers a practical approach to probabilistic learning with Gaussian processes (GPs). GPs represent a powerful set of methods for modeling and predicting a wide variety of spatio-temporal phenomena. Today, they are used for problems that span both regression and classification, with theoretical foundations in Bayesian inference, reproducing kernel Hilbert spaces, eigenvalue problems, and numerical integration. Rather than focus solely on these theoretical foundations, this course balances theory with practical probabilistic programming, using a variety of python-based packages. Moreover, practical engineering problems will also be discussed that see GP models that cut across other areas of machine learning including transfer learning, convolutional networks, and normalizing flows."
  },
  {
    "objectID": "index.html#grading",
    "href": "index.html#grading",
    "title": "Overview",
    "section": "Grading",
    "text": "Grading\nThis course has four assignments; the grades are given below:\n\n\n\n\n\n\n\nAssignment\nGrade percentage (%)\n\n\n\n\nAssignment 1: Take-home mid-term (covering fundamentals) \n20\n\n\nAssignment 2: Build your own GP from scratch for a given dataset \n20\n\n\nAssignment 3: Proposal\n20\n\n\nAssignment 4: Final project (presentation and notebook)\n40\n\n\n\n\nPre-requisites:\n\nCS1371, MATH2551, MATH2552 (or equivalent)\nWorking knowledge of python including familiarity with numpy and matplotlib libraries.\nWorking local version of python and Jupyter."
  },
  {
    "objectID": "index.html#lectures",
    "href": "index.html#lectures",
    "title": "Overview",
    "section": "Lectures",
    "text": "Lectures\nBelow you will find a list of the lectures that form the backbone of this course. Sub-topics for each lecture will be updated in due course.\n01.08: L1. Introduction & probability fundamentals | Slides | Examples\n\n\n\nContents\n\n\nCourse overview.\nProbability fundamentals (and Bayes’ theorem).\nRandom variables.\n\n\n01.10: L2. Discrete probability distributions | Slides | Examples | Notebook\n\n\nContents\n\n\nExpectation and variance.\nIndependence.\nBernoulli and Binomial distributions.\n\n\n01.15: No Class (Institute Holiday)\n01.17: L3. Continuous distributions | Slides | Examples\n\n\n\nContents\n\n\nFundamentals of continuous random variables.\nProbability density function.\nGaussian and Beta distributions.\n\n\n01.22: L4. Manipulating and combining distributions | Slides | Examples\n\n\nContents\n\n\nFunctions of random variables.\nSums of random variables.\n\n\n01.24: No Class\n01.29: L5. Multivariate Gaussian distributions | Slides\n\n\nContents\n\n\nMarginal distributions.\nConditional distributions.\nJoint distribution and Schur complement.\n\n\n01.31: L6. Linear modelling | Slides\n\n\nContents\n\n\nLeast squares.\nRegularization.\nGaussian noise model.\n\n\n\n02.05: L7. Towards Bayesian Inference | Slides\n\n\nContents\n\n\nPosterior mean and covariance for a linear model.\nFisher information matrix.\nBayesian model introduction.\nPosterior definition.\n\n\n02.07: L8. Bayesian inference in action | Slides\n\n\nContents\n\n\nAnalytical calculation of the posterior\nConjugacy in Bayesian inference\nA function-space perspective\n\n\n02.12:  Fundamentals Mid-term (take-home)\n02.12: L9. An introduction to Gaussian Processes | Slides | Notebook\n\n\nContents\n\n\nGaussian process prior\nNoise-free regression\nKernel functions\nMidterm overview\n\n\n02.14: L10. More on Gaussian Processes and Kernels | Slides | Notebook\n\n\nContents\n\n\nNoisy regression\nMore about kernels\nKernel trick\n\n\n02.19: No class\n02.21: No class\n02.26: L11. More about Kernels | Notebook 1 | Notebook 2 | Notebook 3 |\n\n\nContents\n\n\nMinimum norm problems 2. The case of infinitely many feature vectors 3. Eigenfunction analysis 4. Fourier analysis\n\n\n02.28: Coding assignment isseued\n02.28: L12. Hyperparameters inference Slides\n\n\nContents\n\n\nMAP 2. Marginal likelihood 3. Introduction to gpytorch and pymc\n\n\n03.04: L13. Markov chain Monte Carlo Notebook\n\n\nContents\n\n\nMAP vs MCMC\nMetropolis\nMetropolis-Hastings\nHMC and NUTS\n\n\n03.06: L14. Approximate inference | Slides | Notebook\n\n\nContents\n\n\nReview of approximate inference methods.\nKL divergence\nVariational inference\n\n\n03.08: L15. A Gaussian Process Case Study | Slides\n03.13: Withdrawal Deadline\n03.18-03.22: Spring Break\n03.25: L16. Scaling Gaussian Processes (Linear Algebra Perspective) | Slides\n\n\nContents\n\n\nNystrom approximation.\nKronecker product structure.\nToeplitz structure.\n\n\n03.27: L17. Scaling Gaussian Processes II (Probabilistic Perspective) | Slides | Notebook\n\n\nContents\n\n\nBayesian inference review.\nDeterministic training conditional (DTC).\nFully independent training conditional (FITC).\n\n\n04.01: L18. Gaussian process classification | Slides\n\n\nContents\n\n\nClassification likelihood.\nMAP via Newton Raphson.\n\n\n04.03: L19. Live coding session: | Code coming up shortly!\n\n\nContents\n\n\nNewton Raphson classification example.\nA simple mulit-task model\n\n\n04.08: L20. Multi-task and Physics-Constrains in Kernels: | Slides | Notebook 1 | Notebook 2\n\n\nContents\n\n\nModel of coregionalization.\nDivergence free kernel.\nCurl free kernel.\n\n\n04.10: L21. Time series Gaussian processes | Slides\n\n\nContents\n\n\nKalman filtering.\nSpatio-temporal Gaussian processes.\nEquivalence.\n\n\n04.08: L23. Guest Lecture\n04.22: L24. Project presentations"
  },
  {
    "objectID": "index.html#office-hours",
    "href": "index.html#office-hours",
    "title": "Overview",
    "section": "Office hours",
    "text": "Office hours\nProfessor Seshadri’s office hours:\n\n\n\nLocation\nTime\n\n\n\n\nMK 421\nFridays 14:30 to 15:30"
  },
  {
    "objectID": "index.html#textbooks",
    "href": "index.html#textbooks",
    "title": "Overview",
    "section": "Textbooks",
    "text": "Textbooks\nThis course will make heavy use of the following texts:\n\nRasmussen, C. E., Williams, C. K. Gaussian Processes for Machine Learning, The MIT Press, 2006.\nMurphy, K. P., Probabilistic Machine Learning: Advanced Topics, The MIT Press, 2023.\n\nBoth these texts have been made freely available by the authors."
  },
  {
    "objectID": "index.html#important-papers",
    "href": "index.html#important-papers",
    "title": "Overview",
    "section": "Important papers",
    "text": "Important papers\nStudents are encouraged to read through the following papers:\n\nRoberts, S., Osborne, M., Ebden, M., Reece, S., Gibson, N., Aigrain, S., (2013) Gaussian processes for time-series modelling, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.\nDunlop, M., Girolami, M., Stuart, A., Teckentrup, A., (2018) How Deep Are Deep Gaussian Processes?, Journal of Machine Learning Research 19, 1-46\nAlvarez, M., Lawrence, N., (2011) Computationally Efficient Convolved Multiple Output Gaussian Processes, Journal of Machine Learning Research 12, 1459-1500\nVan der Wilk, M., Rasmussen, C., Hensman, J., (2017) Convolutional Gaussian Processes, 31st Conference on Neural Information Processing Systems"
  },
  {
    "objectID": "index.html#references",
    "href": "index.html#references",
    "title": "Overview",
    "section": "References",
    "text": "References\nMaterial used in this course has been adapted from\n\nCUED Part IB probability course notes\nAlto University’s module on Gaussian Processes\nSlides from the Gaussian Process Summer Schools"
  },
  {
    "objectID": "slides/lecture-6/index.html#the-three-model-levels",
    "href": "slides/lecture-6/index.html#the-three-model-levels",
    "title": "Lecture 6",
    "section": "The three model levels",
    "text": "The three model levels\nIn this lecture, we will explore three distinct but related flavors of modelling.\n\nLinear least squares model\nGaussian noise model (introducing the idea of likelihood)\nFull Bayesian treatment (next time!)\n\n\nMuch of the exposition shown here is based on the first three chapters of Rogers and Girolami’s, A First Course in Machine Learning."
  },
  {
    "objectID": "slides/lecture-6/index.html#linear-least-squares",
    "href": "slides/lecture-6/index.html#linear-least-squares",
    "title": "Lecture 6",
    "section": "Linear least squares",
    "text": "Linear least squares\n\nConsider the data shown in the plot below. It shows the winning times for the men’s 100 meter race at the Summer Olympics for many years.\n\n\n\n\n\n\nOur goal will be to fit a model to this data. To begin, we will consider a linear model, i.e., \nt = f \\left( x; {\\color{blue}{w_0}}, {\\color{blue}{w_1}} \\right) = {\\color{blue}{w_0}} + {\\color{blue}{w_1}} x\n where x is the year and t is the winning time.\n{\\color{blue}{w_0}} and {\\color{blue}{w_1}} are unknown model parameters that we need to ascertain.\nGood sense would suggest that the best line passes as closely as possible through all the data points on the left."
  },
  {
    "objectID": "slides/lecture-6/index.html#linear-least-squares-1",
    "href": "slides/lecture-6/index.html#linear-least-squares-1",
    "title": "Lecture 6",
    "section": "Linear least squares",
    "text": "Linear least squares\nDefining a good model\n\nOne common strategy for defining this is based on the squared distance between the truth and the model. Thus, for a given year, t_i, this is written as: \n\\mathcal{L}_{i} = \\left( t_i - f \\left( x_i; {\\color{blue}{w_0}}, {\\color{blue}{w_1}} \\right) \\right)^2.\n\nHowever, as we want a model that fits well across all the data, we may consider the average across the entire data set, i.e., all N data points. This is given by: \n\\mathcal{L} = \\frac{1}{N} \\sum_{i=1}^{N} \\mathcal{L}_{i} = \\frac{1}{N} \\sum_{i=1}^{N} \\left( t_i - f \\left( x_i; {\\color{blue}{w_0}}, {\\color{blue}{w_1}} \\right) \\right)^2\n\nNote that this loss function is always positive, and the lower it is the better! Finding optimal values for {\\color{blue}{w_0}}, {\\color{blue}{w_1}} can be expressed as \n\\underset{{\\color{blue}{w_0}}, {\\color{blue}{w_1}}}{argmin} \\; \\; \\frac{1}{N} \\sum_{i=1}^{N} \\left( t_i - f \\left( x_i; {\\color{blue}{w_0}}, {\\color{blue}{w_1}} \\right) \\right)^2\n\n\n\nNote that other loss functions can be considered. A common example is the absolute loss, i.e., \\mathcal{L}_{i} = | t_i - f \\left( x_i; {\\color{blue}{w_0}}, {\\color{blue}{w_1}} \\right)|"
  },
  {
    "objectID": "slides/lecture-6/index.html#linear-least-squares-2",
    "href": "slides/lecture-6/index.html#linear-least-squares-2",
    "title": "Lecture 6",
    "section": "Linear least squares",
    "text": "Linear least squares\nMatrix-vector notation\n\nIt will be very useful to work with vectors and matrices. For convenience, we define: \n\\mathbf{X}=\\left[\\begin{array}{cc}\n1 & x_{1}\\\\\n1 & x_{2}\\\\\n\\vdots & \\vdots\\\\\n1 & x_{N}\n\\end{array}\\right] = \\left[\\begin{array}{c}\n\\mathbf{x}_{1}^{T}\\\\\n\\mathbf{x}_{2}^{T}\\\\\n\\vdots \\\\\n\\mathbf{x}_{N}^{T}\n\\end{array}\\right], \\; \\; \\; \\; \\; \\mathbf{t} =\\left[\\begin{array}{c}\nt_{1}\\\\\nt_{2}\\\\\n\\vdots\\\\\nt_{N}\n\\end{array}\\right], \\; \\; \\; \\; \\; \\mathbf{{\\color{blue}{w}}} = \\left[\\begin{array}{c}\n{\\color{blue}{w_0}}\\\\\n{\\color{blue}{w_1}}\n\\end{array}\\right]\n\nThe loss function from the prior slide is equivalent to writing \n\\mathcal{L} = \\frac{1}{N} \\left( \\mathbf{t} - \\mathbf{X} \\mathbf{{\\color{blue}{w}}} \\right)^{T} \\left( \\mathbf{t} - \\mathbf{X} \\mathbf{{\\color{blue}{w}}} \\right).\n\nThis can be expanded to yield \n\\mathcal{L} = \\frac{1}{N} \\left( \\mathbf{t}^{T} - \\left( \\mathbf{X} \\mathbf{{\\color{blue}{w}}} \\right)^T  \\right)\\left( \\mathbf{t} - \\mathbf{X} \\mathbf{{\\color{blue}{w}}} \\right) = \\frac{1}{N} \\left[ \\mathbf{t}^{T} \\mathbf{t} - 2 \\mathbf{t}^{T} \\mathbf{X}  \\mathbf{{\\color{blue}{w}}} +   \\mathbf{{\\color{blue}{w}}}^T \\mathbf{X}^{T} \\mathbf{X} \\mathbf{{\\color{blue}{w}}} \\right]"
  },
  {
    "objectID": "slides/lecture-6/index.html#linear-least-squares-3",
    "href": "slides/lecture-6/index.html#linear-least-squares-3",
    "title": "Lecture 6",
    "section": "Linear least squares",
    "text": "Linear least squares\nMinimizing the loss\n\nAs our objective is to minimize the loss, the obvious idea is to find out for which \\mathbf{{\\color{blue}{w}}}, the derivative of the loss function, \\partial \\mathcal{L} / \\partial \\mathbf{{\\color{blue}{w}}}, goes to zero.\nNote that in practice, we refer to these points as turning points as they may equally correspond to maxima, minima, or saddle points. A positive second derivative is a sure sign of a minima.\nPrior to working out the derivatives, it will be useful to take note of the following identities on the left below.\n\n\n\n\n\n\n\n\n\n\ng \\left( \\mathbf{{\\color{red}{v}}} \\right)\n\\partial g / \\partial \\mathbf{{\\color{red}{v}}}\n\n\n\n\n\\mathbf{{\\color{red}{v}}}^{T}\\mathbf{x}\n\\mathbf{x}\n\n\n\\mathbf{x}^{T} \\mathbf{{\\color{red}{v}}}\n\\mathbf{x}\n\n\n\\mathbf{{\\color{red}{v}}}^{T} \\mathbf{{\\color{red}{v}}}\n2\\mathbf{{\\color{red}{v}}}\n\n\n\\mathbf{{\\color{red}{v}}}^{T} \\mathbf{C} \\mathbf{{\\color{red}{v}}}\n2\\mathbf{C} \\mathbf{{\\color{red}{v}}}\n\n\n\n\n\nThe derivative of the loss function is given by \n\\frac{\\partial \\mathcal{L}}{\\partial \\mathbf{{\\color{blue}{w}}}} = - \\frac{2}{N} \\mathbf{X}^{T} \\mathbf{t} + \\frac{2}{N} \\mathbf{X}^{T} \\mathbf{X} \\mathbf{{\\color{blue}{w}}}\n\nSetting the derivative to zero, we have \n\\mathbf{X}^{T} \\mathbf{X} \\mathbf{{\\color{blue}{w}}} = \\mathbf{X}^{T} \\mathbf{t} \\; \\; \\; \\Rightarrow \\; \\; \\; \\hat{\\mathbf{{\\color{blue}{w}}}} = \\left( \\mathbf{X}^{T} \\mathbf{X} \\right)^{-1} \\mathbf{X}^{T} \\mathbf{t}\n where \\hat{\\mathbf{{\\color{blue}{w}}}} represents the value of \\mathbf{{\\color{blue}{w}}} that minimizes the loss."
  },
  {
    "objectID": "slides/lecture-6/index.html#linear-least-squares-4",
    "href": "slides/lecture-6/index.html#linear-least-squares-4",
    "title": "Lecture 6",
    "section": "Linear least squares",
    "text": "Linear least squares\n\n\n\nPlotCode\n\n\n\n\n\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\n\ndf = pd.read_csv('notebooks/data/data100m.csv')\ndf.columns=['Year', 'Time']\nN = df.shape[0]\n\nmax_year, min_year = df['Year'].values.max() , df['Year'].values.min()\n\nx = (df['Year'].values.reshape(N,1) - min_year)/(max_year - min_year)\nt = df['Time'].values.reshape(N,1)\nX_func = lambda u : np.hstack([np.ones((u.shape[0],1)), u])\nX = X_func(x)\nw_hat = np.linalg.inv(X.T @ X) @ X.T @ t\nloss_func = 1./N * (t - X @ w_hat).T @ (t - X @ w_hat)\n\nxgrid = np.linspace(0, 1, 100).reshape(100,1)\nXg = X_func(xgrid)\ntime_grid = Xg @ w_hat\n\nfig = plt.figure(figsize=(6,4))\nplt.plot(df['Year'].values, df['Time'].values, 'o', color='crimson', label='Data')\nplt.plot(xgrid*(max_year - min_year) + min_year, time_grid, '-', color='dodgerblue', label='Model')\nplt.xlabel('Year')\nplt.ylabel('Time (seconds)')\nloss_title = r'Loss function, $\\mathcal{L}=$'+str(np.around(float(loss_func), 5))+'; \\t norm of $\\hat{\\mathbf{w}}$='+str(np.around(float(np.linalg.norm(w_hat,2)), 3))\nplt.title(loss_title)\nplt.legend()\nplt.savefig('olympics_0.png', dpi=150, bbox_inches='tight')\nplt.close()\n\n\n\n\n\n\nFor this result we set \n\\mathbf{X}=\\left[\\begin{array}{cc}\n1 & x_{1}\\\\\n1 & x_{2}\\\\\n\\vdots & \\vdots\\\\\n1 & x_{N}\n\\end{array}\\right]\n and solve for \\hat{\\mathbf{{\\color{blue}{w}}}} = \\left( \\mathbf{X}^{T} \\mathbf{X} \\right)^{-1} \\mathbf{X}^{T} \\mathbf{t}.\n\n\n\nOnce these weights are obtained, we can extrapolate (blue line) over the years.\n\n\n\nNote the graph title shows the loss function value and the L_2 norm, \\left\\Vert\\hat{\\mathbf{{\\color{blue}{w}}}}\\right\\Vert_{2}."
  },
  {
    "objectID": "slides/lecture-6/index.html#linear-least-squares-5",
    "href": "slides/lecture-6/index.html#linear-least-squares-5",
    "title": "Lecture 6",
    "section": "Linear least squares",
    "text": "Linear least squares\n\n\n\nPlotCode\n\n\n\n\n\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\n\ndf = pd.read_csv('notebooks/data/data100m.csv')\ndf.columns=['Year', 'Time']\nN = df.shape[0]\n\nmax_year, min_year = df['Year'].values.max() , df['Year'].values.min()\n\nx = (df['Year'].values.reshape(N,1) - min_year)/(max_year - min_year)\nt = df['Time'].values.reshape(N,1)\nX_func = lambda u : np.hstack([np.ones((u.shape[0],1)), u , u**2, u**3])\nX = X_func(x)\nw_hat = np.linalg.inv(X.T @ X) @ X.T @ t\nloss_func = 1./N * (t - X @ w_hat).T @ (t - X @ w_hat)\n\nxgrid = np.linspace(0, 1, 100).reshape(100,1)\nXg = X_func(xgrid)\ntime_grid = Xg @ w_hat\n\nfig = plt.figure(figsize=(6,4))\nplt.plot(df['Year'].values, df['Time'].values, 'o', color='crimson', label='Data')\nplt.plot(xgrid*(max_year - min_year) + min_year, time_grid, '-', color='dodgerblue', label='Model')\nplt.xlabel('Year')\nplt.ylabel('Time (seconds)')\nloss_title = r'Loss function, $\\mathcal{L}=$'+str(np.around(float(loss_func), 5))+'; \\t norm of $\\hat{\\mathbf{w}}$='+str(np.around(float(np.linalg.norm(w_hat,2)), 3))\nplt.title(loss_title)\nplt.legend()\nplt.savefig('olympics_3.png', dpi=150, bbox_inches='tight')\nplt.close()\n\n\n\n\n\n\nFor this result we set \n\\mathbf{X}=\\left[\\begin{array}{cccc}\n1 & x_{1} & x_{1}^2 & x_{1}^3\\\\\n1 & x_{2} & x_{2}^2 & x_{2}^3\\\\\n\\vdots & \\vdots & \\vdots & \\vdots\\\\\n1 & x_{N} & x_{N}^2 & x_{N}^3\n\\end{array}\\right]\n and solve for \\hat{\\mathbf{{\\color{blue}{w}}}} = \\left( \\mathbf{X}^{T} \\mathbf{X} \\right)^{-1} \\mathbf{X}^{T} \\mathbf{t}.\n\n\n\nOnce these weights are obtained, we can extrapolate (blue line) over the years.\n\n\n\nNote the graph title shows the loss function value and the L_2 norm, \\left\\Vert\\hat{\\mathbf{{\\color{blue}{w}}}}\\right\\Vert_{2}."
  },
  {
    "objectID": "slides/lecture-6/index.html#linear-least-squares-6",
    "href": "slides/lecture-6/index.html#linear-least-squares-6",
    "title": "Lecture 6",
    "section": "Linear least squares",
    "text": "Linear least squares\n\n\n\nPlotCode\n\n\n\n\n\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\n\ndf = pd.read_csv('notebooks/data/data100m.csv')\ndf.columns=['Year', 'Time']\nN = df.shape[0]\n\nmax_year, min_year = df['Year'].values.max() , df['Year'].values.min()\n\nx = (df['Year'].values.reshape(N,1) - min_year)/(max_year - min_year)\nt = df['Time'].values.reshape(N,1)\nX_func = lambda u : np.hstack([np.ones((u.shape[0],1)), u , u**2, u**3])\nX = X_func(x)\nw_hat = np.linalg.inv(X.T @ X) @ X.T @ t\nloss_func = 1./N * (t - X @ w_hat).T @ (t - X @ w_hat)\n\nxgrid = np.linspace(0, 1, 100).reshape(100,1)\nXg = X_func(xgrid)\ntime_grid = Xg @ w_hat\n\nfig = plt.figure(figsize=(6,4))\nplt.plot(df['Year'].values, df['Time'].values, 'o', color='crimson', label='Data')\nplt.plot(xgrid*(max_year - min_year) + min_year, time_grid, '-', color='dodgerblue', label='Model')\nplt.xlabel('Year')\nplt.ylabel('Time (seconds)')\nloss_title = r'Loss function, $\\mathcal{L}=$'+str(np.around(float(loss_func), 5))+'; \\t norm of $\\hat{\\mathbf{w}}$='+str(np.around(float(np.linalg.norm(w_hat,2)), 3))\nplt.title(loss_title)\nplt.legend()\nplt.savefig('olympics_8.png', dpi=150, bbox_inches='tight')\nplt.close()\n\n\n\n\n\n\nFor this result we set \n\\mathbf{X}=\\left[\\begin{array}{cccccc}\n1 & x_{1} & x_{1}^2 & x_{1}^3 & \\ldots & x_{1}^{8} \\\\\n1 & x_{2} & x_{2}^2 & x_{2}^3 & \\ldots & x_{2}^{8} \\\\\n\\vdots & \\vdots & \\vdots & \\vdots & \\ldots & \\vdots \\\\\n1 & x_{N} & x_{N}^2 & x_{N}^3 & \\ldots & x_{N}^{8} \\\\\n\\end{array}\\right]\n and solve for \\hat{\\mathbf{{\\color{blue}{w}}}} = \\left( \\mathbf{X}^{T} \\mathbf{X} \\right)^{-1} \\mathbf{X}^{T} \\mathbf{t}.\n\n\n\nOnce these weights are obtained, we can extrapolate (blue line) over the years.\n\n\n\nNote the graph title shows the loss function value and the L_2 norm, \\left\\Vert\\hat{\\mathbf{{\\color{blue}{w}}}}\\right\\Vert_{2}."
  },
  {
    "objectID": "slides/lecture-6/index.html#linear-least-squares-7",
    "href": "slides/lecture-6/index.html#linear-least-squares-7",
    "title": "Lecture 6",
    "section": "Linear least squares",
    "text": "Linear least squares\nWith regularization\n\nThere is clearly a trade-off between the:\n\nComplexity of the model in terms of the number of weights, and\nthe value of the loss function.\n\nThere is also the risk of over-fitting to the data. For instance, if we had only 9 data points, then the last model would have interpolated each point, at the risk of not being generalizable.\nAs we do not want our model to be too complex, there are two relatively simple recipes:\n\nSplit the data into test and train (your homework!)\nAdd a regularization term, i.e., \n\\mathcal{L} = \\mathcal{L} + \\lambda \\mathbf{{\\color{blue}{w}}}^{T} \\mathbf{{\\color{blue}{w}}}\n where \\lambda is a constant."
  },
  {
    "objectID": "slides/lecture-6/index.html#maximum-likelihood",
    "href": "slides/lecture-6/index.html#maximum-likelihood",
    "title": "Lecture 6",
    "section": "Maximum likelihood",
    "text": "Maximum likelihood\n\nThe linear model from before is unable to capture each data point, and there are errors between the true data and the model.\nNow we will consider a paradigm where these errors are explicitly modelled.\nWe consider a model of the form \nt_j = f \\left( \\mathbf{x}_{n}; \\mathbf{{\\color{blue}{w}}} \\right) + \\epsilon_{n} \\; \\; \\; \\epsilon_{n} \\sim \\mathcal{N}\\left(0, \\sigma^2 \\right), \\; \\; \\; \\; \\text{where} \\; j \\in \\left[1, N \\right]\n\\tag{1}\nRecall, we had previously learnt that adding a constant to a Gaussian random variable alters its mean. Thus, the random variable t_j has a probability density function \np \\left( t_j | \\mathbf{x}_{j}, \\mathbf{{\\color{blue}{w}}}, \\sigma^2 \\right) = \\mathcal{N} \\left( \\mathbf{{\\color{blue}{w}}}^{T} \\mathbf{x}_{j} , \\sigma^2\\right)\n\nCarefully note the conditioning: the probability density function for t_j depends on particular values of \\mathbf{x}_{j} and \\mathbf{{\\color{blue}{w}}}."
  },
  {
    "objectID": "slides/lecture-6/index.html#maximum-likelihood-1",
    "href": "slides/lecture-6/index.html#maximum-likelihood-1",
    "title": "Lecture 6",
    "section": "Maximum likelihood",
    "text": "Maximum likelihood\nDefining the likelihood\nIf we evaluate the linear model from before, and assume that in Equation 1 \\sigma^2 = 0.05, we would find that \np \\left( t_j | \\mathbf{x}_{j} = \\left[ 1, 1980 \\right]^{T} ,\\mathbf{{\\color{blue}{w}}} = \\left[10.964, -1.31 \\right]^{T}, \\sigma^2 = 0.05 \\right) = \\mathcal{N} \\left( 10.03, 0.05 \\right)  \n This quantity is known as the likelihood of the n-th data point.\n\nPlotCode\n\n\n\n\n\n\n\n\n\n\n\nNote that for a continuous random variable t, p\\left( t \\right) cannot be interpreted as a probability.\nThe height of the curve to the left tells us how likely it is that we observe a particular t for x=1980.\nThe most likely is B, followed by C and then A. Note the actual winning time is t_{n}=10.25.\nWhile we obviously cannot change the actual winning time, we can change \\mathbf{{\\color{blue}{w}}} and \\sigma^2 to move the density to make it as high as possible at t=10.25.\n\n\n\n\n\n\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport pandas as pd\nfrom scipy.stats import multivariate_normal\n\n# Get the data \ndf = pd.read_csv('notebooks/data/data100m.csv')\ndf.columns=['Year', 'Time']\nN = df.shape[0]\nmax_year, min_year = df['Year'].values.max() , df['Year'].values.min()\nx = (df['Year'].values.reshape(N,1) - min_year)/(max_year - min_year)\nt = df['Time'].values.reshape(N,1)\nX_func = lambda u : np.hstack([np.ones((u.shape[0],1)), u])\nX = X_func(x)\nw_hat = np.linalg.inv(X.T @ X) @ X.T @ t\n\n# Specific year!\nyear_j = 1980\nX_j = X_func(np.array( [ (year_j - min_year) / (max_year - min_year) ] ).reshape(1,1) )\ntime_j = float(X_j @ w_hat)\n\nT_1980 = multivariate_normal(time_j, 0.05)\nti = np.linspace(9, 11, 100)\npt_x = T_1980.pdf(ti)\n\nfig = plt.figure(figsize=(7,3))\nplt.plot(ti, pt_x, '-', color='orangered', lw=3, label='From linear model')\nplt.axvline(9.53, linestyle='-.', color='dodgerblue', label='A')\nplt.axvline(10.08, linestyle='-', color='green', label='B')\nplt.axvline(10.40, linestyle='--', color='navy', label='C')\nplt.xlabel('Time (seconds)')\nplt.ylabel(r'$p \\left( t | x \\right)$')\nplt.title(r'For $x=1980$')\nplt.legend()\nplt.close()"
  },
  {
    "objectID": "slides/lecture-6/index.html#maximum-likelihood-2",
    "href": "slides/lecture-6/index.html#maximum-likelihood-2",
    "title": "Lecture 6",
    "section": "Maximum likelihood",
    "text": "Maximum likelihood\nDefining the likelihood\n\nThis idea of finding parameters that can maximize the likelihood is very important in machine learning.\nHowever, in general, we are seldom interested in the likelihood of an isolated data point – we are interested in the likelihood across all the data.\nThis leads to the conditional distribution across all N data points \np \\left( t_1, \\ldots, t_N | \\mathbf{x}_1, \\ldots, \\mathbf{x}_{N}, \\mathbf{{\\color{blue}{w}}}, \\sigma^2 \\right)\n\nIf we assume the noise at each data point is independent, this conditional density can be factorized into N separate terms \n\\mathcal{L} = p  \\left( \\mathbf{t} | \\mathbf{X}, \\mathbf{{\\color{blue}{w}}}, \\sigma^2 \\right) = \\prod_{j=1}^{N} p \\left( t_j | \\mathbf{x}_{j}, \\mathbf{{\\color{blue}{w}}}, \\sigma^2 \\right) = \\prod_{j=1}^{N} \\mathcal{N} \\left(\\mathbf{{\\color{blue}{w}}}^{T} \\mathbf{x}_{n} \\right).\n\\tag{2}\nNote that the t_j values are not completely independent—times have clearly decreased over the years! They are conditionally independent. For a given value of \\mathbf{{\\color{blue}{w}}} the t_j are independent; otherwise they are not.\nWe will now maximize the likelihood (see Equation 2 )."
  },
  {
    "objectID": "slides/lecture-6/index.html#maximum-likelihood-3",
    "href": "slides/lecture-6/index.html#maximum-likelihood-3",
    "title": "Lecture 6",
    "section": "Maximum likelihood",
    "text": "Maximum likelihood\nMaximizing the logarithm of the likelihood\n\nPlugging in the definition of a Gaussian probability density function into Equation 2 we arrive at \n\\mathcal{L} = \\prod_{j=1}^{N} \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} exp \\left( -\\frac{1}{2 \\sigma^2} \\left(t_j - f \\left( \\mathbf{x}_{j}; \\mathbf{{\\color{blue}{w}}} \\right) \\right)^2  \\right)\n\nTaking the logarithm on both sides and simplifying: \nlog \\left( \\mathcal{L} \\right)  = \\sum_{j=1}^{N} log \\left( \\frac{1}{\\sqrt{2 \\pi \\sigma^2}} exp \\left( -\\frac{1}{2 \\sigma^2} \\left(t_j - f \\left( \\mathbf{x}_{j}; \\mathbf{{\\color{blue}{w}}} \\right) \\right)^2  \\right) \\right)\n \n= \\sum_{j=1}^{N}  \\left( -\\frac{1}{2} log \\left( 2 \\pi \\right) - log \\left(\\sigma \\right) - \\frac{1}{2\\sigma^2} \\left( t_j - f \\left(\\mathbf{x}_{j}; \\mathbf{{\\color{blue}{w}}} \\right) \\right)^2 \\right)\n \n= -\\frac{N}{2} log \\left( 2 \\pi \\right) - N \\; log \\left( \\sigma \\right) - \\frac{1}{2 \\sigma^2} \\sum_{j=1}^{N} \\left(t_j - f \\left(\\mathbf{x}_{j}; \\mathbf{{\\color{blue}{w}}} \\right) \\right)^2"
  },
  {
    "objectID": "slides/lecture-6/index.html#maximum-likelihood-4",
    "href": "slides/lecture-6/index.html#maximum-likelihood-4",
    "title": "Lecture 6",
    "section": "Maximum likelihood",
    "text": "Maximum likelihood\nMaximizing the logarithm of the likelihood\n\nJust as we did earlier, with the least squares solution, we can set the derivative of the logarithm of the loss function to be zero. \n\\frac{\\partial \\; log \\left( \\mathcal{L} \\right) }{\\partial \\mathbf{{\\color{blue}{w}}}} = \\frac{1}{\\sigma^2} \\sum_{j=1}^{N} \\mathbf{x}_{j} \\left( t_{j} - \\mathbf{x}_{j}^{T} \\mathbf{{\\color{blue}{w}}} \\right) = \\frac{1}{\\sigma^2} \\sum_{j=1}^{N} \\mathbf{x}_{j} t_j - \\mathbf{x}_{j} \\mathbf{x}_{j}^{T} \\mathbf{{\\color{blue}{w}}} \\equiv 0\n\nJust as we did before, we can use matrix vector notation to write this out as\n\n\n\\frac{\\partial \\; log \\left( \\mathcal{L} \\right) }{\\partial \\mathbf{{\\color{blue}{w}}} } = \\frac{1}{\\sigma^2} \\left( \\mathbf{X}^{T} \\mathbf{t} - \\mathbf{X}^{T} \\mathbf{X} \\mathbf{{\\color{blue}{w}}}\\right) = 0\n\n\nSolving this expression leads to\n\n\\mathbf{X}^{T} \\mathbf{t} - \\mathbf{X}^{T} \\mathbf{X} \\mathbf{{\\color{blue}{w}}} = 0 \\Rightarrow \\hat{\\mathbf{{\\color{blue}{w}}}} = \\left( \\mathbf{X}^{T} \\mathbf{X} \\right)^{-1} \\mathbf{X}^{T} \\mathbf{t}.\n\n Thus, the maximum likelihood solution for \\mathbf{{\\color{blue}{w}}} is exactly the solution for the least squares problem!  Minimizing the squared loss is equivalent to the maximum likelihood solution if the noise is assumed Gaussian."
  },
  {
    "objectID": "slides/lecture-6/index.html#maximum-likelihood-5",
    "href": "slides/lecture-6/index.html#maximum-likelihood-5",
    "title": "Lecture 6",
    "section": "Maximum likelihood",
    "text": "Maximum likelihood\nMaximizing the logarithm of the likelihood\n\nWhat remains now is to compute the maximum likelihood estimate of the noise, \\sigma. Assuming that \\hat{\\mathbf{{\\color{blue}{w}}}} = \\mathbf{{\\color{blue}{w}}} we can write \n\\frac{\\partial \\; log \\left( \\mathcal{L} \\right) }{\\partial \\sigma }  = - \\frac{N}{\\sigma} + \\frac{1}{\\sigma^3} \\sum_{j=1}^{N} \\left( t_j - \\mathbf{x}^{T} \\hat{\\mathbf{{\\color{blue}{w}}} } \\right)^2 \\equiv 0.\n\nRearranging, this yields \\hat{\\sigma^2} = 1/N \\sum_{j=1}^{N} \\left( t_j - \\mathbf{x}^{T} \\hat{\\mathbf{{\\color{blue}{w}}} }\\right).\nThis expression states that the variance is the averaged squared error, which intuitively makes sense. Re-writing this using matrix notation, we have \n\\hat{\\sigma^2} = \\frac{1}{N} \\left( \\mathbf{t} - \\mathbf{X} \\hat{\\mathbf{{\\color{blue}{w}}} } \\right)^{T}  \\left( \\mathbf{t} - \\mathbf{X} \\hat{\\mathbf{{\\color{blue}{w}}} } \\right) = \\frac{1}{N} \\left(  \\mathbf{t}^{T}  \\mathbf{t} - 2  \\mathbf{t}^{T} \\mathbf{X} \\hat{\\mathbf{{\\color{blue}{w}}} } + \\hat{\\mathbf{{\\color{blue}{w}}} }^{T} \\mathbf{X}^{T} \\mathbf{X} \\hat{\\mathbf{{\\color{blue}{w}}} } \\right)\n\n\nPlugging in \\hat{\\mathbf{{\\color{blue}{w}}}} = \\left( \\mathbf{X}^{T} \\mathbf{X} \\right)^{-1} \\mathbf{X}^{T} \\mathbf{t}, we arrive at \n\\hat{\\sigma^2} = \\frac{1}{N} \\left(   \\mathbf{t}^{T}   \\mathbf{t} -  \\mathbf{t}^{T}  \\mathbf{X} \\left( \\mathbf{X}^{T} \\mathbf{X} \\right)^{-1} \\mathbf{X}^{T} \\mathbf{t}\\right)"
  },
  {
    "objectID": "slides/lecture-6/index.html#maximum-likelihood-6",
    "href": "slides/lecture-6/index.html#maximum-likelihood-6",
    "title": "Lecture 6",
    "section": "Maximum likelihood",
    "text": "Maximum likelihood\nVisualizing the Gaussian noise model\n\n\n\nPlotCode\n\n\n\n\n\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\n\ndf = pd.read_csv('notebooks/data/data100m.csv')\ndf.columns=['Year', 'Time']\nN = df.shape[0]\n\nmax_year, min_year = df['Year'].values.max() , df['Year'].values.min()\n\nx = (df['Year'].values.reshape(N,1) - min_year)/(max_year - min_year)\nt = df['Time'].values.reshape(N,1)\nX_func = lambda u : np.hstack([np.ones((u.shape[0],1)), u ])\nX = X_func(x)\nw_hat = np.linalg.inv(X.T @ X) @ X.T @ t\nloss_func = 1./N * (t - X @ w_hat).T @ (t - X @ w_hat)\n\nxgrid = np.linspace(0, 1, 100).reshape(100,1)\nXg = X_func(xgrid)\ntime_grid = Xg @ w_hat\n\n\nxi = xgrid*(max_year - min_year) + min_year\nxi = xi.flatten()\nsigma_hat_squared = float( (1. / N) * (t.T @ t - t.T @ X @ w_hat) )\nsigma_hat = np.sqrt(sigma_hat_squared)\nyi = xi* 0 + sigma_hat\n\nloss_func =  -N/2 * np.log(np.pi * 2)  - N * np.log(sigma_hat) - \\\n                   1./(2 * sigma_hat_squared) * np.sum((X @ w_hat - t)**2) \n\nfig = plt.figure(figsize=(6,4))\na, = plt.plot(df['Year'].values, df['Time'].values, 'o', color='crimson', label='Data')\nplt.plot(xi, time_grid, '-', color='r')\nc = plt.fill_between(xi, time_grid.flatten()-yi, time_grid.flatten()+yi, color='red', alpha=0.2, label='Model')\nplt.xlabel('Year')\nplt.ylabel('Time (seconds)')\nloss_title = r'Logarithm of loss function, $log \\left( \\mathcal{L} \\right)=$'+str(np.around(float(loss_func), 5))\nplt.title(loss_title)\nplt.legend([a,c], ['Data', 'Model'])\nplt.savefig('olympics_last.png', dpi=150, bbox_inches='tight')\nplt.close()\n\n\n\n\n\nThe graph on the left is the final model.\n\n\n\n\nAE8803 | Gaussian Processes for Machine Learning"
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model",
    "href": "slides/lecture-8/index.html#bayesian-model",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nDeriving the posterior\n\nRecall, from Lecture 8\n\n\np \\left( \\mathbf{w} | \\mathbf{t}, \\mathbf{X}, \\sigma^2 \\right) \\propto p \\left( \\mathbf{t} | \\mathbf{w}, \\mathbf{X}, \\sigma^2 \\right) p \\left( \\mathbf{w}   \\right)\n \n\\begin{aligned}\n= \\frac{1}{\\left( 2 \\pi \\right)^{N/2} | \\sigma^2 \\mathbf{I} |^{1/2} }  exp \\left( - \\frac{1}{2} \\left( \\mathbf{t} - \\mathbf{Xw} \\right)^{T} \\left( \\sigma^2 \\mathbf{I} \\right)^{-1} \\left( \\mathbf{t} - \\mathbf{Xw} \\right)  \\right)   \\\\\n\\times \\frac{1}{\\left( 2 \\pi \\right)^{N/2} | \\boldsymbol{\\Sigma}_{0} |^{1/2} } exp \\left( - \\frac{1}{2} \\left( \\mathbf{w} - \\boldsymbol{\\mu}_{0} \\right)^{T} \\boldsymbol{\\Sigma}_{0}^{-1} \\left( \\mathbf{w} - \\boldsymbol{\\mu}_{0} \\right)  \\right)\n\\end{aligned}\n \n\\propto exp \\left( - \\frac{1}{2} \\left( \\mathbf{t} - \\mathbf{Xw} \\right)^{T} \\left( \\sigma^2 \\mathbf{I} \\right)^{-1} \\left( \\mathbf{t} - \\mathbf{Xw} \\right)  \\right)  \\times exp \\left( - \\frac{1}{2} \\left( \\mathbf{w} - \\boldsymbol{\\mu}_{0} \\right)^{T} \\boldsymbol{\\Sigma}_{0}^{-1} \\left( \\mathbf{w} - \\boldsymbol{\\mu}_{0} \\right)  \\right)\n\n\nWe shall now continue this derivation."
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model-1",
    "href": "slides/lecture-8/index.html#bayesian-model-1",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nDeriving the posterior\n\np \\left( \\mathbf{w} | \\mathbf{t}, \\mathbf{X}, \\sigma^2 \\right) \\propto exp \\left\\{ - \\frac{1}{2} \\left( \\frac{1}{\\sigma^2} \\left( \\mathbf{t} - \\mathbf{Xw} \\right)^{T}   \\left( \\mathbf{t} - \\mathbf{Xw} \\right) + \\left(   \\mathbf{w} - \\boldsymbol{\\mu}_{0}  \\right)^{T} \\boldsymbol{\\Sigma\n}_{0}^{-1} \\left(   \\mathbf{w} - \\boldsymbol{\\mu}_{0}  \\right)  \\right) \\right\\}\n\nMultiplying the terms in the bracket out, and removing any terms that do not involve a \\mathbf{w} yields\n\n\\begin{aligned}\np \\left( \\mathbf{w} | \\mathbf{t}, \\mathbf{X}, \\sigma^2 \\right) & \\propto \\\\\n& exp \\left\\{ - \\frac{1}{2} \\left(  - \\frac{2}{\\sigma^2} \\mathbf{t}^{T} \\mathbf{Xw} + \\frac{1}{\\sigma^2} \\mathbf{w}^{T} \\mathbf{X}^{T}\\mathbf{X} \\mathbf{w} + \\mathbf{w}^{T} \\boldsymbol{\\Sigma}_{0}^{-1} \\mathbf{w} - 2 \\boldsymbol{\\mu}_{0}^{T} \\boldsymbol{\\Sigma}_{0}^{-1}\\mathbf{w}   \\right)   \\right\\}\n\\end{aligned}\n\\tag{1}\nThe trick is to recognize that since our posterior is Gaussian, it must have the form\n\n\\begin{aligned}\np \\left( \\mathbf{w} | \\mathbf{t}, \\mathbf{X}, \\sigma^2 \\right) = \\mathcal{N}\\left(\\boldsymbol{\\mu}_{\\mathbf{w}}, \\boldsymbol{\\Sigma}_{\\mathbf{w}} \\right)\n\\end{aligned}\n\\tag{2}"
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model-2",
    "href": "slides/lecture-8/index.html#bayesian-model-2",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nDeriving the posterior\nIf we expand Equation 2, we arrive at\n\n\\begin{aligned}\np \\left( \\mathbf{w} | \\mathbf{t}, \\mathbf{X}, \\sigma^2 \\right) & \\propto exp \\left\\{ - \\frac{1}{2} \\left( \\mathbf{w} - \\boldsymbol{\\mu}_{\\mathbf{w}} \\right)^{T} \\boldsymbol{\\Sigma}_{\\mathbf{w}}^{-1} \\left( \\mathbf{w} - \\boldsymbol{\\mu}_{\\mathbf{w}} \\right)  \\right\\} \\\\\n& \\propto exp \\left\\{ - \\frac{1}{2} \\left( \\mathbf{w}^{T} \\boldsymbol{\\Sigma}^{-1}_{\\mathbf{w}} \\mathbf{w} - 2 \\boldsymbol{\\mu}_{\\mathbf{w}}^{T} \\boldsymbol{\\Sigma}_{\\mathbf{w}}^{-1} \\mathbf{w} \\right)  \\right\\}\n\\end{aligned}\n\\tag{3}\nThe quadratic terms in \\mathbf{w} in Equation 3 must match those we had in Equation 1. Thus we can write \n\\begin{aligned}\n\\mathbf{w}^{T} \\boldsymbol{\\Sigma}_{\\mathbf{w}}^{-1} \\mathbf{w} & = \\frac{1}{\\sigma^2} \\mathbf{w}^{T} \\mathbf{X}^{T} \\mathbf{X} \\mathbf{w} + \\mathbf{w}^{T} \\boldsymbol{\\Sigma}_{0}^{-1} \\mathbf{w} \\\\\n& = \\mathbf{w}^{T} \\left( \\frac{1}{\\sigma^2} \\mathbf{X}^{T} \\mathbf{X} + \\boldsymbol{\\Sigma}_{0}^{-1} \\right) \\mathbf{w}\n\\end{aligned}\n\nAs for the expectation, we equate the linear terms in Equation 1 with those in Equation 3 \n\\begin{aligned}\n-2 \\boldsymbol{\\mu}_{\\mathbf{w}}^{T} \\boldsymbol{\\Sigma}_{\\mathbf{w}}^{-1}\\mathbf{w} = - \\frac{2}{\\sigma^2} \\mathbf{t}^{T} \\mathbf{Xw} - 2 \\boldsymbol{\\mu}_{0}^{T} \\boldsymbol{\\Sigma}_{0}^{-1} \\mathbf{w}\n\\end{aligned}"
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model-3",
    "href": "slides/lecture-8/index.html#bayesian-model-3",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nDeriving the posterior\nContinuing this expansion\n\n\\begin{aligned}\n\\boldsymbol{\\mu}_{\\mathbf{w}}^{T} \\boldsymbol{\\Sigma}_{\\mathbf{w}}^{-1} & = \\frac{1}{\\sigma^2} \\mathbf{t}^{T} \\mathbf{X} + \\boldsymbol{\\mu}_{0}^{T} \\boldsymbol{\\Sigma}_{0}^{-1} \\\\\n\\boldsymbol{\\mu}_{\\mathbf{w}}^{T} & = \\left( \\frac{1}{\\sigma^2} \\mathbf{t}^{T} \\mathbf{X} + \\boldsymbol{\\mu}_{0}^{T} \\boldsymbol{\\Sigma}_{0}^{-1} \\right) \\boldsymbol{\\Sigma}_{\\mathbf{w}} \\\\\n\\Rightarrow \\boldsymbol{\\mu}_{\\mathbf{w}} & = \\mathbf{\\Sigma}_{\\mathbf{w}} \\left( \\frac{1}{\\sigma^2} \\mathbf{X}^{T} \\mathbf{t} + \\boldsymbol{\\Sigma}^{-1}_{0} \\boldsymbol{\\mu}_{0} \\right)\n\\end{aligned}\n\nwhere the last statement is due to \\boldsymbol{\\Sigma}_{\\mathbf{w}} = \\boldsymbol{\\Sigma}_{\\mathbf{w}}^{T}, i.e., the covariance matrix is symmetric.\nTo summarize, we have now worked out our posterior\n\n\np \\left( \\mathbf{w} | \\mathbf{t}, \\mathbf{X}, \\sigma^2 \\right) = \\mathcal{N}\\left( \\boldsymbol{\\mu}_{\\mathbf{w}}, \\boldsymbol{\\Sigma}_{\\mathbf{w}} \\right)\n where \n\\boldsymbol{\\mu}_{\\mathbf{w}} = \\mathbf{\\Sigma}_{\\mathbf{w}} \\left( \\frac{1}{\\sigma^2} \\mathbf{X}^{T} \\mathbf{t} + \\boldsymbol{\\Sigma}^{-1}_{0} \\boldsymbol{\\mu}_{0} \\right) \\; \\; \\text{and} \\; \\; \\boldsymbol{\\Sigma}_{\\mathbf{w}} = \\left( \\frac{1}{\\sigma^2} \\mathbf{X}^{T} \\mathbf{X} + \\boldsymbol{\\Sigma}_{0}^{-1} \\right)^{-1}"
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model-4",
    "href": "slides/lecture-8/index.html#bayesian-model-4",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nMaximum a posteriori estimate\n\nIf we set the prior mean to be zero, i.e., \\boldsymbol{\\mu}_{0} = \\left[ 0, 0, \\ldots, 0 \\right]^{T}, the resulting value of \\boldsymbol{\\mu}_{\\mathbf{w}} looks very similar to the maximum likelihood solution from before.\nNote, that because the posterior p \\left( \\mathbf{w} | \\mathbf{t}, \\mathbf{X}, \\sigma^2 \\right) is Gaussian, the most likely value of \\mathbf{w} is the mean of the posterior, \\boldsymbol{\\mu}_{\\mathbf{w}}. .\nThis is known as the maximum a posteriori (MAP) estimate of \\mathbf{w} and can also be thought of as the maximum value of the joint density p \\left( \\mathbf{w}, \\mathbf{t} | \\mathbf{X} \\sigma^2, \\boldsymbol{\\mu}_{0}, \\boldsymbol{\\Sigma}_{0} \\right) (which is the likelihood \\times prior)."
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model-5",
    "href": "slides/lecture-8/index.html#bayesian-model-5",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nPredictive density\nJust as we did last time, it will be useful to make predictions. To do this, we once again use \\mathbf{X}_{new} \\in \\mathbb{R}^{S \\times 2} where S is the number of new x locations (in this case time) at which our model needs to be evaluated, i.e., \n\\mathbf{X}_{new}=\\left[\\begin{array}{cc}\n1 & x^{new}_{1}\\\\\n1 & x^{new}_{2}\\\\\n\\vdots & \\vdots\\\\\n1 & x^{new}_{S}\n\\end{array}\\right]\n\nWe are interested in the predictive density\n\np \\left( \\mathbf{t}_{new} | \\mathbf{X}_{new}, \\mathbf{X},  \\mathbf{t}, \\sigma^2 \\right)\n\nNotice, that this density is not conditioned on \\mathbf{w}. We are going to integrate out \\mathbf{w} by taking an expectation with respect to the posterior p \\left( \\mathbf{w} | \\mathbf{t}, \\mathbf{X}, \\sigma^2 \\right), i.e.,\n\n\\begin{aligned}\np \\left( \\mathbf{t}_{new} | \\mathbf{X}_{new}, \\mathbf{X},  \\mathbf{t}, \\sigma^2 \\right) & = \\mathbb{E}_{p \\left( \\mathbf{w} | \\mathbf{t}, \\mathbf{X}, \\sigma^2 \\right)} \\left[  p \\left( \\mathbf{t}_{new} | \\mathbf{w},  \\mathbf{X}_{new}, \\sigma^2 \\right)\\right] \\\\\n& = \\int  p \\left( \\mathbf{t}_{new} | \\mathbf{w}, \\mathbf{X}_{new}, \\sigma^2 \\right) p \\left( \\mathbf{w} | \\mathbf{t}, \\mathbf{X}, \\sigma^2 \\right) d\\mathbf{w}\n\\end{aligned}"
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model-6",
    "href": "slides/lecture-8/index.html#bayesian-model-6",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nPredictive density\n\nThis leads to the predictive density form given by \np \\left( \\mathbf{t}_{new} | \\mathbf{X}_{new}, \\mathbf{X},  \\mathbf{t}, \\sigma^2 \\right)  = \\mathcal{N} \\left( \\mathbf{X}_{new}  \\boldsymbol{\\mu}_{\\mathbf{w}} , \\sigma^2 \\mathbf{I} + \\mathbf{X}_{new}^{T} \\boldsymbol{\\Sigma}_{\\mathbf{w}} \\mathbf{X}_{new} \\right)\n\\tag{4}\nTry proving this yourself! You may find section 2.3.2 of Bishop useful\nThe inclusion of the \\sigma^2 \\mathbf{I} term is optional, i.e., if we wish to replicate the true signal (without noise) this can be negated. However, if we wish to model the realistic data generation process then the noise should be incorporated."
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model-7",
    "href": "slides/lecture-8/index.html#bayesian-model-7",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nVisualizing the prior\n\nPlotCode\n\n\n\n\n\n\n\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib\nimport matplotlib.pyplot as plt\nfrom scipy.stats import multivariate_normal\nimport seaborn as sns\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\nplt.style.use('dark_background')\ndf = pd.read_csv('notebook/data/data100m.csv')\ndf.columns=['Year', 'Time']\nN = df.shape[0]\n\n# Data & basis\nmax_year, min_year = df['Year'].values.max() , df['Year'].values.min()\nx = (df['Year'].values.reshape(N,1) - min_year)/(max_year - min_year)\nt = df['Time'].values.reshape(N,1)\nX_func = lambda u : np.hstack([np.ones((u.shape[0],1)), u ])\nX = X_func(x)\n\n# For prediction / plotting\nxgrid = np.linspace(0, 1, 100).reshape(100,1)\nXg = X_func(xgrid)\nxi = xgrid*(max_year - min_year) + min_year\nxi = xi.flatten()\nsigma_hat = 0.2\nsigma_w = 2.0\n\n# Prior\nmu_0 = np.array([[7],\n                 [0]])\nSigma_0 = np.array([[5.0, -0.8],\n                    [-0.8, 0.5]])\ninv_Sigma_0 = np.eye(X.shape[1]) * 1/(sigma_w**2)\nprior = multivariate_normal(mu_0.flatten(), Sigma_0)\n\n# Posterior\nSigma_w = np.linalg.inv(1./(sigma_hat**2) * (X.T @ X) + inv_Sigma_0)\nmu_w = Sigma_w @ (1./(sigma_hat**2) * (X.T @ t) + (inv_Sigma_0 @ mu_0) )\nposterior = multivariate_normal(mu_w.flatten(), Sigma_w)\n\n# Plotting support\nww1, ww2 = np.mgrid[2:12:.05, -2:1.5:.05]\npos = np.dstack((ww1, ww2))\n\nfig, ax = plt.subplots(2, figsize=(12,4))\nfig.patch.set_facecolor('#6C757D')\nax[0].set_fc('#6C757D')\nplt.subplot(121)\nplt.contourf(ww1, ww2, prior.pdf(pos), 30, cmap=plt.cm.Oranges)\nplt.title(r'Prior $\\mathcal{N}(\\mu_0, \\Sigma_0)$')\nplt.xlabel(r'$\\mathbf{w}_0$')\nplt.ylabel(r'$\\mathbf{w}_1$')\nfig.patch.set_facecolor('#6C757D')\n\nplt.subplot(122)\nplt.rcParams['axes.facecolor']='#6C757D'\nax[1].set_facecolor('#6C757D')\nrandom_samples = 200\nplt.plot(xi, Xg @ prior.rvs(random_samples).T, alpha=0.7)\nplt.xlabel('Year')\nplt.title('model using prior samples')\nplt.ylabel('Time (seconds)')\nplt.savefig('prior.png', dpi=150, bbox_inches='tight', facecolor=\"#6C757D\")\nplt.close()"
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model-8",
    "href": "slides/lecture-8/index.html#bayesian-model-8",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nVisualizing the posterior\n\nPlotCode\n\n\n\n\n\n\n\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib\nimport matplotlib.pyplot as plt\nfrom scipy.stats import multivariate_normal\nimport seaborn as sns\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\nplt.style.use('dark_background')\ndf = pd.read_csv('notebook/data/data100m.csv')\ndf.columns=['Year', 'Time']\nN = df.shape[0]\n\n# Data & basis\nmax_year, min_year = df['Year'].values.max() , df['Year'].values.min()\nx = (df['Year'].values.reshape(N,1) - min_year)/(max_year - min_year)\nt = df['Time'].values.reshape(N,1)\nX_func = lambda u : np.hstack([np.ones((u.shape[0],1)), u ])\nX = X_func(x)\n\n# For prediction / plotting\nxgrid = np.linspace(0, 1, 100).reshape(100,1)\nXg = X_func(xgrid)\nxi = xgrid*(max_year - min_year) + min_year\nxi = xi.flatten()\nsigma_hat = 0.2\nsigma_w = 2.0\n\n# Prior\nmu_0 = np.array([[7],\n                 [0]])\nSigma_0 = np.array([[5.0, -0.8],\n                    [-0.8, 0.5]])\ninv_Sigma_0 = np.eye(X.shape[1]) * 1/(sigma_w**2)\nprior = multivariate_normal(mu_0.flatten(), Sigma_0)\n\n# Posterior\nSigma_w = np.linalg.inv(1./(sigma_hat**2) * (X.T @ X) + inv_Sigma_0)\nmu_w = Sigma_w @ (1./(sigma_hat**2) * (X.T @ t) + (inv_Sigma_0 @ mu_0) )\nposterior = multivariate_normal(mu_w.flatten(), Sigma_w)\n\n# Plotting support\nww1, ww2 = np.mgrid[2:12:.05, -2:1.5:.05]\npos = np.dstack((ww1, ww2))\n\nfig, ax = plt.subplots(2, figsize=(12,4))\nfig.patch.set_facecolor('#6C757D')\nax[0].set_fc('#6C757D')\nplt.subplot(121)\nplt.contourf(ww1, ww2, posterior.pdf(pos), 30, cmap=plt.cm.Oranges)\nplt.title(r'Posterior $\\mathcal{N}(\\mu_w, \\Sigma_w)$')\nplt.xlabel(r'$\\mathbf{w}_0$')\nplt.ylabel(r'$\\mathbf{w}_1$')\nfig.patch.set_facecolor('#6C757D')\n\nplt.subplot(122)\nplt.rcParams['axes.facecolor']='#6C757D'\nax[1].set_facecolor('#6C757D')\nplt.plot(xi, Xg @ posterior.rvs(random_samples).T, zorder=-1, alpha=0.8)\na, = plt.plot(df['Year'].values, df['Time'].values, 'o', color='dodgerblue', \\\n              label='Data', markeredgecolor='k', lw=1, ms=10, zorder=1)\nplt.xlabel('Year')\nplt.title('model using posterior samples')\nplt.legend([a ], ['Data'], framealpha=0.2)\nplt.ylabel('Time (seconds)')\nplt.savefig('posterior.png', dpi=150, bbox_inches='tight', facecolor=\"#6C757D\")\nplt.show()\nplt.close()"
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model-9",
    "href": "slides/lecture-8/index.html#bayesian-model-9",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nConjugacy\n\nA likelihood-prior pair is said to be conjugate if they result in a posterior which is of the same form as the prior.\nThis enables us to compute the posterior density analytically (as we have done) without having to worry about the marginal likelihood (the denominator in Bayes’ rule).\nSome commonly used prior and likelihood distributions are given below.\n\n\n\n\nPrior\nLikelihood\n\n\n\n\nGaussian\nGaussian\n\n\nBeta\nBinomial\n\n\nGamma\nGaussian\n\n\nDirichlet\nMultinomial"
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model-10",
    "href": "slides/lecture-8/index.html#bayesian-model-10",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nMarginal likelihood re-visited\n\nThe marginal likelihood for a Bayesian model is given by \n\\text{marginal likelihood} = \\int \\text{likelihood} \\times \\text{prior} \\; d \\mathbf{w}\n\nThe marginal likelihood for our Gaussian prior and Gaussian likelihood model is given by \n\\begin{aligned}\np \\left( \\mathbf{t} | \\mathbf{X}, \\boldsymbol{\\mu}_{0}, \\boldsymbol{\\Sigma}_{0} \\right) & = \\int p \\left( \\mathbf{t} | \\mathbf{X}, \\mathbf{w}, \\sigma^2 \\right) p \\left( \\mathbf{w} | \\boldsymbol{\\mu}_{0} , \\boldsymbol{\\Sigma}_{0} \\right) d \\mathbf{w} \\\\\n& = \\mathcal{N} \\left( \\mathbf{X} \\boldsymbol{\\mu}_{0}, \\sigma^2 \\mathbf{I}_{N} + \\mathbf{X} \\boldsymbol{\\Sigma}_{0} \\mathbf{X}^{T} \\right)\n\\end{aligned}\n\nNote that it has the same form as Equation 4, and it is evaluated at \\mathbf{t}, i.e., the observed winning times."
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model-11",
    "href": "slides/lecture-8/index.html#bayesian-model-11",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nMarginal likelihood re-visited\n\nPlotCode\n\n\n\n\n\nThe plot on the right shows the logarithm of the marginal likelihood.\nNote that where the contours peak corresponds to values of \\mathbf{w} that better explain the data \\mathbf{t}, given its uncertainty \\sigma^2, and the model as encoded into \\mathbf{X}.\n\n\n\n\n\n\n\n\nimport pandas as pd\nimport numpy as np\nimport matplotlib\nimport matplotlib.pyplot as plt\nfrom scipy.stats import multivariate_normal\nimport seaborn as sns\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\nplt.style.use('dark_background')\ndf = pd.read_csv('notebook/data/data100m.csv')\ndf.columns=['Year', 'Time']\nN = df.shape[0]\n\n# Data & basis\nmax_year, min_year = df['Year'].values.max() , df['Year'].values.min()\nx = (df['Year'].values.reshape(N,1) - min_year)/(max_year - min_year)\nt = df['Time'].values.reshape(N,1)\nX_func = lambda u : np.hstack([np.ones((u.shape[0],1)), u ])\nX = X_func(x)\n\n# For prediction / plotting\nxgrid = np.linspace(0, 1, 100).reshape(100,1)\nXg = X_func(xgrid)\nxi = xgrid*(max_year - min_year) + min_year\nxi = xi.flatten()\nsigma_hat = 0.2\nsigma_w = 2.0\n\n# Prior\nmu_0 = np.array([[7],\n                 [0]])\nSigma_0 = np.array([[5.0, -0.8],\n                    [-0.8, 0.5]])\ninv_Sigma_0 = np.eye(X.shape[1]) * 1/(sigma_w**2)\nprior = multivariate_normal(mu_0.flatten(), Sigma_0)\n\n# Posterior\nSigma_w = np.linalg.inv(1./(sigma_hat**2) * (X.T @ X) + inv_Sigma_0)\nmu_w = Sigma_w @ (1./(sigma_hat**2) * (X.T @ t) + (inv_Sigma_0 @ mu_0) )\nposterior = multivariate_normal(mu_w.flatten(), Sigma_w)\n\n# Plotting support\nww1, ww2 = np.mgrid[2:12:.05, -2:1.5:.05]\npos = np.dstack((ww1, ww2))\n\nmarginal_like_val = np.ones((ww1.shape[0], ww1.shape[1]))\n\nfor i in range(0, ww1.shape[0]):\n    for j in range(0, ww1.shape[1]):\n        mu_0 = np.array([ ww1[i,j], ww2[i,j] ]).reshape(2, 1)\n        marginal_likelihood_dist = multivariate_normal( (X @ mu_0).flatten(), \\\n                                               sigma_hat**2 * np.eye(t.shape[0]) + X @ X.T)\n        marginal_like_val[i, j] = marginal_likelihood_dist.pdf(t.flatten())\n\nfig, ax = plt.subplots(figsize=(6,4))\nfig.patch.set_facecolor('#6C757D')\nax.set_fc('#6C757D')\nc = plt.contourf(ww1, ww2, np.log10(marginal_like_val), 60, cmap=plt.cm.Oranges)\nplt.colorbar(c)\nplt.title(r'Log_{10} of marginal likelihood evaluated at $t$')\nplt.xlabel(r'$\\mathbf{w}_0$')\nplt.ylabel(r'$\\mathbf{w}_1$')\nplt.savefig('marginal_log.png', dpi=150, bbox_inches='tight', facecolor=\"#6C757D\")\nplt.close()"
  },
  {
    "objectID": "slides/lecture-8/index.html#bayesian-model-12",
    "href": "slides/lecture-8/index.html#bayesian-model-12",
    "title": "Lecture 8",
    "section": "Bayesian model",
    "text": "Bayesian model\nA function space perspective\n\nSampling \\mathbf{t} conditioned on \\mathbf{w} vs. sampling \\mathbf{t} directly from the marginal likelihood are statistically equivalent.\nRemoving \\mathbf{w} from our model (effectively making it non-parametric) opens the door to thinking of priors in the space of functions, and not just in the space of model parameters.\nThis is arguably one of the main ideas in Gaussian processes, that I will introduce next time we meet.\n\n\n\nAE8803 | Gaussian Processes for Machine Learning"
  },
  {
    "objectID": "sample_problems/lecture_1.html",
    "href": "sample_problems/lecture_1.html",
    "title": "L1 examples",
    "section": "",
    "text": "The probability that a scheduled flight departs on time is 0.83 and the probability that it arrives on time is 0.92. The probability that it both departs and arrives on time is 0.78. Find the probability that\n\nthe plane arrives on time given that it departed on time;\nthe plane did not depart on time given that it did not arrive on time.\n\n\n\nSolution\n\nIt will be useful to consider the Venn diagram shown below.\n\nLet \\(\\require{color}{\\color[rgb]{0.000066,0.001801,0.998229}A}\\) denote the event that the plane arrives on time, while \\({\\color[rgb]{0.986252,0.007236,0.027423}D}\\) denotes th event that the plane departs on time. To construct the Venn diagram above note that \\(P\\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) = 0.78\\). From the sum rule of probabilities, we have:\n\\[\n\\require{color}\n\\large\nP\\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\right) = P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap \\bar{{\\color[rgb]{0.986252,0.007236,0.027423}D}} \\right) + P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right)\n\\]\n\\[\n\\require{color}\n\\large\n0.92= 0.78+ P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right)\n\\]\nwhich implies that \\(\\require{color} P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) = 0.92 - 0.78 = 0.14\\). Similarly, we have:\n\\[\n\\require{color}\n\\large\nP \\left( {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) = P \\left( {\\color[rgb]{0.986252,0.007236,0.027423}D} \\cap \\bar{{\\color[rgb]{0.000066,0.001801,0.998229}A}} \\right) + P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right)\n\\]\n\\[\n\\require{color}\n\\large\n\\Rightarrow 0.83 = P \\left( {\\color[rgb]{0.986252,0.007236,0.027423}D} \\cap \\bar{{\\color[rgb]{0.000066,0.001801,0.998229}A}} \\right) + 0.78\n\\]\nwhich implies that \\(P \\left( {\\color[rgb]{0.986252,0.007236,0.027423}D} \\cap \\bar{{\\color[rgb]{0.000066,0.001801,0.998229}A}} \\right) = 0.05\\). With these probabilities, we can now answer the questions.\n\nThe plane arrives on time conditioned that it departed on time:\n\n\\[\n\\large\n\\require{color}\nP \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} | {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) = \\frac{P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) }{P \\left( {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) } = \\frac{0.78}{0.83} = 0.94\n\\]\n\nThe plane did not depart on time conditioned on it having not arrived on time:\n\n\\[\n\\large\n\\require{color}\nP \\left( \\bar{{\\color[rgb]{0.986252,0.007236,0.027423}D}} | \\bar{{\\color[rgb]{0.000066,0.001801,0.998229}A}} \\right) = \\frac{P \\left( \\bar{{\\color[rgb]{0.986252,0.007236,0.027423}D } }\\cap \\bar{{\\color[rgb]{0.000066,0.001801,0.998229}A}} \\right) }{P \\left( \\bar{{\\color[rgb]{0.000066,0.001801,0.998229}A}} \\right) }\n\\]\n\\[\n\\large\n\\require{color}\n= \\frac{1 - P\\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\right) - P \\left( {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) + P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) }{1 - P\\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\right) } = \\frac{1 - 0.92 - 0.83 + 0.78}{1 - 0.92} = \\frac{0.03}{0.08} = 0.375\n\\]"
  },
  {
    "objectID": "sample_problems/lecture_1.html#problem-1",
    "href": "sample_problems/lecture_1.html#problem-1",
    "title": "L1 examples",
    "section": "",
    "text": "The probability that a scheduled flight departs on time is 0.83 and the probability that it arrives on time is 0.92. The probability that it both departs and arrives on time is 0.78. Find the probability that\n\nthe plane arrives on time given that it departed on time;\nthe plane did not depart on time given that it did not arrive on time.\n\n\n\nSolution\n\nIt will be useful to consider the Venn diagram shown below.\n\nLet \\(\\require{color}{\\color[rgb]{0.000066,0.001801,0.998229}A}\\) denote the event that the plane arrives on time, while \\({\\color[rgb]{0.986252,0.007236,0.027423}D}\\) denotes th event that the plane departs on time. To construct the Venn diagram above note that \\(P\\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) = 0.78\\). From the sum rule of probabilities, we have:\n\\[\n\\require{color}\n\\large\nP\\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\right) = P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap \\bar{{\\color[rgb]{0.986252,0.007236,0.027423}D}} \\right) + P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right)\n\\]\n\\[\n\\require{color}\n\\large\n0.92= 0.78+ P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right)\n\\]\nwhich implies that \\(\\require{color} P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) = 0.92 - 0.78 = 0.14\\). Similarly, we have:\n\\[\n\\require{color}\n\\large\nP \\left( {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) = P \\left( {\\color[rgb]{0.986252,0.007236,0.027423}D} \\cap \\bar{{\\color[rgb]{0.000066,0.001801,0.998229}A}} \\right) + P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right)\n\\]\n\\[\n\\require{color}\n\\large\n\\Rightarrow 0.83 = P \\left( {\\color[rgb]{0.986252,0.007236,0.027423}D} \\cap \\bar{{\\color[rgb]{0.000066,0.001801,0.998229}A}} \\right) + 0.78\n\\]\nwhich implies that \\(P \\left( {\\color[rgb]{0.986252,0.007236,0.027423}D} \\cap \\bar{{\\color[rgb]{0.000066,0.001801,0.998229}A}} \\right) = 0.05\\). With these probabilities, we can now answer the questions.\n\nThe plane arrives on time conditioned that it departed on time:\n\n\\[\n\\large\n\\require{color}\nP \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} | {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) = \\frac{P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) }{P \\left( {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) } = \\frac{0.78}{0.83} = 0.94\n\\]\n\nThe plane did not depart on time conditioned on it having not arrived on time:\n\n\\[\n\\large\n\\require{color}\nP \\left( \\bar{{\\color[rgb]{0.986252,0.007236,0.027423}D}} | \\bar{{\\color[rgb]{0.000066,0.001801,0.998229}A}} \\right) = \\frac{P \\left( \\bar{{\\color[rgb]{0.986252,0.007236,0.027423}D } }\\cap \\bar{{\\color[rgb]{0.000066,0.001801,0.998229}A}} \\right) }{P \\left( \\bar{{\\color[rgb]{0.000066,0.001801,0.998229}A}} \\right) }\n\\]\n\\[\n\\large\n\\require{color}\n= \\frac{1 - P\\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\right) - P \\left( {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) + P \\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}D} \\right) }{1 - P\\left( {\\color[rgb]{0.000066,0.001801,0.998229}A} \\right) } = \\frac{1 - 0.92 - 0.83 + 0.78}{1 - 0.92} = \\frac{0.03}{0.08} = 0.375\n\\]"
  },
  {
    "objectID": "sample_problems/lecture_1.html#problem-2",
    "href": "sample_problems/lecture_1.html#problem-2",
    "title": "L1 examples",
    "section": "Problem 2",
    "text": "Problem 2\nToss a coin three times, what is the probability of at least two heads?\n\n\nSolution\n\nThere are 8 possible outcomes which, if the coin is unbiased, should all be equally likely:\n\nHHH\nHHT\nHTH\nHTT\nTHH\nTHT\nTTH\nTTT\n\nTwo or more heads result from 4 outcomes. The probability of two or more heads is therefore \\(4/8=1/2\\)."
  },
  {
    "objectID": "sample_problems/lecture_1.html#problem-3",
    "href": "sample_problems/lecture_1.html#problem-3",
    "title": "L1 examples",
    "section": "Problem 3",
    "text": "Problem 3\nThis problem introduces the idea that whilst it may be tempting to add probabilities, the context is very important.\nAround 0.9% of the population are blue-green color blind and roughly 1 in 5 is left-handed. Assuming these characteristics are inherited independently, calculate the probability that a person, chosen at random will:\n\nbe both color-blind and left-handed\nbe color-blind and not left-handed\nbe color-blind or left-handed\nbe neither color-blind nor left-handed\n\n\n\nSolution\n\nConsider the diagram shown below; given that the characteristics are inherited independently, each sub-branch of the population can be divided into color-blind and non-color-blind groups.\n\n\nthe probability of being both color-blind and left-handed is: \\(0.009 \\times 0.2 = 0.0018\\) or \\(0.18 \\%\\).\nthe probability of being color-blind and right-handed is: \\(0.009 \\times 0.8 = 0.0072\\).\nthis is the sum of all probabilities within the first branch and the probability calculated in the prior step, i.e., \\(0.20 + 0.0072 = 0.2072\\).\nthis is given by the last group, i.e., \\(0.991 \\times 0.8 = 0.7928\\)."
  },
  {
    "objectID": "sample_problems/lecture_1.html#problem-4",
    "href": "sample_problems/lecture_1.html#problem-4",
    "title": "L1 examples",
    "section": "Problem 4",
    "text": "Problem 4\nThis problem has two parts.\n\nDerive Bayes’ rule.\nThe chance of an honest citizen lying is 1 in 1000. Assume that such a citizen is tested with a lie detector which correctly identifies both truth and false statements 95 times out of 100.\n\n\nWhat is the probability that the lie detector indicates falsehood?\nIn this case, what is the probability that the person is actually lying?\n\n\n\nSolution\n\n\nTo derive Bayes’ rule, we will use the definition of the conditional probability, and the fact that \\(p \\left({\\color[rgb]{0.986252,0.007236,0.027423}A} \\cap {\\color[rgb]{0.131302,0.999697,0.023594}B} \\right) = p \\left( {\\color[rgb]{0.131302,0.999697,0.023594}B} \\cap {\\color[rgb]{0.986252,0.007236,0.027423}A} \\right)\\), which leads to\n\n\\[\n\\large\n\\require{color}\np \\left( {\\color[rgb]{0.986252,0.007236,0.027423}A} | {\\color[rgb]{0.131302,0.999697,0.023594}B} \\right) p \\left( {\\color[rgb]{0.131302,0.999697,0.023594}B} \\right) = p \\left( {\\color[rgb]{0.131302,0.999697,0.023594}B}| {\\color[rgb]{0.986252,0.007236,0.027423}A} \\right) p \\left({\\color[rgb]{0.986252,0.007236,0.027423}A} \\right)\n\\]\nFrom this one can write\n\\[\n\\large\n\\require{color}\np \\left( {\\color[rgb]{0.986252,0.007236,0.027423}A} | {\\color[rgb]{0.131302,0.999697,0.023594}B} \\right) = \\frac{p \\left( {\\color[rgb]{0.131302,0.999697,0.023594}B} | {\\color[rgb]{0.986252,0.007236,0.027423}A} \\right) p \\left( {\\color[rgb]{0.986252,0.007236,0.027423}A} \\right) }{p \\left( {\\color[rgb]{0.131302,0.999697,0.023594}B} \\right) }\n\\]\n\nThe probability that the lie detector indicates a falsehood is based on (i) the citizen is lying, and (ii) the citizen is being honest, but the detector makes an error. Let \\(F\\) be the probability that the lie detector indicates a falsehood. Thus\n\n\\[\n\\large\np \\left( F \\right) = \\frac{1}{1000} \\times 0.95 + \\frac{999}{1000} \\times 0.05 = 0.0509.\n\\]\nLet $p ( L ) be the probability that the person is actually lying. Thus, what we want is\n\\[\n\\large\np \\left( L | F \\right) = \\frac{p \\left( F | L \\right) p \\left( L \\right) }{p \\left( F \\right) } = \\frac{0.95 \\times 0.001}{0.0509} = 0.01866.\n\\]"
  },
  {
    "objectID": "sample_problems/lecture_3.html",
    "href": "sample_problems/lecture_3.html",
    "title": "L3 examples",
    "section": "",
    "text": "In going through some historical records, you find that scientists from a lost civilization tried to measure the distance from the ground to some clouds. Based on the data you assume that the distance is a Gaussian random variable with a mean of 1830m and a standard deviation of 460m. What is the probability that the clouds would be at a height above 2750m?\n\n\nSolution\n\nLet \\(X\\) be this Gaussian random variable. This problem essentially requires us to work out \\(p \\left( X &gt; 2750 \\right)\\). This can be expressed as\n\\[\n\\large\np \\left(X &gt; 2750 \\right) = 1 - p \\left( X \\leq 2750 \\right) = 1 - \\Phi \\left( z \\right)\n\\]\nwhere \\(z = (2750 - 1830)/460 = 2\\). Thus we have\n\\[\n\\large\n1 - \\Phi \\left( 2 \\right) = 1 - 0.9772 = 0.0228\n\\]"
  },
  {
    "objectID": "sample_problems/lecture_3.html#problem-1",
    "href": "sample_problems/lecture_3.html#problem-1",
    "title": "L3 examples",
    "section": "",
    "text": "In going through some historical records, you find that scientists from a lost civilization tried to measure the distance from the ground to some clouds. Based on the data you assume that the distance is a Gaussian random variable with a mean of 1830m and a standard deviation of 460m. What is the probability that the clouds would be at a height above 2750m?\n\n\nSolution\n\nLet \\(X\\) be this Gaussian random variable. This problem essentially requires us to work out \\(p \\left( X &gt; 2750 \\right)\\). This can be expressed as\n\\[\n\\large\np \\left(X &gt; 2750 \\right) = 1 - p \\left( X \\leq 2750 \\right) = 1 - \\Phi \\left( z \\right)\n\\]\nwhere \\(z = (2750 - 1830)/460 = 2\\). Thus we have\n\\[\n\\large\n1 - \\Phi \\left( 2 \\right) = 1 - 0.9772 = 0.0228\n\\]"
  },
  {
    "objectID": "useful_codes/discrete.html",
    "href": "useful_codes/discrete.html",
    "title": "Discrete distributions",
    "section": "",
    "text": "This notebook is has useful boiler plate code for generating distributions and visualizing them.\n\n\nCode\nimport numpy as np \nfrom scipy.stats import bernoulli, binom\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.special import comb\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\n#plt.style.use('dark_background') # cosmetic!\n\n\n\n\nThe probability mass function for a Bernoulli distribution is given by\n\\[\np \\left( x \\right) = \\begin{cases}\n\\begin{array}{c}\n1 - p  \\; \\; \\; \\textrm{if} \\; x = 0 \\\\\np \\; \\; \\; \\textrm{if} \\; x = 1\n\\end{array}\\end{cases}\n\\]\nfor \\(x \\in \\left\\{0, 1 \\right\\}\\) and where \\(0 \\leq p \\leq 1\\).\n\n\nCode\np = 0.4 # Bernoulli parameter\nx = np.linspace(0, 1, 2)\nprobabilities = bernoulli.pmf(x, p)\n\nfig = plt.figure(figsize=(8,4))\n\nplt.plot(x, probabilities, 'o', ms=8, color='orangered')\nplt.vlines(x, 0, probabilities, colors='orangered', lw=5, alpha=0.5)\nplt.xlabel('x')\nplt.ylabel('Probability')\nplt.savefig('pdf.png', dpi=150, bbox_inches='tight', transparent=True)\n\nplt.show()\n\n\n\n\n\nOne can generate random values from this distribution, i.e.,\n\n\nCode\nX = bernoulli.rvs(p, size=500)\nprint(X)\n\n\n[0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 1 0 0 0 0\n 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 0 1 0 0 0 1 1 1 0 1\n 1 0 1 1 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0\n 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1\n 0 1 0 0 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0\n 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 0 0 1 0 1 1 1 0\n 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1\n 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0\n 0 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 1 1 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0\n 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 0 0\n 0 0 0 1 0 1 0 0 1 1 0 1 1 0 1 1 1 0 0 0 0 0 1 1 1 0 1 0 0 1 0 0 1 1 0 0 1\n 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0\n 1 0 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1 0 0 0 1 1 0 0 1\n 0 0 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 1]\n\n\nThus, random values from a Bernoulli distribution are inherently binary, and the number of 0s vs 1s will vary depending on the choice of the parameter, \\(p\\). We will see later on (in another notebook) how this relatively simple idea can be used to train a Naive Bayes Classifier. For now, we will plot the expected value of the Bernoulli random variable with increasing number of samples.\n\n\nCode\nnumbers = [10, 50, 100, 200, 300, 500, 1000, 2000, 5000, 10000]\nmeans = []\nstds = []\nfor j in numbers:\n    X_val = []\n    for q in range(0, 10):\n        X = bernoulli.rvs(p, size=j)\n        X_val.append(np.mean(X))\n    means.append(np.mean(X_val))\n    stds.append(np.std(X_val))\n\nmeans = np.array(means)\nstds = np.array(stds)\nnumbers = np.array(numbers)\n\nfig = plt.figure(figsize=(8,4))\nplt.plot(numbers, means, 'ro-', lw=2)\nplt.fill_between(numbers, means + stds, means - stds, color='crimson', alpha=0.3)\nplt.xlabel('Number of random samples')\nplt.ylabel('Expectation')\nplt.savefig('convergence.png', dpi=150, bbox_inches='tight', transparent=True)\nplt.show()\n\n\n\n\n\n\n\n\nNext, we consider the Binomial distribution. It has a probability mass function\n\\[\np \\left( x \\right) = \\left(\\begin{array}{c}\nn\\\\\nx\n\\end{array}\\right)p^{x}\\left(1-p\\right)^{n-x}\n\\]\nfor \\(x \\in \\left\\{0, 1, \\ldots, n \\right\\}\\) and where \\(0 \\leq p \\leq 1\\).\n\n\nCode\np = 0.3 # Bernoulli parameter\nn = 7\nx = np.arange(0, n+1)\nprobabilities = binom(n, p)\n\nfig = plt.figure(figsize=(8,4))\nplt.plot(x, probabilities.pmf(x), 'o', ms=8, color='deeppink')\nplt.vlines(x, 0, probabilities.pmf(x), colors='deeppink', lw=5 )\nplt.xlabel('x')\nplt.ylabel('Probability')\nplt.savefig('pdf_2.png', dpi=150, bbox_inches='tight', transparent=True)\nplt.show()\n\n\n\n\n\nTo work out the probability at \\(x=3\\), we can compute:\n\n\nCode\nprob = comb(N=n, k=3) * p**3 * (1 - p)**(n - 3)\nprint(prob)\n\n\n0.22689449999999992"
  },
  {
    "objectID": "useful_codes/discrete.html#scope",
    "href": "useful_codes/discrete.html#scope",
    "title": "Discrete distributions",
    "section": "",
    "text": "This notebook is has useful boiler plate code for generating distributions and visualizing them.\n\n\nCode\nimport numpy as np \nfrom scipy.stats import bernoulli, binom\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.special import comb\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\n#plt.style.use('dark_background') # cosmetic!\n\n\n\n\nThe probability mass function for a Bernoulli distribution is given by\n\\[\np \\left( x \\right) = \\begin{cases}\n\\begin{array}{c}\n1 - p  \\; \\; \\; \\textrm{if} \\; x = 0 \\\\\np \\; \\; \\; \\textrm{if} \\; x = 1\n\\end{array}\\end{cases}\n\\]\nfor \\(x \\in \\left\\{0, 1 \\right\\}\\) and where \\(0 \\leq p \\leq 1\\).\n\n\nCode\np = 0.4 # Bernoulli parameter\nx = np.linspace(0, 1, 2)\nprobabilities = bernoulli.pmf(x, p)\n\nfig = plt.figure(figsize=(8,4))\n\nplt.plot(x, probabilities, 'o', ms=8, color='orangered')\nplt.vlines(x, 0, probabilities, colors='orangered', lw=5, alpha=0.5)\nplt.xlabel('x')\nplt.ylabel('Probability')\nplt.savefig('pdf.png', dpi=150, bbox_inches='tight', transparent=True)\n\nplt.show()\n\n\n\n\n\nOne can generate random values from this distribution, i.e.,\n\n\nCode\nX = bernoulli.rvs(p, size=500)\nprint(X)\n\n\n[0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 1 0 0 0 0\n 1 0 1 0 0 1 0 0 0 1 1 1 0 1 0 0 0 0 1 1 0 1 0 0 1 1 1 0 1 0 0 0 1 1 1 0 1\n 1 0 1 1 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0\n 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1\n 0 1 0 0 1 1 0 1 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0\n 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 1 0 1 0 0 0 1 0 1 1 1 0\n 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 1 1 0 1\n 0 1 0 0 0 1 1 0 0 1 1 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0\n 0 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 1 1 0 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 0 0\n 0 0 1 0 0 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 0 1 0 0 0 0 0\n 0 0 0 1 0 1 0 0 1 1 0 1 1 0 1 1 1 0 0 0 0 0 1 1 1 0 1 0 0 1 0 0 1 1 0 0 1\n 0 1 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 1 1 0 0\n 1 0 0 1 0 1 0 1 1 1 1 1 0 0 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1 0 0 0 1 1 0 0 1\n 0 0 0 1 1 0 1 0 0 1 1 0 0 1 0 0 0 1 1]\n\n\nThus, random values from a Bernoulli distribution are inherently binary, and the number of 0s vs 1s will vary depending on the choice of the parameter, \\(p\\). We will see later on (in another notebook) how this relatively simple idea can be used to train a Naive Bayes Classifier. For now, we will plot the expected value of the Bernoulli random variable with increasing number of samples.\n\n\nCode\nnumbers = [10, 50, 100, 200, 300, 500, 1000, 2000, 5000, 10000]\nmeans = []\nstds = []\nfor j in numbers:\n    X_val = []\n    for q in range(0, 10):\n        X = bernoulli.rvs(p, size=j)\n        X_val.append(np.mean(X))\n    means.append(np.mean(X_val))\n    stds.append(np.std(X_val))\n\nmeans = np.array(means)\nstds = np.array(stds)\nnumbers = np.array(numbers)\n\nfig = plt.figure(figsize=(8,4))\nplt.plot(numbers, means, 'ro-', lw=2)\nplt.fill_between(numbers, means + stds, means - stds, color='crimson', alpha=0.3)\nplt.xlabel('Number of random samples')\nplt.ylabel('Expectation')\nplt.savefig('convergence.png', dpi=150, bbox_inches='tight', transparent=True)\nplt.show()\n\n\n\n\n\n\n\n\nNext, we consider the Binomial distribution. It has a probability mass function\n\\[\np \\left( x \\right) = \\left(\\begin{array}{c}\nn\\\\\nx\n\\end{array}\\right)p^{x}\\left(1-p\\right)^{n-x}\n\\]\nfor \\(x \\in \\left\\{0, 1, \\ldots, n \\right\\}\\) and where \\(0 \\leq p \\leq 1\\).\n\n\nCode\np = 0.3 # Bernoulli parameter\nn = 7\nx = np.arange(0, n+1)\nprobabilities = binom(n, p)\n\nfig = plt.figure(figsize=(8,4))\nplt.plot(x, probabilities.pmf(x), 'o', ms=8, color='deeppink')\nplt.vlines(x, 0, probabilities.pmf(x), colors='deeppink', lw=5 )\nplt.xlabel('x')\nplt.ylabel('Probability')\nplt.savefig('pdf_2.png', dpi=150, bbox_inches='tight', transparent=True)\nplt.show()\n\n\n\n\n\nTo work out the probability at \\(x=3\\), we can compute:\n\n\nCode\nprob = comb(N=n, k=3) * p**3 * (1 - p)**(n - 3)\nprint(prob)\n\n\n0.22689449999999992"
  },
  {
    "objectID": "useful_codes/kernels.html",
    "href": "useful_codes/kernels.html",
    "title": "Kernel trick and lifting",
    "section": "",
    "text": "This attempts to describe kernels. The hope is after going through this, the reader appreciates just how powerful kernels are, and the role they play in Gaussian process models.\n\n\nCode\n### Data \nimport plotly.graph_objects as go\nimport plotly.figure_factory as ff\nimport plotly.express as px\nimport numpy as np\nimport pandas as pd\nimport plotly.io as pio\nimport numpy as np\npio.renderers.default = 'iframe'\nfrom IPython.display import display, HTML\n\n\n\n\nOne way to motivate the study of kernels, is to consider a linear regression problem where one has more unknowns than observational data. Let \\(\\mathbf{X} = \\left[\\mathbf{x}_{1}^{T}, \\mathbf{x}_{2}^{T}, \\ldots, \\mathbf{x}_{N}^{T}\\right]\\) be the \\(N \\times d\\) data corresponding to \\(N\\) observations of \\(d\\)-dimensional data. These input observations are accompanied by an output observational vector, \\(\\mathbf{y} \\in \\mathbb{R}^{N}\\). Let \\(\\boldsymbol{\\Phi} \\left( \\mathbf{X} \\right) \\in \\mathbb{R}^{N \\times M}\\) be a parameterized matrix comprising of \\(M\\) basis functions, i.e.,\n\\[\n\\boldsymbol{\\Phi}\\left( \\mathbf{X} \\right) = \\left[\\begin{array}{cccc}\n\\phi_{1}\\left(\\mathbf{X} \\right), & \\phi_{2}\\left(\\mathbf{X} \\right), & \\ldots, & \\phi_{M}\\left(\\mathbf{X} \\right)\\end{array}\\right]\n\\]\nIf we are interested in approximating \\(f \\left( \\mathbf{X} \\right) \\approx \\hat{f} \\left( \\mathbf{X} \\right) = y = \\mathbf{\\Phi} \\left( \\mathbf{X} \\right) \\boldsymbol{\\alpha}\\), we can determine the unknown coefficients via least squares. This leads to the solution via the normal equations\n\\[\n\\boldsymbol{\\alpha} = \\left( \\mathbf{\\Phi}^{T} \\mathbf{\\Phi}\\right)^{-1} \\boldsymbol{\\Phi}^{T} \\mathbf{y}\n\\]\nNow, strictly speaking, one cannot use the normal equations to solve a problem where there are more unknowns than observations because \\(\\left( \\mathbf{\\Phi}^{T} \\mathbf{\\Phi}\\right)\\) is not full rank. Recognizing that in such a situation, there may likely be numerous solutions to \\(\\boldsymbol{\\Phi}\\left( \\mathbf{X} \\right) \\boldsymbol{\\alpha} = \\mathbf{y}\\), we want the solution with the lowest \\(L_2\\) norm. This can be more conveniently formulated as as minimum norm problem, written as\n\\[\n\\begin{aligned}\n\\underset{x}{\\textrm{minimize}} & \\; \\boldsymbol{\\alpha}^{T} \\boldsymbol{\\alpha} \\\\\n\\textrm{subject to} \\; \\; &  \\boldsymbol{\\Phi}\\left( \\mathbf{X} \\right) \\boldsymbol{\\alpha} = \\mathbf{y}.\n\\end{aligned}\n\\]\nThe easiest way to solve this via the method of Lagrange multipliers, i.e., we define the objective function\n\\[\nL \\left( \\boldsymbol{\\alpha}, \\lambda \\right) = \\boldsymbol{\\alpha}^{T} \\boldsymbol{\\alpha}  + \\lambda^{T} \\left( \\boldsymbol{\\Phi}\\left( \\mathbf{X} \\right) \\boldsymbol{\\alpha} - \\mathbf{y}\\right),\n\\]\nwhere \\(\\lambda\\) comprises the Lagrange multipliers. The optimality conditions for this objective are given by\n\\[\n\\begin{aligned}\n\\nabla_{\\boldsymbol{\\alpha}} L & = 2 \\boldsymbol{\\alpha} + \\boldsymbol{\\Phi}^{T} \\lambda = 0, \\\\\n\\nabla_{\\lambda} L & = \\boldsymbol{\\Phi} \\boldsymbol{\\alpha} - \\mathbf{y} = 0.\n\\end{aligned}\n\\]\nThis leads to \\(\\boldsymbol{\\alpha} = - \\boldsymbol{\\Phi}^{T} \\lambda / 2\\). Substituting this into the second expression above yields \\(\\lambda = -2 \\left(\\boldsymbol{\\Phi} \\boldsymbol{\\Phi}^{T} \\right)^{-1} \\mathbf{y}\\). This leads to the minimum norm solution\n\\[\n\\boldsymbol{\\alpha} = \\boldsymbol{\\Phi}^{T}  \\left( \\boldsymbol{\\Phi} \\boldsymbol{\\Phi}^{T}  \\right)^{-1} \\mathbf{y}.\n\\]\nNote that unlike \\(\\left( \\boldsymbol{\\Phi}^{T} \\boldsymbol{\\Phi} \\right)\\), \\(\\left( \\boldsymbol{\\Phi} \\boldsymbol{\\Phi}^{T} \\right)\\) does have full rank. The latter is an inner product between feature vectors. To see this, define the two-point kernel function\n\\[\nk \\left( \\mathbf{x}, \\mathbf{x}' \\right) = \\boldsymbol{\\Phi} \\left( \\mathbf{x} \\right) \\boldsymbol{\\Phi}^{T} \\left( \\mathbf{x} \\right).\n\\]\nand the associated covariance matrix, defined elementwise via\n\\[\n\\left[ \\mathbf{K} \\left(\\mathbf{X}, \\mathbf{X}' \\right)\\right]_{ij} = k \\left( \\mathbf{x}_{i}, \\mathbf{x}_{j} \\right)\n\\]\n\n\n\nFrom the coefficients \\(\\boldsymbol{\\alpha}\\) computed via the minimum norm solution, it should be clear that approximate values of the true function at new locations \\(\\mathbf{X}_{\\ast}\\) can be given via\n\\[\n\\begin{aligned}\n\\hat{f} \\left( \\mathbf{X}_{\\ast} \\right) & = \\Phi \\left( \\mathbf{X}_{\\ast} \\right) \\boldsymbol{\\alpha} \\\\\n& = \\boldsymbol{\\Phi} \\left( \\mathbf{X}_{\\ast} \\right)  \\boldsymbol{\\Phi}^{T} \\left( \\mathbf{X} \\right)  \\left( \\boldsymbol{\\Phi} \\left( \\mathbf{X} \\right)  \\boldsymbol{\\Phi}^{T} \\left( \\mathbf{X} \\right)   \\right)^{-1} \\mathbf{y} \\\\\n& = \\left( \\boldsymbol{\\Phi} \\left( \\mathbf{X}_{\\ast} \\right)  \\boldsymbol{\\Phi}^{T} \\left( \\mathbf{X} \\right)  \\right)  \\left( \\boldsymbol{\\Phi} \\left( \\mathbf{X} \\right)  \\boldsymbol{\\Phi} \\left( \\mathbf{X} \\right) ^{T}  \\right)^{-1} \\mathbf{y} \\\\\n& = \\mathbf{K} \\left( \\mathbf{X}_{\\ast}, \\mathbf{X} \\right) \\mathbf{K}^{-1} \\left( \\mathbf{X}, \\mathbf{X} \\right) \\mathbf{y} \\\\\n\\end{aligned}\n\\]\nThere are two points to note here:\n\nThe form of the expression above is exactly that of the posterior predictive mean of a noise-free Gaussian processes model.\nOne need not compute the full \\(N \\times M\\) feature matrix \\(\\boldsymbol{\\Phi} \\left( \\mathbf{X} \\right)\\) explictly to work out the \\(N \\times N\\) matrix \\(\\mathbf{K}\\left( \\mathbf{X}, \\mathbf{X} \\right)\\).\n\nThis latter point is why this is called the kernel trick, i.e., for a very large number of features \\(M &gt;&gt; N\\) (possibly infinite), it is more computationally efficient to work out \\(\\mathbf{K}\\).\nAnother way to interpret the kernel trick is to consider the example that was discussed in lecture with regards to the data in the plot below.\n\n\nConsider a quadratic kernel in \\(\\mathbb{R}^{2}\\), where \\(\\mathbf{x} = \\left(x_1, x_2 \\right)^{T}\\) and \\(\\mathbf{v} = \\left(v_1, v_2 \\right)^{T}\\). We can express this kernel as \\[\n\\begin{aligned}\nk \\left( \\mathbf{x}, \\mathbf{v} \\right) =  \\left( \\mathbf{x}^{T}  \\mathbf{v} \\right)^2 & = \\left( \\left[\\begin{array}{cc}\nx_{1} & x_{2}\\end{array}\\right]\\left[\\begin{array}{c}\nv_{1}\\\\\nv_{2}\n\\end{array}\\right] \\right)^2  \\\\\n& = \\left( x_1^2 v_1^2 + 2 x_1 x_2 v_1 v_2 + x_2^2 v_2^2\\right) \\\\\n& = \\left[\\begin{array}{ccc}\nx^2_{1} & \\sqrt{2} x_1 x_2 & x_2^2 \\end{array}\\right]\\left[\\begin{array}{c}\nv_{1}^2\\\\\n\\sqrt{2}v_1 v_2 \\\\\nv_{2}^2\n\\end{array}\\right] \\\\\n& = \\phi \\left( \\mathbf{x} \\right)^{T} \\phi \\left( \\mathbf{v}  \\right).\n\\end{aligned}\n\\] where \\(\\phi \\left( \\cdot \\right) \\in \\mathbb{R}^{3}\\).\nNow lets tabulate the number of operations required depending on which route one takes. Computing \\(\\left( \\mathbf{x}^{T} \\mathbf{v} \\right)^2\\) requires two multiplications (i.e., \\(x_1 \\times v_1\\) and \\(x_2 \\times v_2\\)), one sum (i.e., \\(s = x_1 v_1 + x_2 v_2\\)), and one product (i.e., \\(s^2\\)). This leads to a total of four operations.\nNow consider the number of operations required for computing $ ( )^{T} ( )$. Assembling \\(\\phi \\left( \\mathbf{x} \\right)\\) itself requires three products; multiplying by \\(\\phi \\left( \\mathbf{v} \\right)\\) incurs another three products leading to a total of 10 operations (9 multiplications and one sum). Thus, computationally, it is cheaper to use the original form for calculating the product.\nThere is however another perspective to this. Data that is not linearly separable in \\(\\mathbb{R}^{2}\\) can be lifted up to \\(\\mathbb{R}^{3}\\) where a separation may be more easily inferred. In this particular case, \\(\\phi \\left( \\mathbf{x} \\right)\\) takes the form of a polynomial kernel.\nTo visualize this consider a red and green set of random points within a circle. Points that have a relatively greater radius are shown in red, whilst points that are closer to the center are captured in green.\n\n\nCode\nt = np.random.rand(40,1)* 2 * np.pi\nr = np.random.rand(40,1)*0.2 + 2\nu = r * np.cos(t)\nv = r * np.sin(t)\n\nxy = np.vstack([np.random.rand(20,2)*2 - 1])\nxy2 = np.hstack([u, v])\n\n\n\n\nCode\nfig = go.Figure()\nfig.add_scatter(x=xy2[:,0], y=xy2[:,1],  name='Red', mode='markers', marker=dict(\n        size=15, color='red', opacity=0.8, line=dict(color='black', width=1) ))\nfig.add_scatter(x=xy[:,0], y=xy[:,1],  name='Green', mode='markers', marker=dict(\n        size=15, color='green', opacity=0.8, line=dict(color='black', width=1) ))\nfig.update_layout(legend=dict(yanchor=\"top\", y=0.99, xanchor=\"left\", x=0.01),\n                  xaxis_title=r'$\\mathbf{x}$',yaxis_title=r'$\\mathbf{v}$')\n\nfig.show()\n\n\n\n\n\nIt is not possible to separate these two sets using a line (or more generally a hyperplane). However, when the same data is lifed to \\(\\mathbb{R}^{3}\\), the two sets are linearly separable.\n\n\nCode\ndef mapup(xy):\n    phi_1 = xy[:,0]**2\n    phi_2 = np.sqrt(2) * xy[:,0] * xy[:,1]\n    phi_3 = xy[:,1]**2\n    return phi_1, phi_2, phi_3\n\nz1, z2, z3 = mapup(xy2)\nw1, w2, w3 = mapup(xy)\n\n\n\n\nCode\nfig = go.Figure()\nfig.add_scatter3d(x=z1, y=z2, z=z3, name='Red', mode='markers', marker=dict(\n        size=10, color='red', opacity=0.8, line=dict(color='black', width=2) ))\nfig.add_scatter3d(x=w1, y=w2, z=w3, name='Green', mode='markers', marker=dict(\n        size=10, color='green', opacity=0.8, line=dict(color='black', width=2) ))\nfig.update_layout(legend=dict(yanchor=\"top\", y=0.99, xanchor=\"left\", x=0.01),\n                  scene = dict(\n                    xaxis_title=r'$\\phi_{1}$',\n                    yaxis_title=r'$\\phi_2$',\n                    zaxis_title=r'$\\phi_3$'),\n                    width=700,\n                    margin=dict(r=20, b=10, l=10, t=10))\n\nfig.show()\n\n\n\n\n\n\n\n\nWe shall now briefly consider the case of regression with infinitely many functions. Consider a basis function of the form\n\\[\n\\boldsymbol{\\Phi}\\left( \\mathbf{x} \\right) = \\left[\\begin{array}{cccc}\n\\phi_{1}\\left(\\mathbf{x} \\right), & \\phi_{2}\\left(\\mathbf{x} \\right), & \\ldots, & \\phi_{\\infty}\\left(\\mathbf{x} \\right)\\end{array}\\right]\n\\]\nwhere\n\\[\n\\phi_{j}\\left( \\mathbf{x} \\right) = exp \\left( - \\frac{\\left( \\mathbf{x} - c_j \\right)^2 }{2l^2} \\right)\n\\]\nwhere \\(c_j\\) represents the center of the bell-shaped basis function; we assume that there are infinitely many centers across the domain of interest and thus there exists infinitely many basis terms. To visualize this, see the code below.\n\n\nCode\nx = np.linspace(-5, 5, 150)\ninfty_subtitute = 20\nc_js = np.linspace(-5, 5, infty_subtitute)\nl = 0.5\n\nfig = go.Figure()\nfor j in range(0, infty_subtitute):\n    leg = 'c_j = '+str(np.around(c_js[j], 2))\n    psi_j = np.exp(- (x - c_js[j])**2 * 1./(2*l**2))\n    fig.add_scatter(x=x, y=psi_j, mode='lines', name=leg)\n    fig.update_layout(legend=dict(yanchor=\"top\", y=0.99, xanchor=\"left\", x=-0.6),\n                  xaxis_title=r'$x$',yaxis_title=r'$\\phi\\left( x \\right)$')\nfig.show()\n\n\n\n\n\nThe two-point covariance matrix can be written as the sum of outer products of the feature vectors evaluated at all points \\(\\mathbf{X}\\) (which are individually rank-one matrices):\n\\[\n\\begin{aligned}\n\\mathbf{K} & = \\boldsymbol{\\Phi}\\left( \\mathbf{x} \\right)\\boldsymbol{\\Phi}^{T} \\left( \\mathbf{X} \\right) \\\\\n& = \\sum_{j=1}^{\\infty} \\phi_{j}\\left( \\mathbf{X} \\right) \\phi_{j}^{T}\\left( \\mathbf{X}' \\right)\n\\end{aligned}\n\\]\nor if one is considering each kernel entry, one can write\n\\[\n\\begin{aligned}\n\\mathbf{K}\\left[i, j \\right] = \\mathbf{K}_{ij}  & = k \\left( \\mathbf{x}_i, \\mathbf{x}_j \\right)\\\\\n& = \\boldsymbol{\\Phi}\\left( \\mathbf{x}_i \\right)\\boldsymbol{\\Phi}^{T} \\left( \\mathbf{x}_j \\right) \\\\\n& = \\sum_{p=1}^{\\infty} \\phi_{p}\\left( \\mathbf{x}_i \\right) \\phi_{p}\\left( \\mathbf{x}_j \\right)\n\\end{aligned}\n\\]\nThis last expression can be conveniently replaced with an integral (see RW page 84).\n\\[\nk \\left( \\mathbf{x}_i, \\mathbf{x}_j \\right)  = \\int_{\\mathcal{X}} exp \\left( - \\frac{\\left( \\mathbf{x}_i - c \\right)^2 }{2l^2} \\right)exp \\left( - \\frac{\\left( \\mathbf{x}_j - c \\right)^2 }{2l^2} \\right)dc\n\\]\nwhere we will assume that \\(\\mathcal{X} \\subset [-\\infty, \\infty]\\). This leads to\n\\[\n\\begin{aligned}\nk \\left( \\mathbf{x}_i, \\mathbf{x}_j \\right) & = \\int_{-\\infty}^{\\infty} exp \\left( - \\frac{\\left( \\mathbf{x}_i - c \\right)^2 }{2l^2} \\right)exp \\left( - \\frac{\\left( \\mathbf{x}_j - c \\right)^2 }{2l^2} \\right)dc \\\\\n& = \\sqrt{\\pi}l \\; exp \\left( - \\frac{\\left(\\mathbf{x}_i - \\mathbf{x}_j\\right)^2 }{2 \\left( \\sqrt{2} \\right) l^2 } \\right).\n\\end{aligned}\n\\]\nThe last expression is easily recognizable as an RBF kernel with an amplitude of \\(\\sqrt{\\pi}l\\) and a slightly amended length scale of \\(\\sqrt{2}l^2\\). It is straightforward to adapt this to multivariate \\(\\mathbf{x}\\).\nNote the utility of this representation—we essentially have infinitely many basis terms, but the size of our covariance matrix is driven by the number of data points"
  },
  {
    "objectID": "useful_codes/kernels.html#overview",
    "href": "useful_codes/kernels.html#overview",
    "title": "Kernel trick and lifting",
    "section": "",
    "text": "This attempts to describe kernels. The hope is after going through this, the reader appreciates just how powerful kernels are, and the role they play in Gaussian process models.\n\n\nCode\n### Data \nimport plotly.graph_objects as go\nimport plotly.figure_factory as ff\nimport plotly.express as px\nimport numpy as np\nimport pandas as pd\nimport plotly.io as pio\nimport numpy as np\npio.renderers.default = 'iframe'\nfrom IPython.display import display, HTML\n\n\n\n\nOne way to motivate the study of kernels, is to consider a linear regression problem where one has more unknowns than observational data. Let \\(\\mathbf{X} = \\left[\\mathbf{x}_{1}^{T}, \\mathbf{x}_{2}^{T}, \\ldots, \\mathbf{x}_{N}^{T}\\right]\\) be the \\(N \\times d\\) data corresponding to \\(N\\) observations of \\(d\\)-dimensional data. These input observations are accompanied by an output observational vector, \\(\\mathbf{y} \\in \\mathbb{R}^{N}\\). Let \\(\\boldsymbol{\\Phi} \\left( \\mathbf{X} \\right) \\in \\mathbb{R}^{N \\times M}\\) be a parameterized matrix comprising of \\(M\\) basis functions, i.e.,\n\\[\n\\boldsymbol{\\Phi}\\left( \\mathbf{X} \\right) = \\left[\\begin{array}{cccc}\n\\phi_{1}\\left(\\mathbf{X} \\right), & \\phi_{2}\\left(\\mathbf{X} \\right), & \\ldots, & \\phi_{M}\\left(\\mathbf{X} \\right)\\end{array}\\right]\n\\]\nIf we are interested in approximating \\(f \\left( \\mathbf{X} \\right) \\approx \\hat{f} \\left( \\mathbf{X} \\right) = y = \\mathbf{\\Phi} \\left( \\mathbf{X} \\right) \\boldsymbol{\\alpha}\\), we can determine the unknown coefficients via least squares. This leads to the solution via the normal equations\n\\[\n\\boldsymbol{\\alpha} = \\left( \\mathbf{\\Phi}^{T} \\mathbf{\\Phi}\\right)^{-1} \\boldsymbol{\\Phi}^{T} \\mathbf{y}\n\\]\nNow, strictly speaking, one cannot use the normal equations to solve a problem where there are more unknowns than observations because \\(\\left( \\mathbf{\\Phi}^{T} \\mathbf{\\Phi}\\right)\\) is not full rank. Recognizing that in such a situation, there may likely be numerous solutions to \\(\\boldsymbol{\\Phi}\\left( \\mathbf{X} \\right) \\boldsymbol{\\alpha} = \\mathbf{y}\\), we want the solution with the lowest \\(L_2\\) norm. This can be more conveniently formulated as as minimum norm problem, written as\n\\[\n\\begin{aligned}\n\\underset{x}{\\textrm{minimize}} & \\; \\boldsymbol{\\alpha}^{T} \\boldsymbol{\\alpha} \\\\\n\\textrm{subject to} \\; \\; &  \\boldsymbol{\\Phi}\\left( \\mathbf{X} \\right) \\boldsymbol{\\alpha} = \\mathbf{y}.\n\\end{aligned}\n\\]\nThe easiest way to solve this via the method of Lagrange multipliers, i.e., we define the objective function\n\\[\nL \\left( \\boldsymbol{\\alpha}, \\lambda \\right) = \\boldsymbol{\\alpha}^{T} \\boldsymbol{\\alpha}  + \\lambda^{T} \\left( \\boldsymbol{\\Phi}\\left( \\mathbf{X} \\right) \\boldsymbol{\\alpha} - \\mathbf{y}\\right),\n\\]\nwhere \\(\\lambda\\) comprises the Lagrange multipliers. The optimality conditions for this objective are given by\n\\[\n\\begin{aligned}\n\\nabla_{\\boldsymbol{\\alpha}} L & = 2 \\boldsymbol{\\alpha} + \\boldsymbol{\\Phi}^{T} \\lambda = 0, \\\\\n\\nabla_{\\lambda} L & = \\boldsymbol{\\Phi} \\boldsymbol{\\alpha} - \\mathbf{y} = 0.\n\\end{aligned}\n\\]\nThis leads to \\(\\boldsymbol{\\alpha} = - \\boldsymbol{\\Phi}^{T} \\lambda / 2\\). Substituting this into the second expression above yields \\(\\lambda = -2 \\left(\\boldsymbol{\\Phi} \\boldsymbol{\\Phi}^{T} \\right)^{-1} \\mathbf{y}\\). This leads to the minimum norm solution\n\\[\n\\boldsymbol{\\alpha} = \\boldsymbol{\\Phi}^{T}  \\left( \\boldsymbol{\\Phi} \\boldsymbol{\\Phi}^{T}  \\right)^{-1} \\mathbf{y}.\n\\]\nNote that unlike \\(\\left( \\boldsymbol{\\Phi}^{T} \\boldsymbol{\\Phi} \\right)\\), \\(\\left( \\boldsymbol{\\Phi} \\boldsymbol{\\Phi}^{T} \\right)\\) does have full rank. The latter is an inner product between feature vectors. To see this, define the two-point kernel function\n\\[\nk \\left( \\mathbf{x}, \\mathbf{x}' \\right) = \\boldsymbol{\\Phi} \\left( \\mathbf{x} \\right) \\boldsymbol{\\Phi}^{T} \\left( \\mathbf{x} \\right).\n\\]\nand the associated covariance matrix, defined elementwise via\n\\[\n\\left[ \\mathbf{K} \\left(\\mathbf{X}, \\mathbf{X}' \\right)\\right]_{ij} = k \\left( \\mathbf{x}_{i}, \\mathbf{x}_{j} \\right)\n\\]\n\n\n\nFrom the coefficients \\(\\boldsymbol{\\alpha}\\) computed via the minimum norm solution, it should be clear that approximate values of the true function at new locations \\(\\mathbf{X}_{\\ast}\\) can be given via\n\\[\n\\begin{aligned}\n\\hat{f} \\left( \\mathbf{X}_{\\ast} \\right) & = \\Phi \\left( \\mathbf{X}_{\\ast} \\right) \\boldsymbol{\\alpha} \\\\\n& = \\boldsymbol{\\Phi} \\left( \\mathbf{X}_{\\ast} \\right)  \\boldsymbol{\\Phi}^{T} \\left( \\mathbf{X} \\right)  \\left( \\boldsymbol{\\Phi} \\left( \\mathbf{X} \\right)  \\boldsymbol{\\Phi}^{T} \\left( \\mathbf{X} \\right)   \\right)^{-1} \\mathbf{y} \\\\\n& = \\left( \\boldsymbol{\\Phi} \\left( \\mathbf{X}_{\\ast} \\right)  \\boldsymbol{\\Phi}^{T} \\left( \\mathbf{X} \\right)  \\right)  \\left( \\boldsymbol{\\Phi} \\left( \\mathbf{X} \\right)  \\boldsymbol{\\Phi} \\left( \\mathbf{X} \\right) ^{T}  \\right)^{-1} \\mathbf{y} \\\\\n& = \\mathbf{K} \\left( \\mathbf{X}_{\\ast}, \\mathbf{X} \\right) \\mathbf{K}^{-1} \\left( \\mathbf{X}, \\mathbf{X} \\right) \\mathbf{y} \\\\\n\\end{aligned}\n\\]\nThere are two points to note here:\n\nThe form of the expression above is exactly that of the posterior predictive mean of a noise-free Gaussian processes model.\nOne need not compute the full \\(N \\times M\\) feature matrix \\(\\boldsymbol{\\Phi} \\left( \\mathbf{X} \\right)\\) explictly to work out the \\(N \\times N\\) matrix \\(\\mathbf{K}\\left( \\mathbf{X}, \\mathbf{X} \\right)\\).\n\nThis latter point is why this is called the kernel trick, i.e., for a very large number of features \\(M &gt;&gt; N\\) (possibly infinite), it is more computationally efficient to work out \\(\\mathbf{K}\\).\nAnother way to interpret the kernel trick is to consider the example that was discussed in lecture with regards to the data in the plot below.\n\n\nConsider a quadratic kernel in \\(\\mathbb{R}^{2}\\), where \\(\\mathbf{x} = \\left(x_1, x_2 \\right)^{T}\\) and \\(\\mathbf{v} = \\left(v_1, v_2 \\right)^{T}\\). We can express this kernel as \\[\n\\begin{aligned}\nk \\left( \\mathbf{x}, \\mathbf{v} \\right) =  \\left( \\mathbf{x}^{T}  \\mathbf{v} \\right)^2 & = \\left( \\left[\\begin{array}{cc}\nx_{1} & x_{2}\\end{array}\\right]\\left[\\begin{array}{c}\nv_{1}\\\\\nv_{2}\n\\end{array}\\right] \\right)^2  \\\\\n& = \\left( x_1^2 v_1^2 + 2 x_1 x_2 v_1 v_2 + x_2^2 v_2^2\\right) \\\\\n& = \\left[\\begin{array}{ccc}\nx^2_{1} & \\sqrt{2} x_1 x_2 & x_2^2 \\end{array}\\right]\\left[\\begin{array}{c}\nv_{1}^2\\\\\n\\sqrt{2}v_1 v_2 \\\\\nv_{2}^2\n\\end{array}\\right] \\\\\n& = \\phi \\left( \\mathbf{x} \\right)^{T} \\phi \\left( \\mathbf{v}  \\right).\n\\end{aligned}\n\\] where \\(\\phi \\left( \\cdot \\right) \\in \\mathbb{R}^{3}\\).\nNow lets tabulate the number of operations required depending on which route one takes. Computing \\(\\left( \\mathbf{x}^{T} \\mathbf{v} \\right)^2\\) requires two multiplications (i.e., \\(x_1 \\times v_1\\) and \\(x_2 \\times v_2\\)), one sum (i.e., \\(s = x_1 v_1 + x_2 v_2\\)), and one product (i.e., \\(s^2\\)). This leads to a total of four operations.\nNow consider the number of operations required for computing $ ( )^{T} ( )$. Assembling \\(\\phi \\left( \\mathbf{x} \\right)\\) itself requires three products; multiplying by \\(\\phi \\left( \\mathbf{v} \\right)\\) incurs another three products leading to a total of 10 operations (9 multiplications and one sum). Thus, computationally, it is cheaper to use the original form for calculating the product.\nThere is however another perspective to this. Data that is not linearly separable in \\(\\mathbb{R}^{2}\\) can be lifted up to \\(\\mathbb{R}^{3}\\) where a separation may be more easily inferred. In this particular case, \\(\\phi \\left( \\mathbf{x} \\right)\\) takes the form of a polynomial kernel.\nTo visualize this consider a red and green set of random points within a circle. Points that have a relatively greater radius are shown in red, whilst points that are closer to the center are captured in green.\n\n\nCode\nt = np.random.rand(40,1)* 2 * np.pi\nr = np.random.rand(40,1)*0.2 + 2\nu = r * np.cos(t)\nv = r * np.sin(t)\n\nxy = np.vstack([np.random.rand(20,2)*2 - 1])\nxy2 = np.hstack([u, v])\n\n\n\n\nCode\nfig = go.Figure()\nfig.add_scatter(x=xy2[:,0], y=xy2[:,1],  name='Red', mode='markers', marker=dict(\n        size=15, color='red', opacity=0.8, line=dict(color='black', width=1) ))\nfig.add_scatter(x=xy[:,0], y=xy[:,1],  name='Green', mode='markers', marker=dict(\n        size=15, color='green', opacity=0.8, line=dict(color='black', width=1) ))\nfig.update_layout(legend=dict(yanchor=\"top\", y=0.99, xanchor=\"left\", x=0.01),\n                  xaxis_title=r'$\\mathbf{x}$',yaxis_title=r'$\\mathbf{v}$')\n\nfig.show()\n\n\n\n\n\nIt is not possible to separate these two sets using a line (or more generally a hyperplane). However, when the same data is lifed to \\(\\mathbb{R}^{3}\\), the two sets are linearly separable.\n\n\nCode\ndef mapup(xy):\n    phi_1 = xy[:,0]**2\n    phi_2 = np.sqrt(2) * xy[:,0] * xy[:,1]\n    phi_3 = xy[:,1]**2\n    return phi_1, phi_2, phi_3\n\nz1, z2, z3 = mapup(xy2)\nw1, w2, w3 = mapup(xy)\n\n\n\n\nCode\nfig = go.Figure()\nfig.add_scatter3d(x=z1, y=z2, z=z3, name='Red', mode='markers', marker=dict(\n        size=10, color='red', opacity=0.8, line=dict(color='black', width=2) ))\nfig.add_scatter3d(x=w1, y=w2, z=w3, name='Green', mode='markers', marker=dict(\n        size=10, color='green', opacity=0.8, line=dict(color='black', width=2) ))\nfig.update_layout(legend=dict(yanchor=\"top\", y=0.99, xanchor=\"left\", x=0.01),\n                  scene = dict(\n                    xaxis_title=r'$\\phi_{1}$',\n                    yaxis_title=r'$\\phi_2$',\n                    zaxis_title=r'$\\phi_3$'),\n                    width=700,\n                    margin=dict(r=20, b=10, l=10, t=10))\n\nfig.show()\n\n\n\n\n\n\n\n\nWe shall now briefly consider the case of regression with infinitely many functions. Consider a basis function of the form\n\\[\n\\boldsymbol{\\Phi}\\left( \\mathbf{x} \\right) = \\left[\\begin{array}{cccc}\n\\phi_{1}\\left(\\mathbf{x} \\right), & \\phi_{2}\\left(\\mathbf{x} \\right), & \\ldots, & \\phi_{\\infty}\\left(\\mathbf{x} \\right)\\end{array}\\right]\n\\]\nwhere\n\\[\n\\phi_{j}\\left( \\mathbf{x} \\right) = exp \\left( - \\frac{\\left( \\mathbf{x} - c_j \\right)^2 }{2l^2} \\right)\n\\]\nwhere \\(c_j\\) represents the center of the bell-shaped basis function; we assume that there are infinitely many centers across the domain of interest and thus there exists infinitely many basis terms. To visualize this, see the code below.\n\n\nCode\nx = np.linspace(-5, 5, 150)\ninfty_subtitute = 20\nc_js = np.linspace(-5, 5, infty_subtitute)\nl = 0.5\n\nfig = go.Figure()\nfor j in range(0, infty_subtitute):\n    leg = 'c_j = '+str(np.around(c_js[j], 2))\n    psi_j = np.exp(- (x - c_js[j])**2 * 1./(2*l**2))\n    fig.add_scatter(x=x, y=psi_j, mode='lines', name=leg)\n    fig.update_layout(legend=dict(yanchor=\"top\", y=0.99, xanchor=\"left\", x=-0.6),\n                  xaxis_title=r'$x$',yaxis_title=r'$\\phi\\left( x \\right)$')\nfig.show()\n\n\n\n\n\nThe two-point covariance matrix can be written as the sum of outer products of the feature vectors evaluated at all points \\(\\mathbf{X}\\) (which are individually rank-one matrices):\n\\[\n\\begin{aligned}\n\\mathbf{K} & = \\boldsymbol{\\Phi}\\left( \\mathbf{x} \\right)\\boldsymbol{\\Phi}^{T} \\left( \\mathbf{X} \\right) \\\\\n& = \\sum_{j=1}^{\\infty} \\phi_{j}\\left( \\mathbf{X} \\right) \\phi_{j}^{T}\\left( \\mathbf{X}' \\right)\n\\end{aligned}\n\\]\nor if one is considering each kernel entry, one can write\n\\[\n\\begin{aligned}\n\\mathbf{K}\\left[i, j \\right] = \\mathbf{K}_{ij}  & = k \\left( \\mathbf{x}_i, \\mathbf{x}_j \\right)\\\\\n& = \\boldsymbol{\\Phi}\\left( \\mathbf{x}_i \\right)\\boldsymbol{\\Phi}^{T} \\left( \\mathbf{x}_j \\right) \\\\\n& = \\sum_{p=1}^{\\infty} \\phi_{p}\\left( \\mathbf{x}_i \\right) \\phi_{p}\\left( \\mathbf{x}_j \\right)\n\\end{aligned}\n\\]\nThis last expression can be conveniently replaced with an integral (see RW page 84).\n\\[\nk \\left( \\mathbf{x}_i, \\mathbf{x}_j \\right)  = \\int_{\\mathcal{X}} exp \\left( - \\frac{\\left( \\mathbf{x}_i - c \\right)^2 }{2l^2} \\right)exp \\left( - \\frac{\\left( \\mathbf{x}_j - c \\right)^2 }{2l^2} \\right)dc\n\\]\nwhere we will assume that \\(\\mathcal{X} \\subset [-\\infty, \\infty]\\). This leads to\n\\[\n\\begin{aligned}\nk \\left( \\mathbf{x}_i, \\mathbf{x}_j \\right) & = \\int_{-\\infty}^{\\infty} exp \\left( - \\frac{\\left( \\mathbf{x}_i - c \\right)^2 }{2l^2} \\right)exp \\left( - \\frac{\\left( \\mathbf{x}_j - c \\right)^2 }{2l^2} \\right)dc \\\\\n& = \\sqrt{\\pi}l \\; exp \\left( - \\frac{\\left(\\mathbf{x}_i - \\mathbf{x}_j\\right)^2 }{2 \\left( \\sqrt{2} \\right) l^2 } \\right).\n\\end{aligned}\n\\]\nThe last expression is easily recognizable as an RBF kernel with an amplitude of \\(\\sqrt{\\pi}l\\) and a slightly amended length scale of \\(\\sqrt{2}l^2\\). It is straightforward to adapt this to multivariate \\(\\mathbf{x}\\).\nNote the utility of this representation—we essentially have infinitely many basis terms, but the size of our covariance matrix is driven by the number of data points"
  },
  {
    "objectID": "useful_codes/fourier.html",
    "href": "useful_codes/fourier.html",
    "title": "Fourier analysis of kernels",
    "section": "",
    "text": "This note concerns the Fourier dual of a kernel, i.e., one can think of a kernel has having an associated set of frequencies and amplitudes, much like studying the spectral density or power spectrum of a signal.\n\n\nCode\nimport plotly.graph_objects as go\nimport plotly.figure_factory as ff\nimport plotly.express as px\nimport pandas as pd\nimport plotly.io as pio\nimport numpy as np\nfrom scipy.stats import multivariate_normal\nimport matplotlib.pyplot as plt\npio.renderers.default = 'iframe'\nfrom IPython.display import display, HTML\n\n\n\n\nA stationary function \\(k\\left( \\mathbf{x}, \\mathbf{x}' \\right)=k \\left( \\mathbf{x} - \\mathbf{x}' \\right) = k \\left( \\boldsymbol{\\tau} \\right)\\) can be represented as the Fourier transform of a positive finite measure. The formal statement is given by\nA complex-valued function \\(k\\) on \\(\\mathcal{X}\\) is the covariance function of a weakly stationary mean square continuous complex-valued random process on \\(\\mathcal{X}\\) if and only if it can be represented as\n\\[\nk \\left( \\boldsymbol{\\tau} \\right) = \\int_{\\mathcal{X}} exp \\left( 2 \\pi i \\boldsymbol{\\omega} \\cdot \\boldsymbol{\\tau} \\right) d \\mu \\left( \\boldsymbol{\\omega} \\right)\n\\]\nwhere \\(\\mu\\) is a positive finite measure, and \\(\\boldsymbol{\\omega}\\) are the frequencies. If \\(\\mu\\) has a density \\(S \\left( \\boldsymbol{\\omega} \\right)\\), then \\(S\\) is the spectral density or power spectrum associated with the kernel \\(k\\).\n\n\n\nA direct consequence of Bochner’s theoreom is the Wiener-Khintchine theorem. If the spectral density \\(S \\left( \\boldsymbol{\\omega} \\right)\\) exists, the spectral density and the covariance function are said to be Fourier duals. This leads to the following statement:\n\\[\nk \\left( \\boldsymbol{\\tau} \\right) = \\int S \\left( \\boldsymbol{\\omega} \\right) exp \\left( 2 \\pi i \\boldsymbol{\\omega} \\cdot \\boldsymbol{\\tau} \\right) d \\boldsymbol{\\omega}, \\; \\; \\; \\; S \\left( \\boldsymbol{\\omega}\\right) = \\int k \\left( \\boldsymbol{\\tau} \\right) exp\\left(- 2 \\pi i \\boldsymbol{\\omega} \\cdot \\boldsymbol{\\tau} \\right) d \\boldsymbol{\\tau}\n\\]\nAs noted in RW, \\(S \\left( \\boldsymbol{\\omega} \\right)\\) is essentially the amount of power assigned to the eigenfunction \\(exp \\left( 2 \\pi i \\boldsymbol{\\omega} \\cdot \\mathbf{\\tau} \\right)\\) with frequency \\(\\boldsymbol{\\omega}\\). The amplitude as a function of frequency \\(S\\left( \\boldsymbol{\\omega} \\right)\\) must decay sufficiently fast so that the terms above are integrable.\nThere are some important points to note:\n\nIf we have a stationary kernel, we can resolve what frequencies underscore the model by working out its Fourier transform.\nOn the other hand, if we have a certain spectral density of interest, then its inverse Fourier transform is a kernel.\n\nTo analytically work this out, it may be useful to go through an example (courtsey of Markus Heinonen). The derivation below will require three pieces: - We shall assume a symmetric frequency distribution, i.e., \\(S\\left( \\boldsymbol{\\omega} \\right) = S \\left( -\\boldsymbol{\\omega} \\right)\\). - From Euler’s formula we have \\(cos\\left(x\\right) \\pm i sin\\left(x \\right) = exp \\left(\\pm ix \\right)\\) - The negative sine identity, i.e., \\(sin \\left( -x \\right) = - sin \\left( x \\right)\\)\nStarting with the expression above, we begin wtih\n\\[\n\\begin{aligned}\nk \\left( \\boldsymbol{\\tau} \\right) & = \\int_{-\\infty}^{\\infty} S \\left( \\boldsymbol{\\omega} \\right) exp \\left( 2 \\pi i \\boldsymbol{\\omega} \\cdot \\boldsymbol{\\tau} \\right) d \\boldsymbol{\\omega} \\\\\n& =   \\int_{-\\infty}^{\\infty} S \\left(\\boldsymbol{\\omega} \\right) cos \\left( 2 \\pi\\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} + \\int_{-\\infty}^{\\infty} iS \\left(\\boldsymbol{\\omega} \\right) sin \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} \\\\\n& = \\mathbb{E}\\left[ S \\left(\\omega \\right) \\right] cos \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right)  + \\int_{-\\infty}^{0} iS \\left(\\boldsymbol{\\omega} \\right) sin \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} + \\int_{0}^{\\infty} iS \\left(\\boldsymbol{\\omega} \\right) sin \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} \\\\\n& = \\mathbb{E}\\left[ S \\left(\\omega \\right) \\right] cos \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right)  + \\int_{0}^{\\infty} iS \\left(-\\boldsymbol{\\omega} \\right) sin \\left( -2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} + \\int_{0}^{\\infty} iS \\left(\\boldsymbol{\\omega} \\right) sin \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} \\\\\n& = \\mathbb{E}\\left[ S \\left(\\omega \\right) \\right] cos \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right)  + \\int_{0}^{\\infty} -iS \\left(\\boldsymbol{\\omega} \\right) sin \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} + \\int_{0}^{\\infty} iS \\left(\\boldsymbol{\\omega} \\right) sin \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} \\\\\n\\end{aligned}\n\\]\nThis leads to\n\\[\n\\begin{aligned}\nk \\left( \\boldsymbol{\\tau} \\right) & = \\mathbb{E}\\left[ S \\left(\\omega \\right) \\right] cos \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right)   \n\\end{aligned}\n\\]\nThis demonstrates that all real-valued stationary kernels are \\(S\\left( \\boldsymbol{\\omega} \\right)\\)-weighted combinations of cosine terms. \n\n\n\nOur new general stationary kernel definition is thus:\n\\[\nk \\left( \\boldsymbol{\\tau} \\right)  = \\mathbb{E}\\left[ S \\left(\\omega \\right) \\right] cos \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right)   \n\\]\nwhere the frequencies \\(\\boldsymbol{\\omega}\\) are an inverse of the period \\(1/\\boldsymbol{\\omega}\\). Bracewell provides the following expressions for the Wiener-Khintchine result, by integrating out the angular variables (see page 83 of RW):\n\\[\n\\begin{aligned}\nk \\left( \\boldsymbol{\\tau} \\right) & = \\frac{2 \\pi}{\\boldsymbol{\\tau}^{-1/2}} \\int_{0}^{\\infty} S \\left( \\boldsymbol{\\omega} \\right) J_{-1/2} \\left(2 \\pi  \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) \\boldsymbol{\\omega}^{1/2} d \\boldsymbol{\\omega} \\\\\nS \\left(   \\boldsymbol{\\omega} \\right) & = \\frac{2 \\pi}{\\boldsymbol{\\omega}^{-1/2}} \\int_{0}^{\\infty} k \\left( \\boldsymbol{\\tau} \\right) J_{-1/2} \\left(2 \\pi  \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) \\boldsymbol{\\tau}^{1/2} d \\boldsymbol{\\tau}\n\\end{aligned}\n\\]\nNote that in RW, the authors use \\(D\\) to denote the dimensionality, which we have assumed to be 1. The function \\(J_{-1/2}\\) is the Bessel function of order \\(-1/2\\). While the expressions above may seem unwiedly, we can work out what these are using a bit of Sympy. Consider the case of a squared exponential kernel of the form\n\\[\nk \\left(\\boldsymbol{\\tau} \\right) = exp \\left(- \\frac{\\boldsymbol{\\tau}^2}{2l^2} \\right).\n\\]\n\n\nCode\nfrom sympy import * \n\nomega = Symbol(\"omega\")\nell = Symbol(\"l\")\ntau = Symbol(\"tau\")\n\nkernel = exp(- tau**2 / (2 * ell**2))\nintegrate(2*pi*omega**(1/2) * kernel * besselj(-1/2, 2*pi*tau*omega)*tau**(1/2), (tau, 0, oo))\n\n\n\\(\\displaystyle \\begin{cases} 1.4142135623731 \\pi^{0.5} l^{1.0} e^{- 2 \\pi^{2} l^{2} \\omega^{2}} & \\text{for}\\: \\left(\\left|{\\arg{\\left(\\omega \\right)}}\\right| = 0 \\wedge \\left|{\\arg{\\left(l \\right)}}\\right| &lt; \\frac{\\pi}{4}\\right) \\vee \\left|{\\arg{\\left(l \\right)}}\\right| &lt; \\frac{\\pi}{4} \\\\\\int\\limits_{0}^{\\infty} 2 \\pi \\omega^{0.5} \\tau^{0.5} e^{- \\frac{\\tau^{2}}{2 l^{2}}} J_{-0.5}\\left(2 \\pi \\omega \\tau\\right)\\, d\\tau & \\text{otherwise} \\end{cases}\\)\n\n\nThe first expression above is the Fourier amplitude of the squared exponential kernel, i.e.,\n\\[\nS \\left(\\boldsymbol{\\omega} \\right) = \\left( 2 \\pi l^2\\right)^{1/2} exp \\left( - 2 \\pi l^2 \\boldsymbol{\\omega}^2 \\right)\n\\]\n\n\nCode\nomega = np.linspace(0, np.pi/4, 50)\nl = 0.5\nS_omega = (2 * np.pi * l**2)**(1/2) * \\\n            np.exp(- 2 * np.pi * l**2 * omega**2)\ntau = np.linspace(0, 10, 200)\n\n\nfig = go.Figure()\nfig.add_scatter(x=omega, y=S_omega, mode='lines')\nfig.update_layout(title='Spectral density', \\\n                  xaxis_title=r'Frequency, $\\omega$',\\\n                  yaxis_title=r'Spectral density, $S\\left( \\omega \\right) $')\nfig.show()\n\n\n\n\n\n\n\nCode\nkernel = tau * 0.\ntrue_kernel = np.exp(-tau**2 / l**2)\ncounter = 0.\n\nfig = go.Figure()\nfor omega_j in omega:\n    counter += 1.\n    label=str(np.around(int(counter), 1))+' terms'\n    S_omega_j = (2 * np.pi * l**2)**(1/2) * \\\n            np.exp(- 2 * np.pi * l**2 * omega_j**2)\n    cos_term = np.cos(2 * np.pi * tau * omega_j)\n    kernel += (S_omega_j * cos_term)\n    fig.add_scatter(x=tau, y=kernel * 1/counter, name=label, mode='lines')\nfig.add_scatter(x=tau, y=true_kernel, name='Kernel', mode='lines', \\\n                line=dict(width=4, color='black'))\nfig.update_layout(title='Sq. exp kernel Fourier representation', \\\n                  xaxis_title=r'Distance, $\\tau$',\\\n                  yaxis_title=r'$k ( \\tau )$')\nfig.show()\n\n\n\n\n\nNotice that the more terms we incorporate, the closer we converge to the true kernel.\n\n\n\nRather than negotiate a kernel approximation with a great many number of terms, it will be more instructive to resort to a few terms. Such is the idea behind Random Fourier Features, where one selects a kernel comprised of random frequencies. For more details, please see the paper by Rahimi and Recht.\n\n\nCode\nR = 500 # random features\nD = 50 # number of data pts.\nx = np.linspace(-2*np.pi, 2*np.pi, D).reshape(D,1) # grid\nX = np.tile(x, [1, D]) - np.tile(x.T, [D, 1])\nW    = np.random.normal(loc=0, scale=0.1, size=(R, D))\nb    = np.random.uniform(0, 2*np.pi, size=R)\nB    = np.repeat(b[:, np.newaxis], D, axis=1)\nnorm = 1./ np.sqrt(R)\nZ    = norm * np.sqrt(2) * np.cos(W @ X.T + B)\nZZ   = Z.T @ Z\n\n\n\n\nCode\nfig = plt.figure(figsize=(14,5))\nplt.subplot(121)\nd = plt.imshow(ZZ)\nplt.colorbar(d, shrink=0.3)\nplt.title('Random Fourier Features')\nnormal = multivariate_normal(np.zeros((D)), ZZ, allow_singular=True)\nplt.subplot(122)\nplt.plot(x, normal.rvs(10).T )\nplt.title('Random samples from prior')\nplt.xlabel('x')\nplt.show()"
  },
  {
    "objectID": "useful_codes/fourier.html#overview",
    "href": "useful_codes/fourier.html#overview",
    "title": "Fourier analysis of kernels",
    "section": "",
    "text": "This note concerns the Fourier dual of a kernel, i.e., one can think of a kernel has having an associated set of frequencies and amplitudes, much like studying the spectral density or power spectrum of a signal.\n\n\nCode\nimport plotly.graph_objects as go\nimport plotly.figure_factory as ff\nimport plotly.express as px\nimport pandas as pd\nimport plotly.io as pio\nimport numpy as np\nfrom scipy.stats import multivariate_normal\nimport matplotlib.pyplot as plt\npio.renderers.default = 'iframe'\nfrom IPython.display import display, HTML\n\n\n\n\nA stationary function \\(k\\left( \\mathbf{x}, \\mathbf{x}' \\right)=k \\left( \\mathbf{x} - \\mathbf{x}' \\right) = k \\left( \\boldsymbol{\\tau} \\right)\\) can be represented as the Fourier transform of a positive finite measure. The formal statement is given by\nA complex-valued function \\(k\\) on \\(\\mathcal{X}\\) is the covariance function of a weakly stationary mean square continuous complex-valued random process on \\(\\mathcal{X}\\) if and only if it can be represented as\n\\[\nk \\left( \\boldsymbol{\\tau} \\right) = \\int_{\\mathcal{X}} exp \\left( 2 \\pi i \\boldsymbol{\\omega} \\cdot \\boldsymbol{\\tau} \\right) d \\mu \\left( \\boldsymbol{\\omega} \\right)\n\\]\nwhere \\(\\mu\\) is a positive finite measure, and \\(\\boldsymbol{\\omega}\\) are the frequencies. If \\(\\mu\\) has a density \\(S \\left( \\boldsymbol{\\omega} \\right)\\), then \\(S\\) is the spectral density or power spectrum associated with the kernel \\(k\\).\n\n\n\nA direct consequence of Bochner’s theoreom is the Wiener-Khintchine theorem. If the spectral density \\(S \\left( \\boldsymbol{\\omega} \\right)\\) exists, the spectral density and the covariance function are said to be Fourier duals. This leads to the following statement:\n\\[\nk \\left( \\boldsymbol{\\tau} \\right) = \\int S \\left( \\boldsymbol{\\omega} \\right) exp \\left( 2 \\pi i \\boldsymbol{\\omega} \\cdot \\boldsymbol{\\tau} \\right) d \\boldsymbol{\\omega}, \\; \\; \\; \\; S \\left( \\boldsymbol{\\omega}\\right) = \\int k \\left( \\boldsymbol{\\tau} \\right) exp\\left(- 2 \\pi i \\boldsymbol{\\omega} \\cdot \\boldsymbol{\\tau} \\right) d \\boldsymbol{\\tau}\n\\]\nAs noted in RW, \\(S \\left( \\boldsymbol{\\omega} \\right)\\) is essentially the amount of power assigned to the eigenfunction \\(exp \\left( 2 \\pi i \\boldsymbol{\\omega} \\cdot \\mathbf{\\tau} \\right)\\) with frequency \\(\\boldsymbol{\\omega}\\). The amplitude as a function of frequency \\(S\\left( \\boldsymbol{\\omega} \\right)\\) must decay sufficiently fast so that the terms above are integrable.\nThere are some important points to note:\n\nIf we have a stationary kernel, we can resolve what frequencies underscore the model by working out its Fourier transform.\nOn the other hand, if we have a certain spectral density of interest, then its inverse Fourier transform is a kernel.\n\nTo analytically work this out, it may be useful to go through an example (courtsey of Markus Heinonen). The derivation below will require three pieces: - We shall assume a symmetric frequency distribution, i.e., \\(S\\left( \\boldsymbol{\\omega} \\right) = S \\left( -\\boldsymbol{\\omega} \\right)\\). - From Euler’s formula we have \\(cos\\left(x\\right) \\pm i sin\\left(x \\right) = exp \\left(\\pm ix \\right)\\) - The negative sine identity, i.e., \\(sin \\left( -x \\right) = - sin \\left( x \\right)\\)\nStarting with the expression above, we begin wtih\n\\[\n\\begin{aligned}\nk \\left( \\boldsymbol{\\tau} \\right) & = \\int_{-\\infty}^{\\infty} S \\left( \\boldsymbol{\\omega} \\right) exp \\left( 2 \\pi i \\boldsymbol{\\omega} \\cdot \\boldsymbol{\\tau} \\right) d \\boldsymbol{\\omega} \\\\\n& =   \\int_{-\\infty}^{\\infty} S \\left(\\boldsymbol{\\omega} \\right) cos \\left( 2 \\pi\\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} + \\int_{-\\infty}^{\\infty} iS \\left(\\boldsymbol{\\omega} \\right) sin \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} \\\\\n& = \\mathbb{E}\\left[ S \\left(\\omega \\right) \\right] cos \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right)  + \\int_{-\\infty}^{0} iS \\left(\\boldsymbol{\\omega} \\right) sin \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} + \\int_{0}^{\\infty} iS \\left(\\boldsymbol{\\omega} \\right) sin \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} \\\\\n& = \\mathbb{E}\\left[ S \\left(\\omega \\right) \\right] cos \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right)  + \\int_{0}^{\\infty} iS \\left(-\\boldsymbol{\\omega} \\right) sin \\left( -2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} + \\int_{0}^{\\infty} iS \\left(\\boldsymbol{\\omega} \\right) sin \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} \\\\\n& = \\mathbb{E}\\left[ S \\left(\\omega \\right) \\right] cos \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right)  + \\int_{0}^{\\infty} -iS \\left(\\boldsymbol{\\omega} \\right) sin \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} + \\int_{0}^{\\infty} iS \\left(\\boldsymbol{\\omega} \\right) sin \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) d \\boldsymbol{\\omega} \\\\\n\\end{aligned}\n\\]\nThis leads to\n\\[\n\\begin{aligned}\nk \\left( \\boldsymbol{\\tau} \\right) & = \\mathbb{E}\\left[ S \\left(\\omega \\right) \\right] cos \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right)   \n\\end{aligned}\n\\]\nThis demonstrates that all real-valued stationary kernels are \\(S\\left( \\boldsymbol{\\omega} \\right)\\)-weighted combinations of cosine terms. \n\n\n\nOur new general stationary kernel definition is thus:\n\\[\nk \\left( \\boldsymbol{\\tau} \\right)  = \\mathbb{E}\\left[ S \\left(\\omega \\right) \\right] cos \\left( 2 \\pi \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right)   \n\\]\nwhere the frequencies \\(\\boldsymbol{\\omega}\\) are an inverse of the period \\(1/\\boldsymbol{\\omega}\\). Bracewell provides the following expressions for the Wiener-Khintchine result, by integrating out the angular variables (see page 83 of RW):\n\\[\n\\begin{aligned}\nk \\left( \\boldsymbol{\\tau} \\right) & = \\frac{2 \\pi}{\\boldsymbol{\\tau}^{-1/2}} \\int_{0}^{\\infty} S \\left( \\boldsymbol{\\omega} \\right) J_{-1/2} \\left(2 \\pi  \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) \\boldsymbol{\\omega}^{1/2} d \\boldsymbol{\\omega} \\\\\nS \\left(   \\boldsymbol{\\omega} \\right) & = \\frac{2 \\pi}{\\boldsymbol{\\omega}^{-1/2}} \\int_{0}^{\\infty} k \\left( \\boldsymbol{\\tau} \\right) J_{-1/2} \\left(2 \\pi  \\boldsymbol{\\tau} \\cdot \\boldsymbol{\\omega} \\right) \\boldsymbol{\\tau}^{1/2} d \\boldsymbol{\\tau}\n\\end{aligned}\n\\]\nNote that in RW, the authors use \\(D\\) to denote the dimensionality, which we have assumed to be 1. The function \\(J_{-1/2}\\) is the Bessel function of order \\(-1/2\\). While the expressions above may seem unwiedly, we can work out what these are using a bit of Sympy. Consider the case of a squared exponential kernel of the form\n\\[\nk \\left(\\boldsymbol{\\tau} \\right) = exp \\left(- \\frac{\\boldsymbol{\\tau}^2}{2l^2} \\right).\n\\]\n\n\nCode\nfrom sympy import * \n\nomega = Symbol(\"omega\")\nell = Symbol(\"l\")\ntau = Symbol(\"tau\")\n\nkernel = exp(- tau**2 / (2 * ell**2))\nintegrate(2*pi*omega**(1/2) * kernel * besselj(-1/2, 2*pi*tau*omega)*tau**(1/2), (tau, 0, oo))\n\n\n\\(\\displaystyle \\begin{cases} 1.4142135623731 \\pi^{0.5} l^{1.0} e^{- 2 \\pi^{2} l^{2} \\omega^{2}} & \\text{for}\\: \\left(\\left|{\\arg{\\left(\\omega \\right)}}\\right| = 0 \\wedge \\left|{\\arg{\\left(l \\right)}}\\right| &lt; \\frac{\\pi}{4}\\right) \\vee \\left|{\\arg{\\left(l \\right)}}\\right| &lt; \\frac{\\pi}{4} \\\\\\int\\limits_{0}^{\\infty} 2 \\pi \\omega^{0.5} \\tau^{0.5} e^{- \\frac{\\tau^{2}}{2 l^{2}}} J_{-0.5}\\left(2 \\pi \\omega \\tau\\right)\\, d\\tau & \\text{otherwise} \\end{cases}\\)\n\n\nThe first expression above is the Fourier amplitude of the squared exponential kernel, i.e.,\n\\[\nS \\left(\\boldsymbol{\\omega} \\right) = \\left( 2 \\pi l^2\\right)^{1/2} exp \\left( - 2 \\pi l^2 \\boldsymbol{\\omega}^2 \\right)\n\\]\n\n\nCode\nomega = np.linspace(0, np.pi/4, 50)\nl = 0.5\nS_omega = (2 * np.pi * l**2)**(1/2) * \\\n            np.exp(- 2 * np.pi * l**2 * omega**2)\ntau = np.linspace(0, 10, 200)\n\n\nfig = go.Figure()\nfig.add_scatter(x=omega, y=S_omega, mode='lines')\nfig.update_layout(title='Spectral density', \\\n                  xaxis_title=r'Frequency, $\\omega$',\\\n                  yaxis_title=r'Spectral density, $S\\left( \\omega \\right) $')\nfig.show()\n\n\n\n\n\n\n\nCode\nkernel = tau * 0.\ntrue_kernel = np.exp(-tau**2 / l**2)\ncounter = 0.\n\nfig = go.Figure()\nfor omega_j in omega:\n    counter += 1.\n    label=str(np.around(int(counter), 1))+' terms'\n    S_omega_j = (2 * np.pi * l**2)**(1/2) * \\\n            np.exp(- 2 * np.pi * l**2 * omega_j**2)\n    cos_term = np.cos(2 * np.pi * tau * omega_j)\n    kernel += (S_omega_j * cos_term)\n    fig.add_scatter(x=tau, y=kernel * 1/counter, name=label, mode='lines')\nfig.add_scatter(x=tau, y=true_kernel, name='Kernel', mode='lines', \\\n                line=dict(width=4, color='black'))\nfig.update_layout(title='Sq. exp kernel Fourier representation', \\\n                  xaxis_title=r'Distance, $\\tau$',\\\n                  yaxis_title=r'$k ( \\tau )$')\nfig.show()\n\n\n\n\n\nNotice that the more terms we incorporate, the closer we converge to the true kernel.\n\n\n\nRather than negotiate a kernel approximation with a great many number of terms, it will be more instructive to resort to a few terms. Such is the idea behind Random Fourier Features, where one selects a kernel comprised of random frequencies. For more details, please see the paper by Rahimi and Recht.\n\n\nCode\nR = 500 # random features\nD = 50 # number of data pts.\nx = np.linspace(-2*np.pi, 2*np.pi, D).reshape(D,1) # grid\nX = np.tile(x, [1, D]) - np.tile(x.T, [D, 1])\nW    = np.random.normal(loc=0, scale=0.1, size=(R, D))\nb    = np.random.uniform(0, 2*np.pi, size=R)\nB    = np.repeat(b[:, np.newaxis], D, axis=1)\nnorm = 1./ np.sqrt(R)\nZ    = norm * np.sqrt(2) * np.cos(W @ X.T + B)\nZZ   = Z.T @ Z\n\n\n\n\nCode\nfig = plt.figure(figsize=(14,5))\nplt.subplot(121)\nd = plt.imshow(ZZ)\nplt.colorbar(d, shrink=0.3)\nplt.title('Random Fourier Features')\nnormal = multivariate_normal(np.zeros((D)), ZZ, allow_singular=True)\nplt.subplot(122)\nplt.plot(x, normal.rvs(10).T )\nplt.title('Random samples from prior')\nplt.xlabel('x')\nplt.show()"
  },
  {
    "objectID": "useful_codes/gp_classification.html",
    "href": "useful_codes/gp_classification.html",
    "title": "Overview",
    "section": "",
    "text": "Code\n---\ntitle: \"Gaussian Process Classification\"\nformat:\n    html:\n        code-fold: true\njupyter: python3\nfontsize: 1.2em\nlinestretch: 1.5\ntoc: true\nnotebook-view: true\n---\n\n\n\n\nCode\nimport pandas as pd\nimport numpy as np\nimport matplotlib\nimport matplotlib.pyplot as plt\nfrom scipy.stats import multivariate_normal\nfrom scipy.linalg import cholesky, solve_triangular, cho_factor, cho_solve\nimport seaborn as sns\nfrom sklearn.datasets import make_moons\nfrom sklearn.metrics import pairwise_distances\n\n\n\n\nCode\nX, t = make_moons(n_samples=100, noise=0.1, random_state=0)\nN = X.shape[0]\n\n\n\n\nCode\nfig = plt.figure(figsize=(5,4))\nplt.scatter(X[:,0], X[:,1], c=t, cmap=matplotlib.cm.RdYlBu, s=25, alpha=0.8)\nplt.xlabel(r'$x_1$')\nplt.ylabel(r'$x_2$')\nplt.savefig('fig.png', dpi=150, bbox_inches='tight', transparent=True)\nplt.show()\n\n\n\n\n\n\n\nCode\ndef kernel(xa, xb, amp=2.0, ll=0.5):\n    D = pairwise_distances(xa, xb)\n    return amp**2 * np.exp(-0.5 * 1./ll**2 * D**2 )\n\ndef sigmoid(f):\n    return 1.0 / (1.0 + np.exp(-f)) + 1e-8\n\n\n\n\nCode\nM = 20\nxx = np.linspace(-1.3, 2.3, M)\nyy = np.linspace(-1.3, 1.3, M)\nXa, Xb = np.meshgrid(xx, yy)\npts = np.hstack([Xa.reshape(M*M,1), Xb.reshape(M*M,1)])\n\n\n\n\nCode\nprior = multivariate_normal(np.zeros((M*M)), kernel(pts,pts), allow_singular=True)\\\nnorm = matplotlib.cm.colors.Normalize(vmax=1, vmin=0)\n\n\n\n\nCode\nfig = plt.figure(figsize=(10,8))\nplt.subplot(221)\nc = plt.contourf(Xa, Xb, sigmoid(prior.rvs(1)).reshape(M,M), 50, vmin=0, vmax=1, cmap=matplotlib.cm.RdYlBu, norm=norm)\nplt.scatter(X[:,0], X[:,1], c=t, cmap=matplotlib.cm.RdYlBu, s=25, edgecolor='w', alpha=0.8, norm=norm)\ncbar = plt.colorbar(c)\n\nplt.subplot(222)\nc = plt.contourf(Xa, Xb, sigmoid(prior.rvs(1)).reshape(M,M), 50, vmin=0, vmax=1, cmap=matplotlib.cm.RdYlBu, norm=norm)\nplt.scatter(X[:,0], X[:,1], c=t, cmap=matplotlib.cm.RdYlBu, s=25, edgecolor='w', alpha=0.8, norm=norm)\ncbar = plt.colorbar(c)\n\nplt.subplot(223)\nc = plt.contourf(Xa, Xb, sigmoid(prior.rvs(1)).reshape(M,M), 50, vmin=0, vmax=1, cmap=matplotlib.cm.RdYlBu, norm=norm)\nplt.scatter(X[:,0], X[:,1], c=t, cmap=matplotlib.cm.RdYlBu, s=25, edgecolor='w', alpha=0.8, norm=norm)\ncbar = plt.colorbar(c)\n\nplt.subplot(224)\nc = plt.contourf(Xa, Xb, sigmoid(prior.rvs(1)).reshape(M,M), 50, vmin=0, vmax=1, cmap=matplotlib.cm.RdYlBu, norm=norm)\nplt.scatter(X[:,0], X[:,1], c=t, cmap=matplotlib.cm.RdYlBu, s=25, edgecolor='w', alpha=0.8, norm=norm)\ncbar = plt.colorbar(c)\n\nfig.suptitle('Four samples from the GP prior', fontsize=13)\nplt.savefig('fig2.png', dpi=150, bbox_inches='tight', transparent=True)\nplt.show()\n\n\n\n\n\n\n\nCode\nf = np.random.rand(N,1)\nK = kernel(X,X)\nK_inv = np.linalg.inv(K + 1e-5 * np.eye(N))\n\n\n\n\nCode\nc = plt.imshow(K, cmap=matplotlib.cm.RdYlBu)\nplt.colorbar(c)\n\n\n&lt;matplotlib.colorbar.Colorbar at 0x157b46350&gt;\n\n\n\n\n\n\n\nCode\nfor j in range(0, 10):\n    pp = sigmoid(f).flatten()\n    g_f = np.sum(t * np.log(pp) + (1 - t) * np.log(1. - pp + 1e-8)) - 0.5 * f.T @ K_inv @ f\n    print(float(g_f))\n    q = (t - pp).reshape(N,1)\n    grad = q - K_inv @ f\n    P = np.diag(pp * (1. - pp))\n    hess = -P - K_inv\n    f_prime = f - np.linalg.inv(hess + 1e-12 * np.eye(N)) @ grad\n    f = f_prime\n\n\n-172370.77223793537\n-19.282854330238436\n-16.478057199339062\n-16.338477505385345\n-16.33785884403903\n-16.337858825403753\n-16.33785882444115\n-16.337858825450375\n-16.337858825363156\n-16.33785882566754\n\n\n/var/folders/34/0177579s72zfk8k1ytk34_9c0346k7/T/ipykernel_40819/2525650791.py:4: DeprecationWarning: Conversion of an array with ndim &gt; 0 to a scalar is deprecated, and will error in future. Ensure you extract a single element from your array before performing this operation. (Deprecated NumPy 1.25.)\n  print(float(g_f))\n\n\n\n\nCode\nKxpx = kernel(pts, X)\nKxpxp = kernel(pts, pts)\nmean = Kxpx @ K_inv @ f\ncov = Kxpxp - Kxpx @ K_inv @ Kxpx.T\n\nposterior = multivariate_normal(mean.flatten(), cov, allow_singular=True)\n\n\n\n\nCode\nfig = plt.figure(figsize=(10,8))\nplt.subplot(221)\nc = plt.contourf(Xa, Xb, sigmoid(posterior.rvs(1)).reshape(M,M), 50, vmin=0, vmax=1, cmap=matplotlib.cm.RdYlBu, norm=norm)\nplt.scatter(X[:,0], X[:,1], c=t, cmap=matplotlib.cm.RdYlBu, s=20, edgecolor='w', alpha=0.8, norm=norm)\ncbar = plt.colorbar(c)\n\nplt.subplot(222)\nc = plt.contourf(Xa, Xb, sigmoid(posterior.rvs(1)).reshape(M,M), 50, vmin=0, vmax=1, cmap=matplotlib.cm.RdYlBu, norm=norm)\nplt.scatter(X[:,0], X[:,1], c=t, cmap=matplotlib.cm.RdYlBu, s=25, edgecolor='w', alpha=0.8, norm=norm)\ncbar = plt.colorbar(c)\n\nplt.subplot(223)\nc = plt.contourf(Xa, Xb, sigmoid(posterior.rvs(1)).reshape(M,M), 50, vmin=0, vmax=1, cmap=matplotlib.cm.RdYlBu, norm=norm)\nplt.scatter(X[:,0], X[:,1], c=t, cmap=matplotlib.cm.RdYlBu, s=25, edgecolor='w', alpha=0.8, norm=norm)\ncbar = plt.colorbar(c)\n\nplt.subplot(224)\nc = plt.contourf(Xa, Xb, sigmoid(posterior.rvs(1)).reshape(M,M), 50, vmin=0, vmax=1, cmap=matplotlib.cm.RdYlBu, norm=norm)\nplt.scatter(X[:,0], X[:,1], c=t, cmap=matplotlib.cm.RdYlBu,s=25, edgecolor='w', alpha=0.8,norm=norm)\ncbar = plt.colorbar(c)\n\nfig.suptitle('Four samples from the GP posterior', fontsize=13)\nplt.savefig('fig3.png', dpi=150, bbox_inches='tight', transparent=True)\nplt.show()"
  },
  {
    "objectID": "useful_codes/mcmc.html",
    "href": "useful_codes/mcmc.html",
    "title": "Inference with MAP & MCMC",
    "section": "",
    "text": "This lecture primarily will focus on Markov chain Monte Carlo as another method for hyperparameter inference. Much of the material below is based on Chapter 11 in Gelman et al [1]."
  },
  {
    "objectID": "useful_codes/mcmc.html#overview",
    "href": "useful_codes/mcmc.html#overview",
    "title": "Inference with MAP & MCMC",
    "section": "Overview",
    "text": "Overview\nMCMC is a general purpose method based on drawing samples \\(\\boldsymbol{\\theta}\\) from its prior \\(p \\left( \\boldsymbol{\\theta} \\right)\\), and correcting those draws to better approximate a target distribution \\(p \\left( \\boldsymbol{\\theta} | \\mathbf{t} \\right)\\). MCMC sampling is typically carried out when it is impossible or computationally intractable to directly sample from the posterior distribution of the hyperparameters \\(p \\left( \\boldsymbol{\\theta} | \\mathbf{t} \\right)\\).\nNote that this sampling is done sequentially, although one can have parallel chains running. It is called a Markov chain, as each sample depends only on the sample drawn immediately before. The method is remarkably successful, as target distributions are improved at each simulation step, enabling it to converge to a target distribution.\nThe central objective is to create a Markov process whose stationary distribution is the required hyperparameter posterior \\(p \\left( \\boldsymbol{\\theta} | \\mathbf{t} \\right)\\), and run the chains for sufficient length so as to converge to this distribution. A given set of independent sequences \\(\\boldsymbol{\\theta}_1, \\boldsymbol{\\theta}_2, \\ldots \\boldsymbol{\\theta}_{N}\\) is produced by sampling randomly from a prior, and for each \\(t=1, \\ldots, N\\), drawing \\(\\boldsymbol{\\theta}_{t}\\) from a transition distribution that only depends on the prior draw \\(\\boldsymbol{\\theta}_{t-1}\\).\nIntuitively, Metropolis converges to the target distribution if:\n\nThe Markov chain is aperiodic, not transient, and can reach any state from any other state (it is irreducible).\nThe stationary distribution is the target distribution.\n\nThe Metropolis algorithm is a general term for a family of Markov chain simulation methods that are useful for sampling from posterior distributions. The main steps are captured below:\n\nFrom the prior \\(p \\left( \\boldsymbol{\\theta} \\right)\\) draw a random sample \\(\\boldsymbol{\\theta}_{0}\\). Ensure that \\(p \\left( \\boldsymbol{\\theta}_{0} | \\mathbf{t} \\right) &gt; 0\\).\nFor each iterate of the chain\n\nSample \\(\\boldsymbol{\\theta}_{\\ast}\\) from a proposal (or jumping) distribution, i.e., \\[\n  \\boldsymbol{\\theta}_{\\ast} \\sim J_{t} \\left(   \\boldsymbol{\\theta}_{\\ast} | \\boldsymbol{\\theta}_{t-1}  \\right)\n  \\]\nCalculate the ratio of the densities: \\[\n  r = \\frac{p \\left( \\boldsymbol{\\theta}_{\\ast} | \\mathbf{t} \\right) }{p \\left( \\boldsymbol{\\theta}_{t-1} | \\mathbf{t} \\right)}\n  \\]\nSet \\[\n\\theta_{t}=\\begin{cases}\n\\begin{array}{c}\n\\boldsymbol{\\theta}_{\\ast}\\\\\n\\boldsymbol{\\theta_{t-1}}\n\\end{array} & \\begin{array}{c}\n\\text{with  probability} \\; min \\left[ r, 1 \\right] \\\\\n\\text{otherwise}\n\\end{array}\\end{cases}\n\\]\n\n\nOne can think of the transition distribution here as being a mixture between a point mass \\(\\boldsymbol{\\theta}_{t} = \\boldsymbol{\\theta}_{t-1}\\) and a weighted analogue of the proposal distribution.\n\nBut intuitively, why does this work?\nConsider starting at time \\(t-1\\). Starting with a draw from the target distribution, \\(\\boldsymbol{\\theta}_{t-1} \\sim \\mathcal{N} \\left(\\boldsymbol{\\theta} | \\mathbf{t} \\right)\\), let us consider two possible points, \\(\\boldsymbol{\\theta}_{p}\\) and \\(\\boldsymbol{\\theta}_{q}\\). Let us assume that \\(p \\left( \\boldsymbol{\\theta}_{q} | \\mathbf{t} \\right) \\geq p \\left( \\boldsymbol{\\theta}_{p} | \\mathbf{t} \\right)\\). Additionally, assume a symmetric jump distribution that transitions from \\(\\boldsymbol{\\theta}_{p}\\) to \\(\\boldsymbol{\\theta}_{q}\\)\n\\[\np \\left( \\boldsymbol{\\theta}_{t}, \\boldsymbol{\\theta}_{t-1} \\right) = p \\left( \\boldsymbol{\\theta}_{p} | \\mathbf{t} \\right) J_{t} \\left( \\boldsymbol{\\theta}_p, \\boldsymbol{\\theta}_q \\right).\n\\]\nAs the joint distribution is symmetric, \\(\\boldsymbol{\\theta}_{t} = \\boldsymbol{\\theta}_{p}\\) and \\(\\boldsymbol{\\theta}_{t-1} = \\boldsymbol{\\theta}_{q}\\) have the same marginal distribution, as a result \\(p\\left( \\boldsymbol{\\theta} | \\mathbf{t} \\right)\\) is the stationary distribution of the Markov chain.\n\n\nDemonstration\nWe shall now visualize what this looks like for the case where we our target (posterior) density is a multivariate normal of the form\n\\[\n\\mathcal{N}\\left( \\left[\\begin{array}{c}\n2\\\\\n3\n\\end{array}\\right],\\left[\\begin{array}{cc}\n1 & 0.8\\\\\n0.8 & 1\n\\end{array}\\right] \\right)\n\\]\nand the priors are univariate normal distributions with a variance of \\(0.2\\).\n\n\nCode\nimport numpy as np\nfrom scipy.stats import multivariate_normal\nimport matplotlib.pyplot as plt\n\n# Target density (bivariate unit normal)\ndef target_density(x, y):\n    rho = 0.8\n    var1 = 1.5\n    var2 = 2.2\n    cov = np.identity(2)  # Identity covariance matrix\n    cov[0,0] = var1\n    cov[1,1] = var2\n    cov[0,1] = rho * np.sqrt(var1 * var2)\n    cov[1,0] = rho * np.sqrt(var1 * var2)\n    return multivariate_normal.pdf([x, y], mean=[2, 3], cov=cov)\n\n# Metropolis sampling\ndef metropolis_sampler(n_iterations):\n    x_current = -5.0\n    y_current = -5.0\n    samples = []\n\n    for _ in range(n_iterations):\n        x_new = np.random.normal(x_current, 0.2)\n        y_new = np.random.normal(y_current, 0.2)\n\n        alpha = min(1, (target_density(x_new, y_new) ) / (target_density(x_current, y_current)))  \n\n        u = np.random.uniform()\n        if u &lt;= alpha:\n            x_current = x_new\n            y_current = y_new\n\n        samples.append((x_current, y_current))\n\n    return samples\n\n\nsamples = metropolis_sampler(5000) \nsamples = np.array(samples)\n\n\n\n\nCode\nfig = plt.figure(figsize=(10,3))\nplt.subplot(121)\nplt.plot(samples[:,0], '-', color='orangered')\nplt.xlabel('Number of iterates')\nplt.ylabel('$x$')\nplt.subplot(122)\nplt.plot(samples[:,1], '-', color='orangered')\nplt.xlabel('Number of iterates')\nplt.ylabel('$y$')\nplt.show()\n\n\n\n\n\n\n\nCode\nfig = plt.figure(figsize=(6,5))\nplt.plot(samples[:,0], samples[:,1], '.-', alpha=0.3, color='orangered')\nplt.ylabel('$x$')\nplt.ylabel('$y$')\nplt.title('Metropolis sampler')\nplt.show()\n\n\n\n\n\nIn a Metropolis sampler, the proposal distribution is symmetric. Thus, in the acceptance ratio, \\(r\\), the proposal density does not make an appearence as it cancels out from both the numerator and denominator. In Metropolis-Hastings, the proposal distribution is not symmetric as its center changes with each iteration. Therefore, we need to explicitly calculate the probability of moving from the current state to the proposed state and the reverse probability.\n\\[\nr = \\frac{p \\left( \\boldsymbol{\\theta}_{\\ast} | \\mathbf{t} \\right) / J_{t} \\left( \\boldsymbol{\\theta}_{\\ast}  | \\boldsymbol{\\theta}_{t-1} \\right) }{p \\left( \\boldsymbol{\\theta}_{t-1} | \\mathbf{t} \\right) / J_{t} \\left(   \\boldsymbol{\\theta}_{t-1} | \\boldsymbol{\\theta}_{\\ast}  \\right)  }\n\\]\nTherefore, the terms proposal_density(x_current, y_current, x_new, y_new) and proposal_density(x_new, y_new, x_current, y_current) are incorporated into the acceptance ratio as shown below.\nAllowing asymmetric jumping rules can be helpful in increasing the speed of the sampler. We shall now demonstrate the utility of Metropolis Hastings on determining the posterior distribution of hyperparameters in a Gaussian process model. For this demonstration, we will be making use of pymc; the model shown below has been adapted from this pymc tutorial.\n\n\nA Gaussian Process example\n\n\nCode\nimport numpy as np\nimport pymc as pm\nimport matplotlib.pyplot as plt\nimport arviz as az\n\n\n\n\nCode\n# Training data\nn = 80 \nX = np.linspace(0, 10, n)[:, None]  \n\n# Define the true covariance function and its parameters\nell_true = 1.0\neta_true = 3.0\ncov_func = eta_true**2 * pm.gp.cov.Matern52(1, ell_true)\nmean_func = pm.gp.mean.Zero()\nf_true = np.random.multivariate_normal(\n    mean_func(X).eval(), cov_func(X).eval() + 1e-8 * np.eye(n), 1\n).flatten()\nsigma_true = 2.0\n\n# True signal is corrupted by random noise\ny = f_true + sigma_true * np.random.randn(n)\n\n## Plot the data and the unobserved latent function\nfig = plt.figure(figsize=(8, 5))\nax = fig.gca()\nax.plot(X, f_true, \"dodgerblue\", lw=3, label=\"True f\")\nax.plot(X, y, \"ok\", ms=3, alpha=0.5, label=\"Data\")\nax.set_xlabel(\"X\")\nax.set_ylabel(\"The true f(x)\")\nplt.legend();\n\n\n\n\n\nWe shall use a Matern52 kernel that is parameterized by\n\\[\nk(x, x') =  \\eta^2 \\left(1 + \\frac{\\sqrt{5(x - x')^2}}{\\ell} +\n                   \\frac{5(x-x')^2}{3\\ell^2}\\right)\n                   \\mathrm{exp}\\left[ - \\frac{\\sqrt{5(x - x')^2}}{\\ell} \\right]\n\\]\nwhere the hyperparameters are \\(\\eta\\) and \\(\\ell\\). Additionally, we will assume that the data noise is given by a Half Cauchy distribution with shape parameter \\(\\sigma\\). For hyperparameter inference, we shall utilize both MCMC (via Metropolis\n\n\nCode\nwith pm.Model() as model:\n    ell = pm.Gamma(\"ell\", alpha=2, beta=1)\n    eta = pm.HalfCauchy(\"eta\", beta=5)\n\n    cov = eta**2 * pm.gp.cov.Matern52(1, ell)\n    gp = pm.gp.Marginal(cov_func=cov)\n\n    sigma = pm.HalfCauchy(\"sigma\", beta=5)\n    y_ = gp.marginal_likelihood(\"y\", X=X, y=y, sigma=sigma)\n\n    with model:\n        marginal_post = pm.sample(draws=5000, step=pm.Metropolis(), chains=1) # by default uses a Normal proposal\n        \n    with model:\n        map_post = pm.find_MAP()\n\n\nSequential sampling (1 chains in 1 job)\nCompoundStep\n&gt;Metropolis: [ell]\n&gt;Metropolis: [eta]\n&gt;Metropolis: [sigma]\nSampling 1 chain for 1_000 tune and 5_000 draw iterations (1_000 + 5_000 draws total) took 11 seconds.\nOnly one chain was sampled, this makes it impossible to run some convergence checks\n\n\n\n\n\n\n\n    \n      \n      100.00% [6000/6000 00:10&lt;00:00 Sampling chain 0, 0 divergences]\n    \n    \n\n\n\n\n\n\n\n    \n      \n      100.00% [14/14 00:00&lt;00:00 logp = -184.99, ||grad|| = 0.56318]\n    \n    \n\n\n\n\n\nWe can now contrast the optimized MAP values with those from the MCMC chains.\n\n\nCode\ndef plot_(param):\n    traces = marginal_post.posterior[param].values.flatten()\n    fig = plt.figure(figsize=(10,3))\n    plt.subplot(121)\n    plt.plot(traces, color='orangered', label='MCMC')\n    plt.axhline(map_post['ell'], label='MAP', lw=2)\n    plt.xlabel('MCMC trace')\n    plt.title('MCMC samples')\n    plt.ylabel(param)\n    plt.legend()\n    plt.subplot(122)\n    plt.hist(traces, 50, color='orangered', edgecolor='w', label='MCMC', density=True)\n    plt.axvline(map_post[param], label='MAP', lw=2)\n    plt.xlabel(param)\n    plt.title('Posterior density from MCMC')\n    plt.legend()\n    plt.show()\n    \nplot_('ell')\nplot_('eta')\nplot_('sigma')\n\n\n\n\n\n\n\n\n\n\n\nCare should be taken when plotting the posterior distribution from MCMC chains. Plotting the posterior distribution by averaging across the iterates is incorrect, and almost surely will yield a result that has a underestimated uncertainty. The correct approach is to sample across the posterior, and average those for plotting. This is clarified in the code below.\n\n\nCode\n# Test values\nX_new = np.linspace(0, 20, 600)[:, None]\n\n# add the GP conditional to the model, given the new X values\nwith model:\n    f_pred = gp.conditional(\"f_pred\", X_new)\n\nwith model:\n    pred_samples = pm.sample_posterior_predictive(\n        marginal_post.sel(draw=slice(0, 50)), var_names=[\"f_pred\"] # using 50 samples from the chain\n    )\n\n\nSampling: [f_pred]\n\n\n\n\n\n\n\n    \n      \n      100.00% [51/51 00:38&lt;00:00]\n    \n    \n\n\nFirst, we shall plot the MCMC yielded posterior.\n\n\nCode\n# plot the results\nfig = plt.figure(figsize=(12, 5))\nax = fig.gca()\n\n# plot the samples from the gp posterior with samples and shading\nfrom pymc.gp.util import plot_gp_dist\n\nf_pred_samples = az.extract(pred_samples, group=\"posterior_predictive\", var_names=[\"f_pred\"])\nplot_gp_dist(ax, samples=f_pred_samples.T, x=X_new)\n\n# plot the data and the true latent function\nplt.plot(X, f_true, \"dodgerblue\", lw=3, label=\"True f\")\nplt.plot(X, y, \"ok\", ms=3, alpha=0.5, label=\"Observed data\")\n\n# axis labels and title\nplt.xlabel(\"X\")\nplt.ylim([-5, 8])\nplt.title(\"MCMC Result\")\nplt.legend();\n\n\n\n\n\nAnd now for the MAP value:\n\n\nCode\nwith model:\n    mu, covar = gp.predict(X_new, point=map_post, diag=False)\n\n\n\n\nCode\npost_map_dist = multivariate_normal(mu, covar)\nmap_samples = post_map_dist.rvs(50)\n\nfig = plt.figure(figsize=(12, 5))\nax = fig.gca()\nplt.plot(X, f_true, \"dodgerblue\", lw=3, label=\"True f\")\nplt.plot(X, y, \"ok\", ms=3, alpha=0.5, label=\"Observed data\")\nplot_gp_dist(ax, samples=map_samples, x=X_new)\nplt.legend()\nplt.xlabel(\"X\")\nplt.ylim([-5, 8])\nplt.title(\"MAP result\")\nplt.show()\n\n\n\n\n\nIt is clear that the MAP result quotes a smaller uncertainty than the MCMC result.\n\n\nNotes on this iterative process\n\nIf the number of iterations is insufficient, then the chains may be unrepresentative of the target distribution.\nFor the same number of draws, simualtions that originate from correlated draws are less precise than independent ones.\nAs can be observed even above, early iterations should be discarded as the chains are warming up.\nIn practice, once the simulation has converged, one need not store the entire chain; simply the every \\(k\\)-th iterate, such that \\(k\\) is no more than a couple thousand. This is called thinning.\nIt is common practice to run multiple simulations and ensure that the variance within a sequence is much less than the variance across sequences.\n\nTo identify convergence one needs to check for stationarity and mixing. The simplest recipe to check both is to split the chains into two halves after discarding the warm up samples.\n\nThe R-hat value\nFor each hyperparameter, one can compute \\(\\beta\\) and \\(\\omega\\): the between-sequence and within-sequence variances of the MCMC chains. Consider \\(m\\) chains and \\(n\\) iterations in each chain. Define an iterate to be \\(\\theta_{ij}\\) where \\(i=1, \\ldots, n\\) and \\(j=1, \\ldots, m\\). Define\n\\[\n\\bar{\\theta}_{j} = \\frac{1}{n} \\sum_{i=1}^{n} \\theta_{ij}\n\\]\nand\n\\[\n\\hat{\\theta} = \\frac{1}{m} \\sum_{j=1}^{m} \\bar{\\theta}_{j}\n\\]\nthen \\[\n\\begin{aligned}\n\\beta & = \\frac{n}{m-1} \\sum_{j=1}^{m} \\left[ \\bar{\\theta}_{j} - \\hat{\\theta} \\right]^2.\n\\end{aligned}\n\\]\nFor \\(\\omega\\), the within-sequence variation, we have\n\\[\n\\omega = \\frac{1}{m} \\sum_{j=1}^{m} \\kappa_j^2, \\; \\; \\; \\textrm{where} \\; \\; \\; \\kappa_{j}^2 = \\frac{1}{n-1} \\sum_{i=1}^{n} \\left( \\theta_{ij} - \\bar{\\theta}_{j} \\right)^2\n\\]\nThe R-hat value is given by\n\\[\n\\hat{R} = \\left( \\frac{1}{\\omega}  \\left( \\frac{n-1}{n} \\omega + \\frac{1}{n} \\beta \\right) \\right)^{1/2}\n\\]\nwhich should reduce to \\(1\\) as \\(n \\rightarrow \\infty\\). This value is also called the Gelman-Rubin statistic.\n\n\n\nBeyond Metropolis\nIt should be clear that the MCMC implementation above (i.e., Metropolis and Metropolis-Hastings) represents one strategy. There are numerous other sampling strategies\n\nGibbs sampler\nHamiltonian Monte Carlo (Duane et al. 1987)\nNo-U-Turn sampler (Hoffman and Gelman 2014)\nRiemannian updating (Girolami and Calderhead 2011)"
  },
  {
    "objectID": "useful_codes/sparse.html",
    "href": "useful_codes/sparse.html",
    "title": "Sparse Gaussian Processes",
    "section": "",
    "text": "This rather brief note outlines the differences between two important sparse Gaussian process approximation methods. Following the lecture slides, we will focus on the Deterministic Training Conditional (DTC) method and the Fully Independent Training Conditional (FITC) method. These are both examples of inducing point methods. Following our discussion of Candela & Rasmussen in Lecture, it is worth emphasizing that both these methods can be expressed in terms of amendments to the joint (training and predictive) Gaussian process prior.\n\nDTC\nDTC assumes that the posterior distribution of the Gaussian Process, when conditioned on the inducing points, is independent of the training data, i.e., there is conditional independence between them. It uses this to derive a closed-form (only when the likelihood is Gaussian too) expression for the approximate posterior. DTC is relatively simple to implement and is computationally efficient. Consider the pseudo-code below that contrasts the standard Gaussian process formalism with the DTC approach for building the conditional distribution.\n# Standard formalism\n\nKxx = kern(X,X)\nKxs = kern(X, Xnew)\nKss = kern(Xnew, Xnew)\n\nSigma = sigma**2 * eye(N)\nL = cholesky(Kxx + Sigma)\nA = solve_lower(L, Kxs)\nv = solve_lower(L, y)\nmu = A.T @ v\ncov = Kss - A.T @ A    \n# DTC approach\n\nKuu = kern(Xu, Xu)\nKuf = kern(Xu, X)\nKss = kern(Xnew, Xnew)\n\nLuu = cholesky(Kuu)\nA = solve_lower(Luu, Kuf)\nQffd = sum(A**2, 0)\nLambda = sigma**2 * eye(Qffd.shape[0])\nA_l = A / Lambda\nAs = solve_lower(Luu, Kus)\nL_B = cholesky(eye(Xu.shape[0]) + A_l @ A.T)\nmu = As.T @ solve_upper(L_B.T, c)\ncov = Kss - As.T @ As + C.T @ C\nOne weakness associated with DTC is that its independence assumption can lead to overconfident predictions, particularly when away from the incuding points.\n\n\nFITC\nLike DTC, FITC also assumes conditional independence between the GP values at the inducing points and the remaining training data. However, it doesn’t completely neglect the correlations; it introduces a diagonal correction term to approximate the covariance matrix. In pseudocode, the amendment to DTC is given below\nKffd = ker(X, X, diag=True) # only the diagonal terms\nLambda = Kffd - Qffd + sigma**2\n\nIt should be clear that even with FITC, there will be accuracy limitations when compared with a standard Gaussian process model.\n\n\nExample\nIn the code blocks below, we demonstrate using DTC and FITC with pymc. To swamp between the two, simply change this line of code:\ngp = pm.gp.MarginalApprox(cov_func=cov, approx='DTC')\n\n\nCode\nimport pymc as pm\nimport matplotlib.pyplot as plt\nimport arviz as az\nimport numpy as np\nfrom scipy.stats import multivariate_normal\nfrom pymc.gp.util import plot_gp_dist\n\n\n\n\nCode\n# Training data\nn = 2000 \nX = np.linspace(0, 10, n)[:, None]  \n\nnp.random.seed(100)\n\n# Define the true covariance function and its parameters\nell_true = 1.0\neta_true = 3.0\ncov_func = eta_true**2 * pm.gp.cov.Matern52(1, ell_true)\nmean_func = pm.gp.mean.Zero()\nf_true = np.random.multivariate_normal(\n    mean_func(X).eval(), cov_func(X).eval() + 1e-8 * np.eye(n), 1\n).flatten()\nsigma_true = 2.0\n\n# True signal is corrupted by random noise\ny = f_true + sigma_true * np.random.randn(n)\n\n## Plot the data and the unobserved latent function\nfig = plt.figure(figsize=(8, 5))\nax = fig.gca()\nax.plot(X, f_true, \"dodgerblue\", lw=3, label=\"True f\")\nax.plot(X, y, \"ok\", ms=3, alpha=0.5, label=\"Data\")\nax.set_xlabel(\"X\")\nax.set_ylabel(\"The true f(x)\")\nplt.legend();\n\n\n\n\n\nWe shall use a Matern52 kernel that is parameterized by\n\\[\nk(x, x') =  \\eta^2 \\left(1 + \\frac{\\sqrt{5(x - x')^2}}{\\ell} +\n                   \\frac{5(x-x')^2}{3\\ell^2}\\right)\n                   \\mathrm{exp}\\left[ - \\frac{\\sqrt{5(x - x')^2}}{\\ell} \\right]\n\\]\nwhere the hyperparameters are \\(\\eta\\) and \\(\\ell\\). Additionally, we will assume that the data noise is given by a Half Cauchy distribution with shape parameter \\(\\sigma\\). For hyperparameter inference, we shall utilize MAP.\n\n\nCode\nwith pm.Model() as model:\n    ell = pm.Gamma(\"ell\", alpha=2, beta=1)\n    eta = pm.HalfCauchy(\"eta\", beta=5)\n\n    cov = eta**2 * pm.gp.cov.Matern52(1, ell)\n    gp = pm.gp.MarginalApprox(cov_func=cov, approx='DTC')\n    \n    # Fixed inducing points!\n    Xu = np.linspace(0, 10, 5).reshape(5,1)\n\n    sigma = pm.HalfCauchy(\"sigma\", beta=5)\n    y_ = gp.marginal_likelihood(\"y\", X=X, y=y, Xu=Xu , sigma=sigma)\n\n    with model:\n        map_post = pm.find_MAP()\n\n\n\n\n\n\n\n    \n      \n      100.00% [18/18 00:00&lt;00:00 logp = -4,411.2, ||grad|| = 14.543]\n    \n    \n\n\n\n\n\nNow let’s plot the posterior predictive distribution!\n\n\nCode\n# Test values\nX_new = np.linspace(0, 20, 600)[:, None]\n\nwith model:\n    mu, covar = gp.predict(X_new, point=map_post, diag=False)\n\n\n\n\nCode\npost_map_dist = multivariate_normal(mu, covar)\nmap_samples = post_map_dist.rvs(50)\nXu_final = Xu\nfig = plt.figure(figsize=(12, 5))\nax = fig.gca()\nplt.plot(X, f_true, \"dodgerblue\", lw=3, label=\"True f\")\nplt.plot(Xu_final, Xu_final*0 + 10, 'x', ms=12, color='limegreen', label='Incuding points')\nplt.plot(X, y, \"ok\", ms=3, alpha=0.5, label=\"Observed data\")\nplot_gp_dist(ax, samples=map_samples, x=X_new )\nplt.legend(loc='lower right')\nplt.xlabel(\"X\")\nplt.xlim([-1, 20])\nplt.ylim([-8, 12])\nplt.savefig('fixed.png', dpi=150, bbox_inches='tight', transparent=True)\nplt.show()\n\n\n\n\n\nWe shall repeat the same exercise, but now assign a flat prior to the inducing point locations and let those be optimized as well.\n\n\nCode\nwith pm.Model() as model:\n    ell = pm.Gamma(\"ell\", alpha=2, beta=1)\n    eta = pm.HalfCauchy(\"eta\", beta=5)\n\n    cov = eta**2 * pm.gp.cov.Matern52(1, ell)\n    gp = pm.gp.MarginalApprox(cov_func=cov, approx='DTC')\n    \n    # Fixed inducing points!\n    Xu_init = np.linspace(0, 10, 5).reshape(5,1)\n    \n    # Varying inducing points!\n    Xu = pm.Flat(\"Xu_i\", initval=Xu_init, shape=(5,1))\n\n    sigma = pm.HalfCauchy(\"sigma\", beta=5)\n    y_ = gp.marginal_likelihood(\"y\", X=X, y=y, Xu=Xu , sigma=sigma)\n\n    with model:\n        map_post = pm.find_MAP()\n        \nwith model:\n    mu, covar = gp.predict(X_new, point=map_post, diag=False)\n\n\n\n\n\n\n\n    \n      \n      100.00% [63/63 00:00&lt;00:00 logp = -4,324, ||grad|| = 1.4772]\n    \n    \n\n\n\n\n\n\n\nCode\npost_map_dist = multivariate_normal(mu, covar)\nmap_samples = post_map_dist.rvs(50)\nXu_final = map_post['Xu_i']\n\nfig = plt.figure(figsize=(12, 5))\nax = fig.gca()\nplt.plot(X, f_true, \"dodgerblue\", lw=3, label=\"True f\")\nplt.plot(Xu_final, Xu_final*0 + 10, 'x', ms=12, color='limegreen', label='Incuding points')\nplt.plot(X, y, \"ok\", ms=3, alpha=0.5, label=\"Observed data\")\nplot_gp_dist(ax, samples=map_samples, x=X_new )\nplt.legend(loc='lower right')\nplt.xlabel(\"X\")\nplt.xlim([-1, 20])\nplt.ylim([-8, 12])\nplt.savefig('opt.png', dpi=150, bbox_inches='tight', transparent=True)\nplt.show()\n\n\n\n\n\n\n\nLocation of the inducing points\nIn the example above, we note a considerable degree of clustering in the inducing point locations. This is not unnatural. Inducing points are meant to act as a summary of the dataset, and the optimization process aims to place them in regions where either there is less data or where there is significant variation in the function values. Note that in the case above, it happens to be where the function is rapidly changing."
  },
  {
    "objectID": "useful_codes/div.html",
    "href": "useful_codes/div.html",
    "title": "Divergence free",
    "section": "",
    "text": "Code\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport matplotlib\nfrom scipy import stats\nfrom copy import deepcopy\nimport pymc as pm\nimport pytensor\nimport pytensor.tensor as tt\nfrom pymc.gp.cov import Covariance\nfrom functools import partial\nfrom pytensor.tensor.linalg import cholesky, eigh, solve_triangular\nfrom scipy.stats import multivariate_normal\nsolve_lower = partial(solve_triangular, lower=True)\nsolve_upper = partial(solve_triangular, lower=False)\n\n\nWARNING (pytensor.tensor.blas): Using NumPy C-API based implementation for BLAS functions.\nCode\nm = 40\nn = 40\ns = np.linspace(0, 1, m) * 4\nt = np.linspace(0, 1, n) * 4\n[S, T] = np.meshgrid(s, t)\nSS = S.flatten()\nTT = T.flatten()\nN = SS.shape[0]\nXpred = np.hstack([TT.reshape(N,1), SS.reshape(N,1)])\nCode\nnp.random.seed(seed=10)\nnum_random_points = 7\n#X_init = np.random.rand(num_random_points, 2)\n#X_init[:,0] *= 4\n#X_init[:,1] *= 4\nX_init = np.array([[3, 4], \n                   [2, 1],\n                   [0, 3.5], \n                   [2, 2], \n                   [3, 0], \n                   [1, 1],\n                   [1.5, 3]]).reshape(num_random_points,2)\nvel_x = -np.cos(X_init[:,0]) * np.sin(X_init[:,1]) \nvel_y =  np.sin(X_init[:,0]) * np.cos(X_init[:,1])\nsigma_noise = 1e-6\nCode\nvel_x_truth = -np.cos(Xpred[:,0]) * np.sin(Xpred[:,1]) \nvel_y_truth = np.sin(Xpred[:,0]) * np.cos(Xpred[:,1])\nvel_mag_truth = np.sqrt(vel_x_truth**2 + vel_y_truth**2)\nCode\nplt.scatter(X_init[:,0], X_init[:,1], c='w', s=40, lw=1, edgecolor='k')\nplt.quiver(X_init[:,0], X_init[:,1], vel_x, vel_y)\nplt.show()\nCode\nX_init = np.vstack([X_init, X_init])\nprint(X_init)\n\n\n[[3.  4. ]\n [2.  1. ]\n [0.  3.5]\n [2.  2. ]\n [3.  0. ]\n [1.  1. ]\n [1.5 3. ]\n [3.  4. ]\n [2.  1. ]\n [0.  3.5]\n [2.  2. ]\n [3.  0. ]\n [1.  1. ]\n [1.5 3. ]]\nCode\ny_init = np.vstack([vel_x.reshape(num_random_points,1), vel_y.reshape(num_random_points,1)]).flatten()\nCode\ny_init\n\n\narray([-0.74922879,  0.35017549,  0.35078323,  0.37840125,  0.        ,\n       -0.45464871, -0.00998243, -0.09224219,  0.4912955 , -0.        ,\n       -0.37840125,  0.14112001,  0.45464871, -0.98751255])"
  },
  {
    "objectID": "useful_codes/div.html#standard-approach-independent-gps-for-each-velocity-component.",
    "href": "useful_codes/div.html#standard-approach-independent-gps-for-each-velocity-component.",
    "title": "Divergence free",
    "section": "Standard approach – Independent GPs for each velocity component.",
    "text": "Standard approach – Independent GPs for each velocity component.\n\n\nCode\nwith pm.Model() as model2:\n    \n    sigma_f = pm.HalfNormal(\"sigma_f\", sigma=1)\n    l = pm.HalfNormal(\"l\", sigma=1.0)\n    cov = SquaredExp(2, sigma_f, l)\n    gp = pm.gp.Marginal(cov_func=cov)\n    y_ = gp.marginal_likelihood(\"y_\", X=X_init[0:num_random_points,:].reshape(num_random_points, 2), \\\n                                      y=y_init[0:num_random_points] - np.mean(y_init[0:num_random_points]), \\\n                                      noise=1e-4)\n\n\n/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pymc/gp/gp.py:56: FutureWarning: The 'noise' parameter has been been changed to 'sigma' in order to standardize the GP API and will be deprecated in future releases.\n  warnings.warn(_noise_deprecation_warning, FutureWarning)\n\n\n\n\nCode\nwith model2:\n    mp2 = pm.find_MAP()\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nCode\nwith model2:\n    post_mean2, post_covar2 = gp.predict(Xpred, point=mp2, diag=False)\n\n\n\n\nCode\nwith pm.Model() as model3:\n    \n    sigma_f = pm.HalfNormal(\"sigma_f\", sigma=1)\n    l = pm.HalfNormal(\"l\", sigma=1.0)\n    cov = SquaredExp(2, sigma_f, l)\n    gp = pm.gp.Marginal(cov_func=cov)\n    y_ = gp.marginal_likelihood(\"y_\", X=X_init[0:num_random_points,:].reshape(num_random_points, 2), \\\n                                      y=y_init[num_random_points:] - np.mean(y_init[num_random_points:]), \\\n                                      noise=1e-4)\n\n\n/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/site-packages/pymc/gp/gp.py:56: FutureWarning: The 'noise' parameter has been been changed to 'sigma' in order to standardize the GP API and will be deprecated in future releases.\n  warnings.warn(_noise_deprecation_warning, FutureWarning)\n\n\n\n\nCode\nwith model3:\n    mp3 = pm.find_MAP()\n    \nwith model3:\n    post_mean3, post_covar3 = gp.predict(Xpred, point=mp3, diag=False)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nmu3, cov3 = gp.predictt(Xpred) post_mean3, post_covar3 = draw_values([mu3, cov3], point=mp3) post_mean3 += np.mean(y_init[num_random_points:])\n\n\nCode\nvelocity_x_mean_gp = post_mean2\nvelocity_y_mean_gp = post_mean3\nvelocity_mag_mean_gp = np.sqrt(velocity_x_mean_gp**2 + velocity_y_mean_gp**2 )\nvelocity_x_std_gp = np.sqrt(np.diag(post_covar2))\nvelocity_y_std_gp = np.sqrt(np.diag(post_covar3))\n\n\n\n\nCode\nnorm = matplotlib.colors.Normalize(vmin=np.min(velocity_mag_mean_gp),\\\n                                    vmax=np.max(velocity_mag_mean_gp))\n\nfig = plt.figure(figsize=(15,4))\nax1 = plt.subplot(131)\nc = ax1.contourf(T, S, velocity_mag_mean_gp.reshape(n, m), 50, cmap=plt.cm.turbo, norm=norm)\nplt.quiver(Xpred[:,0], Xpred[:,1], velocity_x_mean_gp, velocity_y_mean_gp, headwidth=5, scale=10)\nplt.scatter(X_init[0:num_random_points,0], X_init[0:num_random_points,1], c='w', s=70, lw=1, edgecolor='k')\ncbar = plt.colorbar(c, pad=0.05, shrink=0.6)\ncbar.ax.tick_params(labelsize=13)\nax1.set_yticklabels([])\nax1.set_xticklabels([])\nplt.xlabel(r'$x_1$')\nplt.ylabel(r'$x_2$')\n#ax1.set_xlabel('(e)')\nax1.set_title('Posterior velocity mag. with vectors', fontsize=13)\n\n\nnorm = matplotlib.colors.Normalize(vmin=np.min(velocity_x_std_gp),\\\n                                    vmax=np.max(velocity_x_std_gp))\n\nax2 = plt.subplot(132)\nc = ax2.contourf(T, S, velocity_x_std_gp.reshape(n, m), 50, cmap=plt.cm.turbo, norm=norm)\nplt.scatter(X_init[0:num_random_points,0], X_init[0:num_random_points,1], c='w', s=70, lw=1, edgecolor='k')\ncbar = plt.colorbar(c, pad=0.05, shrink=0.6)\ncbar.ax.tick_params(labelsize=13)\nax2.set_yticklabels([])\nax2.set_xticklabels([])\nplt.xlabel(r'$x_1$')\nplt.ylabel(r'$x_2$')\n#ax2.set_xlabel('(f)')\nax2.set_title('Velocity-x std dev.', fontsize=13)\n\nnorm = matplotlib.colors.Normalize(vmin=np.min(velocity_y_std_gp),\\\n                                    vmax=np.max(velocity_y_std_gp))\n\nax3 = plt.subplot(133)\nc = ax3.contourf(T, S, velocity_y_std_gp.reshape(n, m), 50, cmap=plt.cm.turbo, norm=norm)\nplt.scatter(X_init[0:num_random_points,0], X_init[0:num_random_points,1], c='w', s=70, lw=1, edgecolor='k')\ncbar = plt.colorbar(c, pad=0.05, shrink=0.6)\ncbar.ax.tick_params(labelsize=13)\nax3.set_yticklabels([])\nplt.xlabel(r'$x_1$')\nplt.ylabel(r'$x_2$')\nax3.set_xticklabels([])\n#ax3.set_xlabel('(g)')\nax3.set_title('Velocity-y std dev.', fontsize=13)\nplt.savefig('velocity_gp.png', dpi=170, bbox_inches='tight', transparent=True)\nplt.show()"
  },
  {
    "objectID": "midterm/midterm_2024_solutions.html",
    "href": "midterm/midterm_2024_solutions.html",
    "title": "Midterm",
    "section": "",
    "text": "Please do not attempt to copy from, or discuss the contents of this paper with, any of your classmates during the course of this midterm. Any such attempt, will be viewed as a violation of Georgia Tech’s Honor Code, and this will result in a zero-point grade. Also note that you can, on your own accord, do poorly on this midterm and still get an A in the course. However, if you violate the Honor code, it is likely that the repercussions to your overall grade will be more severe.\nYou are not permitted to post any questions related to this midterm on the Ed Discussions page, Stackoverflow, any large language model (e.g., Chat GPT, Bard, etc.), social media, or any other discussion forum. Should there be any questions, please email me directly. The work you submit must be your own entirely.\nIf you have utilized any resource (e.g., textbook, Wikipedia, lecture notes from another source) you must clearly state the source is and where you have used it. You must also do the same for the AE8803 lecture notes, e.g., when using the definition of the Binomial distribution, please state which lecture / slide number / and where you used it.\nFor the questions below, the integer value in the brackets indicates the number of points assigned to a particular question."
  },
  {
    "objectID": "midterm/midterm_2024_solutions.html#instructions",
    "href": "midterm/midterm_2024_solutions.html#instructions",
    "title": "Midterm",
    "section": "",
    "text": "Please do not attempt to copy from, or discuss the contents of this paper with, any of your classmates during the course of this midterm. Any such attempt, will be viewed as a violation of Georgia Tech’s Honor Code, and this will result in a zero-point grade. Also note that you can, on your own accord, do poorly on this midterm and still get an A in the course. However, if you violate the Honor code, it is likely that the repercussions to your overall grade will be more severe.\nYou are not permitted to post any questions related to this midterm on the Ed Discussions page, Stackoverflow, any large language model (e.g., Chat GPT, Bard, etc.), social media, or any other discussion forum. Should there be any questions, please email me directly. The work you submit must be your own entirely.\nIf you have utilized any resource (e.g., textbook, Wikipedia, lecture notes from another source) you must clearly state the source is and where you have used it. You must also do the same for the AE8803 lecture notes, e.g., when using the definition of the Binomial distribution, please state which lecture / slide number / and where you used it.\nFor the questions below, the integer value in the brackets indicates the number of points assigned to a particular question."
  },
  {
    "objectID": "midterm/midterm_2024_solutions.html#problem-1",
    "href": "midterm/midterm_2024_solutions.html#problem-1",
    "title": "Midterm",
    "section": "Problem 1",
    "text": "Problem 1\nA drawer contains red socks and black socks. When two socks are drawn at random, the probability that both are red is \\(0.5\\).\n\nHow small can the number of socks in the drawer be?\nHow small if the number of black socks is even?\n\nTotal points: [6]\n This is a problem from Frederick Mosteller’s Fifty Challenging Problems in Probability. There are many ways to solve this, including using numbers and proceeding arithmetically. Here, we pursue and algebraic approach.\nLet there be \\(r\\) red socks and \\(b\\) black socks. The probability of the first sock’s being red is \\(r/(r + b)\\). Note that if the first sock is red, then the probability of the second sock being red is now \\((r-1)/(r+b-1)\\), i.e., since a red sock has been removed.\nAs we require the probability of both socks to be red to be \\(0.5\\), we can write\n\\[\n\\frac{r}{r + b} \\times \\frac{r-1}{r+b-1} = 0.5\n\\]\nNote that for \\(b &gt; 0\\)\n\\[\n\\frac{r}{r+b} &gt; \\frac{r-1}{r+b-1} .\n\\]\nThus, we can create the inequalities\n\\[\n\\left( \\frac{r}{r+b} \\right)^2 &gt; 0.5 &gt; \\left( \\frac{r-1}{r+b-1} \\right)^2\n\\]\nTaking square roots, for \\(r&gt;1\\) we have\n\\[\n\\frac{r}{r+b} &gt; \\frac{1}{\\sqrt{2}} &gt; \\frac{r-1}{r+b-1}\n\\]\nFrom the first inequality, we obtain\n\\[\n\\begin{aligned}\nr & &gt; \\frac{1}{\\sqrt{2}} \\left( r + b \\right) \\\\\n& &gt; \\frac{1}{\\sqrt{2} - 1} b \\\\\n& = \\left( \\sqrt{2} + 1 \\right)b \\\\\n\\end{aligned}\n\\]\nand from the second\n\\[\n\\left( \\sqrt{2} + 1\\right)b &gt; r - 1\n\\]\nwhich leads to\n\\[\n\\left( \\sqrt{2} + 1 \\right) b + 1 &gt; r &gt; \\left( \\sqrt{2} + 1 \\right) b.\n\\]\nFor \\(b=1\\), \\(r\\) must be greater than 2.41 and less than 3.41, which means that \\(r=3\\), and for \\(r=3\\), \\(b=1\\). To confirm, see that\n\\[\np \\left( 2 \\; \\textrm{red socks} \\right) = \\frac{3}{4} \\times \\frac{2}{3} = \\frac{1}{2}.\n\\]\nSo the smallest number of socks is \\(4\\). Investigating even values of \\(b\\) leads to\n\n\n\n\n\n\n\n\n\nb\n\\(r\\) is between\nvalid \\(r\\)\n\\(p \\left( 2 \\; \\textrm{red socks} \\right)\\)\n\n\n\n\n2\n5.8, 4.8\n5\n\\(\\frac{5(4)}{7(6)}\\neq \\frac{1}{2}\\)\n\n\n4\n10.7, 9.7\n10\n\\(\\frac{10(9)}{14(13)}\\neq \\frac{1}{2}\\)\n\n\n6\n15.5, 14.5\n15\n\\(\\frac{15(14)}{21(20)} = \\frac{1}{2}\\)\n\n\n\nThus, \\(21\\) socks is the smallest number when \\(b\\) is even."
  },
  {
    "objectID": "midterm/midterm_2024_solutions.html#problem-2",
    "href": "midterm/midterm_2024_solutions.html#problem-2",
    "title": "Midterm",
    "section": "Problem 2",
    "text": "Problem 2\nLet \\(X\\) and \\(Y\\) be independent random variables. Also let \\(Z=X + Y\\).\n\nUsing probability generating functions, show that if \\(X \\sim Binomial \\left(n, p \\right)\\) and \\(Y \\sim Binomial \\left(m, p\\right)\\), then \\(Z \\sim Binomial \\left(n+m, p \\right)\\). [4]\nAssume that \\(X\\) and \\(Y\\) are continuous and their probability density functions are \\(f_{X}\\left( x\\right)\\) and \\(f_{Y}\\left( y \\right)\\) respectively. By considering the cumulative density function of \\(Z\\) given \\(X\\), or otherwise, show that the following holds [4]:\n\n\\[\nf_{Z|X} \\left( z | x \\right) = f_{Y} \\left( z - x \\right)\n\\]\n\nDerive the conditional probability density function of \\(X\\), given that \\(Z=z\\), when \\(X \\sim \\mathcal{N} \\left(0, 1\\right)\\) and \\(Y \\sim \\mathcal{N} \\left(0, 1 \\right)\\). [4]\nCompute the expectation of \\(1/\\left(Z + 1 \\right)\\), when \\(X \\sim Poisson \\left( \\lambda \\right)\\) and \\(Y \\sim Poisson \\left( \\lambda \\right)\\) [4]. You may find this Wikipedia page useful. [4]\n\nTotal points: [16]\n\n\nNote that for solving this question, students would likely need to look-up what a probability generating function is. In lieu of not covering this in lecture, the grading for this particular question is lenient. However, as I did mention in class that a visit to Wikipedia’s entry on certain continuous and discrete distributions might be necessary, this question is within the curricula. For a Binomial distribution, you will find the right-hand pane on this Wikipedia website useful (replciated below).\n\n\n\n\nThe probability generating function of \\(X\\) is given by\n\\[\nq_{X} \\left( z \\right) = \\left( 1 - p + pz \\right)^{n}\n\\]\nand for \\(Y\\) we have\n\\[\nq_{Y} \\left( z \\right) = \\left( 1 - p + pz \\right)^{m}\n\\]\nThus, for \\(Z\\) we have\n\\[\nq_{Z} \\left( z \\right) = q_{X+Y}\\left( z \\right) = q_{X} \\left( z \\right) q_{Y} \\left( z \\right) = \\left( 1 - p + pz \\right)^{n+m}\n\\]\nThis is a probability generating function of a Binomial distribution \\(\\left(n+m, p \\right)\\).\n\nUsing the cumulative density function, we have\n\n\\[\n\\begin{aligned}\nF_{Z | X}  \\left( z | x \\right) & = p \\left( Z \\leq z | X \\leq x \\right) \\\\\n& = p \\left( X + Y \\leq z | X = x \\right) \\\\\n& = p \\left( X + Y \\leq z \\right) \\\\\n& = p \\left( Y \\leq z - X \\right) \\\\\n& = F_{Y} \\left(z - x \\right)\n\\end{aligned}\n\\]\nTaking the derivative we now have\n\\[\n\\frac{\\partial F_{Z | X} \\left( z | x \\right) }{\\partial z} = f_{Z | X} \\left( z | x \\right)\n\\]\nand\n\\[\n\\frac{\\partial F_{Y} \\left( z - x \\right)}{\\partial z} = f_{Y} \\left( z - x \\right)\n\\]\nHence \\(f_{Z | X} \\left( z | x \\right) = f_{Y} \\left( z - x \\right)\\). \n\n\nFor the conditional probability distributions, we have\n\n\\[\n\\begin{aligned}\nf_{X|Z} \\left(x | z \\right) & = \\frac{f_{Z|X} \\left( z | x \\right) f_{X} \\left( x \\right) }{f_{Z} \\left( z \\right)}\\\\\nf_{X|Z}(x|z) &= \\frac{f(x,z)}{f(z)} \\\\ &= \\frac{f_X(x) f_Y(z-x)}{f_Z(z)} \\\\\n&= \\frac{\\frac{1}{\\sqrt{2 \\pi}} \\exp {\\bigg(\\frac{-x^2}{2}\\bigg)} \\cdot \\frac{1}{\\sqrt{2 \\pi}} \\exp {\\bigg(\\frac{-(z-x)^2}{2}\\bigg)}}{\\frac{1}{\\sqrt{4\\pi}}\\exp{( -\\frac{z^2}{4})}} \\ \\ \\ \\ \\ \\ \\ (\\because Z \\sim \\mathcal{N}(0, 2))\\\\\n&= \\frac{1}{\\sqrt{\\pi}} \\exp{\\bigg(  -\\frac{x^2}{2} - \\frac{(z-x)^2}{2} + \\frac{z^2}{4}\\bigg)}\\\\\n&= \\frac{1}{\\sqrt{\\pi}} \\exp{\\bigg(  -\\frac{x^2}{2} - \\frac{z^2 - 2xz + x^2}{2} + \\frac{z^2}{4}\\bigg)}\\\\\n&= \\frac{1}{\\sqrt{\\pi}} \\exp{\\bigg(  \\frac{-2x^2 + 2xz - z^2}{2} + \\frac{z^2}{4}\\bigg)}\\\\\n&= \\frac{1}{\\sqrt{\\pi}} \\exp{\\bigg( \\frac{1}{2} (-2x^2 + 2xz - z^2 + \\frac{z^2}{2})\\bigg)}\\\\\n&= \\frac{1}{\\sqrt{\\pi}} \\exp{\\bigg( \\frac{1}{2} (-2x^2 + 2xz - \\frac{z^2}{2})\\bigg)}\\\\\n&= \\frac{1}{\\sqrt{\\pi}} \\exp{\\bigg( -x^2 + xz - \\frac{z^2}{4}\\bigg)}\\\\\n&= \\frac{1}{\\sqrt{\\pi}} \\exp{\\bigg( -(x^2 - xz + \\frac{z^2}{4})\\bigg)}\\\\\n&= \\frac{1}{\\sqrt{\\pi}} \\exp{\\Bigg( -\\bigg(x^2 - 2x\\frac{z}{2} + \\big(\\frac{z}{2}\\big)^2\\bigg)\\Bigg)}\\\\\n\\therefore f_{X|Z}(x|z)&= \\frac{1}{\\sqrt{\\pi}} \\exp{\\bigg\\{- \\big(x - \\frac{z}{2}\\big)^2 \\bigg\\}}\n\\end{aligned}\n\\]\nThis leads to a Gaussian distribution, \\(\\mathcal{N} \\left(\\mu = \\frac{z}{2}, \\sigma^2 = \\frac{1}{2} \\right)\\).\n\n (iv) As before, we will make use of probability generating functions. From Wikipedia, we note that the probability generating functions for \\(X\\) and \\(Y\\) is given by \\(q_{X} \\left( z \\right) = exp \\left( \\lambda \\left( z - 1 \\right) \\right)\\), and \\(q_{Y} \\left( z \\right) = exp \\left( \\lambda \\left( z - 1 \\right) \\right)\\).\nThe probability generating function associated with \\(Z\\) is given by \\(X+Y\\), so we have\n\\[\nq_{Z} \\left( z \\right) = q_{X} \\left( z \\right) q_{Y} \\left( z \\right) = exp \\left(2 \\lambda \\left( z - 1 \\right) \\right)\n\\]\nHence, \\(Z \\sim \\textrm{Poisson} \\left( 2 \\lambda \\right)\\). The expectation can be worked out by recognizing that the probability density function is given by\n\\[\nf_{Y} \\left( k \\right) = \\frac{\\lambda^{k} exp\\left(-\\lambda \\right)}{k!}, \\; \\; \\; \\; for \\; \\; k=0, 1, 2, 3, \\ldots\n\\]\nNow, for a discrete random variable, \\(Y\\), the expectation of the function \\(g\\left(y \\right)\\) is given by\n\\[\n\\mathbb{E} \\left[ g \\left( Y \\right) \\right] = \\mathbb{E} \\sum g \\left( y \\right) f_{Y} \\left( y \\right).\n\\]\nNote that for \\(Y \\sim \\textrm{Poisson} \\left( \\lambda \\right)\\), we have\n\\[\n\\begin{aligned}\n\\mathbb{E} \\left[ Y \\right] & = \\sum_{k=0}^{\\infty}  \\frac{\\lambda^{k}}{k!} exp \\left( - \\lambda \\right)\n\\end{aligned}\n\\]\n\n Thus, for \\(Z\\) we have \\[\n\\begin{aligned}\n\\mathbb{E} \\left[ \\frac{1}{Z+1} \\right] &  =   \\sum_{k=0}^{\\infty} \\frac{1}{k+1} exp\\left( - 2\\lambda \\right)  \\frac{\\left( 2 \\lambda \\right)^{k}}{k!}\\\\\n& = exp\\left( - 2\\lambda \\right) \\sum_{k=0}^{\\infty}   \\frac{\\left( 2 \\lambda \\right)^{k}}{\\left(k+1\\right)!}\\\\\n& = exp\\left( - 2\\lambda \\right) \\frac{1}{2 \\lambda}\\sum_{k=0}^{\\infty}   \\frac{\\left( 2 \\lambda \\right)^{k+1}}{\\left(k+1\\right)!}\\\\\n& = \\frac{1}{2 \\lambda} \\left( 1 - exp \\left( - 2 \\lambda \\right) \\right)\n\\end{aligned}\n\\]\nwhere in the second last equation we have used the Taylor series expansion for \\(exp\\)."
  },
  {
    "objectID": "midterm/midterm_2024_solutions.html#problem-3",
    "href": "midterm/midterm_2024_solutions.html#problem-3",
    "title": "Midterm",
    "section": "Problem 3",
    "text": "Problem 3\nA new test for a disease has been produced by a company. Trials revealed that if a test user has the disease, the test is positive with \\(95\\%\\) probability. Similarly, if the user does not have the disease, the test is negative with \\(92 \\%\\) probability. Approximately \\(1\\%\\) of the opulation is suspected to have this disease. Compute the probabilities below:\n\nThe probability that a randomly chosen individual will test positive. [4]\nThe probability that a randomly chosen individual who tests positive will have the disease. [4]\n\nTotal points: [8]\n\n\nLet \\(H\\) denote the outcome of a test, and let \\(D\\) denote the diagnosis. To clarify\n\n\n\n\nTrue / False\nG\nD\n\n\n\n\nT\nTest is positive\nHas disease\n\n\nF\nTest is negative\nDoesn’t have disease\n\n\n\nThus, from the problem statement we can furnish the following probabilities\n\\[\n\\begin{aligned}\np \\left( H = F | D = F \\right) & = 0.92 \\\\\np \\left( H = T | D = T \\right) & = 0.95 \\\\\np \\left( D = T \\right) & = 0.01\n\\end{aligned}\n\\] \n\n\nCode\np_HF_DF = 0.92\np_HT_DT = 0.95\np_DT = 0.01\n\n\n The probability that a randomly chosen individual will test positive is given by\n\\[\n\\begin{aligned}\np \\left( H = T \\right) & = p \\left( H = T, D = F \\right) + P \\left( H = T, D = T \\right) \\\\\n& = p \\left( H = T | D = F \\right) p \\left( D = F \\right) + p \\left(H = T | D = T \\right) p \\left(D = T \\right)\\\\\n& = (1 - 0.92)(1 - 0.01) + 0.95 \\times 0.01 \\\\\n\\end{aligned}\n\\] \n\n\nCode\nprob_HT = (1 - p_HF_DF) * (1 - p_DT) + p_HT_DT * p_DT\nprint(prob_HT)\n\n\n0.08869999999999996\n\n\n ii) The probability that a randomly chosen individual who tests positive will have the disease is given by\n\\[\n\\begin{aligned}\np \\left( D = T | H = T\\right) & = \\frac{p \\left( D = T, H = T \\right) }{p \\left( H = T \\right)} \\\\\n& = \\frac{p \\left( H = T | D = T \\right) p \\left( D = T \\right) }{p \\left( H = T \\right)}\n\\end{aligned}\n\\] \n\n\nCode\np_DT_HT = (p_HT_DT * p_DT) / prob_HT\nprint(p_DT_HT)\n\n\n0.1071025930101466"
  },
  {
    "objectID": "midterm/midterm_2024_solutions.html#problem-4",
    "href": "midterm/midterm_2024_solutions.html#problem-4",
    "title": "Midterm",
    "section": "Problem 4",
    "text": "Problem 4\nIn Lecture 8, we had studied a Bayesian model applied to the Olympic winning time dataset. A key modelling decision was what basis terms to incorporate within \\(\\mathbf{X}\\) and relatedly, the number of unknown model parameters (determined by the number of columns in \\(\\mathbf{X}\\)). Now, you will re-visit this dataset, but with a few changes:\n\nSelect a non-polynomial basis of your choice (e.g., harmonic, exponential, logarithmic, or even a combination thereof) to input in a tall \\(\\mathbf{X}\\) to best match the data. [6]\nCarefully select a multivariate normal prior \\(p \\left(\\mathbf{w} | \\boldsymbol{\\mu}_{0}, \\boldsymbol{\\Sigma}_{0} \\right)\\) for your new model parameters \\(\\mathbf{w}\\). You are free to vary the above mean and covariance as you see best. [6]\nRepeat (i) and (ii) for the case where the number of model parameters is greater than the number of data observations. You will have to alter your prior and \\(\\mathbf{X}\\) across these two cases. You may also choose to subsample from the original dataset if that helps. [12]\n\nPlease ensure that you provide executable code below, along with appropriate comments, to address the statements above. For your experiments, pay particular attention to the condition number of the matrix, and the assumed observation noise.\nTotal points: [24]\n (i) and (ii) The code below is amended from Lecture 8; below we select a harmonic basis. \n\n\nCode\nimport pandas as pd\nimport numpy as np\nimport matplotlib\nimport matplotlib.pyplot as plt\nfrom scipy.stats import multivariate_normal\nimport seaborn as sns\nsns.set(font_scale=1.0)\nsns.set_style(\"white\")\nsns.set_style(\"ticks\")\npalette = sns.color_palette('deep')\n\n\ndf = pd.read_csv('data100m.csv')\ndf.columns=['Year', 'Time']\nN = df.shape[0]\n\n# Data & basis\nmax_year, min_year = df['Year'].values.max() , df['Year'].values.min()\nx = (df['Year'].values.reshape(N,1) - min_year)/(max_year - min_year)\nt = df['Time'].values.reshape(N,1)\n\n\nX_func = lambda u : np.hstack([np.ones((u.shape[0],1)), np.sin(u), np.cos(u), np.sin(6*u), np.cos(6*u) ])\nX = X_func(x)\nM = X.shape[1]\n\n# For prediction / plotting\nxgrid = np.linspace(0, 1, 100).reshape(100,1)\nXg = X_func(xgrid)\nxi = xgrid*(max_year - min_year) + min_year\nxi = xi.flatten()\nsigma_hat = 0.2\nsigma_w = 2.0\n\n# Prior\nrandom_samples = 300\nmu_0 = np.zeros((M,1))\nSigma_0 = np.eye(M)\n\ninv_Sigma_0 = np.eye(M) * 1/(sigma_w**2)\nprior = multivariate_normal(mu_0.flatten(), Sigma_0)\n\n# Posterior\nSigma_w = np.linalg.inv(1./(sigma_hat**2) * (X.T @ X) + inv_Sigma_0)\nmu_w = Sigma_w @ (1./(sigma_hat**2) * (X.T @ t) + (inv_Sigma_0 @ mu_0) )\nposterior = multivariate_normal(mu_w.flatten(), Sigma_w)\n\n\nfig, ax = plt.subplots( figsize=(6,4))\nplt.plot(xi, Xg @ posterior.rvs(random_samples).T, zorder=-1, alpha=0.2)\na, = plt.plot(df['Year'].values, df['Time'].values, 'o', color='dodgerblue', \\\n              label='Data', markeredgecolor='k', lw=1, ms=10, zorder=1)\nplt.xlabel('Year')\nplt.title('model using posterior samples')\nplt.legend([a ], ['Data'], framealpha=0.2)\nplt.ylabel('Time (seconds)')\nplt.savefig('posterior.png', dpi=150, bbox_inches='tight', facecolor=\"#6C757D\")\nplt.show()\n\n\n\n\n\n(iii). Rather than select a subset of the data, we can add more basis terms.\n\n\nCode\ndef X_func(u):\n    modes = 14\n    M = np.ones((u.shape[0], 2*modes+1))\n    for j in range(1, 2*modes+1):\n        if j % 2 == 0:\n            M[:,j] = np.sin(j * u).flatten()\n        else:\n            M[:,j] = np.cos(j * u).flatten()\n    return M\n\nX = X_func(x)\nM = X.shape[1]\n\n# For prediction / plotting\nxgrid = np.linspace(0, 1, 100).reshape(100,1)\nXg = X_func(xgrid)\nxi = xgrid*(max_year - min_year) + min_year\nxi = xi.flatten()\nsigma_hat = 0.2\nsigma_w = 2.0\n\n# Prior\nrandom_samples = 300\nmu_0 = np.zeros((M,1))\nSigma_0 = np.eye(M) * 0.01\n\ninv_Sigma_0 = np.eye(M) * 1/(sigma_w**2)\nprior = multivariate_normal(mu_0.flatten(), Sigma_0)\n\n# Posterior\nSigma_w = np.linalg.inv(1./(sigma_hat**2) * (X.T @ X) + inv_Sigma_0)\nmu_w = Sigma_w @ (1./(sigma_hat**2) * (X.T @ t) + (inv_Sigma_0 @ mu_0) )\nposterior = multivariate_normal(mu_w.flatten(), Sigma_w)\n\n\nfig, ax = plt.subplots( figsize=(6,4))\nplt.plot(xi, Xg @ posterior.rvs(random_samples).T, zorder=-1, alpha=0.2)\na, = plt.plot(df['Year'].values, df['Time'].values, 'o', color='dodgerblue', \\\n              label='Data', markeredgecolor='k', lw=1, ms=10, zorder=1)\nplt.xlabel('Year')\nplt.title('model using posterior samples')\nplt.legend([a ], ['Data'], framealpha=0.2)\nplt.ylabel('Time (seconds)')\nplt.show()\n\n\n\n\n\n Whilst I have not included any summarizing statements here, I expect students to study the condition number and vary the prior. Marks were deducted for models that did not bear any resemblance to the data."
  }
]