Public material for Harvard's CS109 class
Branch: master
Clone or download
Pull request Compare This branch is 47 commits ahead of cs109:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Labs
Lectures
data
homework
images
to read
.gitignore
Bootstrapping.ipynb
README.md
city2surf.ipynb
linear_regression_with_gradient_descent.ipynb
monty_hall.ipynb
movielens.ipynb
twitter.ipynb

README.md

All right, this git repo is me working through Harvard's 2015 CS109 class, one lab at a time. This class was recommened at Quora and a few other places as being a good resource for practical data science, so here goes. My notes are here in the readme file as I work through the course syllabus and the labs and homeworks.

The stuff to watch and work through:

Note: download the videos using this script, and merge pull request 11 to get the 2015 videos.

Study Suggestions before starting:

  • Use a internet blocker app like SelfControl to stop procrastinating and a pomodoro app to break up study into chunks.
  • Use a paper notebook for notes as you watch the videos and do the labs.
  • Get a second monitor so you can watch videos/have lab notebooks open and work through at the same time. (I got a 1440p 27inch monitor from ebay and it made things so much easier from just using my laptop).
  • Don't look at the lab and hw answers - try to do them first on your own. Discuss with others before looking at solutions.

CS109 notes, by the class schedule:

Week 1: What is Data Science

Lecture 1 introduces data science. The basic stuff covered in every blog post.

Week 2: Intro Data Analysis and Viz

The lecture 2 notebook goes through getting data and putting it into a pandas dataframe.

Lab 1 has three very introductory notebooks: pythonpandas, followed by babypython, and finally git. However, since the course dates back to 2015, some of the python is a bit dated and uses 2.x code.

After doing the three intro notebooks, hw0 runs you through installing anaconda, git, and setting up a github and aws account.

Hw0 has one interesting section, where you solve the montyhall problem one step at a time. I didn't really like their steps, so made a simpler monty hall implementation.

Moving on to the Lecture 2 & its quiz notebook, this goes through some more pandas and data scraping web pages and parsing them.

I made a couple of notebooks expand on some of the stuff covered:

Lecture 3 (slides, video):

  • ask a q, get data to answer it, explore & check data, then model it and finally communicate and visualize the results.
  • keep viz simple and think of the kind of chart needed.

Week 3 : Databases, SQL and more Pandas

Lab 2 introduces web scraping with requests and then parsing html with beautiful soup 4.

Lecture 4 (video, slides) (covers some more Pandas and SQL.

Lecture 5 (slides, video) on stats is a bit sparse. Some supplementary material:

Week 4: Probablity, regression and some stats

Lab 3 has three notebooks:

  • Lab3-Probability covers basic probability. Uses a lot of numpy methods, so its a good idea to brush up on numpy.
    • scipy.stats - very handy, has most stats stuff needed for DS built in.
  • Lab3-Frequentism, Samples and the Bootstrap
    • use seaborn for plotting, very handy. a good guide to sns factorplot and facetgrids
    • PDF tells us the probability of where a continuus random variable will be in set of possible values that random variable can be (the sample space).
    • PMF tells us the probability that a discrete random variable will be ecactly equal to some value
    • CDF function tells us the probability that a random discrete or continous variable X will take a value less than or equal to X. Video

Lecture 6: Story Telling and Effective Communication (slides, video)

Good insights on how to tell a story with data. Infer, model, use an algorithim and draw conclusions (and check!).

  • Start with two fundamental questions:
    • Whats the goal? think first of that rather than going first to all the many ways you can slice and dice data.
    • Who cares? Know your audience and tell them a story. Have a clear sense of direction and logic.
  • Read some howto's on scientific writing
  • have some memorable examples or small stories

Tell a story:

  • know your audience and why/what they care about this data - what do they want?
  • Don't make them think - clearly tell them what you want them to know in a way they can follow. highlight key facts and insights.
  • unexpectedness - show something the audience didn't expect. I liked the story which highlighted that bigfoot sightings are dropping sharply
  • What tools can we give the audience? For example, a web app for them to further explore the data, or a takeaway presentation with key points.
  • be careful of your point of view and don't distort the data, but depending on the audience you can frame your story - for example presenting war deaths in red rather than a regular plot color.
  • important to put the message up front - what does my graph show? Show it in stages if a lot of data, highlighting what to look at. Design matters.

More resources:

Lecture 7: Bias and Regression (slides, video)

  • think about bias, missing data, etc
  • combine independent, unbiased estimators for a parameter into one:
    • fisher weighting
    • nate silver weighting method
  • Bonferroni
  • good segment on weighting variables
  • regression towards the mean
  • think of regression in terms of population or a projection of the column space of x - i.e what combination of the variables of x gets us closest to the value of y?
  • linear regression means we're taking linear combination of predictors, the actual regression equation can be nonlinear
  • what function of x gives the best predictor of y?
  • Gauss-Markov Theorem
  • the residuals are the diff b/w the actual value of y and the predicted value - plot residuals vs fitted values and vs each predictor variable - good way to eyeball quality of linear regression model
  • variance R^2 measures goodness of fit, but doesn't mean model is good.
  • Best way to check a model is prediction.

Week 5: Scikit learn & regression

Lab 4 - Regression in Python (video, notebook)

  • Wikipeda article
  • We have data X, which is related to Y in some way.
  • Linear regression uses X to predict Y, and also tells us the predictive power of each variable of X
  • Linear regression assumes the distribution of each Y is normal (which isn't always so)
  • there are many ways to fit a linear regression model, most common is the least squares method
  • be careful that the features (variables in X) aren't too similar
  • explore your data, plot variables, scatterplots etc. Use seaborn to plot regression.
  • use sklearn to split dataset into a train/test
  • use cross-validation
  • overfitting happens when the model 'learns' the train data so performs better on that than the test dataset
  • there are many types of regressions so think about which one to use
  • for high d data, closed form vs gradient decent:
    • closed form - use math to solve. this becomes computationally intensive very quickly, is ordered n cubed
    • gradient descent is O(n), so for large high d data it's gradient descent all the way
  • Logistic regression - used where outcome is binary, for example a chance of success/failure. read:Adit's explanation.

Lecture 8: More Regression (video, slides)

  • collinerarity - when some variables are highly correlated with each other - this is bad
  • Logistic Regression
  • Odds Ratio: ratio of two different people's odds of an outcome.
  • Crude Odds Ratio Estimate - quick estimate but flawed as doesn't control for anything.
  • Confounding Factors - i.e is one group pre-disposed to our outcome for some reason?
  • Curse of dimensionality - in high d settings, vast majority of data is near boundaries, not center. But, high d can also be a blessing.
  • dealing with high dimensionality: ridge regression, shrinkage estimation
  • Stein's Paradox wikipedia, good article
  • LASSO and Ridge help with high D data by reducing features
    • Lasso does L1 regularization, reduces number of features
    • Ridge does L2 regularization, doesn't necessarily reduce features but reduces the impace of features on the model by reducing coefficient value
  • Elasticnet does both L1 and L2 regularization

Lecture 9: Classification (video, slides)

  • we take data and assign labels
  • 1 nearest neighbour - simple classification method for low d data
    • slow, has to check all points to find nearest neighbour
  • k nearest neighbours - use k nearest points to find decision boundary
  • cross validation - for 5 fold cross validation, the data is split into 6 folds - 4 for training, one for validation and the sixth for testing, which is only used at the end.
  • CIFAR-10 for 60K images - is split into 50K training and 10K test
    • pics are 32x32x3
  • L1 distance is the absolute diff b/w two vectors
  • L2 is the Euclidean distance i.e "ordinary" straight-line distance
  • for images, l1 and l2 are pretty bad, so there are a lot more methods
  • more features are good for classification, but too many features means the data gets sparse - the curse of dimensionality strikes
  • so often we want to reduce dimensionality
  • Principal Component Analysis - project a dataset from many variables into fewer less correlated ones, called the principal components.
  • Singular Value Decomposition (SVD) - computational method to calculate pricipal components of a dataset. It transforms a large matrix of data into three smallter matrixes: A (m*n) = U(m*r) x E(r*r) x V(r*n). The values in the middle matrix r*r are the singular values and we can discard bits of them to reduce the amount of data to a more manageable number.
  • good pca and svd explanation
  • Watch Statistics for Hackers

HW2 Q1 (notebook)

  • Uses svd and pca to analyze gene data
  • a pandas excercise in downloading csv files into a data frame, usijng pd.datetime and visualising samples vs time

Week 6: SVM, trees and forests

Now the course finally gets interesting. Before starting this weeks work, think about project ideas and see Hans Rosling videos to see how to present data. Pitch this project idea (to study group or the internet at large).

There are quite a few companies automating the entire datascience chain, so the key is being able to present your findings well.

HW 2 Questions 2,3 & 4 (notebook)

H2 depends wholly on week 5, so good idea to get it done first. Used seaborn for all the viz questions makes some of them trivial.

  • q2 looks at polling data and bias
  • q3 is more of the same but with seaborn
  • q4 nothing much to see here besides using list comprehensions to make a list of all the .csv files (I'm trying to use python to do all the work instead of my stone age past life of copying and pasting links)
url_str = "http://elections.huffingtonpost.com/pollster/api/charts/?topic=2014-senate"
election_urls = [election['url'] + '.csv' for election in requests.get(url_str).json()]

Lab 5: Machine Learning

Learning Models (notebook, video)

  • we often have a small sample of a much dataset, and we want to predict the larger data from our sample.
  • this isn't just statistical analysis, as we make models which involve domain knowledge and choices.
  • need to think about whether our sample is in some way representative of the population
  • Stochastic noise, i.e randomness
  • systematic error, i.e where the sampling isn't representative, like polling ppl using landlines
  • overfitting: models can 'memrize the the data points in the training set, becoming useless or inaccurate at predicting real world data. With many data sets a more and more complex dataset will keep getting better while getting worse on test/validation data. The best model minimizes test set error, and not training set error.
  • great illustration of variance at 24:30 and 35min in the video
  • use from sklearn.cross_validation import train_test_split for splitting datasets into a train test split. See sklearn
  • sklearn has 3 main features:
    • build and fit models
    • predict
    • transform data.
  • sklearn expects days in a 2d array or a matrix of size [n_samples, n_features], so reshape 1d data using np.reshape(-1,1)
  • Validation - keep a chunk of data seperate to check the model after training on the test/train data.
  • Cross Validation: randomly split data into K diff train/test splits - so you traion on K-1 partitions and test on 1, so there are a total of K combinations, leading to K risks. This leads to better results then just doing one test/train split.
  • regularization helps with overfitting

Classification (notebook, video)

  • sort data into classes, i.e what kind of fruit
  • most datasets can be projected onto a lower dimensial space, for e.g using PCA
  • read sklearn's PCA docs
  • kNN:
  • Logistic Regression - use sklearn, main paramater is C which defines how much to regularize the data. Read this explanation
  • Use sklearns GridSearchCV to find hyperparameters
  • one way to classify: use PCA to reduce the feature space, then use logistic regression to classify
  • many datatypes, like images, have tons of features, so important to reduce dimensionality.
  • sklearn PCA returns the principal components in order of how much they explain the data:
from sklearn.decomposition import PCA
pca = PCA(n_components=60) # PCA with no. of components to return
X = pca.fit_transform(data)
print(pca.explained_variance_ratio_) # how much of variance is explained
  • sklearn uses the same interface for all its classifiers, so makes it easy to put a wrapper around gridsearchCV and pass in diff classifiers to compare.
  • discriminative classifier - finds the decision boundary b/w classes
  • maximum-margin classifier - for many classifincation problems, multiplie diff lines can seperate classes. Choose that line where the margin b/w classes is the largest, which makes this model senstive to boundaries b/w classes, rather than to point samples deep inside a class.
  • SVM is a discrimant classier which finds the widest possible margin b/w classes, including some points touching the boundary which are called the support vectors. (since they support or establish the margins.)

Lecture 10: SVM, Evaluation (video, slides)

  • KNN - training is fast, prediction slow since we need to check all the data points to find the nearest neighbours
  • but if we know the decision boundary (the seperating hyperplane) we don't need all the data points
    • w: weight vector defines the orientation of the hyperplane, and bias b.
    • so a new point x is classified by w(transpose)*x + b
    • this is the mathematical model of a neuron, invented 1957 by Rosenblatt
  • step function vs sigmoid activation
  • Support Vector Machines (SVM) are widely used, some consider it best of the shelf classifier. They add a new dimension to help seperate classes and also use maximum margin classification. SVM is called svm becuase of the support vectors defining the max margin lines for the classification boundary.
  • large data is good for training svm as the points on the boundary are rare and svm cares about establishing the boundary
  • since outliers can change the svm boundaries, there is a concept of slack variables - it allows the SVM to missclassify outliers to make a neat decision boundary. sklearn uses the parameter C to define the slack. the lower the number the more the slack.
  • kernel tricks for svm - go to aribitarily mary dimensions with little computational cost. need to think about what kernel to use. Read What are kernels in machine learning and SVM and why do we need them?.
  • read Andrew Ng's cs229 svm notes
  • todo: tale sklearns 'faces' dataset and use svm to predict
  • svm tips:
    • normalize data to 0,1 or -1,1 interval. (check whether the library already normalizes)
    • RBF kernel is a good default
    • Read Chich-Wei Hsu practical guide to SVM
    • tune paramaters - which kernel, what parameters for it and what C?
  • ROC curve: plot true positive rate vs false positive rate
    • true +ve is tp/(tp+fn)
    • false +ve is fp/(fp+tn)
    • one useful summary stat is area under the curve
  • Precision Recall Curve sklearn, quora
    • Precision: tp/(tp+fp) - how much are we getting right, or the probability a a random +ve sample is actually +ve (since some +ve samples are false positives).
    • Recall tp/(tp+fn) - same as in the ROC, how much of the data are we finding. for a random +ve sameple, the probability that it's making a correct prediction. (consider the false negatives.)
    • ideally we want both to be one.
  • good way to compare classifiers
  • How do you classify multple classes with a SVM, e.g 10 types of fruit
    • One vs all - pick one class, and train it against all the other classes one by one. So you train n classifers for n classes.
    • One vs One - Train n(n-1)/2 classifiers, take majority vote
    • use a confusion matrix of predicted label vs true labels to see classification results

Lecture 11: Decision Trees and Random Forests (video, slides)

  • Books: Read Elements of Statiscal Learning and Pattern Recognition and Machine Learning
  • Decision Tree - fast training and prediction, easy to understand and interpret. DT basically paritions a feature space into cells using a series of decision rules on one feature at a time
  • which feature to query and what thresholds to use?
  • node purity: do splits so cells are pure, i.e have only one class in them
    • Gini Impurity gives us the expected error of predicted a class from a random sample
    • Gini Index
    • Node purity gain: compare gini impurity of parent vs child nodes. Lets us see whether a split has improved classification better than a simple missclassification number.
  • Optimal trees - diff ways to get there
  • Tree pruning - easy to overfit, so first we make the tree, then go backwards and remove 'bad' decisions, or merge cells etc.
  • DT disadvatages: sensitive to small changes in data, overfit, only axis aligned splits
  • DT vs SVM:
  • Netflix prize winners used an ensemble of over 800 models. somewhat disapointing as they didn't come up with a new method
  • DT doesn't perform well, but what if we use many of them?
  • Bootstrap is one way to do this. It's a resampling method from statistics, useful to get error bags on estimates.
    • Bootstrap lets us generate more sample sets from one dataset, each one slightly different.
    • Take N data points and draw N times with replacement, then get an estimate from each bootstrapped samples.
    • bagging: bootstrap aggregrating, where you learn a classifer from each bootstrap sample and average the . (normally uses just one type of classifier)
    • Bias-Variance trade off: train models with a high variance, then the average might get close
  • Random Forest builds on bagging, builds a tree from each bootstrap sample with node splits selected from random feature subsets. See 1, or rather 2

Week 7: Machine Learning best practices

HW 3 Q1 due: a lot of pandas manipulation on baseball data

Start the project

Towards the end of the course you will work on a month-long data science project. The goal of the project is to go through the complete data science process to answer questions you have about some topic of your own choosing. You will acquire the data, design your visualizations, run statistical analysis, and communicate the results. You will work closely with other classmates in a 3-4 person project team.

Lab 6: Machine Learning 2

Classification, Probabilities, ROC curves, and Cost (video,notebook)

Comparing Models (video, notebook)

  • learn Bayes classifiers in sklean
  • use ROC curves - often accuracy is the the relevant, the true +ve and false -ve rate is more important. Since False negatives can be costly, you often want to change the threshold probability from the default 0.5. So write your own prediction function as the sklearn one uses 0.5, or with bayes classifiers adjust the prior probability.
  • sklearn has a roc function, tutorial

Lecture 12: Ensemble Methods ((video, slides))

  • philisophical point: who do ensenble methods work so well? in real life wisdom of the crowds is overrated, but it does a lot better in computerland. Some averaging methods pick up useful stuff in the data, while others cancel out each others errors.
  • Decision trees are easy but have poor predictive accuracy, tend to ovefit
  • Ensemble learning combines many learners using methods like weighted avg, boosting, etc. See sklearn's ensemble page
  • random forests is an extension of bagging of decision trees
    • random forests using sqrt(features) of random predicters to make trees give the best results
  • Boosting is another ensemble method like bagging, but better for most applications.
    • tune by num of trees, splits in each tree, weights
    • often using very simple trees, which you adjust the weights as u make more trees to classify better the things u got wrong. At the end u make a weighted avg.
    • most popular & succesful boosting algorithim is AdaBoost

Note: read this series on machine learning

Lecture 13: Best Practices (video, slides)

  • story telling is important for data scientists - explain whats going on, even in your own notebooks. good presentation is very important.
  • Diff b/w bagging and random forest: rf has bagging idea + random feature subsets
  • didn't really find this video that useful, for example:
    • knn and trees can be used for regression too, but why would we?
    • SVM's can be used for regression
  • take care to normalize your data in a sane manner

Week 8: EC2 and Spark

HW3 q2 uses the iris data set & q3 uses sklearn's digits dataset and gridsearchCV to find best parameters for a KNN classifier for two simple datasets.

Lab 7: Decision Trees, Random Forests, Ensemble Methods (video, notebook)

  • decision trees, but they use their own function on top of sklearn so its a bit annoying
  • random forests:
    • take a random subsample of the data
    • select a subset of all possible variables (at each node) and build the tree on the best split
    • repeat, then finally take a majority vote
  • Ensemble learning - put together a lot of classifiers to build a better one
  • AdaBoost Classifier uses weights on the training data.
  • Gradient Boost Classifier is similar to AdaBoost but uses tree as its base classifier, can do regression or classification.

Lecture 14: Best Practices, Recommendations and MapReduce (video, slides)

  • should have started final project by now, if not, start now!
  • Nesting: use 5 fold cross validation, take each fold and further apply 5 fold cv to find the right hyperparameters, then use those params in the original fold to train the classifier.
  • think about how and why to normalize data, e.g data already in a range from 0-1 might not need to be normalized, think about normazlization as hyperparameters to your model.
    • get the mean estimates from your training data and use that to normalize training, validation and testing data. these values need to be stored to normalize future data.
  • know your application domain
  • many problems are to do with imbalanced data. Fixes to get balanced data for training:
    • oversample: take bigger samples of the sparse clasess
    • subsampling: take smaller samples of the more frequently occuring classes
    • or weight classes so classifier pays more attention to the sparese classes, see sklearn
  • Missing data, where some of the features are missing for certain samples:
    • delete, but can reduce data too much
    • use mean
    • use regression to estimate missing values
  • Collobrative filtering is a common technique for recommendations or filling in missing values. Can be user based or item based.
    • KNN is widely used for collobrative filtering.
    • SVD is widely used, famously in the netflix contest. watch this svd video for a refresher if needed.

Moving on from Machine Learning...

Map reduce is a way to deal with very large data sets by distributing work over many machines. Developed by Google, and Apache Hadoop is a open source implementation.

  • data is in key value pairs and is distributed to however many servers
  • we write a map function which does something to a key value pair.
  • the mapreduce implementation reaggregrates the key value pairs incoming from all the servers (sorted by keys)
  • we write a reduce which does something else
  • simple mapreduce example:
    • map takes in sentences, sends them off to diff machines to process,
    • those machines send back key:value pairs, say we are just counting words so we get back from machine 1: "the":1, "the":1, "a":4 and machine 2 sends back "the":4, "a"10,
    • there can be a combine step here, which takes the output from one mapper and combines it, so the two the words from machine 1 become one "the":2. this reduces network traffic and makes the work upstream easier. the combine output has to be of the same type as the mapper, since combine is just an optional optimizatizion step not a DO WORK thing.
    • the mapreduce thingamajig aggregrates all the key:value pairs and sends them to the
    • reduce function, which counts the words and finally we end up with "the":6 and "a":14
    • Udacity has a course on mapreduce with hadoop

Lecture 15: MapReduce Combiners and Spark (video, slides)

  • refresher on mapreduce: it takes away the headache of dealing with machines, clusters, etc
    • be careful on how you use reducers, they shouldn't effect the algorithims effectiveness
    • the reducer should work whether or not the combiner runs
    • you could have a in-mapper combining, but here we are gettting into complex territory
  • Why did we switch to map reduce from good old write a function and run it computing to the horror that is map reduce?
    • large large data which looms like a t-rex, scaring our current teeny tiny computers cowering away in fear. luckily the computers gathered in data centers where mapreduce and other such algos can harness them in parallel.
    • many problems are embarrisingly parallel
  • Apache Hadoop is an open source implementation of map reduce, built way back in 2006. It provides us a way to store data on clusters of commodity machines and process it using mapreduce, without having to worry about the rocketscience of managing the clusters and splitting data and processes across all the machines.
  • Though these days you probably want to use Hadoop with Apache Spark running on Amazon's EMR or Google Cloud.
  • python has tools like dask and ipyparallel for parallel computing. for many projects python is enough
  • Functional programming: where your functions take an input, process it and return an output, without messing around with other functions or data stored elsewhere. this makes so much sense that it shouldn't be a big deal, it sounds like a toyota corolla of the computing world, but its a big deal because ppl manage to write functions which fuck around with your programming world enough that after a few rounds of their unfunctional running around no one has a good idea of whats the actual state of things.
  • Getting back to mapreduce, when you are programming functionally, then it makes it possible to distribute the data and computing across different machines as the functions aren't trying to mess with each other. When a machine dies, its no big deal as we know what data and function we sent it to process, we just resend it to some other machine.
    • aside: this is why pandas always outputs a new dataframe whenever we do something, trying to emulate this ideal of not fucking up existing things. Spark also does similar stuff.
  • Apache Spark makes all this easy for us. Runs on top of Hadoop and provides nice interface and methods for all kind of stuff. So nicer, faster, shineir than Hadoop.
    • Spark stores data on a Resilient distributed dataset (RDD) - a fault tolerant collection of stuff which can be operated on in parallel.
    • the basics of using sparl: write a function, some kind of mapreduce job, spark runs it on the RDD and makes a new processed RDD.
    • a lot of the pandas style commands work on spark
    • spark is lazy - it makes a graph of the stuff we want to do and then runs it from start to end every time we execute. so caching is important so we don't keep computing same stuff over and over again
    • explore spark…. todo

Week 9: Bayes!

Lab 8: Vagrant and VirtualBox, AWS, and Spark (video, notebook)

  • Moving on from sklearn...

Lecture 16: Bayes Theorem and Bayesian Methods (video, slides)

  • book recommendations from the lecture: (the one I liked best in bold)

  • Bayes rule: P(A|B) = P(B|A)P(A) / P(B)

  • bayes rules tells us how to update our beliefs (the prior) as we get new data, which gives us a new posterior (which is just a fancy word for our new, updated belief). A more wordy description:

    The theorem itself can be stated simply. Beginning with a provisional hypothesis about the world (there are, of course, no other kinds), we assign to it an initial probability called the prior probability or simply the prior. After actively collecting or happening upon some potentially relevant evidence, we use Bayes’s theorem to recalculate the probability of the hypothesis in light of the new evidence. This revised probability is called the posterior probability or simply the posterior.

    • bayes is controversial becuase traditional stats doesn't like giving numbers to unknown thinks, for example bayes essentially makes up the prior. the prior is often our subject belief about something.
      • however, even when starting out with different priors, given that they aren't ridiculously dogmatic, with a large sample size the different priors will converge
    • discriminative model: focus on predicting y given x, generative model: we simulate the entire model, i.e we can generate x and y
    • naive bayes: assumes probablities are conditionally independent of each other, greatly simplifies the calculations. sometimes unrealistic but works well for many scenarios. sklearn has bayes, of course.
    • Conjugate prior says that if u start with a family of distributions, like beta, you stay in the same distribution. simplifies computations
    • one way to think about bayesian

More Reading:

Lecture 17: Bayesian Methods Continued (video, slides)

  • you can estimate your prior from the data, though some bayesians would say you're tainting your priors and the data by doing that, but this is an accepted way to get an acceptable prior.
  • the lecture works through examples from a blog, which has collected its bayes posts into this book: Introduction to Empirical Bayes: Examples from Baseball Statistics. the explanations in the book look great.

Note: Bayes is simple to do yet hard to understand. So read a number of guides/blogs/posts/youtubes till it makes sense. Some talks to see:

Week 10: Text

hw5

Week 11: Clustering!

Week 12: Deep LEarning

Week 13: Final Project & Wrapup

Additional Resources

Stuff I found useful to understand the class material better.