diff --git a/.nojekyll b/.nojekyll index 1d678d4..67c026b 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -17606cf0 \ No newline at end of file +e3d3fd73 \ No newline at end of file diff --git a/bibliography.html b/bibliography.html index 8793d50..2d9103f 100644 --- a/bibliography.html +++ b/bibliography.html @@ -140,8 +140,8 @@ Hypothesis testing
  • - - + + Regression
  • diff --git a/index.html b/index.html index ec7e8ad..dff6be3 100644 --- a/index.html +++ b/index.html @@ -140,8 +140,8 @@ Hypothesis testing
  • - - + + Regression
  • diff --git a/info.html b/info.html index bb0f69d..add8c70 100644 --- a/info.html +++ b/info.html @@ -140,8 +140,8 @@ Hypothesis testing
  • - - + + Regression
  • diff --git a/schedule.html b/schedule.html index c36ff08..b74b6bf 100644 --- a/schedule.html +++ b/schedule.html @@ -142,8 +142,8 @@ Hypothesis testing
  • - - + + Regression
  • diff --git a/search.json b/search.json index 37abaed..89f1a7d 100644 --- a/search.json +++ b/search.json @@ -76,6 +76,48 @@ "section": "Measuring shape", "text": "Measuring shape\nBase R doesn’t have a lot of functions for summarising shape, so we will need to install and use the moments package:\n\n# install.packages(\"moments\")\nlibrary(\"moments\")\n\nSkew describes whether or not a variable has a symmetrical distribution. A distribution ‘leaning’ towards the left on a graph is negatively skewed; to the right positively skewed.\nWe can measure skew using the skewness() function from moments:\n\nskewness(x)\n\nKurtosis is the degree to which the distribution of a variable is ‘stretched out’. A variable with positive kurtosis might be described as having a ‘high peak’ and a ‘long tail’; with negative kurtosis, a ‘flat top’ or hill-shape.\nThere are two functions for measuring kurtosis in the moments package:\n\nkurtosis(x) # Pearson's measure of kurtosis\ngeary(x) # Geary's measure of kurtosis\n\nA final important shape characteristic is multimodality. So far, we’ve only worked with unimodal variables – distributed around a single central tendency. But it is not uncommon to see bimodal variables, with two distinct peaks, or multimodal variables, with three or more. At this point most of the statistics we calculated above will not be meaningful, and you have to investigate various techniques for decomposing the data into multiple, unimodal variables.\n\nExercises\n\nCalculate and interpret the skew of lithic flakes from sites on Islay\nCalculate and interpret the kurtosis of lithic flakes from sites on Islay\nWhat are some examples of archaeological datasets the we would expect to be multimodal? Why are these challenging from a statistical point of view?" }, + { + "objectID": "tutorials/regression.html", + "href": "tutorials/regression.html", + "title": "Regression", + "section": "", + "text": "Earlier in the course we saw how visualising relationships between variables can be used as part of an exploratory data analysis. In this tutorial we will look at how to measure the strength of these relationships (correlation) and at regression, an extremely versatile family of methods for formally modelling them." + }, + { + "objectID": "tutorials/regression.html#prerequisites", + "href": "tutorials/regression.html#prerequisites", + "title": "Regression", + "section": "Prerequisites", + "text": "Prerequisites\nBy the end of this tutorial, you should be able to:\n\nCalculate correlation coefficients in R\nFit a linear model to data\nAssess the quality of fit of a linear model visually and using residual analysis" + }, + { + "objectID": "tutorials/regression.html#objectives", + "href": "tutorials/regression.html#objectives", + "title": "Regression", + "section": "Objectives", + "text": "Objectives\n\nChapter 8 (“Relationships between Two Numeric Variables: Correlation and Regression”), in Stephen Shennan (1998), Quantifying Archaeology, 2nd edition. https://doi.org/10.1515/9781474472555-009\nChapter 9 (“When the Regression Doesn’t Fit”), in Stephen Shennan (1998), Quantifying Archaeology, 2nd edition. https://doi.org/10.1515/9781474472555-010" + }, + { + "objectID": "tutorials/regression.html#recap-visualisating-relationships", + "href": "tutorials/regression.html#recap-visualisating-relationships", + "title": "Regression", + "section": "Recap: visualisating relationships", + "text": "Recap: visualisating relationships\nWe have already learned how to use a scatter plot (or scattergram) to visualise the relationship between two numeric variables. For example, we could look at the relationship between the area of sites on Islay and the total number lithics found there:\n\nlibrary(ggplot2)\nlibrary(islay)\n\nggplot(islay_lithics, aes(area, total)) +\n geom_point()\n\nWe have also used geom_smooth() to add a “smoothed conditional mean” (in ggplot2’s words) to plots like this:\n\nggplot(islay_lithics, aes(area, total)) +\n geom_point() +\n geom_smooth()\n\nThe message generated by ggplot gives us some clues as to what it is doing under the hood:\n`geom_smooth()` using method = 'loess' and formula = 'y ~ x'\nIt has automatically fitted a model to the data using LOESS (method = 'loess'), a method of nonlinear regression. geom_smooth() always picks a nonlinear method by default: either LOESS or GAM (generalised additive models), depending on the number of observations. Both methods combine multiple regression models fitted to subsets of the data – producing the characteristically ‘wiggly’ output. This is useful for exploratory data analysis because they can fit relationships with many different shapes without further tuning.\nYou can also specify the method used by geom_smooth manually. Here, for example, it looks like a linear model might be more useful. Use method = \"glm\" to fit a generalised linear (regression) model (GLM) to the data:\n\nggplot(islay_lithics, aes(area, total)) +\n geom_point() +\n geom_smooth(method = \"glm\")\n\nLater we will learn how we can measure the strength of correlation and goodness fit quantitatively, and further manipulate the linear model. For now, we can inspect it visually. The blue line represents the estimated mean trend or ‘line of best fit’. If our model is successful, should roughly match the central trend of the surrounding points.\nEqually important is the grey area around the line, representing the confidence interval around the mean. By default, it is the 95% confidence interval. You can control this with the level argument:\n\nggplot(islay_lithics, aes(area, total)) +\n geom_point() +\n geom_smooth(method = \"glm\", level = 0.68) # one sigma\nggplot(islay_lithics, aes(area, total)) +\n geom_point() +\n geom_smooth(method = \"glm\", level = 0.997) # three sigma\n\nAs always with confidence intervals, it is important to remember that the level chosen is arbitrary.\nAnother useful argument to geom_smooth() is fullrange. We can set this to TRUE to extend the model beyond the actual range of the data, which can be useful for prediction:\n\nggplot(islay_lithics, aes(area, total)) +\n geom_point() +\n geom_smooth(method = \"glm\", fullrange = TRUE) +\n scale_x_continuous(limits = c(0, 300000))\n\nNote the use of scale_x_continuous() to extend the range of the x axis. Prediction needs to be approached with common sense and a knowledge of the data. For example, we can readily ask the model to ‘predict’ the number of lithics found at sites with an area of less than zero:\n\nggplot(islay_lithics, aes(area, total)) +\n geom_point() +\n geom_smooth(method = \"glm\", fullrange = TRUE) +\n scale_x_continuous(limits = c(-100000, 200000))\n\nOne final way we can try to improve the model fit, without actually changing the model itself, is to consider the scale of the variables. In this case, both the area of sites and the total number of lithics appear to follow a log-normal distribution:\n\nggplot(islay_lithics, aes(area)) + geom_density()\nggplot(islay_lithics, aes(total)) + geom_density()\n\nAs we’ve already seen, these kind of distributions are easy to normalise using a log transform. We could create new log variables, as we did last week, but since at this point we’re only concerned with visualisation, we can instead do the transformation directly on the plot itself with scale_*_log10():\n\nggplot(islay_lithics, aes(area, total)) +\n geom_point() +\n geom_smooth(method = \"glm\") +\n scale_x_log10() +\n scale_y_log10()\n\nThis is called a log–log scale. If there is a linear correlation on a log-log scale, as there is here, it indicates a power law correlation on a normal scale.\n\nExercises\n\nWhy does it make sense to log transform the variables total and area specifically?\nChoose two lithic types (flakes, etc.) and fit an appropriate model to them using geom_smooth().\nIs there a correlation? Why or why not?" + }, + { + "objectID": "tutorials/regression.html#measuring-correlation", + "href": "tutorials/regression.html#measuring-correlation", + "title": "Regression", + "section": "Measuring correlation", + "text": "Measuring correlation\nA correlation coefficient measures the strength of correlation between two variables. We can calculate them with cor.test() function in R.\nThe most common, and default for cor.test(), is the product–moment correlation coefficient (method = \"pearson\"), which works well for linear relationships with the standard assumptions (cf. Shennan):\n\ncor.test(islay_lithics$area, islay_lithics$total)\n\nThe output should be familiar from other hypothesis tests. The most useful parts are the p-value—which we can interpret in the usual way—and cor, which is the correlation coefficient r. The latter measures the strength of the correlation, and is a value between 0 (no correlation) and 1 (perfect correlation). Another useful statistic is R2, which measures how much of the variance in the response variable can be predicted by the independent variable. To calculate this, we need to extract the actual value of r from the object returned by cor.test():\n\ncorrelation <- cor.test(islay_lithics$area, islay_lithics$total)\ncorrelation$estimate^2\n\nThis isn’t very promising: the p-value is relatively low, but so is the correlation coefficient and R2. However, we’ve already seen above that the correlation looks stronger on a log-log scale, i.e. it is probably not linear. Rather than the product-moment coefficient, we should therefore use a nonparametric coefficient like Spearman’s ρ or Kendall’s τ. To this we simply need to change the method argument of cor.test():\n\ncor.test(islay_lithics$area, islay_lithics$total, method = \"spearman\")\n\nAs expected, the correlation appears to be stronger using this method.\n\nExercises\n\nCalculate Kendall’s τ for total vs. area (this is usually preferred to Spearman’s).\nDescribe the correlation (or lack thereof) you see, based on Kendall’s τ.\nChoose two lithic types (flakes, etc.) and calculate an appropriate correlation coefficient (hint: you will first need to determine whether the correlation appears to be linear or not)." + }, + { + "objectID": "tutorials/regression.html#linear-regression", + "href": "tutorials/regression.html#linear-regression", + "title": "Regression", + "section": "Linear regression", + "text": "Linear regression\nLinear models might seem underwhelming compared to LOESS or GAM, but unlike these ‘black boxes’ we can easily build our own linear models, and in doing so learn a lot about the underlying properties of the data. We should also remember that regression fitting algorithms will always fit the model you give them; with non-linear models, this can easily produce over-fitted leading to spurious interpretations. By restraining ourselves to a linear relationships encourages to also keep our interpretations simple, and be more likely to spot when the model simply doesn’t fit.\nTo build our own linear models, we will use the function lm() directly. The following replicates the linear model we fitted with geom_smooth above:\n\nlm(total~area, data = islay_lithics)\n\nLike many modelling functions in R, it uses a special formula syntax, where the response variable(s) (left) are separated from the independent variable(s) (right) with ~. By giving our data frame as the data argument, we can use column names directly in the formula.\nThe result is a model object which includes the fitted values (used e.g. to reproduce a graph) as well as information on the model and a variety of measures of its performance – see ?lm for a full list. The format of this object is a little inconvenient. The broom package allows us to extract information from it as a data frame, which is a bit easier to work with. tidy() gives us the parameters of the model itself:\n\nlibrary(broom)\nmodel <- lm(total~area, data = islay_lithics)\ntidy(model)\n\nglance() summarises measures of its performance:\n\nglance(model)\n\nAnd augment() gives us the original data alongside the fitted model predictions and residuals:\n\naugment(model, data = islay_lithics)\n\nWe can use augment to reproduce a geom_smooth()-style plot from our model created with lm():\n\nfitted <- augment(model, data = islay_lithics, interval = \"confidence\")\nggplot(fitted, aes(area, total)) +\n geom_point() +\n geom_line(aes(y = .fitted)) + \n geom_ribbon(aes(ymin = .lower, ymax = .upper), col = NA, alpha = 0.3)\n\nBut crucially, we can go beyond what geom_smooth() can do and analyse the fit of the model in more detail. For example, the result of augment() includes the residuals, which can be very useful for further exploratory data analysis of the regression model (cf. Shennan, ch. 9).\nLet’s see if we can make a better model. Our correlation coefficients and log-log plots have indicated that there probably is a correlation in this data, but that it is nonlinear – following a power law. We can model this by including log() in the formula specification of our model:\n\nlm(total~log(area), data = islay_lithics) |>\n augment(data = islay_lithics, interval = \"confidence\") |>\n ggplot(aes(area, total)) +\n geom_point() +\n geom_line(aes(y = .fitted)) +\n geom_ribbon(aes(ymin = .lower, ymax = .upper), col = NA, alpha = 0.3)\n\n\nExercises\n\nChoose two lithic types (flakes, etc.) and fit an appropriate linear model to them.\nProduce a density plot of the residuals from the area–total regression. What can we learn from this?" + }, { "objectID": "tutorials/visualising_distributions.html", "href": "tutorials/visualising_distributions.html", @@ -158,7 +200,7 @@ "href": "tutorials/eda_with_r.html#the-r-environment", "title": "Exploring data with R", "section": "The R environment", - "text": "The R environment\nR’s engine is its console, the command-line interface through which you interact with the R environment. When you first start the console, you will be presented with a welcome message like this:\nR version 4.2.2 (2022-10-31) -- \"Innocent and Trusting\"\nCopyright (C) 2022 The R Foundation for Statistical Computing\nPlatform: x86_64-pc-linux-gnu (64-bit)\n\nR is free software and comes with ABSOLUTELY NO WARRANTY.\nYou are welcome to redistribute it under certain conditions.\nType 'license()' or 'licence()' for distribution details.\n\n Natural language support but running in an English locale\n\nR is a collaborative project with many contributors.\nType 'contributors()' for more information and\n'citation()' on how to cite R or R packages in publications.\n\nType 'demo()' for some demos, 'help()' for on-line help, or\n'help.start()' for an HTML browser interface to help.\nType 'q()' to quit R.\n\n> \nThe > character at the bottom is R’s command prompt. It indicates that R is waiting for you to tell it to do something! Try typing in a simple arithmetic expression such as 2 + 2:\n\n2 + 2\n\nR responds with the answer: 4 (we will get back to what the [1] means later). The numbers in this expression represent the first building block of the R environment: values. These simply represent what they are, e.g. the value 2 is the number 2.\n\nValues\nIf we enter a value into the prompt on its own, R simply prints it back for us:\n\n2\n\nValues don’t have to be numbers. Text can be entered as a string or character value, which is represented by text surrounded by \"double quotes\":\n\n\"Hello world!\"\n\nAgain, presented with a value alone, R simply prints it. We can’t do arithmetic with string values (that wouldn’t make sense), but we can manipulate them in other ways:\n\ntoupper(\"Hello world!\")\n\nIn total, there are six basic data types that can be represented as values in R:\n\n\n\n\n\n\n\n\nType\nExamples\nDescription\n\n\n\n\ndouble\n1.2\nA decimal number, known for historical reasons as a double precision value.\n\n\ninteger\n1L\nA whole number; an integer.\n\n\ncharacter\n\"abc\"\nText, also known as a string. Character values can also be entered with single quotes, e.g. 'abc', but double quotes is the norm.\n\n\nlogical\nTRUE, FALSE\nA binary or boolean value, representing ‘yes’ or ‘no’, ‘true’ or ‘false’, etc. Logical values can also be entered with the abbreviations T and F, but this should be avoided because it is possible to do T <- FALSE and wreak havoc.\n\n\ncomplex\n1i\nA complex number\n\n\nraw\nas.raw(255)\nThe raw bytes underlying other values\n\n\n\nIn most cases R converts freely between doubles and integers, so they are often grouped together and called numerics. Complex and raw values are rarely encountered in practice and are included here only for completeness.\nLogical values give us the possibility of using R to evaluate logical statements. For example, we can test if two values are equal to one another using the == operator:\n\n2 == 2\n1 == 2\n\"Hello\" == \"HELLO\"\n\nThe result is a logical value: TRUE or FALSE. We will look at logical operations in more detail in the next section. For now, be aware that R sometimes try to be clever and interpret one data type as another:\n\n3L == 3\n3 == \"3\"\nTRUE == 1\n\nBut it is not that clever:\n\n3 == \"three\"\n\"false\" == FALSE\n\n\nVectors\nR doesn’t limit us to single values. We can combine them together into a vector by entering a list of values separated with a comma and enclosed in c( ). For example, a sequence of numbers:\n\nc(1, 2, 3, 4, 5)\n\nOr a series of strings:\n\nc(\"red\", \"orange\", \"yellow\", \"green\", \"blue\")\n\nThere is also a shortcut syntax for creating vectors with a sequence of consecutive numbers:\n\n1:10\n\nVectors enable vector operations. For example, if we do arithmetic with two vectors, R will repeat the operation on each pair of values (note that this is not what a mathematician would do if asked to multiply vectors):\n\nc(1, 2, 3) * c(3, 2, 1)\nc(\"A\", \"B\", \"C\") == c(\"A\", \"b\", \"c\")\n\nIn fact, R considers all values to be vectors: a value like 2 is simply a vector of one number. This is why when we print even a single value it is proceeded by a [1] (indicating the initial position in a vector). Vector operations are fundamental to R and by default anything that you can do with a single value, you can also do with a vector:\n\nc(1, 2, 3) * 2\ntoupper(c(\"red\", \"orange\", \"yellow\", \"green\", \"blue\"))\n\nNote that vectors can only contain one data type. If you try to combine more than one data type, R will try to find the ‘common denominator’:\n\nc(1, 2, \"3\")\n\n\n\nNA and null values\nIn addition to the six basic data types, R recognises a number of special values that represent the absence of a value.\nThe most common of these is NA, which in R means “missing data” or “unknown”. When importing tables of data into R, empty cells will be read as NA. NA values can occur in any of the six basic data types, for example:\n\nc(1, NA, 3)\nc(\"red\", NA, \"yellow\")\n\nImportantly, NA is not equivalent to 0 or FALSE; those are known values. Generally speaking, introducing NA into an expression ‘propagates’ the unknown value:\n\nNA == 5\nNA * 5\n\nThis is because R doesn’t know if NA is 5. It could be; it could not. It also doesn’t know what the result of multiplying an unknown value by 5 is.\nThe strictness with which R treats NAs can be frustrating at times, but it is mathematically well-founded and generally protects you from doing silly things (as we saw last week!)\n\n\nExercises\n\nUse R to calculate the mean of these ten random numbers: 14, 95, 54, 100, 82, 65, 9, 61, 42, 20\nWhat is the result of c(FALSE, TRUE, 2, 3) == 0:3? Why?\nWhat is the result of NA == NA? Why?\n\n\n\n\nObjects\nApart from values, R interprets everything you type into the console as the name of an object. This is why we have to surround character values with quotes, to distinguish them from objects:\n\n\"version\"\nversion\n\nWhy do these two commands produce such different output? \"version\" is a value – the word ‘version’. version is an object, storing information on the version of R you have installed.\n\nErrors\nMost words aren’t the name of an object, so if you forget to enclose a character value with quotes (which is easily done), you are likely to encounter an error like this:\n\nhello\n\n“Error: object ‘hello’ not found” tells us fairly straightforwardly what the problem is: there is no object called hello.\nThis is the first of many errors you are going to see on this course! What if, for example, we try to multiply two strings, an operation that doesn’t make sense?\n\n\"Hello world!\" * \"Goodbye, world...\"\n\nThis time the message (“non-numeric argument to binary operator”) is pretty cryptic, which is unfortunately quite common with R. Essentially what it means is that R does not know how to add (a “binary operation”) something that is not a number (a “non-numeric argument”). But however cryptic they are, it’s important that you pay attention to errors when they occur. R never raises errors without reason and it is unlikely that you will be able to continue with your work until you have resolved them.\n\n\n\n\n\n\nDeciphering R error messages\n\n\n\nAs you get used to the jargon used in R’s error and warning messages, you will find that they do actually explain what the problem is. In the meantime, apart from asking your instructor, you can always try just pasting the error into a search engine. Unless you’re doing something truly exotic, it is very likely that someone else has encountered it before and has explained what it means in ordinary language.\nWhen you run more than one line of R code at a time, an error in one line is likely to produce errors in subsequent code. In these circumstances, it’s important to start investigating the problem with the first error encountered. If you encountered an error when trying to create an object, for example, you might subsequently get the “object not found” error when trying to use it – but this doesn’t tell you the root of the problem.\n\n\n\n\nAssignment\nThe R environment is initialised with a certain number of in-built objects, like version. Much more useful, however, is creating objects yourself. For this, we use the assignment operator, <-:\n\nx <- 2\nx\n\nWhat is happening here? In the first line, we assign the value 2 to the object x. In the second line, we ask R to print the object x, which is 2.\nAssignment is how we can start to build up routines that are more complex than those we could do on a calculator. For example, we can break down the exercise of calculating the mean into several steps:\n\ntotal <- 10 + 20 + 30 + 40 + 50\nn <- 5\ntotal / n\n\nAs you populate the R environment with objects, it can be easy to get lost. For example, what if I forget that I’ve already used the name n and reassign it to something else? Then if I try to calculate the mean again:\n\nn <- \"N\"\ntotal / n\n\nEntering ls() lists the objects currently in the R environment, so we can keep track:\n\nls()\n\nIf you need to clean things up, you can remove objects with rm():\n\nrm(n)\nls()\n\nOr to start completely afresh, simply restart R: objects are not saved between sessions.\n\n\n\n\n\n\nNaming objects\n\n\n\nAs the work you do in R becomes more complicated, choosing good names for objects is crucial for keeping track. There are some hard rules about object names:\n\nObject names can’t contain spaces\nObject names can’t start with a number\nObject names can only contain letters, numbers, _ and .\nObject names are case sensitive (e.g. X is different from x)\n\nAnd some good tips:\n\nPick a convention for separating words (e.g. snake_case or camelCase) and stick to it\nIt is better to use a long and descriptive name (e.g. switzerland_sites) than a short, abstract one (e.g. data)\nDon’t reassign object names unless you have a good reason to\n\n\n\n\n\nExercises\n\nCreate objects containing a) a vector of numbers, b) a single character value and c) the sum of a set of numbers\nWhy does 3sites <- c(\"Stonehenge\", \"Giza\", \"Knossos\") produce an error?\nWhat would you call an object containing the mean site size of Neolithic sites in Germany?\n\n\n\n\nFunctions\nFunctions are a special type of object that do something with other objects or values (called arguments). You have already met several functions in this tutorial: arithmetic operators like +, logical operators like ==, and regular functions like ls().\nThese show two different function syntaxes. The arithmetic and logical operators use infix syntax, with two arguments and the function name (the operator) in the middle:\n\n2 + 2\n\nMost functions, and any that can have more than two arguments, use regular function syntax. This looks something like this:\nfunction(argument1, argument2, argument3)\nls() is a function of this style, which takes no arguments (hence the empty brackets). An example of a function that does take arguments is sum():\n\nsum(10, 20, 30, 40, 50)\n\nFunction arguments can also be objects:\n\nx <- 1:100\nsum(x)\n\n\nExercises\n\nCalculate the mean of these ten random numbers using the sum() function: 44, 97, 34, 62, 82, 8, 72, 28, 95, 82\nWhy is it a bad idea to assign the mean of a variable to a new object called mean?" + "text": "The R environment\nR’s engine is its console, the command-line interface through which you interact with the R environment. When you first start the console, you will be presented with a welcome message like this:\nR version 4.2.2 (2022-10-31) -- \"Innocent and Trusting\"\nCopyright (C) 2022 The R Foundation for Statistical Computing\nPlatform: x86_64-pc-linux-gnu (64-bit)\n\nR is free software and comes with ABSOLUTELY NO WARRANTY.\nYou are welcome to redistribute it under certain conditions.\nType 'license()' or 'licence()' for distribution details.\n\n Natural language support but running in an English locale\n\nR is a collaborative project with many contributors.\nType 'contributors()' for more information and\n'citation()' on how to cite R or R packages in publications.\n\nType 'demo()' for some demos, 'help()' for on-line help, or\n'help.start()' for an HTML browser interface to help.\nType 'q()' to quit R.\n\n> \nThe > character at the bottom is R’s command prompt. It indicates that R is waiting for you to tell it to do something! Try typing in a simple arithmetic expression such as 2 + 2:\n\n2 + 2\n\nR responds with the answer: 4 (we will get back to what the [1] means later). The numbers in this expression represent the first building block of the R environment: values. These simply represent what they are, e.g. the value 2 is the number 2.\n\nValues\nIf we enter a value into the prompt on its own, R simply prints it back for us:\n\n2\n\nValues don’t have to be numbers. Text can be entered as a string or character value, which is represented by text surrounded by \"double quotes\":\n\n\"Hello world!\"\n\nAgain, presented with a value alone, R simply prints it. We can’t do arithmetic with string values (that wouldn’t make sense), but we can manipulate them in other ways:\n\ntoupper(\"Hello world!\")\n\nIn total, there are six basic data types that can be represented as values in R:\n\n\n\n\n\n\n\n\nType\nExamples\nDescription\n\n\n\n\ndouble\n1.2\nA decimal number, known for historical reasons as a double precision value.\n\n\ninteger\n1L\nA whole number; an integer.\n\n\ncharacter\n\"abc\"\nText, also known as a string. Character values can also be entered with single quotes, e.g. 'abc', but double quotes is the norm.\n\n\nlogical\nTRUE, FALSE\nA binary or boolean value, representing ‘yes’ or ‘no’, ‘true’ or ‘false’, etc. Logical values can also be entered with the abbreviations T and F, but this should be avoided because it is possible to do T <- FALSE and wreak havoc.\n\n\ncomplex\n1i\nA complex number\n\n\nraw\nas.raw(255)\nThe raw bytes underlying other values\n\n\n\nIn most cases R converts freely between doubles and integers, so they are often grouped together and called numerics. Complex and raw values are rarely encountered in practice and are included here only for completeness.\nLogical values give us the possibility of using R to evaluate logical statements. For example, we can test if two values are equal to one another using the == operator:\n\n2 == 2\n1 == 2\n\"Hello\" == \"HELLO\"\n\nThe result is a logical value: TRUE or FALSE. We will look at logical operations in more detail in the next section. For now, be aware that R sometimes try to be clever and interpret one data type as another:\n\n3L == 3\n3 == \"3\"\nTRUE == 1\n\nBut it is not that clever:\n\n3 == \"three\"\n\"false\" == FALSE\n\n\nVectors\nR doesn’t limit us to single values. We can combine them together into a vector by entering a list of values separated with a comma and enclosed in c( ). For example, a sequence of numbers:\n\nc(1, 2, 3, 4, 5)\n\nOr a series of strings:\n\nc(\"red\", \"orange\", \"yellow\", \"green\", \"blue\")\n\nThere is also a shortcut syntax for creating vectors with a sequence of consecutive numbers:\n\n1:10\n\nVectors enable vector operations. For example, if we do arithmetic with two vectors, R will repeat the operation on each pair of values (note that this is not what a mathematician would do if asked to multiply vectors):\n\nc(1, 2, 3) * c(3, 2, 1)\nc(\"A\", \"B\", \"C\") == c(\"A\", \"b\", \"c\")\n\nIn fact, R considers all values to be vectors: a value like 2 is simply a vector of one number. This is why when we print even a single value it is proceeded by a [1] (indicating the initial position in a vector). Vector operations are fundamental to R and by default anything that you can do with a single value, you can also do with a vector:\n\nc(1, 2, 3) * 2\ntoupper(c(\"red\", \"orange\", \"yellow\", \"green\", \"blue\"))\n\nNote that vectors can only contain one data type. If you try to combine more than one data type, R will try to find the ‘common denominator’:\n\nc(1, 2, \"3\")\n\n\n\nNA and null values\nIn addition to the six basic data types, R recognises a number of special values that represent the absence of a value.\nThe most common of these is NA, which in R means “missing data” or “unknown”. When importing tables of data into R, empty cells will be read as NA. NA values can occur in any of the six basic data types, for example:\n\nc(1, NA, 3)\nc(\"red\", NA, \"yellow\")\n\nImportantly, NA is not equivalent to 0 or FALSE; those are known values. Generally speaking, introducing NA into an expression ‘propagates’ the unknown value:\n\nNA == 5\nNA * 5\n\nThis is because R doesn’t know if NA is 5. It could be; it could not. It also doesn’t know what the result of multiplying an unknown value by 5 is.\nThe strictness with which R treats NAs can be frustrating at times, but it is mathematically well-founded and generally protects you from doing silly things (as we saw last week!)\n\n\nExercises\n\nUse R to calculate the mean of these ten random numbers: 65, 99, 75, 57, 20, 23, 31, 66, 45, 6\nWhat is the result of c(FALSE, TRUE, 2, 3) == 0:3? Why?\nWhat is the result of NA == NA? Why?\n\n\n\n\nObjects\nApart from values, R interprets everything you type into the console as the name of an object. This is why we have to surround character values with quotes, to distinguish them from objects:\n\n\"version\"\nversion\n\nWhy do these two commands produce such different output? \"version\" is a value – the word ‘version’. version is an object, storing information on the version of R you have installed.\n\nErrors\nMost words aren’t the name of an object, so if you forget to enclose a character value with quotes (which is easily done), you are likely to encounter an error like this:\n\nhello\n\n“Error: object ‘hello’ not found” tells us fairly straightforwardly what the problem is: there is no object called hello.\nThis is the first of many errors you are going to see on this course! What if, for example, we try to multiply two strings, an operation that doesn’t make sense?\n\n\"Hello world!\" * \"Goodbye, world...\"\n\nThis time the message (“non-numeric argument to binary operator”) is pretty cryptic, which is unfortunately quite common with R. Essentially what it means is that R does not know how to add (a “binary operation”) something that is not a number (a “non-numeric argument”). But however cryptic they are, it’s important that you pay attention to errors when they occur. R never raises errors without reason and it is unlikely that you will be able to continue with your work until you have resolved them.\n\n\n\n\n\n\nDeciphering R error messages\n\n\n\nAs you get used to the jargon used in R’s error and warning messages, you will find that they do actually explain what the problem is. In the meantime, apart from asking your instructor, you can always try just pasting the error into a search engine. Unless you’re doing something truly exotic, it is very likely that someone else has encountered it before and has explained what it means in ordinary language.\nWhen you run more than one line of R code at a time, an error in one line is likely to produce errors in subsequent code. In these circumstances, it’s important to start investigating the problem with the first error encountered. If you encountered an error when trying to create an object, for example, you might subsequently get the “object not found” error when trying to use it – but this doesn’t tell you the root of the problem.\n\n\n\n\nAssignment\nThe R environment is initialised with a certain number of in-built objects, like version. Much more useful, however, is creating objects yourself. For this, we use the assignment operator, <-:\n\nx <- 2\nx\n\nWhat is happening here? In the first line, we assign the value 2 to the object x. In the second line, we ask R to print the object x, which is 2.\nAssignment is how we can start to build up routines that are more complex than those we could do on a calculator. For example, we can break down the exercise of calculating the mean into several steps:\n\ntotal <- 10 + 20 + 30 + 40 + 50\nn <- 5\ntotal / n\n\nAs you populate the R environment with objects, it can be easy to get lost. For example, what if I forget that I’ve already used the name n and reassign it to something else? Then if I try to calculate the mean again:\n\nn <- \"N\"\ntotal / n\n\nEntering ls() lists the objects currently in the R environment, so we can keep track:\n\nls()\n\nIf you need to clean things up, you can remove objects with rm():\n\nrm(n)\nls()\n\nOr to start completely afresh, simply restart R: objects are not saved between sessions.\n\n\n\n\n\n\nNaming objects\n\n\n\nAs the work you do in R becomes more complicated, choosing good names for objects is crucial for keeping track. There are some hard rules about object names:\n\nObject names can’t contain spaces\nObject names can’t start with a number\nObject names can only contain letters, numbers, _ and .\nObject names are case sensitive (e.g. X is different from x)\n\nAnd some good tips:\n\nPick a convention for separating words (e.g. snake_case or camelCase) and stick to it\nIt is better to use a long and descriptive name (e.g. switzerland_sites) than a short, abstract one (e.g. data)\nDon’t reassign object names unless you have a good reason to\n\n\n\n\n\nExercises\n\nCreate objects containing a) a vector of numbers, b) a single character value and c) the sum of a set of numbers\nWhy does 3sites <- c(\"Stonehenge\", \"Giza\", \"Knossos\") produce an error?\nWhat would you call an object containing the mean site size of Neolithic sites in Germany?\n\n\n\n\nFunctions\nFunctions are a special type of object that do something with other objects or values (called arguments). You have already met several functions in this tutorial: arithmetic operators like +, logical operators like ==, and regular functions like ls().\nThese show two different function syntaxes. The arithmetic and logical operators use infix syntax, with two arguments and the function name (the operator) in the middle:\n\n2 + 2\n\nMost functions, and any that can have more than two arguments, use regular function syntax. This looks something like this:\nfunction(argument1, argument2, argument3)\nls() is a function of this style, which takes no arguments (hence the empty brackets). An example of a function that does take arguments is sum():\n\nsum(10, 20, 30, 40, 50)\n\nFunction arguments can also be objects:\n\nx <- 1:100\nsum(x)\n\n\nExercises\n\nCalculate the mean of these ten random numbers using the sum() function: 58, 49, 19, 21, 38, 36, 23, 25, 35, 3\nWhy is it a bad idea to assign the mean of a variable to a new object called mean?" }, { "objectID": "tutorials/eda_with_r.html#exploratory-data-analysis-with-r", @@ -326,7 +368,7 @@ "href": "tutorials/visualising_relationships.html#lithic-assemblages-from-islay", "title": "Visualising relationships", "section": "Lithic assemblages from Islay", - "text": "Lithic assemblages from Islay\nLoad the islay_lithics dataset from islay:\n\nlibrary(islay)\ndata(islay_lithics)\n\nWe can use the head() function to get a quick preview of the data frame:\n\nhead(islay_lithics)\n\n site_code region period area flakes blades\n1 LGM1 Loch Gorm South Mesolithic & Later Prehistoric 102450 159 15\n2 LGM2 Loch Gorm South Mesolithic & Later Prehistoric 62497 125 6\n3 LMG4 Loch Gorm South 37480 12 0\n4 LGM5 Loch Gorm South Mesolithic 52473 128 18\n5 LGM6 Loch Gorm South Later Prehistoric 54971 56 4\n6 LGM8 Loch Gorm South 49974 29 1\n chunks cores pebbles retouched\n1 16 24 0 15\n2 11 20 4 16\n3 1 1 6 3\n4 17 27 7 5\n5 8 18 12 10\n6 20 3 0 5\n\n\nBecause this is an in-built dataset of the package, you can also enter ?islay_lithics to open the help page for the dataset, which contains more information on what it describes.\nAs with the last dataset, it will be useful to turn the period column into a factor now, so that it will automatically be ordered in our subsequent plots:\n\nperiods <- c(\"Mesolithic\", \"Mesolithic & Later Prehistoric\", \"Later Prehistoric\")\nislay_lithics$period <- factor(islay_lithics$period, periods)\n\n\nExercises\n\nGenerate a plot showing the relationship between period and the number of retouched pieces. Is there a correlation? What could explain this?\nTry with two other types of lithics. Does it change your answer?\nGenerate a plot showing the relationship between the number of two types of lithics.\nAdd an aesthetic showing a categorical variable.\nExport the plot." + "text": "Lithic assemblages from Islay\nLoad the islay_lithics dataset from islay:\n\nlibrary(islay)\ndata(islay_lithics)\n\nWe can use the head() function to get a quick preview of the data frame:\n\nhead(islay_lithics)\n\n site_code region period area flakes blades\n1 LGM1 Loch Gorm South Mesolithic & Later Prehistoric 102450 159 15\n2 LGM2 Loch Gorm South Mesolithic & Later Prehistoric 62497 125 6\n3 LMG4 Loch Gorm South 37480 12 0\n4 LGM5 Loch Gorm South Mesolithic 52473 128 18\n5 LGM6 Loch Gorm South Later Prehistoric 54971 56 4\n6 LGM8 Loch Gorm South 49974 29 1\n chunks cores pebbles retouched total\n1 16 24 0 15 229\n2 11 20 4 16 182\n3 1 1 6 3 23\n4 17 27 7 5 202\n5 8 18 12 10 108\n6 20 3 0 5 58\n\n\nBecause this is an in-built dataset of the package, you can also enter ?islay_lithics to open the help page for the dataset, which contains more information on what it describes.\nAs with the last dataset, it will be useful to turn the period column into a factor now, so that it will automatically be ordered in our subsequent plots:\n\nperiods <- c(\"Mesolithic\", \"Mesolithic & Later Prehistoric\", \"Later Prehistoric\")\nislay_lithics$period <- factor(islay_lithics$period, periods)\n\n\nExercises\n\nGenerate a plot showing the relationship between period and the number of retouched pieces. Is there a correlation? What could explain this?\nTry with two other types of lithics. Does it change your answer?\nGenerate a plot showing the relationship between the number of two types of lithics.\nAdd an aesthetic showing a categorical variable.\nExport the plot." }, { "objectID": "tutorials/tidy_data.html", diff --git a/sitemap.xml b/sitemap.xml index dfd0529..eb38559 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,54 +2,58 @@ https://smada.joeroe.io/info.html - 2023-05-15T14:12:51.462Z + 2023-05-23T08:42:59.334Z https://smada.joeroe.io/tutorials/summary_statistics.html - 2023-05-15T14:12:55.579Z + 2023-05-23T08:43:03.277Z + + + https://smada.joeroe.io/tutorials/regression.html + 2023-05-23T08:43:12.424Z https://smada.joeroe.io/tutorials/visualising_distributions.html - 2023-05-15T14:13:00.609Z + 2023-05-23T08:43:17.954Z https://smada.joeroe.io/tutorials/eda_with_r.html - 2023-05-15T14:13:02.733Z + 2023-05-23T08:43:20.188Z https://smada.joeroe.io/tutorials/replication.html - 2023-05-15T14:13:03.146Z + 2023-05-23T08:43:20.708Z https://smada.joeroe.io/tutorials/testing.html - 2023-05-15T14:13:07.033Z + 2023-05-23T08:43:25.318Z https://smada.joeroe.io/tutorials/introduction.html - 2023-05-15T14:13:07.419Z + 2023-05-23T08:43:25.705Z https://smada.joeroe.io/tutorials/workflows.html - 2023-05-15T14:13:07.783Z + 2023-05-23T08:43:26.208Z https://smada.joeroe.io/tutorials/visualising_relationships.html - 2023-05-15T14:13:09.699Z + 2023-05-23T08:43:28.218Z https://smada.joeroe.io/tutorials/tidy_data.html - 2023-05-15T14:13:15.556Z + 2023-05-23T08:43:34.252Z https://smada.joeroe.io/bibliography.html - 2023-05-15T14:13:15.973Z + 2023-05-23T08:43:34.628Z https://smada.joeroe.io/index.html - 2023-05-15T14:13:16.333Z + 2023-05-23T08:43:35.155Z https://smada.joeroe.io/schedule.html - 2023-05-15T14:13:16.766Z + 2023-05-23T08:43:35.678Z diff --git a/tutorials/eda_with_r.html b/tutorials/eda_with_r.html index d2549f4..67081b7 100644 --- a/tutorials/eda_with_r.html +++ b/tutorials/eda_with_r.html @@ -205,8 +205,8 @@ Hypothesis testing
  • - - + + Regression
  • @@ -314,6 +314,11 @@

    Exploring data with R

    +
  • + @@ -563,7 +568,7 @@

    NA and nul

    Exercises

      -
    1. Use R to calculate the mean of these ten random numbers: 14, 95, 54, 100, 82, 65, 9, 61, 42, 20
    2. +
    3. Use R to calculate the mean of these ten random numbers: 65, 99, 75, 57, 20, 23, 31, 66, 45, 6
    4. What is the result of c(FALSE, TRUE, 2, 3) == 0:3? Why?
    5. What is the result of NA == NA? Why?
    @@ -689,7 +694,7 @@

    Functions

    Exercises

      -
    1. Calculate the mean of these ten random numbers using the sum() function: 44, 97, 34, 62, 82, 8, 72, 28, 95, 82
    2. +
    3. Calculate the mean of these ten random numbers using the sum() function: 58, 49, 19, 21, 38, 36, 23, 25, 35, 3
    4. Why is it a bad idea to assign the mean of a variable to a new object called mean?
    diff --git a/tutorials/introduction.html b/tutorials/introduction.html index c0107f2..72ac623 100644 --- a/tutorials/introduction.html +++ b/tutorials/introduction.html @@ -141,8 +141,8 @@ Hypothesis testing
  • - - + + Regression
  • @@ -250,6 +250,11 @@

    Introduction

    +
  • + diff --git a/tutorials/regression.html b/tutorials/regression.html index c1046d7..ebae238 100644 --- a/tutorials/regression.html +++ b/tutorials/regression.html @@ -2,12 +2,12 @@ - + -Statistical Methods for Archaeological Data Analysis [FS2023] - Regression +SMADA - Regression @@ -30,6 +93,7 @@ + @@ -59,6 +123,7 @@ "search-submit-button-title": "Submit" } } + @@ -72,7 +137,7 @@ - - - @@ -268,11 +331,20 @@

    Regression

    On this page

    @@ -296,23 +368,171 @@

    Regression

    -

    In this tutorial we will look at regression, an extremely versatile family of methods for modelling the relationship (correlation) between variables.

    +

    Earlier in the course we saw how visualising relationships between variables can be used as part of an exploratory data analysis. In this tutorial we will look at how to measure the strength of these relationships (correlation) and at regression, an extremely versatile family of methods for formally modelling them.

    +
    +

    Prerequisites

    +

    By the end of this tutorial, you should be able to:

    +
      +
    1. Calculate correlation coefficients in R
    2. +
    3. Fit a linear model to data
    4. +
    5. Assess the quality of fit of a linear model visually and using residual analysis
    6. +
    +

    Objectives

    +
    -
    -

    Prerequisites

    +
    +

    Recap: visualisating relationships

    +

    We have already learned how to use a scatter plot (or scattergram) to visualise the relationship between two numeric variables. For example, we could look at the relationship between the area of sites on Islay and the total number lithics found there:

    +
    +
    library(ggplot2)
    +library(islay)
    +
    +ggplot(islay_lithics, aes(area, total)) +
    +    geom_point()
    +
    +

    We have also used geom_smooth() to add a “smoothed conditional mean” (in ggplot2’s words) to plots like this:

    +
    +
    ggplot(islay_lithics, aes(area, total)) +
    +    geom_point() +
    +    geom_smooth()
    +
    +

    The message generated by ggplot gives us some clues as to what it is doing under the hood:

    +
    `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
    +

    It has automatically fitted a model to the data using LOESS (method = 'loess'), a method of nonlinear regression. geom_smooth() always picks a nonlinear method by default: either LOESS or GAM (generalised additive models), depending on the number of observations. Both methods combine multiple regression models fitted to subsets of the data – producing the characteristically ‘wiggly’ output. This is useful for exploratory data analysis because they can fit relationships with many different shapes without further tuning.

    +

    You can also specify the method used by geom_smooth manually. Here, for example, it looks like a linear model might be more useful. Use method = "glm" to fit a generalised linear (regression) model (GLM) to the data:

    +
    +
    ggplot(islay_lithics, aes(area, total)) +
    +    geom_point() +
    +    geom_smooth(method = "glm")
    +
    +

    Later we will learn how we can measure the strength of correlation and goodness fit quantitatively, and further manipulate the linear model. For now, we can inspect it visually. The blue line represents the estimated mean trend or ‘line of best fit’. If our model is successful, should roughly match the central trend of the surrounding points.

    +

    Equally important is the grey area around the line, representing the confidence interval around the mean. By default, it is the 95% confidence interval. You can control this with the level argument:

    +
    +
    ggplot(islay_lithics, aes(area, total)) +
    +    geom_point() +
    +    geom_smooth(method = "glm", level = 0.68) # one sigma
    +
    ggplot(islay_lithics, aes(area, total)) +
    +    geom_point() +
    +    geom_smooth(method = "glm", level = 0.997) # three sigma
    +
    +

    As always with confidence intervals, it is important to remember that the level chosen is arbitrary.

    +

    Another useful argument to geom_smooth() is fullrange. We can set this to TRUE to extend the model beyond the actual range of the data, which can be useful for prediction:

    +
    +
    ggplot(islay_lithics, aes(area, total)) +
    +    geom_point() +
    +    geom_smooth(method = "glm", fullrange = TRUE) +
    +    scale_x_continuous(limits = c(0, 300000))
    +
    +

    Note the use of scale_x_continuous() to extend the range of the x axis. Prediction needs to be approached with common sense and a knowledge of the data. For example, we can readily ask the model to ‘predict’ the number of lithics found at sites with an area of less than zero:

    +
    +
    ggplot(islay_lithics, aes(area, total)) +
    +    geom_point() +
    +    geom_smooth(method = "glm", fullrange = TRUE) +
    +    scale_x_continuous(limits = c(-100000, 200000))
    +
    +

    One final way we can try to improve the model fit, without actually changing the model itself, is to consider the scale of the variables. In this case, both the area of sites and the total number of lithics appear to follow a log-normal distribution:

    +
    +
    ggplot(islay_lithics, aes(area)) + geom_density()
    +
    ggplot(islay_lithics, aes(total)) + geom_density()
    +
    +

    As we’ve already seen, these kind of distributions are easy to normalise using a log transform. We could create new log variables, as we did last week, but since at this point we’re only concerned with visualisation, we can instead do the transformation directly on the plot itself with scale_*_log10():

    +
    +
    ggplot(islay_lithics, aes(area, total)) +
    +    geom_point() +
    +    geom_smooth(method = "glm") +
    +    scale_x_log10() +
    +    scale_y_log10()
    +
    +

    This is called a log–log scale. If there is a linear correlation on a log-log scale, as there is here, it indicates a power law correlation on a normal scale.

    +
    +

    Exercises

    +
      +
    1. Why does it make sense to log transform the variables total and area specifically?
    2. +
    3. Choose two lithic types (flakes, etc.) and fit an appropriate model to them using geom_smooth().
    4. +
    5. Is there a correlation? Why or why not?
    6. +
    -
    -

    Key concepts

    -
    -

    Foundation

    +
    +

    Measuring correlation

    +

    A correlation coefficient measures the strength of correlation between two variables. We can calculate them with cor.test() function in R.

    +

    The most common, and default for cor.test(), is the product–moment correlation coefficient (method = "pearson"), which works well for linear relationships with the standard assumptions (cf. Shennan):

    +
    +
    cor.test(islay_lithics$area, islay_lithics$total)
    +
    +

    The output should be familiar from other hypothesis tests. The most useful parts are the p-value—which we can interpret in the usual way—and cor, which is the correlation coefficient r. The latter measures the strength of the correlation, and is a value between 0 (no correlation) and 1 (perfect correlation). Another useful statistic is R2, which measures how much of the variance in the response variable can be predicted by the independent variable. To calculate this, we need to extract the actual value of r from the object returned by cor.test():

    +
    +
    correlation <- cor.test(islay_lithics$area, islay_lithics$total)
    +correlation$estimate^2
    +
    +

    This isn’t very promising: the p-value is relatively low, but so is the correlation coefficient and R2. However, we’ve already seen above that the correlation looks stronger on a log-log scale, i.e. it is probably not linear. Rather than the product-moment coefficient, we should therefore use a nonparametric coefficient like Spearman’s ρ or Kendall’s τ. To this we simply need to change the method argument of cor.test():

    +
    +
    cor.test(islay_lithics$area, islay_lithics$total, method = "spearman")
    +
    +

    As expected, the correlation appears to be stronger using this method.

    +
    +

    Exercises

    +
      +
    1. Calculate Kendall’s τ for total vs. area (this is usually preferred to Spearman’s).
    2. +
    3. Describe the correlation (or lack thereof) you see, based on Kendall’s τ.
    4. +
    5. Choose two lithic types (flakes, etc.) and calculate an appropriate correlation coefficient (hint: you will first need to determine whether the correlation appears to be linear or not).
    6. +
    +
    -
    -

    Extensions

    +
    +

    Linear regression

    +

    Linear models might seem underwhelming compared to LOESS or GAM, but unlike these ‘black boxes’ we can easily build our own linear models, and in doing so learn a lot about the underlying properties of the data. We should also remember that regression fitting algorithms will always fit the model you give them; with non-linear models, this can easily produce over-fitted leading to spurious interpretations. By restraining ourselves to a linear relationships encourages to also keep our interpretations simple, and be more likely to spot when the model simply doesn’t fit.

    +

    To build our own linear models, we will use the function lm() directly. The following replicates the linear model we fitted with geom_smooth above:

    +
    +
    lm(total~area, data = islay_lithics)
    +
    +

    Like many modelling functions in R, it uses a special formula syntax, where the response variable(s) (left) are separated from the independent variable(s) (right) with ~. By giving our data frame as the data argument, we can use column names directly in the formula.

    +

    The result is a model object which includes the fitted values (used e.g. to reproduce a graph) as well as information on the model and a variety of measures of its performance – see ?lm for a full list. The format of this object is a little inconvenient. The broom package allows us to extract information from it as a data frame, which is a bit easier to work with. tidy() gives us the parameters of the model itself:

    +
    +
    library(broom)
    +model <- lm(total~area, data = islay_lithics)
    +tidy(model)
    +
    +

    glance() summarises measures of its performance:

    +
    +
    glance(model)
    +
    +

    And augment() gives us the original data alongside the fitted model predictions and residuals:

    +
    +
    augment(model, data = islay_lithics)
    +
    +

    We can use augment to reproduce a geom_smooth()-style plot from our model created with lm():

    +
    +
    fitted <- augment(model, data = islay_lithics, interval = "confidence")
    +ggplot(fitted, aes(area, total)) +
    +    geom_point() +
    +    geom_line(aes(y = .fitted)) + 
    +    geom_ribbon(aes(ymin = .lower, ymax = .upper), col = NA, alpha = 0.3)
    +
    +

    But crucially, we can go beyond what geom_smooth() can do and analyse the fit of the model in more detail. For example, the result of augment() includes the residuals, which can be very useful for further exploratory data analysis of the regression model (cf. Shennan, ch. 9).

    +

    Let’s see if we can make a better model. Our correlation coefficients and log-log plots have indicated that there probably is a correlation in this data, but that it is nonlinear – following a power law. We can model this by including log() in the formula specification of our model:

    +
    +
    lm(total~log(area), data = islay_lithics) |>
    +    augment(data = islay_lithics, interval = "confidence") |>
    +    ggplot(aes(area, total)) +
    +    geom_point() +
    +    geom_line(aes(y = .fitted)) +
    +    geom_ribbon(aes(ymin = .lower, ymax = .upper), col = NA, alpha = 0.3)
    +
    +
    +

    Exercises

    +
      +
    1. Choose two lithic types (flakes, etc.) and fit an appropriate linear model to them.
    2. +
    3. Produce a density plot of the residuals from the area–total regression. What can we learn from this?
    4. +
    +
    @@ -456,6 +676,9 @@

    Extensions

    diff --git a/tutorials/regression_files/figure-html/unnamed-chunk-1-1.png b/tutorials/regression_files/figure-html/unnamed-chunk-1-1.png new file mode 100644 index 0000000..2e82008 Binary files /dev/null and b/tutorials/regression_files/figure-html/unnamed-chunk-1-1.png differ diff --git a/tutorials/regression_files/figure-html/unnamed-chunk-16-1.png b/tutorials/regression_files/figure-html/unnamed-chunk-16-1.png new file mode 100644 index 0000000..13865c1 Binary files /dev/null and b/tutorials/regression_files/figure-html/unnamed-chunk-16-1.png differ diff --git a/tutorials/regression_files/figure-html/unnamed-chunk-17-1.png b/tutorials/regression_files/figure-html/unnamed-chunk-17-1.png new file mode 100644 index 0000000..3968864 Binary files /dev/null and b/tutorials/regression_files/figure-html/unnamed-chunk-17-1.png differ diff --git a/tutorials/regression_files/figure-html/unnamed-chunk-2-1.png b/tutorials/regression_files/figure-html/unnamed-chunk-2-1.png new file mode 100644 index 0000000..9df1b94 Binary files /dev/null and b/tutorials/regression_files/figure-html/unnamed-chunk-2-1.png differ diff --git a/tutorials/regression_files/figure-html/unnamed-chunk-3-1.png b/tutorials/regression_files/figure-html/unnamed-chunk-3-1.png new file mode 100644 index 0000000..a0c25d8 Binary files /dev/null and b/tutorials/regression_files/figure-html/unnamed-chunk-3-1.png differ diff --git a/tutorials/regression_files/figure-html/unnamed-chunk-4-1.png b/tutorials/regression_files/figure-html/unnamed-chunk-4-1.png new file mode 100644 index 0000000..0b72eec Binary files /dev/null and b/tutorials/regression_files/figure-html/unnamed-chunk-4-1.png differ diff --git a/tutorials/regression_files/figure-html/unnamed-chunk-4-2.png b/tutorials/regression_files/figure-html/unnamed-chunk-4-2.png new file mode 100644 index 0000000..e999f9f Binary files /dev/null and b/tutorials/regression_files/figure-html/unnamed-chunk-4-2.png differ diff --git a/tutorials/regression_files/figure-html/unnamed-chunk-5-1.png b/tutorials/regression_files/figure-html/unnamed-chunk-5-1.png new file mode 100644 index 0000000..843bc61 Binary files /dev/null and b/tutorials/regression_files/figure-html/unnamed-chunk-5-1.png differ diff --git a/tutorials/regression_files/figure-html/unnamed-chunk-6-1.png b/tutorials/regression_files/figure-html/unnamed-chunk-6-1.png new file mode 100644 index 0000000..9ab27fd Binary files /dev/null and b/tutorials/regression_files/figure-html/unnamed-chunk-6-1.png differ diff --git a/tutorials/regression_files/figure-html/unnamed-chunk-7-1.png b/tutorials/regression_files/figure-html/unnamed-chunk-7-1.png new file mode 100644 index 0000000..15889cc Binary files /dev/null and b/tutorials/regression_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/tutorials/regression_files/figure-html/unnamed-chunk-7-2.png b/tutorials/regression_files/figure-html/unnamed-chunk-7-2.png new file mode 100644 index 0000000..81157c1 Binary files /dev/null and b/tutorials/regression_files/figure-html/unnamed-chunk-7-2.png differ diff --git a/tutorials/regression_files/figure-html/unnamed-chunk-8-1.png b/tutorials/regression_files/figure-html/unnamed-chunk-8-1.png new file mode 100644 index 0000000..84e4605 Binary files /dev/null and b/tutorials/regression_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/tutorials/replication.html b/tutorials/replication.html index 6165086..62fca2c 100644 --- a/tutorials/replication.html +++ b/tutorials/replication.html @@ -142,8 +142,8 @@ Hypothesis testing
  • - - + + Regression
  • @@ -251,6 +251,11 @@

    Replication Workshop

    +
  • + diff --git a/tutorials/summary_statistics.html b/tutorials/summary_statistics.html index e54e78d..d7c6423 100644 --- a/tutorials/summary_statistics.html +++ b/tutorials/summary_statistics.html @@ -205,8 +205,8 @@ Hypothesis testing
  • - - + + Regression
  • @@ -314,6 +314,11 @@

    Summary statistics

    +
  • + diff --git a/tutorials/summary_statistics_files/figure-html/unnamed-chunk-1-1.png b/tutorials/summary_statistics_files/figure-html/unnamed-chunk-1-1.png index d400a71..e0ba376 100644 Binary files a/tutorials/summary_statistics_files/figure-html/unnamed-chunk-1-1.png and b/tutorials/summary_statistics_files/figure-html/unnamed-chunk-1-1.png differ diff --git a/tutorials/summary_statistics_files/figure-html/unnamed-chunk-1-2.png b/tutorials/summary_statistics_files/figure-html/unnamed-chunk-1-2.png index ceee849..1d0053f 100644 Binary files a/tutorials/summary_statistics_files/figure-html/unnamed-chunk-1-2.png and b/tutorials/summary_statistics_files/figure-html/unnamed-chunk-1-2.png differ diff --git a/tutorials/summary_statistics_files/figure-html/unnamed-chunk-1-3.png b/tutorials/summary_statistics_files/figure-html/unnamed-chunk-1-3.png index 700ffe7..c02d2d6 100644 Binary files a/tutorials/summary_statistics_files/figure-html/unnamed-chunk-1-3.png and b/tutorials/summary_statistics_files/figure-html/unnamed-chunk-1-3.png differ diff --git a/tutorials/summary_statistics_files/figure-html/unnamed-chunk-7-1.png b/tutorials/summary_statistics_files/figure-html/unnamed-chunk-7-1.png index 32022e1..ad6870d 100644 Binary files a/tutorials/summary_statistics_files/figure-html/unnamed-chunk-7-1.png and b/tutorials/summary_statistics_files/figure-html/unnamed-chunk-7-1.png differ diff --git a/tutorials/summary_statistics_files/figure-html/unnamed-chunk-8-1.png b/tutorials/summary_statistics_files/figure-html/unnamed-chunk-8-1.png index 5e1688a..7f6d5c2 100644 Binary files a/tutorials/summary_statistics_files/figure-html/unnamed-chunk-8-1.png and b/tutorials/summary_statistics_files/figure-html/unnamed-chunk-8-1.png differ diff --git a/tutorials/testing.html b/tutorials/testing.html index 8e3a3c1..8bf659d 100644 --- a/tutorials/testing.html +++ b/tutorials/testing.html @@ -93,7 +93,7 @@ - + @@ -205,8 +205,8 @@ Hypothesis testing
  • - - + + Regression
  • @@ -314,6 +314,11 @@

    Hypothesis testing

    +
  • + @@ -625,8 +630,8 @@

    Exercises

    diff --git a/tutorials/tidy_data.html b/tutorials/tidy_data.html index b28e58b..47deafc 100644 --- a/tutorials/tidy_data.html +++ b/tutorials/tidy_data.html @@ -205,8 +205,8 @@ Hypothesis testing
  • - - + + Regression
  • @@ -314,6 +314,11 @@

    Importing and tidying data

    +
  • + diff --git a/tutorials/visualising_distributions.html b/tutorials/visualising_distributions.html index 7f581ba..c1140c5 100644 --- a/tutorials/visualising_distributions.html +++ b/tutorials/visualising_distributions.html @@ -205,8 +205,8 @@ Hypothesis testing
  • - - + + Regression
  • @@ -314,6 +314,11 @@

    Visualising distributions

    +
  • + diff --git a/tutorials/visualising_relationships.html b/tutorials/visualising_relationships.html index fbcc80d..9543592 100644 --- a/tutorials/visualising_relationships.html +++ b/tutorials/visualising_relationships.html @@ -205,8 +205,8 @@ Hypothesis testing
  • - - + + Regression
  • @@ -314,6 +314,11 @@

    Visualising relationships

    +
  • + @@ -426,13 +431,13 @@

    Lithic assem 4 LGM5 Loch Gorm South Mesolithic 52473 128 18 5 LGM6 Loch Gorm South Later Prehistoric 54971 56 4 6 LGM8 Loch Gorm South <NA> 49974 29 1 - chunks cores pebbles retouched -1 16 24 0 15 -2 11 20 4 16 -3 1 1 6 3 -4 17 27 7 5 -5 8 18 12 10 -6 20 3 0 5 + chunks cores pebbles retouched total +1 16 24 0 15 229 +2 11 20 4 16 182 +3 1 1 6 3 23 +4 17 27 7 5 202 +5 8 18 12 10 108 +6 20 3 0 5 58

    Because this is an in-built dataset of the package, you can also enter ?islay_lithics to open the help page for the dataset, which contains more information on what it describes.

    diff --git a/tutorials/workflows.html b/tutorials/workflows.html index a82a4de..0b96c57 100644 --- a/tutorials/workflows.html +++ b/tutorials/workflows.html @@ -142,8 +142,8 @@ Hypothesis testing
  • - - + + Regression
  • @@ -251,6 +251,11 @@

    Data analysis workflows

    +
  • +