diff --git a/2-Regression/4-Logistic/README.md b/2-Regression/4-Logistic/README.md index d283db45f..34e49aaba 100644 --- a/2-Regression/4-Logistic/README.md +++ b/2-Regression/4-Logistic/README.md @@ -41,7 +41,7 @@ Logistic regression differs from linear regression, which you learned about prev ### Binary classification -Logistic regression does not offer the same features as linear regression. The former offers a prediction about a binary category ("orange or not orange") whereas the latter is capable of predicting continual values, for example given the origin of a pumpkin and the time of harvest, _how much its price will rise_. +Logistic regression does not offer the same features as linear regression. The former offers a prediction about a binary category ("white or not white") whereas the latter is capable of predicting continual values, for example given the origin of a pumpkin and the time of harvest, _how much its price will rise_. ![Pumpkin classification Model](./images/pumpkin-classifier.png) > Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded) diff --git a/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb b/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb index 98534c2bc..c0b39d271 100644 --- a/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb +++ b/2-Regression/4-Logistic/solution/R/lesson_4-R.ipynb @@ -1,751 +1,675 @@ { - "nbformat": 4, - "nbformat_minor": 2, - "metadata": { - "colab": { - "name": "Untitled10.ipynb", - "provenance": [], - "collapsed_sections": [] - }, - "kernelspec": { - "name": "ir", - "display_name": "R" - }, - "language_info": { - "name": "R" - } - }, - "cells": [ - { - "cell_type": "markdown", - "source": [ - "# Build a regression model: logistic regression\n", - "
\n" - ], - "metadata": { - "id": "fVfEucLYkV9T" - } - }, - { - "cell_type": "markdown", - "source": [ - "## Build a logistic regression model - Lesson 4\n", - "\n", - "

\n", - " \n", - "

Infographic by Dasani Madipalli
\n", - "\n", - "" - ], - "metadata": { - "id": "QizKKpzakfx2" - } - }, - { - "cell_type": "markdown", - "source": [ - "#### ** [Pre-lecture quiz](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/15/)**\n", - "\n", - "#### Introduction\n", - "\n", - "In this final lesson on Regression, one of the basic *classic* ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict `binary` `categories`. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?\n", - "\n", - "In this lesson, you will learn:\n", - "\n", - "- Techniques for logistic regression\n", - "\n", - "โœ… Deepen your understanding of working with this type of regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-77952-leestott)\n", - "\n", - "#### **Prerequisite**\n", - "\n", - "Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one binary category that we can work with: `Color`.\n", - "\n", - "Let's build a logistic regression model to predict that, given some variables, *what color a given pumpkin is likely to be* (orange ๐ŸŽƒ or white ๐Ÿ‘ป).\n", - "\n", - "> Why are we talking about binary classification in a lesson grouping about regression? Only for linguistic convenience, as logistic regression is [really a classification method](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), albeit a linear-based one. Learn about other ways to classify data in the next lesson group.\n", - "\n", - "For this lesson, we'll require the following packages:\n", - "\n", - "- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!\n", - "\n", - "- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\n", - "\n", - "- `janitor`: The [janitor package](https://github.com/sfirke/janitor) provides simple little tools for examining and cleaning dirty data.\n", - "\n", - "- `ggbeeswarm`: The [ggbeeswarm package](https://github.com/eclarke/ggbeeswarm) provides methods to create beeswarm-style plots using ggplot2.\n", - "\n", - "You can have them installed as:\n", - "\n", - "`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"ggbeeswarm\"))`\n", - "\n", - "Alternately, the script below checks whether you have the packages required to complete this module and installs them for you in case they are missing." - ], - "metadata": { - "id": "KPmut75XkmXY" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "suppressWarnings(if (!require(\"pacman\")) install.packages(\"pacman\"))\n", - "\n", - "pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)" - ], - "outputs": [], - "metadata": { - "id": "dnIGNNttkx_O" - } - }, - { - "cell_type": "markdown", - "source": [ - "## ** Define the question**\n", - "\n", - "For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.\n", - "\n", - "> ๐ŸŽƒ Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking!\n", - "\n", - "## **About logistic regression**\n", - "\n", - "Logistic regression differs from linear regression, which you learned about previously, in a few important ways.\n", - "\n", - "#### **Binary classification**\n", - "\n", - "Logistic regression does not offer the same features as linear regression. The former offers a prediction about a `binary category` (\"orange or not orange\") whereas the latter is capable of predicting `continual values`, for example given the origin of a pumpkin and the time of harvest, *how much its price will rise*.\n", - "\n", - "

\n", - " \n", - "

Infographic by Dasani Madipalli
\n", - "\n", - "" - ], - "metadata": { - "id": "ws-hP_SXk2O6" - } - }, - { - "cell_type": "markdown", - "source": [ - "#### **Other classifications**\n", - "\n", - "There are other types of logistic regression, including multinomial and ordinal:\n", - "\n", - "- **Multinomial**, which involves having more than one category - \"Orange, White, and Striped\".\n", - "\n", - "- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).\n", - "\n", - "

\n", - " \n", - "

Infographic by Dasani Madipalli
\n", - "\n", - "" - ], - "metadata": { - "id": "LkLN-ZgDlBEc" - } - }, - { - "cell_type": "markdown", - "source": [ - "**It's still linear**\n", - "\n", - "Even though this type of Regression is all about 'category predictions', it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not.\n", - "\n", - "#### **Variables DO NOT have to correlate**\n", - "\n", - "Remember how linear regression worked better with more correlated variables? Logistic regression is the opposite - the variables don't have to align. That works for this data which has somewhat weak correlations.\n", - "\n", - "#### **You need a lot of clean data**\n", - "\n", - "Logistic regression will give more accurate results if you use more data; our small dataset is not optimal for this task, so keep that in mind.\n", - "\n", - "โœ… Think about the types of data that would lend themselves well to logistic regression\n" - ], - "metadata": { - "id": "D8_JoVZtlHUt" - } - }, - { - "cell_type": "markdown", - "source": [ - "## 1. Tidy the data\n", - "\n", - "Now, the fun begins! Let's start by importing the data, cleaning the data a bit, dropping rows containing missing values and selecting only some of the columns:" - ], - "metadata": { - "id": "LPj8Ib1AlIua" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Load the core tidyverse packages\n", - "library(tidyverse)\n", - "\n", - "# Import the data and clean column names\n", - "pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \n", - " clean_names()\n", - "\n", - "# Select desired columns\n", - "pumpkins_select <- pumpkins %>% \n", - " select(c(city_name, package, variety, origin, item_size, color)) \n", - "\n", - "# Drop rows containing missing values and encode color as factor (category)\n", - "pumpkins_select <- pumpkins_select %>% \n", - " drop_na() %>% \n", - " mutate(color = factor(color))\n", - "\n", - "# View the first few rows\n", - "pumpkins_select %>% \n", - " slice_head(n = 5)\n" - ], - "outputs": [], - "metadata": { - "id": "Q8oKJ8PAlLM0" - } - }, - { - "cell_type": "markdown", - "source": [ - "Sometimes, we may want some little more information on our data. We can have a look at the `data`, `its structure` and the `data type` of its features by using the [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function as below:" - ], - "metadata": { - "id": "tKY5eN8alPNn" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "pumpkins_select %>% \n", - " glimpse()" - ], - "outputs": [], - "metadata": { - "id": "wDpatL1WlShu" - } - }, - { - "cell_type": "markdown", - "source": [ - "Wow! Seems that all our columns are all of type *character*, further alluding that they are all categorical.\n", - "\n", - "Let's confirm that we will actually be doing a binary classification problem:" - ], - "metadata": { - "id": "QbdC2b0JlU2G" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Subset distinct observations in outcome column\n", - "pumpkins_select %>% \n", - " distinct(color)" - ], - "outputs": [], - "metadata": { - "id": "Gys-Q18rlZpE" - } - }, - { - "cell_type": "markdown", - "source": [ - "๐Ÿฅณ๐Ÿฅณ That went down well!\n", - "\n", - "## 2. Explore the data\n", - "\n", - "The goal of data exploration is to try to understand the `relationships` between its attributes; in particular, any apparent correlation between the *features* and the *label* your model will try to predict. One way of doing this is by using data visualization.\n", - "\n", - "Given our the data types of our columns, we can `encode` them and be on our way to making some visualizations. This simply involves `translating` a column with `categorical values` for example our columns of type *char*, into one or more `numeric columns` that take the place of the original. - Something we did in our [last lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/3-Linear/solution/lesson_3-R.ipynb).\n", - "\n", - "Tidymodels provides yet another neat package: [recipes](https://recipes.tidymodels.org/)- a package for preprocessing data. We'll define a `recipe` that specifies that all predictor columns should be encoded into a set of integers , `prep` it to estimates the required quantities and statistics needed by any operations and finally `bake` to apply the computations to new data.\n", - "\n", - "> Normally, recipes is usually used as a preprocessor for modelling where it defines what steps should be applied to a data set in order to get it ready for modelling. In that case it is **highly recommend** that you use a `workflow()` instead of manually estimating a recipe using prep and bake. We'll see all this in just a moment.\n", - ">\n", - "> However for now, we are using recipes + prep + bake to specify what steps should be applied to a data set in order to get it ready for data analysis and then extract the preprocessed data with the steps applied." - ], - "metadata": { - "id": "kn_20wSPldVH" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Preprocess and extract data to allow some data analysis\n", - "baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>% \n", - " # Encode all columns to a set of integers\n", - " step_integer(all_predictors(), zero_based = T) %>% \n", - " prep() %>% \n", - " bake(new_data = NULL)\n", - "\n", - "\n", - "# Display the first few rows of preprocessed data\n", - "baked_pumpkins %>% \n", - " slice_head(n = 5)" - ], - "outputs": [], - "metadata": { - "id": "syaCgFQ_lijg" - } - }, - { - "cell_type": "markdown", - "source": [ - "Now let's compare the feature distributions for each label value using box plots. We'll begin by formatting the data to a *long* format to make it somewhat easier to make multiple `facets`." - ], - "metadata": { - "id": "RlkOZ_C5lldq" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Pivot data to long format\n", - "baked_pumpkins_long <- baked_pumpkins %>% \n", - " pivot_longer(!color, names_to = \"features\", values_to = \"values\")\n", - "\n", - "\n", - "# Print out restructured data\n", - "baked_pumpkins_long %>% \n", - " slice_head(n = 10)\n" - ], - "outputs": [], - "metadata": { - "id": "putq8DagltUQ" - } - }, - { - "cell_type": "markdown", - "source": [ - "Now, let's make some boxplots showing the distribution of the predictors with respect to the outcome color." - ], - "metadata": { - "id": "-RHm-12zlt-B" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "theme_set(theme_light())\n", - "#Make a box plot for each predictor feature\n", - "baked_pumpkins_long %>% \n", - " mutate(color = factor(color)) %>% \n", - " ggplot(mapping = aes(x = color, y = values, fill = features)) +\n", - " geom_boxplot() + \n", - " facet_wrap(~ features, scales = \"free\", ncol = 3) +\n", - " scale_color_viridis_d(option = \"cividis\", end = .8) +\n", - " theme(legend.position = \"none\")" - ], - "outputs": [], - "metadata": { - "id": "3Py4i1p1l3hP" - } - }, - { - "cell_type": "markdown", - "source": [ - "Amazing๐Ÿคฉ! For some of the features, there's a noticeable difference in the distribution for each color label. For instance, it seems the white pumpkins can be found in smaller packages and in some particular varieties of pumpkins. The *item_size* category also seems to make a difference in the color distribution. These features may help predict the color of a pumpkin.\n", - "\n", - "#### **Use a swarm plot**\n", - "\n", - "Color is a binary category (Orange or Not), it's called `categorical data`. There are other various ways of [visualizing categorical data](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar).\n", - "\n", - "Try a `swarm plot` to show the distribution of color with respect to the item_size.\n", - "\n", - "We'll use the [ggbeeswarm package](https://github.com/eclarke/ggbeeswarm) which provides methods to create beeswarm-style plots using ggplot2. Beeswarm plots are a way of plotting points that would ordinarily overlap so that they fall next to each other instead." - ], - "metadata": { - "id": "2LSj6_LCl68V" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Create beeswarm plots of color and item_size\n", - "baked_pumpkins %>% \n", - " mutate(color = factor(color)) %>% \n", - " ggplot(mapping = aes(x = color, y = item_size, color = color)) +\n", - " geom_quasirandom() +\n", - " scale_color_brewer(palette = \"Dark2\", direction = -1) +\n", - " theme(legend.position = \"none\")" - ], - "outputs": [], - "metadata": { - "id": "hGKeRgUemMTb" - } - }, - { - "cell_type": "markdown", - "source": [ - "#### **Violin plot**\n", - "\n", - "A 'violin' type plot is useful as you can easily visualize the way that data in the two categories is distributed. [`Violin plots`](https://en.wikipedia.org/wiki/Violin_plot) are similar to box plots, except that they also show the probability density of the data at different values. Violin plots don't work so well with smaller datasets as the distribution is displayed more 'smoothly'." - ], - "metadata": { - "id": "_9wdZJH5mOvN" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Create a violin plot of color and item_size\n", - "baked_pumpkins %>%\n", - " mutate(color = factor(color)) %>% \n", - " ggplot(mapping = aes(x = color, y = item_size, fill = color)) +\n", - " geom_violin() +\n", - " geom_boxplot(color = \"black\", fill = \"white\", width = 0.02) +\n", - " scale_fill_brewer(palette = \"Dark2\", direction = -1) +\n", - " theme(legend.position = \"none\")" - ], - "outputs": [], - "metadata": { - "id": "LFFFymujmTAZ" - } - }, - { - "cell_type": "markdown", - "source": [ - "Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.\n", - "\n", - "## 3. Build your logistic regression model\n", - "\n", - "

\n", - " \n", - "

Infographic by Dasani Madipalli
\n", - "\n", - "> **๐Ÿงฎ Show Me The Math**\n", - ">\n", - "> Remember how `linear regression` often used `ordinary least squares` to arrive at a value? `Logistic regression` relies on the concept of 'maximum likelihood' using [`sigmoid functions`](https://wikipedia.org/wiki/Sigmoid_function). A Sigmoid Function on a plot looks like an `S shape`. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:\n", - ">\n", - "> \n", - "

\n", - " \n", - "\n", - "\n", - "> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class 1 of the binary choice. If not, it will be classified as 0.\n", - "\n", - "Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value.\n", - "\n", - "It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:" - ], - "metadata": { - "id": "RA_bnMS9mVo8" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Split data into 80% for training and 20% for testing\n", - "set.seed(2056)\n", - "pumpkins_split <- pumpkins_select %>% \n", - " initial_split(prop = 0.8)\n", - "\n", - "# Extract the data in each split\n", - "pumpkins_train <- training(pumpkins_split)\n", - "pumpkins_test <- testing(pumpkins_split)\n", - "\n", - "# Print out the first 5 rows of the training set\n", - "pumpkins_train %>% \n", - " slice_head(n = 5)" - ], - "outputs": [], - "metadata": { - "id": "PQdpEYYPmdGW" - } - }, - { - "cell_type": "markdown", - "source": [ - "๐Ÿ™Œ We are now ready to train a model by fitting the training features to the training label (color).\n", - "\n", - "We'll begin by creating a recipe that specifies the preprocessing steps that should be carried out on our data to get it ready for modelling i.e: encoding categorical variables into a set of integers.\n", - "\n", - "There are quite a number of ways to specify a logistic regression model in Tidymodels. See `?logistic_reg()` For now, we'll specify a logistic regression model via the default `stats::glm()` engine." - ], - "metadata": { - "id": "MX9LipSimhn0" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Create a recipe that specifies preprocessing steps for modelling\n", - "pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \n", - " step_integer(all_predictors(), zero_based = TRUE)\n", - "\n", - "\n", - "# Create a logistic model specification\n", - "log_reg <- logistic_reg() %>% \n", - " set_engine(\"glm\") %>% \n", - " set_mode(\"classification\")\n" - ], - "outputs": [], - "metadata": { - "id": "0Eo5-SbSmm2-" - } - }, - { - "cell_type": "markdown", - "source": [ - "Now that we have a recipe and a model specification, we need to find a way of bundling them together into an object that will first preprocess the data (prep+bake behind the scenes), fit the model on the preprocessed data and also allow for potential post-processing activities.\n", - "\n", - "In Tidymodels, this convenient object is called a [`workflow`](https://workflows.tidymodels.org/) and conveniently holds your modeling components." - ], - "metadata": { - "id": "G599GKhXmqWf" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Bundle modelling components in a workflow\n", - "log_reg_wf <- workflow() %>% \n", - " add_recipe(pumpkins_recipe) %>% \n", - " add_model(log_reg)\n", - "\n", - "# Print out the workflow\n", - "log_reg_wf\n" - ], - "outputs": [], - "metadata": { - "id": "cRoU0tpbmu1T" - } - }, - { - "cell_type": "markdown", - "source": [ - "After a workflow has been *specified*, a model can be `trained` using the [`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html) function. The workflow will estimate a recipe and preprocess the data before training, so we won't have to manually do that using prep and bake." - ], - "metadata": { - "id": "JnRXKmREnEpd" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Train the model\n", - "wf_fit <- log_reg_wf %>% \n", - " fit(data = pumpkins_train)\n", - "\n", - "# Print the trained workflow\n", - "wf_fit" - ], - "outputs": [], - "metadata": { - "id": "ehFwfkjWnNCb" - } - }, - { - "cell_type": "markdown", - "source": [ - "The model print out shows the coefficients learned during training.\n", - "\n", - "Now we've trained the model using the training data, we can make predictions on the test data using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). Let's start by using the model to predict labels for our test set and the probabilities for each label. When the probability is more than 0.5, the predict class is `ORANGE` else `WHITE`." - ], - "metadata": { - "id": "w01dGNZjnOJQ" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Make predictions for color and corresponding probabilities\n", - "results <- pumpkins_test %>% select(color) %>% \n", - " bind_cols(wf_fit %>% \n", - " predict(new_data = pumpkins_test)) %>%\n", - " bind_cols(wf_fit %>%\n", - " predict(new_data = pumpkins_test, type = \"prob\"))\n", - "\n", - "# Compare predictions\n", - "results %>% \n", - " slice_head(n = 10)" - ], - "outputs": [], - "metadata": { - "id": "K8PNjPfTnak2" - } - }, - { - "cell_type": "markdown", - "source": [ - "Very nice! This provides some more insights into how logistic regression works.\n", - "\n", - "Comparing each prediction with its corresponding \"ground truth\" actual value isn't a very efficient way to determine how well the model is predicting. Fortunately, Tidymodels has a few more tricks up its sleeve: [`yardstick`](https://yardstick.tidymodels.org/) - a package used to measure the effectiveness of models using performance metrics.\n", - "\n", - "One performance metric associated with classification problems is the [`confusion matrix`](https://wikipedia.org/wiki/Confusion_matrix). A confusion matrix describes how well a classification model performs. A confusion matrix tabulates how many examples in each class were correctly classified by a model. In our case, it will show you how many orange pumpkins were classified as orange and how many white pumpkins were classified as white; the confusion matrix also shows you how many were classified into the **wrong** categories.\n", - "\n", - "The [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat.html) function from yardstick calculates this cross-tabulation of observed and predicted classes." - ], - "metadata": { - "id": "N3J-yW0wngKo" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Confusion matrix for prediction results\n", - "conf_mat(data = results, truth = color, estimate = .pred_class)" - ], - "outputs": [], - "metadata": { - "id": "0RD77Dq1nl2j" - } - }, - { - "cell_type": "markdown", - "source": [ - "Let's interpret the confusion matrix. Our model is asked to classify pumpkins between two binary categories, category `orange` and category `not-orange`\n", - "\n", - "- If your model predicts a pumpkin as orange and it belongs to category 'orange' in reality we call it a `true positive`, shown by the top left number.\n", - "\n", - "- If your model predicts a pumpkin as not orange and it belongs to category 'orange' in reality we call it a `false negative`, shown by the bottom left number.\n", - "\n", - "- If your model predicts a pumpkin as orange and it belongs to category 'not-orange' in reality we call it a `false positive`, shown by the top right number.\n", - "\n", - "- If your model predicts a pumpkin as not orange and it belongs to category 'not-orange' in reality we call it a `true negative`, shown by the bottom right number.\n", - "\n", - "\n", - "| **Truth** |\n", - "|:-----:|\n", - "\n", - "\n", - "| | | |\n", - "|---------------|--------|-------|\n", - "| **Predicted** | ORANGE | WHITE |\n", - "| ORANGE | TP | FP |\n", - "| WHITE | FN | TN |" - ], - "metadata": { - "id": "H61sFwdOnoiO" - } - }, - { - "cell_type": "markdown", - "source": [ - "As you might have guessed it's preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.\n", - "\n", - "The confusion matrix is helpful since it gives rise to other metrics that can help us better evaluate the performance of a classification model. Let's go through some of them:\n", - "\n", - "๐ŸŽ“ Precision: `TP/(TP + FP)` defined as the proportion of predicted positives that are actually positive. Also called [positive predictive value](https://en.wikipedia.org/wiki/Positive_predictive_value \"Positive predictive value\")\n", - "\n", - "๐ŸŽ“ Recall: `TP/(TP + FN)` defined as the proportion of positive results out of the number of samples which were actually positive. Also known as `sensitivity`.\n", - "\n", - "๐ŸŽ“ Specificity: `TN/(TN + FP)` defined as the proportion of negative results out of the number of samples which were actually negative.\n", - "\n", - "๐ŸŽ“ Accuracy: `TP + TN/(TP + TN + FP + FN)` The percentage of labels predicted accurately for a sample.\n", - "\n", - "๐ŸŽ“ F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.\n", - "\n", - "Let's calculate these metrics!" - ], - "metadata": { - "id": "Yc6QUie2oQUr" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Combine metric functions and calculate them all at once\n", - "eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\n", - "eval_metrics(data = results, truth = color, estimate = .pred_class)" - ], - "outputs": [], - "metadata": { - "id": "p6rXx_T3oVxX" - } - }, - { - "cell_type": "markdown", - "source": [ - "#### **Visualize the ROC curve of this model**\n", - "\n", - "For a start, this is not a bad model; its precision, recall, F measure and accuracy are in the 80% range so ideally you could use it to predict the color of a pumpkin given a set of variables. It also seems that our model was not really able to identify the white pumpkins ๐Ÿง. Could you guess why? One reason could be because of the high prevalence of ORANGE pumpkins in our training set making our model more inclined to predict the majority class.\n", - "\n", - "Let's do one more visualization to see the so-called [`ROC score`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):" - ], - "metadata": { - "id": "JcenzZo1oaKR" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Make a roc_curve\n", - "results %>% \n", - " roc_curve(color, .pred_ORANGE) %>% \n", - " autoplot()" - ], - "outputs": [], - "metadata": { - "id": "BcmkHHHwogRB" - } - }, - { - "cell_type": "markdown", - "source": [ - "ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. ROC curves typically feature `True Positive Rate`/Sensitivity on the Y axis, and `False Positive Rate`/1-Specificity on the X axis. Thus, the steepness of the curve and the space between the midpoint line and the curve matter: you want a curve that quickly heads up and over the line. In our case, there are false positives to start with, and then the line heads up and over properly.\n", - "\n", - "Finally, let's use `yardstick::roc_auc()` to calculate the actual Area Under the Curve. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example." - ], - "metadata": { - "id": "P_an3vc1oqjI" - } - }, - { - "cell_type": "code", - "execution_count": null, - "source": [ - "# Calculate area under curve\n", - "results %>% \n", - " roc_auc(color, .pred_ORANGE)" - ], - "outputs": [], - "metadata": { - "id": "SZyy5BT8ovew" - } - }, - { - "cell_type": "markdown", - "source": [ - "The result is around `0.67053`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*.\n", - "\n", - "In future lessons on classifications, you will learn how to improve your model's scores (such as dealing with imbalanced data in this case).\n", - "\n", - "But for now, congratulations ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰! You've completed these regression lessons!\n", - "\n", - "You R awesome!\n", - "\n", - "

\n", - " \n", - "

Artwork by @allison_horst
\n", - "\n", - "\n" - ], - "metadata": { - "id": "5jtVKLTVoy6u" - } - } - ] -} \ No newline at end of file + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Build a logistic regression model - Lesson 4\n", + "\n", + "![Logistic vs. linear regression infographic](../../images/linear-vs-logistic.png)\n", + "\n", + "#### **[Pre-lecture quiz](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/15/)**\n", + "\n", + "#### Introduction\n", + "\n", + "In this final lesson on Regression, one of the basic *classic* ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict binary categories. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not?\n", + "\n", + "In this lesson, you will learn:\n", + "\n", + "- Techniques for logistic regression\n", + "\n", + "โœ… Deepen your understanding of working with this type of regression in this [Learn module](https://learn.microsoft.com/training/modules/introduction-classification-models/?WT.mc_id=academic-77952-leestott)\n", + "\n", + "## Prerequisite\n", + "\n", + "Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one binary category that we can work with: `Color`.\n", + "\n", + "Let's build a logistic regression model to predict that, given some variables, *what color a given pumpkin is likely to be* (orange ๐ŸŽƒ or white ๐Ÿ‘ป).\n", + "\n", + "> Why are we talking about binary classification in a lesson grouping about regression? Only for linguistic convenience, as logistic regression is [really a classification method](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), albeit a linear-based one. Learn about other ways to classify data in the next lesson group.\n", + "\n", + "For this lesson, we'll require the following packages:\n", + "\n", + "- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!\n", + "\n", + "- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\n", + "\n", + "- `janitor`: The [janitor package](https://github.com/sfirke/janitor) provides simple little tools for examining and cleaning dirty data.\n", + "\n", + "- `ggbeeswarm`: The [ggbeeswarm package](https://github.com/eclarke/ggbeeswarm) provides methods to create beeswarm-style plots using ggplot2.\n", + "\n", + "You can have them installed as:\n", + "\n", + "`install.packages(c(\"tidyverse\", \"tidymodels\", \"janitor\", \"ggbeeswarm\"))`\n", + "\n", + "Alternately, the script below checks whether you have the packages required to complete this module and installs them for you in case they are missing.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "suppressWarnings(if (!require(\"pacman\"))install.packages(\"pacman\"))\n", + "\n", + "pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## **Define the question**\n", + "\n", + "For our purposes, we will express this as a binary: 'White' or 'Not White'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway.\n", + "\n", + "> ๐ŸŽƒ Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking! So we could also reformulate our question as: 'Ghost' or 'Not Ghost'. ๐Ÿ‘ป\n", + "\n", + "## **About logistic regression**\n", + "\n", + "Logistic regression differs from linear regression, which you learned about previously, in a few important ways.\n", + "\n", + "#### **Binary classification**\n", + "\n", + "Logistic regression does not offer the same features as linear regression. The former offers a prediction about a `binary category` (\"orange or not orange\") whereas the latter is capable of predicting `continual values`, for example given the origin of a pumpkin and the time of harvest, *how much its price will rise*.\n", + "\n", + "![Infographic by Dasani Madipalli](../../images/pumpkin-classifier.png)\n", + "\n", + "### Other classifications\n", + "\n", + "There are other types of logistic regression, including multinomial and ordinal:\n", + "\n", + "- **Multinomial**, which involves having more than one category - \"Orange, White, and Striped\".\n", + "\n", + "- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl).\n", + "\n", + "![Multinomial vs ordinal regression](../../images/multinomial-vs-ordinal.png)\n", + "\n", + "#### **Variables DO NOT have to correlate**\n", + "\n", + "Remember how linear regression worked better with more correlated variables? Logistic regression is the opposite - the variables don't have to align. That works for this data which has somewhat weak correlations.\n", + "\n", + "#### **You need a lot of clean data**\n", + "\n", + "Logistic regression will give more accurate results if you use more data; our small dataset is not optimal for this task, so keep that in mind.\n", + "\n", + "โœ… Think about the types of data that would lend themselves well to logistic regression\n", + "\n", + "## Exercise - tidy the data\n", + "\n", + "First, clean the data a bit, dropping null values and selecting only some of the columns:\n", + "\n", + "1. Add the following code:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Load the core tidyverse packages\n", + "library(tidyverse)\n", + "\n", + "# Import the data and clean column names\n", + "pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\") %>% \n", + " clean_names()\n", + "\n", + "# Select desired columns\n", + "pumpkins_select <- pumpkins %>% \n", + " select(c(city_name, package, variety, origin, item_size, color)) \n", + "\n", + "# Drop rows containing missing values and encode color as factor (category)\n", + "pumpkins_select <- pumpkins_select %>% \n", + " drop_na() %>% \n", + " mutate(color = factor(color))\n", + "\n", + "# View the first few rows\n", + "pumpkins_select %>% \n", + " slice_head(n = 5)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can always take a peek at your new dataframe, by using the [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function as below:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "pumpkins_select %>% \n", + " glimpse()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's confirm that we will actually be doing a binary classification problem:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Subset distinct observations in outcome column\n", + "pumpkins_select %>% \n", + " distinct(color)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Visualization - categorical plot\n", + "By now you have loaded up the pumpkin data once again and cleaned it so as to preserve a dataset containing a few variables, including Color. Let's visualize the dataframe in the notebook using ggplot library.\n", + "\n", + "The ggplot library offers some neat ways to visualize your data. For example, you can compare distributions of the data for each Variety and Color in a categorical plot.\n", + "\n", + "1. Create such a plot by using the geombar function, using our pumpkin data, and specifying a color mapping for each pumpkin category (orange or white):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "python" + } + }, + "outputs": [], + "source": [ + "# Specify colors for each value of the hue variable\n", + "palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n", + "\n", + "# Create the bar plot\n", + "ggplot(pumpkins_select, aes(y = variety, fill = color)) +\n", + " geom_bar(position = \"dodge\") +\n", + " scale_fill_manual(values = palette) +\n", + " labs(y = \"Variety\", fill = \"Color\") +\n", + " theme_minimal()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By observing the data, you can see how the Color data relates to Variety.\n", + "\n", + "โœ… Given this categorical plot, what are some interesting explorations you can envision?" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data pre-processing: feature encoding\n", + "\n", + "Our pumpkins dataset contains string values for all its columns. Working with categorical data is intuitive for humans but not for machines. Machine learning algorithms work well with numbers. That's why encoding is a very important step in the data pre-processing phase, since it enables us to turn categorical data into numerical data, without losing any information. Good encoding leads to building a good model.\n", + "\n", + "For feature encoding there are two main types of encoders:\n", + "\n", + "1. Ordinal encoder: it suits well for ordinal variables, which are categorical variables where their data follows a logical ordering, like the `item_size` column in our dataset. It creates a mapping such that each category is represented by a number, which is the order of the category in the column.\n", + "\n", + "2. Categorical encoder: it suits well for nominal variables, which are categorical variables where their data does not follow a logical ordering, like all the features different from `item_size` in our dataset. It is a one-hot encoding, which means that each category is represented by a binary column: the encoded variable is equal to 1 if the pumpkin belongs to that Variety and 0 otherwise.\n", + "\n", + "Tidymodels provides yet another neat package: [recipes](https://recipes.tidymodels.org/)- a package for preprocessing data. We'll define a `recipe` that specifies that all predictor columns should be encoded into a set of integers , `prep` it to estimates the required quantities and statistics needed by any operations and finally `bake` to apply the computations to new data.\n", + "\n", + "> Normally, recipes is usually used as a preprocessor for modelling where it defines what steps should be applied to a data set in order to get it ready for modelling. In that case it is **highly recommend** that you use a `workflow()` instead of manually estimating a recipe using prep and bake. We'll see all this in just a moment.\n", + ">\n", + "> However for now, we are using recipes + prep + bake to specify what steps should be applied to a data set in order to get it ready for data analysis and then extract the preprocessed data with the steps applied.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Preprocess and extract data to allow some data analysis\n", + "baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>%\n", + " # Define ordering for item_size column\n", + " step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n", + " # Convert factors to numbers using the order defined above (Ordinal encoding)\n", + " step_integer(item_size, zero_based = F) %>%\n", + " # Encode all other predictors using one hot encoding\n", + " step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>%\n", + " prep(data = pumpkin_select) %>%\n", + " bake(new_data = NULL)\n", + "\n", + "# Display the first few rows of preprocessed data\n", + "baked_pumpkins %>% \n", + " slice_head(n = 5)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "โœ… What are the advantages of using an ordinal encoder for the Item Size column?\n", + "\n", + "### Analyse relationships between variables\n", + "\n", + "Now that we have pre-processed our data, we can analyse the relationships between the features and the label to grasp an idea of how well the model will be able to predict the label given the features. The best way to perform this kind of analysis is plotting the data. \n", + "We'll be using again the ggplot geom_boxplot_ function, to visualize the relationships between Item Size, Variety and Color in a categorical plot. To better plot the data we'll be using the encoded Item Size column and the unencoded Variety column.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Define the color palette\n", + "palette <- c(ORANGE = \"orange\", WHITE = \"wheat\")\n", + "\n", + "# We need the encoded Item Size column to use it as the x-axis values in the plot\n", + "pumpkins_select_plot<-pumpkins_select\n", + "pumpkins_select_plot$item_size <- baked_pumpkins$item_size\n", + "\n", + "# Create the grouped box plot\n", + "ggplot(pumpkins_select_plot, aes(x = `item_size`, y = color, fill = color)) +\n", + " geom_boxplot() +\n", + " facet_grid(variety ~ ., scales = \"free_x\") +\n", + " scale_fill_manual(values = palette) +\n", + " labs(x = \"Item Size\", y = \"\") +\n", + " theme_minimal() +\n", + " theme(strip.text = element_text(size = 12)) +\n", + " theme(axis.text.x = element_text(size = 10)) +\n", + " theme(axis.title.x = element_text(size = 12)) +\n", + " theme(axis.title.y = element_blank()) +\n", + " theme(legend.position = \"bottom\") +\n", + " guides(fill = guide_legend(title = \"Color\")) +\n", + " theme(panel.spacing = unit(0.5, \"lines\"))+\n", + " theme(strip.text.y = element_text(size = 4, hjust = 0)) \n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Use a swarm plot\n", + "\n", + "Since Color is a binary category (White or Not), it needs 'a [specialized approach](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf) to visualization'.\n", + "\n", + "Try a `swarm plot` to show the distribution of color with respect to the item_size.\n", + "\n", + "We'll use the [ggbeeswarm package](https://github.com/eclarke/ggbeeswarm) which provides methods to create beeswarm-style plots using ggplot2. Beeswarm plots are a way of plotting points that would ordinarily overlap so that they fall next to each other instead.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Create beeswarm plots of color and item_size\n", + "baked_pumpkins %>% \n", + " mutate(color = factor(color)) %>% \n", + " ggplot(mapping = aes(x = color, y = item_size, color = color)) +\n", + " geom_quasirandom() +\n", + " scale_color_brewer(palette = \"Dark2\", direction = -1) +\n", + " theme(legend.position = \"none\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color.\n", + "\n", + "## Build your model\n", + "\n", + "Select the variables you want to use in your classification model and split the data into training and test sets. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Split data into 80% for training and 20% for testing\n", + "set.seed(2056)\n", + "pumpkins_split <- pumpkins_select %>% \n", + " initial_split(prop = 0.8)\n", + "\n", + "# Extract the data in each split\n", + "pumpkins_train <- training(pumpkins_split)\n", + "pumpkins_test <- testing(pumpkins_split)\n", + "\n", + "# Print out the first 5 rows of the training set\n", + "pumpkins_train %>% \n", + " slice_head(n = 5)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "๐Ÿ™Œ We are now ready to train a model by fitting the training features to the training label (color).\n", + "\n", + "We'll begin by creating a recipe that specifies the preprocessing steps that should be carried out on our data to get it ready for modelling i.e: encoding categorical variables into a set of integers. Just like `baked_pumpkins`, we create a `pumpkins_recipe` but do not `prep` and `bake` since it would be bundled into a workflow, which you will see in just a few steps from now. \n", + "\n", + "There are quite a number of ways to specify a logistic regression model in Tidymodels. See `?logistic_reg()` For now, we'll specify a logistic regression model via the default `stats::glm()` engine.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Create a recipe that specifies preprocessing steps for modelling\n", + "pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% \n", + " step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>%\n", + " step_integer(item_size, zero_based = F) %>% \n", + " step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE)\n", + "\n", + "# Create a logistic model specification\n", + "log_reg <- logistic_reg() %>% \n", + " set_engine(\"glm\") %>% \n", + " set_mode(\"classification\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have a recipe and a model specification, we need to find a way of bundling them together into an object that will first preprocess the data (prep+bake behind the scenes), fit the model on the preprocessed data and also allow for potential post-processing activities.\n", + "\n", + "In Tidymodels, this convenient object is called a [`workflow`](https://workflows.tidymodels.org/) and conveniently holds your modeling components.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Bundle modelling components in a workflow\n", + "log_reg_wf <- workflow() %>% \n", + " add_recipe(pumpkins_recipe) %>% \n", + " add_model(log_reg)\n", + "\n", + "# Print out the workflow\n", + "log_reg_wf\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After a workflow has been *specified*, a model can be `trained` using the [`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html) function. The workflow will estimate a recipe and preprocess the data before training, so we won't have to manually do that using prep and bake.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Train the model\n", + "wf_fit <- log_reg_wf %>% \n", + " fit(data = pumpkins_train)\n", + "\n", + "# Print the trained workflow\n", + "wf_fit\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The model print out shows the coefficients learned during training.\n", + "\n", + "Now we've trained the model using the training data, we can make predictions on the test data using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). Let's start by using the model to predict labels for our test set and the probabilities for each label. When the probability is more than 0.5, the predict class is `WHITE` else `ORANGE`.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Make predictions for color and corresponding probabilities\n", + "results <- pumpkins_test %>% select(color) %>% \n", + " bind_cols(wf_fit %>% \n", + " predict(new_data = pumpkins_test)) %>%\n", + " bind_cols(wf_fit %>%\n", + " predict(new_data = pumpkins_test, type = \"prob\"))\n", + "\n", + "# Compare predictions\n", + "results %>% \n", + " slice_head(n = 10)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Very nice! This provides some more insights into how logistic regression works.\n", + "\n", + "### Better comprehension via a confusion matrix\n", + "\n", + "Comparing each prediction with its corresponding \"ground truth\" actual value isn't a very efficient way to determine how well the model is predicting. Fortunately, Tidymodels has a few more tricks up its sleeve: [`yardstick`](https://yardstick.tidymodels.org/) - a package used to measure the effectiveness of models using performance metrics.\n", + "\n", + "One performance metric associated with classification problems is the [`confusion matrix`](https://wikipedia.org/wiki/Confusion_matrix). A confusion matrix describes how well a classification model performs. A confusion matrix tabulates how many examples in each class were correctly classified by a model. In our case, it will show you how many orange pumpkins were classified as orange and how many white pumpkins were classified as white; the confusion matrix also shows you how many were classified into the **wrong** categories.\n", + "\n", + "The [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat.html) function from yardstick calculates this cross-tabulation of observed and predicted classes.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Confusion matrix for prediction results\n", + "conf_mat(data = results, truth = color, estimate = .pred_class)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's interpret the confusion matrix. Our model is asked to classify pumpkins between two binary categories, category `white` and category `not-white`\n", + "\n", + "- If your model predicts a pumpkin as white and it belongs to category 'white' in reality we call it a `true positive`, shown by the top left number.\n", + "\n", + "- If your model predicts a pumpkin as not white and it belongs to category 'white' in reality we call it a `false negative`, shown by the bottom left number.\n", + "\n", + "- If your model predicts a pumpkin as white and it belongs to category 'not-white' in reality we call it a `false positive`, shown by the top right number.\n", + "\n", + "- If your model predicts a pumpkin as not white and it belongs to category 'not-white' in reality we call it a `true negative`, shown by the bottom right number.\n", + "\n", + "| Truth |\n", + "|:-----:|\n", + "\n", + "\n", + "| | | |\n", + "|---------------|--------|-------|\n", + "| **Predicted** | WHITE | ORANGE |\n", + "| WHITE | TP | FP |\n", + "| ORANGE | FN | TN |\n", + "\n", + "As you might have guessed it's preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.\n", + "\n", + "The confusion matrix is helpful since it gives rise to other metrics that can help us better evaluate the performance of a classification model. Let's go through some of them:\n", + "\n", + "๐ŸŽ“ Precision: `TP/(TP + FP)` defined as the proportion of predicted positives that are actually positive. Also called [positive predictive value](https://en.wikipedia.org/wiki/Positive_predictive_value \"Positive predictive value\")\n", + "\n", + "๐ŸŽ“ Recall: `TP/(TP + FN)` defined as the proportion of positive results out of the number of samples which were actually positive. Also known as `sensitivity`.\n", + "\n", + "๐ŸŽ“ Specificity: `TN/(TN + FP)` defined as the proportion of negative results out of the number of samples which were actually negative.\n", + "\n", + "๐ŸŽ“ Accuracy: `TP + TN/(TP + TN + FP + FN)` The percentage of labels predicted accurately for a sample.\n", + "\n", + "๐ŸŽ“ F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.\n", + "\n", + "Let's calculate these metrics!\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Combine metric functions and calculate them all at once\n", + "eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy)\n", + "eval_metrics(data = results, truth = color, estimate = .pred_class)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Visualize the ROC curve of this model\n", + "\n", + "Let's do one more visualization to see the so-called [`ROC curve`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic):\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Make a roc_curve\n", + "results %>% \n", + " roc_curve(color, .pred_ORANGE) %>% \n", + " autoplot()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. ROC curves typically feature `True Positive Rate`/Sensitivity on the Y axis, and `False Positive Rate`/1-Specificity on the X axis. Thus, the steepness of the curve and the space between the midpoint line and the curve matter: you want a curve that quickly heads up and over the line. In our case, there are false positives to start with, and then the line heads up and over properly.\n", + "\n", + "Finally, let's use `yardstick::roc_auc()` to calculate the actual Area Under the Curve. One way of interpreting AUC is as the probability that the model ranks a random positive example more highly than a random negative example.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "vscode": { + "languageId": "r" + } + }, + "outputs": [], + "source": [ + "# Calculate area under curve\n", + "results %>% \n", + " roc_auc(color, .pred_ORANGE)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The result is around `0.975`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*.\n", + "\n", + "In future lessons on classifications, you will learn how to improve your model's scores (such as dealing with imbalanced data in this case).\n", + "\n", + "## ๐Ÿš€Challenge\n", + "\n", + "There's a lot more to unpack regarding logistic regression! But the best way to learn is to experiment. Find a dataset that lends itself to this type of analysis and build a model with it. What do you learn? tip: try [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) for interesting datasets.\n", + "\n", + "## Review & Self Study\n", + "\n", + "Read the first few pages of [this paper from Stanford](https://web.stanford.edu/~jurafsky/slp3/5.pdf) on some practical uses for logistic regression. Think about tasks that are better suited for one or the other type of regression tasks that we have studied up to this point. What would work best?\n" + ] + } + ], + "metadata": { + "anaconda-cloud": "", + "kernelspec": { + "display_name": "R", + "langauge": "R", + "name": "ir" + }, + "language_info": { + "codemirror_mode": "r", + "file_extension": ".r", + "mimetype": "text/x-r-source", + "name": "R", + "pygments_lexer": "r", + "version": "3.4.1" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/2-Regression/4-Logistic/solution/R/lesson_4.Rmd b/2-Regression/4-Logistic/solution/R/lesson_4.Rmd index 6bf95bf01..2199fb34f 100644 --- a/2-Regression/4-Logistic/solution/R/lesson_4.Rmd +++ b/2-Regression/4-Logistic/solution/R/lesson_4.Rmd @@ -12,21 +12,21 @@ output: ## Build a logistic regression model - Lesson 4 -![Infographic by Dasani Madipalli](../../images/logistic-linear.png){width="600"} +![Logistic vs. linear regression infographic](../../images/linear-vs-logistic.png) #### **[Pre-lecture quiz](https://gray-sand-07a10f403.1.azurestaticapps.net/quiz/15/)** #### Introduction -In this final lesson on Regression, one of the basic *classic* ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict `binary` `categories`. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not? +In this final lesson on Regression, one of the basic *classic* ML techniques, we will take a look at Logistic Regression. You would use this technique to discover patterns to predict binary categories. Is this candy chocolate or not? Is this disease contagious or not? Will this customer choose this product or not? In this lesson, you will learn: - Techniques for logistic regression -โœ… Deepen your understanding of working with this type of regression in this [Learn module](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-77952-leestott) +โœ… Deepen your understanding of working with this type of regression in this [Learn module](https://learn.microsoft.com/training/modules/introduction-classification-models/?WT.mc_id=academic-77952-leestott) -#### **Prerequisite** +## Prerequisite Having worked with the pumpkin data, we are now familiar enough with it to realize that there's one binary category that we can work with: `Color`. @@ -58,9 +58,9 @@ pacman::p_load(tidyverse, tidymodels, janitor, ggbeeswarm) ## **Define the question** -For our purposes, we will express this as a binary: 'Orange' or 'Not Orange'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway. +For our purposes, we will express this as a binary: 'White' or 'Not White'. There is also a 'striped' category in our dataset but there are few instances of it, so we will not use it. It disappears once we remove null values from the dataset, anyway. -> ๐ŸŽƒ Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking! +> ๐ŸŽƒ Fun fact, we sometimes call white pumpkins 'ghost' pumpkins. They aren't very easy to carve, so they aren't as popular as the orange ones but they are cool looking! So we could also reformulate our question as: 'Ghost' or 'Not Ghost'. ๐Ÿ‘ป ## **About logistic regression** @@ -70,22 +70,17 @@ Logistic regression differs from linear regression, which you learned about prev Logistic regression does not offer the same features as linear regression. The former offers a prediction about a `binary category` ("orange or not orange") whereas the latter is capable of predicting `continual values`, for example given the origin of a pumpkin and the time of harvest, *how much its price will rise*. -![Infographic by Dasani Madipalli](../../images/pumpkin-classifier.png){width="600"} +![Infographic by Dasani Madipalli](../../images/pumpkin-classifier.png) -#### **Other classifications** +### Other classifications There are other types of logistic regression, including multinomial and ordinal: -- **Multinomial**, which involves having more than one category - "Orange, White, and Striped". +- **Multinomial**, which involves having more than one category - "Orange, White, and Striped". -- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl). +- **Ordinal**, which involves ordered categories, useful if we wanted to order our outcomes logically, like our pumpkins that are ordered by a finite number of sizes (mini,sm,med,lg,xl,xxl). -![Infographic by Dasani Madipalli](../../images/multinomial-ordinal.png){width="600"} - -\ -**It's still linear** - -Even though this type of Regression is all about 'category predictions', it still works best when there is a clear linear relationship between the dependent variable (color) and the other independent variables (the rest of the dataset, like city name and size). It's good to get an idea of whether there is any linearity dividing these variables or not. +![Multinomial vs ordinal regression](../../images/multinomial-vs-ordinal.png) #### **Variables DO NOT have to correlate** @@ -97,9 +92,11 @@ Logistic regression will give more accurate results if you use more data; our sm โœ… Think about the types of data that would lend themselves well to logistic regression -## 1. Tidy the data +## Exercise - tidy the data -Now, the fun begins! Let's start by importing the data, cleaning the data a bit, dropping rows containing missing values and selecting only some of the columns: +First, clean the data a bit, dropping null values and selecting only some of the columns: + +1. Add the following code: ```{r, tidyr, message=F, warning=F} # Load the core tidyverse packages @@ -121,34 +118,56 @@ pumpkins_select <- pumpkins_select %>% # View the first few rows pumpkins_select %>% slice_head(n = 5) - ``` -Sometimes, we may want some little more information on our data. We can have a look at the `data`, `its structure` and the `data type` of its features by using the [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function as below: +You can always take a peek at your new dataframe, by using the [*glimpse()*](https://pillar.r-lib.org/reference/glimpse.html) function as below: + ```{r glimpse} pumpkins_select %>% glimpse() ``` -Wow! Seems that all our columns are all of type *character*, further alluding that they are all categorical. - Let's confirm that we will actually be doing a binary classification problem: ```{r distinct color} # Subset distinct observations in outcome column pumpkins_select %>% distinct(color) +``` +### Visualization - categorical plot +By now you have loaded up the pumpkin data once again and cleaned it so as to preserve a dataset containing a few variables, including Color. Let's visualize the dataframe in the notebook using ggplot library. + +The ggplot library offers some neat ways to visualize your data. For example, you can compare distributions of the data for each Variety and Color in a categorical plot. + +1. Create such a plot by using the geombar function, using our pumpkin data, and specifying a color mapping for each pumpkin category (orange or white): + +```{r} +# Specify colors for each value of the hue variable +palette <- c(ORANGE = "orange", WHITE = "wheat") + +# Create the bar plot +ggplot(pumpkins_select, aes(y = variety, fill = color)) + + geom_bar(position = "dodge") + + scale_fill_manual(values = palette) + + labs(y = "Variety", fill = "Color") + + theme_minimal() ``` -๐Ÿฅณ๐Ÿฅณ That went down well! +By observing the data, you can see how the Color data relates to Variety. + +โœ… Given this categorical plot, what are some interesting explorations you can envision? + +### Data pre-processing: feature encoding + +Our pumpkins dataset contains string values for all its columns. Working with categorical data is intuitive for humans but not for machines. Machine learning algorithms work well with numbers. That's why encoding is a very important step in the data pre-processing phase, since it enables us to turn categorical data into numerical data, without losing any information. Good encoding leads to building a good model. -## 2. Explore the data +For feature encoding there are two main types of encoders: -The goal of data exploration is to try to understand the `relationships` between its attributes; in particular, any apparent correlation between the *features* and the *label* your model will try to predict. One way of doing this is by using data visualization. +1. Ordinal encoder: it suits well for ordinal variables, which are categorical variables where their data follows a logical ordering, like the `item_size` column in our dataset. It creates a mapping such that each category is represented by a number, which is the order of the category in the column. -Given our the data types of our columns, we can `encode` them and be on our way to making some visualizations. This simply involves `translating` a column with `categorical values` for example our columns of type *char*, into one or more `numeric columns` that take the place of the original. - Something we did in our [last lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/3-Linear/solution/lesson_3.html). +2. Categorical encoder: it suits well for nominal variables, which are categorical variables where their data does not follow a logical ordering, like all the features different from `item_size` in our dataset. It is a one-hot encoding, which means that each category is represented by a binary column: the encoded variable is equal to 1 if the pumpkin belongs to that Variety and 0 otherwise. Tidymodels provides yet another neat package: [recipes](https://recipes.tidymodels.org/)- a package for preprocessing data. We'll define a `recipe` that specifies that all predictor columns should be encoded into a set of integers , `prep` it to estimates the required quantities and statistics needed by any operations and finally `bake` to apply the computations to new data. @@ -158,53 +177,56 @@ Tidymodels provides yet another neat package: [recipes](https://recipes.tidymode ```{r recipe_prep_bake} # Preprocess and extract data to allow some data analysis -baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>% - # Encode all columns to a set of integers - step_integer(all_predictors(), zero_based = T) %>% - prep() %>% +baked_pumpkins <- recipe(color ~ ., data = pumpkins_select) %>% + # Define ordering for item_size column + step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>% + # Convert factors to numbers using the order defined above (Ordinal encoding) + step_integer(item_size, zero_based = F) %>% + # Encode all other predictors using one hot encoding + step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% + prep(data = pumpkin_select) %>% bake(new_data = NULL) - # Display the first few rows of preprocessed data baked_pumpkins %>% slice_head(n = 5) - -``` - -Now let's compare the feature distributions for each label value using box plots. We'll begin by formatting the data to a *long* format to make it somewhat easier to make multiple `facets`. - -```{r pivot} -# Pivot data to long format -baked_pumpkins_long <- baked_pumpkins %>% - pivot_longer(!color, names_to = "features", values_to = "values") - - -# Print out restructured data -baked_pumpkins_long %>% - slice_head(n = 10) - ``` - -Now, let's make some boxplots showing the distribution of the predictors with respect to the outcome color! - -```{r boxplots} -theme_set(theme_light()) -#Make a box plot for each predictor feature -baked_pumpkins_long %>% - mutate(color = factor(color)) %>% - ggplot(mapping = aes(x = color, y = values, fill = features)) + - geom_boxplot() + - facet_wrap(~ features, scales = "free", ncol = 3) + - scale_color_viridis_d(option = "cividis", end = .8) + - theme(legend.position = "none") +โœ… What are the advantages of using an ordinal encoder for the Item Size column? + +### Analyse relationships between variables + +Now that we have pre-processed our data, we can analyse the relationships between the features and the label to grasp an idea of how well the model will be able to predict the label given the features. The best way to perform this kind of analysis is plotting the data. +We'll be using again the ggplot geom_boxplot_ function, to visualize the relationships between Item Size, Variety and Color in a categorical plot. To better plot the data we'll be using the encoded Item Size column and the unencoded Variety column. + +```{r boxplot} +# Define the color palette +palette <- c(ORANGE = "orange", WHITE = "wheat") + +# We need the encoded Item Size column to use it as the x-axis values in the plot +pumpkins_select_plot<-pumpkins_select +pumpkins_select_plot$item_size <- baked_pumpkins$item_size + +# Create the grouped box plot +ggplot(pumpkins_select_plot, aes(x = `item_size`, y = color, fill = color)) + + geom_boxplot() + + facet_grid(variety ~ ., scales = "free_x") + + scale_fill_manual(values = palette) + + labs(x = "Item Size", y = "") + + theme_minimal() + + theme(strip.text = element_text(size = 12)) + + theme(axis.text.x = element_text(size = 10)) + + theme(axis.title.x = element_text(size = 12)) + + theme(axis.title.y = element_blank()) + + theme(legend.position = "bottom") + + guides(fill = guide_legend(title = "Color")) + + theme(panel.spacing = unit(0.5, "lines"))+ + theme(strip.text.y = element_text(size = 4, hjust = 0)) ``` -Amazing๐Ÿคฉ! For some of the features, there's a noticeable difference in the distribution for each color label. For instance, it seems the white pumpkins can be found in smaller packages and in some particular varieties of pumpkins. The *item_size* category also seems to make a difference in the color distribution. These features may help predict the color of a pumpkin. - -#### **Use a swarm plot** +#### Use a swarm plot -Color is a binary category (Orange or Not), it's called `categorical data`. There are other various ways of [visualizing categorical data](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar). +Since Color is a binary category (White or Not), it needs 'a [specialized approach](https://github.com/rstudio/cheatsheets/blob/main/data-visualization.pdf) to visualization'. Try a `swarm plot` to show the distribution of color with respect to the item_size. @@ -220,37 +242,11 @@ baked_pumpkins %>% theme(legend.position = "none") ``` -#### **Violin plot** - -A 'violin' type plot is useful as you can easily visualize the way that data in the two categories is distributed. [`Violin plots`](https://en.wikipedia.org/wiki/Violin_plot) are similar to box plots, except that they also show the probability density of the data at different values. Violin plots don't work so well with smaller datasets as the distribution is displayed more 'smoothly'. - -```{r violin_plot} -# Create a violin plot of color and item_size -baked_pumpkins %>% - mutate(color = factor(color)) %>% - ggplot(mapping = aes(x = color, y = item_size, fill = color)) + - geom_violin() + - geom_boxplot(color = "black", fill = "white", width = 0.02) + - scale_fill_brewer(palette = "Dark2", direction = -1) + - theme(legend.position = "none") - -``` - Now that we have an idea of the relationship between the binary categories of color and the larger group of sizes, let's explore logistic regression to determine a given pumpkin's likely color. -## 3. Build your model +## Build your model -> **๐Ÿงฎ Show Me The Math** -> -> Remember how `linear regression` often used `ordinary least squares` to arrive at a value? `Logistic regression` relies on the concept of 'maximum likelihood' using [`sigmoid functions`](https://wikipedia.org/wiki/Sigmoid_function). A Sigmoid Function on a plot looks like an `S shape`. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this: -> -> ![](../../images/sigmoid.png) -> -> where the sigmoid's midpoint finds itself at x's 0 point, L is the curve's maximum value, and k is the curve's steepness. If the outcome of the function is more than 0.5, the label in question will be given the class 1 of the binary choice. If not, it will be classified as 0. - -Let's begin by splitting the data into `training` and `test` sets. The training set is used to train a classifier so that it finds a statistical relationship between the features and the label value. - -It is best practice to hold out some of your data for **testing** in order to get a better estimate of how your models will perform on new data by comparing the predicted labels with the already known labels in the test set. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling: +Select the variables you want to use in your classification model and split the data into training and test sets. [rsample](https://rsample.tidymodels.org/), a package in Tidymodels, provides infrastructure for efficient data splitting and resampling: ```{r split_data} # Split data into 80% for training and 20% for testing @@ -265,28 +261,25 @@ pumpkins_test <- testing(pumpkins_split) # Print out the first 5 rows of the training set pumpkins_train %>% slice_head(n = 5) - - ``` ๐Ÿ™Œ We are now ready to train a model by fitting the training features to the training label (color). -We'll begin by creating a recipe that specifies the preprocessing steps that should be carried out on our data to get it ready for modelling i.e: encoding categorical variables into a set of integers. +We'll begin by creating a recipe that specifies the preprocessing steps that should be carried out on our data to get it ready for modelling i.e: encoding categorical variables into a set of integers. Just like `baked_pumpkins`, we create a `pumpkins_recipe` but do not `prep` and `bake` since it would be bundled into a workflow, which you will see in just a few steps from now. There are quite a number of ways to specify a logistic regression model in Tidymodels. See `?logistic_reg()` For now, we'll specify a logistic regression model via the default `stats::glm()` engine. ```{r log_reg} # Create a recipe that specifies preprocessing steps for modelling pumpkins_recipe <- recipe(color ~ ., data = pumpkins_train) %>% - step_integer(all_predictors(), zero_based = TRUE) - + step_mutate(item_size = ordered(item_size, levels = c('sml', 'med', 'med-lge', 'lge', 'xlge', 'jbo', 'exjbo'))) %>% + step_integer(item_size, zero_based = F) %>% + step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) # Create a logistic model specification log_reg <- logistic_reg() %>% set_engine("glm") %>% set_mode("classification") - - ``` Now that we have a recipe and a model specification, we need to find a way of bundling them together into an object that will first preprocess the data (prep+bake behind the scenes), fit the model on the preprocessed data and also allow for potential post-processing activities. @@ -301,12 +294,11 @@ log_reg_wf <- workflow() %>% # Print out the workflow log_reg_wf - - ``` After a workflow has been *specified*, a model can be `trained` using the [`fit()`](https://tidymodels.github.io/parsnip/reference/fit.html) function. The workflow will estimate a recipe and preprocess the data before training, so we won't have to manually do that using prep and bake. + ```{r train} # Train the model wf_fit <- log_reg_wf %>% @@ -314,12 +306,11 @@ wf_fit <- log_reg_wf %>% # Print the trained workflow wf_fit - ``` The model print out shows the coefficients learned during training. -Now we've trained the model using the training data, we can make predictions on the test data using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). Let's start by using the model to predict labels for our test set and the probabilities for each label. When the probability is more than 0.5, the predict class is `ORANGE` else `WHITE`. +Now we've trained the model using the training data, we can make predictions on the test data using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). Let's start by using the model to predict labels for our test set and the probabilities for each label. When the probability is more than 0.5, the predict class is `WHITE` else `ORANGE`. ```{r test_pred} # Make predictions for color and corresponding probabilities @@ -332,11 +323,12 @@ results <- pumpkins_test %>% select(color) %>% # Compare predictions results %>% slice_head(n = 10) - ``` Very nice! This provides some more insights into how logistic regression works. +### Better comprehension via a confusion matrix + Comparing each prediction with its corresponding "ground truth" actual value isn't a very efficient way to determine how well the model is predicting. Fortunately, Tidymodels has a few more tricks up its sleeve: [`yardstick`](https://yardstick.tidymodels.org/) - a package used to measure the effectiveness of models using performance metrics. One performance metric associated with classification problems is the [`confusion matrix`](https://wikipedia.org/wiki/Confusion_matrix). A confusion matrix describes how well a classification model performs. A confusion matrix tabulates how many examples in each class were correctly classified by a model. In our case, it will show you how many orange pumpkins were classified as orange and how many white pumpkins were classified as white; the confusion matrix also shows you how many were classified into the **wrong** categories. @@ -346,19 +338,17 @@ The [**`conf_mat()`**](https://tidymodels.github.io/yardstick/reference/conf_mat ```{r conf_mat} # Confusion matrix for prediction results conf_mat(data = results, truth = color, estimate = .pred_class) - - ``` -Let's interpret the confusion matrix. Our model is asked to classify pumpkins between two binary categories, category `orange` and category `not-orange` +Let's interpret the confusion matrix. Our model is asked to classify pumpkins between two binary categories, category `white` and category `not-white` -- If your model predicts a pumpkin as orange and it belongs to category 'orange' in reality we call it a `true positive`, shown by the top left number. +- If your model predicts a pumpkin as white and it belongs to category 'white' in reality we call it a `true positive`, shown by the top left number. -- If your model predicts a pumpkin as not orange and it belongs to category 'orange' in reality we call it a `false negative`, shown by the bottom left number. +- If your model predicts a pumpkin as not white and it belongs to category 'white' in reality we call it a `false negative`, shown by the bottom left number. -- If your model predicts a pumpkin as orange and it belongs to category 'not-orange' in reality we call it a `false positive`, shown by the top right number. +- If your model predicts a pumpkin as white and it belongs to category 'not-white' in reality we call it a `false positive`, shown by the top right number. -- If your model predicts a pumpkin as not orange and it belongs to category 'not-orange' in reality we call it a `true negative`, shown by the bottom right number. +- If your model predicts a pumpkin as not white and it belongs to category 'not-white' in reality we call it a `true negative`, shown by the bottom right number. | Truth | |:-----:| @@ -366,9 +356,9 @@ Let's interpret the confusion matrix. Our model is asked to classify pumpkins be | | | | |---------------|--------|-------| -| **Predicted** | ORANGE | WHITE | -| ORANGE | TP | FP | -| WHITE | FN | TN | +| **Predicted** | WHITE | ORANGE | +| WHITE | TP | FP | +| ORANGE | FN | TN | As you might have guessed it's preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better. @@ -392,18 +382,15 @@ eval_metrics <- metric_set(ppv, recall, spec, f_meas, accuracy) eval_metrics(data = results, truth = color, estimate = .pred_class) ``` -#### **Visualize the ROC curve of this model** +## Visualize the ROC curve of this model -For a start, this is not a bad model; its precision, recall, F measure and accuracy are in the 80% range so ideally you could use it to predict the color of a pumpkin given a set of variables. It also seems that our model was not really able to identify the white pumpkins ๐Ÿง. Could you guess why? One reason could be because of the high prevalence of ORANGE pumpkins in our training set making our model more inclined to predict the majority class. - -Let's do one more visualization to see the so-called [`ROC score`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic): +Let's do one more visualization to see the so-called [`ROC curve`](https://en.wikipedia.org/wiki/Receiver_operating_characteristic): ```{r roc_curve} # Make a roc_curve results %>% roc_curve(color, .pred_ORANGE) %>% autoplot() - ``` ROC curves are often used to get a view of the output of a classifier in terms of its true vs. false positives. ROC curves typically feature `True Positive Rate`/Sensitivity on the Y axis, and `False Positive Rate`/1-Specificity on the X axis. Thus, the steepness of the curve and the space between the midpoint line and the curve matter: you want a curve that quickly heads up and over the line. In our case, there are false positives to start with, and then the line heads up and over properly. @@ -414,17 +401,17 @@ Finally, let's use `yardstick::roc_auc()` to calculate the actual Area Under the # Calculate area under curve results %>% roc_auc(color, .pred_ORANGE) - ``` -The result is around `0.67053`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*. +The result is around `0.975`. Given that the AUC ranges from 0 to 1, you want a big score, since a model that is 100% correct in its predictions will have an AUC of 1; in this case, the model is *pretty good*. In future lessons on classifications, you will learn how to improve your model's scores (such as dealing with imbalanced data in this case). -But for now, congratulations ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰! You've completed these regression lessons! +## ๐Ÿš€Challenge -You R awesome! +There's a lot more to unpack regarding logistic regression! But the best way to learn is to experiment. Find a dataset that lends itself to this type of analysis and build a model with it. What do you learn? tip: try [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) for interesting datasets. -![Artwork by \@allison_horst](../../images/r_learners_sm.jpeg) +## Review & Self Study +Read the first few pages of [this paper from Stanford](https://web.stanford.edu/~jurafsky/slp3/5.pdf) on some practical uses for logistic regression. Think about tasks that are better suited for one or the other type of regression tasks that we have studied up to this point. What would work best? diff --git a/2-Regression/4-Logistic/solution/R/lesson_4.html b/2-Regression/4-Logistic/solution/R/lesson_4.html index f81e8cfa3..c44a56d2b 100644 --- a/2-Regression/4-Logistic/solution/R/lesson_4.html +++ b/2-Regression/4-Logistic/solution/R/lesson_4.html @@ -83,18 +83,12 @@ h6 {font-size: 12px;} code {color: inherit; background-color: rgba(0, 0, 0, 0.04);} pre:not([class]) { background-color: white } - +!function(t){"use strict";"function"==typeof define&&define.amd?define(["jquery"],t):t(jQuery)}(function(V){"use strict";V.ui=V.ui||{};V.ui.version="1.13.2";var n,i=0,a=Array.prototype.hasOwnProperty,r=Array.prototype.slice;V.cleanData=(n=V.cleanData,function(t){for(var e,i,s=0;null!=(i=t[s]);s++)(e=V._data(i,"events"))&&e.remove&&V(i).triggerHandler("remove");n(t)}),V.widget=function(t,i,e){var s,n,o,a={},r=t.split(".")[0],l=r+"-"+(t=t.split(".")[1]);return e||(e=i,i=V.Widget),Array.isArray(e)&&(e=V.extend.apply(null,[{}].concat(e))),V.expr.pseudos[l.toLowerCase()]=function(t){return!!V.data(t,l)},V[r]=V[r]||{},s=V[r][t],n=V[r][t]=function(t,e){if(!this||!this._createWidget)return new n(t,e);arguments.length&&this._createWidget(t,e)},V.extend(n,s,{version:e.version,_proto:V.extend({},e),_childConstructors:[]}),(o=new i).options=V.widget.extend({},o.options),V.each(e,function(e,s){function n(){return i.prototype[e].apply(this,arguments)}function o(t){return i.prototype[e].apply(this,t)}a[e]="function"==typeof s?function(){var t,e=this._super,i=this._superApply;return this._super=n,this._superApply=o,t=s.apply(this,arguments),this._super=e,this._superApply=i,t}:s}),n.prototype=V.widget.extend(o,{widgetEventPrefix:s&&o.widgetEventPrefix||t},a,{constructor:n,namespace:r,widgetName:t,widgetFullName:l}),s?(V.each(s._childConstructors,function(t,e){var i=e.prototype;V.widget(i.namespace+"."+i.widgetName,n,e._proto)}),delete s._childConstructors):i._childConstructors.push(n),V.widget.bridge(t,n),n},V.widget.extend=function(t){for(var e,i,s=r.call(arguments,1),n=0,o=s.length;n",options:{classes:{},disabled:!1,create:null},_createWidget:function(t,e){e=V(e||this.defaultElement||this)[0],this.element=V(e),this.uuid=i++,this.eventNamespace="."+this.widgetName+this.uuid,this.bindings=V(),this.hoverable=V(),this.focusable=V(),this.classesElementLookup={},e!==this&&(V.data(e,this.widgetFullName,this),this._on(!0,this.element,{remove:function(t){t.target===e&&this.destroy()}}),this.document=V(e.style?e.ownerDocument:e.document||e),this.window=V(this.document[0].defaultView||this.document[0].parentWindow)),this.options=V.widget.extend({},this.options,this._getCreateOptions(),t),this._create(),this.options.disabled&&this._setOptionDisabled(this.options.disabled),this._trigger("create",null,this._getCreateEventData()),this._init()},_getCreateOptions:function(){return{}},_getCreateEventData:V.noop,_create:V.noop,_init:V.noop,destroy:function(){var i=this;this._destroy(),V.each(this.classesElementLookup,function(t,e){i._removeClass(e,t)}),this.element.off(this.eventNamespace).removeData(this.widgetFullName),this.widget().off(this.eventNamespace).removeAttr("aria-disabled"),this.bindings.off(this.eventNamespace)},_destroy:V.noop,widget:function(){return this.element},option:function(t,e){var i,s,n,o=t;if(0===arguments.length)return V.widget.extend({},this.options);if("string"==typeof t)if(o={},t=(i=t.split(".")).shift(),i.length){for(s=o[t]=V.widget.extend({},this.options[t]),n=0;n
"),i=e.children()[0];return V("body").append(e),t=i.offsetWidth,e.css("overflow","scroll"),t===(i=i.offsetWidth)&&(i=e[0].clientWidth),e.remove(),s=t-i},getScrollInfo:function(t){var e=t.isWindow||t.isDocument?"":t.element.css("overflow-x"),i=t.isWindow||t.isDocument?"":t.element.css("overflow-y"),e="scroll"===e||"auto"===e&&t.widthx(k(s),k(n))?o.important="horizontal":o.important="vertical",u.using.call(this,t,o)}),a.offset(V.extend(h,{using:t}))})},V.ui.position={fit:{left:function(t,e){var i=e.within,s=i.isWindow?i.scrollLeft:i.offset.left,n=i.width,o=t.left-e.collisionPosition.marginLeft,a=s-o,r=o+e.collisionWidth-n-s;e.collisionWidth>n?0n?0")[0],w=d.each;function P(t){return null==t?t+"":"object"==typeof t?p[e.call(t)]||"object":typeof t}function M(t,e,i){var s=v[e.type]||{};return null==t?i||!e.def?null:e.def:(t=s.floor?~~t:parseFloat(t),isNaN(t)?e.def:s.mod?(t+s.mod)%s.mod:Math.min(s.max,Math.max(0,t)))}function S(s){var n=m(),o=n._rgba=[];return s=s.toLowerCase(),w(g,function(t,e){var i=e.re.exec(s),i=i&&e.parse(i),e=e.space||"rgba";if(i)return i=n[e](i),n[_[e].cache]=i[_[e].cache],o=n._rgba=i._rgba,!1}),o.length?("0,0,0,0"===o.join()&&d.extend(o,B.transparent),n):B[s]}function H(t,e,i){return 6*(i=(i+1)%1)<1?t+(e-t)*i*6:2*i<1?e:3*i<2?t+(e-t)*(2/3-i)*6:t}y.style.cssText="background-color:rgba(1,1,1,.5)",b.rgba=-1o.mod/2?s+=o.mod:s-n>o.mod/2&&(s-=o.mod)),l[i]=M((n-s)*a+s,e)))}),this[e](l)},blend:function(t){if(1===this._rgba[3])return this;var e=this._rgba.slice(),i=e.pop(),s=m(t)._rgba;return m(d.map(e,function(t,e){return(1-i)*s[e]+i*t}))},toRgbaString:function(){var t="rgba(",e=d.map(this._rgba,function(t,e){return null!=t?t:2").addClass("ui-effects-wrapper").css({fontSize:"100%",background:"transparent",border:"none",margin:0,padding:0}),e={width:i.width(),height:i.height()},n=document.activeElement;try{n.id}catch(t){n=document.body}return i.wrap(t),i[0]!==n&&!V.contains(i[0],n)||V(n).trigger("focus"),t=i.parent(),"static"===i.css("position")?(t.css({position:"relative"}),i.css({position:"relative"})):(V.extend(s,{position:i.css("position"),zIndex:i.css("z-index")}),V.each(["top","left","bottom","right"],function(t,e){s[e]=i.css(e),isNaN(parseInt(s[e],10))&&(s[e]="auto")}),i.css({position:"relative",top:0,left:0,right:"auto",bottom:"auto"})),i.css(e),t.css(s).show()},removeWrapper:function(t){var e=document.activeElement;return t.parent().is(".ui-effects-wrapper")&&(t.parent().replaceWith(t),t[0]!==e&&!V.contains(t[0],e)||V(e).trigger("focus")),t}}),V.extend(V.effects,{version:"1.13.2",define:function(t,e,i){return i||(i=e,e="effect"),V.effects.effect[t]=i,V.effects.effect[t].mode=e,i},scaledDimensions:function(t,e,i){if(0===e)return{height:0,width:0,outerHeight:0,outerWidth:0};var s="horizontal"!==i?(e||100)/100:1,e="vertical"!==i?(e||100)/100:1;return{height:t.height()*e,width:t.width()*s,outerHeight:t.outerHeight()*e,outerWidth:t.outerWidth()*s}},clipToBox:function(t){return{width:t.clip.right-t.clip.left,height:t.clip.bottom-t.clip.top,left:t.clip.left,top:t.clip.top}},unshift:function(t,e,i){var s=t.queue();1").insertAfter(t).css({display:/^(inline|ruby)/.test(t.css("display"))?"inline-block":"block",visibility:"hidden",marginTop:t.css("marginTop"),marginBottom:t.css("marginBottom"),marginLeft:t.css("marginLeft"),marginRight:t.css("marginRight"),float:t.css("float")}).outerWidth(t.outerWidth()).outerHeight(t.outerHeight()).addClass("ui-effects-placeholder"),t.data(j+"placeholder",e)),t.css({position:i,left:s.left,top:s.top}),e},removePlaceholder:function(t){var e=j+"placeholder",i=t.data(e);i&&(i.remove(),t.removeData(e))},cleanUp:function(t){V.effects.restoreStyle(t),V.effects.removePlaceholder(t)},setTransition:function(s,t,n,o){return o=o||{},V.each(t,function(t,e){var i=s.cssUnit(e);0");l.appendTo("body").addClass(t.className).css({top:s.top-a,left:s.left-r,height:i.innerHeight(),width:i.innerWidth(),position:n?"fixed":"absolute"}).animate(o,t.duration,t.easing,function(){l.remove(),"function"==typeof e&&e()})}}),V.fx.step.clip=function(t){t.clipInit||(t.start=V(t.elem).cssClip(),"string"==typeof t.end&&(t.end=G(t.end,t.elem)),t.clipInit=!0),V(t.elem).cssClip({top:t.pos*(t.end.top-t.start.top)+t.start.top,right:t.pos*(t.end.right-t.start.right)+t.start.right,bottom:t.pos*(t.end.bottom-t.start.bottom)+t.start.bottom,left:t.pos*(t.end.left-t.start.left)+t.start.left})},Y={},V.each(["Quad","Cubic","Quart","Quint","Expo"],function(e,t){Y[t]=function(t){return Math.pow(t,e+2)}}),V.extend(Y,{Sine:function(t){return 1-Math.cos(t*Math.PI/2)},Circ:function(t){return 1-Math.sqrt(1-t*t)},Elastic:function(t){return 0===t||1===t?t:-Math.pow(2,8*(t-1))*Math.sin((80*(t-1)-7.5)*Math.PI/15)},Back:function(t){return t*t*(3*t-2)},Bounce:function(t){for(var e,i=4;t<((e=Math.pow(2,--i))-1)/11;);return 1/Math.pow(4,3-i)-7.5625*Math.pow((3*e-2)/22-t,2)}}),V.each(Y,function(t,e){V.easing["easeIn"+t]=e,V.easing["easeOut"+t]=function(t){return 1-e(1-t)},V.easing["easeInOut"+t]=function(t){return t<.5?e(2*t)/2:1-e(-2*t+2)/2}});y=V.effects,V.effects.define("blind","hide",function(t,e){var i={up:["bottom","top"],vertical:["bottom","top"],down:["top","bottom"],left:["right","left"],horizontal:["right","left"],right:["left","right"]},s=V(this),n=t.direction||"up",o=s.cssClip(),a={clip:V.extend({},o)},r=V.effects.createPlaceholder(s);a.clip[i[n][0]]=a.clip[i[n][1]],"show"===t.mode&&(s.cssClip(a.clip),r&&r.css(V.effects.clipToBox(a)),a.clip=o),r&&r.animate(V.effects.clipToBox(a),t.duration,t.easing),s.animate(a,{queue:!1,duration:t.duration,easing:t.easing,complete:e})}),V.effects.define("bounce",function(t,e){var i,s,n=V(this),o=t.mode,a="hide"===o,r="show"===o,l=t.direction||"up",h=t.distance,c=t.times||5,o=2*c+(r||a?1:0),u=t.duration/o,d=t.easing,p="up"===l||"down"===l?"top":"left",f="up"===l||"left"===l,g=0,t=n.queue().length;for(V.effects.createPlaceholder(n),l=n.css(p),h=h||n["top"==p?"outerHeight":"outerWidth"]()/3,r&&((s={opacity:1})[p]=l,n.css("opacity",0).css(p,f?2*-h:2*h).animate(s,u,d)),a&&(h/=Math.pow(2,c-1)),(s={})[p]=l;g").css({position:"absolute",visibility:"visible",left:-s*p,top:-i*f}).parent().addClass("ui-effects-explode").css({position:"absolute",overflow:"hidden",width:p,height:f,left:n+(u?a*p:0),top:o+(u?r*f:0),opacity:u?0:1}).animate({left:n+(u?0:a*p),top:o+(u?0:r*f),opacity:u?1:0},t.duration||500,t.easing,m)}),V.effects.define("fade","toggle",function(t,e){var i="show"===t.mode;V(this).css("opacity",i?0:1).animate({opacity:i?1:0},{queue:!1,duration:t.duration,easing:t.easing,complete:e})}),V.effects.define("fold","hide",function(e,t){var i=V(this),s=e.mode,n="show"===s,o="hide"===s,a=e.size||15,r=/([0-9]+)%/.exec(a),l=!!e.horizFirst?["right","bottom"]:["bottom","right"],h=e.duration/2,c=V.effects.createPlaceholder(i),u=i.cssClip(),d={clip:V.extend({},u)},p={clip:V.extend({},u)},f=[u[l[0]],u[l[1]]],s=i.queue().length;r&&(a=parseInt(r[1],10)/100*f[o?0:1]),d.clip[l[0]]=a,p.clip[l[0]]=a,p.clip[l[1]]=0,n&&(i.cssClip(p.clip),c&&c.css(V.effects.clipToBox(p)),p.clip=u),i.queue(function(t){c&&c.animate(V.effects.clipToBox(d),h,e.easing).animate(V.effects.clipToBox(p),h,e.easing),t()}).animate(d,h,e.easing).animate(p,h,e.easing).queue(t),V.effects.unshift(i,s,4)}),V.effects.define("highlight","show",function(t,e){var i=V(this),s={backgroundColor:i.css("backgroundColor")};"hide"===t.mode&&(s.opacity=0),V.effects.saveStyle(i),i.css({backgroundImage:"none",backgroundColor:t.color||"#ffff99"}).animate(s,{queue:!1,duration:t.duration,easing:t.easing,complete:e})}),V.effects.define("size",function(s,e){var n,i=V(this),t=["fontSize"],o=["borderTopWidth","borderBottomWidth","paddingTop","paddingBottom"],a=["borderLeftWidth","borderRightWidth","paddingLeft","paddingRight"],r=s.mode,l="effect"!==r,h=s.scale||"both",c=s.origin||["middle","center"],u=i.css("position"),d=i.position(),p=V.effects.scaledDimensions(i),f=s.from||p,g=s.to||V.effects.scaledDimensions(i,0);V.effects.createPlaceholder(i),"show"===r&&(r=f,f=g,g=r),n={from:{y:f.height/p.height,x:f.width/p.width},to:{y:g.height/p.height,x:g.width/p.width}},"box"!==h&&"both"!==h||(n.from.y!==n.to.y&&(f=V.effects.setTransition(i,o,n.from.y,f),g=V.effects.setTransition(i,o,n.to.y,g)),n.from.x!==n.to.x&&(f=V.effects.setTransition(i,a,n.from.x,f),g=V.effects.setTransition(i,a,n.to.x,g))),"content"!==h&&"both"!==h||n.from.y!==n.to.y&&(f=V.effects.setTransition(i,t,n.from.y,f),g=V.effects.setTransition(i,t,n.to.y,g)),c&&(c=V.effects.getBaseline(c,p),f.top=(p.outerHeight-f.outerHeight)*c.y+d.top,f.left=(p.outerWidth-f.outerWidth)*c.x+d.left,g.top=(p.outerHeight-g.outerHeight)*c.y+d.top,g.left=(p.outerWidth-g.outerWidth)*c.x+d.left),delete f.outerHeight,delete f.outerWidth,i.css(f),"content"!==h&&"both"!==h||(o=o.concat(["marginTop","marginBottom"]).concat(t),a=a.concat(["marginLeft","marginRight"]),i.find("*[width]").each(function(){var t=V(this),e=V.effects.scaledDimensions(t),i={height:e.height*n.from.y,width:e.width*n.from.x,outerHeight:e.outerHeight*n.from.y,outerWidth:e.outerWidth*n.from.x},e={height:e.height*n.to.y,width:e.width*n.to.x,outerHeight:e.height*n.to.y,outerWidth:e.width*n.to.x};n.from.y!==n.to.y&&(i=V.effects.setTransition(t,o,n.from.y,i),e=V.effects.setTransition(t,o,n.to.y,e)),n.from.x!==n.to.x&&(i=V.effects.setTransition(t,a,n.from.x,i),e=V.effects.setTransition(t,a,n.to.x,e)),l&&V.effects.saveStyle(t),t.css(i),t.animate(e,s.duration,s.easing,function(){l&&V.effects.restoreStyle(t)})})),i.animate(g,{queue:!1,duration:s.duration,easing:s.easing,complete:function(){var t=i.offset();0===g.opacity&&i.css("opacity",f.opacity),l||(i.css("position","static"===u?"relative":u).offset(t),V.effects.saveStyle(i)),e()}})}),V.effects.define("scale",function(t,e){var i=V(this),s=t.mode,s=parseInt(t.percent,10)||(0===parseInt(t.percent,10)||"effect"!==s?0:100),s=V.extend(!0,{from:V.effects.scaledDimensions(i),to:V.effects.scaledDimensions(i,s,t.direction||"both"),origin:t.origin||["middle","center"]},t);t.fade&&(s.from.opacity=1,s.to.opacity=0),V.effects.effect.size.call(this,s,e)}),V.effects.define("puff","hide",function(t,e){t=V.extend(!0,{},t,{fade:!0,percent:parseInt(t.percent,10)||150});V.effects.effect.scale.call(this,t,e)}),V.effects.define("pulsate","show",function(t,e){var i=V(this),s=t.mode,n="show"===s,o=2*(t.times||5)+(n||"hide"===s?1:0),a=t.duration/o,r=0,l=1,s=i.queue().length;for(!n&&i.is(":visible")||(i.css("opacity",0).show(),r=1);l li > :first-child").add(t.find("> :not(li)").even())},heightStyle:"auto",icons:{activeHeader:"ui-icon-triangle-1-s",header:"ui-icon-triangle-1-e"},activate:null,beforeActivate:null},hideProps:{borderTopWidth:"hide",borderBottomWidth:"hide",paddingTop:"hide",paddingBottom:"hide",height:"hide"},showProps:{borderTopWidth:"show",borderBottomWidth:"show",paddingTop:"show",paddingBottom:"show",height:"show"},_create:function(){var t=this.options;this.prevShow=this.prevHide=V(),this._addClass("ui-accordion","ui-widget ui-helper-reset"),this.element.attr("role","tablist"),t.collapsible||!1!==t.active&&null!=t.active||(t.active=0),this._processPanels(),t.active<0&&(t.active+=this.headers.length),this._refresh()},_getCreateEventData:function(){return{header:this.active,panel:this.active.length?this.active.next():V()}},_createIcons:function(){var t,e=this.options.icons;e&&(t=V(""),this._addClass(t,"ui-accordion-header-icon","ui-icon "+e.header),t.prependTo(this.headers),t=this.active.children(".ui-accordion-header-icon"),this._removeClass(t,e.header)._addClass(t,null,e.activeHeader)._addClass(this.headers,"ui-accordion-icons"))},_destroyIcons:function(){this._removeClass(this.headers,"ui-accordion-icons"),this.headers.children(".ui-accordion-header-icon").remove()},_destroy:function(){var t;this.element.removeAttr("role"),this.headers.removeAttr("role aria-expanded aria-selected aria-controls tabIndex").removeUniqueId(),this._destroyIcons(),t=this.headers.next().css("display","").removeAttr("role aria-hidden aria-labelledby").removeUniqueId(),"content"!==this.options.heightStyle&&t.css("height","")},_setOption:function(t,e){"active"!==t?("event"===t&&(this.options.event&&this._off(this.headers,this.options.event),this._setupEvents(e)),this._super(t,e),"collapsible"!==t||e||!1!==this.options.active||this._activate(0),"icons"===t&&(this._destroyIcons(),e&&this._createIcons())):this._activate(e)},_setOptionDisabled:function(t){this._super(t),this.element.attr("aria-disabled",t),this._toggleClass(null,"ui-state-disabled",!!t),this._toggleClass(this.headers.add(this.headers.next()),null,"ui-state-disabled",!!t)},_keydown:function(t){if(!t.altKey&&!t.ctrlKey){var e=V.ui.keyCode,i=this.headers.length,s=this.headers.index(t.target),n=!1;switch(t.keyCode){case e.RIGHT:case e.DOWN:n=this.headers[(s+1)%i];break;case e.LEFT:case e.UP:n=this.headers[(s-1+i)%i];break;case e.SPACE:case e.ENTER:this._eventHandler(t);break;case e.HOME:n=this.headers[0];break;case e.END:n=this.headers[i-1]}n&&(V(t.target).attr("tabIndex",-1),V(n).attr("tabIndex",0),V(n).trigger("focus"),t.preventDefault())}},_panelKeyDown:function(t){t.keyCode===V.ui.keyCode.UP&&t.ctrlKey&&V(t.currentTarget).prev().trigger("focus")},refresh:function(){var t=this.options;this._processPanels(),!1===t.active&&!0===t.collapsible||!this.headers.length?(t.active=!1,this.active=V()):!1===t.active?this._activate(0):this.active.length&&!V.contains(this.element[0],this.active[0])?this.headers.length===this.headers.find(".ui-state-disabled").length?(t.active=!1,this.active=V()):this._activate(Math.max(0,t.active-1)):t.active=this.headers.index(this.active),this._destroyIcons(),this._refresh()},_processPanels:function(){var t=this.headers,e=this.panels;"function"==typeof this.options.header?this.headers=this.options.header(this.element):this.headers=this.element.find(this.options.header),this._addClass(this.headers,"ui-accordion-header ui-accordion-header-collapsed","ui-state-default"),this.panels=this.headers.next().filter(":not(.ui-accordion-content-active)").hide(),this._addClass(this.panels,"ui-accordion-content","ui-helper-reset ui-widget-content"),e&&(this._off(t.not(this.headers)),this._off(e.not(this.panels)))},_refresh:function(){var i,t=this.options,e=t.heightStyle,s=this.element.parent();this.active=this._findActive(t.active),this._addClass(this.active,"ui-accordion-header-active","ui-state-active")._removeClass(this.active,"ui-accordion-header-collapsed"),this._addClass(this.active.next(),"ui-accordion-content-active"),this.active.next().show(),this.headers.attr("role","tab").each(function(){var t=V(this),e=t.uniqueId().attr("id"),i=t.next(),s=i.uniqueId().attr("id");t.attr("aria-controls",s),i.attr("aria-labelledby",e)}).next().attr("role","tabpanel"),this.headers.not(this.active).attr({"aria-selected":"false","aria-expanded":"false",tabIndex:-1}).next().attr({"aria-hidden":"true"}).hide(),this.active.length?this.active.attr({"aria-selected":"true","aria-expanded":"true",tabIndex:0}).next().attr({"aria-hidden":"false"}):this.headers.eq(0).attr("tabIndex",0),this._createIcons(),this._setupEvents(t.event),"fill"===e?(i=s.height(),this.element.siblings(":visible").each(function(){var t=V(this),e=t.css("position");"absolute"!==e&&"fixed"!==e&&(i-=t.outerHeight(!0))}),this.headers.each(function(){i-=V(this).outerHeight(!0)}),this.headers.next().each(function(){V(this).height(Math.max(0,i-V(this).innerHeight()+V(this).height()))}).css("overflow","auto")):"auto"===e&&(i=0,this.headers.next().each(function(){var t=V(this).is(":visible");t||V(this).show(),i=Math.max(i,V(this).css("height","").height()),t||V(this).hide()}).height(i))},_activate:function(t){t=this._findActive(t)[0];t!==this.active[0]&&(t=t||this.active[0],this._eventHandler({target:t,currentTarget:t,preventDefault:V.noop}))},_findActive:function(t){return"number"==typeof t?this.headers.eq(t):V()},_setupEvents:function(t){var i={keydown:"_keydown"};t&&V.each(t.split(" "),function(t,e){i[e]="_eventHandler"}),this._off(this.headers.add(this.headers.next())),this._on(this.headers,i),this._on(this.headers.next(),{keydown:"_panelKeyDown"}),this._hoverable(this.headers),this._focusable(this.headers)},_eventHandler:function(t){var e=this.options,i=this.active,s=V(t.currentTarget),n=s[0]===i[0],o=n&&e.collapsible,a=o?V():s.next(),r=i.next(),a={oldHeader:i,oldPanel:r,newHeader:o?V():s,newPanel:a};t.preventDefault(),n&&!e.collapsible||!1===this._trigger("beforeActivate",t,a)||(e.active=!o&&this.headers.index(s),this.active=n?V():s,this._toggle(a),this._removeClass(i,"ui-accordion-header-active","ui-state-active"),e.icons&&(i=i.children(".ui-accordion-header-icon"),this._removeClass(i,null,e.icons.activeHeader)._addClass(i,null,e.icons.header)),n||(this._removeClass(s,"ui-accordion-header-collapsed")._addClass(s,"ui-accordion-header-active","ui-state-active"),e.icons&&(n=s.children(".ui-accordion-header-icon"),this._removeClass(n,null,e.icons.header)._addClass(n,null,e.icons.activeHeader)),this._addClass(s.next(),"ui-accordion-content-active")))},_toggle:function(t){var e=t.newPanel,i=this.prevShow.length?this.prevShow:t.oldPanel;this.prevShow.add(this.prevHide).stop(!0,!0),this.prevShow=e,this.prevHide=i,this.options.animate?this._animate(e,i,t):(i.hide(),e.show(),this._toggleComplete(t)),i.attr({"aria-hidden":"true"}),i.prev().attr({"aria-selected":"false","aria-expanded":"false"}),e.length&&i.length?i.prev().attr({tabIndex:-1,"aria-expanded":"false"}):e.length&&this.headers.filter(function(){return 0===parseInt(V(this).attr("tabIndex"),10)}).attr("tabIndex",-1),e.attr("aria-hidden","false").prev().attr({"aria-selected":"true","aria-expanded":"true",tabIndex:0})},_animate:function(t,i,e){var s,n,o,a=this,r=0,l=t.css("box-sizing"),h=t.length&&(!i.length||t.index()",delay:300,options:{icons:{submenu:"ui-icon-caret-1-e"},items:"> *",menus:"ul",position:{my:"left top",at:"right top"},role:"menu",blur:null,focus:null,select:null},_create:function(){this.activeMenu=this.element,this.mouseHandled=!1,this.lastMousePosition={x:null,y:null},this.element.uniqueId().attr({role:this.options.role,tabIndex:0}),this._addClass("ui-menu","ui-widget ui-widget-content"),this._on({"mousedown .ui-menu-item":function(t){t.preventDefault(),this._activateItem(t)},"click .ui-menu-item":function(t){var e=V(t.target),i=V(V.ui.safeActiveElement(this.document[0]));!this.mouseHandled&&e.not(".ui-state-disabled").length&&(this.select(t),t.isPropagationStopped()||(this.mouseHandled=!0),e.has(".ui-menu").length?this.expand(t):!this.element.is(":focus")&&i.closest(".ui-menu").length&&(this.element.trigger("focus",[!0]),this.active&&1===this.active.parents(".ui-menu").length&&clearTimeout(this.timer)))},"mouseenter .ui-menu-item":"_activateItem","mousemove .ui-menu-item":"_activateItem",mouseleave:"collapseAll","mouseleave .ui-menu":"collapseAll",focus:function(t,e){var i=this.active||this._menuItems().first();e||this.focus(t,i)},blur:function(t){this._delay(function(){V.contains(this.element[0],V.ui.safeActiveElement(this.document[0]))||this.collapseAll(t)})},keydown:"_keydown"}),this.refresh(),this._on(this.document,{click:function(t){this._closeOnDocumentClick(t)&&this.collapseAll(t,!0),this.mouseHandled=!1}})},_activateItem:function(t){var e,i;this.previousFilter||t.clientX===this.lastMousePosition.x&&t.clientY===this.lastMousePosition.y||(this.lastMousePosition={x:t.clientX,y:t.clientY},e=V(t.target).closest(".ui-menu-item"),i=V(t.currentTarget),e[0]===i[0]&&(i.is(".ui-state-active")||(this._removeClass(i.siblings().children(".ui-state-active"),null,"ui-state-active"),this.focus(t,i))))},_destroy:function(){var t=this.element.find(".ui-menu-item").removeAttr("role aria-disabled").children(".ui-menu-item-wrapper").removeUniqueId().removeAttr("tabIndex role aria-haspopup");this.element.removeAttr("aria-activedescendant").find(".ui-menu").addBack().removeAttr("role aria-labelledby aria-expanded aria-hidden aria-disabled tabIndex").removeUniqueId().show(),t.children().each(function(){var t=V(this);t.data("ui-menu-submenu-caret")&&t.remove()})},_keydown:function(t){var e,i,s,n=!0;switch(t.keyCode){case V.ui.keyCode.PAGE_UP:this.previousPage(t);break;case V.ui.keyCode.PAGE_DOWN:this.nextPage(t);break;case V.ui.keyCode.HOME:this._move("first","first",t);break;case V.ui.keyCode.END:this._move("last","last",t);break;case V.ui.keyCode.UP:this.previous(t);break;case V.ui.keyCode.DOWN:this.next(t);break;case V.ui.keyCode.LEFT:this.collapse(t);break;case V.ui.keyCode.RIGHT:this.active&&!this.active.is(".ui-state-disabled")&&this.expand(t);break;case V.ui.keyCode.ENTER:case V.ui.keyCode.SPACE:this._activate(t);break;case V.ui.keyCode.ESCAPE:this.collapse(t);break;default:e=this.previousFilter||"",s=n=!1,i=96<=t.keyCode&&t.keyCode<=105?(t.keyCode-96).toString():String.fromCharCode(t.keyCode),clearTimeout(this.filterTimer),i===e?s=!0:i=e+i,e=this._filterMenuItems(i),(e=s&&-1!==e.index(this.active.next())?this.active.nextAll(".ui-menu-item"):e).length||(i=String.fromCharCode(t.keyCode),e=this._filterMenuItems(i)),e.length?(this.focus(t,e),this.previousFilter=i,this.filterTimer=this._delay(function(){delete this.previousFilter},1e3)):delete this.previousFilter}n&&t.preventDefault()},_activate:function(t){this.active&&!this.active.is(".ui-state-disabled")&&(this.active.children("[aria-haspopup='true']").length?this.expand(t):this.select(t))},refresh:function(){var t,e,s=this,n=this.options.icons.submenu,i=this.element.find(this.options.menus);this._toggleClass("ui-menu-icons",null,!!this.element.find(".ui-icon").length),e=i.filter(":not(.ui-menu)").hide().attr({role:this.options.role,"aria-hidden":"true","aria-expanded":"false"}).each(function(){var t=V(this),e=t.prev(),i=V("").data("ui-menu-submenu-caret",!0);s._addClass(i,"ui-menu-icon","ui-icon "+n),e.attr("aria-haspopup","true").prepend(i),t.attr("aria-labelledby",e.attr("id"))}),this._addClass(e,"ui-menu","ui-widget ui-widget-content ui-front"),(t=i.add(this.element).find(this.options.items)).not(".ui-menu-item").each(function(){var t=V(this);s._isDivider(t)&&s._addClass(t,"ui-menu-divider","ui-widget-content")}),i=(e=t.not(".ui-menu-item, .ui-menu-divider")).children().not(".ui-menu").uniqueId().attr({tabIndex:-1,role:this._itemRole()}),this._addClass(e,"ui-menu-item")._addClass(i,"ui-menu-item-wrapper"),t.filter(".ui-state-disabled").attr("aria-disabled","true"),this.active&&!V.contains(this.element[0],this.active[0])&&this.blur()},_itemRole:function(){return{menu:"menuitem",listbox:"option"}[this.options.role]},_setOption:function(t,e){var i;"icons"===t&&(i=this.element.find(".ui-menu-icon"),this._removeClass(i,null,this.options.icons.submenu)._addClass(i,null,e.submenu)),this._super(t,e)},_setOptionDisabled:function(t){this._super(t),this.element.attr("aria-disabled",String(t)),this._toggleClass(null,"ui-state-disabled",!!t)},focus:function(t,e){var i;this.blur(t,t&&"focus"===t.type),this._scrollIntoView(e),this.active=e.first(),i=this.active.children(".ui-menu-item-wrapper"),this._addClass(i,null,"ui-state-active"),this.options.role&&this.element.attr("aria-activedescendant",i.attr("id")),i=this.active.parent().closest(".ui-menu-item").children(".ui-menu-item-wrapper"),this._addClass(i,null,"ui-state-active"),t&&"keydown"===t.type?this._close():this.timer=this._delay(function(){this._close()},this.delay),(i=e.children(".ui-menu")).length&&t&&/^mouse/.test(t.type)&&this._startOpening(i),this.activeMenu=e.parent(),this._trigger("focus",t,{item:e})},_scrollIntoView:function(t){var e,i,s;this._hasScroll()&&(i=parseFloat(V.css(this.activeMenu[0],"borderTopWidth"))||0,s=parseFloat(V.css(this.activeMenu[0],"paddingTop"))||0,e=t.offset().top-this.activeMenu.offset().top-i-s,i=this.activeMenu.scrollTop(),s=this.activeMenu.height(),t=t.outerHeight(),e<0?this.activeMenu.scrollTop(i+e):s",options:{appendTo:null,autoFocus:!1,delay:300,minLength:1,position:{my:"left top",at:"left bottom",collision:"none"},source:null,change:null,close:null,focus:null,open:null,response:null,search:null,select:null},requestIndex:0,pending:0,liveRegionTimer:null,_create:function(){var i,s,n,t=this.element[0].nodeName.toLowerCase(),e="textarea"===t,t="input"===t;this.isMultiLine=e||!t&&this._isContentEditable(this.element),this.valueMethod=this.element[e||t?"val":"text"],this.isNewMenu=!0,this._addClass("ui-autocomplete-input"),this.element.attr("autocomplete","off"),this._on(this.element,{keydown:function(t){if(this.element.prop("readOnly"))s=n=i=!0;else{s=n=i=!1;var e=V.ui.keyCode;switch(t.keyCode){case e.PAGE_UP:i=!0,this._move("previousPage",t);break;case e.PAGE_DOWN:i=!0,this._move("nextPage",t);break;case e.UP:i=!0,this._keyEvent("previous",t);break;case e.DOWN:i=!0,this._keyEvent("next",t);break;case e.ENTER:this.menu.active&&(i=!0,t.preventDefault(),this.menu.select(t));break;case e.TAB:this.menu.active&&this.menu.select(t);break;case e.ESCAPE:this.menu.element.is(":visible")&&(this.isMultiLine||this._value(this.term),this.close(t),t.preventDefault());break;default:s=!0,this._searchTimeout(t)}}},keypress:function(t){if(i)return i=!1,void(this.isMultiLine&&!this.menu.element.is(":visible")||t.preventDefault());if(!s){var e=V.ui.keyCode;switch(t.keyCode){case e.PAGE_UP:this._move("previousPage",t);break;case e.PAGE_DOWN:this._move("nextPage",t);break;case e.UP:this._keyEvent("previous",t);break;case e.DOWN:this._keyEvent("next",t)}}},input:function(t){if(n)return n=!1,void t.preventDefault();this._searchTimeout(t)},focus:function(){this.selectedItem=null,this.previous=this._value()},blur:function(t){clearTimeout(this.searching),this.close(t),this._change(t)}}),this._initSource(),this.menu=V("