 # Classification

 This notebook gives an overview and introduction to classification of two groups using classification trees. The data are real data collected from a group of young people during a Statistics class at FSEV UK. The data were retreived from [Kaggle](https://www.kaggle.com/miroslavsabo/young-people-survey).

In [None]:
# install packages for Google Colab
install.packages(c("ggformula", "rpart.plot"))

# libraries
library(tidyverse)
library(ggformula)
library(rpart)
library(rpart.plot)

# Set visualization theme
theme_set(theme_bw(base_size = 16))

# Read in Data
survey_data <- read_csv("https://raw.githubusercontent.com/lebebr01/jshs-stat/main/data/survey_responses.csv")

# Process data names for easier access
names(survey_data) <- gsub("-|,|/", "", names(survey_data))
names(survey_data) <- gsub("\\s+", "_", names(survey_data))

# Visual first few rows of the data
head(survey_data)


 ## Can we classify whether someone is from a city or rural location?

 In this example, we are going to try to correctly classify if an individual grew up in a city or more rural location (labeled as a village in the data) based solely off some of their movie preferences. To turn this into a research question, this may look like the following:

 1. Are there differences in movie preferences between those who grew up in a city or rural location?
 2. Do different movie preferences accurately predict whether someone grew up in a city or rural location?


 Before we explicitly answer these research questions, let's first visualize the data to understand how many individuals in the data grew up in a city or rural location.

 ### How many grew up in a city or rural location?

In [None]:
gf_bar(~ Village_town, data = survey_data)


 ### Fit model

In [None]:
classify_model <- rpart(Village_town ~ Movies + Horror + Thriller + Comedy + Romantic + Scifi + 
  War + FantasyFairy_tales + Animated + Documentary + Western + Action, 
  data = survey_data, method = "class")

  rpart.plot(classify_model,  roundint = FALSE, type = 3, branch = .3)


 ### Accuracy

 What could we use to evaluate how well the model does? Since we know from the data how many actually grew up in the city vs rural area, we can take the model predicted values and compare to the actual values.

In [None]:
survey_data_predict <- survey_data %>%
  drop_na(Village_town) %>%
  mutate(village_town_predict = predict(classify_model, type = 'class')) %>%
  cbind(predict(classify_model, type = 'prob'))

  # Table of results
  survey_data_predict %>%
  count(Village_town, village_town_predict)


 #### Easier to visualize

In [None]:
gf_bar(~ Village_town, fill = ~village_town_predict, data = survey_data_predict, position = 'fill')


 #### Generate a Percentage

 We can generate an overall classification accuracy as a percentage too. Given what you know about the figure above, is there anything problematic about a single percentage?

In [None]:
survey_data_predict %>%
  mutate(same_class = ifelse(Village_town == village_town_predict, 1, 0)) %>%
  df_stats(~ same_class, mean, sum)


 #### Percentage by real groups

 The overall percentage is likely misleading for two main reasons. First, the two groups have different number of people in them. This is makes an overall classification accuracy seem higher in this example as the group that we are classifying as more accurate has more data. Second, there is differential performance between the two groups. Fortunately, we can generate a percentage for each group, a classification accuracy for those who grew up in the city and another for the rural individuals.

In [None]:
survey_data_predict %>%
  mutate(same_class = ifelse(Village_town == village_town_predict, 1, 0)) %>%
  df_stats(same_class ~ Village_town, mean, sum)


