# Project Title

In [15]:
library(tidyverse)
library(repr)
options(repr.matrix.max.rows = 7)

## Introduction
While not the most dangerous, alcohol is one of if not the most popular drug in the world. Its consumption is not only popularized but normalized in modern society;, regardless of any negative effects it may have. Many mental factors, such as extraversion and neuroticism, can affect how often one consumes alcohol. Furthermore, minimum legal ages fail to stop children and young teens from consuming alcohol. The question we aim to answer in this project is **"How can we predict the frequency of alcohol consumption based on certain aspects of an individual’s personality?"** The dataset we are using to answer this question is a quantified drug consumption dataset. The dataset contains information about 1885 unique individuals. Each entry contains 12 attributes, such as personality measurements, education level, age, gender, country of residence, and ethnicity;, all of which are quantified and standardized. In additionFurthermore, the dataset contains information regarding these individuals’ use of 18 legal and illegal drugs, ranging from chocolate to heroin, and one fictitious drug (semeron) to filter out liars. Each value in the drugs columndrug columns reflects the recency of an individual’s use of said drug, ranging from “Never Used” to “Used in Last Day.”

## Preliminary Data Exploration
### Reading the data
The dataset does not contain any headings so we will provide labels for each column and rename appropriately.

In [16]:
drugs_data <- read_csv("data_sets_project/drug_consumption.csv", col_names = FALSE)
colnames(drugs_data) <- c("ID", "Age", "Gender", "Education", "Country", "Ethnicity", "Neuroticism", "Extraversion", 
                          "Openness_to_Experience", "Agreeableness", "Conscientiousness", "Impulsiveness", 
                          "Sensation_Seeing", "Alcohol", "Amphet", "Amyl", "Benzos", "Caff", "Cannabis", "Choc", "Coke",
                          "Crack", "Ecstacy", "Heorin", "LegalH", "LSD", "Meth", "Mushrooms", "Nicotine", "Semer", "VSA")
drugs_data

[1mRows: [22m[34m1885[39m [1mColumns: [22m[34m32[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (19): X14, X15, X16, X17, X18, X19, X20, X21, X22, X23, X24, X25, X26, X...
[32mdbl[39m (13): X1, X2, X3, X4, X5, X6, X7, X8, X9, X10, X11, X12, X13

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


ID,Age,Gender,Education,Country,Ethnicity,Neuroticism,Extraversion,Openness_to_Experience,Agreeableness,⋯,Ecstacy,Heorin,LegalH,LSD,Meth,Mushrooms,Nicotine,Semer,VSA,NA
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>,<chr>
1,0.49788,0.48246,-0.05921,0.96082,0.12600,0.31287,-0.57545,-0.58331,-0.91699,⋯,CL0,CL0,CL0,CL0,CL0,CL0,CL0,CL2,CL0,CL0
2,-0.07854,-0.48246,1.98437,0.96082,-0.31685,-0.67825,1.93886,1.43533,0.76096,⋯,CL4,CL0,CL2,CL0,CL2,CL3,CL0,CL4,CL0,CL0
3,0.49788,-0.48246,-0.05921,0.96082,-0.31685,-0.46725,0.80523,-0.84732,-1.62090,⋯,CL0,CL0,CL0,CL0,CL0,CL0,CL1,CL0,CL0,CL0
4,-0.95197,0.48246,1.16365,0.96082,-0.31685,-0.14882,-0.80615,-0.01928,0.59042,⋯,CL0,CL0,CL2,CL0,CL0,CL0,CL0,CL2,CL0,CL0
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋱,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
1886,-0.07854,0.48246,0.45468,-0.57009,-0.31685,1.13281,-1.37639,-1.27553,-1.77200,⋯,CL4,CL0,CL2,CL0,CL2,CL0,CL2,CL6,CL0,CL0
1887,-0.95197,0.48246,-0.61113,-0.57009,-0.31685,0.91093,-1.92173,0.29338,-1.62090,⋯,CL3,CL0,CL0,CL3,CL3,CL0,CL3,CL4,CL0,CL0
1888,-0.95197,-0.48246,-0.61113,0.21128,-0.31685,-0.46725,2.12700,1.65653,1.11406,⋯,CL3,CL0,CL0,CL3,CL3,CL0,CL3,CL6,CL0,CL2


Each row represents an individual and each column represents an attribute assigned to that individual. Thus, the data is in tidy format as each row corresponds to a single observation, each column a single variable and each cell holds a unique value. 

### Selecting the variables
Since we are trying to predict the frequency of alcohol consumption based on each individual's personality traits, we only need to focus on the columns 'Neuroticism', 'Extraversion', 'Openness_to_Experience', 'Agrreableness', 'Impulsiveness', 'Conscientiousness' and of course 'Alcohol'. We would also like to see how the consumption levels vary across different age categories. So we will also include the column 'Age'.

In [34]:
alcohol_data <- select(drugs_data, Age, Neuroticism,
                       Extraversion, Openness_to_Experience,
                       Agreeableness, Impulsiveness, Conscientiousness, Alcohol)
alcohol_data

Age,Neuroticism,Extraversion,Openness_to_Experience,Agreeableness,Impulsiveness,Conscientiousness,Alcohol
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0.49788,0.31287,-0.57545,-0.58331,-0.91699,-0.21712,-0.00665,CL5
-0.07854,-0.67825,1.93886,1.43533,0.76096,-0.71126,-0.14277,CL5
0.49788,-0.46725,0.80523,-0.84732,-1.62090,-1.37983,-1.01450,CL6
-0.95197,-0.14882,-0.80615,-0.01928,0.59042,-1.37983,0.58489,CL4
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
-0.07854,1.13281,-1.37639,-1.27553,-1.77200,0.52975,-1.38502,CL4
-0.95197,0.91093,-1.92173,0.29338,-1.62090,1.29221,-2.57309,CL5
-0.95197,-0.46725,2.12700,1.65653,1.11406,0.88113,0.41594,CL4


### Summarizing data
We are going to use only training data from this point on for our data exploration. 

In [35]:
alcohol_split <- initial_split(alcohol_data, prop = 0.75, strata = Alcohol)
alcohol_train <- training(alcohol_split)
alcohol_train

Age,Neuroticism,Extraversion,Openness_to_Experience,Agreeableness,Impulsiveness,Conscientiousness,Alcohol
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>
0.49788,-1.19430,-0.80615,0.14143,-0.60633,-1.37983,-0.00665,CL3
2.59171,-0.24649,-0.80615,-2.63199,-0.30172,0.52975,-0.78155,CL3
1.82213,0.04257,-0.69509,-1.11902,-0.45321,-1.37983,-0.40581,CL3
2.59171,-1.19430,-1.76250,-2.85950,-1.47955,0.52975,0.25953,CL3
⋮,⋮,⋮,⋮,⋮,⋮,⋮,⋮
-0.95197,1.02119,-0.43999,1.43533,-1.07533,-0.71126,0.12331,CL6
-0.07854,-0.14882,-0.57545,1.43533,-0.91699,0.52975,-0.78155,CL6
-0.95197,1.49158,-1.92173,-0.58331,-1.77200,-0.21712,0.58489,CL6


First, we will see how many observations are there for each class of alcohol consumption.

In [36]:
alcohol_count <- group_by(alcohol_train, Alcohol) |>
                 summarize(n = n())
alcohol_count

Alcohol,n
<chr>,<int>
CL0,22
CL1,24
CL2,50
CL3,149
CL4,215
CL5,571
CL6,381


As you can see, most of the individuals belong to the class 'CL6'. To make it more clear, we will rename the classes.

In [37]:
meaning <- c("Never Used", "Used over a Decade Ago", "Used in Last Decade",
             "Used in Last Year", "Used in Last Month", "Used in Last Week", "Used in Last Day")
alcohol_count_meaning <- bind_cols(meaning, alcohol_count)
colnames(alcohol_count_meaning) <- c("Meaning", "Class", "No. of Individuals")
alcohol_count_meaning

[1m[22mNew names:
[36m•[39m `` -> `...1`


Meaning,Class,No. of Individuals
<chr>,<chr>,<int>
Never Used,CL0,22
Used over a Decade Ago,CL1,24
Used in Last Decade,CL2,50
Used in Last Year,CL3,149
Used in Last Month,CL4,215
Used in Last Week,CL5,571
Used in Last Day,CL6,381


Thus, most of the individuals consumed alcohol in the last week. We will also find out the mean for each of the personality attributes. However, the values for the attributes in the data set do not reflect their true values probably because they have been standarized. So we will provide the true values when we find out the means for these variables except for the variable 'Impulsiveness' because only the standardized values have been provided for it in the data set information page.

In [45]:
attributes_mean <- alcohol_train |>
                   select(Neuroticism:Conscientiousness) |>
                   map_dfc(mean) |>
                   pivot_longer(cols = Neuroticism:Conscientiousness,
                               names_to = "Personality Trait",
                               values_to = "Mean") |>
                   bind_cols(c("36", "41", "47", "43", "NA", "42"),
                             c("12-60", "16-59", "24-60", "12-60", "NA", "17-59"))
colnames(attributes_mean) <- c("Personality Trait", "Mean", "Closest True Value", "Range of True Values")
attributes_mean

[1m[22mNew names:
[36m•[39m `` -> `...3`
[36m•[39m `` -> `...4`


Personality Trait,Mean,Closest True Value,Range of True Values
<chr>,<dbl>,<chr>,<chr>
Neuroticism,-0.0099988102,36.0,12-60
Extraversion,-0.0005180241,41.0,16-59
Openness_to_Experience,0.0105236615,47.0,24-60
Agreeableness,-0.0058394901,43.0,12-60
Impulsiveness,0.0144916289,,
Conscientiousness,0.0023146246,42.0,17-59
