# Deliverable 2: Project Proposal

### Group 3 - Aden Chan, Kashie Ugoji, Linda Han, Sungha Choi
--------------


### Inferential question: Does the average final grade of highschool students who do extra-curricular activities differ from those who do not?

Dataset used: https://archive.ics.uci.edu/dataset/320/student+performance


## Introduction

Begin by providing some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal.

Clearly state the question you will try to answer with your project. Your question should involve one or more random variables of interest, spread across two or more categories that are interesting to compare. For example, you could consider the annual maxima river flow at two different locations along a river, or perhaps gender diversity at different universities. Of the response variable, identify one location parameter (mean, median, quantile, etc.) and one scale parameter (standard deviation, inter-quartile range, etc.) that would be useful in answering your question. Justify your choices.

UPDATE (Mar 1, 2022): If it doesn’t make sense to infer a scale parameter, you can choose another parameter, or choose a second variable altogether. Ultimately, we’re looking for a comprehensive inference analysis on one parameter spread across 2+ groups (with at least one hypothesis test), plus a bit more (such as an investigation on the variance, a quantile, or a different variable). In total, you should use both bootstrapping and asymptotics somewhere in your report at least once each. Also, your hypothesis test(s) need not be significant: it is perfectly fine to write a report claiming no significant findings (i.e. your p-value is large).


Also, be sure to frame your question/objectives in terms of what is already known in the literature. Be sure to include at least two scientific publications that can help frame your study (you will need to include these in the References section). We have no specific citation style requirements, but be consistent.



## Preliminary Results

### 1. Importing libraries

To start doing exploratory data analysis, we first need to load in all the necessary libraries.

In [6]:
library(tidyverse)

### 2. Importing dataset

The dataset we are interested in is "student-mat.csv", and it contains demographic, family, personal attributes, and academic performance information about **students in math classes from two Portuguese high schools**.

Below, we illustrate the process of unzipping and extracting the dataset from a URL. The resulting dataset will be loaded into an object called `students`.

In [7]:
# Downloads the zipfile containing the dataset if it doesn't already exist 
# and saves it to the current working directory as "dataset-zip"
url <- "https://archive.ics.uci.edu/static/public/320/student+performance.zip"
destfile <- "./dataset-zip" 

if (!file.exists(destfile)) {
    download.file(url, destfile)
    
    # Unzips the zipfile and extracts the dataset to current working directory
    unzip(destfile, files = "student.zip") %>% unzip(files = "student-mat.csv")
}



# Read the dataset and name it `students`
students <- read_delim("./student-mat.csv", delim=";")
head(students)

[1mRows: [22m[34m395[39m [1mColumns: [22m[34m33[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ";"
[31mchr[39m (17): school, sex, address, famsize, Pstatus, Mjob, Fjob, reason, guardi...
[32mdbl[39m (16): age, Medu, Fedu, traveltime, studytime, failures, famrel, freetime...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,⋯,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
<chr>,<chr>,<dbl>,<chr>,<chr>,<chr>,<dbl>,<dbl>,<chr>,<chr>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
GP,F,18,U,GT3,A,4,4,at_home,teacher,⋯,4,3,4,1,1,3,6,5,6,6
GP,F,17,U,GT3,T,1,1,at_home,other,⋯,5,3,3,1,1,3,4,5,5,6
GP,F,15,U,LE3,T,1,1,at_home,other,⋯,4,3,2,2,3,3,10,7,8,10
GP,F,15,U,GT3,T,4,2,health,services,⋯,3,2,2,1,1,5,2,15,14,15
GP,F,16,U,GT3,T,3,3,other,other,⋯,4,3,2,1,2,5,4,6,10,10
GP,M,16,U,LE3,T,4,3,services,other,⋯,5,4,2,1,2,5,10,15,15,15


### 3. Wrangling the dataset

Now that the dataset has been successfully loaded, we will wrangle it to the desired format for our project.

According to the description provided in the documentation, the column `G3` in the dataset represents the students' final grade:

> G3 - final grade (numeric: from 0 to 20, output target)

Note: Grades in Portugal are distributed on a scale of 0-20 with 18-20 being excellent in international standards (letter grade A). (https://www.upt.pt/en/home/internationals/portuguese-grading-system-2/)

To make `G3` more readable, we mutate its name to `final_grade`.

We then select the two columns relevant to our inferential question - `activities` and the newly mutated column `final_grade`.

In [8]:
students_sample <- students %>%
    mutate(final_grade = G3) %>%
    select(activities, final_grade)

head(students_selected)

activities,final_grade
<chr>,<dbl>
no,6
no,6
no,10
yes,15
no,10
yes,15


### 4. Gaining insight about the dataset

Once again, we are interested in finding the **average grade of students who do extra-curricular activities** and the **average grade of those who don't**. 

First, we try to gain some insight about the dataset by counting the total number of students and the number of them in each category (extra-curricular and no extra-curricular).

We examine that there is a total number of 395 students in the dataset, and out of those students 49% of them don't do extra-curriculars while the other 51% of them do.

In [13]:
# count the total number of students
nrow(students)

In [14]:
# count the number and proportion of students in each category
count(students, activities) %>%
mutate(prop = n/nrow(students))

activities,n,prop
<chr>,<int>,<dbl>
no,194,0.4911392
yes,201,0.5088608


In [12]:
students_sample %>%
    group_by(activities) %>%
    summarize(mean_final_grade = mean(final_grade))

activities,mean_final_grade
<chr>,<dbl>
no,10.34021
yes,10.48756


## Methods: Plan

The previous sections will carry over to your final report (you’ll be allowed to improve them based on feedback you get). Begin this Methods section with a brief description of “the good things” about this report – specifically, in what ways is this report trustworthy?

Continue by explaining why the plot(s) and estimates that you produced are not enough to give to a stakeholder, and what you should provide in addition to address this gap. Make sure your plans include at least one hypothesis test and one confidence interval. If possible, compare both the bootstrapping and asymptotics methods.

Finish this section by reflecting on how your final report might play out:

What do you expect to find?
What impact could such findings have?
What future questions could this lead to?

## References
At least two citations of literature relevant to the project. The citation format is your choice – just be consistent. Make sure to cite the source of your data as well.

assigned to review a different group’s proposal. This allows your group to collectively see a larger variety of proposals.)

submit that to canvas. There is no page limit. The teaching team will deliver the feedback to your reviewee.)

communicated effectively. When possible, provide suggestions for improvement. If everything looks good to you, say why it looks good.) comment: <> (- What part of the proposal is the most effective, and why?) comment: <> (- What part of the proposal is the least effective, and why? Provide a suggestion for improvement.) comment: <> (- Provide feedback on English, spelling, and grammar, if applicable.)

the composition of your submission, reasoning (70%) evaluates your feedback, and writing (20%) evaluates your English, spelling, and grammar.)