# Group 36 Project Proposal

### Group Members: Bhavtej Bhasin, Peter Chen, Theresa Choi, Sky Langille
### DSCI 100 004

In [16]:
#loading libraries
library(repr)
library(tidyverse)
library(tidymodels)
options(repr.matrix.max.rows = 10)

**Proposed Title:** K-Nearest Neighbor Regression Model for University Ranking Based on the Ratio of Female Students.

**Introduction:**
- Background Information:
    - University rankings are one of the first aspects viewed when students, their families or employers look into a prospective education or employee. These rankings are valued by many people including the university itself as through these filters and discoveries flaws, biases, or interesting variables can be found within the data. By diving deeper into how the female student ratio in universities around the world influence the school's ranking, more informed decisions can be made by prospective students of their choice of a post-secondary school, students can expect the kind of environment they will be entering and education they will be receiving. 
- Predictive Question:
    - Can global university ranking be predicted using the ratio of female students within the student body?
- Data Set:
    - The data set we have selected to use for this study is the World University Rankings 2023 data set published on Kaggle by Syed Ali Taqi in collaboration with Abdullah Sajid and Muhammad Jawad Awan. This set includes university rankings from across 104 countries which includes 1,799 indivdual universities. Original variables include:
- Name of university
- Location
- Number of students
- Number of students per staff
- International student percentage
- Female to male ratio (out of 100 count)
As well as scoring variables:
- Overall score
- Teaching score
- Research score
- Citation score
- Industry income score
- International outlook score

**Preliminary Data Analysis:**
- demonstrate data set can be read into R
- clean and wrangle data into tidy format
- separate into training and test data; summarize training data into a table and visualize data with at least one plot

In [33]:
#load dataset
university_data <- read_csv("World University Rankings 2023.csv")

[1mRows: [22m[34m2341[39m [1mColumns: [22m[34m13[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (11): University Rank, Name of University, Location, International Stude...
[32mdbl[39m  (1): No of student per staff
[32mnum[39m  (1): No of student

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [53]:
#initial data wrangling
university_rename <- university_data |> #rename relevant columns
    rename(
       "university_rank" = "University Rank",
        "name_of_university" = "Name of University",
        "number_of_students" = "No of student",
        "number_students_per_staff" = "No of student per staff",
        "international_students" = "International Student",
        "femalemale_ratio" = "Female:Male Ratio"
    )

university_student_stats <- university_rename |> #create tibble with only relvant columns
    select(university_rank, name_of_university, number_of_students, number_students_per_staff, international_students, femalemale_ratio) |>
    filter(university_rank != "Reporter") |> #filter out unranked universities
    filter(complete.cases(femalemale_ratio)) |> #filter out universities that did not report a female:male student ratio
    mutate(university_rank = as.numeric(university_rank)) |> #change ranking from character to numeric
    drop_na(university_rank) #filter for only universities with a non-range ranking (eg. no universities ranked 200-250)

university_separate <- university_student_stats |> #separate male and female student ratio into two columns
    separate(
        col = femalemale_ratio,
        into = c("female_student_ratio", "male_student_ratio"),
        sep = ":")

university_female <- university_separate |> #separate tibble with only univeristy rank, name and ratio of female students
    select(university_rank, name_of_university, female_student_ratio)
university_female

In [53]:
#split data into training and testing sets; summarize training set in one table
set.seed(2023) #set seed DO NOT CHANGE!

university_split <- initial_split(university_female, prop = 0.70, strata = university_rank) #split data into training and testing sets
university_training <- training(university_split)
university_testing <- testing(university_split)

training_table <- university_training |>
    

In [43]:
#generate scatterplot of university ranking vs. female student ratio
options(repr.plot.height = 8, repr.plot.width = 15) #set visualization dimensions

uni_ratio_plot <- university_training |> 
    ggplot(aes(x = female_student_ratio, y = university_rank)) +
    geom_point() +
    xlab("Ratio of Female University Students") +
    ylab("Ranking of University") +
    ggtitle("Ratio of Female Students VS. University Ranking in Training Set") +
    theme(text = element_text(size=20))
uni_ratio_plot

**Methods:**
- explain how we will conduct analysis and what variable or columns will be used
- describe at least one way we will visualize our results

- Use K-nearest neighbor regression to predict university ranking based on ratio of female students (? > make the project a little more complex because we'll have to separate the male:female ratio column instead of just eliminating it), teaching score (analyze assumed quality of teaching)
    - We should try and avoid using more than one variable for regression because the visualization will be 3D and we have not been taught that (could be changed if we are taught or we go to the teaching team to learn but will likely complicate the project)

**Expected Outcomes and Significance:**
- what do we expect to happen?
- why is this important?
- what future questions can this lead to?