# Final Report 
Group members: Loay Al-Abri, 


(2) Methods and Results

In this section, you will include:


b) “Methods: Plan”

    - Describe in written English the methods you used to perform your analysis from beginning to end, and narrate the code that does the analysis.
    - If included, describe the “Feature Selection” process and how and why you choose the covariates of your final model.
    - Make sure to interpret/explain the results you obtain. It’s not enough to just say, “I fitted a linear model with these covariates, and my R-square is 0.87”.
        - If inference is the aim of your project, a detailed interpretation of your fitted model is required, as well as a discussion of relevant quantities (e.g., are the coefficients significant? How does the model fit the data)?
        - A careful model assessment must be conducted.
        - If prediction is the project's aim, describe the test data used or how it was created.
    - Ensure your tables and/or figures are labelled with a figure/table number.


Questions:
- How is the stars of a repository related to forks, issues, size, and discussion setting?

In this project, we will perform analysis on the number of Stars a GitHub repository has using the most essential variables that affect the number of Stars a GitHub Repository might get. To start with, we will use to baseline models to compare our model's performance with. The first baseline model is a model where we always perdict the mean number of stars. We expect this model to have $R^2$ close to 0. and it will have poor performance. The second baseline model we will use is a linear regression model that includes all the variables in the dataset as explanatory variables. Thus, our final model should perform better than the maximal model because the explanatory data analysis showed that the variables we chose are the once most essentail determining the number of stars a repository has. Finally, we will diagnose our model to examine whether any of the linear regression assumptions are violated. 

In [1]:
library(tidyverse)
library(tidymodels)
library(Metrics)
library(ggplot2)
library(broom)
library(GGally)
library(gridExtra)
library(cowplot)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 3.5.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.3     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.2.0 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

In [35]:
git_rep <- read.csv("data/repositories.csv") %>%
    select(-Name, -Homepage, -URL, -Topics, -License, -Description,-`Created.At`, -`Updated.At`, -Language, -Watchers,
     - Is.Fork) %>%
     mutate(
        Default_branch = case_when(
            `Default.Branch` == "master" ~ "master",
            `Default.Branch` == "main" ~ "main",
            TRUE ~ "other"
        )
    ) %>%
    select(-`Default.Branch`) %>%
    mutate(Has.Issues = as.factor(Has.Issues),
           Has.Projects = as.factor(Has.Projects),
           Has.Downloads = as.factor(Has.Downloads),
           Has.Wiki = as.factor(Has.Wiki),
           Has.Pages = as.factor(Has.Pages),
           Has.Discussions = as.factor(Has.Discussions),
           Is.Archived = as.factor(Is.Archived),
           Is.Template = as.factor(Is.Template)
           )

git_split <- initial_split(git_rep, prop = 0.7, strata = Stars)
git_tr <- training(git_split)
git_te <- testing(git_split)
head(git_tr)

Unnamed: 0_level_0,Size,Stars,Forks,Issues,Has.Issues,Has.Projects,Has.Downloads,Has.Wiki,Has.Pages,Has.Discussions,Is.Archived,Is.Template,Default_branch
Unnamed: 0_level_1,<int>,<int>,<int>,<int>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<fct>,<chr>
1,843,237,23,5,True,True,True,True,False,False,False,False,main
2,2411,237,32,36,True,True,True,False,False,False,False,False,master
3,58257,237,71,2,True,True,True,True,True,False,False,False,master
4,685,237,28,9,True,True,True,True,False,True,False,False,master
5,306952,237,74,57,True,True,True,True,False,True,False,False,main
6,718,237,44,0,True,True,True,True,False,False,False,False,master


In [36]:
compute_r_sqr <- function(y_actual, y_predicted) {
    SS_residuals <- sum((y_actual - y_predicted)^2)
    SS_total <- sum((y_actual - mean(y_actual))^2)
    R2 <- 1 - (SS_residuals / SS_total)
    return(R2)
}

In [42]:
mean_model <- lm(Stars ~ 1, data = git_tr)
mean_model_summary <- summary(mean_model)
preds <- predict(mean_model, newdata = git_te, type = "response")
results <- data.frame(Model = c("mean model", "mean model"),
                    data = c("train", "test"),
                    r_sqr = c(round(mean_model_summary$r.squared, 2), round(compute_r_sqr(git_te$Stars, preds), 2)),
                    rmse = c(round(rmse(mean_model$fitted.values, git_tr$Stars), 2), 
                            round(rmse(preds, git_te$Stars), 2)))
results

Model,data,r_sqr,rmse
<chr>,<chr>,<dbl>,<dbl>
mean model,train,0,4026.17
mean model,test,0,3912.36


The results from the mean model are not impressive, but that is expected since the model always predicts the mean number of Stars in the training data. 

In [43]:
full_model <- lm(Stars ~ ., data = git_tr)
full_model_summary <- summary(full_model)
preds <- predict(full_model, newdata = git_te, type = "response")
full_model_results <- data.frame(
    Model = c("full model", "full model"),
    data = c("train", "test"), 
    r_sqr = c(round(full_model_summary$r.squared, 2),
              round(compute_r_sqr(git_te$Stars, preds), 2)),
    rmse = c(round(rmse(git_tr$Stars, full_model$fitted.values), 2),
             round(rmse(git_te$Stars, preds), 2))
)
results <- rbind(results, full_model_results)
results

Model,data,r_sqr,rmse
<chr>,<chr>,<dbl>,<dbl>
mean model,train,0.0,4026.17
mean model,test,0.0,3912.36
full model,train,0.37,3207.63
full model,test,0.35,3153.77


In [47]:
full_model_summary %>% tidy() %>% mutate_if(is.numeric, round, 2)

term,estimate,std.error,statistic,p.value
<chr>,<dbl>,<dbl>,<dbl>,<dbl>
(Intercept),999.69,97.92,10.21,0.0
Size,0.0,0.0,0.09,0.93
Forks,1.68,0.01,250.74,0.0
Issues,2.94,0.04,66.54,0.0
Has.IssuesTrue,49.66,48.52,1.02,0.31
Has.ProjectsTrue,-271.55,30.43,-8.92,0.0
Has.DownloadsTrue,-157.83,87.87,-1.8,0.07
Has.WikiTrue,-196.93,26.02,-7.57,0.0
Has.PagesTrue,198.24,21.77,9.11,0.0
Has.DiscussionsTrue,754.0,26.05,28.95,0.0


The full model significantly outperforms the mean model, achieving an $R^2$ score of 0.35 on the testing data with a root mean squared error (RMSE) of 3153.77. Additionally, the model does not exhibit overfitting, as the $R^2$ scores and RMSE values for the training and testing datasets are relatively similar. Interestingly, the model's RMSE is lower on the testing set compared to the training set, which may indicate better generalization or potential data-specific characteristics.