## Implementation of a proposed model

In [None]:
install.packages("leaps")
install.packages("Metrics")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [67]:
library(tidyverse)
library(lubridate)

library(ggplot2)
library(broom)
library(tidyr)
library(dplyr)
library(Metrics)
library(modelr)

library(leaps)
library(caret)

- In the previous assignment, I explored the data with useful visuals that allowed me to observe relationships between input variables and response one and among input variables themselves. I also observed how some input variables were highly correlated (multicollinearity) by plotting a heatmap.
- in this assignment, I will first convert the 'Created_At' and 'Updated_AT' Date variables to numerical values, where they represent the days since created, updated respectively.
- Then, I will clean the Data and use the Forward Selection Algorithm to find the best-fit model.
- Afterward, I am going to split the data into training, testing data with a ratio of 80/20.
- then, I am going to use the variables from the Forward Selection results to fit an Additive linear regression model.
- Then, using the predict function, I am going to predict the number of stars in the testing set using the fitted model.
- Finally, I will calculate the RMSE to evaluate the accuracy of the prediction model.*


In [68]:
df <- read_csv('repositories.csv')

[1mRows: [22m[34m215029[39m [1mColumns: [22m[34m24[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (8): Name, Description, URL, Homepage, Language, License, Topics, Defau...
[32mdbl[39m  (5): Size, Stars, Forks, Issues, Watchers
[33mlgl[39m  (9): Has Issues, Has Projects, Has Downloads, Has Wiki, Has Pages, Has ...
[34mdttm[39m (2): Created At, Updated At

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [69]:
df_subset <- df %>% sample_n(size = 10000)

head(df_subset, n = 3)

Name,Description,URL,Created At,Updated At,Homepage,Size,Stars,Forks,Issues,⋯,Has Issues,Has Projects,Has Downloads,Has Wiki,Has Pages,Has Discussions,Is Fork,Is Archived,Is Template,Default Branch
<chr>,<chr>,<chr>,<dttm>,<dttm>,<chr>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<lgl>,<chr>
kvs,💿 KVS: NVMe Key-Value Store,https://github.com/synrc/kvs,2013-05-29 11:10:53,2023-08-31 09:06:02,https://kvs.n2o.dev,917,174,50,0,⋯,True,True,True,True,True,False,False,False,False,master
GPU-GEMS-3D-Fluid-Simulation,3D fluid simulation on the in Unity,https://github.com/Scrawk/GPU-GEMS-3D-Fluid-Simulation,2017-06-21 00:49:51,2023-09-14 08:04:33,,77,261,39,0,⋯,True,True,True,True,False,False,False,False,False,master
legacy-dotfiles,,https://github.com/rbnis/legacy-dotfiles,2018-06-02 11:40:21,2023-09-13 13:30:09,,5774,175,44,5,⋯,True,False,True,True,False,False,False,True,False,master


* - I am choosing only those variables because of the limitation on running the Forward Selection Algorithm on all of the given variables. Also, the other columns (e.g. logical variables) contain very redundant data.  
Note: I explain more about this issue in the cell just above the Forward Selection cell down below.*

In [70]:
features <- c("Size", "Stars", "Forks",
              "Issues", 'Watchers', 'Is Archived', 'Has Issues', 'Has Pages',
              'Created At', 'Updated At')
df_subset_1 <- df_subset %>% select(all_of(features))

head(df_subset_1, n = 3)
nrow(df_subset_1)
sum(is.na(df_subset_1))

Size,Stars,Forks,Issues,Watchers,Is Archived,Has Issues,Has Pages,Created At,Updated At
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>,<lgl>,<dttm>,<dttm>
917,174,50,0,174,False,True,True,2013-05-29 11:10:53,2023-08-31 09:06:02
77,261,39,0,261,False,True,False,2017-06-21 00:49:51,2023-09-14 08:04:33
5774,175,44,5,175,True,True,False,2018-06-02 11:40:21,2023-09-13 13:30:09


In [71]:
df_subset$'Created At' <- as.Date(strptime(df_subset$'Created At', format = "%Y-%m-%d %H:%M:%S"))
df_subset$'Updated At' <- as.Date(strptime(df_subset$'Updated At', format = "%Y-%m-%d %H:%M:%S"))

In [72]:
df_subset_2 <- df_subset_1 %>%
  mutate(
    `Created At` = as.Date(`Created At`),        
    `Updated At` = as.Date(`Updated At`),        
    repo_age_days = as.numeric(Sys.Date() - `Created At`),
    days_since_update = as.numeric(Sys.Date() - `Updated At`)
  )

head(df_subset_2, n = 3)

Size,Stars,Forks,Issues,Watchers,Is Archived,Has Issues,Has Pages,Created At,Updated At,repo_age_days,days_since_update
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>,<lgl>,<date>,<date>,<dbl>,<dbl>
917,174,50,0,174,False,True,True,2013-05-29,2023-08-31,4202,456
77,261,39,0,261,False,True,False,2017-06-21,2023-09-14,2718,442
5774,175,44,5,175,True,True,False,2018-06-02,2023-09-13,2372,443


In [73]:
df_subset_3 <- df_subset_2 %>% select(-c('Created At', 'Updated At'))

head(df_subset_3, n = 3)

Size,Stars,Forks,Issues,Watchers,Is Archived,Has Issues,Has Pages,repo_age_days,days_since_update
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>,<lgl>,<dbl>,<dbl>
917,174,50,0,174,False,True,True,4202,456
77,261,39,0,261,False,True,False,2718,442
5774,175,44,5,175,True,True,False,2372,443


In [74]:
non_na_repo_age <- df_subset_3 %>% filter(!is.na(repo_age_days))
non_na_days_since_update <- df_subset_3 %>% filter(!is.na(days_since_update))

head(df_subset_3 %>% select(repo_age_days, days_since_update, n = 3))

repo_age_days,days_since_update,n
<dbl>,<dbl>,<dbl>
4202,456,50
2718,442,39
2372,443,44
4564,465,40
2058,440,22
5720,440,406


In [75]:
df_cleaned <- df_subset_3 %>% filter(!is.na(Stars) & !is.na(Forks) & !is.na(repo_age_days))  

head(df_cleaned, n = 3)

Size,Stars,Forks,Issues,Watchers,Is Archived,Has Issues,Has Pages,repo_age_days,days_since_update
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>,<lgl>,<dbl>,<dbl>
917,174,50,0,174,False,True,True,4202,456
77,261,39,0,261,False,True,False,2718,442
5774,175,44,5,175,True,True,False,2372,443


In [76]:
summary_stats <- df_cleaned %>%
  summarize(
    count = n(),
    avg_stars = mean(Stars, na.rm = TRUE),
    avg_forks = mean(Forks, na.rm = TRUE),
    avg_issues = mean(Issues, na.rm = TRUE),
    avg_repo_age = mean(repo_age_days, na.rm = TRUE),
    avg_days_since_update = mean(days_since_update, na.rm = TRUE)
  )

print(summary_stats)
head(df_cleaned, n = 3)
nrow(df_cleaned)
sum(is.na(df_cleaned))

[90m# A tibble: 1 × 6[39m
  count avg_stars avg_forks avg_issues avg_repo_age avg_days_since_update
  [3m[90m<int>[39m[23m     [3m[90m<dbl>[39m[23m     [3m[90m<dbl>[39m[23m      [3m[90m<dbl>[39m[23m        [3m[90m<dbl>[39m[23m                 [3m[90m<dbl>[39m[23m
[90m1[39m [4m1[24m[4m0[24m000     [4m1[24m092.      230.       37.8        [4m2[24m765.                  454.


Size,Stars,Forks,Issues,Watchers,Is Archived,Has Issues,Has Pages,repo_age_days,days_since_update
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>,<lgl>,<dbl>,<dbl>
917,174,50,0,174,False,True,True,4202,456
77,261,39,0,261,False,True,False,2718,442
5774,175,44,5,175,True,True,False,2372,443


In this cell, I will split the data into training/testing data (80/20 ratio). I will use them to fit the linear regression model, predict 'Stars' values, and evaluate the model.  

In [77]:
set.seed(123) # DO NOT CHANGE!

GitHub_sample <-
    df_cleaned %>%
    mutate(id = row_number())

training_GitHub <- 
    GitHub_sample %>%
    slice_sample(prop = 0.80, replace = FALSE)

testing_GitHub <- 
    GitHub_sample %>% 
    anti_join(training_GitHub, by = "id") %>%
    select(-id)

training_GitHub <- 
    training_GitHub %>% 
    select(-id)

head(training_GitHub)
nrow(training_GitHub)

nrow(testing_GitHub)

Size,Stars,Forks,Issues,Watchers,Is Archived,Has Issues,Has Pages,repo_age_days,days_since_update
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<lgl>,<lgl>,<lgl>,<dbl>,<dbl>
1806,291,24,0,291,False,True,False,927,441
176,181,17,0,181,False,True,False,5450,560
250118,428,222,24,428,False,True,False,2401,437
2205,515,86,36,515,False,True,False,4527,461
15445,250,34,2,250,False,True,False,3451,440
50,183,39,4,183,False,True,False,1830,432


- One of the limitations I faced when running the Forward Selection Algorithm on the given data was that The kernel would collapse and restart entirely.
- To solve this problem, I will drop some of the categorical variables that have many levels (>2) and I, from the experience I gained in this course, believe they have little to no utility in predicting the number of stars.

In [78]:
GitHub_forward_sel <- regsubsets(
  Stars ~ ., data = training_GitHub,
  method = "forward",
  nvmax = ncol(training_GitHub) - 1
)
GitHub_forward_sel

GitHub_fwd_summary <- summary(GitHub_forward_sel)

GitHub_fwd_summary <- tibble(
  n_input_variables = 1:length(GitHub_fwd_summary$rss),  
  RSS = GitHub_fwd_summary$rss,
  BIC = GitHub_fwd_summary$bic,
  Cp = GitHub_fwd_summary$cp
)

Subset selection object
Call: regsubsets.formula(Stars ~ ., data = training_GitHub, method = "forward", 
    nvmax = ncol(training_GitHub) - 1)
9 Variables  (and intercept)
                  Forced in Forced out
Size                  FALSE      FALSE
Forks                 FALSE      FALSE
Issues                FALSE      FALSE
Watchers              FALSE      FALSE
`Is Archived`TRUE     FALSE      FALSE
`Has Issues`TRUE      FALSE      FALSE
`Has Pages`TRUE       FALSE      FALSE
repo_age_days         FALSE      FALSE
days_since_update     FALSE      FALSE
1 subsets of each size up to 9
Selection Algorithm: forward

In [79]:
GitHub_fwd_summary
summary(GitHub_forward_sel)

n_input_variables,RSS,BIC,Cp
<int>,<dbl>,<dbl>,<dbl>
1,4.357730999999999e-19,-538133.9,628.845164
2,4.0661209999999996e-19,-538679.0,53.68893
3,4.0377979999999995e-19,-538725.9,-0.368246
4,4.0374769999999995e-19,-538717.6,0.997205
5,4.0371339999999995e-19,-538709.2,2.316943
6,4.037072e-19,-538700.4,4.194667
7,4.0370069999999997e-19,-538691.5,6.066094
8,4.0369739999999996e-19,-538682.6,8.000564
9,4.036973e-19,-538673.6,10.0


Subset selection object
Call: regsubsets.formula(Stars ~ ., data = training_GitHub, method = "forward", 
    nvmax = ncol(training_GitHub) - 1)
9 Variables  (and intercept)
                  Forced in Forced out
Size                  FALSE      FALSE
Forks                 FALSE      FALSE
Issues                FALSE      FALSE
Watchers              FALSE      FALSE
`Is Archived`TRUE     FALSE      FALSE
`Has Issues`TRUE      FALSE      FALSE
`Has Pages`TRUE       FALSE      FALSE
repo_age_days         FALSE      FALSE
days_since_update     FALSE      FALSE
1 subsets of each size up to 9
Selection Algorithm: forward
         Size Forks Issues Watchers `Is Archived`TRUE `Has Issues`TRUE
1  ( 1 ) " "  " "   " "    "*"      " "               " "             
2  ( 1 ) " "  "*"   " "    "*"      " "               " "             
3  ( 1 ) " "  "*"   "*"    "*"      " "               " "             
4  ( 1 ) " "  "*"   "*"    "*"      "*"               " "             
5  ( 1 ) " "  "*"   "*

- based on the forward selection, the variables for the best-fit model are Size Forks Issues Watchers. Therefore I will fit the training data into a linear regression prediction model using an additive model for all the variables we just selected(4).
-  the reason I selected model (4) is that it has the best combination of lowest CP, BIC and the among the highest RSS.*

In [99]:
GitHub_full <- lm(Stars ~ Size + Forks + Issues + Watchers + `Is Archived`, data = training_GitHub)

GitHub_full


Call:
lm(formula = Stars ~ Size + Forks + Issues + Watchers + `Is Archived`, 
    data = training_GitHub)

Coefficients:
      (Intercept)               Size              Forks             Issues  
       -6.610e-14         -6.019e-21         -3.214e-16         -5.368e-16  
         Watchers  `Is Archived`TRUE  
        1.000e+00          4.625e-16  


In [100]:
GitHub_test_pred_full <- predict(GitHub_full, newdata = testing_GitHub)

head(GitHub_test_pred_full)

- including 'Watchers' in the model will lead to perfect prediction. This is because 'Watchers' = 'Stars'.
- Therefore the resulting RMSE will equal to 0 (or approximatly 0). 

- By using the vif function, and to resolve multicollinearity, I got the following results:
- Size: 1.001, Forks: 1.615, Issues: 1.441, Watchers: 2.042, `Has Discussions`: 1.051repo_age_days: 1.087, days_since_update: 1.071.
- I observed that 'Watchers' has a strong correlation with both 'Forks' and 'Issues' (0.62 and 0.55 respectively).
- Now let's fit a linear regression model by using the best-fit model result from Forward Selection.(without 'Watchers')
- Note: I could have removed 'Watchers' before running the Forward Selection Algorithm, however, I wanted to see what the algorithm would select as the best-fit model.

In [101]:
GitHub_No_Watchers <- lm(Stars ~ Forks + Issues + `Is Archived` , data = training_GitHub)

GitHub_No_Watchers


Call:
lm(formula = Stars ~ Forks + Issues + `Is Archived`, data = training_GitHub)

Coefficients:
      (Intercept)              Forks             Issues  `Is Archived`TRUE  
          543.753              1.893              3.094           -224.554  


In [102]:
GitHub_pred_No_Watchers <- predict(GitHub_No_Watchers, newdata = testing_GitHub)

head(GitHub_pred_No_Watchers)

- Since we saw in the previous assignment visualization that 'Watchers' was perfectly linearly correlated with Stars and also strongly correlated with 'Issues' and 'Forks', now let's fit a simple linear regression model with only 'Forks' as the input then only 'Issues as the input variable.*

In [107]:
GitHub_Simple_Forks <- lm(Stars ~ Forks, data = training_GitHub)


GitHub_Simple_Issues <- lm(Stars ~ Issues, data = training_GitHub)

GitHub_Simple_Forks
GitHub_Simple_Issues


Call:
lm(formula = Stars ~ Forks, data = training_GitHub)

Coefficients:
(Intercept)        Forks  
    609.813        2.053  



Call:
lm(formula = Stars ~ Issues, data = training_GitHub)

Coefficients:
(Intercept)       Issues  
    845.588        6.472  


In [109]:
GitHub_test_pred_Simple_Forks <- predict(GitHub_Simple_Forks, newdata = testing_GitHub)

GitHub_test_pred_Simple_Issues <- predict(GitHub_Simple_Issues, newdata = testing_GitHub)


head(GitHub_test_pred_Simple_Forks)
head(GitHub_test_pred_Simple_Issues)

In [110]:
GitHub_RMSE_table <- bind_rows(
  tibble(
    Model = "Full Regression",
    RMSE = rmse(model = GitHub_full, data = testing_GitHub)
  ),
  tibble(
    Model = "Full Regression No Watchers",
    RMSE = rmse(model = GitHub_No_Watchers, data = testing_GitHub)
  ),
  tibble(
    Model = "Simple Regression (Forks)",
    RMSE = rmse(model = GitHub_Simple_Forks, data = testing_GitHub)
  ),
  tibble(
    Model = "Simple Regression (Issues)",
    RMSE = rmse(model = GitHub_Simple_Issues, data = testing_GitHub)
  )
)

GitHub_RMSE_table

Model,RMSE
<chr>,<dbl>
Full Regression,5.015018e-13
Full Regression No Watchers,1980.421
Simple Regression (Forks),2041.78
Simple Regression (Issues),2435.912


- We can conclude from the RMSE values for all the models, except the one with 'Watchers', that the linear regression model that yielded the best prediction is the Forward Selection one (lowest RMSE).
- This was expected because, from the visualization of the relationship between input variables and 'Stars' (in the previous assignment), we saw how those variables had a weak linear correlation with the number of stars a repository has.