## Introduction:

- ...
- ...
- ...
- ...
- ...

In [None]:
install.packages("leaps")
install.packages("Metrics")

In [None]:
library(tidyverse)
library(lubridate)

library(ggplot2)
library(broom)
library(tidyr)
library(dplyr)
library(Metrics)
library(modelr)
library(GGally)

library(leaps)
library(caret)
library(car)
library(reshape2)

## Question: 
* What factors affect the number of stars a GitHub repository has?
* what variables are most correlated with the success of a GitHub repository?
* Can we predict the number of stars a GitHub repository will have based on the number of other input variables (e.g. # of Issues, # of Forks, # of days since it was created, ...etc)?
* How accurate are the predictions, and what is the best-fit model based on the given input data? 

## EDA Methods: 
- I will explore the data with useful visuals that allow me to observe relationships between input variables and response one and among input variables themselves. I also observed how some input variables were highly correlated (multicollinearity) by plotting a heatmap.
- in this assignment, I will first convert the 'Created_At' and 'Updated_AT' Date variables to numerical values, where they represent the days since created, updated respectively.
- Then, I will clean the Data and use the Forward Selection Algorithm to find the best-fit model.
- Afterward, I am going to split the data into training, testing data with a ratio of 80/20.
- then, I am going to use the variables from the Forward Selection results to fit an Additive linear regression model.
- Then, using the predict function, I am going to predict the number of stars in the testing set using the fitted model.
- Finally, I will calculate the RMSE to evaluate the accuracy of the prediction model.*

This dataset lists over 215000 top projects by star with over 167 stars. The dataset was collected using a GitHub search API. This API will get the first 1000 for a query. By looping through the low to high pairs the API returns less than a thousand repositories when query = Star:{Low}...{High}. The repository includes the name, description, URL, date of creation, date of last update, homepage, size, and stars, among other attributes for a total of 24 attributes.

- Name: chr
- Description: chr
- URL: chr
- Created At: dttm
- Updated At: dttm
- Homepage: chr
- Size: dbl
- Stars: dbl
- Forks: dbl
- Issues: dbl
- Watchers: dbl
- Language: chr
- License: chr
- Topics: chr
- Has Issues: lgl
- Has Projects: lgl
- Has Downloads: lgl
- Has Wiki: lgl
- Has Pages: lgl
- Has Discussions: lgl
- Is Fork: lgl
- Is Archived: lgl
- Is Template: lgl
- Default Branch: chr

In [None]:
df <- read_csv('repositories.csv')

In [None]:
df_subset <- df %>% sample_n(size = 10000)

head(df_subset, n = 3)

In [None]:
nrow(df_subset)

glimpse(df_subset) 

colnames(df_subset) 

* - I am choosing only those variables because of the limitation on running the Forward Selection Algorithm on all of the given variables. Also, the other columns (e.g. logical variables) contain very redundant data.  
Note: I explain more about this issue in the cell just above the Forward Selection cell down below.*

In [None]:
features <- c("Size", "Stars", "Forks",
              "Issues", 'Watchers', 'Is Archived', 'Has Issues', 'Has Pages',
              'Created At', 'Updated At')
df_subset_1 <- df_subset %>% select(all_of(features))

head(df_subset_1, n = 3)
nrow(df_subset_1)
sum(is.na(df_subset_1))

In [None]:
df_subset$'Created At' <- as.Date(strptime(df_subset$'Created At', format = "%Y-%m-%d %H:%M:%S"))
df_subset$'Updated At' <- as.Date(strptime(df_subset$'Updated At', format = "%Y-%m-%d %H:%M:%S"))

In [None]:
df_subset_2 <- df_subset_1 %>%
  mutate(
    `Created At` = as.Date(`Created At`),        
    `Updated At` = as.Date(`Updated At`),        
    repo_age_days = as.numeric(Sys.Date() - `Created At`),
    days_since_update = as.numeric(Sys.Date() - `Updated At`)
  )

head(df_subset_2, n = 3)

In [None]:
df_subset_3 <- df_subset_2 %>% select(-c('Created At', 'Updated At'))

head(df_subset_3, n = 3)

In [None]:
non_na_repo_age <- df_subset_3 %>% filter(!is.na(repo_age_days))
non_na_days_since_update <- df_subset_3 %>% filter(!is.na(days_since_update))

head(df_subset_3 %>% select(repo_age_days, days_since_update, n = 3))

In [None]:
df_cleaned <- df_subset_3 %>% filter(!is.na(Stars) & !is.na(Forks) & !is.na(repo_age_days))  

head(df_cleaned, n = 3)

In [None]:
summary_stats <- df_cleaned %>%
  summarize(
    count = n(),
    avg_stars = mean(Stars, na.rm = TRUE),
    avg_forks = mean(Forks, na.rm = TRUE),
    avg_issues = mean(Issues, na.rm = TRUE),
    avg_repo_age = mean(repo_age_days, na.rm = TRUE),
    avg_days_since_update = mean(days_since_update, na.rm = TRUE)
  )

print(summary_stats)
head(df_cleaned, n = 3)
nrow(df_cleaned)
sum(is.na(df_cleaned))

*Now that our data is clean, I am going to perform visualization. I will start by creating a simple linear model plot for each numerical input variable againt our desired reponse variable (Stars).*

In [None]:
continuous_vars <- c("Size", "Forks", "Issues")

model <- lm(Stars ~ ., data = df_cleaned %>% select(Stars, all_of(continuous_vars)))

print(summary(model))

for (var in continuous_vars) {
  ggplot(df_cleaned, aes_string(x = var, y = "Stars")) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", color = "red", se = FALSE) +
    labs(title = paste("Linear Model Fit for Stars vs", var)) +
    theme_minimal() -> p
    
  print(p)
}

*To investigate the presence of Multicollinearty, I am going to create a heat map to visulize
the corellation between input variables.*

In [None]:
options(repr.plot.width = 15, repr.plot.height = 12) 

df_cleaned_pair_plots <- 
    df_cleaned %>% 
    select(Size, Forks, Issues, Watchers, repo_age_days, days_since_update)%>% 
    ggpairs(progress = FALSE) +
    theme(text = element_text(size = 15))

df_cleaned_pair_plots

In [None]:
corr_matrix_df_cleaned <- 
    df_cleaned %>%  
    select(Size, Forks, Issues, Watchers, repo_age_days, days_since_update) %>%
    cor() %>%
    as_tibble(rownames = 'var1') %>%
    pivot_longer(-var1, names_to = "var2", values_to = "corr")

head(corr_matrix_df_cleaned)

In [None]:
MLR_df_cleaned <- lm(formula = Stars ~ ., data = df_cleaned)

MLR_df_cleaned_results <- 
    tidy(MLR_df_cleaned) %>%
    mutate_if(is.numeric, round, 2)

head(MLR_df_cleaned_results)

In [None]:
VIF_MLR_df_cleaned <- vif(MLR_df_cleaned)

round(VIF_MLR_df_cleaned, 3)

*We can see from the heat map that some of the input variables have a medium linear correlation. 
 e.g. Forks and Issues(0.667). In the next cell, we are going the find out what are their correlation coefficients.*

In this cell, I will split the data into training/testing data (80/20 ratio). I will use them to fit the linear regression model, predict 'Stars' values, and evaluate the model.  

In [None]:
set.seed(123) # DO NOT CHANGE!

GitHub_sample <-
    df_cleaned %>%
    mutate(id = row_number())

training_GitHub <- 
    GitHub_sample %>%
    slice_sample(prop = 0.80, replace = FALSE)

testing_GitHub <- 
    GitHub_sample %>% 
    anti_join(training_GitHub, by = "id") %>%
    select(-id)

training_GitHub <- 
    training_GitHub %>% 
    select(-id)

head(training_GitHub)
nrow(training_GitHub)

nrow(testing_GitHub)

- One of the limitations I faced when running the Forward Selection Algorithm on the given data was that The kernel would collapse and restart entirely.
- To solve this problem, I will drop some of the categorical variables that have many levels (>2) and I, from the experience I gained in this course, believe they have little to no utility in predicting the number of stars.

In [None]:
GitHub_forward_sel <- regsubsets(
  Stars ~ ., data = training_GitHub,
  method = "forward",
  nvmax = ncol(training_GitHub) - 1
)
GitHub_forward_sel

GitHub_fwd_summary <- summary(GitHub_forward_sel)

GitHub_fwd_summary <- tibble(
  n_input_variables = 1:length(GitHub_fwd_summary$rss),  
  RSS = GitHub_fwd_summary$rss,
  BIC = GitHub_fwd_summary$bic,
  Cp = GitHub_fwd_summary$cp
)

In [None]:
GitHub_fwd_summary
summary(GitHub_forward_sel)

- based on the forward selection, the variables for the best-fit model are Size Forks Issues Watchers. Therefore I will fit the training data into a linear regression prediction model using an additive model for all the variables we just selected(4).
-  the reason I selected model (4) is that it has the best combination of lowest CP, BIC and the among the highest RSS.*

In [None]:
GitHub_full <- lm(Stars ~ Size + Forks + Issues + Watchers + `Is Archived`, data = training_GitHub)

GitHub_full

In [None]:
GitHub_test_pred_full <- predict(GitHub_full, newdata = testing_GitHub)

head(GitHub_test_pred_full)

- including 'Watchers' in the model will lead to perfect prediction. This is because 'Watchers' = 'Stars'.
- Therefore the resulting RMSE will equal to 0 (or approximatly 0). 

- By using the vif function, and to resolve multicollinearity, I got the following results:
- Size: 1.001, Forks: 1.615, Issues: 1.441, Watchers: 2.042, `Has Discussions`: 1.051repo_age_days: 1.087, days_since_update: 1.071.
- I observed that 'Watchers' has a strong correlation with both 'Forks' and 'Issues' (0.62 and 0.55 respectively).
- Now let's fit a linear regression model by using the best-fit model result from Forward Selection.(without 'Watchers')
- Note: I could have removed 'Watchers' before running the Forward Selection Algorithm, however, I wanted to see what the algorithm would select as the best-fit model.

In [None]:
GitHub_No_Watchers <- lm(Stars ~ Forks + Issues + `Is Archived` , data = training_GitHub)

GitHub_No_Watchers

In [None]:
GitHub_pred_No_Watchers <- predict(GitHub_No_Watchers, newdata = testing_GitHub)

head(GitHub_pred_No_Watchers)

- Since we saw in the previous assignment visualization that 'Watchers' was perfectly linearly correlated with Stars and also strongly correlated with 'Issues' and 'Forks', now let's fit a simple linear regression model with only 'Forks' as the input then only 'Issues as the input variable.*

In [None]:
GitHub_Simple_Forks <- lm(Stars ~ Forks, data = training_GitHub)


GitHub_Simple_Issues <- lm(Stars ~ Issues, data = training_GitHub)

GitHub_Simple_Forks
GitHub_Simple_Issues

In [None]:
GitHub_test_pred_Simple_Forks <- predict(GitHub_Simple_Forks, newdata = testing_GitHub)

GitHub_test_pred_Simple_Issues <- predict(GitHub_Simple_Issues, newdata = testing_GitHub)


head(GitHub_test_pred_Simple_Forks)
head(GitHub_test_pred_Simple_Issues)

In [None]:
GitHub_RMSE_table <- bind_rows(
  tibble(
    Model = "Full Regression",
    RMSE = rmse(model = GitHub_full, data = testing_GitHub)
  ),
  tibble(
    Model = "Full Regression No Watchers",
    RMSE = rmse(model = GitHub_No_Watchers, data = testing_GitHub)
  ),
  tibble(
    Model = "Simple Regression (Forks)",
    RMSE = rmse(model = GitHub_Simple_Forks, data = testing_GitHub)
  ),
  tibble(
    Model = "Simple Regression (Issues)",
    RMSE = rmse(model = GitHub_Simple_Issues, data = testing_GitHub)
  )
)

GitHub_RMSE_table

- We can conclude from the RMSE values for all the models, except the one with 'Watchers', that the linear regression model that yielded the best prediction is the Forward Selection one (lowest RMSE).
- This was expected because, from the visualization of the relationship between input variables and 'Stars' (in the previous assignment), we saw how those variables had a weak linear correlation with the number of stars a repository has.

## Discussion: 


- The analysis of GitHub repository data revealed that certain features, such as Size, Forks, and Issues show significant correlations with Stars.
  - Implications: These findings suggest that the GitHub community highly values active contributions and project size.
  - For developers looking to gain visibility, regular updates and long-term project management might be critical strategies. Furthermore, it highlights the importance of continuous engagement with the community to maintain and grow a project's popularity.

- Were the Results Expected?
  - Yes, the results align with the expected findings. We anticipated that the size of the repository and the number of forks would correlate with higher star counts, as these factors are typically associated with increased engagement and project quality.

- Model Improvement:
  - Add more features: Introducing additional features like the number of contributors, issue resolution time, or repository license type could enhance the model's predictive power by accounting for more aspects of repository activity.
  - Model selection: Experimenting with more advanced regression techniques, such as random forest regression could capture more complex relationships and interactions between features.
  -  Outlier handling: Identifying and managing outliers, such as repositories with very high stars despite low activity, could lead to more robust model results.
  -  Cross-validation: Implementing cross-validation techniques would help assess the model's performance more thoroughly and prevent overfitting.

- Future Questions/Research:
  - Impact of contributor diversity: Future research could explore how the diversity of contributors (e.g., the number of unique contributors or contributions from well-known developers) influences the star count.
  - Project lifecycle and stars: Investigating how the timing of major updates, milestones, or releases correlates with star growth over time could provide valuable insights for project planning.
  - External factors: It would be interesting to study the influence of external factors such as social media activity or media coverage on project popularity, potentially incorporating data from Twitter or Reddit.
  - Causality vs. correlation: Future work could focus on determining whether frequent updates directly drive star growth or if other underlying factors are at play (correlation does not imply causation).

## Reference: 

- GitHub. (n.d.). Saving repositories with stars. GitHub Docs. https://docs.github.com/en/get-started/exploring-projects-on-github/saving-repositories-with-stars
- Barbos, D. (n.d.). GitHub repositories. Kaggle. https://www.kaggle.com/datasets/donbarbos/github-repos?resource=download
- I. J. Mojica Ruiz, M. Nagappan, B. Adams, T. Berger, S. Dienst, and A. E. Hassan, “Impact of ad libraries on ratings of Android mobile apps,” IEEE Software, vol. 31, no. 6, pp. 86–92, 2014.
- K. Aggarwal, A. Hindle, and E. Stroulia, “Co-evolution of project documentation and popularity within GitHub,” in 11th Working Conference on Mining Software Repositories (MSR), 2014, pp. 360–363.
