In [None]:
install.packages("leaps")
install.packages("ggcorrplot")
install.packages("tidyr")
install.packages("reshape2")

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



In [None]:
library(tidyverse)
library(lubridate)

library(ggplot2)
library(reshape2)
library(broom)
library(tidyr)
library(dplyr)
library(GGally)

library(knitr)

*This dataset lists over 215000 top projects by star with over 167 stars. The dataset was collected using a GitHub search API. This API will get the first 1000 for a query. By looping through the low to high pairs the API returns less than a thousand repositories when query = Star:{Low}...{High}. The repository includes the name, description, URL, date of creation, date of last update, homepage, size, and stars, among other attributes for a total of 24 attributes.*

- Name:               chr 
- Description:        chr
- URL:                chr 
- Created At:         dttm 
- Updated At:         dttm
- Homepage:           chr
- Size:               dbl
- Stars:              dbl
- Forks:              dbl 
- Issues:             dbl
- Watchers:           dbl
- Language:           chr 
- License:            chr 
- Topics:             chr 
- Has Issues:        lgl
- Has Projects:      lgl 
- Has Downloads:     lgl 
- Has Wiki:          lgl 
- Has Pages:         lgl
- Has Discussions:   lgl 
- Is Fork:           lgl
- Is Archived:       lgl 
- Is Template:       lgl 
- Default Branch:    chr

In [None]:
df <- read_csv('repositories.csv')

df_subset <- df %>% sample_n(size = 10000)

head(df_subset, n = 3)

* What factors affect the number of stars a GitHub repository has?
* what variables are most correlated with the success of a GitHub repository?
* Can we predict the number of stars a GitHub repository will have based on the number of other input variables (e.g. # of Issues, # of Forks, # of days since it was created, ...etc)?
* How accurate are the predictions, and what is the best-fit model based on the given input data? 

*The two cells below provide additional information about the data such as the number of observations, variable names and types, and samples of the data that is included in each column.* 

In [None]:
nrow(df_subset)

glimpse(df_subset) 

colnames(df_subset) 

In [None]:
str(df)

*By observing the varibles and their types, Now I want to create a subset that include 
only the variables that are relevant the question I am trying to asnswer with this data.*

In [None]:
features <- c('Created At', 'Updated At', "Size", "Stars", "Forks",
              "Issues", 'Watchers', 'Has Discussions')
df_subset <- df_subset %>% select(all_of(features))

head(df_subset, n = 3)

In [None]:
sum(is.na(df_subset))

*The variable 'Created At' and 'Updated At' are in DateTime <dttm> format. To make them easy to work with, I will convert them to numerical values representing the number of days that have elapsed since the repository was created and updated.*

In [None]:
head(df_subset$'Created At') 

In [None]:
df_subset$'Created At' <- as.Date(strptime(df_subset$'Created At', format = "%Y-%m-%d %H:%M:%S"))
df_subset$'Updated At' <- as.Date(strptime(df_subset$'Updated At', format = "%Y-%m-%d %H:%M:%S"))

In [None]:
df_subset <- df_subset %>%
  mutate(repo_age_days = as.numeric(Sys.Date() - `Created At`),
         days_since_update = as.numeric(Sys.Date() - `Updated At`))

In [None]:
head(df_subset, n = 3)

In [None]:
df_subset <- df_subset %>% select(-c('Created At', 'Updated At'))

head(df_subset, n = 3)

*Now, the data needs to be cleaned before we can visualize it.*

In [None]:
non_na_repo_age <- df_subset %>% filter(!is.na(repo_age_days))
non_na_days_since_update <- df_subset %>% filter(!is.na(days_since_update))

head(df_subset %>% select(repo_age_days, days_since_update, n = 3))

In [None]:
df_cleaned <- df_subset %>% filter(!is.na(Stars) & !is.na(Forks) & !is.na(repo_age_days))  

head(df_cleaned, n = 3)

In [None]:
df_summary <- summary(df_cleaned) %>%
              kable()

df_summary

In [None]:
summary_stats <- df_cleaned %>%
  summarize(
    count = n(),
    avg_stars = mean(Stars, na.rm = TRUE),
    avg_forks = mean(Forks, na.rm = TRUE),
    avg_issues = mean(Issues, na.rm = TRUE),
    avg_repo_age = mean(repo_age_days, na.rm = TRUE),
    avg_days_since_update = mean(days_since_update, na.rm = TRUE)
  )

print(summary_stats)

In [None]:
head(df, n = 3)
nrow(df_cleaned)
sum(is.na(df_cleaned))

In [None]:
colSums(is.na(df_cleaned))

*Now that our data is clean, I am going to perform visualization. I will start by creating a simple linear model plot for each numerical input variable againt our desired reponse variable (Stars).*

In [None]:
colnames(df_cleaned)

In [None]:
continuous_vars <- c("Size", "Forks", "Issues", "Watchers", "repo_age_days", "days_since_update")

model <- lm(Stars ~ ., data = df_cleaned %>% select(Stars, all_of(continuous_vars)))

print(summary(model))

for (var in continuous_vars) {
  ggplot(df_cleaned, aes_string(x = var, y = "Stars")) +
    geom_point(alpha = 0.5) +
    geom_smooth(method = "lm", color = "red", se = FALSE) +
    labs(title = paste("Linear Model Fit for Stars vs", var)) +
    theme_minimal() -> p
  
  print(p)
}

*To investigate the presence of Multicollinearty, I am going to create a heat map to visulize
the corellation between input variables.*

In [None]:
options(repr.plot.width = 15, repr.plot.height = 12) 

df_cleaned_pair_plots <- 
    df_cleaned %>% 
    select(Size, Forks, Issues, Watchers, repo_age_days, days_since_update)%>% 
    ggpairs(progress = FALSE) +
    theme(text = element_text(size = 15))

df_cleaned_pair_plots

In [None]:
corr_matrix_df_cleaned <- 
    df_cleaned %>%  
    select(Size, Forks, Issues, Watchers, repo_age_days, days_since_update) %>%
    cor() %>%
    as_tibble(rownames = 'var1') %>%
    pivot_longer(-var1, names_to = "var2", values_to = "corr")

head(corr_matrix_df_cleaned)

In [None]:
options(repr.plot.width = 8, repr.plot.height = 5) 

plot_corr_matrix_df_cleaned <- 
    corr_matrix_df_cleaned %>%
    ggplot(aes(var1, var2)) +
    geom_tile(aes(fill = corr), color = "white") +
    scale_fill_distiller("Correlation Coefficient \n",
                         palette = "YlOrRd",
                         direction = 1, 
                         limits = c(-1, 1)
    ) +
    labs(x = "", y = "") +
    theme_minimal() +
    theme(
        axis.text.x = element_text(angle = 45, vjust = 1, size = 14, hjust = 1),
        axis.text.y = element_text(vjust = 1, size = 14, hjust = 1),
        legend.title = element_text(size = 18),
        legend.text = element_text(size = 12),
        legend.key.size = unit(1.5, "cm")
    ) +
    coord_fixed() +
    geom_text(aes(var2, var1, label = round(corr, 2)), color = "black", size = 4.5)

plot_corr_matrix_df_cleaned

*We can see a strong positive correlation between 'Watchers' and 'Forks' implying multicollinearity*

In [None]:
MLR_df_cleaned <- lm(formula = Stars ~ ., data = df_cleaned)

MLR_df_cleaned_results <- 
    tidy(MLR_df_cleaned) %>%
    mutate_if(is.numeric, round, 2)

head(MLR_df_cleaned_results)

In [None]:
VIF_MLR_df_cleaned <- vif(MLR_df_cleaned)

round(VIF_MLR_df_cleaned, 3)

*We can see from the heat map that some of the input variables have a medium linear correlation. 
 e.g. Forks and Issues(0.667). In the next cell, we are going the find out what are their correlation coefficients.*

In [None]:
additive_model <- lm(Stars ~ Size + Forks + Issues + Watchers + `Has Discussions`, data = df_cleaned)

additive_model_results <- tidy(additive_model, conf.int = TRUE)

additive_model
additive_model_results

In [None]:
additive_model <- lm(Stars ~ Size * Watchers * `Has Discussions`, data = df_cleaned)

additive_model_results <- tidy(additive_model, conf.int = TRUE)

additive_model

additive_model_results