## Planning Stage: Data Description & Exploratory Data Analysis and Visualization

(1) Data Description:

Provide a full description of the dataset chosen. Note that the selected dataset will probably contain more variables than you need. In fact, exploring how the different variables in the dataset affect your model may be a crucial part of the project. Regardless of which variables you plan to use, provide a full descriptive summary of the dataset, including information such as the number of observations, number of variables, name and type of variables, etc. You may want to use a table or bullet points to describe the variables in the dataset.

Include a brief description of the dataset indicating how the data has been collected or where it comes from.

This dataset contains a comprehensive list of the top GitHub projects by the number of stars (over 167 stars). To collect this dataset, the creator used the github search api and repeatedly looped through different star ranges (low/high star pairs) to get smaller sets of repositories less than 1000 in size. The dataset has a count of 215029 observed repos. The repository includes a total of 24 attributes:
- name: The name of the GitHub repository
- description: A brief textual description that summarizes the purpose or focus of the repository
- URL
- date of creation (date): The date and time when the repository was initially created on GitHub, in ISO 8601 format
- date of last update (date): The date and time of the most recent update or modification to the repository, in ISO 8601 format
- homepage
- size (number): The size of the repository in bytes, indicating the total storage space used by the repository's files and data
- stars (number): The number of stars or likes that the repository has received from other GitHub users, indicating its popularity or interest
- forks (number)
- number of issues (number)
- number of watchers (number): The number of GitHub users who are "watching" or monitoring the repository for updates and changes
- language: The primary programming language
- license
- topics: A list of topics or tags associated with the repository, helping users discover related projects and topics of interest
- has issues (True/False): A boolean value indicating whether the repository has an issue tracker enabled. In this case, it's true, meaning it has an issue tracker
- has projects (True/False): A boolean value indicating whether the repository uses GitHub Projects to manage and organize tasks and work items
- has downloads (True/False): A boolean value indicating whether the repository offers downloadable files or assets to users
- has a Wikipedia page (True/False): A boolean value indicating whether the repository has an associated wiki with additional documentation and information
- has pages (True/False): A boolean value indicating whether the repository has GitHub Pages enabled, allowing the creation of a website associated with the repository
- has discussions (True/False): A boolean value indicating whether the repository has GitHub Discussions enabled, allowing community discussions and collaboration
- is a fork (True/False): A boolean value indicating whether the repository is a fork of another repository. In this case, it's false, meaning it is not a fork
- is archived (True/False)
- is template (True/False)
- default branch: The name of the default branch

(2) Question:
Clearly state the question you will try to answer using the selected dataset. Your question should involve one random variable of interest (the response) and one or more explanatory variables. Describe clearly how the data will help you address the question of interest. Explain whether your question is focused on prediction, inference, or both.

It is fine to have the same question as other group members. However, you don’t need to agree on a unique common question for the group project. In fact, usually many questions can be answered with the same dataset. Regardless of how many questions are proposed within each group, each team member must state and justify at least one question of interest.

Question:
What factors are most strongly associated with star popularity in GitHub repositories?

This question will designate the 'Stars' variable as the response variable, testing any repository-relevant data that might relate to exposure to GitHub users, user accessibility, and maintenance (created at, size, forks, issues, watchers, language, topics, has issues, has projects, has downloads, has wiki, has pages, has discussions, is fork, is archived) as potential explanatory variables. This question explore how to use the proposed model to infer information about the unknown relation between variables, thereby being an inference question. 

(3) Exploratory Data Analysis and Visualization
In this assignment, you will:
Demonstrate that the dataset can be loaded into R.
Clean and wrangle your data into a tidy format.
Propose a visualization that you consider relevant to address your question or to explore the data.
propose a high-quality plot or set of plots of the same kind (e.g., histograms of different variables)
explain why you consider this plot relevant to address your question or to explore the data
Note: this visualization does not have to illustrate the results of a methodology. Instead, you are exploring which variables are relevant, potential problems that you anticipate encountering, groups in the observations, etc.

Proposed visualization: 
We could create pair-plot to explore the numerical metrics that could have a direct linear relationship to the popularity of the repo. Numerical metrics include: size, forks, issues, watchers. This would clearly allow us to visually gage the visualize the distribution of the explanatory variable as well as relationships with the stars. In addition to this initial visual analysis, the pair-plot function will provide Pearson correlation coefficient between each variable (a value between -1 to 1, where a value of -1 signifies a total negative linear correlation, 0 signifies no correlation, and + 1 signifies a total positive correlation). However we can anticipate that additional visulaizations are necessary to evaluate non-numerical explanatory variables that could offer insight into the Star popularity.

We could use a multiple linear regression model, setting stars as the dependent variable and attributes such as repository size, update frequency, age, and homepage presence as independent variables.
We could use Exploratory Data Analysis (EDA), which explores the relationship between different variables (e.g., repository size, creation data, etc.), and the number of stars using visualizations. 
We could use correlation analysis. Computing the correlation between each attribute and the number of stars allows for the identification of the most significant factors.