# Introduction to Data Science â€“ Homework 6
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/*

Due: Friday, Feburary 28 2025, 11:59pm.

In Part 1 of this homework you will scrape github repositories and organize the information in a Pandas dataframe. In Part 2, you will use linear regression to gain meaningful insights. 

## Your Data
First Name: Kim
<br>
Last Name: Lanaghen
<br>
E-mail: kim.lanaghen@utah.edu
<br>
UID: u1210825
<br>

In [6]:
# imports and setup 
from bs4 import BeautifulSoup

import pandas as pd
import scipy as sc
import numpy as np
import os

import statsmodels.formula.api as sm

import matplotlib.pyplot as plt 
plt.style.use('ggplot')
%matplotlib inline  
plt.rcParams['figure.figsize'] = (10, 6) 
# where the data is stored
DATA_PATH = "/Users/kimlanaghen/Downloads/2026-datascience-homework-main/HW6/snapshots"

In [7]:
print(len(os.listdir(DATA_PATH)), "files in DATA_PATH")


102 files in DATA_PATH


### 1. Scrape Github Repository List using BeautifulSoup
In this part you will explore Github repositories, specifically the 100 most-starred repositories. You are going to scrape data from a snapshot of [this repository list](https://github.com/search?o=desc&q=stars%3A%3E1&s=stars&type=Repositories).

### 1.1. Check whether you are permitted to scrape the data
Before you start to scrape any website you should go through the terms of service and policy documents of the website. Almost all websites post conditions to use their data. Check the terms of [https://github.com/](https://github.com/) (see the tiny "terms" link at the bottom of the page) to see whether the site permits you to scrape their data or not. Are you sure you are allowed to scrape?

**Your solution:**

You are allowed to scrape under the conditions that it is not excessive, used for harm, spamming purposes, or to gain access to somebody's personal information.

Reference solution: The [terms of service](https://help.github.com/articles/github-terms-of-service/) do not mention scraping, but the [help pages on the site policy](https://help.github.com/en/github/site-policy/github-acceptable-use-policies#5-scraping-and-api-usage-restrictionsyou) allows scraping. You can scrape Github under the following conditions:

- Researchers may scrape public, non-personal information from GitHub for research purposes, only if any publications resulting from that research are open access.
- Archivists may scrape GitHub for public data for archival purposes.
- You may not scrape GitHub for spamming purposes, including for the purposes of selling GitHub users' personal information, such as to recruiters, headhunters, and job boards.

The [robots.txt](https://github.com/robots.txt) is a little less explicit about what is allowed and what not, but overall, since we are scraping Github pages for education/research purposes and not publishing the results, it is reasonable to assume that this is ok to do.

### Task 1.2 Load the Data

To avoid any problems with GitHub blocking us from downloading the data many times, we have downloaded and saved a snapshot of the html files for you in the [snapshots](snapshots) folder. Note that the snapshots folder is not completely consistent with what you see on the web â€“ we've made a few patches to the data that makes your task here easier and this data represents a snapshot in time. You will be treating the data folder as your website to be scraped. The path to data folder is stored in `DATA_PATH` variable.

In the data folder you will find first 10 pages of highly starred repositories saved as `search_page_1.html`,`search_page_2.html`,`search_page_3.html` ... `search_page_10.html`

Check out page 5 if you want to see what happens if you scrape too quickly ðŸ˜‰. **Tip**: you should skip page 5.

Now read these html files in python and create a soup object. This is a two step process:
 * Read the text in the html files
 * Create the soup from the files that you've read. 

In [8]:
pages = [1, 2, 3, 4, 6, 7, 8, 9, 10]

html_pages = []
soups = []
  
for page in pages:
    filename = f"search_page_{page}.html"
    file_path = os.path.join(DATA_PATH, filename)

    with open(file_path, "r", encoding="utf-8") as f:
        html = f.read()
        html_pages.append(html)

        soup = BeautifulSoup(html, "html.parser")
        soups.append(soup)



In [9]:
page1_path = os.path.join(DATA_PATH, "search_page_1.html")

with open(page1_path, "r", encoding="utf-8") as f:
    
    html = f.read()

soup = BeautifulSoup(html, "html.parser")


### Extracting Data

Extract the following data for each repository, and create a Pandas Dataframe with a row for each repository and a column for each of these datums. 

+ The name of the repository
+ The primary language (there are multiple or none, if multiple, use the first one, if none, use "none")
+ The number of watching
+ The number of stars
+ The number of forks
+ The number of issues
+ Number of commits
+ Number of pull requests, and

Here's an example for one repository, `freeCodeCamp/freeCodeCamp,` in our dataset: 
```python
{'name': 'freeCodeCamp',
'language': 'TypeScript',
'watching': '8500',
'stars': '410251',
'forks': '39007',
'issues': 168,
'commits': 37591,
'pull_requests':66
}
```
### Task 1.3 Extract repository URLs

If you look at the results of the 100 most-starred repositories [(this list)](https://github.com/search?o=desc&q=stars%3A%3E1&s=stars&type=Repositories), you will notice that all the information we want to extract for each repository is not in that list. This information is in the repositoryâ€™s individual web page, for example [996icu](https://github.com/996icu/996.ICU). 

Therefore, you will first have to extract links of each repository from the soup you scraped earlier. When you extract the link for the repository, it will be a path to the stored HTML page for the repository. You will use this path to read the file and extract the above information.

Refer to the scraping lecture for details on how to do this. We recommend you use the web inspector to identify the relevant structures.

Example of a link that you need to extract - `996icu/996.ICU.html`. This means in the next task you need to access local folder `snapshots/996icu/996.ICU.html`. Similarly, for `521xueweihan/HelloGitHub.html` you should access `snapshots/521xueweihan/HelloGitHub.html` 

You may need to do string operations to get the desired format for the link. For example, if you get `raw_link = https://github.com/996icu/996.ICU`, you can do
`link = raw_link.replace("https://github.com/", "") + ".html"` so you get `996icu/996.ICU.html`.

Please title your output 'repo_list', and print this list once you have created it.

In [12]:
repo_list = []

for a in soup.find_all("a"):
    href = a.get("href")
    if href is None:
        continue

    # keep only repo-ish links like "/owner/repo"
    if href.startswith("/") and href.count("/") == 2:
        path = href[1:]  # remove leading "/"
        owner = path.split("/")[0]

        # filter out obvious non-repo sections
        if owner in ["search", "topics", "settings", "login", "signup", "orgs", "features"]:
            continue

        file_path = path + ".html"
        if file_path not in repo_list:
            repo_list.append(file_path)

print(repo_list[:10])
print("repo_list count:", len(repo_list))




[]
repo_list count: 0


### Task 1.4 Extracting required information

Once you have extracted links for each repository, you can start parsing those HTML pages using BeautifulSoup and extract all the required information.

**Note**: There are few repositories which do not contain 'issues' field (such as 996icu/996.ICU.html). Therefore, write your code such that it handles this condition as well.

**Save the dataframe you created to a new file project_info.csv and include this in your submission.** This separate file will also be graded and is required to earn points.

You also need to make sure that you reformat all numerical columns to be integer data. You can do that either as you parse, or when you have a dataframe with strings.

Some repositories (~30) are missing in the collection, we have provided code to skip these cases, and similarly in the next frame to NOT include the None numbers in the storage.

**Tips**: the exact value of stars and forks can be found on top right corner, with mouse hover over the value. E.g., hover over 410k, shows 410,246. For *watching*, the data is abbreviated, You need to manually convert it. For example, 8.5k should be converted to 8500.

In [None]:
from pathlib import Path

def extract_repository_details(url):
    row = []
    
    file_path = Path("snapshots") / url
    if file_path.exists():
        with file_path.open('r', encoding="utf8") as f:
            file = f.read()
            
    ## Your code goes here
    
    data = {"name": repo_name,
            "language":language,
            "watching": watching, 
            "stars": stars, 
            "forks": forks, 
            "issues":issues,
            "commits": commits,
            "pull_requests": pull_requests
    }
        
    return(data)

In [None]:
## complete extract_repository_details() before running this snippet
repo_info_list = []
for repo in repo_list:
    item = extract_repository_details(repo)
    if item is not None:  
        repo_info_list.append(item)

project_info = pd.DataFrame(repo_info_list)
project_info.to_csv('project_info.csv', index=False)

### 2. Analyzing the repository data

In this part, you will analyze the data collected in Part 1 using regression tools. The goal is to identify properties that make a repository popular. 

First, load the `project_info.csv` file in again. **We need you to do this so that we can run your code below without having to run your scraping code, which can be slow.**

In [None]:
project_info = pd.read_csv('project_info.csv')
project_info.head()

### Task 2.1.1 Describe the data

+ Get an overview of the data using the describe function.
+ Compute the correlation matrix, visualize it with a labeled heat map.
+ Interpret what you see, and discuss why some variables may or may not be correlated with others.

You can re-use code from your previous homework here.

In [None]:
# your code goes here - describe

In [None]:
# your code goes here - correlation matrix

In [None]:
# your code goes here - heat map

**Your Interpretation:** TODO

### Task 2.1.2 Scatterplot
+ Visualize the correlations by making a scatterplot matrix.
+ Interpret what you see. Compare this to the correlation matrix. Do either provide you with insight that the other does not?

In [None]:
# your code goes here

**Your Interpretation:** TODO

### Task 2.2 Train/Test
+ Randomly partition the dataset into two groups, train and test, with an 80/20 split. Store these datasets, and use them for the remainder of the assignment. When you train a model, do so on the train set. When you evaluate a model, do so on the test set.




In [None]:
# your code goes here

### 2.3.1 Linear regression

+ Use linear regression to try to predict the number of Stars based on Forks, Pull Requests, and Commits. Discuss the R-squared , F-statistic p-value, MSE, and coefficient  p-values seperately for the train set AND R-squared, MSE for the test set. 
+ Interpret your results. 


In [None]:
# your code goes here

**Your Interpretation:** TODO

### 2.3.2 Linear Regression Exploration
+ Develop a model which is simpler AND a model which is more complex than in 2.3.1, with the aim of finding a model which performs better on the test set. Hint: refer to the correlation matrix.
+ Explain why your chosen model is better than the model in 2.3.1, explain your decision-making process for generating the models, and interpret your results. 

In [None]:
# your code goes here

**Your Interpretation:** TODO

### 2.4.1 Ridge Regression
Refer to section 6.2.1 of [ISL 2015](https://hastie.su.domains/ISLR2/ISLRv2_corrected_June_2023.pdf.download.html) for a description of ridge regression.
+ Implement ridge regression on both the variables for 2.3.1, and for your best solution from 2.3.2. 
+ Plot $\lambda$ (trained on the train set) against MSE (evaluated on the test set) in order to find an approximately optimal value. 
+ Explain your selection for $\lambda$.




In [None]:
# your code goes here

**Your Interpretation:** TODO

### 2.4.2 Lasso Regression
Refer to section 6.2.2 of [ISL 2015](https://hastie.su.domains/ISLR2/ISLRv2_corrected_June_2023.pdf.download.html) for a description of lasso regression.
+ Implement lasso regression on both the variables for 2.3.1, and for your best solution from 2.3.2. 
+ Plot $\lambda$ (trained on the train set) against MSE (evaluated on the test set) in order to find an approximately optimal value. 
+ Explain your selection for $\lambda$.




In [None]:
# your code goes here

**Your Interpretation:** TODO

### 2.5 Regression Methods Analysis
Compare the results of each regression method for this use case. Which one performed the best, and why?

**Your Interpretation:** TODO

### 2.6 Regression Methods Study
Based on your reading of the textbook and the prior exercises, explain the differences between linear, lasso, and ridge regression, and when you would want to use each.

**Your Interpretation:** TODO