# Introduction to Data Science – Homework 6
*COMP 5360 / MATH 4100, University of Utah, http://datasciencecourse.net/*

Due: Friday, March 01 2024, 11:59pm.

In Part 1 of this homework you will scrape github repositories and organize the information in a Pandas dataframe. In Part 2, you will use linear regression to gain meaningful insights. 

## Your Data
First Name: Logan
<br>
Last Name: Correa
<br>
E-mail: u1094034@umail.utah.edu
<br>
UID: u1094034
<br>

In [1]:
# imports and setup 
from bs4 import BeautifulSoup
# you can use either of these libraries to get html from a website
import time
import os

import pandas as pd
import scipy as sc
import numpy as np

import statsmodels.formula.api as sm

import matplotlib.pyplot as plt 
plt.style.use('ggplot')
%matplotlib inline  
plt.rcParams['figure.figsize'] = (10, 6) 
# where the data is stored
DATA_PATH = "data"

### 1. Scrape Github Repository List using BeautifulSoup
In this part you will explore Github repositories, specifically the 100 most-starred repositories. You are going to scrape data from a snapshot of [this repository list](https://github.com/search?o=desc&q=stars%3A%3E1&s=stars&type=Repositories).

### 1.1. Check whether you are permitted to scrape the data
Before you start to scrape any website you should go through the terms of service and policy documents of the website. Almost all websites post conditions to use their data. Check the terms of [https://github.com/](https://github.com/) (see the tiny "terms" link at the bottom of the page) to see whether the site permits you to scrape their data or not. Are you sure you are allowed to scrape?

**Your solution:**

Scraping is allowed for researchers as long as published works are open access and the information is not used for spamming purposes.

### Task 1.2 Load the Data

To avoid any problems with GitHub blocking us from downloading the data many times, we have downloaded and saved a snapshot of the html files for you in the [data](data) folder. Note that the data folder is not completely consistent with what you see on the web – we've made a few patches to the data that makes your task here easier and this data represents a snapshot in time. You will be treating the data folder as your website to be scraped. The path to data folder is stored in `DATA_PATH` variable.

In the data folder you will find first 10 pages of highly starred repositories saved as `searchPage1.html`,`searchPage2.html`,`searchPage3.html` ... `searchPage10.html`

Check out page 10 if you want to see what happens if you scrape too quickly 😉. 

Now read these html files in python and create a soup object. This is a two step process:
 * Read the text in the html files
 * Create the soup from the files that you've read. 

In [7]:
"""
html_pages = []
for root, dirs, files in os.walk(DATA_PATH):
    for file in files:
        if file.endswith(".html"):
            full_path = os.path.join(root, file)
            with open(full_path, 'r') as f:
                html_pages.append(f.read())

soup_objects = []
for page in html_pages:
    soup_objects.append(BeautifulSoup(page, 'html.parser'))
"""

In [46]:
# Read html files and create soup files
html_pages = []
for files in os.listdir(DATA_PATH):
    if files.endswith(".html"):
        full_path = os.path.join(DATA_PATH, files)
        with open(full_path, 'r') as f:
            html_pages.append(f.read())

SearchPage_soup = []
for page in html_pages:
    SearchPage_soup.append(BeautifulSoup(page, 'html.parser'))

len(SearchPage_soup)

10

In [54]:
# Extract all 'div' elements with class "mt-n1" (these classes contain individual repositories) from each item in SearchPage_soup.
repositories = []

for i in range(0, len(SearchPage_soup)):
    mtn1 = SearchPage_soup[i].find_all('div', class_="mt-n1")

    repositories.extend(mtn1)

# Now, all_repositories contains all 'div' elements with class "mt-n1" from each item in SearchPage_soup
print(len(repositories))
print(repositories[0])

90
<div class="mt-n1">
<div class="f4 text-normal">
<a class="v-align-middle" data-hydro-click='{"event_type":"search_result.click","payload":{"page_number":9,"per_page":10,"query":"stars:&gt;1","result_position":1,"click_id":10744183,"result":{"id":10744183,"global_relay_id":"MDEwOlJlcG9zaXRvcnkxMDc0NDE4Mw==","model_name":"Repository","url":"https://github.com/netdata/netdata"},"originating_url":"https://github.com/search?o=desc&amp;p=9&amp;q=stars%3A%3E1&amp;s=stars&amp;type=Repositories","user_id":null}}' data-hydro-click-hmac="ea9a3c2f929c99f58815139b251d60f8bba3b48cedab4b12ea9cfefe48e826cb" href="netdata/netdata.html">
            netdata/netdata
           </a>
</div>
<p class="mb-1">
           Real-time performance monitoring, done right!
           <a href="https://my-netdata.io/" rel="nofollow">
            https://my-netdata.io/
           </a>
</p>
<div>
<div>
<a class="topic-tag topic-tag-link f6 px-2 mx-0" data-ga-click="Topic, search results" data-octo-click="topic_click

In [41]:
# Repo name
for i in range(0, len(SearchPage_soup)):
    
    elements = SearchPage_soup[i].find_all("a", class_="v-align-middle")

    repo_name = [element.get_text().strip() for element in elements]
  

['netdata/netdata', 'tonsky/FiraCode', 'denoland/deno', 'h5bp/html5-boilerplate', 'ElemeFE/element', 'adam-p/markdown-here', 'h5bp/Front-end-Developer-Interview-Questions', 'resume/resume.github.com', 'josephmisiti/awesome-machine-learning', 'lodash/lodash']
['angular/angular.js', 'puppeteer/puppeteer', 'mrdoob/three.js', 'microsoft/TypeScript', 'angular/angular', 'microsoft/terminal', 'laravel/laravel', 'moby/moby', 'ant-design/ant-design', 'iluwatar/java-design-patterns']
['ossu/computer-science', '30-seconds/30-seconds-of-code', 'mui-org/material-ui', 'jquery/jquery', 'webpack/webpack', 'reduxjs/redux', 'nvbn/thefuck', 'vuejs/awesome-vue', 'avelino/awesome-go', 'atom/atom']
['apple/swift', 'hakimel/reveal.js', 'MisterBooo/LeetCodeAnimation', 'PanJiaChen/vue-element-admin', 'pallets/flask', 'socketio/socket.io', 'expressjs/express', 'Semantic-Org/Semantic-UI', 'shadowsocks/shadowsocks-windows', 'chartjs/Chart.js']
['jwasham/coding-interview-university', 'kamranahmedse/developer-roadm

In [45]:
# Initialize an empty list to hold all languages
all_langs = []

# Loop through each BeautifulSoup object in SearchPage_soup
for i in range(0, len(SearchPage_soup)):
    elements = SearchPage_soup[i].find_all("span", itemprop="programmingLanguage")
    # Extend the all_langs list with the cleaned texts
    all_langs.extend([element.get_text().strip() for element in elements])

# Now all_langs contains all programming languages from all pages/sections
print(all_langs)

['C', 'Clojure', 'TypeScript', 'JavaScript', 'Vue', 'JavaScript', 'HTML', 'JavaScript', 'Python', 'JavaScript', 'JavaScript', 'JavaScript', 'JavaScript', 'TypeScript', 'TypeScript', 'C++', 'PHP', 'Go', 'TypeScript', 'Java', 'JavaScript', 'JavaScript', 'JavaScript', 'JavaScript', 'TypeScript', 'Python', 'Go', 'JavaScript', 'C++', 'JavaScript', 'Java', 'Vue', 'Python', 'JavaScript', 'JavaScript', 'JavaScript', 'C#', 'JavaScript', 'JavaScript', 'TypeScript', 'Java', 'JavaScript', 'Dart', 'C', 'JavaScript', 'JavaScript', 'Python', 'CSS', 'Go', 'JavaScript', 'JavaScript', 'Python', 'Python', 'JavaScript', 'Rust', 'JavaScript', 'JavaScript', 'C++', 'JavaScript', 'Shell', 'Python', 'C++', 'Python', 'Jupyter Notebook', 'JavaScript', 'Python', 'JavaScript', 'Go', 'Java', 'Python', 'Java', 'Python', 'Python', 'TypeScript', 'JavaScript', 'Assembly', 'Java', 'Ruby', 'JavaScript']


### Extracting Data

Extract the following data for each repository, and create a Pandas Dataframe with a row for each repository and a column for each of these datums. 

+ The name of the repository
+ The primary language (there are multiple or none, if multiple, use the first one, if none, use "none")
+ The number of watches
+ The number of stars
+ The number of forks
+ The number of issues
+ Number of commits
+ Number of contributors
+ Number of pull requests, and
+ Number of top level folders in the file list.

Here's an example for one repository, `jackfrued/Python-100-Days,` in our dataset: 
```python
{'name': 'Python-100-Days',
'language': 'Jupyter Notebook',
'watches': '4822',
'stars': '78068',
'forks': '30979',
'issues': 224,
'commits': 296,
'contributors': 12,
'pull_requests':85,
'folders': 14
}
```

### Task 1.3 Extract repository URLs

If you look at the results of the 100 most-starred repositories [(this list)](https://github.com/search?o=desc&q=stars%3A%3E1&s=stars&type=Repositories), you will notice that all the information we want to extract for each repository is not in that list. This information is in the repository’s individual web page, for example [996icu](https://github.com/996icu/996.ICU). 

Therefore, you will first have to extract links of each repository from the soup you scraped earlier. When you extract the link for the repository, it will be a path to the stored HTML page for the repository. You will use this path to read the file and extract the above information.

Refer to the scraping lecture for details on how to do this. We recommend you use the web inspector to identify the relevant structures.

Example of a link that you need to extract - 996icu/996.ICU.html

In [None]:
## Your code goes here

### Task 1.4 Extracting required information

Once you have extracted links for each repository, you can start parsing those HTML pages using BeautifulSoup and extract all the required information.

**Note**: There are few repositories which do not contain 'issues' field (such as 996icu/996.ICU.html). Therefore, write your code such that it handles this condition as well.

**Save the dataframe you created to a new file project_info.csv and include this in your submission.** This separate file will also be graded and is required to earn points.

You also need to make sure that you reformat all numerical columns to be integer data. You can do that either as you parse, or when you have a dataframe with strings.

Note that there is one repository flagged as having infinite contributers (the Linux kernel). We'll assume that it in fact has 15600 contributors (an estimate based on a Google search at the time of download).

In [None]:
project_info.to_csv('project_info.csv', index=False)

### 2. Analyzing the repository data

In this part, you will analyze the data collected in Part 1 using regression tools. The goal is to identify properties that make a repository popular. 

First, load the `project_info.csv` file in again. **We need you to do this so that we can run your code below without having to run your scraping code, which can be slow.**

In [None]:
project_info = pd.read_csv('project_info.csv')
project_info.head()

### Task 2.1 Describe the data

+ Get an overview of the data using the describe function.
+ Compute the correlation matrix, visualize it with a heat map.
+ Visualize the correlations by making a scatterplot matrix.
+ Interprete what you see.

You can re-use code from your previous homework here.

In [None]:
# your code goes here

**Your Interpretation:** TODO

### 2.2. Linear regression

1. Use linear regression to try to predict the number of Stars based on Forks, Pull Requests, and Number of Folders. Explain why this is not a very good model by discussing the R-squared , F-statistic p-value, and coefficient  p-values. 
+ Develop another model which is better. Explain why it is better and interpret your results. Hint: try using other variables such as Watches and/or Contributors. 

In [None]:
# your code goes here

**Your interpretation:** TODO