# Scraping READMEs from GitHub

## Approach
- Two approaches discussed:
    - MVP: accept a list of project repos to extract the READMEs from
    - AAB: start with the user's GitHub profile and use the pinned repositories.

In [1]:
import requests, bs4

# MVP

In [2]:
# For MVP approach
repos = ['https://github.com/jirvingphd/how-to-make-successful-movies',
         'https://github.com/jirvingphd/how-to-spot-a-russian-troll-tweet-mod-4-project'
        ]

In [4]:
# select test repo
repo = repos[0]
repo

'https://github.com/jirvingphd/how-to-make-successful-movies'

In [5]:
# get response
response = requests.get(repo)
response

<Response [200]>

In [8]:
# make a beautiful soup
soup = bs4.BeautifulSoup(response.content)
len(soup)

3

In [18]:
# Find the readme 
tag = soup.find_all(id='readme')
readme = tag[0]
print(readme.text)
















How to Make a Successful Movie
Business Problem
Specifications/Constraints
Part 1- Initial IMDB Data Processing.ipynb
IMDB Movie Metadata
Part 2 - Extracting TMDB Data.ipynb
Supplement Data from The Movie Database  (TMDB)'s
EDA Summary of Extracted Data
Years Extracted (thus far)
MPAA Rating Counts
MPAA Rating Revenue Comparison
MPAA Rating - Average Budget Comparison
MPAA Rating - Average ROI Comparison
Part 3 - MySQL Database Construction
ERD
Part 4 - Hypothesis Testing - WIP
Q1: Do some MPAA Ratings make more revenue than others?
Hypothesis
Selecting the Right Test
ANOVA Assumptions
Outliers Removed
Normality Assumption
Final Conclusion
Future Work: Planned Hypotheses to Test
Part 5 - Regression Model-Based Insights - WIP
Best Model
Regression Model Coefiicents and Importances
OLS Coefficients
Random Forest - Built-in Feature Importance
Permutation Importance
Part 6 - Classification Model-Based Insights - WIP
Random Forest Classifier - BuiltIn Feature  Importances
Ran

- Too big of a result. Includes the toc.

In [20]:
articles = soup.find_all('article')
len(articles)

1

In [23]:
readme = articles[0]
print(readme.text)

How to Make a Successful Movie

James M. Irving


Business Problem
I have been hired to process and analyze IMDB's extensive publicly-available dataset, supplement it with financial data from TMDB's API, convert the raw data into a MySQL database, and then use that database for extracting insights and recommendations on how to make a successful movie.
I will use a combination of machine-learning-model-based insights and hypothesis testing to extract insights for our stakeholders.
Specifications/Constraints

The stakeholder wants to focus on attributes of the movies themselves vs. the actors and directors connected to those movies.
They only want to include information related to movies released in the United States.
They also did not want to include movies released before the year 2000.
The stakeholder is particularly interested in how the MPAA rating, genre(s), runtime, budget, and production companies influence movie revenue and user ratings.

Part 1- Initial IMDB Data Processing.ipy

# AAB

In [26]:
# FOR AAB approach
profile = "https://github.com/jirvingphd"
class_="js-pinned-items-reorder-container"

In [24]:
response = requests.get(profile)
soup = bs4.BeautifulSoup(response.content)

In [27]:
pins  = soup.find_all(class_=class_)
len(pins)

1

In [39]:
pinned_repos = pins[0]
print(pinned_repos.text[:1000])



      Pinned
    















how-to-make-successful-movies
Public



        Exploration of successful movies using IMDB, TMDB API, MySQL Database Construction, Hypothesis Testing, And Regression Modeling
      



Jupyter Notebook





            1
          













iowa-prisoner-recidivism-mod-3-project
Public



          Forked from learn-co-students/dsc-3-final-project-online-ds-ft-021119


        Classifying which released prisoners in Iowa will return to a life of crime using Next-Gen Gradient Boosted Trees 
      



Jupyter Notebook





            1
          




            1
          













how-to-spot-a-russian-troll-tweet-mod-4-project
Public



          Forked from learn-co-students/dsc-4-final-project-online-ds-ft-021119


        How to Spot a (Russian) Troll - Classifying Troll Tweets vs Authentic Tweets
      



Jupyter Notebook





            1
          













predicting-the-SP500-using-trumps-tweets_capstone-project
Public



   

In [40]:
links = pinned_repos.find_all('a', href=True)
len(links)

19

In [58]:
links[0]

<a class="mr-1 text-bold wb-break-word" data-hydro-click='{"event_type":"user_profile.click","payload":{"profile_user_id":44880739,"target":"PINNED_REPO","user_id":null,"originating_url":"https://github.com/jirvingphd"}}' data-hydro-click-hmac="e294bdc3208f2eab7fd3ec5ff1600083f0fc9ff2cc544f74a29ddfd4f678545d" href="/jirvingphd/how-to-make-successful-movies">
<span class="repo" title="how-to-make-successful-movies">how-to-make-successful-movies</span></a>

In [42]:
link = links[0]
link

<a class="mr-1 text-bold wb-break-word" data-hydro-click='{"event_type":"user_profile.click","payload":{"profile_user_id":44880739,"target":"PINNED_REPO","user_id":null,"originating_url":"https://github.com/jirvingphd"}}' data-hydro-click-hmac="e294bdc3208f2eab7fd3ec5ff1600083f0fc9ff2cc544f74a29ddfd4f678545d" href="/jirvingphd/how-to-make-successful-movies">
<span class="repo" title="how-to-make-successful-movies">how-to-make-successful-movies</span></a>

In [64]:
link.attrs

{'class': ['mr-1', 'text-bold', 'wb-break-word'],
 'data-hydro-click': '{"event_type":"user_profile.click","payload":{"profile_user_id":44880739,"target":"PINNED_REPO","user_id":null,"originating_url":"https://github.com/jirvingphd"}}',
 'data-hydro-click-hmac': 'e294bdc3208f2eab7fd3ec5ff1600083f0fc9ff2cc544f74a29ddfd4f678545d',
 'href': '/jirvingphd/how-to-make-successful-movies'}

In [50]:
# relative link
rel_link = link['href']
rel_link

'/jirvingphd/how-to-make-successful-movies'

In [51]:
from urllib.parse import urljoin

## set base url and test join
base_url = "http://www.github.com"
# abs_link = base_url + tag.attrs['href']
abs_link = urljoin(base_url,rel_link)
abs_link

'http://www.github.com/jirvingphd/how-to-make-successful-movies'

# Putting it all together (AAB)

## Generate List of Repo Links from Profile Pins

In [117]:
profile = "https://www.github.com/Caellwyn"


In [118]:
# Class for pins
class_="js-pinned-items-reorder-container"

In [119]:
response = requests.get(profile)
soup = bs4.BeautifulSoup(response.content)

In [120]:
pins  = soup.find_all(class_=class_)
len(pins)

1

In [121]:
pinned_repos = pins[0]
len(pinned_repos)

5

In [122]:
links = pinned_repos.find_all('a', href=True)
len(links)

14

In [123]:
from urllib.parse import urljoin
base_url = "https://www.github.com"
repo_links = []

# saving list of absolute links

for link in links:
    # relative link
    rel_link = link['href']
    abs_link = urljoin(base_url,rel_link)
    
    # remove stars and forks
    if abs_link.endswith('stargazers') | abs_link.endswith('forks'):
        pass
    else:
        repo_links.append(abs_link)
    
    

In [124]:
repo_links#[1].endswith('stargazers')

['https://www.github.com/Caellwyn/ou_student_predictions',
 'https://www.github.com/Caellwyn/product-flexible-twitter-sentiment-analysis',
 'https://www.github.com/Caellwyn/Seattle-Home-Sales',
 'https://www.github.com/Caellwyn/chat-with-a-philosopher',
 'https://www.github.com/Caellwyn/pet-predictor',
 'https://www.github.com/learn-co-curriculum/streamlit-image-classifier-demo',
 'https://www.github.com/Violet-Spiral/covid-xprize']

## Scrape Repo README's

In [125]:
import time
import numpy as np

In [126]:
profile

'https://www.github.com/Caellwyn'

In [127]:
profile in repo_links[0]

True

In [129]:
repo_links[0].replace(profile,'').replace('/','')

'ou_student_predictions'

In [137]:
READMES={}

for repo in repo_links:

    # get response
    response = requests.get(repo)
    # make a beautiful soup
    soup = bs4.BeautifulSoup(response.content)
    
    articles = soup.find_all('article')
    try:
        readme = articles[0]
        
        key = repo.replace(base_url+'/','')#.replace('/','')
        READMES[key] = readme.text
        
    except Exception as e:
        display(e)
        print(repo)
    sec_sleep = np.random.choice([1.9,1.2, 1.34,1.1,0.9])
    time.sleep(sec_sleep)

In [138]:
READMES.keys()

dict_keys(['Caellwyn/ou_student_predictions', 'Caellwyn/product-flexible-twitter-sentiment-analysis', 'Caellwyn/Seattle-Home-Sales', 'Caellwyn/chat-with-a-philosopher', 'Caellwyn/pet-predictor', 'learn-co-curriculum/streamlit-image-classifier-demo', 'Violet-Spiral/covid-xprize'])

In [140]:
print(READMES['Caellwyn/ou_student_predictions'])

Predicting Student Success Using Virtual Learning Environment Interaction Statistics

Image by Goran Ivos, courtesy of Unsplash
Introduction:
The Problem
Online learning has been a growing and maturing industry for years now, and in a way, the internet has always been about learning.  Khan Academy was, to me, a quintessential step in self-driven learning and massive open online courses (MOOCs) like Coursera create an even more structured approaches to online, self-driven learning.  I took my first online course on coding fundamentals from edX in 2013 and started my data science journey in 2019 on Coursera.
While these are an amazing resources, they also have high dropout and failure rates.  Many students, accustomed to the accountability features in traditional face-to-face learning, struggle with the openness of these systems and the lack accountability.  In self-driven learning, it can be easy to get lost or give up.
The Solution
I set out to find ways to use data to help students su