# PLAN

- [ ] Acquisition
    - [ ] Select what list of repos to scrape.
    - [ ] Get requests form the site.
    - [ ] Save responses to csv.
- [ ] Preparation
    - [ ] Prepare the data for analysis.
- [ ] Exploration
    - [ ] Answer the following prompts:
        - [ ] What are the most common words in READMEs?
        - [ ] What does the distribution of IDFs look like for the most common words?
        - [ ] Does the length of the README vary by language?
        - [ ] Do different languages use a different number of unique words?
- [ ] Modeling
    - [ ] Transform the data for machine learning; use language to predict.
    - [ ] Fit several models using different text repressentations.
    - [ ] Build a function that will take in the text of a README file, and makes a prediction of language.
- [ ] Delivery
    - [ ] Github repo
        - [x] This notebook.
        - [ ] Documentation within the notebook.
        - [ ] README file in the repo.
        - [ ] Python scripts if applicable.
    - [ ] Google Slides
        - [ ] 1-2 slides only summarizing analysis.
        - [ ] Visualizations are labeled.
        - [ ] Geared for the general audience.
        - [ ] Share link @ readme file and/or classroom.

# ENVIRONMENT

In [42]:
# disable warnings
import warnings
warnings.filterwarnings("ignore")

import unicodedata
import re
from requests import get
import json
# import spacy
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
import pandas as pd

# PAGE = 1
# BASEURL = 
# HEADERS = {'User-Agent': 'Sentient Attack Helicoptor'}

# ACQUIRE

First thing that needs to happen is to get the links from the most starred github repositories.

In [None]:
#construct a url list

def get_url_list(url):
    urls = []
    response = get(url, headers=HEADERS)
    soup = BeautifulSoup(response.content, 'html.parser')
    max_page = 11
    max_page = max_page + 1
    for i in range(1,max_page):
        url = 'https://github.com/search?p=' + str(i) + '&q=stars%3A%3E0&s=stars&type=Repositories'
        print(f'traversing url: {url}')
        response = get(url, headers=HEADERS)
        soup = BeautifulSoup(response.text)
        for h in soup.find_all('h3'):
            a = h.find('a')
            urls.append(a.attrs['href'])
    print(len(urls))
    return urls

In [66]:
def grab_readmes_and_languages(urls):
    readmes = []
    languages = []
    for url in urls:
        response = get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        print('Retrieving README')
        if soup.find('div', class_='Box-body') == None:
            print('Skipping because of no README')
            continue
        else:
            single_readme = soup.find('div', class_='Box-body').text
            print('Got README')
        print('Retrieving language')
        if soup.find('span', class_='lang') == None:
            print('Skipping because of no language')
            continue
        else:
            repo_language = soup.find('span', class_='lang').text
            print('Got language')
        title = soup.find('a', data-pjax='#js-repo-pjax-container').text
        print(title)
        languages.append(repo_language)
        readmes.append(single_readme)
    df = pd.DataFrame({'readme':readmes, 'language':languages})
    #     with open('blog_articles.json', 'w') as blog:
    #         json.dump(articles, blog, indent=4)
    
    return df


SyntaxError: keyword can't be an expression (<ipython-input-66-97c71e16e25d>, line 21)

In [67]:
urls = ['https://github.com/freeCodeCamp/freeCodeCamp', 'https://github.com/996icu/996.ICU',
        'https://github.com/vuejs/vue', 'https://github.com/twbs/bootstrap', 'https://github.com/facebook/react', 
        'https://github.com/tensorflow/tensorflow', 'https://github.com/EbookFoundation/free-programming-books', 
        'https://github.com/sindresorhus/awesome', 'https://github.com/getify/You-Dont-Know-JS', 
        'https://github.com/airbnb/javascript']
grab_readmes_and_languages(urls)

Retrieving README
Got README
Retrieving language
Got language
Retrieving README
Got README
Retrieving language
Got language
Retrieving README
Got README
Retrieving language
Got language
Retrieving README
Got README
Retrieving language
Got language
Retrieving README
Got README
Retrieving language
Got language
Retrieving README
Got README
Retrieving language
Got language
Retrieving README
Got README
Retrieving language
Skipping because of no language
Retrieving README
Got README
Retrieving language
Skipping because of no language
Retrieving README
Got README
Retrieving language
Skipping because of no language
Retrieving README
Got README
Retrieving language
Got language


Unnamed: 0,readme,language
0,\n\n\n\n\n\nWelcome to freeCodeCamp.org's open...,JavaScript
1,\n996.ICU\nPlease note that there exists NO ot...,Rust
2,\n\n\n\n\n\n\n\n\n\n\n\nSupporting Vue.js\nVue...,JavaScript
3,"\n\n\n\n\n\nBootstrap\n\n Sleek, intuitive, a...",JavaScript
4,\nReact · \nReact is a JavaScript library f...,JavaScript
5,\n\n\n\n\n\n\n\nDocumentation\n\n\n\n\n\n\n\n\...,C++
6,\nAirbnb JavaScript Style Guide() {\nA mostly ...,JavaScript


In [39]:
urls = ['https://github.com/EbookFoundation/free-programming-books', 'https://github.com/996icu/996.ICU',
        'https://github.com/vuejs/vue', 'https://github.com/twbs/bootstrap']
readmes = []
languages = []
for url in urls:
    response = get(url)
    soup = BeautifulSoup(response.content, 'html.parser')
    print('Retrieving README')
    single_readme = soup.find('div', class_='Box-body').text
    readmes.append(single_readme)
    print('Got README')
    print('Retrieving language')
    if soup.find('span', class_='lang') == None:
        print('No language found')
        repo_language = 'None'
    else:
        repo_language = soup.find('span', class_='lang').text
        print('Got language')
    languages.append(repo_language)
    #         article = soup.find('div', class_='mk-single-content')
    #         article_dict = {}
    #         article_dict['title'] = site.split('/')[-2].replace('-', '_')
    #         article_dict['body'] = article.text
    print(readmes)
    print(languages)

Retrieving README
Got README
Retrieving language
No language found
["\nThis page is available as an easy-to-read website at https://ebookfoundation.github.io/.\nList of Free Learning Resources \nView the English list\nIntro\nThis list was originally a clone of stackoverflow - List of Freely Available Programming Books with contributions from Karan Bhangui and George Stocker.\nThe list was moved to GitHub by Victor Felder for collaborative updating and maintenance. It has grown to become one of the most popular repositories on Github, with over 100,000 stars, over 4500 commits, over 950 contributors, and over 25,000 forks.\nThe repo is now administered by the Free Ebook Foundation, a not-for-profit organization devoted to promoting the creation, distribution, archiving and sustainability of free ebooks. Donations to the Free Ebook Foundation are tax-deductible in the US.\nHow To Contribute\nPlease read CONTRIBUTING. If you're new to Github, welcome!\nHow to Share\n\nShare on Twitter\nSh

Retrieving README
Got README
Retrieving language
Got language
['None', 'Rust', 'JavaScript']


Retrieving README
Got README
Retrieving language
Got language
['None', 'Rust', 'JavaScript', 'JavaScript']


In [40]:
urls = ['https://github.com/freeCodeCamp/freeCodeCamp', 'https://github.com/996icu/996.ICU',
        'https://github.com/vuejs/vue', 'https://github.com/twbs/bootstrap']
grab_readmes(urls)

Retrieving README for https://github.com/freeCodeCamp/freeCodeCamp
Retrieving README for https://github.com/996icu/996.ICU
Retrieving README for https://github.com/vuejs/vue
Retrieving README for https://github.com/twbs/bootstrap


["\n\n\n\n\n\nWelcome to freeCodeCamp.org's open source codebase and curriculum!\nfreeCodeCamp.org is a friendly community where you can learn to code for free. It is run by a donor-supported 501(c)(3) nonprofit with the goal of helping millions of busy adults transition into tech. Our community has already helped more than 10,000 people get their first developer job.\nOur full-stack web development curriculum is completely free and self-paced. We have thousands of interactive coding challenges to help you expand your skills.\nTable of Contents\n\nCertifications\nThe Learning Platform\nFound a Bug\nFound a Security Issue\nContributing\nLicense\n\nCertifications\nfreeCodeCamp.org offers several free developer certifications. Each of these certifications involves building 5 required web app projects, along with hundreds of optional coding challenges to help you prepare for those projects. We estimate that each certification will take a beginning programmer around 300 hours to earn.\nEach

# PREPARE

# EXPLORE

# MODEL