# PLAN

- [ ] Acquisition
    - [ ] Select what list of repos to scrape.
    - [ ] Get requests form the site.
    - [ ] Save responses to csv.
- [ ] Preparation
    - [ ] Prepare the data for analysis.
- [ ] Exploration
    - [ ] Answer the following prompts:
        - [ ] What are the most common words in READMEs?
        - [ ] What does the distribution of IDFs look like for the most common words?
        - [ ] Does the length of the README vary by language?
        - [ ] Do different languages use a different number of unique words?
- [ ] Modeling
    - [ ] Transform the data for machine learning; use language to predict.
    - [ ] Fit several models using different text repressentations.
    - [ ] Build a function that will take in the text of a README file, and makes a prediction of language.
- [ ] Delivery
    - [ ] Github repo
        - [x] This notebook.
        - [ ] Documentation within the notebook.
        - [ ] README file in the repo.
        - [ ] Python scripts if applicable.
    - [ ] Google Slides
        - [ ] 1-2 slides only summarizing analysis.
        - [ ] Visualizations are labeled.
        - [ ] Geared for the general audience.
        - [ ] Share link @ readme file and/or classroom.

# ENVIRONMENT

In [26]:
# disable warnings
import warnings
warnings.filterwarnings("ignore")

import unicodedata
import re
from requests import get
import json
# import spacy
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
import pandas as pd
import time
import csv

BASEURL = 'https://github.com/search?p=1&q=stars%3A%3E0&s=stars&type=Repositories'
HEADERS = {'User-Agent': 'Not Sentient Attack Helicoptor'}

# ACQUIRE

First thing that needs to happen is to get the links from the most starred github repositories.

In [23]:
def get_url_list(page):
    urls = []
    response = get(BASEURL, headers=HEADERS)
    soup = BeautifulSoup(response.content)
    max_page = page + 1
    for i in range(1,max_page):
        url = 'https://github.com/search?p=' + str(i) + '&q=stars%3A%3E0&s=stars&type=Repositories'
#         print(f'traversing url: {url}')
        response = get(url, headers=HEADERS)
        soup = BeautifulSoup(response.text)
        list_of_repos = soup.find('ul', class_='repo-list')
        repository = list_of_repos.find_all('li', class_='repo-list-item')
        for h in repository:
            if h.find(attrs={'itemprop':'programmingLanguage'}):
                a = h.find('a')
                urls.append(a.attrs['href'])
        time.sleep(3)
    print(f'Scraped a total of {len(urls)} github urls.')
    urls = ['https://github.com' + url for url in urls]
    with open('github_urls.csv', 'w') as f:
        ghub_urls = csv.writer(f, delimiter=',')
        ghub_urls.writerow(urls)
    return urls


##### Function that grabs the readme text and the main language of the repo


In [24]:
def grab_readmes_and_languages(urls):
    readmes = []
    languages = []
    for url in urls:
        response = get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        # print('Retrieving README')
        if soup.find('div', class_='Box-body') == None:
            # print('Skipping because of no README')
            continue
        else:
            single_readme = soup.find('div', class_='Box-body').text
            # print('Got README')
        # print('Retrieving language')
        if soup.find('span', class_='lang') == None:
            # print('Skipping because of no language')
            continue
        else:
            repo_language = soup.find('span', class_='lang').text
            # print('Got language')
        languages.append(repo_language)
        readmes.append(single_readme)
    df = pd.DataFrame({'readme':readmes, 'language':languages})
    
    return df


In [27]:
urls = get_url_list(1)

Scraped a total of 7 github urls.


In [29]:
df = grab_readmes_and_languages(urls)
df.head(10)

Unnamed: 0,readme,language
0,\n\n\n\n\n\nWelcome to freeCodeCamp.org's open...,JavaScript
1,\n996.ICU\nPlease note that there exists NO ot...,Rust
2,\n\n\n\n\n\n\n\n\n\n\n\nSupporting Vue.js\nVue...,JavaScript
3,"\n\n\n\n\n\nBootstrap\n\n Sleek, intuitive, a...",JavaScript
4,\nReact · \nReact is a JavaScript library f...,JavaScript
5,\n\n\n\n\n\n\n\nDocumentation\n\n\n\n\n\n\n\n\...,C++
6,"\n\n\n\nOh My Zsh is an open source, community...",Shell


In [17]:
for line in df.readme:
    print(line[:50])
    print(line[-50:])
    print('\n')







Welcome to freeCodeCamp.org's open source co
on are licensed under the CC-BY-SA-4.0 license.






996.ICU
Please note that there exists NO other of

Contact
You can reach me by E-mail if you need.
















Supporting Vue.js
Vue.js is an MIT-lic

MIT
Copyright (c) 2013-present, Yuxi (Evan) You










Bootstrap

  Sleek, intuitive, and powerful 
T License. Docs released under Creative Commons.





React ·    
React is a JavaScript library for bui
e to get started.
License
React is MIT licensed.












Documentation








TensorFlow is an ope
 ways to participate.
License
Apache License 2.0








Oh My Zsh is an open source, community-driven 
gency. Check out our other open source projects.





D3: Data-Driven Documents

D3 (or D3.js) is a Jav
equire("d3-geo"), require("d3-geo-projection"));







    React Native
  


Learn once, write anywher
ons licensed, as found in the LICENSE-docs file.





Visual Studio Code - Open Source




VS Code is a
rights reserv

# PREPARE

# EXPLORE

# MODEL