# PLAN

- [x] Acquisition
    - [x] Select what list of repos to scrape.
    - [x] Get requests form the site.
    - [x] Save responses to csv.
- [ ] Preparation
    - [x] Prepare the data for analysis.
- [ ] Exploration
    - [ ] Answer the following prompts:
        - [ ] What are the most common words in READMEs?
        - [ ] What does the distribution of IDFs look like for the most common words?
        - [ ] Does the length of the README vary by language?
        - [ ] Do different languages use a different number of unique words?
- [ ] Modeling
    - [ ] Transform the data for machine learning; use language to predict.
    - [ ] Fit several models using different text repressentations.
    - [ ] Build a function that will take in the text of a README file, and makes a prediction of language.
- [ ] Delivery
    - [ ] Github repo
        - [x] This notebook.
        - [ ] Documentation within the notebook.
        - [ ] README file in the repo.
        - [ ] Python scripts if applicable.
    - [ ] Google Slides
        - [ ] 1-2 slides only summarizing analysis.
        - [ ] Visualizations are labeled.
        - [ ] Geared for the general audience.
        - [ ] Share link @ readme file and/or classroom.

# ENVIRONMENT

In [2]:
# disable warnings
import warnings
warnings.filterwarnings("ignore")

import unicodedata
import re
from requests import get
import json
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
from bs4 import BeautifulSoup
import pandas as pd
import time
import csv
from functools import reduce
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

#Global variables holding all of our language names and additional stopwords
LANGUAGES = ['JavaScript', 'Rust', 'C++', 'Python', 'Dart', 'Java', 'Go', 'CSS',
             'PHP', 'TypeScript', 'Ruby', 'HTML', 'C', 'Vue', 'C#', 'Shell',
             'Clojure', 'Objective-C', 'Swift', 'Jupyter Notebook','Vim script',
             'Assembly', 'Kotlin', 'Dockerfile', 'TeX', 'javascript', 'rust',
             'c++', 'python', 'dart', 'java', 'go', 'css', 'php', 'typescript',
             'ruby', 'html', 'c', 'vue', 'c#', 'shell', 'clojure', 'objective-c',
             'swift', 'jupyter notebook', 'vim script', 'assembly', 'kotlin',
             'dockerfile', 'tex', 'yes', 'one', 'also', 'two', 'etc', 'please']

BASEURL = 'https://github.com/search?p=1&q=stars%3A%3E0&s=stars&type=Repositories'
HEADERS = {'User-Agent': 'Assault Potato Gun'}

# ACQUIRE

First thing that needs to happen is to get the links from the most starred github repositories.  The most complicated part here was identifying the section that had the actual urls.

The `get_url_list` function does the following:
* get a response from the BASEURL
* set number of pages to scrape and loop through all of them
* find the list of all the repos on the page
* from that list find the individual list item repos
* do a check to see if there is a language associated with the repo
* * if no language, skip
* loop through individual repo sections and grab the url
* print out a list of the total valid urls scraped
* save the resulting list of urls to a csv
* return the urls

In [3]:
def get_url_list(page):
    urls = []
    response = get(BASEURL, headers=HEADERS)
    soup = BeautifulSoup(response.content)
    max_page = page + 1
    for i in range(1,max_page):
        url = 'https://github.com/search?p=' + str(i) + '&q=stars%3A%3E0&s=stars&type=Repositories'
        print(f'traversing url: {url}')
        response = get(url, headers=HEADERS)
        soup = BeautifulSoup(response.text)
        list_of_repos = soup.find('ul', class_='repo-list')
        repository = list_of_repos.find_all('li', class_='repo-list-item')
        for h in repository:
            if h.find(attrs={'itemprop':'programmingLanguage'}):
                a = h.find('a')
                urls.append(a.attrs['href'])
        time.sleep(3)
    print(f'Scraped a total of {len(urls)} github urls.')
    urls = ['https://github.com' + url for url in urls]
    with open('github_urls.csv', 'w') as f:
        ghub_urls = csv.writer(f, delimiter=',')
        ghub_urls.writerow(urls)
    return urls


Now that we have the list of urls from `get_url_list`, we need to do the following:
* visit each of the urls
* find the main body of the `README.md`
* * if there is no body in the `README.md` then skip it
* grab the readme info
* find the prominent language and grab that as well
* do this for all the urls in the list from the previous function

In [4]:
def grab_readmes_and_languages(urls):
    readmes = []
    languages = []
    for url in urls:
        response = get(url)
        soup = BeautifulSoup(response.content, 'html.parser')
        if soup.find('div', class_='Box-body') == None:
            continue
        else:
            single_readme = soup.find('div', class_='Box-body').text
        if soup.find('span', class_='lang') == None:
            continue
        else:
            repo_language = soup.find('span', class_='lang').text
        languages.append(repo_language)
        readmes.append(single_readme)
    df = pd.DataFrame({'readme':readmes, 'language':languages})
    return df

Now that we have a big bunch of words and languages in a dataframe, we need to do some cleanup.  These functions will do the following:
* lowercase all the text
* normalize the language into english-only
* only keep words that start with letters, numbers, or whitespace
* strip any whitespace at the start or the end of a word
* replace any newlines with a space
* tokenize the words
* either stem or lemmatize the words
* remove all standard stopwords as well as the languages and any additional stopwords that were found during exploration

In [6]:
def basic_clean(string):
    """Will lowercase, normalize, and remove anything that isn't a letter, number,
    whitespace or single quote and return it."""
    clean_string = string.lower()
    clean_string = unicodedata.normalize('NFKD', clean_string).\
                    encode('ascii', 'ignore').\
                    decode('utf-8', 'ignore')
    clean_string = re.sub(r'[^a-z0-9\s]', '', clean_string)
    clean_string = clean_string.strip()
    clean_string = re.sub(r'\s+', ' ', clean_string)
    return clean_string

def tokenize(string, string_or_list='string'):
    """nltk.tokenize.ToktokTokenizer"""
    tokenizer = nltk.tokenize.ToktokTokenizer()
    if string_or_list == 'string':
        return tokenizer.tokenize(string, return_str=True)
    if string_or_list == 'list':
        return tokenizer.tokenize(string)
    
def stem(string, string_or_list='string'):
    """Returns the stems."""
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in string.split()]
    stemmed_string = ' '.join(stems)
    if string_or_list == 'list':
        return stems
    if string_or_list == 'string':
        return stemmed_string
    
def lemmatize(string, string_or_list='string'):
    """Returns the lemmatized text."""
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    lemmatized_string = ' '.join(lemmas)
    if string_or_list == 'string':
        # remove all single characters or numbers
        lemmatized_string = re.sub(r'(^| ).( |$)', '', lemmatized_string)
        lemmatized_string = re.sub(r'(^| )[0-9]( |$)', '', lemmatized_string)
        return lemmatized_string
    if string_or_list == 'list':
        return lemmas
    
def remove_stopwords(string, string_or_list='string', extra_words=None, exclude_words=None):
    """Removes the stopwords from the text then returns it. Able to add or remove stopwords."""
    stopword_list = stopwords.words('english') #+ LANGUAGES
    if extra_words != None:
        for word in extra_words:
            stopword_list.append(word)
    if exclude_words != None:
        for word in exclude_words:
            stopword_list.remove(word)
    filtered_words = [word for word in string.split() if word not in stopword_list]
    filtered_string = ' '.join(filtered_words)
    if string_or_list == 'string':
        return filtered_string
    if string_or_list == 'list':
        return filtered_words
    
def remove_languages(string, string_or_list='string', extra_words=None, exclude_words=None):
    """Removes the stopwords from the text then returns it. Able to add or remove stopwords."""
    stopword_list = stopwords.words('english') #+ languages
    if extra_words != None:
        for word in extra_words:
            stopword_list.append(word)
    if exclude_words != None:
        for word in exclude_words:
            stopword_list.remove(word)
    filtered_words = [word for word in string.split() if word not in stopword_list]
    filtered_string = ' '.join(filtered_words)
    if string_or_list == 'string':
        return filtered_string
    if string_or_list == 'list':
        return filtered_words

# fancy pipe function
def pipe(v, *fns):
    return reduce(lambda x, f: f(x), fns, v)

def readme_lem(text):
    return pipe(text, basic_clean, tokenize, remove_stopwords, lemmatize)

def readme_stem(text):
    return pipe(text, basic_clean, tokenize, remove_stopwords, stem)

If we already have the urls saved, we don't need to re-scrape them.

In [23]:
with open('github_urls1.csv') as f:
    urls = f.readlines()
urls = urls[0].split(',')

Checking to make sure the size of the list of urls makes sense.

In [24]:
len(urls)

218

Scraping the information we need from GitHub with our function and storing it in a dataframe.

In [None]:
df = grab_readmes_and_languages(urls)
df.head(10)

In [25]:
!ls

Book2.csv                github_data.csv          github_urls5.csv
DF_tree                  github_data.html         github_urls6.csv
DF_tree.pdf              github_urls.csv          mz_working_nb.ipynb
README.md                github_urls1.csv         nlp_github_project.ipynb
full_719.csv             github_urls3.csv         nlp_project_orion.ipynb
git_hub.csv              github_urls4.csv         personal_nb_mz.ipynb
