# Predicting the Programming Languages of Most Starred Github Repos

## Goal
Build a model that can predict what programming language a repository is, given the text of the README file.

In [2]:
import pandas as pd
import re

# scraping modules
from requests import get
from bs4 import BeautifulSoup

import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import os
import acquire

## Acquire and Prep data

**Acquire data from local cache** → Normalize and Tokenize → Stem/Lematize → Remove stopwords and extraneous words

In [2]:
#acquire.scrape_github_data()

In [3]:
df = pd.read_json('data.json')

In [4]:
df.head()

Unnamed: 0,language,readme_contents,repo
0,JavaScript,![freeCodeCamp.org Social Banner](https://s3.a...,freeCodeCamp/freeCodeCamp
1,Rust,[996.ICU](https://996.icu/#/en_US)\n=======\n*...,996icu/996.ICU
2,JavaScript,"<p align=""center""><a href=""https://vuejs.org"" ...",vuejs/vue
3,JavaScript,# [React](https://reactjs.org/) &middot; [![Gi...,facebook/react
4,C++,"<div align=""center"">\n <img src=""https://www....",tensorflow/tensorflow


In [14]:
def basic_clean(string):
    """
    Convert to all lowercase  
    Normalize the unicode chars  
    Remove any non-alpha or whitespace characters  
    Remove any alpha strings with 2 characters or less  
    """
    string = string.lower()
    string = unicodedata.normalize('NFKD', string).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    
    # keep only alpha chars
    string = re.sub(r'[^a-z]', ' ', string)
    
    # remove strings less than 2 chars in length
    string = re.sub(r'\b[a-z]{,2}\b', '', string)
    
    # convert newlines and tabs to a single space
    string = re.sub(r'[\r|\n|\r\n]+', ' ', string)
    
    # strip extra whitespace
    string = string.strip()
    
    return string

In [15]:
def stem(string):
    ps = nltk.porter.PorterStemmer()
    stems = [ps.stem(word) for word in string.split()]
    string_of_stems = ' '.join(stems)
    return string_of_stems

In [16]:
def lemmatize(string):
    wnl = nltk.stem.WordNetLemmatizer()
    lemmas = [wnl.lemmatize(word) for word in string.split()]
    string_of_lemmas = ' '.join(lemmas)
    return string_of_lemmas

In [17]:
def tokenize(string):
    tokenizer = nltk.tokenize.ToktokTokenizer()
    return tokenizer.tokenize(string, return_str=True)

What are the words we want to exclude?
- http
- banner
- request
- img
- badge
- svg
- www
- com
- png
- welcome
- pr
- style
- flat
- makeapullrequest
- gitpod
- logo
- blue
- green
- brightgreen

In [18]:
def remove_stopwords(tokenized_string, extra_words=[], exclude_words=[]):
    words = tokenized_string.split()
    stopword_list = stopwords.words('english')

    # remove the excluded words from the stopword list
    stopword_list = set(stopword_list) - set(exclude_words)

    # add in the user specified extra words
    stopword_list = stopword_list.union(set(extra_words))

    filtered_words = [w for w in words if w not in stopword_list]
    final_string = " ".join(filtered_words)
    return final_string

In [20]:
df = df[['language','readme_contents']]
df = df.assign(original = df.readme_contents.apply(basic_clean))

In [21]:
df.head()

Unnamed: 0,language,readme_contents,original
0,JavaScript,![freeCodeCamp.org Social Banner](https://s3.a...,freecodecamp org social banner https amaz...
1,Rust,[996.ICU](https://996.icu/#/en_US)\n=======\n*...,icu https icu please not...
2,JavaScript,"<p align=""center""><a href=""https://vuejs.org"" ...",align center href https vuejs org targ...
3,JavaScript,# [React](https://reactjs.org/) &middot; [![Gi...,react https reactjs org middot githu...
4,C++,"<div align=""center"">\n <img src=""https://www....",div align center img src https www te...


In [22]:
# df = df.assign(normalized = df.original.apply(normalize))

df = df.assign(stemmed = df.original.apply(tokenize).apply(stem))
df = df.assign(lemmatized = df.original.apply(tokenize).apply(lemmatize))


df = df.assign(cleaned = df.lemmatized.apply(remove_stopwords))

df.head()

Unnamed: 0,language,readme_contents,original,stemmed,lemmatized,cleaned
0,JavaScript,![freeCodeCamp.org Social Banner](https://s3.a...,freecodecamp org social banner https amaz...,freecodecamp org social banner http amazonaw c...,freecodecamp org social banner http amazonaws ...,freecodecamp org social banner http amazonaws ...
1,Rust,[996.ICU](https://996.icu/#/en_US)\n=======\n*...,icu https icu please not...,icu http icu pleas note that there exist other...,icu http icu please note that there exists oth...,icu http icu please note exists official accou...
2,JavaScript,"<p align=""center""><a href=""https://vuejs.org"" ...",align center href https vuejs org targ...,align center href http vuej org target blank r...,align center href http vuejs org target blank ...,align center href http vuejs org target blank ...
3,JavaScript,# [React](https://reactjs.org/) &middot; [![Gi...,react https reactjs org middot githu...,react http reactj org middot github licens htt...,react http reactjs org middot github license h...,react http reactjs org middot github license h...
4,C++,"<div align=""center"">\n <img src=""https://www....",div align center img src https www te...,div align center img src http www tensorflow o...,div align center img src http www tensorflow o...,div align center img src http www tensorflow o...


In [23]:
df.cleaned[0]

'freecodecamp org social banner http amazonaws com freecodecamp wide social banner png pull request welcome http img shield badge pr welcome brightgreen svg style flat http makeapullrequest com first timer friendly http img shield badge first timer friendly blue svg http www firsttimersonly com open source helper http www codetriage com freecodecamp freecodecamp badge user svg http www codetriage com freecodecamp freecodecamp setup automated http img shield badge setup automated blue logo gitpod http gitpod referrer freecodecamp org open source codebase curriculum freecodecamp org http www freecodecamp org friendly community learn code free run donor supported nonprofit http donate freecodecamp org help million busy adult transition tech community ha already helped people get first developer job full stack web development curriculum completely free self paced thousand interactive coding challenge help expand skill table content certification certification learning platform learning pla

In [25]:
df.lemmatized[0]

'freecodecamp org social banner http amazonaws com freecodecamp wide social banner png pull request welcome http img shield badge pr welcome brightgreen svg style flat http makeapullrequest com first timer only friendly http img shield badge first timer only friendly blue svg http www firsttimersonly com open source helper http www codetriage com freecodecamp freecodecamp badge user svg http www codetriage com freecodecamp freecodecamp setup automated http img shield badge setup automated blue logo gitpod http gitpod from referrer freecodecamp org open source codebase and curriculum freecodecamp org http www freecodecamp org friendly community where you can learn code for free run donor supported nonprofit http donate freecodecamp org help million busy adult transition into tech our community ha already helped more than people get their first developer job our full stack web development curriculum completely free and self paced have thousand interactive coding challenge help you expand

In [None]:
df.lemmatize

In [18]:
df.readme_contents[0]

"freecodecamp org social banner https s3 amazonaws com freecodecamp wide social banner png pull requests welcome https img shields io badge prs welcome brightgreen svg style flat http makeapullrequest com first timers only friendly https img shields io badge first timers only friendly blue svg http www firsttimersonly com open source helpers https www codetriage com freecodecamp freecodecamp badges users svg https www codetriage com freecodecamp freecodecamp setup automated https img shields io badge setup automated blue logo gitpod https gitpod io from referrer freecodecamp org ' s open source codebase and curriculum freecodecamp org https www freecodecamp org is a friendly community where you can learn to code for free it is run by a donor supported 501 c 3 nonprofit https donate freecodecamp org to help millions of busy adults transition into tech our community has already helped more than 10 000 people get their first developer job our full stack web development curriculum is compl

In [21]:
df = df.rename(columns={'readme_contents': 'original'})

In [22]:
df.head()

Unnamed: 0,language,original,repo
0,JavaScript,freecodecamp org social banner https s3 amazon...,freeCodeCamp/freeCodeCamp
1,Rust,996 icu https 996 icu en us please note that t...,996icu/996.ICU
2,JavaScript,p align center a href https vuejs org target b...,vuejs/vue
3,JavaScript,react https reactjs org middot github license ...,facebook/react
4,C++,div align center img src https www tensorflow ...,tensorflow/tensorflow


In [9]:
#soup = BeautifulSoup(response.content)

In [10]:
#soup.title

<title>GitHub - rbenv/rbenv: Groom your app’s Ruby environment</title>

In [30]:
# get body of README
# body = soup.find('article', class_='markdown-body').get_text()
# body

'Groom your app’s Ruby environment with rbenv.\nUse rbenv to pick a Ruby version for your application and guarantee\nthat your development environment matches production. Put rbenv to work\nwith Bundler for painless Ruby upgrades and\nbulletproof deployments.\nPowerful in development. Specify your app\'s Ruby version once,\nin a single file. Keep all your teammates on the same page. No\nheadaches running apps on different versions of Ruby. Just Works™\nfrom the command line and with app servers like Pow.\nOverride the Ruby version anytime: just set an environment variable.\nRock-solid in production. Your application\'s executables are its\ninterface with ops. With rbenv and Bundler\nbinstubs\nyou\'ll never again need to cd in a cron job or Chef recipe to\nensure you\'ve selected the right runtime. The Ruby version\ndependency lives in one place—your app—so upgrades and rollbacks are\natomic, even when you switch versions.\nOne thing well. rbenv is concerned solely with switching Ruby\n

In [4]:
# find README language

# language = soup.find('span', class_= 'lang').get_text()
# language