# Predicting the Programming Languages of Most Starred Github Repos

## Goal
Build a model that can predict what programming language a repository is, given the text of the README file.

In [7]:
import pandas as pd
import re

# scraping modules
from requests import get
from bs4 import BeautifulSoup

import unicodedata
import re
import json

import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords

import os
import acquire
import prepare

## I. Acquire

Fetch data from local cache using the function `scrape_github_data` from the acquire.py module.

In [8]:
#acquire.scrape_github_data()

In [9]:
df = pd.read_json('data.json')

In [10]:
df.head()

Unnamed: 0,language,readme_contents,repo
0,JavaScript,![freeCodeCamp.org Social Banner](https://s3.a...,freeCodeCamp/freeCodeCamp
1,Rust,[996.ICU](https://996.icu/#/en_US)\n=======\n*...,996icu/996.ICU
2,JavaScript,"<p align=""center""><a href=""https://vuejs.org"" ...",vuejs/vue
3,JavaScript,# [React](https://reactjs.org/) &middot; [![Gi...,facebook/react
4,C++,"<div align=""center"">\n <img src=""https://www....",tensorflow/tensorflow


## II. Prep

`prep_articles` function from the prepare.py module performs the following:
 - normalize data by removing non-ascii characters, special characters, numbers, white spaces...
 - tokenize words
 - stem
**Normalize and Tokenize** → Stem/Lematize → Remove stopwords and extraneous words

In [11]:
df = prepare.prep_articles(df)

In [12]:
df

Unnamed: 0,language,original,normalized,stemmed,lemmatized,cleaned
0,JavaScript,![freeCodeCamp.org Social Banner](https://s3.a...,freecodecamp org social banner https amaz...,freecodecamp org social banner http amazonaw c...,freecodecamp org social banner http amazonaws ...,freecodecamp org social amazonaws freecodecamp...
1,Rust,[996.ICU](https://996.icu/#/en_US)\n=======\n*...,icu https icu please not...,icu http icu pleas note that there exist other...,icu http icu please note that there exists oth...,icu icu please note exists official account ap...
2,JavaScript,"<p align=""center""><a href=""https://vuejs.org"" ...",align center href https vuejs org targ...,align center href http vuej org target blank r...,align center href http vuejs org target blank ...,align center href vuejs org target blank rel n...
3,JavaScript,# [React](https://reactjs.org/) &middot; [![Gi...,react https reactjs org middot githu...,react http reactj org middot github licens htt...,react http reactjs org middot github license h...,react reactjs org middot github license shield...
4,C++,"<div align=""center"">\n <img src=""https://www....",div align center img src https www te...,div align center img src http www tensorflow o...,div align center img src http www tensorflow o...,div align center src tensorflow org image logo...
5,JavaScript,"<p align=""center"">\n <a href=""https://getboot...",align center href https getbootstrap...,align center href http getbootstrap com img sr...,align center href http getbootstrap com img sr...,align center href getbootstrap src getbootstra...
6,,This page is available as an easy-to-read webs...,this page available easy read website ht...,thi page avail easi read websit http ebookfoun...,this page available easy read website http ebo...,page available easy read website ebookfoundati...
7,,"<div align=""center"">\n\t<img width=""500"" heigh...",div align center img width height ...,div align center img width height src media lo...,div align center img width height src medium l...,div align center width height src medium logo ...
8,,# You Don't Know JS Yet (book series) - 2nd Ed...,you don know yet book series edition ...,you don know yet book seri edit thi seri book ...,you don know yet book series edition this seri...,know yet book series edition series book divin...
9,Shell,"<p align=""center"">\n <img src=""https://s3.ama...",align center img src https amazonaw...,align center img src http amazonaw com ohmyzsh...,align center img src http amazonaws com ohmyzs...,align center src amazonaws ohmyzsh zsh logo al...


What are the words we want to exclude?
- http
- banner
- request
- img
- badge
- svg
- www
- com
- png
- welcome
- pr
- style
- flat
- makeapullrequest
- gitpod
- logo
- blue
- green
- brightgreen
- div
- align
- center
- width
- src

In [22]:
df.head()

Unnamed: 0,language,original,repo
0,JavaScript,freecodecamp org social banner https s3 amazon...,freeCodeCamp/freeCodeCamp
1,Rust,996 icu https 996 icu en us please note that t...,996icu/996.ICU
2,JavaScript,p align center a href https vuejs org target b...,vuejs/vue
3,JavaScript,react https reactjs org middot github license ...,facebook/react
4,C++,div align center img src https www tensorflow ...,tensorflow/tensorflow
