# NLP Project: Making a prediction of the Programming Language base on README.md
"We don't know, what we don't know"  

By: Cody Watson and Eric Escalante  
May 13, 2019  

In this Jupyter Notebook, we will be scraping data from GitHub repository README files. The goal is to build a model that can predict which programming language a repository is using, given the text of the README file.

## Imports
**Import the necessary packages and their use cases for this project:**
> **pandas:** data frames and data manipulation  
> **numpy:** summary statistics  
> **matplotlib:** used for visualizations  
> **seasborn:** fancy visualizations  
> **datetime:** turn the dates into datetime objects / get day of week  
> **warning:** used to ignore python warnings  
> **requests:** to obtain the HTML from the page  
> **unicodedata:** character encoding  
> **BeautifulSoup:** to parse the HTML and obtain the text/data that we want  
> **nltk:**  
> **WordCloud:**

In [71]:
import unicodedata
import re
import json
import spacy
from spacy.lang.en import English
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
import pandas as pd
import NLP_acquire
from typing import List, Dict

## Table of contents
1. [Project Planning](#project-planning)
1. [Preparation](#preparation)
1. [Exploration](#exploration)
1. [Modeling](#modeling)
1. [Summary](#summary)

## Project Planning <a name="project-planning"></a>

### Goals  
> Goals for the Project are:  
1. Accurately predict the programming languages based on mutliple programming languages from Github README files
2. Create different WordCloud models showing the most commonly used words with each programming languages
3. Built muliple Classification machine learning models to accurately predict which language the repository is written in
4. Be sure that we are documenting our thoughts throughout the process

### Deliverables
> 1. A well-documented jupyter notebook that contains your analysis  
> 2. One or two google slides suitable for a general audience that summarize your findings. Include a well-labelled visualization in your slides.

### Hypotheses
> "C++ programmers are more elitist"

### Thoughts & Questions
> **Thoughts:**  
- Figure out how to apply multiple Classification methods to predict programming language using the repo's readme.md
- Figure out how to apply multiple sentiment analysis methods to the data
- Compare and Contrast different Corpus using TF-IDF
- We want to learn how to set up the Word2Vec for word embedding
- Apply a 'github' image mask to a WordCloud (within the negative/positive space; label each space with different colors)

> **Questions:**  
- What am I?
- Does the sentiment in a given languange vier more towards positive or negative?
- How many graphs can we make this weekend??

## Prepare the Environment <a name="preparation"></a>

**Bring in the data from the prepare file**

In [175]:
df = pd.read_json('data.json')

**Going to see what the top portion looks like**

In [176]:
df.head()

Unnamed: 0,language,readme_contents,repo
0,JavaScript,"<p align=""center"">\n <a href=""https://getboot...",twbs/bootstrap
1,JavaScript,# [React](https://reactjs.org/) &middot; [![Gi...,facebook/react
2,,This page is available as an easy-to-read webs...,EbookFoundation/free-programming-books
3,,"<div align=""center"">\n\t<img width=""500"" heigh...",sindresorhus/awesome
4,,![Web Developer Roadmap - 2019](https://i.imgu...,kamranahmedse/developer-roadmap


**Funtions to clean our readme files**

In [177]:
def basic_clean(text):
    '''
    Function that takes a string and normalized the text using unicodedata
    '''
    text = unicodedata.normalize('NFKD', text.lower())\
        .encode('ascii', 'ignore')\
        .decode('utf-8', 'ignore')
    return re.sub(r"[^a-z0-9'\s]", '', text)

def lemmatize(text):
    '''
    Function that lemmatizes the string
    '''
    nlp = spacy.load('en', parse=True, tag=True, entity=True)
    doc = nlp(text) # process the text with spacy
    lemmas = [word.lemma_ for word in doc]
    text_lemmatized = ' '.join(lemmas)
    return re.sub(r"\s*(-PRON-|\'s|\')", '', text_lemmatized)

def remove_stopwords(text):
    '''
    Function to remove stopwords from the string
    '''
    tokenizer = ToktokTokenizer()
    stopword_list = stopwords.words('english')
    stopword_list.remove('no')
    stopword_list.remove('not')
    tokens = tokenizer.tokenize(text)
    filtered_tokens = [t for t in tokens if t not in stopword_list]
    return ' '.join(filtered_tokens)

def clean_readme(string):
    return remove_stopwords(lemmatize(basic_clean(string)))

**Apply our funcitons**

In [178]:
df['readme_clean'] = df.readme_contents.apply(clean_readme)

**Thoughts:**
> 1. We no longer need the original readme contents, going to dump them
> 2. We need to now group by the language
> 3. We are going to start using WordCloud to visually see the frequencies of each word

In [179]:
df.drop(columns='readme_contents', inplace=True)

## Exploration  <a name="exploration"></a>

**Thoughts:**
> 1. Look at Top\Bottom 10 words in each readme grouped by programming language
> 2. Sentiment Analysis on each languange
> 3. Figure out how to apply latent Dirichlet allocation
> 4. Learn how to set up Word2Vec

## Modeling <a name="modeling"></a>

**Bring in multiple classification models**
> _ToDo_

### Train-Test Split
> _ToDo_

### Summarize Conclusions <a name="summary"></a>
> _ToDo_

### Find different ways to improve model:
> _ToDo_