# Predicting Programming Languages using NLP

### Executive Summary

- Our goal is to create a classification model to predict programming languages using Readme content from GitHub repositories.
    - This can assist users in finding relevant content based on their programming language critera.
    

### Key Takeaways

- Most common programming language found in our dataset is Javascript followed by Python
- Some of the most common words in Readmes were found to be: 'file', 'end', 'class','use' and 'object'.
- The length of the readme's varies by programming languages.
- 
- Our best model used  to predict programming languages with % accuracy. This model outperformed my baseline score of % accuracy, so it has value.

### Project Overview

- Trello board used to identify the different tasks for this project. You can find the board <a href="https://trello.com/b/PddXdOTJ/nlp-project">here</a>
- Python scripts were used to acquire, prepare and explore the data
- 
- Statistical analyses tested the following hypotheses:
    1. 
 
### Data Dictionary

The data dictionary detailing all variables utilized in this analyses can be found <a href="https://github.com/mariam-and-cindy/predicting-programming-languages/blob/main/README.md">here</a>.

In [4]:
# required imports

from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from wordcloud import WordCloud


# import prepare as pr

# Acquire Data

Our data was scraped from 400 GitHub repositories. We decided to use the list of most forked repos on GitHub <a href="https://github.com/search?o=desc&p={i}&q=stars%3A%3E1&s=forks&type=Repositories">here</a> for our dataset. 

This list of repositories was cached as a csv after acquisition and using our acquire script we pulled the username and title, language and readme contents of every repository into a json file. We will read in the json file as a pandas dataframe.


In [5]:
# read in the json file as df
repo_json_file = 'data2.json'
df = pd.read_json(repo_json_file)

In [6]:
# quick look at df
df.head()

Unnamed: 0,repo,language,readme_contents
0,jtleek/datasharing,,How to share data with a statistician\n=======...
1,rdpeng/ProgrammingAssignment2,R,### Introduction\n\nThis second programming as...
2,octocat/Spoon-Knife,HTML,### Well hello there!\n\nThis repository is me...
3,tensorflow/tensorflow,C++,"<div align=""center"">\n <img src=""https://www...."
4,SmartThingsCommunity/SmartThingsPublic,Groovy,# SmartThings Public GitHub Repo\n\nAn officia...


In [7]:
# check for missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   repo             400 non-null    object
 1   language         344 non-null    object
 2   readme_contents  400 non-null    object
dtypes: object(3)
memory usage: 9.5+ KB


## Takeaways

- The dataset has 400 scraped repositories 
- Some Repositories are missing the language
    - This could be because no primary programming language was obvious
    - We will drop these rows during data preparation
- All variables are object dtypes

# Prepare Data

During this stage of the pipeline, we will work on cleaning and preparing the data for exploration and modeling. 

The following steps will be performed to create the best performing model:
- cleaning content to remove any special characters
- removing stop words
- lemmatizing content
- stemming content
- removing repos that have non English content
- dropping rows with missing values
- creating new columns
    - cleaned
    - stemmed
    - lemmatized
- 