# Predicting Programming Languages using NLP

### Executive Summary

- Our goal is to create a classification model to predict programming languages using Readme content from GitHub repositories.
    - This can assist users in finding relevant content based on their programming language critera.
    

### Key Takeaways

- Most common programming language found in our dataset is Javascript followed by Python
- Some of the most common words in Readmes were found to be: 'file', 'end', 'class','use' and 'object'.
- The length of the readme's varies by programming languages.
- 
- Our best model used  to predict programming languages with % accuracy. This model outperformed my baseline score of % accuracy, so it has value.

### Project Overview

- Trello board used to identify the different tasks for this project. You can find the board <a href="https://trello.com/b/PddXdOTJ/nlp-project">here</a>
- Python scripts were used to acquire, prepare and explore the data
- 
- Statistical analyses tested the following hypotheses:
    1. 
 
### Data Dictionary

The data dictionary detailing all variables utilized in this analyses can be found <a href="https://github.com/mariam-and-cindy/predicting-programming-languages/blob/main/README.md">here</a>.

In [1]:
# required imports

from requests import get
from bs4 import BeautifulSoup
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from textblob import TextBlob

import prepare as pr
import unicodedata
import re
import json

# Acquire Data

Our data was scraped from 400 GitHub repositories. We decided to use the list of most forked repos on GitHub <a href="https://github.com/search?o=desc&p={i}&q=stars%3A%3E1&s=forks&type=Repositories">here</a> for our dataset. 

This list of repositories was cached as a csv after acquisition and using our acquire script we pulled the username and title, language and readme contents of every repository into a json file. We converted the json file to a csv and will read that into a pandas dataframe.


In [29]:
# read in the json file as df
repo_json_file = 'data2.json'
df = pd.read_json(repo_json_file)

In [43]:
# convert to csv
df.to_csv('git_data.csv')

In [4]:
df = pd.read_csv('git_data.csv', index_col=0)

In [5]:
# quick look at df
df.head()

Unnamed: 0,repo,language,readme_contents
0,jtleek/datasharing,,How to share data with a statistician\n=======...
1,rdpeng/ProgrammingAssignment2,R,### Introduction\n\nThis second programming as...
2,octocat/Spoon-Knife,HTML,### Well hello there!\n\nThis repository is me...
3,tensorflow/tensorflow,C++,"<div align=""center"">\n <img src=""https://www...."
4,SmartThingsCommunity/SmartThingsPublic,Groovy,# SmartThings Public GitHub Repo\n\nAn officia...


In [6]:
# check for missing values
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 400 entries, 0 to 399
Data columns (total 3 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   repo             400 non-null    object
 1   language         344 non-null    object
 2   readme_contents  400 non-null    object
dtypes: object(3)
memory usage: 12.5+ KB


## Takeaways

- The dataset has 400 scraped repositories 
- Some Repositories are missing the language
    - This could be because no primary programming language was obvious
    - We will drop these rows during data preparation
- All variables are object dtypes

# Prepare Data

During this stage of the pipeline, we will work on cleaning and preparing the data for exploration and modeling. The prepration functions are part of the prepare script which will be imported. 

The following steps will be performed to create the best performing model:
- cleaning content to remove any special characters and certain words
- removing stop words
- lemmatizing content
- stemming content
- removing repos that have non English content
- dropping rows with missing values
- creating new columns
    - cleaned
    - stemmed
    - lemmatized 

In [41]:
text = '互联网 Java 工程师进阶知识完全扫盲'
lang = TextBlob(text)

In [31]:
def basic_clean (string):
    '''
    takes in a string and lowercase everything, normalize unicode characters, replace anything that is not a letter,
    number, whitespace or a single quote.
    retunr a clean string
    '''
    
    string = string.lower()
    string = unicodedata.normalize('NFKC',string)\
    .encode('ascii', 'ignore')\
    .decode('utf-8')
    string = re.sub(r"[^a-z0-9\s]", '', string)
    string = re.sub(r'\w*http\w*', '', string)
    string = re.sub(r'\w*github\w*', '', string)
    string = re.sub(r'\w*html\w*', '', string)
    return string

In [38]:
def remove_nonenglish (df):
    '''
    takes in df and 1 column to check if the text is in englis if not that row is going to be remove
    '''
    for n in range (0, len(df)):
        basic_clean(df.readme_contents[n])
        text = df.readme_contents[n]
        lang = TextBlob(text)
        if lang.detect_language() != 'en':
            df =df.drop([n])
    return df.reset_index(drop=True)

In [37]:
for n in range (0, len(df)):
        text = df.readme_contents[n]
        lang = TextBlob(text)
        print(n)
        if lang.detect_language() != 'en':
            print('not English', n)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
not English 16
17
18
19
not English 19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
not English 37
38
39
40
41
42
43
not English 43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
not English 76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
not English 90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
not English 108
109
110
111
112
113
114
115
not English 115
116
117
118
119
120
121
122
123
124
not English 124
125
126
127
128
129
130
131
not English 131
132
133
134
135
not English 135
136
137
138
not English 138
139
140
141
142
143
not English 143
144
145
146
147
148
149
150
151
not English 151
152
not English 152
153
154
155
156
not English 156
157
158
159
160
161
162
163
164
165
166
not English 166
167
168
169
170
171
not English 171
172
173
174
not English 174
175
176
177
178
179
180
181
182
183
184
185
not English 185
186
187
188
189
190
191
192
193
not English 193
194


In [13]:
text = df.readme_contents[n]
lang = TextBlob(text)
if lang.detect_language() != 'en':
    194             df =df.drop([n])
    195     return df.reset_index(drop=True)

Unnamed: 0,repo,language,readme_contents
220,PHPMailer/PHPMailer,PHP,![PHPMailer](https://raw.github.com/PHPMailer/...
253,education/GitHubGraduation-2021,JavaScript,"## Updates\n\n### May 27, 2021\nAnd that’s a w..."
106,MarlinFirmware/Marlin,C++,# Marlin 3D Printer Firmware\n\n![GitHub](http...
39,PanJiaChen/vue-element-admin,Vue,"<p align=""center"">\n <img width=""320"" src=""ht..."
196,Homebrew/homebrew-core,Ruby,# Homebrew Core\n\nCore formulae for the Homeb...
...,...,...,...
287,pytorch/examples,Python,# PyTorch Examples\n![Run Examples](https://gi...
101,doocs/advanced-java,Java,# 互联网 Java 工程师进阶知识完全扫盲\n\n[![stars](https://im...
390,vnpy/vnpy,C++,"# By Traders, For Traders.\n\n<p align=""center..."
192,springframeworkguru/spring5webapp,Java,# Spring Framework 5: Beginner to Guru\n\nThis...


# Explore Data

We used our train split to explore the data.