# Git Shoes

## Project Goals

- Determine the programming language used in a given GitHub Repository by examining the words used in the readme file.

- We honed in specifically on repositories with the word 'shoes' in the name.

## Imports Used

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import nltk
import re
import os
import json
import requests
import unicodedata

from bs4 import BeautifulSoup

from pprint import pprint
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, plot_confusion_matrix, recall_score

from time import strftime

from sklearn.model_selection import train_test_split

from wordcloud import WordCloud

from env import get_connection
import acquire as ac
import prepare as prep
import explore_functions as ex
import modeling as md
from acquire import scrape_github_data

import warnings
warnings.filterwarnings("ignore")

seed = 42

# Acquisition

<div class="alert alert-block alert-info">

- We obtained our data through the GitHub api, through the use of many functions within the acquire.py file.

- We started with 198 rows in our data, with each row being an individual repository.

In [7]:
# pulling the data from the json file
df = pd.read_json('repos.json')

df.head()

Unnamed: 0,repo,language,readme_contents
0,justinmk/vim-sneak,Vim Script,sneak.vim 👟\n================\n\nJump to any l...
1,google-research-datasets/Objectron,Jupyter Notebook,"\n<div align=""center"">\n\n# Objectron Dataset\..."
2,shoes/shoes4,Ruby,# Shoes 4 [![Linux Build Status](https://secur...
3,shoes/shoes-deprecated,C,# THIS REPO IS NO LONGER ACTIVE!\n\n**Looking ...
4,filamentgroup/shoestring,JavaScript,:warning: This project is archived and the rep...


# Preparation

<div class="alert alert-block alert-info">

In order to make sense of our data, we had to clean it up quite a bit. We started by doing:

- There were two repositories that had nothing to do with out chosen topic, shoes, and so we dropped those two.
    
- There were also 13 rows that had no readme files, and thus those were also dropped.
    
- Next, we got a value count of the various languages we had, and anything that was beloew 10 we consolidated into one category, 'other'
    
- After that, we created a series for each unique language as well as got a value count for each
    
- After creating those other variables for later exploration, we cleaned up our readme entries by normalizing the words to removing accents and other symbols that arent helpful, as well as removing stopwords and making each word its own token to make it easier to work with later.
    
- Lastly, because we were working with a reasonable amount of data, we both stemmed and lemmatized so that we could determine what works better on our models.

In [None]:
def clean_df(df):
    
    df = df.drop([29, 139])

    df = df.dropna()

    others = ['TypeScript', 'Jupyter Notebook', 'Java', 'C#', 'Swift', 'CSS', 'C', 'C++', 'Kotlin',
         'VimL', 'Handlebars', 'Vue', 'Go', 'SCSS', 'Emacs Lisp', 'Vim Script', 'Lua', 'TeX',
         'Rust', 'Shell', 'PHP', 'Vim script', 'CoffeeScript']

    df = df.replace(to_replace=others, value="Other")

    df['language'] = df['language'].str.lower()
    
    return df

In [9]:
df = prep.clean_df(df)

In [None]:
def word_counts(df):
    
    other_words = prep.clean_text(' '.join(df[df['language'] == 'Other']['readme_contents']))
    javascript_words = prep.clean_text(' '.join(df[df['language'] == 'JavaScript']['readme_contents']))
    html_words = prep.clean_text(' '.join(df[df['language'] == 'HTML']['readme_contents']))
    dart_words = prep.clean_text(' '.join(df[df['language'] == 'Dart']['readme_contents']))
    ruby_words = prep.clean_text(' '.join(df[df['language'] == 'Ruby']['readme_contents']))
    python_words = prep.clean_text(' '.join(df[df['language'] == 'Python']['readme_contents']))
    all_words = prep.clean_text(' '.join(df['readme_contents']))

    other_counts = pd.Series(other_words).value_counts()
    javascript_counts = pd.Series(javascript_words).value_counts()
    html_counts = pd.Series(html_words).value_counts()
    dart_counts = pd.Series(dart_words).value_counts()
    ruby_counts = pd.Series(ruby_words).value_counts()
    python_counts = pd.Series(python_words).value_counts()
    all_counts = pd.Series(all_words).value_counts()

    
    word_freq = pd.concat([other_counts, javascript_counts, html_counts, dart_counts, 
                       ruby_counts, python_counts, all_counts], axis=1)

    word_freq.fillna(0, inplace=True)
    
    word_freq = word_freq.astype('int')

    word_freq = word_freq.rename(columns={0:'other', 1:'javascript', 2:'html', 3: 'dart', 4:'ruby', 5:'python', 6:'all_counts'})

    return word_freq

In [30]:
prep.word_counts(df).head()

Unnamed: 0,other,javascript,html,dart,ruby,python,all_counts
product,0,0,0,0,0,0,279
user,0,0,0,0,0,0,260
file,0,0,0,0,0,0,224
page,0,0,0,0,0,0,209
install,0,0,0,0,0,0,207


In [31]:
prep.prep_article_data(df, 'readme_contents')

NameError: name 'prep_article_data' is not defined

# Exploration

# Modeling

In [None]:
train, validate, test = split_train_test(df)

train.shape, validate.shape, test.shape

In [None]:
X_train, y_train, X_validate, y_validate, X_test, y_test = xy_train(train, validate, test, 'language')

In [None]:
X_train.head(1)

In [None]:
#creating our baseline accuracy
df['baseline'] = df['language'].value_counts().idxmax()

print((df['language'] == df['baseline']).mean())

In [None]:
md.random_forest(X_train['stemmed'], y_train, X_validate['stemmed'], y_validate)

In [None]:
md.random_forest(X_train['lemmatized'], y_train, X_validate['lemmatized'], y_validate)

In [None]:
md.dec_tree(X_train['stemmed'], y_train, X_validate['stemmed'], y_validate)

In [None]:
md.dec_tree(X_train['lemmatized'], y_train, X_validate['lemmatized'], y_validate)

In [None]:
md.knn(X_train['stemmed'], y_train, X_validate['stemmed'], y_validate)

In [None]:
md.knn(X_train['lemmatized'], y_train, X_validate['lemmatized'], y_validate)

In [None]:
test_accuracy(X_train['stemmed'], y_train, X_validate['stemmed'], y_validate, X_test['stemmed'], y_test)

## Modeling Conclusions

## Summary 

## Recommendations

## Next Steps