<img src="https://cdn.vox-cdn.com/thumbor/Z9YA9yAEq8x3-NL660dkqQxNPAM=/0x0:1980x1320/1200x800/filters:focal(832x502:1148x818)/cdn.vox-cdn.com/uploads/chorus_image/image/59943837/microsoftgithublove.0.jpg" alt="mcrsft" class="bg-primary mb-1" width="500px" align="right">

# NLP Modeling - Team Project
### Using data from Microsoft's GitHub Page

By:
- Jason Tellez
- Jeff Akins
- Veronica Reyes
- Jacob Paxton

## Project Goal
The goal of this project was to build a model that can predict the primary programming language for a GitHub repository, given the text of the README file. To achieve this, we first had to decide which and how many repos we wanted to acquire. Microsoft has a large number of repos on its GitHub site with a wide variety of coding languages, so we determined that we could pull the README and coding language from their repos. This required the use of a variety of web scraping and Natural Language Processing tools as well as use of GitHub's API. In the end we acquired 1500 READMEs along with their associated primary coding language from Microsoft's GitHub page. The results are contained in this notebook as well as in a presentation slide deck.


## Executive Summary
After acquiring and exploring the READMEs collected, we determined that the most common coding language in Microsoft's repos was TypeScript. Therefore, we decided to use classification modeling to attempt to predict whether the repos used TypeScript or not based on features from the READMEs. We were able to predict with an 84% accuracy whether a repo used TypeScript or not based on types of words and word length of a README.

### How to Recreate:
There are two methods that you can use to recreate this project.
1. **Quick Method:** Utilize our final .json file with the cleaned README files. This is the simplest method and will produce the same results that we were able to achieve.
2. **Long Method:** Start from scratch using the same functions that we used. This will pull in the most recent repos from Microsoft's GitHub page and therefore will produce slightly different results from what we achieved. This method will also take longer e.g. it took us nearly 30 minutes to download the data from the 1500 repos. 

##### Imports

In [2]:
# For websraping and NLP
import requests
from bs4 import BeautifulSoup
from typing import Dict, List, Optional, Union, cast

# For Timestamps
import time
from time import strftime

import pandas as pd

import json
import wrangle as wr

# Follow the instructions on the acquire file for creating your env file if needed
from env import github_token, github_username

## Acquire

##### Using the Quick Method:
1. Download this [file](https://drive.google.com/file/d/1aec5UqivmWouJ0DqFM-3Nn3yE1Bd-E7f/view)
2. Save the file to the same folder as this notebook. 
3. Run the next cell.

In [5]:
df = pd.read_json('cleaned_readmes.json')

##### Using the Long Method
1. Make a env file with a github personal access token.
    - Go here and generate a personal access [token](https://github.com/settings/tokens)
    - You do _not_ need select any scopes, i.e. leave all the checkboxes unchecked
    - Save it in your env.py file under the variable `github_token`
    - Add your github username to your env.py file under the variable `github_username`
2. Uncomment and run the functions and follow the instructions in the below cell:

In [6]:
# long method:
# df = wr.get_repo_links()
# df.to_csv('microsoft_repo_list.csv')
# -- Run the aquire.py file in your terminal using: python acquire.py
# -- Once it is finished, run the next function to clean the data:
# df = wr.wrangle()

## Preparation
The following steps were taken to clean our data:
1. Dropped rows with null values in the 'language' column
2. Reset the index index
3. Created a series consisting of normalized string values and combined the series with the dataframe
4. Created normalized, lemmatized strings with no stopwords from the 'clean' column
5. Dropped rows that had null values in the 'content' columns and reset the index
6. Created a word count and a character count column 
7. Created a target column that shows whether repo language is TypeScript or not
8. Dropped the original readme contents column

This created the following results:

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1370 entries, 0 to 1369
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   repo                1370 non-null   object
 1   language            1370 non-null   object
 2   clean               1370 non-null   object
 3   lemma_no_stopwords  1370 non-null   object
 4   clean_word_count    1370 non-null   int64 
 5   readme_char_count   1370 non-null   int64 
 6   is_TypeScript       1370 non-null   bool  
dtypes: bool(1), int64(2), object(4)
memory usage: 76.3+ KB


In [8]:
df.head()

Unnamed: 0,repo,language,clean,lemma_no_stopwords,clean_word_count,readme_char_count,is_TypeScript
0,microsoft/react-native-windows,C++,react native for windows build native windows ...,react native window native window apps react h...,536,4288,False
1,microsoft/fast,TypeScript,fastbannergithub914pnghttpsstaticfastdesignass...,fastbannergithub914pnghttpsstaticfastdesignass...,981,8539,True
2,microsoft/Application-Insights-Workbooks,JSON,azure monitor workbook templates build statush...,azure monitor workbook template statushttpsgit...,385,3411,False
3,microsoft/gctoolkit,Java,microsoft gctoolkit gctoolkit is a set of libr...,microsoft gctoolkit gctoolkit set library anal...,349,2815,False
4,microsoft/winget-cli-restsource,C#,welcome to the wingetclirestsource repository ...,welcome wingetclirestsource repository buildin...,780,5557,False


## Exploration and Pre-processing

## Modeling