![Blue%20Modern%20Corporate%20Computer%20and%20Technology%20Linkedin%20Banner.png](attachment:Blue%20Modern%20Corporate%20Computer%20and%20Technology%20Linkedin%20Banner.png)

## PREDICTING PROGRAMMING LANGUAGE WITHIN LINUX README REPOS

by Andrew Rachuig, Stephen Fitzsimon and Jennifer Eyring

_______________________________

<b>Introduction Notes:</b> This NLP project is based on utilizing webscraping methods to create a predictive Logistical Regression Model.
<br>
<br>
Our team scraped the top 3,300 most-forked Linux Github repositories (as of July 20, 2022) to determine what are the top programming languages being used on Github projects.
<br>
<br>
After finding the top languages being used for each Github repo, we then took the words/wording within the README sections of the repos to see if we could build a model that could predict what programming language was used; solely based on the README content and text.
<br>
<br>
### <b>Audience Notes about the Data:</b> 
The source of our data came from webscraping the top-forked Linux Github repositiories, by searching Linux and then also pulling the content of Linux's three common flavors: Arch, Debian and Ubuntu. You can read more of the step-by-step of how to replicate this repo in the README section.


__________________

## Initial Questions when starting this project:

> - How many & what are the unique words to each specific programming language?<br><br>
> - Are there any bigrams/trigrams that are specific to certain programming languages?<br><br>
> - Are there differences in words/phrases to Linux-flavors- specifically Ubuntu, Debian and Archlinux.<br><br>
> - Do certain programming languages have larger README sections than others? And if so, which ones?<br><br>
> - With Linux-flavors-Debian, Arch and Ubuntu-are there differences in README lengths? (ie does one flavor over the others seem to have more details needed or explained than others?)

__________________________

# Project Goals:

> - Utilize Codeup's webscraping function and apply it to our project's parameters of obtaining the top-forked Linux repositories.<br><br>
> - To determine any commonalities/differences between programming languages and the README sections of the repositories.<br><br>
> - Create a classification model that can predict what programming language is used, solely based on the README content/words.

# Executive Summary:

___________________________________________

<h1>ABOUT THE DATA:</h1>

Our team collected 3,300 top-forked Github repos that were specific flavors of Linux: Archlinux, Ubuntu, and Debian repositiores. Each flavor we pulled 1,100 and then combined these webscraped repositories into one large dataframe.

### Key notes:
- 3,300 repositories were collected.
- 2,805 were used in this model, after clean/normalizing data.
- <b>Top 3 programming languages used in all the repos:
    - 1) Shell
    - 2) Python
    - 3) C


<div class="alert alert-info">
Size of data:<br>
<b>Pre-Clean/Normalize: 3,300 rows | 3 columns

<b>After cleaning & normalizing: 2,805 rows | 6 columns
</div>

## Wrangle Process:

#### Measures taken to clean and normalize the data:

> 1) We dropped all nulls as these related to repos that had no languages defined.<br><br>
> 2) Using NLTK tools, we replaced any abnormal symbols and https-related phrases with single spaces on the readme_contents column.<br><br>
> 3) We tokenized the dataset on this same column.<br><br>
> 4) After cleaning/normalizing, we had used the proportions of most common words throughout the repos across every programming language to determine which words need to be removed as they took away from the information the corpus was provided on predictability.<br><br>
> 5) We lowered all capitilizations within the readme content.<br><br>
> 6) And finally we stemmed all words so keep key words/phrases similar.<br><br>
> 7) We added the following columns:
- `disto` : to label which repos were Arch, Ubuntu or Debian
- `clean_readme` : to have a comparison of the original collected text `readme_contents`
- `length_readme`: counts how many unique words are in each repo.
- `readme_string`: that takes the clean_readme and applies as strings rather than list.

______________________

## Exploring the main dataset:

#### Calling in the data:

In [1]:
#imports:

#tools for web scraping:
from requests import get
from bs4 import BeautifulSoup
import os
import pandas as pd
import numpy as np

#group imports
import env
import acquire
import constants_prepare as c

import json
from typing import Dict, List, Optional, Union, cast
import requests
import nltk

#visualizations:
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#calling in master_df:
master_df = pd.read_csv('master_list.csv')

#handling nulls of rows that do not have languages mentioned:
master_df = c.drop_nulls(master_df)

#importing our clean/normalization function
master_df = c.adding_columns(master_df)
master_df.head()

### Average length of Repositories:

In [None]:
plt.figure(figsize=(18, 12))
sns.barplot(data = master_df.groupby('language').mean().reset_index().sort_values('length_of_readme', ascending=False), x = 'length_of_readme', y='language')
plt.title('Average Readme length by Language')
plt.show()

_____________________________________

### TOP 10 COMMON LANGUAGES USED IN FULL CORPUS:

In [None]:
master_language_count=pd.concat([master_df.language.value_counts(), master_df.language.value_counts(normalize=True)], axis = 1). head(10)
master_language_count

In [None]:
m_count = master_df.language.value_counts(normalize=True).head(10)

In [None]:
#creating a df of the percentages to prep for charts:
temp = pd.DataFrame({'language' : m_count.index, 'percentage': m_count.values})

In [None]:
#plotting out the percentages of the Top 10 languages used in Ubuntu repos:
plt.figure(figsize=(10,8))
sns.barplot(data=temp, x = 'language', y = 'percentage')
plt.title('Percentages of the Top 10 Linux Repo languages')

## Number of Unique Words in Corpus:

In [None]:
#finding the count of unique words in the full Corpus:
linux_corpus_series = pd.Series(master_df.readme_string)
pd.Series(linux_corpus_series).nunique()

In [None]:
linux_corpus_series

In [None]:
#creating a list to find unique word counts:
linux_corpus = ' '.join(master_df['readme_string'])
linux_corpus[:100]

### Top Words & Bigrams found in Shell:

In [None]:
shell_words = master_df[master_df.language == 'Shell'].clean_readme
shell_freq = pd.Series(shell_words.str.split()).value_counts()

In [None]:
shell_list = pd.Series((i[0] for i in shell_words))
shell_list.value_counts().nlargest(10)

In [None]:
#taking a look at Shell words:
shell_list.describe()

#### Top Bigrams found in Shell:

In [None]:
#top bigrams of Shell language in Linux repos:
top_20_shell_bigrams = (pd.Series(nltk.ngrams(shell_list, 2))
                      .value_counts()
                      .head(20))

In [None]:
top_20_shell_bigrams

In [None]:
top_20_shell_bigrams.sort_values().plot.barh(color='#1E8FBF', width=.9, figsize=(10, 6))

plt.title('20 Most frequently occuring Shell bigrams')
plt.ylabel('Bigram')
plt.xlabel('# Occurances')

# make the labels pretty
ticks, _ = plt.yticks()
labels = top_20_shell_bigrams.reset_index().sort_index(ascending=False)['index'].apply(lambda t: t[0] + ' ' + t[1])
_ = plt.yticks(ticks, labels)


### Top Words & Bigrams found in Python:

In [None]:
master_df.tail()

In [None]:
python_words = master_df[master_df.language == 'Python'].clean_readme
python_freq = pd.Series(python_words.str.split()).value_counts()

In [None]:
#looking at where index starts?
python_words

In [None]:
python_list = pd.Series((i[0] for i in python_words))
python_list.value_counts().nlargest(10)

### Top Words and Bigrams found in C:

In [None]:
#taking all 'C' lanugage repos from the clean_readme 
C_words = master_df[master_df.language == 'C'].clean_readme
#finding the frequency of C words/grams
C_freq = pd.Series(C_words.str.split()).value_counts()

In [None]:
#creating a series for just unique C_words
C_list = pd.Series((i[0] for i in C_words))
C_list.value_counts().nlargest(10)

In [None]:
#taking a look at C_words
C_list.describe()

#### Looking at Top 20 Bigrams of C language in repos:

In [None]:
#top bigrams of C language in Linux repos:
top_20_c_bigrams = (pd.Series(nltk.ngrams(C_list, 2))
                      .value_counts()
                      .head(20))

In [None]:
top_20_c_bigrams

In [None]:
top_20_c_bigrams.sort_values().plot.barh(color='#143995', width=.9, figsize=(10, 6))

plt.title('20 Most frequently occuring C bigrams')
plt.ylabel('Bigram')
plt.xlabel('# Occurances')

# make the labels pretty
ticks, _ = plt.yticks()
labels = top_20_c_bigrams.reset_index().sort_index(ascending=False)['index'].apply(lambda t: t[0] + ' ' + t[1])
_ = plt.yticks(ticks, labels)


### Overall Takeaways:

_______________________________________

## Finding word frequencies per language:

In [None]:
shell_words

In [None]:
#combining frequencies into dataframes:
word_counts = (pd.concat([shell_freq, python_freq,C_freq], axis=1, sort=True)
              .set_axis(['shell','python','C'], axis=1, inplace=False)
              .fillna(0)
              .apply(lambda s: s.astype(int)))

In [None]:
word_counts

## What words are not associated with a language?

In [None]:
master_df.clean_readme.describe()

In [None]:
#creating an 'other' list for languages to help find any words not associated with a language and the count of this:
keep_languages = master_df.language.value_counts().nlargest(10).index.tolist()
master_df.loc[~(master_df.language.isin(keep_languages)), 'language'] = 'other'

master_df.language.value_counts()

In [None]:
languages = master_df.language.unique().tolist()
languages = ['all'] + languages

In [None]:
print(languages)

In [None]:
#I don't think that the clean_data is working? (see below)

In [None]:
corpora = []
corpora.append({'language':'all', 'corpus':linux_corpus_series})
for lang in languages[1:]:
    corpora.append({'language':lang, 'corpus': ' '.join(master_df[master_df.language == lang].readme_string)})

In [None]:
linux_corpus_list = c.clean_data(linux_corpus)
pd.Series(linux_corpus_series).value_counts().nlargest(20)

In [None]:
#creating corpora of these languages:
df_corpora = pd.DataFrame(corpora)
df_corpora

In [None]:
all_words= df_corpora[df_corpora.language == 'all'].corpus

#next steps:

-Explore by flavors of:
    
-Hypothesis/questions:
    
-Results/takeaways

-modeling

-results/takeaways

