[Use of Special Features &lt;](Magic.ipynb)| [&gt; Data Sceince Workflow](Time.ipynb)

# Notebook Standards

In this notebook, I am looking at if notebooks follow the intention of the creators of Project Jupyter. That is, whether the notebook:
- is a computational narrative
- is collaborative
- is reproducible

## Results Summary:
- 6.82% of notebooks attempt to access a file with a full path.
- 1.8% of notebooks are still untitled.
- There is no apperant association between the contents of a notebook and the attention its repostiory recieves.

------------

# Import Packages & Load Data

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pickle
import math
import load_data
import datetime
import re
import scipy.stats as st
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Load dataframes

In [2]:
notebooks = load_data.load_notebooks()
repos = load_data.load_repos()


Notebooks loaded in 0:00:23.740910
Repos loaded in 0:00:04.058064


Load aggregations

In [3]:
cell_types_df = load_data.load_cell_types()
cell_order_df = load_data.load_cell_order()
statuses_df = load_data.load_statuses()
errors_df = load_data.load_errors()
edu_statuses_df = load_data.load_edu_status()
collab_status_df = load_data.load_collab_status()
code_df = load_data.load_code()
nb_imports = load_data.load_nb_imports()
cell_stats_df = load_data.load_cell_stats()
comments_df = load_data.load_comments()

Cell types loaded in 0:00:04.968306
Cell order loaded in 0:00:39.087439
Statuses loaded in 0:00:01.013576
Errors loaded in 0:00:11.210184
Educational status loaded in 0:00:00.002591
Collaboration statuses loaded in 0:00:00.028552
Code loaded in 0:01:05.415815
Notebook imports loaded in 0:01:06.929515
Cell stats loaded in 0:00:01.834526
Comments loaded in 0:00:01.207842


---
# Manipulate Data

### Reproducibility

Variables we already have:
- cells run in_order
- function definitions in order
- imports in order
- variable definitions in order

Does a notebook attempt to access a local file

In [4]:
def find_paths(code):
    return (list(re.findall('/Users/[a-zA-Z_/./]+', str(code))) +      # Mac
           list(re.findall('C:/[a-zA-Z_/./]+', str(code))) +           # Windows
           list(re.findall('~/[a-zA-Z_/./]+', str(code))) +            # Linux/Mac
           list(re.findall('/home/[a-zA-Z_/./]+', str(code))))         # Linux

In [5]:
local_paths = pd.Series([
    find_paths('\n'.join([i for i in c if type(i) == str])) 
    for c in code_df.code
])

In [6]:
code_df['full_path'] = [len(l) > 0 for l in local_paths]
if 'full_path' not in notebooks.columns:
    notebooks = notebooks.merge(code_df[['file','full_path']], on = 'file')

Add num errors

In [7]:
errors_df['num_errors'] = [len(e) for e in errors_df.error_names]

### Computational Narrative

Variables we already have:
- number of comments

Is a notebook named well? If an owner kept the default 'Untitled123.ipynb' name, it's not very descriptive of the notebook's content.

In [8]:
notebooks['good_name'] = [
    not str(n).lower().startswith('untitled') 
    for n in notebooks.name
]

Add whether the repository has a description

In [9]:
repos['has_desc'] = [1 if d == True else 0 for d in  ~repos.repo_description.isna().values]

Add words markdown to lines of code

In [10]:
cell_stats_df['ratio_wl'] = cell_stats_df['num_words']/cell_stats_df['lines_of_code']
cell_stats_df = cell_stats_df[cell_stats_df.ratio_wl < math.inf]

### Notebook Content

Variables we already have:
- number of cells

Add imports

In [11]:
nb_imports['num_imports'] = [len(im) for im in nb_imports.imports]

In [12]:
for p in ['pandas', 'numpy', 'sklearn', 'tensorflow', 'keras', 'matplotlib', 'seaborn']:
    nb_imports[p] = [p in [i[0] for i in im] for im in nb_imports.imports]

## Combine all "Standards" Variables

In [13]:
attention_features = ['subscribers_count','watchers_count','forks_count','open_issues_count']

standard_df = notebooks\
    .merge(collab_status_df.rename(columns={'status':'collab'}), on = 'repo_id')\
    .merge(errors_df, on = 'file')\
    .merge(cell_types_df, on = 'file')\
    .merge(cell_order_df[['file','in_order']], on = 'file')\
    .merge(statuses_df, on = 'file')\
    .merge(repos[['repo_id','has_desc'] + attention_features], on = 'repo_id')\
    .merge(nb_imports[['file','num_imports', 'pandas', 'numpy', 
                       'sklearn', 'tensorflow', 'keras', 
                       'matplotlib', 'seaborn']], on = 'file')\
    .merge(cell_stats_df[['file','ratio_wl']], on = 'file')\
    .merge(comments_df, on = 'file')[[
        'file','repo_id','collab','num_errors','good_name','ratio_wl', 'num_comments',
        'in_order','function','import','variable','syntax', 'full_path',
        'has_desc','num_cells','num_imports', 'pandas', 'numpy', 'sklearn', 
        'tensorflow', 'keras', 'matplotlib', 'seaborn'
    ] + attention_features]

In [14]:
standard_df = standard_df[standard_df.num_cells > 0]

-----

# Visualizations & Statistics

Comments

In [15]:
print("On average, there are {0} comments per notebook ({1} comments per cell).".format(
    round(standard_df['num_comments'].mean(), 2),
    round((standard_df['num_comments']/standard_df['num_cells']).mean(), 2)
))

On average, there are 18.47 comments per notebook (1.0 comments per cell).


Markdown to Code

In [16]:
print("On average, there are {0} words of markdown for each line of code.".format(
    round(standard_df['ratio_wl'].mean(), 2)
))

On average, there are 6.36 words of markdown for each line of code.


Notebook names

In [17]:
print("{0}% of notebooks were not renamed from their default 'Untitled' name.".format(
    round(100*sum(standard_df.good_name == False)/len(standard_df), 2)
))

1.99% of notebooks were not renamed from their default 'Untitled' name.


Local paths

In [18]:
print("{0}% of notebooks attempt to access a file with a full local path.".format(
    round(100*len(local_paths[[i!=[] for i in local_paths]])/len(local_paths), 2)
))

6.82% of notebooks attempt to access a file with a full local path.


## Can a notebook be run top to bottom?

Variables related to reproducibility are:
- function and variable definition order
- package import order
- cell execution order
- whether the notebook attempts to access a file with a full path
- errors

Hypothetically, if everything is in order, there are no errors, and no full path is used, a notebook should be able to run top to bottom without having to edit or rearrange any code. What proportion of notebooks are reproducible, given this definition.

In [19]:
cant_run_through = (
    standard_df['function'] +            # if functions defined in order, function = 0
    standard_df['import'] +              # if packages imported in order, import = 0
    standard_df['variable'] +            # if vairables defined in order, variable = 0
    standard_df['full_path'] +           # if full path is not used, full_path = 0
    standard_df['num_errors'] +          # we want num_errors = 0
    (standard_df['in_order'] == False)   # if cells run in order, in_order = True
)
print("{0}% of notebooks could hypothetically be run through top to bottom.".format(
    round(100 - 100*sum(cant_run_through)/len(cant_run_through), 2)
))

30.85% of notebooks could hypothetically be run through top to bottom.


## Can the attention of a repository be predicted by the contents of a notebook?


### Alter standards dataframe
Adjust variables by length of notebook

In [20]:
standard_df['num_errors_per_cell'] = standard_df['num_errors'] / standard_df['num_cells']
standard_df['num_imports_per_cell'] = standard_df['num_imports'] / standard_df['num_cells']
standard_df['num_comments_per_cell'] = standard_df['num_comments'] / standard_df['num_cells']

Encode indicator variables

In [21]:
indicators = ['good_name','in_order','full_path','has_desc', 
              'pandas', 'numpy', 'sklearn', 'tensorflow', 
              'keras', 'matplotlib', 'seaborn']
for var in indicators:
    standard_df[var] = [1 if s else 0 for s in standard_df[var]]

In [22]:
for level in standard_df['collab'].unique():
    standard_df['collab_'+level] = [1 if c==level else 0 for c in standard_df['collab']]

Drop notebooks with no lines of code

In [23]:
standard_df = standard_df[standard_df.ratio_wl < math.inf]

Standardize numerical features

In [24]:
# Standardize all numericalcolumns with Z score
num_features = [
    'num_errors_per_cell',
    'ratio_wl','num_comments_per_cell',
    'num_imports_per_cell'
]
for col in num_features:
    standard_df[col] = (standard_df[col] - standard_df[col].mean()) / standard_df[col].std()

### Attention metric 1: product of four related measures

In [25]:
attention = [1]*len(repos)
for f in attention_features:
    repos[f] = (repos[f] - repos[f].mean()) / repos[f].std()
    attention *= repos[f]
repos['attention'] = attention

In [26]:
if 'attention' not in standard_df.columns:
    standard_df = standard_df.merge(
        repos[['repo_id','attention']], on = 'repo_id'
    )

### Attention metric 2: first principal component of four related measures

In [27]:
pca = PCA(n_components = 1)
attention_pca = pca.fit_transform(repos[attention_features])
print("{0}% of the variance in attention is explained by the first principal component.".format(
    round(100*pca.explained_variance_ratio_[0], 2)
))

repos['attention_pca'] = attention_pca

72.52% of the variance in attention is explained by the first principal component.


In [28]:
for i in range(4):
    print(attention_features[i],':',pca.components_[0][i])

subscribers_count : 0.5384214608254829
watchers_count : 0.560261328120955
forks_count : 0.5494918902409348
open_issues_count : 0.3070313294993551


The most important factor for describing attention is watchers count. Watchers, subscribers, and forks have similar importances (coefficient around 0.55), but issues is much less important with a coefficient at 0.3.

In [29]:
if 'attention_pca' not in standard_df.columns:
    standard_df = standard_df.merge(
        repos[['repo_id','attention_pca']], on = 'repo_id'
    )

### Attention metric 3: sum of four related measures
approximate measure of the number of views

In [30]:
attention_sum = [0]*len(repos)
for f in attention_features:
    attention_sum += repos[f]
repos['attention_sum'] = attention

In [31]:
if 'attention_sum' not in standard_df.columns:
    standard_df = standard_df.merge(
        repos[['repo_id','attention_sum']], on = 'repo_id'
    )

In [32]:
features = [
    'num_errors_per_cell','good_name','ratio_wl', 'num_comments_per_cell',
    'in_order','function','import','variable', 'full_path',
    'num_cells', 'pandas', 'numpy', 'sklearn', 
    'tensorflow', 'keras', 'matplotlib', 'seaborn'
]

### Are attention features associated with narrative, reproducibility, and content
- $H_o$: all features are dependent on eachother
- $H_a$: attention features are independent of narrative, reproducibility, and content features. 

Under alternative hypothesis
$\Sigma$ = $\Sigma_a$ = 



[$\Sigma_{attention}$, 0]

[0, $\Sigma_{other\_features}$]

In [34]:
np.linalg.det(standard_df[features + attention_features].corr()) / (
    np.linalg.det(standard_df[features].corr()) *
    np.linalg.det(standard_df[attention_features].corr())
)

0.998571210589533

Under the null alternative hypothesis, $\Sigma$ = $\Sigma_a$ so we expect the U statistic, the ratio of their determinants, to equal 1. The U statistic here of 0.999 is very close to 1, so provides strong evidence that the attention features are independent of narrative, reproducibility, and content features. This means it's unlikely we will be able to predict attention.

This was a surprising finding because in [Collaboration.ipynb](Collaboration.ipynb) I found that collaborative notebooks tend to have a higher markdown to code ratio and are more likely to have descriptions. However, this was differentiating between Collaborative (at least one fork or issue), Watched (at least one stargazer or watcher), and Isolated (no views). The lack of association here might be because of how spread out the data is. Some collaborative repositories have only one fork, while others have thousands. In [Collaboration.ipynb](Collaboration.ipynb), I lumped all of those repositories together, but here I *want* to distinguish between a little bit of attention and a lot of attention (e.g. one versus thousands of forks).

### Linear Regression

In [35]:
x_1 = standard_df[features]
x_2 = standard_df[features]
x_3 = standard_df[features]

y_1 = standard_df['attention']
y_2 = standard_df['attention_pca']
y_3 = standard_df['attention_sum']

X1_train, X1_test, y1_train, y1_test = train_test_split(x_1, y_1, test_size=0.2, random_state=123)
X2_train, X2_test, y2_train, y2_test = train_test_split(x_2, y_2, test_size=0.2, random_state=123)
X3_train, X3_test, y3_train, y3_test = train_test_split(x_3, y_3, test_size=0.2, random_state=123)

Attention measure 1

In [36]:
reg1 = LinearRegression().fit(X1_train, y1_train)
print('Train R^2:',reg1.score(X1_train, y1_train))
print('Test R^2:',reg1.score(X1_test, y1_test))

Train R^2: 2.0933491652330716e-06
Test R^2: 1.589981419858333e-07


Attention measure 2

In [37]:
re2 = LinearRegression().fit(X2_train, y2_train)
print('Train R^2:',reg1.score(X2_train, y2_train))
print('Test R^2:',reg1.score(X2_test, y2_test))

Train R^2: -531115747.19483113
Test R^2: -314387799.64059305


Attention measure 3

In [38]:
re3 = LinearRegression().fit(X3_train, y3_train)
print('Train R^2:',reg1.score(X3_train, y3_train))
print('Test R^2:',reg1.score(X3_test, y3_test))

Train R^2: 2.0933491652330716e-06
Test R^2: 1.589981419858333e-07


The results of the linear regression attempts supports the conclusion that attention is independent of the content of the notebook. All $R^2$ values are extremely low, even on the training sets.

[Use of Special Features &lt;](Magic.ipynb)| [&gt; Data Sceince Workflow](Time.ipynb)