# Modifying existing python libraries

## Why would you need to do this?

1.	**For a novel application or to avoid starting from scratch:** As a scientist you may be faced with a daunting project that seems like you have to build everything from scratch, or your use case may be wildly different from the intended use of the distribution. Best practices are to avoid repeating code unecessarily and take advantage of the hard work that others have done before you. As Newton said: "If I have seen further it is because I stood on the shoulders of giants". You'll need to take care to avoid plaigarism or violating licenses but for one-time academic uses there are almost always ways to do this appropriately. 
1.	**Extend the useability:** Sometimes the distribution just isn't intended for your use case. Maybe it requires real-valued data but you want to use complex numbers. Maybe your matrices are too large, or it just doesn't use the right algorithm at a key point, or you need to substitute one function for another. In these cases don't forget to submit a [feature request](https://guides.github.com/features/issues/). 
1.	**Fix a "bug":** If the distribution has a bug, but you need to use it and can't find solutions in forums and can't wait for a fix (after starting an issue) it makes sense to modify the files where the issue exists. Most people in this situation will also submit their problem and solution to the distribution's [bug tracker](https://guides.github.com/features/issues/). 
1.	**Develop a deeper understanding of the algorithm and learn best coding practices:** As a scientist in 2019 you probably have limited training in algorithms and how to code professionally. Few of our elder's needed these skills. If you find yourself frequently using particular functions or libraries, it would be wise for you to understand them well enough to be creative and avoid hidden pit-falls.  Furthermore, we scientists write awful code. Take advantage of the opportunity to see the habits and tricks python (or other) community uses. 

## What are the problems?

1.	**Breaking things:** If you do this incorrectly, your modifications may be lost at the next update, or that new update or other distribution may become incompatible with your modified distribution, other people may not be able to use your work, or if your computing resources are shared, you may destroy functionality for your collegues. 
2.	**Losing track of what’s been modified:** If you do this wrong you may forget you made a change and get unexpected behavior in the future. You may forget what modifications you made and not be able to restore the previous function. 
3.	**Violating licenses:** We cannot and do not offer legal advice, only speculation and hear-say from the internet. It is often (usually?) the case that you cannot distribute modified versions of distributions. This means you can't put it in a public github or link to it on a website. When sharing *parts* of a distribution you should always explicitly credit the original distribution or better yet the authors, and mark your changes so that it is obvious. Read [the license](https://en.wikipedia.org/wiki/Free_software_license) and be aware of what you need to do. As a scientist doing non-commercial research you are probably permitted to use it and to share it with colleagues via personal communition. This is not legal advice.


## Contents

>### 1. Simplest: Copy and move the specific function you want to modify
>### 2. Most recommended: Install an editable version and use alongside the old
>### 3. Also recommended: Do it in a virtual environment and separate from the old module
>### 4. Lazy and Unwise: Modify in place




# Simplest: Copy and move the specific function you want to modify

This may be the best option for a research scientist, although it is not frequently recommended in online forums. When a scientist wants to modify a library it's usually because we want to accomodate a very unique situation. Thus we will heavily modify it, perhaps even using it as the skeleton for a new analysis pipeline. 


The jist is to copy and paste code snippets into a new file, or find a file and move it to a new folder, then make changes there. This is sometimes troublesome because the snippets or files were *inside* of a much larger module. The code may assume that other functions, variables, or modules are available. Usually it is not too tedious to move those requirements alongside your targeted snippet or file, add new lines to import the modules needed, or to change the references so that they work in the new context. 

## Let's tweak Scikit-Learn's Multi-Layer Perceptron
We are going to use a different objective function so that the MLP optimizes mutual information with prediction and real value rather than cross entropy.

For our illustration we will be running a simple example from [https://scikit-learn.org/stable/modules/neural_networks_supervised.html#multi-layer-perceptron](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#multi-layer-perceptron)

>Our main goal is to modify two lines:    
>The line: `loss = LOSS_FUNCTIONS[loss_func_name](y, activations[-1])`    
>Is commented out and we add the line: `loss = -1.0*mutual_info_score(y.flatten(),activations[-1].flatten())`
>
>We also add a print statement so we never forget we modified something: `print('this modified version of sklearn MLP classifier uses mutual information')`

In [0]:
# MAKE SURE REQUIREMENTS ARE MET

# make sure sklearn is installed
#!pip install scikit-learn

# copy the imports from the example
from sklearn.neural_network import MLPClassifier

## Make the changes needed

### Find the files to modify
If using python 3.6, pip installs packages to: `/usr/local/lib/python3.6/dist-packages/` and likewise for other 3.+ python versions. This will be different for windows or if you are using anaconda.
>
See the import line above: "`from sklearn.neural_network import MLPClassifier`"? 

That means there is a folder "`sklearn/neural_network`" and either a file, folder, class, or function called "`MLPClassifier`".  All this is located within `/usr/local/lib/python3.6/dist-packages/`
>
Navigate to `/usr/local/lib/python3.6/dist-packages/sklearn/neural_network`
Notice the file "`multilayer_perceptron.py`". Open it and inside there is a class: "`MLPClassifier`". We won't have you copy it into a cell and modify it, the file has too many lines. We already made the changes and will show you.
>
>
>
### Note that we also needed to change the imports because they are now broken.

**example of import from sklearn trunk (two dots)**

*old file*
`from ..base import BaseEstimator, ClassifierMixin, RegressorMixin`

*new file*
`from sklearn.base import BaseEstimator, ClassifierMixin, RegressorMixin`

**example of import from the folder containing multilayer_perceptron.py (one dot)**

*old file*
`from ._base import ACTIVATIONS, DERIVATIVES, LOSS_FUNCTIONS`

*new file*
`from sklearn.neural_network._base import ACTIVATIONS, DERIVATIVES, LOSS_FUNCTIONS`



In [0]:
# this cell downloads the modified file and puts it in the current working directory
# first we delete any old versions you might have from a previous run
!rm MLP_jkj_mod.py
# now we download a fresh file using wget
!wget https://raw.githubusercontent.com/jojker/PML_Workshops/master/Summer%202019/Day%201%20-%20Process%20and%20Design%20for%20Rapid%20Progress/Ex%204%20-%20Modifying%20python%20libraries/MLP_jkj_mod.py

--2019-07-11 21:13:29--  https://raw.githubusercontent.com/jojker/PML_Workshops/master/Summer%202019/Day%201%20-%20Process%20and%20Design%20for%20Rapid%20Progress/Ex%204%20-%20Modifying%20python%20libraries/MLP_jkj_mod.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 53224 (52K) [text/plain]
Saving to: ‘MLP_jkj_mod.py’


2019-07-11 21:13:29 (1.45 MB/s) - ‘MLP_jkj_mod.py’ saved [53224/53224]



In [0]:
# import the modified version
# we have to tell python where to look
import sys
import os
sys.path.append(os.getcwd())
# here is the actual import
from MLP_jkj_mod import MLPClassifier_jkjMod  

In [0]:
# initialize test data and compare versions
X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = MLPClassifier(solver='sgd', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
clf_mod = MLPClassifier_jkjMod(solver='sgd', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)

this modified version of sklearn MLP classifier uses mutual information


In [0]:
# train the old classifier
clf.fit(X, y)
# test the old classifier
clf.predict([[2., 2.], [-1., -2.]])



array([1, 0])

In [0]:
# train the new classifier
clf_mod.fit(X, y)
# test the new classifier
clf_mod.predict([[2., 2.], [-1., -2.]])

array([1, 0])

# Most recommended: Install an editable version and use alongside the old

From reading forums and blog posts one gets the impression that software developers use the biggest, bluntest tool available. In this case that means "forking" the entire distribution to a new folder in your computer and modifying it in place there. This solution mostly avoids the challenge of broken requirements of the previous method. However it poses a new challenge, which is keeping track of the changes you made. Version control software, such as git, is designed to help with this and is a requirement to use this approach. However, that adds a new skill for research scientists to learn and it isn't user-friendly when compared to isolating all your changes to a specific and separate folder. 

This is  a very good option nonetheless and if you get in the habit of marking your changes with a comment such as `#TRACK-CHANGE 06/29/2019 changed xxx to xxx and added xxx` you will be able to find them easily by using a text search feature in your file explorer, without having to search through a "diff" command in github (we'll show you that anyway). 

In [0]:
# install with editable mode
!pip install -e git+https://github.com/scikit-learn/scikit-learn#egg=skl_jkjMod

Obtaining skl_jkjMod from git+https://github.com/scikit-learn/scikit-learn#egg=skl_jkjMod
  Cloning https://github.com/scikit-learn/scikit-learn to ./src/skl-jkjmod
  Running command git clone -q https://github.com/scikit-learn/scikit-learn /content/src/skl-jkjmod
Installing collected packages: scikit-learn
  Found existing installation: scikit-learn 0.21.2
    Uninstalling scikit-learn-0.21.2:
      Successfully uninstalled scikit-learn-0.21.2
  Running setup.py develop for scikit-learn
Successfully installed scikit-learn


## Make the changes needed

### Find the files to modify
Now pip just installs to your current working directing `/content/src/`. It also installs the package under your egg name. In this case it's `skl_jkjMod`.
>
That means that now there is a folder "`/content/src/skl_jkjMod/sklearn/neural_network`" and either a file, folder, class, or function called "`MLPClassifier`".
>
Navigate to that folder an notice the file "`multilayer_perceptron.py`". Open it and inside there is a class: "`MLPClassifier`". We are going to replace one line and add a print statement. These are the same core changes from the previous section but we no longer need to change the imports. 

In [0]:
# run before making modifications
# tell python to look in current folder
import sys
import os
sys.path.append(os.getcwd())
# note the import uses the egg name
from skl_jkjMod.sklearn.neural_network import MLPClassifier

X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = MLPClassifier(solver='sgd', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
# train the old classifier
clf.fit(X, y)
# test the old classifier
clf.predict([[2., 2.], [-1., -2.]])

AttributeError: ignored

# DO THIS
### Download the file, make changes, delete the server copy and upload your modified file. Finally restart the runtime so the changes take effect. Run this next cell, skipping the previous cells.

In [0]:
# run after making modifications
# tell python to look in current folder
import sys
import os
sys.path.append(os.cwd)
# re-import
from skl_jkjMod.sklearn.neural_network import MLPClassifier

X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = MLPClassifier(solver='sgd', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
# train the new classifier
clf.fit(X, y)
# test the new classifier
clf.predict([[2., 2.], [-1., -2.]])

In [0]:
# run the original version which is still installed in a different location

# conventional import
from sklearn.neural_network import MLPClassifier

X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = MLPClassifier(solver='sgd', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)
# train the new classifier
clf.fit(X, y)
# test the new classifier
clf.predict([[2., 2.], [-1., -2.]])

### Uninstalling an editable version is tedious
Normally you can uninstall with `pip uninstall PACKAGE_NAME`, but with editable packages you have to delete the files yourself.

In [0]:
# now we laboriously uninstall the modifiable sklearn
!rm -r $(find /usr/local/lib/python3.6/dist-packages -name 'scikit*.egg*')
!rm -r $(find /usr/local/lib/python3.6/dist-packages/ -name 'skl_jkjMod*.egg*')
!rm -r $(find /usr/local/lib/python3.6/dist-packages -name 'sklearn*.egg*')
!rm -r ./src/skl_jkjMod

## Advanced topic: Distributions that make it hard to install in editable mode

Tensorflow is a big and complicated distribution. Either intentionally or not the developers have placed obstacles in your ability to install in editable mode. Nonetheless you can with some simple modifications to the steps above.

In [0]:
# will give error message because tensorflow devs are meanies
!pip install -e git+https://github.com/tensorflow/tensorflow#egg=tFforScience

In [0]:
# we have to copy the pip setup file to the correct (non-meany) location
!cp /content/src/tfforscience/tensorflow/tools/pip_package/setup.py /content/src/tfforscience/setup.py

In [0]:
# now we can try again EXCEPT that we already have the files we just need to tell pip where to find them
!pip install -e /content/src/tfforscience/

In [0]:
# tell python to look in current folder
import sys
import os
sys.path.append(os.cwd)
import tfforscience as tffs
import tensorflow as tf

In [0]:
# now we laboriously uninstall the modifiable tensorflow
!rm -r $(find /usr/local/lib/python3.6/dist-packages -name 'tensorflow*.egg*')
!rm -r $(find /usr/local/lib/python3.6/dist-packages/ -name 'tfforscience*.egg*')
!rm -r ./src/tfforscience

# Also recommended: Do it in a virtual environment and separate from the old module

When you modify a distribution in place, even an editable version, future updates may break the requirements of your modified code. A solution is to create a "virtual environment" that has all the requirements you need to run your code and only those requirements. This is also useful if you are modifying specific functions, you can freeze the requirements so that it will always run. 

If you "freeze" a lot of distributions the hard-disk requirments can add up. You can also forget which virtual environment does what, or simply forget you are **in** a virtual environment and accidentally break it. It does require some set up steps and maintenance which adds a step everytime you use your code. 

### create a virtual environment directory

In [0]:
# install a python package that sets up virtual environments
!pip install virtualenv

In [0]:
# this creates a virtual environment
!virtualenv /content/modLibEnv

In [0]:
# the new environment has nothing installed
# get a list of everything installed on your default system
!pip freeze  > /content/modLibEnv/requirements.txt
# edit the requirements text file to eliminate packages that are not necessary for your project

### finish configuring your virtual environment 
These steps won't work in colab but will work on anaconda or local systems. The reason it won't work is because you have to run the first command *before* opening a python notebook.

In [0]:
# "activate" the virtual environment 
!source /content/modLibEnv/bin/activate

In [0]:
# double check that your environment is pointing to the right place
!echo $PATH
# should be /content/modLibEnv/

In [0]:
# install packages into your new environment
!pip install -r /content/modLibEnv/requirements.txt

In [0]:
# If you want your python to find scripts you wrote you need to add their 
# containing folders to the PYTHONPATH
!export PYTHONPATH="${PYTHONPATH}:/my/other/path/"

In [0]:
# double check your changes
!echo $PYTHONPATH

### Colab compatible check
By putting several commands on one line it will let us see what to expect

In [0]:
# check the path 
!source /content/modLibEnv/bin/activate && !echo $PATH

In [0]:
# check the python path before the modification
!source /content/modLibEnv/bin/activate && !echo $PYTHONPATH

In [0]:
# check the python path before the modification
!source /content/modLibEnv/bin/activate && export PYTHONPATH="${PYTHONPATH}:/content/sample_data/" && echo $PYTHONPATH

### How to delete the virutal environment

In [0]:
# delete a virtual environment
!deactivate
!rm -r /content/modLibEnv

# Lazy and Unwise: Modify in place

Since we showed you here the python library's files are, you could always just go and tweak those. People usually do this because they encountered an unsolved bug or unmet need and didn't feel like they had the time to google the proper method, or because they couldn't think of a useful google search phrase and gave up looking. This is asking for trouble. If you do this, always keep track of what you did and comeback later to do it correctly. 

A better version of this hack is to do this in a virtual environment, you will still break updates within that virtual environment and it may be a headache to use your code outside the environment (e.g. sharing it). However, some distributions make it hard to install an editable version. So making your tweaks in a virtual environment lets you take the time to figure out what changes are required while limiting the broken software to the virtual environment. Then you can go back and copy out just the portions you figured out you need to change. Remember to delete all traces of the virtual environment when you are done using it (or lock it in an archive for posterity). 