# Deployment 2

### Git - Version Control

Git is Version Control System (VCS) used to track changes in the project.

Git is a Distributed Version Control System.

It coordinates work between multiple developers and coordinates changes.

You have local repository in your system and you makes changes to the files in the local repository and push it to remote repository like GitHub/Bitbucket.

git takes snapshots of the files with **'commit'** command.

You can visit any snapshot at any time.

You can put your files in staging area before commit with **'add'** command.

Once you make commit to the remote repository other developers can **'pull'** that information onto their machines.

You can also create branches.

**$ git init** This command initilizes a local git repository. Go to a folder and execute this command. It creates .git file (hidden file) in that folder.

**$ git add < file_name >**  Then after initialzing you can add the files in that folder to staging area. we use this command to add files to the staging area.

**$ git status** to see the files in staging area.

**$ git commit** takes everything in staging area and puts into local repository.

**$ git push** takes your local repository and puts it in the remote repository.

**$ git pull** pulls latest changes from remote repository to local repository.

**$ git clone** clones remote repository to local folder.

You can use git from both command prompt and git bash (Recommended).

**$ git rm --cached < file_name >** removes file from staging area.

**$ git add * .html** adds all files ending with .html

**$ git commit** you will get an editor, type the comment in that file and exit (:wq)

**$ git commit -m 'comment'** eliminate commenting in editor.

**.gitignore** Create a file named .gitignore and include the files that you dont want to be in the staging area and add all the files.

When you add all the files to stagin area, you will see every file when git status command is executed except the files added in the .gitignore file.

**$ git branch < branch name >** creates branches.

**$ git checkout < branch name >** to switch the current branch.

**NOTE:** Files that you created in the branches doesnt show up in the master unless you merge them with master.

**$ git remote** shows available remote repositories

### Virtual Environment

You can create virtual environements in Anaconda or else python has standard library to create virtual environments called venv.

If you dont have virtual environment, when you install a package like numpy it installs globally. When you have two applications running and two application needs different versions of the same package, then ther will be problem. 

Instead what you can do is you can create virtual environement for each of the application and run the application the specified virtual environement. 

The downside with the venv module is it cannot create virtual environments for other python versions than the host python.

### Different approches to putting Machine Learning model into production.

**Train** One Off, Batch and real-time/Online training.

**Serve** Batch, Realtime (Database Trigger, Pub/Sub, Web-Service, inApp)

**One Off** Models dont necessarily need to be trained continously. They will be trained whenever their performance is deteriorated and again pushed back to the production.

**Batch Training** it allows to have constantly refreshed version of your model based on the latest train.

Batch training can benefit lot from AutoML types of frameworks. AutoML enables you to perform/automate activities such as feature processing, feature selection, model selections and parameter optimization. 

### Pickling in python

Python pickle module is used to Serialize objects so that they can be saved to a file, and loaded in a program again later.

**What is pickling**

Pickling is used for Serializing and Deserializing python object Structures. Serialization refers to process of converting an object in Memory to a byte stream that can be stored on disk or sent over network.

later on this character stream can be retrieved and deserialized back to python object.

Pickling is not confused with Compression. Pickling is conversion of an object in RAM to Disk. While compression is process of encoding data in fewer bytes to save disk space.

**What can you do with pickling**

Send data on the TCP or socket connection.

Save Machine Learning algorithms and use them again to make predictions later without having to train model again.

**When not to use pickle**

When you use different programming languages pickle is not recommended, its protocol to specific to python.

Pickle does not compatible with different versions of python. Unpickling a file that was pickled in different version of python doesn't work.

**What can be pickled**

Booleans, integers, floats, complex numbers, strings, tuples, lists, sets, dictionaries and also Classes and Functions.

**What cannot be pickled**

Generators, innerclasses, lambda functions and different dicts cannot be pickled.

#### Pickling a Dictionary

In [1]:
# pickling

import pickle
example_dict = {1:'one', 2:'two', 3:'three', 4:'four'}

# opening a dict.pickle pickle file with write mode
pickle_out = open("dict.pickle", 'wb') # wb for write bytes

# writing whatever is in the example_dict dictionary onto the pickle file
pickle.dump(example_dict, pickle_out)

# closing the pickle file
pickle_out.close()

In [3]:
pickle_in = open('dict.pickle', 'rb')
pkl_example_dict = pickle.load(pickle_in)
pkl_example_dict

{1: 'one', 2: 'two', 3: 'three', 4: 'four'}

#### Pickling a Model

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Split dataset
data = load_iris()
Xtrain, Xtest, Ytrain, Ytest = train_test_split(data.data, data.target, test_size=0.3, random_state=4)

# create model
model = LogisticRegression(C=0.1,
                          max_iter=20,
                          fit_intercept=True,
                          n_jobs=3,
                          solver='liblinear')

# Fit the model on training set
model.fit(Xtrain, Ytrain)

  " = {}.".format(effective_n_jobs(self.n_jobs)))


LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=20, multi_class='warn', n_jobs=3,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [7]:
import pickle

with open('Linearregression.pickle', 'wb') as f:
    pickle.dump(model, f)

In [8]:
# We trained our model with pickle and let us test with pickle 

# some time later ......

# Read the model from the disk

pickle_in = open('Linearregression.pickle', 'rb')
clf = pickle.load(pickle_in)

score = clf.score(Xtest, Ytest)
print(f"Test Score is {100*score}")

Ypredict = clf.predict(Xtest)

Test Score is 91.11111111111111


In production environment pickle saves you lot of time for training the algorithms. Train the model once store the pickle files and retrain the model whenever it is necessary or when model deteriorates its performance.

One more thing you can do is you can scale your EC2 instances when training the model, then train the model on computationally high EC2 instances and kill the instances after the training is done and save the pickles in S3 bucket.

### Joblib - Alternative for pickling for handling large numpy arrays

Joblib library is intended to be replacement for Pickle, for objects containing large data.

joblib offers simple workflow when compared to pickle.

Joblib also allows different compression methods such as 'zlib', 'gzip', 'bz2' and different levels of compression.

In [9]:
from sklearn.externals import joblib

joblib_file = "joblib_model.pkl"
joblib.dump(model, joblib_file)

['joblib_model.pkl']

In [10]:

# Some time later...


# loading from the file
joblib_model = joblib.load(joblib_file)

# using the model to predict
score = joblib_model.score(Xtest, Ytest)
print(f"Test Score is {100*score}")

Test Score is 91.11111111111111


### Disadvantages of both Joblib and Pickle

**Python Version Compatability** The file that is pickled in one version of python cannot be unpickled using different version.

**Model compatibility** One of the most frequent mistakes is saving your model with Pickle and Joblib, then changing the model before trying to restore from file. The internal structure of the model needs to stay unchanged between save and reload.

### Serving model via REST API

https://medium.com/value-stream-design/architecting-a-scalable-real-time-learning-system-95623d27dd15