----------------------------
#### Model persistence
--------------------------

#### lets us understand `pickle` module of python

- is a part of your standard library with your installation of Python.
- `Pickling` is the `serializing` and `de-serializing` of python objects to a byte stream. 
- `Unpicking` is the opposite.

- Pickling is used to `store python objects`. This means things like lists, dictionaries, class objects, and more.

- Examples
    - `data analysis`, where you are performing routine tasks on the data, such as pre-processing. 
        - Save the data for later use
            - such as dictionaries.

    - save `trained machine learning` model. 

        - we just train the algorithm once, store it to a variable (an object), and then we pickle it. 

In [1]:
import pickle

In [2]:
example_dict = {1:"6",
                2:"2",
                3:"f"}

pickle_out = open("dict.pickle","wb")

pickle.dump(example_dict, pickle_out)

pickle_out.close()

In [3]:
pickle_in    = open("dict.pickle","rb")

example_dict = pickle.load(pickle_in)

In [4]:
example_dict

{1: '6', 2: '2', 3: 'f'}

- Inside the Python pickle Module
    - The Python pickle module basically consists of four methods:

    - `pickle.dump`(obj, file, protocol=None, *, fix_imports=True, buffer_callback=None)
    - `pickle.dumps`(obj, protocol=None, *, fix_imports=True, buffer_callback=None)
    - `pickle.load`(file, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
    - `pickle.loads`(bytes_object, *, fix_imports=True, encoding="ASCII", errors="strict", buffers=None)
    
The first two methods are used during the pickling process, and the other two are used during unpickling. 

The only `difference between dump() and dumps()` is that the first creates a file containing the serialization result, whereas the second returns a string.

To differentiate dumps() from dump(), it’s helpful to remember that the `s` at the end of the function name stands for `string`. 

The same concept also applies to load() and loads(): The first one reads a file to start the unpickling process, and the second one operates on a string.

one more example ...

In [5]:
class example_class:
    a_number = 35
    a_string = "hey"
    a_list   = [1, 2, 3]
    a_dict   = {"first": "a", "second": 2, "third": [1, 2, 3]}
    a_tuple  = (22, 23)

my_object = example_class()

my_pickled_object = pickle.dumps(my_object)  # Pickling the object
print(f"This is my pickled object:\n{my_pickled_object}\n")

my_object.a_dict = None

my_unpickled_object = pickle.loads(my_pickled_object)  # Unpickling the object
print(f"This is a_dict of the unpickled object:\n{my_unpickled_object.a_dict}\n")

This is my pickled object:
b'\x80\x04\x95!\x00\x00\x00\x00\x00\x00\x00\x8c\x08__main__\x94\x8c\rexample_class\x94\x93\x94)\x81\x94.'

This is a_dict of the unpickled object:
{'first': 'a', 'second': 2, 'third': [1, 2, 3]}




#### Example ML using KNN
-------------------------

In [6]:
import pandas as pd
import numpy as np

from sklearn.neighbors import KNeighborsClassifier

#### Load data

    4. Relevant Information:
       Samples arrive periodically as Dr. Wolberg reports his clinical cases.
       The database therefore reflects this chronological grouping of the data.
       This grouping information appears immediately below, having been removed
       from the data itself:
         Group 1: 367 instances (January 1989)
         Group 2:  70 instances (October 1989)
         Group 3:  31 instances (February 1990)
         Group 4:  17 instances (April 1990)
         Group 5:  48 instances (August 1990)
         Group 6:  49 instances (Updated January 1991)
         Group 7:  31 instances (June 1991)
         Group 8:  86 instances (November 1991)
         -----------------------------------------
         Total:   699 points (as of the donated datbase on 15 July 1992)

       Note that the results summarized above in Past Usage refer to a dataset
       of size 369, while Group 1 has only 367 instances.  This is because it
       originally contained 369 instances; 2 were removed.  The following
       statements summarizes changes to the original Group 1's set of data:

       #####  Group 1 : 367 points: 200B 167M (January 1989)
       #####  Revised Jan 10, 1991: Replaced zero bare nuclei in 1080185 & 1187805
       #####  Revised Nov 22,1991: Removed 765878,4,5,9,7,10,10,10,3,8,1 no record
       #####                  : Removed 484201,2,7,8,8,4,3,10,3,4,1 zero epithelial
       #####                  : Changed 0 to 1 in field 6 of sample 1219406
       #####                  : Changed 0 to 1 in field 8 of following sample:
       #####                  : 1182404,2,3,1,1,1,2,0,1,1,1

    5. Number of Instances: 699 (as of 15 July 1992)
    6. Number of Attributes: 10 plus the class attribute
    7. Attribute Information: (class attribute has been moved to last column)
       #  Attribute                     Domain
       -- -----------------------------------------
       1. Sample code number            id number
       2. Clump Thickness               1 - 10
       3. Uniformity of Cell Size       1 - 10
       4. Uniformity of Cell Shape      1 - 10
       5. Marginal Adhesion             1 - 10
       6. Single Epithelial Cell Size   1 - 10
       7. Bare Nuclei                   1 - 10
       8. Bland Chromatin               1 - 10
       9. Normal Nucleoli               1 - 10
      10. Mitoses                       1 - 10
      11. Class:                        (2 for benign, 4 for malignant)

    8. Missing attribute values: 16
       There are 16 instances in Groups 1 to 6 that contain a single missing 
       (i.e., unavailable) attribute value, now denoted by "?".  
    9. Class distribution:

       Benign: 458 (65.5%)
       Malignant: 241 (34.5%)

In [30]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
import pickle

In [28]:
# Load the Breast Cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

In [29]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [31]:
# Train a RandomForestClassifier
model = KNeighborsClassifier()
model.fit(X_train, y_train)

KNeighborsClassifier()

In [32]:
# Save the trained model to a file using pickle
with open('model.pkl', 'wb') as f:
    pickle.dump(model, f)

In [33]:
# Load the model back from the pickle file
with open('model.pkl', 'rb') as f:
    loaded_model = pickle.load(f)

In [34]:
# Make predictions on the test set using the loaded model
predictions = loaded_model.predict(X_test)

  mode, _ = stats.mode(_y[neigh_ind, k], axis=1)


In [35]:
# Evaluate the model
accuracy = (predictions == y_test).mean()
print("Accuracy:", accuracy)

Accuracy: 0.956140350877193
