# Python for Data Analysis III




**Agenda:**

    * CProfile 
    * Cython
    * sklearn

Writing programs is fun, but making them fast can be a pain. Python programs are no exception to that, but the basic profiling toolchain is actually not that complicated to use. Here, I would like to show you how you can quickly profile and analyze your Python code to find what part of the code you should optimize.

You can do profiling manually

In [None]:
!pip3 install line_profiler

Collecting line_profiler
  Downloading https://files.pythonhosted.org/packages/14/fc/ecf4e238bb601ff829068e5a72cd1bd67b0ee0ae379db172eb6a0779c6b6/line_profiler-2.1.2.tar.gz (83kB)
[K    100% |████████████████████████████████| 92kB 1.3MB/s ta 0:00:01
[?25hCollecting IPython>=0.13 (from line_profiler)
  Downloading https://files.pythonhosted.org/packages/b1/7f/91d50f28af3e3a24342561983a7857e399ce24093876e6970b986a0b6677/ipython-6.4.0-py3-none-any.whl (750kB)
[K    100% |████████████████████████████████| 757kB 866kB/s ta 0:00:01
[?25hCollecting pickleshare (from IPython>=0.13->line_profiler)
  Downloading https://files.pythonhosted.org/packages/9f/17/daa142fc9be6b76f26f24eeeb9a138940671490b91cb5587393f297c8317/pickleshare-0.7.4-py2.py3-none-any.whl
Collecting pexpect; sys_platform != "win32" (from IPython>=0.13->line_profiler)
  Downloading https://files.pythonhosted.org/packages/89/e6/b5a1de8b0cc4e07ca1b305a4fcc3f9806025c1b651ea302646341222f88b/pexpect-4.6.0-py2.py3-none-any.whl (57k

In [None]:
import numpy as np

In [None]:
%%writefile speedup.py

import random

class Matrix(list):
    @classmethod
    def zeros(cls, shape):
        n_rows, n_cols = shape
        return cls([[0] * n_cols for i in range(n_rows)])

    @classmethod
    def random(cls, shape):
        M, (n_rows, n_cols) = cls(), shape 
        for i in range (n_rows):
            M.append([random.randint(-255, 255) for j in range (n_cols)])
        return M

    @property
    def shape(self):
        return ((0, 0) if not self else (len(self), len(self[0])))
    
    
def dot_product(X, Y):
    n_xrows, n_xcols = X.shape
    n_yrows, n_ycols = Y.shape
    Z = Matrix.zeros((n_xrows, n_ycols))
    for i in range(n_xrows):
        for j in range(n_xcols):
            for k in range(n_ycols):
                Z[i][k] += X[i][j] * Y[j][k]
    return Z

def bench(shape=(64, 64), n_iter=16):
    X = Matrix.random(shape)
    Y = Matrix.random(shape)
    for iter in range(n_iter):
        dot_product(X, Y)

if __name__ == "__main__":
    bench()

In [None]:
%%timeit
a1 = np.random.rand(3,2)
a2 = np.random.rand(2,3)
a1.dot(a2)

The cProfile module allows you to profile Python code up to a function or method call:

In [None]:
import cProfile

source = open("speedup.py").read()
cProfile.run(source, sort="tottime")

In [None]:
%load_ext line_profiler

In [None]:
from speedup import dot_product, bench
%lprun -f dot_product bench

## Scikit-Learn

`Scikit-Learn` is a library, in which implemented a large number of machine learning algorithms

We can separate learning problems in a few large categories:

1. supervised learning, in which the data comes with additional attributes that we want to predict.This problem can be either:

    - classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data.
    - regression: if the desired output consists of one or more continuous variables, then the task is called regression.

2. unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.



In general, a learning problem considers a set of `n` samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.

This idea of first learn known samples and then predict new samples is implemented in scikit-learn with two basic functions: `fit` and `predict`.

In [None]:
import pandas as pd
df = pd.read_csv('Churn-Modelling.csv')
df.dropna(inplace=True)
df = df[['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'Exited']]

In [None]:
df.head()

Everyone is familiar with the fact that most gradient methods (strongly or strangely to scale data). Therefore, before running algorithms, either normalization or so-called standardization is usually done. Normalization involves replacing the nominal characteristics so that each of them lies in the range from 0 to 1. Standardization implies the same preprocessing of data, after which each attribute has an average of 0 and a variance of 1.

In [None]:
from sklearn import preprocessing
# normalize the data attributes
normalized_df = preprocessing.normalize(df)
# standardize the data attributes
standardized_df = preprocessing.scale(df) # Standardization isn't required for logistic regression

#### Training set and testing set

Machine learning is about learning some properties of a data set and applying them to new data. This is why a common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the training set on which we learn data properties and one that we call the testing set on which we test these properties

In [None]:
from sklearn.model_selection import train_test_split
train,test = train_test_split(df)

In [None]:
len(train), len(test)

In [None]:
from sklearn import metrics
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(train[['CreditScore', 'Age', 'Balance', 'EstimatedSalary']], train['Exited'])
print(model)
# make predictions
expected = test['Exited']
predicted = model.predict(test[['CreditScore', 'Age', 'Balance', 'EstimatedSalary']])

In [None]:
# summarize the fit of the model
print(metrics.classification_report(expected, predicted))
print(metrics.confusion_matrix(expected, predicted))

The `confusion_matrix()` function will calculate a confusion matrix and return the result as an array.
The result is telling us that we have 1927+29 correct predictions and 492+47 incorrect predictions.

http://scikit-learn.org/stable/documentation.html