# Python for Data Analysis III




**Agenda:**

    * CProfile 
    * Cython
    * sklearn

Writing programs is fun, but making them fast can be a pain. Python programs are no exception to that, but the basic profiling toolchain is actually not that complicated to use. Here, I would like to show you how you can quickly profile and analyze your Python code to find what part of the code you should optimize.

You can do profiling manually

In [1]:
!pip3 install line_profiler

Collecting line_profiler
Collecting IPython>=0.13 (from line_profiler)
  Using cached https://files.pythonhosted.org/packages/b1/7f/91d50f28af3e3a24342561983a7857e399ce24093876e6970b986a0b6677/ipython-6.4.0-py3-none-any.whl
Collecting setuptools>=18.5 (from IPython>=0.13->line_profiler)
  Using cached https://files.pythonhosted.org/packages/7f/e1/820d941153923aac1d49d7fc37e17b6e73bfbd2904959fffbad77900cf92/setuptools-39.2.0-py2.py3-none-any.whl
Collecting pygments (from IPython>=0.13->line_profiler)
  Using cached https://files.pythonhosted.org/packages/02/ee/b6e02dc6529e82b75bb06823ff7d005b141037cb1416b10c6f00fc419dca/Pygments-2.2.0-py2.py3-none-any.whl
Collecting pickleshare (from IPython>=0.13->line_profiler)
  Using cached https://files.pythonhosted.org/packages/9f/17/daa142fc9be6b76f26f24eeeb9a138940671490b91cb5587393f297c8317/pickleshare-0.7.4-py2.py3-none-any.whl
Collecting pexpect; sys_platform != "win32" (from IPython>=0.13->line_profiler)
  Using cached https://files.pythonho

In [2]:
import numpy as np

In [3]:
%%writefile speedup.py

import random

class Matrix(list):
    @classmethod
    def zeros(cls, shape):
        n_rows, n_cols = shape
        return cls([[0] * n_cols for i in range(n_rows)])

    @classmethod
    def random(cls, shape):
        M, (n_rows, n_cols) = cls(), shape 
        for i in range (n_rows):
            M.append([random.randint(-255, 255) for j in range (n_cols)])
        return M

    @property
    def shape(self):
        return ((0, 0) if not self else (len(self), len(self[0])))
    
    
def dot_product(X, Y):
    n_xrows, n_xcols = X.shape
    n_yrows, n_ycols = Y.shape
    Z = Matrix.zeros((n_xrows, n_ycols))
    for i in range(n_xrows):
        for j in range(n_xcols):
            for k in range(n_ycols):
                Z[i][k] += X[i][j] * Y[j][k]
    return Z

def bench(shape=(64, 64), n_iter=16):
    X = Matrix.random(shape)
    Y = Matrix.random(shape)
    for iter in range(n_iter):
        dot_product(X, Y)

if __name__ == "__main__":
    bench()

Overwriting speedup.py


In [4]:
%%timeit
a1 = np.random.rand(3,2)
a2 = np.random.rand(2,3)
a1.dot(a2)

4.86 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


The cProfile module allows you to profile Python code up to a function or method call:

In [5]:
import cProfile

source = open("speedup.py").read()
cProfile.run(source, sort="tottime")

         41377 function calls in 2.356 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       16    2.330    0.146    2.331    0.146 <string>:22(dot_product)
     8192    0.008    0.000    0.017    0.000 random.py:170(randrange)
     8192    0.007    0.000    0.009    0.000 random.py:220(_randbelow)
     8192    0.004    0.000    0.020    0.000 random.py:214(randint)
      128    0.003    0.000    0.023    0.000 <string>:14(<listcomp>)
     8201    0.002    0.000    0.002    0.000 {method 'getrandbits' of '_random.Random' objects}
        1    0.001    0.001    2.356    2.356 {built-in method builtins.exec}
        1    0.001    0.001    2.355    2.355 <string>:32(bench)
     8192    0.001    0.000    0.001    0.000 {method 'bit_length' of 'int' objects}
       16    0.000    0.000    0.000    0.000 <string>:8(<listcomp>)
        2    0.000    0.000    0.023    0.012 <string>:10(random)
       16    0.000    0.000    0.000 

In [6]:
%load_ext line_profiler

In [7]:
from speedup import dot_product, bench
%lprun -f dot_product bench

## Scikit-Learn

`Scikit-Learn` is a library, in which implemented a large number of machine learning algorithms

We can separate learning problems in a few large categories:

1. supervised learning, in which the data comes with additional attributes that we want to predict.This problem can be either:

    - classification: samples belong to two or more classes and we want to learn from already labeled data how to predict the class of unlabeled data.
    - regression: if the desired output consists of one or more continuous variables, then the task is called regression.

2. unsupervised learning, in which the training data consists of a set of input vectors x without any corresponding target values. The goal in such problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization.



In general, a learning problem considers a set of `n` samples of data and then tries to predict properties of unknown data. If each sample is more than a single number and, for instance, a multi-dimensional entry (aka multivariate data), it is said to have several attributes or features.

This idea of first learn known samples and then predict new samples is implemented in scikit-learn with two basic functions: `fit` and `predict`.

In [8]:
import pandas as pd
df = pd.read_csv('Churn-Modelling.csv')
df.dropna(inplace=True)
df = df[['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'Exited']]

In [9]:
df.head()

Unnamed: 0,CreditScore,Age,Balance,EstimatedSalary,Exited
0,619,42,0.0,101348.88,1
1,608,41,83807.86,112542.58,0
3,699,39,0.0,93826.63,0
4,850,43,125510.82,79084.1,0
6,822,50,0.0,10062.8,0


Everyone is familiar with the fact that most gradient methods (strongly or strangely to scale data). Therefore, before running algorithms, either normalization or so-called standardization is usually done. Normalization involves replacing the nominal characteristics so that each of them lies in the range from 0 to 1. Standardization implies the same preprocessing of data, after which each attribute has an average of 0 and a variance of 1.

In [10]:
from sklearn import preprocessing
# normalize the data attributes
normalized_df = preprocessing.normalize(df)
# standardize the data attributes
standardized_df = preprocessing.scale(df) # Standardization isn't required for logistic regression

#### Training set and testing set

Machine learning is about learning some properties of a data set and applying them to new data. This is why a common practice in machine learning to evaluate an algorithm is to split the data at hand into two sets, one that we call the training set on which we learn data properties and one that we call the testing set on which we test these properties

In [11]:
from sklearn.model_selection import train_test_split
train,test = train_test_split(df)

In [12]:
len(train), len(test)

(7483, 2495)

In [13]:
train.head()

Unnamed: 0,CreditScore,Age,Balance,EstimatedSalary,Exited
7965,625,51,124620.01,92243.94,1
1018,850,45,103909.86,60083.11,1
7182,692,49,110540.43,107472.99,0
8234,766,47,129289.98,169935.46,1
2281,848,40,148495.64,158853.98,0


In [14]:
from sklearn import metrics
from sklearn.linear_model import LogisticRegression # <-- our model

model = LogisticRegression(random_state=22)
model.fit(train[['CreditScore', 'Age', 'Balance', 'EstimatedSalary']], train['Exited'])
print(model)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=22, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)


In [15]:
# make predictions
expected = test['Exited']
predicted = model.predict(test[['CreditScore', 'Age', 'Balance', 'EstimatedSalary']])

In [16]:
len(predicted), len(test)

(2495, 2495)

In [17]:
# summarize the fit of the model
print(metrics.confusion_matrix(expected, predicted))

[[2005    4]
 [ 486    0]]


The `confusion_matrix()` function will calculate a confusion matrix and return the result as an array.
The result is telling us that we have 1925+27 correct predictions and 506+37 incorrect predictions.

There is no need to describe the library scikit-learn - it's just a bunch of algorithms for solving machine learning problems. It does not solve your problems magically - as in our example above - we do not just put the data in the library and get an amazing result. You need to do some work before using the library.

If you already know the algorithm/model for your particular problem, you can just go to the scilit-learn documentation and find out how to use it.

http://scikit-learn.org/stable/documentation.html