# Profiling

Profiling can help to find useless calculations:

In [2]:
import pandas as pd
import numpy as np
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('./ml-1m/users.dat', sep='::', header=None, names=unames, engine='python')
rnames = ['user_id', 'movie_id', 'rating', 'timestamp']
ratings = pd.read_table('./ml-1m/ratings.dat', sep='::', header=None, names=rnames,  engine='python')
mnames = ['movie_id', 'title', 'genres']
movies = pd.read_table('./ml-1m/movies.dat', sep='::', header=None, names=mnames,  engine='python')
data = pd.merge(pd.merge(ratings, users), movies)

def top_movies(dataFrame,usr):
    user= dataFrame[dataFrame.user_id == usr]
    max_i = user.rating.max()
    return user[user.rating == max_i].title

def compareTopMovies(data,usr1, usr2):
    movi1= top_movies(data,usr1).values
    movi2 = top_movies(data,usr2).values
    hits=np.intersect1d(movi1,movi2)
    return hits

#Top Movies for user 1
print (top_movies(data,1))
#Compare TopMovies shared by two users:
print (compareTopMovies(data,1,2))

0        One Flew Over the Cuckoo's Nest (1975)
4201                       Bug's Life, A (1998)
8222                             Ben-Hur (1959)
8926                  Christmas Story, A (1983)
12759               Beauty and the Beast (1991)
15859                Sound of Music, The (1965)
19503                         Awakenings (1990)
23270                 Back to the Future (1985)
25853                   Schindler's List (1993)
28501                         Pocahontas (1995)
37204            Last Days of Disco, The (1998)
37339                         Cinderella (1950)
40375                          Apollo 13 (1995)
41626                          Toy Story (1995)
43703                           Rain Man (1988)
49748                       Mary Poppins (1964)
50759                              Dumbo (1941)
52255                Saving Private Ryan (1998)
Name: title, dtype: object
["One Flew Over the Cuckoo's Nest (1975)"]


In [3]:
#Compare all users between them. Profiling
%prun -D compare.prof {x:compareTopMovies(data,1,x) for x in users.user_id[:200] if x!=1}

 
*** Profile stats marshalled to file 'compare.prof'. 


In [5]:
import pstats
stats = pstats.Stats('compare.prof')
stats.sort_stats('cumtime').print_stats(50) #50 rows

Mon Oct 16 17:57:41 2017    compare.prof

         660884 function calls (647352 primitive calls) in 2.714 seconds

   Ordered by: cumulative time
   List reduced from 246 to 50 due to restriction <50>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    2.714    2.714 {built-in method builtins.exec}
        1    0.000    0.000    2.713    2.713 <string>:1(<module>)
        1    0.003    0.003    2.713    2.713 <string>:1(<dictcomp>)
      199    0.011    0.000    2.710    0.014 <ipython-input-2-a74443488279>:16(compareTopMovies)
      398    0.023    0.000    2.664    0.007 <ipython-input-2-a74443488279>:11(top_movies)
     2389    0.056    0.000    1.379    0.001 /usr/local/lib/python3.5/site-packages/pandas/core/frame.py:1940(__getitem__)
      796    0.013    0.000    1.110    0.001 /usr/local/lib/python3.5/site-packages/pandas/core/frame.py:1983(_getitem_array)
      796    0.031    0.000    1.082    0.001 /usr/local/lib/python3.5

<pstats.Stats at 0x1159b08d0>

We can realize that TopMovie is called 2 times per each user in the table, inside compareTopMovies.
Lets see these functions line per line 

### Line Profiler

`pip install line_profiler`

See how long it took each line in a function to run.  Functions to profile this way must be passed by name with -f.

In [7]:
%load_ext line_profiler

In [8]:
%lprun?
%lprun -f top_movies top_movies(data,1) 

In [9]:
%lprun -f compareTopMovies compareTopMovies(data,1,2)

### Memory profiler


Now let's take a look into memory profiling. 

        pip install psutil
        pip install memory-profiler

In [12]:
##pip install psutil
##pip install memory-profiler
%load_ext memory_profiler

See how much memory a script uses line by line. Let’s take a look at the compareTopUsers function that we profiled with %prun - except this time we’re interested in incremental memory usage and not execution time. NOTE: %mprun can only be used on functions defined in physical files, and not in the IPython environment.

In [14]:
%mprun?
#clear all variables
#%reset 
import pandasExample
%mprun -f pandasExample.test pandasExample.test()

UsageError: Could not find function 'pandasExample.test'.
AttributeError: module 'pandasExample' has no attribute 'test'

See how much memory a script uses overall. %memit works a lot like %timeit except that the number of iterations is set with -r instead of -n.

In [6]:
%memit -r 3  pandasExample.test()

0        One Flew Over the Cuckoo's Nest (1975)
4201                       Bug's Life, A (1998)
8222                             Ben-Hur (1959)
8926                  Christmas Story, A (1983)
12759               Beauty and the Beast (1991)
15859                Sound of Music, The (1965)
19503                         Awakenings (1990)
23270                 Back to the Future (1985)
25853                   Schindler's List (1993)
28501                         Pocahontas (1995)
37204            Last Days of Disco, The (1998)
37339                         Cinderella (1950)
40375                          Apollo 13 (1995)
41626                          Toy Story (1995)
43703                           Rain Man (1988)
49748                       Mary Poppins (1964)
50759                              Dumbo (1941)
52255                Saving Private Ryan (1998)
Name: title, dtype: object
["One Flew Over the Cuckoo's Nest (1975)"]
0        One Flew Over the Cuckoo's Nest (1975)
4201              

## Challenges

1. Change the sieve Of Eratosthenes implemantion, such that its performances would be better. Hint: use Numpy arrays and boolean filters.

2. Change function compareTopMovies in order to get better performance, by reducing useless code. Hint: reuse before recalculate.