# Del 6: Optimizacija kode za velike datasete

Pripravimo datasete:

In [2]:
!tar -xJf data/data_del_06.tar.xz -C ./data/

In [3]:
import pandas as pd
import numpy as np

Viri:
- [A Beginner’s Guide to Optimizing Pandas Code for Speed](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6)
- [How to Optimize your Pandas Code](https://kanoki.org/2019/01/09/how-to-optimize-your-pandas-code/)
- [Optimization tricks](http://ehneilsen.net/notebook/pandasExamples/pandas_examples.html#orgheadline36)
- [Enhancing performance](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html)
- [Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects](https://realpython.com/fast-flexible-pandas/)
- [Optimizing Code Performance On Large Datasets](https://app.dataquest.io/course/improving-code-performance)
- [4 Unique Methods to Optimize your Python Code for Data Science](https://www.analyticsvidhya.com/blog/2019/09/4-methods-optimize-python-code-data-science/)
- [High-Performance Pandas: eval() and query()](https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html#pandas.eval()-for-Efficient-Operations)

**Code optimization, in simple terms, means reducing the number of operations to execute any task while producing the correct results.**

## CPU Bound Programs

### Bounds vs Limitations

In [1]:
# bouns so mehke meje (procesor, omrežje), limitations pa trde (spomin, disk)

<img alt="I/O bounds" src="images/CPU+and+I_O+bounds.png">

### Primer optimizacije

In [4]:
import numpy as np

# Define a basic Haversine distance formula
def haversine(lat1, lon1, lat2, lon2):
    MILES = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    total_miles = MILES * c
    return total_miles

In [5]:
df = pd.read_csv('data/new_york_hotels.csv')

In [6]:
df.head()

Unnamed: 0,ean_hotel_id,name,address1,city,state_province,postal_code,latitude,longitude,star_rating,high_rate,low_rate
0,269955,Hilton Garden Inn Albany/SUNY Area,1389 Washington Ave,Albany,NY,12206,42.68751,-73.81643,3.0,154.0272,124.0216
1,113431,Courtyard by Marriott Albany Thruway,1455 Washington Avenue,Albany,NY,12206,42.68971,-73.82021,3.0,179.01,134.0
2,108151,Radisson Hotel Albany,205 Wolf Rd,Albany,NY,12205,42.7241,-73.79822,3.0,134.17,84.16
3,254756,Hilton Garden Inn Albany Medical Center,62 New Scotland Ave,Albany,NY,12208,42.65157,-73.77638,3.0,308.2807,228.4597
4,198232,CrestHill Suites SUNY University Albany,1415 Washington Avenue,Albany,NY,12206,42.68873,-73.81854,3.0,169.39,89.39


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1631 entries, 0 to 1630
Data columns (total 11 columns):
ean_hotel_id      1631 non-null int64
name              1631 non-null object
address1          1631 non-null object
city              1631 non-null object
state_province    1631 non-null object
postal_code       1631 non-null object
latitude          1631 non-null float64
longitude         1631 non-null float64
star_rating       1630 non-null float64
high_rate         1631 non-null float64
low_rate          1631 non-null float64
dtypes: float64(5), int64(1), object(5)
memory usage: 140.2+ KB


#### Crude looping over DataFrame rows using indices

In [9]:
# gremo od najslabšega (ampak najbolj intuitivnega) načina ka najboljšem

# Define a function to manually loop over all rows and return a series of distances
def haversine_looping(df):
    distance_list = []
    for i in range(0, len(df)):
        d = haversine(40.671, -73.985, df.iloc[i]['latitude'], df.iloc[i]['longitude']  )
        distance_list.append(d)
        
        
        
    return distance_list

In [10]:
%%timeit
# Run the haversine looping function
df['distance'] = haversine_looping(df)

# ni ravno hitro. For zanke so slabe za hitrost algoritmov. Ena for zanka: O(n2), vgnezdena for zanka O(n3)

1.18 s ± 106 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Looping with iterrows()

In [11]:
%%timeit
# Haversine applied on rows via iteration
haversine_series = []

for index, row in df.iterrows():
    haversine_series.append(haversine(40.671, -73.985, row['latitude'], row['longitude']))

df['distance'] = haversine_series

327 ms ± 6.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Looping with apply()

In [13]:
%%timeit
# apply je najbolj optimalen način za loopanje (če že moramo loopati)
# Timing apply on the Haversine function
df['distance'] = df.apply(lambda row: haversine(40.671, -73.985, row['latitude'], row['longitude']), axis=1)


133 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### Vectorization with Pandas series

In [16]:
%%timeit 
# vektorizacija pomeni, da lahko procesor naenkrat obdela več elementov vzporedno

# podamo kar cele stolpce, bo Numpy poskrbel za vrstice

# Vectorized implementation of Haversine applied on Pandas series
df['distance'] = haversine(40.671, -73.985, df['latitude'], df['longitude'])



4.11 ms ± 894 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


####  Vectorization with NumPy arrays

In [17]:
# najhitrejše je, če delamo z numpy arrayji. V Pythonu hitreje kot to ne gre
type(df['latitude'].values)

numpy.ndarray

In [18]:
%%timeit
# Vectorized implementation of Haversine applied on NumPy arrays

df['distance'] = haversine(40.671, -73.985, df['latitude'].values, df['longitude'].values)




484 µs ± 3.35 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


This brings us to a few basic conclusions on optimizing Pandas code:
1. Avoid loops; they’re slow and, in most common use cases, unnecessary.
2. If you must loop, use apply(), not iteration functions.
3. Vectorization is usually better than scalar operations. Most common operations in Pandas can be vectorized.
4. Vector operations on NumPy arrays are more efficient than on native Pandas series.

## I/O Bound Programs

### I/O Bounds

<img src="./images/report_assembly.png">

<img src="./images/report_assembly_bidir.png">

I/O bound tasks are tasks where:
- Our program is reading from an input (like a CSV file).
- Our program is writing to an output (like a text file).
- Our program is waiting for another program to execute something (like a SQL query).
- Our program is waiting for another server to execute something (like an API request).



### Profiling an I/O bound task

In [24]:
# Python ni najboljši za multithreading, ampak se vseeno da
# v enem threadu npr. lahko čakamo podatke iz baze, v dveh threadih pa vseeno ne moremo hkrati procesirati

# tukaj bomo najprej pogledali kako se to dela iz nule, v praksi pa uporabljamo knjižnice za multithreading

In [27]:
# preberemo ekipe, ki so trenutno aktivmne

query = '''
SELECT DISTINCT teamID 
FROM Teams 
INNER JOIN TeamsFranchises ON Teams.franchID == TeamsFranchises.franchID 
WHERE TeamsFranchises.active = 'Y';
'''

In [28]:
import cProfile
import sqlite3

conn = sqlite3.connect("data/lahman2015.sqlite")

cur = conn.cursor()
teams = [row[0] for row in cur.execute(query).fetchall()]

In [29]:
print(teams)

['BSN', 'CHN', 'CN2', 'PT1', 'SL4', 'NY1', 'PHI', 'BR3', 'PIT', 'BRO', 'CIN', 'SLN', 'BLA', 'BOS', 'CHA', 'CLE', 'DET', 'MLA', 'PHA', 'WS1', 'SLA', 'NYA', 'ML1', 'BAL', 'KC1', 'LAN', 'SFN', 'LAA', 'MIN', 'WS2', 'HOU', 'NYN', 'CAL', 'ATL', 'OAK', 'KCA', 'SE1', 'MON', 'SDN', 'ML4', 'TEX', 'SEA', 'TOR', 'COL', 'FLO', 'ANA', 'TBA', 'ARI', 'MIL', 'WAS', 'MIA']


In [33]:
# ekipo po ekipo bomo izračunali SUMO po stolpcu HR
# CProfile nam omogoča, da vidimo kateri del programa uporabi koliko časa
import cProfile
import sqlite3

query = "SELECT SUM(HR) FROM Batting WHERE teamId=?"
conn = sqlite3.connect("data/lahman2015.sqlite")
cur = conn.cursor()

def calculate_runs(teams):
    home_runs = []
    for team in teams:
        runs = cur.execute(query, [team]).fetchall()
        runs = runs[0][0]
        home_runs.append(runs)
    return home_runs

In [32]:
%%timeit
home_runs = calculate_runs(teams)

117 ms ± 6.71 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [34]:
profile_string = "home_runs = calculate_runs(teams)"

In [36]:
# skoraj celoten čas algoritma je šel za branje iz baze
cProfile.run(profile_string)

         157 function calls in 0.136 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.136    0.136 <ipython-input-33-c218e1ee9338>:10(calculate_runs)
        1    0.000    0.000    0.136    0.136 <string>:1(<module>)
        1    0.000    0.000    0.136    0.136 {built-in method builtins.exec}
       51    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       51    0.135    0.003    0.135    0.003 {method 'execute' of 'sqlite3.Cursor' objects}
       51    0.001    0.000    0.001    0.000 {method 'fetchall' of 'sqlite3.Cursor' objects}




### Blocking Tasks

```python
51    0.120    0.002    0.120    0.002 {method 'execute' of 'sqlite3.Cursor' objects}
```

```python
conn = sqlite3.connect(':memory:')
```

In [37]:
# to je samo za demonstracijo, tako se ponavadi ne dela
import sqlite3

# Create an in memory database.
memory = sqlite3.connect(':memory:')

# Connect to our disk database.
disk = sqlite3.connect('data/lahman2015.sqlite')

# Create a query that will read the contents of the disk database 
# into another database.
dump = ''.join(line for line in disk.iterdump())

# Run the query to copy the database from disk into memory.
memory.executescript(dump)

cur = memory.cursor()

In [40]:
profile_string = "home_runs = calculate_runs(teams)"
cProfile.run(profile_string)

         157 function calls in 0.067 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.067    0.067 <ipython-input-39-0ca12a0b9230>:6(calculate_runs)
        1    0.000    0.000    0.067    0.067 <string>:1(<module>)
        1    0.000    0.000    0.067    0.067 {built-in method builtins.exec}
       51    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       51    0.066    0.001    0.066    0.001 {method 'execute' of 'sqlite3.Cursor' objects}
       51    0.000    0.000    0.000    0.000 {method 'fetchall' of 'sqlite3.Cursor' objects}




In [38]:
dump[:1000]

'BEGIN TRANSACTION;CREATE TABLE AllstarFull (\nplayerID TEXT,\nyearID INTEGER,\ngameNum INTEGER,\ngameID TEXT,\nteamID TEXT,\nlgID TEXT,\nGP INTEGER,\nstartingPos INTEGER\n);INSERT INTO "AllstarFull" VALUES(\'aaronha01\',1955,0,\'NLS195507120\',\'ML1\',\'NL\',1,NULL);INSERT INTO "AllstarFull" VALUES(\'aaronha01\',1956,0,\'ALS195607100\',\'ML1\',\'NL\',1,NULL);INSERT INTO "AllstarFull" VALUES(\'aaronha01\',1957,0,\'NLS195707090\',\'ML1\',\'NL\',1,9);INSERT INTO "AllstarFull" VALUES(\'aaronha01\',1958,0,\'ALS195807080\',\'ML1\',\'NL\',1,9);INSERT INTO "AllstarFull" VALUES(\'aaronha01\',1959,1,\'NLS195907070\',\'ML1\',\'NL\',1,9);INSERT INTO "AllstarFull" VALUES(\'aaronha01\',1959,2,\'NLS195908030\',\'ML1\',\'NL\',1,9);INSERT INTO "AllstarFull" VALUES(\'aaronha01\',1960,1,\'ALS196007110\',\'ML1\',\'NL\',1,9);INSERT INTO "AllstarFull" VALUES(\'aaronha01\',1960,2,\'ALS196007130\',\'ML1\',\'NL\',1,9);INSERT INTO "AllstarFull" VALUES(\'aaronha01\',1961,1,\'NLS196107110\',\'ML1\',\'NL\',1,NULL

In [41]:
import cProfile
import sqlite3

query = "SELECT SUM(HR) FROM Batting WHERE teamId=?"

def calculate_runs(teams):
    home_runs = []
    for team in teams:
        runs = cur.execute(query, [team]).fetchall()
        runs = runs[0][0]
        home_runs.append(runs)
    return home_runs

In [42]:
profile_string = "home_runs = calculate_runs(teams)"
cProfile.run(profile_string)

         157 function calls in 0.060 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.060    0.060 <ipython-input-41-0ca12a0b9230>:6(calculate_runs)
        1    0.000    0.000    0.060    0.060 <string>:1(<module>)
        1    0.000    0.000    0.060    0.060 {built-in method builtins.exec}
       51    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
       51    0.059    0.001    0.059    0.001 {method 'execute' of 'sqlite3.Cursor' objects}
       51    0.000    0.000    0.000    0.000 {method 'fetchall' of 'sqlite3.Cursor' objects}




### Parallel Execution

<img src="./images/single_threaded.png">

 What if, instead, we could run several queries at once? It might look like this:

<img src="./images/multi_threaded.png">

We can use the Python 3 [threading library](https://docs.python.org/3/library/threading.html) to implement threading in our programs.

In [43]:
def task(team):
    print(team)

In [46]:
import threading

thread = threading.Thread(target=task,args=(teams,))
thread.start()


['BSN', 'CHN', 'CN2', 'PT1', 'SL4', 'NY1', 'PHI', 'BR3', 'PIT', 'BRO', 'CIN', 'SLN', 'BLA', 'BOS', 'CHA', 'CLE', 'DET', 'MLA', 'PHA', 'WS1', 'SLA', 'NYA', 'ML1', 'BAL', 'KC1', 'LAN', 'SFN', 'LAA', 'MIN', 'WS2', 'HOU', 'NYN', 'CAL', 'ATL', 'OAK', 'KCA', 'SE1', 'MON', 'SDN', 'ML4', 'TEX', 'SEA', 'TOR', 'COL', 'FLO', 'ANA', 'TBA', 'ARI', 'MIL', 'WAS', 'MIA']


In [48]:
def task(team):
    print(team)
    

for n, team in enumerate(teams):
    thread = threading.Thread(target=task, args=(team,))
    thread.start()
    print('Started task', n)
    

# threadi se med sabo ne čakajo, da končajo, zato se včasih izpisi prepletajo

BSNStarted task 0

CHN
Started task 1
CN2
Started task 2
PT1Started task 3
SL4
Started task 4
NY1
Started task 5

PHI
Started task 6
BR3
Started task 7
PIT
Started task 8
BRO
Started task 9
CINStarted task 10
SLN
Started task 11

BLA
Started task 12
BOSStarted task 13
CHA
Started task 14

CLE
Started task 15
DETStarted task 16
MLA
Started task 17
PHA
Started task 18

WS1
Started task 19
SLAStarted task 20

NYA
Started task 21
ML1Started task 22

BAL
Started task 23
KC1Started task 24
LAN
Started task 25
SFN
Started task 26
LAA
Started task 27
MIN
Started task 28
WS2
Started task 29
HOU
Started task 30

NYN
Started task 31
CALStarted task 32

ATL
Started task 33
OAKStarted task 34

KCA
Started task 35
SE1Started task 36

MON
Started task 37
SDN
Started task 38
ML4Started task 39

TEX
Started task 40
SEA
Started task 41
TOR
Started task 42
COLStarted task 43
FLO
Started task 44
ANA
Started task 45

TBA
Started task 46
ARIStarted task 47
MIL
Started task 48
WAS
Started task 49

MIA
Starte

### Thread Blocking

<img src="./images/three_threads.svg">

In [52]:
# zaženemo thread in on bo čakal 3 sekunde, preden bo izpisal. Zato se najprej izpišejo "started task",
# šele nato pa rezultati funkcije task()

# to je zato, ker je zagnal thread, videl da čaka 3s, in zato šel naprej ter zagnal naslednjo nalogo

# na koncu je vse skupaj trajalo 3s+par ms, namesto 30x3s

import threading
import time

def task(team):
    time.sleep(3)
    print(team)
    
for n, team in enumerate(teams):
    thread = threading.Thread(target=task, args=(team,))
    thread.start()
    print('Started task', n)

Started task 0
Started task 1
Started task 2
Started task 3
Started task 4
Started task 5
Started task 6
Started task 7
Started task 8
Started task 9
Started task 10
Started task 11
Started task 12
Started task 13
Started task 14
Started task 15
Started task 16
Started task 17
Started task 18
Started task 19
Started task 20
Started task 21
Started task 22
Started task 23
Started task 24
Started task 25
Started task 26
Started task 27
Started task 28
Started task 29
Started task 30
Started task 31
Started task 32
Started task 33
Started task 34
Started task 35
Started task 36
Started task 37
Started task 38
Started task 39
Started task 40
Started task 41
Started task 42
Started task 43
Started task 44
Started task 45
Started task 46
Started task 47
Started task 48
Started task 49
Started task 50
BSN
CHN
NY1
CN2
SL4
PT1
PHI
BR3
PIT
BRO
CIN
SE1
OAK
CAL
SFN
KC1
ML1
PHA
SLN
ML4ATL
WS1
BOS
NYA
CHA
MLA
HOU
TEX
COL
CLE
TBA
BLA
MON
WS2
FLO
MIL
LAA
KCA
BAL
DET
SEA
SLA
MIN
NYN
TOR
WAS
LAN
SDN
ARI

### Joining Threads

```python
t1 = threading.Thread(target=task, args=(team,))
t2 = threading.Thread(target=task, args=(team,))
t3 = threading.Thread(target=task, args=(team,))

# Start the first three threads
t1.start()
t2.start()
t3.start()

t1.join() # Wait until t1 finishes.
t2.join() # Wait until t2 finishes.  If it already finished, then keep going.
t3.join() # Wait until t3 finishes.  If it already finished, then keep going.
```

<img src="./images/Screenshot from 2019-06-29 16-40-25.png">

In [57]:
# batchi po 5 ekip skupaj
# output še vedno ni dober. Moramo še thread zalockat. To pomeni, da ne more delati drugih stvari, 
# dokler se predhodna ne konča
# to je pomembno, ko delamo s shareanimi resursi.


def task(team):
    print(team)

for i in range(11):
    team_names = teams[i*5: (i+1) * 5]
    threads = []
    for team in team_names:
        thread = threading.Thread(target=task, args=(team,))
        thread.start()
        threads.append(thread)
    for thread in threads:
        thread.join()
    print("Finished batch {}".format(i)) 

BSN
CHN
CN2
PT1
SL4
Finished batch 0
NY1
PHIBR3
PIT
BRO

Finished batch 1
CIN
SLN
BLA
BOS
CHA
Finished batch 2
CLEDET
MLA
PHA

WS1
Finished batch 3
SLA
NYA
ML1
BAL
KC1
Finished batch 4
LAN
SFN
LAA
MIN
WS2
Finished batch 5
HOU
NYN
CAL
ATL
OAK
Finished batch 6
KCA
SE1
MON
SDN
ML4
Finished batch 7
TEX
SEA
TOR
COL
FLO
Finished batch 8
ANA
TBA
ARIMIL

WAS
Finished batch 9
MIA
Finished batch 10


### Locking

It's important to be aware of accessing shared resources when you're working with threads. Some examples of shared resources are:
- The system stdout.
- SQL databases.
- APIs.
- Objects in memory.

```python

lock = threading.Lock()

def task(team):
    lock.acquire()
    # This code cannot be executed until a thread acquires the lock.
    print(team)
    lock.release()

t1 = threading.Thread(target=task, args=(team,))
t2 = threading.Thread(target=task, args=(team,))

t1.start()
t2.start()

```

In [56]:
# moramo dodati še sys.stdout.flush(), da spraznimo buffer in da se vse izpiše

import threading
import time
import sys

lock = threading.Lock()

def task(team):
    lock.acquire()
    print(team)
    sys.stdout.flush()
    lock.release()
    
for i in range(11):
    team_names = teams[i*5: (i+1) * 5]
    threads = []
    for team in team_names:
        thread = threading.Thread(target=task, args=(team,))
        thread.start()
        threads.append(thread)
    for thread in threads:
        thread.join()
    print("Finished batch {}".format(i))   

BSN
CHN
CN2
PT1
SL4
Finished batch 0
NY1
PHI
BR3
PIT
BRO
Finished batch 1
CIN
SLN
BLA
BOS
CHA
Finished batch 2
CLE
DET
MLA
PHA
WS1
Finished batch 3
SLA
NYA
ML1
BAL
KC1
Finished batch 4
LAN
SFN
LAA
MIN
WS2
Finished batch 5
HOU
NYN
CAL
ATL
OAK
Finished batch 6
KCA
SE1
MON
SDN
ML4
Finished batch 7
TEX
SEA
TOR
COL
FLO
Finished batch 8
ANA
TBA
ARI
MIL
WAS
Finished batch 9
MIA
Finished batch 10


### Thread Safety

In general, these operations are not thread safe:
- Modifying data in memory.
- Writing to a file.
- Adding data to a database.
- Modifying data via API.


In [58]:
import cProfile
import sqlite3
import threading
import sys

In [59]:
query = "SELECT DISTINCT teamID from Teams inner join TeamsFranchises on Teams.franchID == TeamsFranchises.franchID where TeamsFranchises.active = 'Y';"

In [60]:
conn = sqlite3.connect("data/lahman2015.sqlite", check_same_thread=False)

In [61]:
cur = conn.cursor()
teams = [row[0] for row in cur.execute(query).fetchall()]

query = "SELECT SUM(HR) FROM Batting WHERE teamId=?"
lock = threading.Lock()

In [63]:
# izpišemo rune
# prihranili smo čas, ko je bral iz baze

def calculate_runs(team):
    cur = conn.cursor()
    runs = cur.execute(query, [team]).fetchall()
    runs = runs[0][0]
    lock.acquire()
    print(team, ':', runs)
    sys.stdout.flush()
    lock.release()
    return runs


threads = []

for team in teams:
    thread = threading.Thread(target=calculate_runs, args=(team,))
    thread.start()
    threads.append(thread)
    
for thread in threads:
    thread.join()

BSN : 3424
CHN : 13530
CN2 : 267
PT1 : 54
SL4 : 305
NY1 : 5777
PHI : 12503
BR3 : 143
PIT : 10878
BRO : 4336
SLN : 11157
BLA : 57
CIN : 12383
BOS : 12883
CHA : 10792
DET : 13160
MLA : 26
PHA : 3502
CLE : 12333
WS1 : 2786
SLA : 3014
ML1 : 2230
NYA : 15218
BAL : 9592
KC1 : 1480
LAN : 7601
SFN : 8348
MIN : 7393
WS2 : 1387
LAA : 2276
HOU : 6536
CAL : 3912
NYN : 6817
ATL : 7535
OAK : 7438
KCA : 5613
SE1 : 125
MON : 4381
SDN : 5648
ML4 : 3664
TEX : 7055
SEA : 5976
TOR : 6415
COL : 4120
FLO : 2816
ANA : 1324
TBA : 2823
ARI : 2987
MIL : 3160
MIA : 474
WAS : 2002


## Optimizing Python Code with pandas

### Basic Looping

### Select columns and rows efficiently


In [64]:
data = pd.read_csv('data/school.csv')
data.head(3)

Unnamed: 0,School ID,School Name,Building Code,Street Address,City,State,Zip Code
0,02M260,Clinton School Writers and Artists,M933,425 West 33rd Street,Manhattan,NY,10001
1,06M211,Inwood Early College for Health and Informatio...,M052,650 Academy Street,Manhattan,NY,10002
2,01M539,"New Explorations into Science, Technology and ...",M022,111 Columbia Street,Manhattan,NY,10002


In [65]:
data['City'].value_counts().head(10)

Brooklyn               121
Bronx                  118
Manhattan              106
Jamaica                 13
Long Island City        12
Staten Island           10
Flushing                 8
Astoria                  6
Elmhurst                 5
Springfield Gardens      4
Name: City, dtype: int64

In [66]:
# save the top cities in a list
top_cities = ['Brooklyn','Bronx','Manhattan','Jamaica','Long Island City']

In [68]:
def rename(city, topCity):
    if city != topCity:
        city = 'other'

In [72]:
%%timeit
# Gregorjeva rešitev
data['City'] = data['City'].where(data['City'].isin(top_cities), 'Other')



1.15 ms ± 101 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [74]:
%%timeit

#Leonov način

data.loc[data['City'].isin(top_cities)== False, 'City'] = 'Others'



3.76 ms ± 44 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [76]:
%%timeit
# slab način

data['City'][(data['City'].isin(top_cities)==False)] = 'Others'


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


60.7 ms ± 7.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [77]:
data.City.value_counts()

Brooklyn            121
Bronx               118
Manhattan           106
Others               65
Jamaica              13
Long Island City     12
Name: City, dtype: int64

In [78]:
data = pd.read_csv('data/school.csv')

In [None]:
# salba praksa
%%timeit


In [None]:
data.City.value_counts()

### Uporaba biult-in funkciji

### Joining on indexes is faster than joining on columns

Construct some sample data:

In [79]:
n = 100000

i1 = np.arange(n)
np.random.shuffle(i1)
df1 = pd.DataFrame({'i': i1,
                    'j': np.random.randint(1,1000,n),
                    'k': np.random.randint(1,1000,n)})

i2 = np.arange(n)
np.random.shuffle(i1)
df2 = pd.DataFrame({'i': i2,
                    'm': np.random.randint(1,1000,n),
                    'n': np.random.randint(1,1000,n)})

In [80]:
df1.head()

Unnamed: 0,i,j,k
0,53326,297,864
1,61538,161,263
2,17899,119,996
3,20886,244,374
4,3765,638,172


In [81]:
df2.head()

Unnamed: 0,i,m,n
0,0,256,590
1,1,981,491
2,2,237,58
3,3,141,846
4,4,195,452


In [82]:
%%timeit
# radi bi mergeali dva dataseta skupaj
df1.merge(df2, on='i')




18.6 ms ± 537 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [83]:
df1.head(10)

Unnamed: 0,i,j,k
0,53326,297,864
1,61538,161,263
2,17899,119,996
3,20886,244,374
4,3765,638,172
5,59961,156,709
6,54548,743,795
7,16800,549,272
8,10987,19,358
9,70925,85,603


In [85]:
# včasih, ko imamo inste indekse, je boljše mergeati po indeksih
df1 = df1.set_index('i')
df2 = df2.set_index('i')

In [87]:
%%timeit
df1.merge(df2, left_index = True, right_index = True)




9.32 ms ± 455 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [None]:
df1 = df1.set_index('i')
df2 = df2.set_index('i')

In [None]:
%%timeit


## PRIMER: Pohitritev pandas kode

Vir: 
- [Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects](https://realpython.com/fast-flexible-pandas/)

### Naloga

### Priprava podatkov

In [88]:
# računali bomo cene elektrike.
# imamo dataset s porabo elektrike po urah
# za vsako uro je svoja cena. Vzamemo tri tarife: peak, shoulder in offpeak.
# podatke obdelamo tako, da za vsako uro vzamemo pravo ceno

import pandas as pd

In [89]:
df = pd.read_csv('data/demand_profile.csv')

In [90]:
df.head()

Unnamed: 0,date_time,energy_kwh
0,1/1/13 0:00,0.586
1,1/1/13 1:00,0.58
2,1/1/13 2:00,0.572
3,1/1/13 3:00,0.596
4,1/1/13 4:00,0.592


In [91]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 2 columns):
date_time     8760 non-null object
energy_kwh    8760 non-null float64
dtypes: float64(1), object(1)
memory usage: 137.0+ KB


In [96]:
# date_time je object, moramo spremeniti v datetime
df.dtypes

date_time     datetime64[ns]
energy_kwh           float64
dtype: object

In [97]:
type(df.iloc[0, 0])

pandas._libs.tslibs.timestamps.Timestamp

In [98]:
df['date_time'] = pd.to_datetime(df['date_time'])
df['date_time'].dtype

dtype('<M8[ns]')

In [99]:
df.head()

Unnamed: 0,date_time,energy_kwh
0,2013-01-01 00:00:00,0.586
1,2013-01-01 01:00:00,0.58
2,2013-01-01 02:00:00,0.572
3,2013-01-01 03:00:00,0.596
4,2013-01-01 04:00:00,0.592


In [100]:
def convert(df, column_name):
    return pd.to_datetime(df[column_name])

df = pd.read_csv('data/demand_profile.csv')
df_coverted = df.copy()

In [101]:
%%timeit -r 3 -n 10
df_coverted['date_time'] = convert(df, 'date_time')

1.44 s ± 22.2 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [102]:
def convert_with_format(df, column_name):
    return pd.to_datetime(df[column_name], format='%d/%m/%y %H:%M')

In [103]:
%%timeit -r 3 -n 10
df_coverted['date_time'] = convert_with_format(df, 'date_time')

60.5 ms ± 1.16 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


In [104]:
1440/60.5


23.801652892561982

In [105]:
df_coverted.head()

Unnamed: 0,date_time,energy_kwh
0,2013-01-01 00:00:00,0.586
1,2013-01-01 01:00:00,0.58
2,2013-01-01 02:00:00,0.572
3,2013-01-01 03:00:00,0.596
4,2013-01-01 04:00:00,0.592


In [106]:
df_coverted.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 2 columns):
date_time     8760 non-null datetime64[ns]
energy_kwh    8760 non-null float64
dtypes: datetime64[ns](1), float64(1)
memory usage: 137.0 KB


### 1) Simple Looping Over Pandas Data

<table class="table table-hover">
<thead>
<tr>
<th>Tariff Type</th>
<th>Cents per kWh</th>
<th>Time Range</th>
</tr>
</thead>
<tbody>
<tr>
<td>Peak</td>
<td>28</td>
<td>17:00 to 24:00</td>
</tr>
<tr>
<td>Shoulder</td>
<td>20</td>
<td>7:00 to 17:00</td>
</tr>
<tr>
<td>Off-Peak</td>
<td>12</td>
<td>0:00 to 7:00</td>
</tr>
</tbody>
</table>

In [108]:
# če bi bila cena v vsaki uri 28
df_test = df_coverted.copy()
df_test['cost_cents'] = df['energy_kwh'] * 28

In [109]:
df_test.head()

Unnamed: 0,date_time,energy_kwh,cost_cents
0,2013-01-01 00:00:00,0.586,16.408
1,2013-01-01 01:00:00,0.58,16.24
2,2013-01-01 02:00:00,0.572,16.016
3,2013-01-01 03:00:00,0.596,16.688
4,2013-01-01 04:00:00,0.592,16.576


In [110]:
sum(df_test['cost_cents'])/100

1603.150920000002

In [111]:
# ker pa cena ni ista, moramo vsaki uri dodeliti vrednost

def apply_tariff(kwh, hour):
    """Calculates cost of electricity for given hour."""    
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    elif 17 <= hour < 24:
        rate = 28
    else:
        raise ValueError(f'Invalid hour: {hour}')
    return rate * kwh

In [115]:
# NOTE: Don't do this!
def apply_tariff_loop(df):
    """Calculate costs in loop.  Modifies `df` inplace."""
    energy_cost_list = []
    for i in range(len(df)):
        # Get electricity used and hour of day
    
        energy_cost_list.append(apply_tariff(df.iloc[i]['energy_kwh'], df.iloc[i]['date_time'].hour))
        
        
    df['cost_cents'] = energy_cost_list

In [116]:
%%timeit -r 3 -n 10
apply_tariff_loop(df_coverted)

4.73 s ± 6.05 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


### 2) Looping with .itertuples() and .iterrows()

In [117]:
for index, row in df[:5].iterrows():
    print(index)
    print(row)
    print('energy_kwh' ,row['energy_kwh'])
    print('-------')

0
date_time     1/1/13 0:00
energy_kwh          0.586
Name: 0, dtype: object
energy_kwh 0.586
-------
1
date_time     1/1/13 1:00
energy_kwh           0.58
Name: 1, dtype: object
energy_kwh 0.58
-------
2
date_time     1/1/13 2:00
energy_kwh          0.572
Name: 2, dtype: object
energy_kwh 0.5720000000000001
-------
3
date_time     1/1/13 3:00
energy_kwh          0.596
Name: 3, dtype: object
energy_kwh 0.596
-------
4
date_time     1/1/13 4:00
energy_kwh          0.592
Name: 4, dtype: object
energy_kwh 0.5920000000000001
-------


In [120]:

def apply_tariff_iterrows(df):
    energy_cost_list = []
    for index, row in df.iterrows():
        energy_cost_list.append(apply_tariff(row['energy_kwh'], row['date_time'].hour))
    df['cost_cents'] = energy_cost_list

In [121]:
%%timeit
apply_tariff_iterrows(df_coverted)

1.39 s ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 3) Pandas’ .apply()

In [126]:
def apply_tariff_withapply(df):

    df['cost_cents'] = df.apply(lambda row: apply_tariff(row['energy_kwh'], row['date_time'].hour), axis=1)

In [127]:
%%timeit
apply_tariff_withapply(df_coverted)

388 ms ± 2.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### 4) Selecting Data With .isin()

In [128]:
df_coverted = df.copy()
df_coverted['date_time'] = convert_with_format(df, 'date_time')
df_coverted.set_index('date_time', inplace=True)

In [129]:
df_coverted.head()

Unnamed: 0_level_0,energy_kwh
date_time,Unnamed: 1_level_1
2013-01-01 00:00:00,0.586
2013-01-01 01:00:00,0.58
2013-01-01 02:00:00,0.572
2013-01-01 03:00:00,0.596
2013-01-01 04:00:00,0.592


In [130]:
def apply_tariff_isin(df):
    peak_hours = df.index.hour.isin(range(17,24))
    shoulder_hours = df.index.hour.isin(range(7,17))
    off_peak_hours = df.index.hour.isin(range(0,7))
    
    df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 28
    df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 20
    df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 12





In [131]:
%%timeit -r 3 -n 10
apply_tariff_isin(df_coverted)

8.86 ms ± 1.13 ms per loop (mean ± std. dev. of 3 runs, 10 loops each)


### 5) Pandas’ pd.cut() function

In [133]:
# pd.cut nam values razdeli v diskretne intervale
# bins so pa intervali cen


df_coverted = df.copy()
df_coverted['date_time'] = convert_with_format(df, 'date_time')
df_coverted.set_index('date_time', inplace=True)

> **[pandas.cut](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html)**
- `pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')`
- Bin values into discrete intervals.
- Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.

In [135]:
# kaj naredi pd.cut .. bins
pd.cut(x = df_coverted.index.hour,
      bins = [0,7,17,24],
      include_lowest = True,
      labels = [12,20,28])

[12, 12, 12, 12, 12, ..., 28, 28, 28, 28, 28]
Length: 8760
Categories (3, int64): [12 < 20 < 28]

In [136]:
def apply_tariff_cut(df):
    
    cents_per_kwh = pd.cut(x = df_coverted.index.hour,
      bins = [0,7,17,24],
      include_lowest = True,
      labels = [12,20,28]).astype(int)
    df['cost_cents'] = cents_per_kwh * df['energy_kwh']
    
    

In [137]:
%%timeit -r 3 -n 10
apply_tariff_cut(df_coverted)

3.81 ms ± 268 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)


### 6) Using NumPy

In [138]:
import numpy as np

In [139]:
df_coverted = df.copy()
df_coverted['date_time'] = convert_with_format(df, 'date_time')
df_coverted.set_index('date_time', inplace=True)

> **[numpy.digitize](https://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html)**
- `numpy.digitize(x, bins, right=False)`
- Return the indices of the bins to which each value in input array belongs.

In [141]:
# digitize je pri NumPyju podobno kot pd.cut

np.digitize(df_coverted.index.hour.values, bins=[7,17,24], right=False)



array([0, 0, 0, ..., 2, 2, 2])

In [142]:
def apply_tariff_digitize(df):
    prices = np.array([12,20,28])
    bins = np.digitize(df.index.hour.values, bins = [7,17,24])
    df['cost_cents'] = prices[bins] * df['energy_kwh'].values

In [143]:
%%timeit -r 3 -n 10
apply_tariff_digitize(df_coverted)

1.16 ms ± 133 µs per loop (mean ± std. dev. of 3 runs, 10 loops each)


### Prevent Reprocessing with HDFStore

[What is HDF5](https://portal.hdfgroup.org/display/knowledge/What+is+HDF5)

In [145]:
# po navadi, ko damo dataset v CSV, izgubimo vse tipe vrstic ipd.
# če bi to ohranili in hitreje brali podatke, uporabimo HDFStore, ki omogoča shranjevanje podatkov v takem formatu
# dataset se dejansko shrani takšen, kot je
# ko bomo spet prebrali podatke, bodo isti kot so bili v dataframeu

df_coverted.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8760 entries, 2013-01-01 00:00:00 to 2013-12-31 23:00:00
Data columns (total 2 columns):
energy_kwh    8760 non-null float64
cost_cents    8760 non-null float64
dtypes: float64(2)
memory usage: 205.3 KB


In [146]:
# Create storage object with filename `processed_data`
data_store = pd.HDFStore('data/OUT_processed_data.h5')

# Put DataFrame into the object setting the key as 'preprocessed_df'
data_store['preprocessed_df'] = df_coverted
data_store.close()

In [147]:
data_store = pd.HDFStore('data/OUT_processed_data.h5')

# Retrieve data using key
preprocessed_df = data_store['preprocessed_df']
data_store.close()

In [148]:
preprocessed_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8760 entries, 2013-01-01 00:00:00 to 2013-12-31 23:00:00
Data columns (total 2 columns):
energy_kwh    8760 non-null float64
cost_cents    8760 non-null float64
dtypes: float64(2)
memory usage: 205.3 KB


### Povzetek

<ul>
<li>
<p>Try to use <a href="https://realpython.com/numpy-array-programming/#what-is-vectorization">vectorized operations</a> where possible rather than approaching problems with the <code>for x in df...</code> mentality. If your code is home to a lot of for-loops, it might be better suited to working with native Python data structures, because Pandas otherwise comes with a lot of overhead.</p>
</li>
<li>
<p>If you have more complex operations where vectorization is simply impossible or too difficult to work out efficiently, use the <code>.apply()</code> method.</p>
</li>
<li>
<p>If you do have to loop over your array (which does happen), use <code>.iterrows()</code> or <code>.itertuples()</code> to improve speed and syntax.</p>
</li>
<li>
<p>Pandas has a lot of optionality, and there are almost always several ways to get from A to B. Be mindful of this, compare how different routes perform, and choose the one that works best in the context of your project.</p>
</li>
<li>
<p>Once you’ve got a data cleaning script built, avoid reprocessing by storing your intermediate results with HDFStore.</p>
</li>
<li>
<p>Integrating NumPy into Pandas operations can often improve speed and simplify syntax.</p>
</li>
</ul>

## Drugi nasveti

###  [Numba](https://numba.pydata.org/)

Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.

### pandas.eval() for Efficient Operations

[Dokumentacija](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html)

[High-Performance Pandas: eval() and query()](https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html#pandas.eval()-for-Efficient-Operations)

As of version 0.13 (released January 2014), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays. These are the eval() and query() functions, which rely on the Numexpr package.