# Del 6: Optimizacija kode za velike datasete

Pripravimo datasete:

In [None]:
!tar -xJf data/data_del_06.tar.xz -C ./data/

In [None]:
import pandas as pd
import numpy as np

Viri:
- [A Beginner’s Guide to Optimizing Pandas Code for Speed](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6)
- [How to Optimize your Pandas Code](https://kanoki.org/2019/01/09/how-to-optimize-your-pandas-code/)
- [Optimization tricks](http://ehneilsen.net/notebook/pandasExamples/pandas_examples.html#orgheadline36)
- [Enhancing performance](https://pandas.pydata.org/pandas-docs/stable/user_guide/enhancingperf.html)
- [Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects](https://realpython.com/fast-flexible-pandas/)
- [Optimizing Code Performance On Large Datasets](https://app.dataquest.io/course/improving-code-performance)
- [4 Unique Methods to Optimize your Python Code for Data Science](https://www.analyticsvidhya.com/blog/2019/09/4-methods-optimize-python-code-data-science/)
- [High-Performance Pandas: eval() and query()](https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html#pandas.eval()-for-Efficient-Operations)

**Code optimization, in simple terms, means reducing the number of operations to execute any task while producing the correct results.**

## CPU Bound Programs

### Bounds vs Limitations

<img alt="I/O bounds" src="images/CPU+and+I_O+bounds.png">

### Primer optimizacije

In [None]:
import numpy as np

# Define a basic Haversine distance formula
def haversine(lat1, lon1, lat2, lon2):
    MILES = 3959
    lat1, lon1, lat2, lon2 = map(np.deg2rad, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1 
    dlon = lon2 - lon1 
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a)) 
    total_miles = MILES * c
    return total_miles

In [None]:
df = pd.read_csv('data/new_york_hotels.csv')

#### Crude looping over DataFrame rows using indices

In [None]:
# Define a function to manually loop over all rows and return a series of distances
def haversine_looping(df):
    distance_list = []
    for i in range(0, len(df)):
        d = haversine(40.671, -73.985,   )
        
        
        
    return distance_list

In [None]:
%%timeit
# Run the haversine looping function
df['distance'] = haversine_looping(df)

#### Looping with iterrows()

In [None]:
%%timeit
# Haversine applied on rows via iteration
haversine_series = []



df['distance'] = haversine_series

#### Looping with apply()

In [None]:
%%timeit

# Timing apply on the Haversine function


#### Vectorization with Pandas series

In [None]:
%%timeit 
# Vectorized implementation of Haversine applied on Pandas series


####  Vectorization with NumPy arrays

In [None]:
%%timeit
# Vectorized implementation of Haversine applied on NumPy arrays


This brings us to a few basic conclusions on optimizing Pandas code:
1. Avoid loops; they’re slow and, in most common use cases, unnecessary.
2. If you must loop, use apply(), not iteration functions.
3. Vectorization is usually better than scalar operations. Most common operations in Pandas can be vectorized.
4. Vector operations on NumPy arrays are more efficient than on native Pandas series.

## I/O Bound Programs

### I/O Bounds

<img src="./images/report_assembly.png">

<img src="./images/report_assembly_bidir.png">

I/O bound tasks are tasks where:
- Our program is reading from an input (like a CSV file).
- Our program is writing to an output (like a text file).
- Our program is waiting for another program to execute something (like a SQL query).
- Our program is waiting for another server to execute something (like an API request).



### Profiling an I/O bound task

In [None]:
query = '''
SELECT DISTINCT teamID 
FROM Teams 
INNER JOIN TeamsFranchises ON Teams.franchID == TeamsFranchises.franchID 
WHERE TeamsFranchises.active = 'Y';
'''

In [None]:
import cProfile
import sqlite3

conn = sqlite3.connect("data/lahman2015.sqlite")

cur = conn.cursor()
teams = [row[0] for row in cur.execute(query).fetchall()]

In [None]:
print(teams)

In [None]:
import cProfile
import sqlite3

query = "SELECT SUM(HR) FROM Batting WHERE teamId=?"
conn = sqlite3.connect("data/lahman2015.sqlite")
cur = conn.cursor()

def calculate_runs(teams):
    home_runs = []
    for team in teams:
        runs = cur.execute(query, [team]).fetchall()
        runs = runs[0][0]
        home_runs.append(runs)
    return home_runs

In [None]:
%%timeit
home_runs = calculate_runs(teams)

In [None]:
profile_string = "home_runs = calculate_runs(teams)"

In [None]:
cProfile.run(profile_string)

### Blocking Tasks

```python
51    0.120    0.002    0.120    0.002 {method 'execute' of 'sqlite3.Cursor' objects}
```

```python
conn = sqlite3.connect(':memory:')
```

In [None]:
import sqlite3

# Create an in memory database.
memory = sqlite3.connect(':memory:')

# Connect to our disk database.
disk = sqlite3.connect('data/lahman2015.sqlite')

# Create a query that will read the contents of the disk database 
# into another database.
dump = ''.join(line for line in disk.iterdump())

# Run the query to copy the database from disk into memory.
memory.executescript(dump)

cur = memory.cursor()

In [None]:
dump[:1000]

In [None]:
import cProfile
import sqlite3

query = "SELECT SUM(HR) FROM Batting WHERE teamId=?"

def calculate_runs(teams):
    home_runs = []
    for team in teams:
        runs = cur.execute(query, [team]).fetchall()
        runs = runs[0][0]
        home_runs.append(runs)
    return home_runs

In [None]:
profile_string = "home_runs = calculate_runs(teams)"
cProfile.run(profile_string)

### Parallel Execution

<img src="./images/single_threaded.png">

 What if, instead, we could run several queries at once? It might look like this:

<img src="./images/multi_threaded.png">

We can use the Python 3 [threading library](https://docs.python.org/3/library/threading.html) to implement threading in our programs.

In [None]:
def task(team):
    print(team)

In [None]:
import threading




In [None]:
def task(team):
    print(team)
    

for n, team in enumerate(teams):
    thread = threading.Thread(target=task, args=(team,))
    thread.start()
    print('Started task', n)

### Thread Blocking

<img src="./images/three_threads.svg">

In [None]:
import threading
import time

def task(team):
    time.sleep(3)
    print(team)
    
for n, team in enumerate(teams):
    thread = threading.Thread(target=task, args=(team,))
    thread.start()
    print('Started task', n)

### Joining Threads

```python
t1 = threading.Thread(target=task, args=(team,))
t2 = threading.Thread(target=task, args=(team,))
t3 = threading.Thread(target=task, args=(team,))

# Start the first three threads
t1.start()
t2.start()
t3.start()

t1.join() # Wait until t1 finishes.
t2.join() # Wait until t2 finishes.  If it already finished, then keep going.
t3.join() # Wait until t3 finishes.  If it already finished, then keep going.
```

<img src="./images/Screenshot from 2019-06-29 16-40-25.png">

In [None]:
def task(team):
    print(team)

for i in range(11):
    team_names = teams[i*5: (i+1) * 5]
    threads = []
    for team in team_names:
        thread = threading.Thread(target=task, args=(team,))
        thread.start()
        threads.append(thread)
    for thread in threads:
        thread.join()
    print("Finished batch {}".format(i)) 

### Locking

It's important to be aware of accessing shared resources when you're working with threads. Some examples of shared resources are:
- The system stdout.
- SQL databases.
- APIs.
- Objects in memory.

```python

lock = threading.Lock()

def task(team):
    lock.acquire()
    # This code cannot be executed until a thread acquires the lock.
    print(team)
    lock.release()

t1 = threading.Thread(target=task, args=(team,))
t2 = threading.Thread(target=task, args=(team,))

t1.start()
t2.start()

```

In [None]:
import threading
import time
import sys

lock = threading.Lock()

def task(team):
    lock.acquire()
    print(team)
    sys.stdout.flush()
    lock.release()
    
for i in range(11):
    team_names = teams[i*5: (i+1) * 5]
    threads = []
    for team in team_names:
        thread = threading.Thread(target=task, args=(team,))
        thread.start()
        threads.append(thread)
    for thread in threads:
        thread.join()
    print("Finished batch {}".format(i))   

### Thread Safety

In general, these operations are not thread safe:
- Modifying data in memory.
- Writing to a file.
- Adding data to a database.
- Modifying data via API.


In [None]:
import cProfile
import sqlite3
import threading
import sys

In [None]:
query = "SELECT DISTINCT teamID from Teams inner join TeamsFranchises on Teams.franchID == TeamsFranchises.franchID where TeamsFranchises.active = 'Y';"

In [None]:
conn = sqlite3.connect("data/lahman2015.sqlite", check_same_thread=False)

In [None]:
cur = conn.cursor()
teams = [row[0] for row in cur.execute(query).fetchall()]

query = "SELECT SUM(HR) FROM Batting WHERE teamId=?"
lock = threading.Lock()

In [None]:
def calculate_runs(team):
    cur = conn.cursor()
    runs = cur.execute(query, [team]).fetchall()
    runs = runs[0][0]
    lock.acquire()
    print(team, ':', runs)
    sys.stdout.flush()
    lock.release()
    return runs


threads = []

for team in teams:
    thread = threading.Thread(target=calculate_runs, args=(team,))
    thread.start()
    threads.append(thread)
    
for thread in threads:
    thread.join()

## Optimizing Python Code with pandas

### Basic Looping

### Select columns and rows efficiently


In [None]:
data = pd.read_csv('data/school.csv')
data.head(3)

In [None]:
data['City'].value_counts().head(10)

In [None]:
# save the top cities in a list
top_cities = ['Brooklyn','Bronx','Manhattan','Jamaica','Long Island City']

In [None]:
%%timeit


In [None]:
data.City.value_counts()

In [None]:
data = pd.read_csv('data/school.csv')

In [None]:
# salba praksa
%%timeit


In [None]:
data.City.value_counts()

### Uporaba biult-in funkciji

### Joining on indexes is faster than joining on columns

Construct some sample data:

In [None]:
n = 100000

i1 = np.arange(n)
np.random.shuffle(i1)
df1 = pd.DataFrame({'i': i1,
                    'j': np.random.randint(1,1000,n),
                    'k': np.random.randint(1,1000,n)})

i2 = np.arange(n)
np.random.shuffle(i1)
df2 = pd.DataFrame({'i': i2,
                    'm': np.random.randint(1,1000,n),
                    'n': np.random.randint(1,1000,n)})

In [None]:
df1.head()

In [None]:
df2.head()

In [None]:
%%timeit


In [None]:
df1 = df1.set_index('i')
df2 = df2.set_index('i')

In [None]:
%%timeit


## PRIMER: Pohitritev pandas kode

Vir: 
- [Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects](https://realpython.com/fast-flexible-pandas/)

### Naloga

### Priprava podatkov

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv('data/demand_profile.csv')

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.dtypes

In [None]:
type(df.iloc[0, 0])

In [None]:
df['date_time'] = pd.to_datetime(df['date_time'])
df['date_time'].dtype

In [None]:
df.head()

In [None]:
def convert(df, column_name):
    return pd.to_datetime(df[column_name])

df = pd.read_csv('data/demand_profile.csv')
df_coverted = df.copy()

In [None]:
%%timeit -r 3 -n 10
df_coverted['date_time'] = convert(df, 'date_time')

In [None]:
def convert_with_format(df, column_name):
    return pd.to_datetime(df[column_name], format='%d/%m/%y %H:%M')

In [None]:
%%timeit -r 3 -n 10
df_coverted['date_time'] = convert_with_format(df, 'date_time')

In [None]:
859/18

In [None]:
df_coverted.head()

In [None]:
df_coverted.info()

### 1) Simple Looping Over Pandas Data

<table class="table table-hover">
<thead>
<tr>
<th>Tariff Type</th>
<th>Cents per kWh</th>
<th>Time Range</th>
</tr>
</thead>
<tbody>
<tr>
<td>Peak</td>
<td>28</td>
<td>17:00 to 24:00</td>
</tr>
<tr>
<td>Shoulder</td>
<td>20</td>
<td>7:00 to 17:00</td>
</tr>
<tr>
<td>Off-Peak</td>
<td>12</td>
<td>0:00 to 7:00</td>
</tr>
</tbody>
</table>

In [None]:
df_test = df_coverted.copy()
df_test['cost_cents'] = df['energy_kwh'] * 28

In [None]:
df_test.head()

In [None]:
def apply_tariff(kwh, hour):
    """Calculates cost of electricity for given hour."""    
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    elif 17 <= hour < 24:
        rate = 28
    else:
        raise ValueError(f'Invalid hour: {hour}')
    return rate * kwh

In [None]:
# NOTE: Don't do this!
def apply_tariff_loop(df):
    """Calculate costs in loop.  Modifies `df` inplace."""
    energy_cost_list = []
    for i in range(len(df)):
        # Get electricity used and hour of day
        
    df['cost_cents'] = energy_cost_list

In [None]:
%%timeit -r 3 -n 10
apply_tariff_loop(df_coverted)

### 2) Looping with .itertuples() and .iterrows()

In [None]:
for index, row in df[:5].iterrows():
    print(index)
    print(row)
    print('energy_kwh' ,row['energy_kwh'])
    print('-------')

In [None]:
def apply_tariff_iterrows(df):


In [None]:
%%timeit -r 3 -n 10
apply_tariff_iterrows(df_coverted)

### 3) Pandas’ .apply()

In [None]:
def apply_tariff_withapply(df):


In [None]:
%%timeit -r 3 -n 10
apply_tariff_withapply(df_coverted)

### 4) Selecting Data With .isin()

In [None]:
df_coverted = df.copy()
df_coverted['date_time'] = convert_with_format(df, 'date_time')
df_coverted.set_index('date_time', inplace=True)

In [None]:
df_coverted.head()

In [None]:
def apply_tariff_isin(df):
   

In [None]:
%%timeit -r 3 -n 10
apply_tariff_isin(df_coverted)

### 5) Pandas’ pd.cut() function

In [None]:
df_coverted = df.copy()
df_coverted['date_time'] = convert_with_format(df, 'date_time')
df_coverted.set_index('date_time', inplace=True)

> **[pandas.cut](https://pandas.pydata.org/pandas-docs/version/0.23.4/generated/pandas.cut.html)**
- `pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise')`
- Bin values into discrete intervals.
- Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.

In [None]:
def apply_tariff_cut(df):


In [None]:
%%timeit -r 3 -n 10
apply_tariff_cut(df_coverted)

### 6) Using NumPy

In [None]:
import numpy as np

In [None]:
df_coverted = df.copy()
df_coverted['date_time'] = convert_with_format(df, 'date_time')
df_coverted.set_index('date_time', inplace=True)

> **[numpy.digitize](https://docs.scipy.org/doc/numpy/reference/generated/numpy.digitize.html)**
- `numpy.digitize(x, bins, right=False)`
- Return the indices of the bins to which each value in input array belongs.

In [None]:
def apply_tariff_digitize(df):


In [None]:
%%timeit -r 3 -n 10
apply_tariff_digitize(df_coverted)

### Prevent Reprocessing with HDFStore

[What is HDF5](https://portal.hdfgroup.org/display/knowledge/What+is+HDF5)

In [None]:
df_coverted.info()

In [None]:
# Create storage object with filename `processed_data`
data_store = pd.HDFStore('data/OUT_processed_data.h5')

# Put DataFrame into the object setting the key as 'preprocessed_df'
data_store['preprocessed_df'] = df_coverted
data_store.close()

In [None]:
data_store = pd.HDFStore('data/OUT_processed_data.h5')

# Retrieve data using key
preprocessed_df = data_store['preprocessed_df']
data_store.close()

In [None]:
preprocessed_df.info()

### Povzetek

<ul>
<li>
<p>Try to use <a href="https://realpython.com/numpy-array-programming/#what-is-vectorization">vectorized operations</a> where possible rather than approaching problems with the <code>for x in df...</code> mentality. If your code is home to a lot of for-loops, it might be better suited to working with native Python data structures, because Pandas otherwise comes with a lot of overhead.</p>
</li>
<li>
<p>If you have more complex operations where vectorization is simply impossible or too difficult to work out efficiently, use the <code>.apply()</code> method.</p>
</li>
<li>
<p>If you do have to loop over your array (which does happen), use <code>.iterrows()</code> or <code>.itertuples()</code> to improve speed and syntax.</p>
</li>
<li>
<p>Pandas has a lot of optionality, and there are almost always several ways to get from A to B. Be mindful of this, compare how different routes perform, and choose the one that works best in the context of your project.</p>
</li>
<li>
<p>Once you’ve got a data cleaning script built, avoid reprocessing by storing your intermediate results with HDFStore.</p>
</li>
<li>
<p>Integrating NumPy into Pandas operations can often improve speed and simplify syntax.</p>
</li>
</ul>

## Drugi nasveti

###  [Numba](https://numba.pydata.org/)

Numba translates Python functions to optimized machine code at runtime using the industry-standard LLVM compiler library. Numba-compiled numerical algorithms in Python can approach the speeds of C or FORTRAN.

### pandas.eval() for Efficient Operations

[Dokumentacija](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.eval.html)

[High-Performance Pandas: eval() and query()](https://jakevdp.github.io/PythonDataScienceHandbook/03.12-performance-eval-and-query.html#pandas.eval()-for-Efficient-Operations)

As of version 0.13 (released January 2014), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays. These are the eval() and query() functions, which rely on the Numexpr package.