<IMG SRC="https://github.com/jacquesroy/byte-size-data-science/raw/master/images/Banner.png" ALT="BSDS Banner" WIDTH=1195 HEIGHT=200>

# Pandas Memory Usage
We use the Socrata API to access a large catalog of data.
Socrata documentation: https://dev.socrata.com/

The production API endpoints for the public version of this API are at https://api.us.socrata.com/api/catalog/v1 for domains in North America
and https://api.eu.socrata.com/api/catalog/v1 for all other domains.

See: 
- <a href="https://youtu.be/D46A9r3bfjM" target="_blank">019-Finding Data: Socrata Catalog</a>
- <a href="https://youtu.be/4C9ShcU--ek" target="_blank">020-Socrata Datasets</a>
- <a href="https://github.com/jacquesroy/byte-size-data-science/blob/master/Notebooks/W005-FindingData.ipynb" target="_blank">W005-FindingData.ipynb</a>

In [None]:
from IPython.display import IFrame

IFrame(src="https://www.youtube.com/embed/t5Swm-7fAAw?rel=0&amp;controls=0&amp;showinfo=0", width=560, height=315)

## Import the appropriate libraries and set up needed connections

In [None]:
import pandas as pd
import numpy as np

In [None]:
# Library used to read datasets
# https://github.com/xmunoz/sodapy
!pip install sodapy 2>&1 >sodapip.txt
from sodapy import Socrata

## Read the data
Use some accident data from Chicago

In [None]:
# Unauthenticated client only works with public data sets. Note 'None'
# in place of application token, and no username or password:
client = Socrata("data.cityofchicago.org", None)

## Read some of the most recent records

In [None]:
from datetime import date
from dateutil.relativedelta import relativedelta

six_months = (date.today() - relativedelta(months=+6)).strftime('%Y-%m')
three_months = (date.today() - relativedelta(months=+3)).strftime('%Y-%m')
one_month = (date.today() - relativedelta(months=+1)).strftime('%Y-%m')

where = "crash_date > '{}'".format(three_months)

### What we get from reading from Socrata are lists
We can read a maximum of 10,000 records so we need to loop to get all our records.

In [None]:
# https://data.cityofchicago.org/Transportation/Traffic-Crashes-Crashes/85ca-t3if
crashes_df = pd.DataFrame(client.get("85ca-t3if", where=where, limit=10000))
offset = 10000
result = client.get("85ca-t3if", where=where, offset=offset, limit=10000)
while (len(result) > 0) :
    crashes_df = crashes_df.append(pd.DataFrame(result), sort=True)
    offset += 10000
    result = client.get("85ca-t3if", where=where, offset=offset, limit=10000)

print("Number of records: {}, number of columns: {}".format(crashes_df.shape[0], crashes_df.shape[1]))

### File size
Check the file size by writing it out.

In [None]:
crashes_df.to_csv('crashes_df.csv', index=False)

In [None]:
!ls -l crashes*

### Dataframe memory usage

In [None]:
# The result is a series with a value for each column, so we add them up
crashes_df_mem = crashes_df.memory_usage(deep=True).sum()
print("crashes_df memory usage: {0:,} bytes".format(crashes_df_mem) )

### Adjust the data types of multiple columns
It turns out that the `object` type has quite a bit of overhead.

We can convert a few columns to numerical types.

In [None]:
crashes2_df = crashes_df.astype({'beat_of_occurrence': 'int64', 'crash_day_of_week': 'int64', 
                                 'crash_hour': 'int64', 'crash_month': 'int64',
                                 'latitude': 'float64', 'longitude': 'float64', 
                                 'num_units': 'int64', 'posted_speed_limit': 'int64'}, errors = 'ignore')

print("Number of records: {}, number of columns: {}".format(crashes2_df.shape[0], crashes2_df.shape[1]))
print("Number of attributes converted to numeric: {0}".format(crashes2_df.dtypes[crashes2_df.dtypes != 'object'].size))

crashes2_df_mem = crashes2_df.memory_usage(deep=True).sum()
print("Memory usage: {0:,} bytes".format(crashes2_df_mem) )
print("Memory savings: {0:,} bytes".format(crashes_df_mem - crashes2_df_mem))

In [None]:
crashes2_df.dtypes[crashes2_df.dtypes != 'object']

## Compare space usage for object vs. int64 and float64

In [None]:
crash_hour_mem_object = crashes_df['crash_day_of_week'].memory_usage(deep=True,index=False)
crash_hour_mem_int64 = crashes2_df['crash_day_of_week'].memory_usage(deep=True,index=False)

print("Object memory usage: {:,} bytes total, {:5.2f} bytes per object".format(
              crash_hour_mem_object, crash_hour_mem_object / crashes_df.shape[0] ))
print("int64 memory usage : {:9,d} bytes total, {:5.2f} bytes per object".format(
              crash_hour_mem_int64, crash_hour_mem_int64 / crashes_df.shape[0] ))

In [None]:
latitude_mem_object = crashes_df['latitude'].memory_usage(deep=True,index=False)
latitude_mem_int64 = crashes2_df['latitude'].memory_usage(deep=True,index=False)

print("Object memory usage: {:,} bytes total, {:5.2f} bytes per object".format(
              latitude_mem_object, latitude_mem_object / crashes_df.shape[0] ))
print("int64 memory usage : {:9,d} bytes total, {:5.2f} bytes per object".format(
              latitude_mem_int64, latitude_mem_int64 / crashes_df.shape[0] ))

In [None]:
print("crash_day_of_week: '{}' length: {}".format(crashes_df['crash_day_of_week'].iloc[0], len(crashes_df['crash_day_of_week'].iloc[0])))
print("latitude         : '{}', length: {}".format(crashes_df['latitude'].iloc[0], len(crashes_df['latitude'].iloc[0])))

crashes_df['latitude'].head()

## Column Elimination
In our case we are using the location information. Still, let's eliminate the columns that are likely useless.
We may use some of these columns later...

In [None]:
# Minimum percentage for column selection
minpercent = .5

In [None]:
total = crashes_df.shape[0]
result = (crashes_df.count() / total)
colnames = result.index[result >= minpercent].tolist()
print("We went from         : {} to {} columns".format(crashes2_df.shape[1],len(colnames)))
crashes_df2 = crashes2_df[colnames]
print("crashes_df rows      : {:,}, columns: {}".format(crashes2_df.shape[0],crashes2_df.shape[1]))
print("crashes_df2 rows     : {:,}, columns: {}".format(crashes_df2.shape[0],crashes_df2.shape[1]))
crashes_df2_mem = crashes_df2.memory_usage(deep=True).sum()
print("Memory usage         : {0:,} bytes".format(crashes_df2_mem) )
print("Memory savings       : {0:,} bytes".format(crashes2_df_mem - crashes_df2_mem) )
print("Total memory savings : {0:,} bytes".format(crashes_df_mem - crashes_df2_mem) )
print("Percent total savings: {:5.2f}%".format(100.0 * (1.0 - (crashes_df2_mem / crashes_df_mem))) )

## Free up memory we don't need
That frees up quite a few MBs.

In [None]:
import gc

# dereference variables storage
result = None
crashes_df = None
crashes2_df = None

ret = gc.collect()