If we read a large csv file over and over time to do analysis,  
we should save that file to a binary format such hdf then load that hdf file next time 

In [3]:
import pandas as pd
pd.options.display.width = 1000

# read CarsData.csv filed ~ 5MB to a dataframe and print time taken
import time
start = time.time()
df = pd.read_csv('CarsData.csv')
end = time.time()
print('Time taken to read CarsData.csv:', end - start, 'seconds')

# save above dataframe to a hdf file format and print time taken
start = time.time()
df.to_hdf('CarsData.h5', key='df', mode='w')
end = time.time()
print('Time taken to save dataframe to hdf:', end - start, 'seconds')

# Re read CarsData.h5 filed  to a dataframe and print time taken
start = time.time()
df = pd.read_hdf('CarsData.h5', key='df')
end = time.time()
print('Time taken to read CarsData.h5:', end - start, 'seconds')

Time taken to read CarsData.csv: 0.16954684257507324 seconds
Time taken to save dataframe to hdf: 0.10571789741516113 seconds
Time taken to read CarsData.h5: 0.058846235275268555 seconds


If we know a column's value on in a set of limited values,  
we should convert it to category type to save space
Example of that kind of column: country list, brand name ...


In [17]:
print(df.head(5))

# print the number of unique values in each column
print("number of unique values in each column:")
print(df.nunique())

# as we see that the Manufacturer column has 9 unique values, naturally it is a good candidate for a category column.

print("size of dataframe in memory (MB):", df.memory_usage(deep=True).sum() / (1024**2))

# convert Manufacturer to category
df['Manufacturer'] = df['Manufacturer'].astype('category')
print("size of dataframe in memory after converting Manufacturer to category (MB):", df.memory_usage(deep=True).sum() / (1024**2))

# covert model to category
df['model'] = df['model'].astype('category')
print("size of dataframe in memory after converting Model to category (MB):", df.memory_usage(deep=True).sum() / (1024**2))

print(df.head(2))

           model  year  price transmission  mileage fuelType  tax   mpg  engineSize Manufacturer
0            I10  2017   7495       Manual    11630   Petrol  145  60.1         1.0       hyundi
1           Polo  2017  10989       Manual     9200   Petrol  145  58.9         1.0   volkswagen
2       2 Series  2019  27990    Semi-Auto     1614   Diesel  145  49.6         2.0          BMW
3   Yeti Outdoor  2017  12495       Manual    30960   Diesel  150  62.8         2.0        skoda
4         Fiesta  2017   7999       Manual    19353   Petrol  125  54.3         1.2         ford
number of unique values in each column:
model             196
year               27
price           13236
transmission        4
mileage         42214
fuelType            5
tax                48
mpg               208
engineSize         40
Manufacturer        9
dtype: int64
size of dataframe in memory (MB): 17.377426147460938
size of dataframe in memory after converting Manufacturer to category (MB): 17.3774261474609

Speed up Column Operations
* Iteration by iloc (access by index = 1, 2 , so on)
* Iteration by .iterrows().
* Iteration by .itertyple()
* apply() function.
* Vectorize like Numpy.

In [1]:
import numpy as np
import timeit
import pandas as pd
import math

d = np.random.randint(1, 10, size=(100000, 2))
df = pd.DataFrame(d)
df.columns = ["a", "b"]

# Iterate the rows by iterrows
tmpList = np.zeros(100000)
start = timeit.default_timer()
for i, row in df.iterrows():
    tmpList[i] = math.sqrt(row['a'])
stop = timeit.default_timer()
df['sqrt_a'] = tmpList
print("Running time of iterrows is {} seconds.".format(stop-start))

# iterate by itertuples
tmpList = np.zeros(100000)
start = timeit.default_timer()
for row in df.itertuples():
    tmpList[i] = math.sqrt(row.a)
stop = timeit.default_timer()
df['sqrt_a'] = tmpList
print("Running time of itertuples is {} seconds.".format(stop-start))

# Iterate the rows by iloc operation
tmpList = np.zeros(100000)
start = timeit.default_timer()
for i in range(100000):
    tmpList[i] = math.sqrt(df.iloc[i, 0])
stop = timeit.default_timer()
df['sqrt_a'] = tmpList
print("Running time of iloc is {} seconds.".format(stop-start))

# Iterate the rows by apply().
start = timeit.default_timer()
df['sqrt_a'] = df["a"].apply(lambda s: math.sqrt(s))
stop = timeit.default_timer()
print("Running time of apply() is {} seconds.".format(stop-start))

# Iterate the rows by Numpy vectorize.
start = timeit.default_timer()
df['sqrt_a'] = np.sqrt(df["a"])
stop = timeit.default_timer()
print("Running time of Vectorize is {} seconds.".format(stop-start))

Running time of iterrows is 9.3508115 seconds.
Running time of itertuples is 0.13143509999999914 seconds.
Running time of iloc is 1.1085849000000003 seconds.
Running time of apply() is 0.041646199999998856 seconds.
Running time of Vectorize is 0.00230830000000104 seconds.
