# Building Fast Queries on a CSV

In this project, I will be answering business questions for a hypothetical online laptop store. I will use the.  `laptops.csv` file as my inventory, which was adapted from the [Laptop Prices dataset](https://www.kaggle.com/ionaskel/laptop-prices) on Kaggle. Here is a brief description of the rows:

* **ID**: A unique identifier for the laptop.
* **Company**: The name of the company that produces the laptop.
* **Product**: The name of the laptop.
* **TypeName**: The type of laptop.
* **Inches**: The size of the screen in inches.
* **ScreenResolution**: The resolution of the screen.
* **CPU**: The laptop CPU.
* **RAM**: The amount of RAM in the laptop.
* **Memory**: The size of the hard drive.
* **GPU**: The graphics card name.
* **OpSys**: The name of the operating system.
* **Weight**: The laptop weight.
* **Price**: The price of the laptop.

I will first read the file, store the header separately, and print the first five rows. I will print this intially in pandas in order to view more clearly. 

In [1]:
import csv
with open('laptops.csv') as file:
    read = csv.reader(file)
    rows = list(read)
    header = rows[0]
    rows = rows[1:]

import pandas as pd
laptops = pd.read_csv('laptops.csv', nrows=5)
print(laptops)

        Id Company      Product   TypeName  Inches  \
0  6571244   Apple  MacBook Pro  Ultrabook    13.3   
1  7287764   Apple  Macbook Air  Ultrabook    13.3   
2  3362737      HP       250 G6   Notebook    15.6   
3  9722156   Apple  MacBook Pro  Ultrabook    15.4   
4  8550527   Apple  MacBook Pro  Ultrabook    13.3   

                     ScreenResolution                         Cpu   Ram  \
0  IPS Panel Retina Display 2560x1600        Intel Core i5 2.3GHz   8GB   
1                            1440x900        Intel Core i5 1.8GHz   8GB   
2                   Full HD 1920x1080  Intel Core i5 7200U 2.5GHz   8GB   
3  IPS Panel Retina Display 2880x1800        Intel Core i7 2.7GHz  16GB   
4  IPS Panel Retina Display 2560x1600        Intel Core i5 3.1GHz   8GB   

                Memory                           Gpu  OpSys  Weight  Price  
0            128GB SSD  Intel Iris Plus Graphics 640  macOS  1.37kg   1339  
1  128GB Flash Storage        Intel HD Graphics 6000  macOS  1.34kg   

## Creating a class

The goal of this project is to create a class that represents my inventory. The methods in this class will implement the queries that I want to answer about my inventory. I will also preprocess the data to make those queries run faster. 

I will start by implementing the constructor, which will take in the name of the CSV file as an argument and then read the rows contained in the file. 

In [2]:
class Inventory():
    
    def __init__(self, csv_filename):
        import csv
        with open(csv_filename) as file:
            read = csv.reader(file)
            listed = list(read)
        self.header = listed[0]
        self.rows = listed[1:]
        for row in self.rows:
            row[-1] = int(row[-1])
        
laptops = Inventory('laptops.csv')
print(laptops.header)
print(laptops.rows[:5])
print('\n')
print('Number of Rows:', len(laptops.rows))

['Id', 'Company', 'Product', 'TypeName', 'Inches', 'ScreenResolution', 'Cpu', 'Ram', 'Memory', 'Gpu', 'OpSys', 'Weight', 'Price']
[['6571244', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 2.3GHz', '8GB', '128GB SSD', 'Intel Iris Plus Graphics 640', 'macOS', '1.37kg', 1339], ['7287764', 'Apple', 'Macbook Air', 'Ultrabook', '13.3', '1440x900', 'Intel Core i5 1.8GHz', '8GB', '128GB Flash Storage', 'Intel HD Graphics 6000', 'macOS', '1.34kg', 898], ['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575], ['9722156', 'Apple', 'MacBook Pro', 'Ultrabook', '15.4', 'IPS Panel Retina Display 2880x1800', 'Intel Core i7 2.7GHz', '16GB', '512GB SSD', 'AMD Radeon Pro 455', 'macOS', '1.83kg', 2537], ['8550527', 'Apple', 'MacBook Pro', 'Ultrabook', '13.3', 'IPS Panel Retina Display 2560x1600', 'Intel Core i5 3.1GHz', '8GB', '256GB SSD', 'I

## Adding a method to that class

I will be modifying the `Inventory` class throughout the project. Firstly, I will add a way to look up a laptop from a given identifier. This way, when a customer comes to the store with a purchase slip, I can quickly identify to which laptop it corresponds.

In [3]:
class Inventory():
    
    def __init__(self, csv_filename):
        import csv
        with open(csv_filename) as file:
            read = csv.reader(file)
            listed = list(read)
        self.header = listed[0]
        self.rows = listed[1:]
        for row in self.rows:
            row[-1] = int(row[-1])
    
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if laptop_id == row[0]:
                return row
        return None
        
laptops = Inventory('laptops.csv')
result1 = laptops.get_laptop_from_id('3362737')
result2 = laptops.get_laptop_from_id('3362736')
print(result1)
print('\n')
print(result2)

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]


None


## Optimizing that method

That previous algorithm requires looking at every single roe to find the one it's looking for, which has an *O(R)* time complexity, where *R* is the number of rows. I will now try a new method of preprocessing the data. 

In [4]:
class Inventory():
    
    def __init__(self, csv_filename):
        import csv
        with open(csv_filename) as file:
            read = csv.reader(file)
            listed = list(read)
        self.header = listed[0]
        self.rows = listed[1:]
        self.id_to_row = {}
        for row in self.rows:
            row[-1] = int(row[-1])
            id = row[0]
            self.id_to_row[id] = row
    
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if laptop_id == row[0]:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        else:
            return None
        
laptops = Inventory('laptops.csv')
result1 = laptops.get_laptop_from_id_fast('3362737')
result2 = laptops.get_laptop_from_id_fast('3362736')
print(result1)
print('\n')
print(result2)

['3362737', 'HP', '250 G6', 'Notebook', '15.6', 'Full HD 1920x1080', 'Intel Core i5 7200U 2.5GHz', '8GB', '256GB SSD', 'Intel HD Graphics 620', 'No OS', '1.86kg', 575]


None


This new implementation has time complexity *O(1)* as opposed to the *O(R)* from the earlier method. However, it does so by using more memory to store the `self.id_to_row` dictionary. 

In [5]:
import time
import random
ids = [random.randint(1000000, 9999999) for _ in range(10000)]

laptops = Inventory('laptops.csv')
total_time_no_dict = 0
total_time_dict = 0

for id in ids:
    id = str(id)
    start = time.time()
    laptops.get_laptop_from_id(id)
    end = time.time()
    total_time_no_dict += (end - start)
    
for id in ids:
    id = str(id)
    start = time.time()
    laptops.get_laptop_from_id_fast(id)
    end = time.time()
    total_time_dict += (end - start)

print(total_time_no_dict)
print(total_time_dict)

1.3079805374145508
0.00622868537902832


I can see from the values above that the amount of time taken to execute the function using no dictionary is much higher than the time needed to execute the the function with the dictionary. 

## Laptop promotion

I will now create a method that will determine all of the possible values if the prices of two laptops are added together. This will help for a promotion that my company is running.

In [6]:
class Inventory():
    
    def __init__(self, csv_filename):
        import csv
        with open(csv_filename) as file:
            read = csv.reader(file)
            listed = list(read)
        self.header = listed[0]
        self.rows = listed[1:]
        self.id_to_row = {}
        for row in self.rows:
            row[-1] = int(row[-1])
            id = row[0]
            self.id_to_row[id] = row
    
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if laptop_id == row[0]:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        else:
            return None
        
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:
                return True
            for other in self.rows:
                if row[-1] + other[-1] == dollars:
                    return True
        return False
        
laptops = Inventory('laptops.csv')
result1 = laptops.check_promotion_dollars(1000)
result2 = laptops.check_promotion_dollars(442)
print(result1)
print(result2)

True
False


## Preprocessing data

Now I can preprocess data in order to make my code run faster. I will store all laptop prices in a set, then check in contant time whether there is a laptop with a given price.  

In [7]:
class Inventory():
    
    def __init__(self, csv_filename):
        import csv
        with open(csv_filename) as file:
            read = csv.reader(file)
            listed = list(read)
        self.header = listed[0]
        self.rows = listed[1:]
        self.id_to_row = {}
        self.prices = set()
        for row in self.rows:
            row[-1] = int(row[-1])
            id = row[0]
            self.id_to_row[id] = row
            price = row[-1]
            self.prices.add(price)
            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if laptop_id == row[0]:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        else:
            return None
        
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:
                return True
            for other in self.rows:
                if row[-1] + other[-1] == dollars:
                    return True
        return False
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price1 in self.prices:
            price2 = dollars - price1
            if price2 in self.prices:
                return True
        return False
        
laptops = Inventory('laptops.csv')
result1 = laptops.check_promotion_dollars_fast(1000)
result2 = laptops.check_promotion_dollars_fast(442)
print(result1)
print(result2)

True
False


I will now compare the performance of the last two functions. 

In [8]:
import time
import random
ids = [random.randint(100, 5000) for _ in range(100)]

laptops = Inventory('laptops.csv')
total_time_no_set = 0
total_time_set = 0

for price in laptops.prices:
    start = time.time()
    laptops.check_promotion_dollars(price)
    end = time.time()
    total_time_no_set += (end - start)
    
for price in laptops.prices:
    start = time.time()
    laptops.check_promotion_dollars_fast(price)
    end = time.time()
    total_time_set += (end - start)

print(total_time_no_set)
print(total_time_set)

11.454347133636475
0.00035953521728515625


We can see a huge difference in performance time when using the set versus not using the set, which means that preprocessing the data speeds up performance time by quite a bit.

## Finding laptops within a budget

Now I will use a binary search algorithm to help a customer find all laptops that fall within their budget. To do this, I will have to sort the laptops by price. 

In [9]:
def row_price(row):
    return row[-1]

class Inventory():
    
    def __init__(self, csv_filename):
        import csv
        with open(csv_filename) as file:
            read = csv.reader(file)
            listed = list(read)
        self.header = listed[0]
        self.rows = listed[1:]
        self.id_to_row = {}
        self.prices = set()
        for row in self.rows:
            row[-1] = int(row[-1])
            id = row[0]
            self.id_to_row[id] = row
            price = row[-1]
            self.prices.add(price)
        self.rows_by_price = sorted(self.rows, key=row_price)
            
    def get_laptop_from_id(self, laptop_id):
        for row in self.rows:
            if laptop_id == row[0]:
                return row
        return None
    
    def get_laptop_from_id_fast(self, laptop_id):
        if laptop_id in self.id_to_row:
            return self.id_to_row[laptop_id]
        else:
            return None
        
    def check_promotion_dollars(self, dollars):
        for row in self.rows:
            if row[-1] == dollars:
                return True
            for other in self.rows:
                if row[-1] + other[-1] == dollars:
                    return True
        return False
    
    def check_promotion_dollars_fast(self, dollars):
        if dollars in self.prices:
            return True
        for price1 in self.prices:
            price2 = dollars - price1
            if price2 in self.prices:
                return True
        return False
    
    def find_first_laptop_more_expensive(self, target_price):
        range_start = 0
        range_end = len(self.rows_by_price) - 1
        while range_start < range_end:
            range_middle = (range_end + range_start) // 2
            price = self.rows_by_price[range_middle][-1]
            if price > target_price:
                range_end = range_middle
            else:
                 range_start = range_middle + 1
        if self.rows_by_price[range_start][-1] <= target_price:
            return -1
        return range_start
          
laptops = Inventory('laptops.csv')
result1 = laptops.find_first_laptop_more_expensive(1000)
result2 = laptops.find_first_laptop_more_expensive(10000)
print(result1)
print(result2)

683
-1
