# ID 5059 Coursework 1
John Belcher-Heath (jbh6)

# Introduction

The task is to predict the price of a car from a subset of attributes from the Kaggle dataset.

I will complete the task following the ML checklist in the book, Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. which is:

1. Frame the problem
2. Get the data
3. Explore the data
4. Prepare the data
5. Explore models
6. Fine-tune models
7. Present solution
8. Launch/maintain

# 1. Frame the problem

We want to predict the price of a car (continuos) using a small selection of attributes available to us. This makes the problem a regression problem.

Since this is a regression problem the standard performance measure of Root Mean Square Error (referred to as RMSE from now on) will be used:

$$
RMSE = \sqrt{\frac{1}{n} \sum_{i=1}^n(y_i - \hat{y}_i)^2}
$$

For this measure we are looking for low RMSE. This will mean small residuals and the model is a good fit for the data.

# 2. Get the data

In this section a random sleection of entries from one of the large datasets will be obtained and read into a pandas.dataframe to explore. A random selection of the large dataset will be explored since all we are doing is getting to know the data. Having a large amount of data to explore will be time consuming, but having too small (and non random sample) will mean our observations may not be valid. Taking a random sample of a large dataset should give a relatively good representation of the overall dataset, whilst minimising the amount of data requiring to be manipulated.

Note when it comes to applying the model I will include a check of the data to make sure our observations on the smaller dataset still hold.

In [None]:
import sys
!{sys.executable} -m pip install numpy pandas matplotlib scikit-learn | grep -v 'already satisfied'

# Import libraries
import pandas as pd
import numpy as np
import dask.dataframe as dd
import sklearn
import os
import glob
from pathlib import Path
import math

In [None]:
# folder_path: str = "/cs/studres/ID5059/Coursework/Coursework-1/data/2_medium" # uni
folder_path : str = r"/home/johnbh/personal_git/ID5059_coursework_1/data/3_large" # Desktop

if not os.path.exists(folder_path):
    raise FileNotFoundError
os.chdir(folder_path)

file_names : list = [i for i in glob.glob("*.{}".format('csv'))]

    
def read_car_data(filepath : str) -> pd.DataFrame:
    """
    Reads a filepath and returns the dataframe
    :param filepath: The location of the file to read
    :return: returns the pandas dataframe
    """
    return pd.read_csv(filepath, index_col = "vin")

frac: float = 0.6 # fraction of data to use to explore
original_df: pd.DataFrame = read_car_data(file_names[0])
df: pd.DataFrame = original_df.sample(frac = frac)
sample_size: int = len(df)

# Clear the maximum number of columns to be displayed, so that all will be visible.
pd.set_option('display.max_columns', None)
# check data looks roughly okay
df.head(5)

# 3. Explore the data

The data will now be inspected to explore what attributes are available to using the info output. Attributes with large proportion of NAs can start to be identified as well.

In [None]:
df: pd.DataFrame = df.reset_index(drop=True) # Reindex to make elements easier to quickly access
df.info()

In [None]:
# Explore attributes
df.head(5)

Initial observations from head:

- A lot of measurements contain the units, making the non-numerical
- Descriptions contain lots of irrelevant information
- A few columns seem to represent the same information
- Some attributes appear to have lots of NaNs
- Multiple ID attributes which can all be dropped
- `major_options` is a list which will need parsing somehow
- `power` contains all the info of `horsepower`
- Lots of irrelevant metadata to drop

In [None]:
from sklearn.model_selection import train_test_split

split_train: float = 0.6# fraction of data to use to explore

train_set, test_set = train_test_split(df, test_size = split_train, random_state=314)

### Start to inspect
Firstly, let's drop all attributes from above which have less than 50% non-null values, since including these may negatively effect our model if a majority of entries do not have this attribute. Using them in our model will mean the model is not very general.

In [None]:
# Drop all attributes with less than 50% non-null values
df = df.drop(columns=df.keys()[df.count() / sample_size < 0.5])

### Data types correction
Some of the attributes appear to have been imported with different datatype, for example `zip code` as `object` not `int64`. This will be due to some integer attributes containing `NaNs`, and since the system has no interpretation for `NaNs` in `integer` types, they are taken as `object` data types instead. 

To further inspect this, all `object` data types are shown below.

In [None]:
df.select_dtypes(include=object).info()

From manual inspection there are some attributes that need further inspection to check they have been given the correct type. The first 5 entries are shown below to help.

In [None]:
pd.set_option('display.max_columns', None)
df.select_dtypes(include=object).head(5)

The only attribute that can be directly converted to an integer is the `dealer_zip`, this is unlikely to provide any additional information that the `lattitude` and `longitutde` will not already give so no need to convert.This is dropped from our dataset below. 

This inspection has shown that a lot of the measurements have had units included, so these attributes will need to be converted to numerical.

In [None]:
# Drop dealer_zip
try:
    df = df.drop(columns='dealer_zip')
except KeyError:
    print("Column already dropped")

In [None]:
def convert_measurement(s: str) -> float:
    """
    Converts the measuremnt with units to a numerical value
    :param s: string measurement
    :type s: str
    :return: the actual numerical value
    """
    if type(s) == str:
        s_split: list = s.split(" ")
        try:
            return float(s_split[0])
        # If cannot convert to dtype, ie NA then return NA
        except ValueError:
            return float('NaN')
    # If already converted to correct format, ie if function accidently run twice
    else:
        return s

cols_to_convert: list = ["back_legroom", "front_legroom", "fuel_tank_volume", "height", "length", 
                         "maximum_seating", "wheelbase", "width"]

In [None]:
# Apply the function to get numerical data from the string measurements
df[cols_to_convert] = df[cols_to_convert].applymap(convert_measurement)
df[cols_to_convert] = df[cols_to_convert].astype(np.float64)

It is important to note that the attributes power and torque contain numerical data, but this cannot be simply convert at this point but will be saved for later.

Next, let's drop all the irrelevant meta data which won't be helpful with our model and will instead just increase the complexity which could lead to overfitting. For example the `description`, `interior color`, `exterior color` etc

In [None]:
df = df.drop(columns=['description', 'interior_color', 'exterior_color', 
                      'main_picture_url', 'model_name', 'sp_name', 'transmission_display',
                      'trim_name', 'trimId'])

### Fixing duplicates part 1

It is easy to see that `engine_cylinders` and `engine_type` appear to be duplicate. Similarly so do `wheel_system` and `wheel_system_display`, as well as `make_name` and `franchise_make`.

Before dropping one of each of these, the data will be further inspected to make sure that there's no discrepancy between the two in the wider data set (i.e. not just in the head).

In [None]:
df_engine = df[['engine_cylinders', 'engine_type']]
df_engine[np.logical_xor(df_engine.engine_cylinders.isna(), df_engine.engine_type.isna())].count()

So above tells us that all entries with attributes are identical in being either NA or not, so dropping one of these attributes means no information is lost.

In [None]:
df = df.drop(columns='engine_cylinders')

For the `wheel_system` and `wheel_system_display`:

In [None]:
df_wheel = df[['wheel_system', 'wheel_system_display']]
df_wheel[np.logical_xor(df_wheel.wheel_system.isna(), df_wheel.wheel_system_display.isna())].count()

The above implies that both attributes provide the same information for the cars. Hence deciding which to drop is irrelevant. I will choose to drop the `wheel_system_display` since wheel system has a nice short appriviation.

In [None]:
df = df.drop(columns='wheel_system_display')

Finally for make.

In [None]:
df_make = df[['make_name', 'franchise_make']]
df_make[np.logical_xor(df_make.make_name.isna(), df_make.franchise_make.isna())].count()

From this we can see that the `make_name` has more information than the `franchise_make`, hence the `franchise_make` is dropped.

In [None]:
df = df.drop(columns='franchise_make')

### Fixing duplicates part 2

For part 2, these duplicates data may need to be extracted then compared, before just dropping attributes.

Let's inspect the engine data:

In [None]:
df[np.logical_xor(df.engine_displacement.isna(), df.horsepower.isna())]

So, luckily `horsepower` and `power` do give the same information so one can be dropped arbitrarily. As horsepower is already numerical, `power` will be dropped.

In [None]:
#### could add RPM attribute

There is also another useful attribute of RPM which could help to distinguish between performance cars with large horsepower and 4x4 with the same, but there may be too many NAs for this attribute to use this metric, let's see.

In [None]:
df.horsepower.isna() / sample_size

So from above we can see that only around 5% have no `horsepower` attribute. For these remaining entries we will consider how many have engine size attributes.

In [None]:
len(df[(df['horsepower'].isna() & df['engine_type'].isna())]) / sample_size

Now there is only a small amount of cars with neither `horsepower`, `power` or `engine_type` attribute. All these entries will simply take the overall average for `horsepower`.

The `horsepower` for all cars will be assigned using the following:

- if the car has `horspower` asigned pass
- elif the car has `engine_type` assign average for that type
- else assign the overall average for `horsepower`

Let's do the first to steps:

In [None]:
df[['horsepower']] = df[['horsepower', 'engine_type']].groupby('engine_type').transform(lambda x: x.fillna(x.mean()))

Let's examine the improvements

In [None]:
df.horsepower.count() / sample_size

Now for the final step of assigning the last na just the average of all the horsepowers:

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(df[['horsepower']])

df[['horsepower']] = imputer.transform(df[['horsepower']])

Let's see the results

In [None]:
df.horsepower.count() / sample_size, df.horsepower.hist()

Everything looks all good!

In [None]:
#######################################

In [None]:
def get_power_data(s: str):
    """
    Returns the hp and RPM from the power string
    attribute of a vehicle
    """
    if not pd.isna(s):
        try:
            string_split: list = s.split(" ")
            return string_split[0], string_split[3].replace(",", "")
        except AttributeError:
            pass
    return np.nan, np.nan

# Example of usage
#zip(*map(get_power_data, df.engine_size))

### Object type attributes
Now we have removed some of the duplicates and corrected some of the data type issues the `object` type attributes will be properly explored now.

In [None]:
df.select_dtypes(include=object).head(5)

First let's see if any of the attributes have any blaring issues with NAs.

In [None]:
df.select_dtypes(include=object).count() / sample_size

Clerly some of the attributes are not suitable to use since they have a low number of entries. Any object attributes with less than 80% entries are removed.

In [None]:
df = df.drop(columns=df.select_dtypes(include=object).loc[:, df.select_dtypes(include=object).count() / sample_size < 0.8].keys())

This leaves:

In [None]:
df.select_dtypes(include=object).head(5)

Since we have `daysonmarket` attribute the `listed_date` can be dropped. Additionally, `city` can assumed to have minimal effect since most cities can be assumed to have a diverse range of individuals with varying wealth and cars.

In [None]:
df = df.drop(columns=['city', 'listed_date'])

`torque` could be useful but there is too few entries (see below) for it and it is not recorded elsewhere (like `horsepower` recorded in `power` and `engine_size`). Hence I will not use this attribute for my model

In [None]:
df.torque.count() / sample_size

In [None]:
df = df.drop(columns='torque')

For major options, since there is so much variabilty from visual inspection of naming of products, the number of major of features will be used instead. The actual usefulness of this will be explored later.

In [None]:
df.major_options = df.major_options.apply(lambda x: len(x.split(",")) if type(x) == str else "NaN").astype(np.float64)

For the remaining attributes, these will be used as categorical attributes in the model.

### Exploring the numerical attributes

Now the qualitative attributes have been dealt with it's time for the quantiative attributes.

Let's explore all the numerical attributes with an actual numerical meaning(index or listing_id have no meaning numerically). Attributes with no numerical meaning our dropped below.

In [None]:
# Quick inspection to see which numerical but non-relevant attributes need to be dropped
df.select_dtypes(include=[np.int64, np.float64])

In [None]:
import matplotlib.pyplot as mpl
%matplotlib inline

df_numerical = df.select_dtypes(include=[np.int64, np.float64]).drop(columns=['listing_id', 'sp_id'])
df_numerical.hist(figsize=(16,20), bins=30)
mpl.show()

Observations:
- Both Fuel economy attributes appear to be normally distributed with a slight skew
- Majority of cars do not stay on the market for a long duration, mostly less than a couple of months. Some may be above a large amount so these may need to be removed to not skew data.
- Engine displacement doesn't appear to have any obvious standard distribution
- Horsepower appears to have a normal distribution around 200hp with a standard deviation of around 50hp
- Lattitude is as expected all grouped together around 39 to 44 
- longitutde is split into two peaks, most likely corresponding to central US and alaska
- Milegae of most cars is grouped mostly around 0 and fewer cars with higher mileage, as would be expected
- owner count has a modal of 1, again as to be expected
- Most cars prices are group around the same order of magnitutde. Howeever some extremes are seen. A logarithmic transformation may need to be considered later.
- Seller ratings appear to be skew negatively towards the higher end
- Majority of cars are from the last 15 years
- Modal max seats is 5

It is clear as well that some of the bins are very sparse so will need coarser bins with labels for our model later to make sure our training set and test set have similar distributions.

In [None]:
df.select_dtypes(include=[np.float64, np.int64]).count() / sample_size

Firstly it is clear to see there is no issue with NAs in the attributes: `daysonmarket`, `lattitude`, `longitude`, `price`, `savings_amount` and `year` (as well as `horsepower` after the fix above). Using contextual knowledge all these attributes (excl `price` as this is being compared to) will likely be useful in predicting the `price` attribute so will be used. 

Looking at the list of other attributes available with a low number of non-nulls. The additional attributes I believe may effect the `price` and want to explore more are:

- `city_fuel_economy` and `highway_fuel_economy` - useful metric of car performance, more powerful and expensive cars likely to have lower fuel efficiency
- `fuel_tank_volume` - bigger more expensive cars likely to have a large fuel tank, hence useful metric
- `engine_displacement` and `horsepower` (and `power` which will be used to get na values) - all similar/the same metrics for how powerful a car is
- `major_options` - more expensive cars tend to have more options
- `mileage` - more miles done the less it is valued generally
- `seller_rating` - If a seller has a better rating people may pay more than if they were to go to a seller with a poor rating.
- `length` and `width` - A measure of the size of the car. Large cars tend to be more expensive. E.g. sports cars are very wide generally.

I have chosen not to include `owner_count` since there are too few entries for this attribute.

To explore these options there is some transformation required to remove any skew by the extreme values, also to reduce the complexity of the model.

### Attribute transformation
From the graphs above some attributes we have chosen to explore further need transforming so that the distribution of the training set and test set are similar. To do this the function below will be used.

In [None]:
### NEED TO TWEAK
def transform_bins(pds: pd.Series, bins, min_val = None, max_val = None) -> pd.Series:
    """
    Function to transform a continuous series with sparse data to a categorical attribute with full bins.
    The absolute max is always 0 and inf to make sure all data is captured.
    :param pds: original cts data
    :param bins: number of bins in resultant series (note this is how many will be attempted to be created)
    :min_val: starting value for main section of the bins
    :max_val: ending value for main section of the bins
    :return: transformed series
    """
    bins -=1
    if min_val is not None and max_val is not None: 
        cuts: list = np.append(np.linspace(min_val, max_val, bins), np.array([np.inf])).tolist()
        cuts.insert(0,0)
    else:
        cuts: list = np.append(np.linspace(pds.quantile(0.025), pds.quantile(0.975), bins), np.array([np.inf])).tolist()
        cuts.insert(0, 0)
        
    # Drop any duplicates, ie if 0 included twice
    cuts = list(dict.fromkeys(cuts))
    labels: list = [str(i) for i in range(len(cuts)-1)]
    # include_lowest needed to make sure if values are 0 they're still given a label
    return pd.cut(pds, bins=cuts, labels=labels, include_lowest=True).astype(np.float64)

The attributes needing to be transformed are:

In [None]:
transform_attributes: list = ["city_fuel_economy", "highway_fuel_economy","daysonmarket", "fuel_tank_volume", 
                              "mileage", "savings_amount", "year"]

The function will be applied in a uniform way with 30 bins for each first, these will then be inspected to see if more detailed transformation may be required

In [None]:
transformed_attr: pd.DataFrame = df[transform_attributes].apply(lambda x: transform_bins(x, bins=30))
transformed_attr.hist(figsize=(16,16))
mpl.show()

In [None]:
transformed_attr.count() / sample_size

These distributions look much better than before. However there may be a slight issue with `savings_amount` and `city_fuel_economy`. For this one different min, max and bins need to be used. Using contextual knowledge the following conversions are used.

In [None]:
df[['city_fuel_economy']] = df[['city_fuel_economy']].apply(lambda x: transform_bins(x, bins=5, min_val=18, max_val=28))

In [None]:
df[['savings_amount']] = df[['savings_amount']].apply(lambda x: transform_bins(x, bins=5, min_val=100, max_val=3000))

In [None]:
df[transform_attributes] = transformed_attr[transform_attributes]

In [None]:
transformed_attr.hist(figsize=(16,16))

These look much better than before.

After all the exploratory analysis a list of attributes which are hopefully correlated to the `price` attribute have been identied. But before preperation let's take a look at the correlation between the numerical attributes and the `price` to maybe eliminate some attributes, reducing the complexity 

### Explore correlations

In [None]:
chosen_numerical_attributes : list = ['daysonmarket', 'latitude', 'longitude', 'price', 'savings_amount', 'year', 'horsepower', 'city_fuel_economy', 
'highway_fuel_economy', 'fuel_tank_volume', 'engine_displacement', 'major_options', 'mileage', 'mileage', 'seller_rating',
'length', 'width']

In [None]:
df_numerical = df.select_dtypes(include=[np.float64, np.int64])[['daysonmarket', 'latitude', 'longitude', 'price', 'savings_amount', 'year', 'horsepower', 'city_fuel_economy', 
'highway_fuel_economy', 'fuel_tank_volume', 'engine_displacement', 'major_options', 'mileage', 'mileage', 'seller_rating',
'length', 'width', 'wheelbase']]
# abs taken as don't care if posotive or negative effect
corr_series = abs(df_numerical.drop("price", axis=1).apply(lambda x: x.corr(df_numerical.price)))
corr_series.sort_values()

Clearly some the attributes left don't have much of a correlation
Now let's choose all attributes with a correlation of more than 0.25 and use some of our contextual knowledge to inspect.

In [None]:
corr_series[corr_series > 0.25]

All these attributes seem to make logical sense. One attribute that could be removed is one of `wheelbase` or `length` since they represent different ways to measure the length of a car. Since wheelbase has the higher correlation, `length` will be dropped. Let's inspect the above attributes in more detail. Any attributes we had intially chosen but have are not included above will be dropped.

In [None]:
chosen_numerical_attributes = corr_series[corr_series > 0.25].keys().tolist()
chosen_numerical_attributes.remove('length')
chosen_numerical_attributes.append('price')

In [None]:
df[chosen_numerical_attributes].info()

In [None]:
pd.plotting.scatter_matrix(df[:, chosen_numerical_attributes], figsize=(15,15))
mpl.show()

Inspecting the `price` row (or column), `horsepower` and `mileage` have the stongest correlation as to be expected. 
`wheelbase` and `width` appear to have similar correlation to price, which is to be expected by them being a measurement of size.

## 4. Prepare data
### Dealing with NAs

### Test distributions
Let's compare the training set and test set distributions to check. All NAs will be dropped for this since the function will not work otherwise. Note this is valid since we can assume that both sets should have a similar amount of NAs.

In [None]:
df['price']

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit
shuffled_data = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=314)

attr_dist_change: list = ['city_fuel_economy', 'highway_fuel_economy', 'year']

results: dict = {}
    
for attr in attr_dist_change:
    # Sample for all these new distributions created to see any problems
    [(train_index, test_index)] = shuffled_data.split(df, df[attr])
    stratified_train_set = df.loc[train_index]
    stratified_test_set = df.loc[test_index]

    train_make_up = stratified_train_set[attr].value_counts() / len(stratified_train_set)
    test_make_up = stratified_test_set[attr].value_counts() / len(stratified_test_set)
    results[attr] = train_make_up - test_make_up

Now all the qualitative data which was numeric has been converted. Only the categorical data will be left, this can be seen below.

In [None]:
df.select_dtypes(include=object).head(5)

In [None]:
df.select_dtypes(include=object).count() / sample_size

Clearly there is severe issues with 'owner_count'. Now this could be an important metric, however the 'mileage' attribute will likely be able to show similar information implicitly, but with likely more detail since it is a continuous not descrete attribute. Hence the owner_count will be ommitted.

In [None]:
df = df.drop(columns='owner_count')

In [None]:
###########################################################

In [None]:
df.loc[:, (df.count() / len(df.index)) < 0.95]

As most a large amount of the attributes are categorical or boolean, changing the remaining NAs to the average of the column would not make much sense. Furthermore, since the dataset is very large, removing NAs is unlikely to heavily impact the model. However to make sure the one type of car or manufactor is not being discrimanted against before they're removed the entries with NAs will be inspected.

In [None]:
car_makes_with_NA = df1[df1.isnull().any(axis=1)].make_name.unique()
df_prepared = df1[~df1.isnull().any(axis=1)]
car_makes_no_NA = df_prepared.make_name.unique()

(set(car_makes_no_NA) and set(car_makes_with_NA)) == set(car_makes_with_NA)

This means all car makes are being represented still even when NA rows are removed.

Now the data has been filtered, the exploration of relationships can begin. Note that only 4% of the entries have been removed.

In [None]:
round(len(df_prepared.index) / data_original_length, 2)

In [None]:
used_cars_prices = df_prepared["price"].copy()
used_cars_prices.head()

In [None]:
# os.chdir(folder_path)
# file_names : list = [i for i in glob.glob("*.{}".format('csv'))]
# df = pd.concat(map(read_car_data, file_names))


## Choosing attributes

Inspecting this list and using our contextual knowledge of cars, as well as the info available on the [kaggle page](https://www.kaggle.com/datasets/ananaymital/us-used-cars-dataset). Certain attributes can be removed immediately, leaving ones that are believed to influence the price. Any attributes left will be further inspected before any models are used. 

Note: data types are now defined to make sure any further exploration is done correctly.

In [None]:
# Attributes belived to influence price
desired_attributes : list = ["body_type", "city", "daysonmarket", "dealer_zip", "engine_cylinders", "engine_displacement",
                             "engine_type", "fleet", "frame_damaged", "franchise_dealer", "fuel_tank_volume", "has_accidents", "horsepower",
                            "is_new", "listed_date", "make_name", "owner_count", "power", "price",
                             "savings_amount", "seller_rating", "year", "torque"
                            ]
    
# REMOEVE ANY IRRELEVANT ONES
# Define datatypes of attributes to make sure any exploration is good.
data_types = {'vin' : str, 'back_legroom' : str, 'bed' : str, 'bed_height' : str,
              'bed_length' : str, 'body_type' : str, 'cabin' : str, 'city' : str,
              'city_fuel_economy' : np.float64, 'combine_fuel_economy' : np.float64,
              'daysonmarket' : np.int32, 'dealer_zip' : np.int32, 'description' : str, 
              'engine_cylinders' : str, 'engine_displacement' : np.float64,
              'engine_type' : str, 'exterior_color' : str, 'fleet' : bool, 'frame_damaged' : bool,
              'franchise_dealer' : bool, 'franchise_make' : str, 'front_legroom' : str,
              'fuel_tank_volume' : str, 'fuel_type' : str, 'has_accidents' : bool, 'height' : str,
            'highway_fuel_economy' : np.float64, 'horsepower' : np.float64, 'interior_color' : str, 'isCab' : bool,
            'is_certified' : bool, 'is_cpo' : bool, 'is_new' : bool, 'is_oemcpo' : bool, 'latitude' : np.float64, 'length' : str,
            'listed_date' : str, 'listing_color' : str, 'listing_id' : np.int32, 'longitude' : np.float64,
            'main_picture_url' : str, 'major_options' : str, 'make_name' : str, 'maximum_seating' : np.int32,
            'mileage' : np.int32, 'model_name' : str, 'owner_count' : np.int32, 'power' : str, 'price' : np.float64, 'salvage' : bool,
            'savings_amount' : np.int32 , 'seller_rating' : np.float64, 'sp_id' : np.int32, 'sp_name' : str, 'theft_title' : bool,
            'torque' : str, 'transmission' : str, 'transmission_display' : str, 'trimId' : np.int32, 'trim_name' : str,
            'vehicle_damage_category' : str, 'wheel_system' : str, 'wheel_system_display' : str,
            'wheelbase' : str, 'width' : str, 'year' : np.int32}

    
df = df[df.columns.intersection(desired_attributes)].convert_dtypes(data_types).copy()
df = df.reset_index(drop=True) # Let's also reset the index to stop using vin