# Car EDA Web App

## Introduction

I have been tasked with taking US vehicle data, cleaning it, and preparing to be used for a streamlit app that will be put online through render.com. The app will be fully interactive as users will be able to select values they want/care about to then be visualized.

## Importing & Inital Reading of Data

In [1]:
import pandas as pd
import streamlit as st
import plotly_express as px

In [2]:
car_data = pd.read_csv('vehicles_us.csv')
#seperate instance of the car data that will be modified. 
comp_df = pd.read_csv('vehicles_us.csv')

In [3]:
display(car_data.head(10))

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed
0,9400,2011.0,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19
1,25500,,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50
2,5500,2013.0,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79
3,1500,2003.0,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9
4,14900,2017.0,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28
5,14990,2014.0,chrysler 300,excellent,6.0,gas,57954.0,automatic,sedan,black,1.0,2018-06-20,15
6,12990,2015.0,toyota camry,excellent,4.0,gas,79212.0,automatic,sedan,white,,2018-12-27,73
7,15990,2013.0,honda pilot,excellent,6.0,gas,109473.0,automatic,SUV,black,1.0,2019-01-07,68
8,11500,2012.0,kia sorento,excellent,4.0,gas,104174.0,automatic,SUV,,1.0,2018-07-16,19
9,9200,2008.0,honda pilot,excellent,,gas,147191.0,automatic,SUV,blue,1.0,2019-02-15,17


**NOT** Removing all of the columns that I will not be utilizing for visualization in the web app **EVEN THOUGH THEY HAVE NO USE FOR MY WEBAPP: PER THE REQUEST OF MY CODE REVIEWER**

In [4]:
# comp_df = car_data.drop(['cylinders', 'fuel', 'transmission', 'paint_color', 'is_4wd', 'date_posted', 'days_listed' ], axis=1)

## Cleaning/Enriching cylinders, fuel, transmission, is_4wd, date_posted, days listed

### price

In [5]:
display(car_data.info())
display(car_data['price'].describe())
display(car_data['price'].value_counts())

comp_df = comp_df.rename(columns={'price': 'price_$'})

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


None

count     51525.000000
mean      12132.464920
std       10040.803015
min           1.000000
25%        5000.000000
50%        9000.000000
75%       16839.000000
max      375000.000000
Name: price, dtype: float64

1        798
6995     719
5995     655
4995     624
3500     620
        ... 
58500      1
3993       1
32987      1
3744       1
7455       1
Name: price, Length: 3443, dtype: int64

With no null values and seemingly reasonable range of prices I don't see much need to modify or enrich this data outside of adding $ to the price to help contextualize the prices. The almost 800 instances of 1 dollar I find to be acceptable, they might be for parts or a best offer kind of post. If I was going to use the price mean, I would consider droping these 

### model_year

My code reviewer suggested I use median by model for the model year. I don't believe this to be a good enrichment of the data. While it won't skew the median for each model, I do think could have a questionable affect on any conclusions drawn from model_years. If I was planning to use model years in my webapp I would also add an additional column to mark if the year was added via the following function. But since I won't be using this, I'll simply fill the null values with the median of that model year.

In [6]:
def list_of_null (df, clmn):
    
    """ Creates a list of indexes with a null value in a entry from 
    a column in a dataframe 
    """
    
    null_list = df[df[clmn].isna()]
    null_list = null_list.index.tolist()
    return null_list

def list_of_value (df, clmn, value):
    
    """ Creates a list of indexes from a column in a dataframe with 
    a specified value
    """
    
    value_list = df[df[clmn] == value]
    value_list = value_list.index.tolist() 
    return value_list


In [7]:
def model_year_update(df, null_list):
    
    """ Changes the model year column entry from null to the median
    year of other models with a non-null model year value
    """
    
    for i in null_list:
        # retrieve the list entry's model name 
        model = df.loc[i, 'model']
        
        #check if there are other entries with the same name & a non-null cylinder entry
        other_entries = df[(df['model']==model) & (df['model_year'].notnull())]
        if not other_entries.empty:
            # set the null list's entry of model year to the median of all 
            # other entries
            median = round(other_entries['model_year'].median())
            df.loc[i, 'model_year'] = median
        else:
            df.loc[i, 'model_year'] = 0
            
    return df

In [8]:
null_model_year = list_of_null(car_data, 'model_year')
comp_df = model_year_update(car_data, null_model_year)
comp_df['model_year'] = comp_df['model_year'].astype(int)
display('The number of entries with the year 0 is: ',\
        (comp_df['model_year']==0).sum())
display(comp_df['model_year'].value_counts(dropna=False))

'The number of entries with the year 0 is: '

0

2011    4017
2013    3994
2012    3982
2014    3540
2008    3462
        ... 
1948       1
1961       1
1936       1
1949       1
1929       1
Name: model_year, Length: 68, dtype: int64

We have now updated the model years on all null entries. We also changed the year to an int as there is no good reason for it to continue to be a float.

### model

In [9]:
display(car_data.info())
# make a new column 'make'from the model column
comp_df['make'] = comp_df['model'].str.split().str[0]
display(comp_df)
display(comp_df['make'].info())
display(comp_df['make'].value_counts())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    51525 non-null  int64  
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(3), int64(3), object(7)
memory usage: 5.1+ MB


None

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,make
0,9400,2011,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,1.0,2018-06-23,19,bmw
1,25500,2011,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,1.0,2018-10-19,50,ford
2,5500,2013,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,,2019-02-07,79,hyundai
3,1500,2003,ford f-150,fair,8.0,gas,,automatic,pickup,,,2019-03-22,9,ford
4,14900,2017,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,,2019-04-02,28,chrysler
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51520,9249,2013,nissan maxima,like new,6.0,gas,88136.0,automatic,sedan,black,,2018-10-03,37,nissan
51521,2700,2002,honda civic,salvage,4.0,gas,181500.0,automatic,sedan,white,,2018-11-14,22,honda
51522,3950,2009,hyundai sonata,excellent,4.0,gas,128000.0,automatic,sedan,blue,,2018-11-15,32,hyundai
51523,7455,2013,toyota corolla,good,4.0,gas,139573.0,automatic,sedan,black,,2018-07-02,71,toyota


<class 'pandas.core.series.Series'>
RangeIndex: 51525 entries, 0 to 51524
Series name: make
Non-Null Count  Dtype 
--------------  ----- 
51525 non-null  object
dtypes: object(1)
memory usage: 402.7+ KB


None

ford             12672
chevrolet        10611
toyota            5445
honda             3485
ram               3316
jeep              3281
nissan            3208
gmc               2378
subaru            1272
dodge             1255
hyundai           1173
volkswagen         869
chrysler           838
kia                585
cadillac           322
buick              271
bmw                267
acura              236
mercedes-benz       41
Name: make, dtype: int64

Since there were no null values theres no reason to have to clean the model column. I did see a good reason to take the first word in the model and create a new column 'make' for use in my webapp. I also checked to make sure there were no missing values in this column and that all of the make entries made sense.

### condition

In [10]:
display(car_data['condition'].value_counts(dropna=False))

excellent    24773
good         20145
like new      4742
fair          1607
new            143
salvage        115
Name: condition, dtype: int64

All of the entries in the condition column are non null and consist of 6 differnet strings. Meaning the object type continues to make sense here. There is no further enriching or cleaning needed for my purposes.

### cylinders

In [11]:
display(car_data['cylinders'].value_counts(dropna = False))
display(car_data[car_data['cylinders'].isnull()])

8.0     15844
6.0     15700
4.0     13864
NaN      5260
10.0      549
5.0       272
3.0        34
12.0        2
Name: cylinders, dtype: int64

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,make
9,9200,2008,honda pilot,excellent,,gas,147191.0,automatic,SUV,blue,1.0,2019-02-15,17,honda
36,10499,2013,chrysler 300,good,,gas,88042.0,automatic,sedan,,,2018-05-05,22,chrysler
37,7500,2005,toyota tacoma,good,,gas,160000.0,automatic,pickup,,,2018-07-22,44,toyota
59,5200,2006,toyota highlander,good,,gas,186000.0,automatic,SUV,green,,2018-12-20,2,toyota
63,30000,1966,ford mustang,excellent,,gas,51000.0,manual,convertible,red,,2019-01-23,17,ford
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51429,3250,2004,toyota camry,good,,gas,179412.0,automatic,sedan,,,2018-07-02,25,toyota
51442,28990,2018,ford f150,excellent,,gas,10152.0,automatic,truck,white,1.0,2018-06-13,47,ford
51460,5995,2007,ford fusion,excellent,,gas,88977.0,manual,sedan,silver,,2019-03-27,66,ford
51477,6499,2007,acura tl,good,,gas,112119.0,automatic,sedan,white,,2018-06-22,28,acura


As there are over 5000 missing entries and the model types seem to be normal, I'll run a function to compare entries with missing values to entries of the exact same model name and use a non-null value from there to enrich our data

In [12]:
def update_cylinders (df, null_list):
    
    """ Changes the cylinders column entry from null to the first matching 
    model entry with a non-null cylinders value
    """
    
    for i in null_list:
        # retrieve the list entry's model name 
        name = df.loc[i, 'model']
        
        #check if there are other entries with the same name & a non-null cylinder entry
        other_entries = df[(df['model']==name) & (df['cylinders'].notnull())]
        if not other_entries.empty:
            # set the null list's entry's cylinder to the first cylinder entry 
            # in other entries
            non_null_cylinder = other_entries.iloc[0]['cylinders']
            df.loc[i, 'cylinders'] = non_null_cylinder
        else:
            df.loc[i, 'cylinders'] = 0
            
    return df

In [13]:
cylinders_null = list_of_null(car_data, 'cylinders')
comp_df = update_cylinders(car_data, cylinders_null)

In [27]:
display(comp_df['cylinders'].value_counts(dropna=False))
comp_df['cylinders'] = comp_df['cylinders'].astype(int)
display(comp_df.info())

6.0     17736
8.0     17459
4.0     15417
10.0      549
5.0       328
3.0        34
12.0        2
Name: cylinders, dtype: int64

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   price         51525 non-null  int64         
 1   model_year    51525 non-null  int64         
 2   model         51525 non-null  object        
 3   condition     51525 non-null  object        
 4   cylinders     51525 non-null  int64         
 5   fuel          51525 non-null  object        
 6   odometer      51525 non-null  float64       
 7   transmission  51525 non-null  object        
 8   type          51525 non-null  object        
 9   paint_color   42258 non-null  object        
 10  is_4wd        51525 non-null  bool          
 11  date_posted   51525 non-null  datetime64[ns]
 12  days_listed   51525 non-null  int64         
 13  make          51525 non-null  object        
dtypes: bool(1), datetime64[ns](1), float64(1), int64(4), object(7)
memory usage: 5.2+ MB


None

The function has updated all entries to now have values. Only 10, 3, and 12 cylinder engines did not gain any entires in with the code. I am not worried about this because there are also no 0 cylinder entries  meaning all of the entries found an appropriate match. I also chose to change the type to integer as there were no signs of a partial cylinder option

### fuel

In [15]:
display(car_data['fuel'].value_counts(dropna=False))

gas         47288
diesel       3714
hybrid        409
other         108
electric        6
Name: fuel, dtype: int64

The fuel column seems to be in fine order. The datatype is an object which fits for a string. The vast majority of cars are gas. In a professional setting I would try to ask what "other" even could be as a fuel source as that could lead to some interesting conclusions. But as far as this goes we are good.

### odometer

In [16]:
def odometer_update(df, null_list):
    
    """ Changes the model year column entry from null to the median
    year of other models with a non-null model year value
    """
    
    for i in null_list:
        # retrieve the list entry's model_year 
        year = df.loc[i, 'model_year']
        
        #check if there are other entries with the same year & a non-null milage entry
        other_entries = df[(df['model_year']==year) & (df['odometer'].notnull())]
        if not other_entries.empty:
            # set the null list's entry of milage to the median of all 
            # other entries of that year.
            median = round(other_entries['odometer'].median())
            df.loc[i, 'odometer'] = median
        else:
            df.loc[i, 'odometer'] = 0
            
    return df

In [17]:
null_milage = list_of_null(car_data, 'odometer')
comp_df = odometer_update(car_data, null_milage)
display(comp_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    51525 non-null  int64  
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     51525 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      51525 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
 13  make          51525 non-null  object 
dtypes: float64(3), int64(3), object(8)
memory usage: 5.5+ MB


None

We now have no missing values when it comes to the odometer readings. I do like with the model_year I question how reliable the data is with adding about 8000 instances of medians by year. While much better than the median of the whole dataset, I'm just not sure how this would be viewed in a real life situation. Articles, white papers, or any professional advice on this would be very appreciated.  
I didn't change the odometer from a float. Most vehicles account for milage down to the tenth of a unit (be it km or mile). I didn't look through all instances, so I don't want to potentially mess anything up with the data.

### transmission

In [18]:
display(car_data['transmission'].value_counts(dropna=False))

automatic    46902
manual        2829
other         1794
Name: transmission, dtype: int64

Again with the transmission column I would consider this to be in good working order for us to use in an analysis. I would ask what "other" might mean for conclusions in the real world, but for our uses today that is unnecessary.

### type

In [19]:
display(car_data['type'].value_counts(dropna=False))

SUV            12405
truck          12353
sedan          12154
pickup          6988
coupe           2303
wagon           1541
mini-van        1161
hatchback       1047
van              633
convertible      446
other            256
offroad          214
bus               24
Name: type, dtype: int64

There are no null values, all instances seem to be in place. The object type is appropriate for the strings. There is no further enriching or cleaning necessary here.

### paint_color

In [20]:
display(car_data['paint_color'].value_counts(dropna=False))
display(comp_df.groupby('make')['paint_color'].value_counts())

white     10029
NaN        9267
black      7692
silver     6244
grey       5037
blue       4475
red        4421
green      1396
brown      1223
custom     1153
yellow      255
orange      231
purple      102
Name: paint_color, dtype: int64

make        paint_color
acura       grey           55
            black          44
            silver         36
            white          26
            blue            9
                           ..
volkswagen  blue           93
            red            63
            brown           9
            green           7
            custom          6
Name: paint_color, Length: 202, dtype: int64

So I attempted a number looks into the relation of 'paint_color' and other column's values. But I didn't find any sound logical relation. I don't want to use the modal value to fill in by make. So I'm going to push back on my code reviewer and refuse to enrich the paint_color column. In the real world I would first ask if this data was important to the scope of the analysis. If it was, I'd attempt to talk to other people to see if there are logical connections. Maybe I could generate a function to fill in the the null values based on the make's of the cars with null paint color based on the distribution of colors that are present in the data. But with 9200+ entries without color AND the fact that I'm not using color in my webapp, I choose to not do anything to this data as everything I can come up with isn't based in inductive or deductive reasoning. 

### is_4wd

In [21]:
display(car_data['is_4wd'].value_counts(dropna=False))

NaN    25953
1.0    25572
Name: is_4wd, dtype: int64

I find it safe to assume that since there is only 1.0 and null values the 1.0 is a True and all other entries are intended to be 0.0: False

In the following cell I will convert the null values to 0.0 then conver the column to boolean values.

In [22]:
comp_df['is_4wd'].fillna(0.0, inplace=True)
comp_df['is_4wd'] = comp_df['is_4wd'].astype(bool)
display(comp_df.head(10))
display(comp_df['is_4wd'].value_counts())

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,make
0,9400,2011,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,True,2018-06-23,19,bmw
1,25500,2011,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,True,2018-10-19,50,ford
2,5500,2013,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,False,2019-02-07,79,hyundai
3,1500,2003,ford f-150,fair,8.0,gas,161397.0,automatic,pickup,,False,2019-03-22,9,ford
4,14900,2017,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,False,2019-04-02,28,chrysler
5,14990,2014,chrysler 300,excellent,6.0,gas,57954.0,automatic,sedan,black,True,2018-06-20,15,chrysler
6,12990,2015,toyota camry,excellent,4.0,gas,79212.0,automatic,sedan,white,False,2018-12-27,73,toyota
7,15990,2013,honda pilot,excellent,6.0,gas,109473.0,automatic,SUV,black,True,2019-01-07,68,honda
8,11500,2012,kia sorento,excellent,4.0,gas,104174.0,automatic,SUV,,True,2018-07-16,19,kia
9,9200,2008,honda pilot,excellent,6.0,gas,147191.0,automatic,SUV,blue,True,2019-02-15,17,honda


False    25953
True     25572
Name: is_4wd, dtype: int64

Now all values in the column are true or false. Making using it later easier and better communication interaction with the column title.

### date_posted

In [24]:
comp_df['date_posted'] = pd.to_datetime(comp_df['date_posted'],\
                                        format="%Y-%m-%d")
display(comp_df.info())
display(comp_df.head(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   price         51525 non-null  int64         
 1   model_year    51525 non-null  int64         
 2   model         51525 non-null  object        
 3   condition     51525 non-null  object        
 4   cylinders     51525 non-null  float64       
 5   fuel          51525 non-null  object        
 6   odometer      51525 non-null  float64       
 7   transmission  51525 non-null  object        
 8   type          51525 non-null  object        
 9   paint_color   42258 non-null  object        
 10  is_4wd        51525 non-null  bool          
 11  date_posted   51525 non-null  datetime64[ns]
 12  days_listed   51525 non-null  int64         
 13  make          51525 non-null  object        
dtypes: bool(1), datetime64[ns](1), float64(2), int64(3), object(7)
memory usage: 5.2+ MB


None

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,make
0,9400,2011,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,True,2018-06-23,19,bmw
1,25500,2011,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,True,2018-10-19,50,ford
2,5500,2013,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,False,2019-02-07,79,hyundai
3,1500,2003,ford f-150,fair,8.0,gas,161397.0,automatic,pickup,,False,2019-03-22,9,ford
4,14900,2017,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,False,2019-04-02,28,chrysler


We have changed the date_posted to a datetime object. This change could be helpful if we were using this column for those values or enriching our data in some other way. Otherwise with no missing values I find this column to be in good working order.

### days_listed

In [25]:
display(car_data['days_listed'].value_counts(dropna=False))
display(car_data['days_listed'].isnull().sum())

18     959
24     950
22     945
19     941
20     934
      ... 
240      1
209      1
188      1
192      1
186      1
Name: days_listed, Length: 227, dtype: int64

0

This column doesn't have any null values and the day counts all look sensible. I don't think theres any cleaning or enriching to do at the moment for the purposes of my webapp.

In [26]:
display(comp_df.info())
display(comp_df.head(5))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   price         51525 non-null  int64         
 1   model_year    51525 non-null  int64         
 2   model         51525 non-null  object        
 3   condition     51525 non-null  object        
 4   cylinders     51525 non-null  float64       
 5   fuel          51525 non-null  object        
 6   odometer      51525 non-null  float64       
 7   transmission  51525 non-null  object        
 8   type          51525 non-null  object        
 9   paint_color   42258 non-null  object        
 10  is_4wd        51525 non-null  bool          
 11  date_posted   51525 non-null  datetime64[ns]
 12  days_listed   51525 non-null  int64         
 13  make          51525 non-null  object        
dtypes: bool(1), datetime64[ns](1), float64(2), int64(3), object(7)
memory usage: 5.2+ MB


None

Unnamed: 0,price,model_year,model,condition,cylinders,fuel,odometer,transmission,type,paint_color,is_4wd,date_posted,days_listed,make
0,9400,2011,bmw x5,good,6.0,gas,145000.0,automatic,SUV,,True,2018-06-23,19,bmw
1,25500,2011,ford f-150,good,6.0,gas,88705.0,automatic,pickup,white,True,2018-10-19,50,ford
2,5500,2013,hyundai sonata,like new,4.0,gas,110000.0,automatic,sedan,red,False,2019-02-07,79,hyundai
3,1500,2003,ford f-150,fair,8.0,gas,161397.0,automatic,pickup,,False,2019-03-22,9,ford
4,14900,2017,chrysler 200,excellent,4.0,gas,80903.0,automatic,sedan,black,False,2019-04-02,28,chrysler


## Conclusion of cleaning
I have now effectively removed all instances of null values where there was logical conditions to help fill in the missing data. I didn't update paint_color for the reasons explained above.

## Final Prep before bringing this over to app.py

In [28]:
comp_df = comp_df.drop(['model','cylinders', 'fuel', 'transmission',\
                       'paint_color', 'is_4wd', 'date_posted',\
                        'days_listed'], axis=1)

In [29]:
display(comp_df)

Unnamed: 0,price,model_year,condition,odometer,type,make
0,9400,2011,good,145000.0,SUV,bmw
1,25500,2011,good,88705.0,pickup,ford
2,5500,2013,like new,110000.0,sedan,hyundai
3,1500,2003,fair,161397.0,pickup,ford
4,14900,2017,excellent,80903.0,sedan,chrysler
...,...,...,...,...,...,...
51520,9249,2013,like new,88136.0,sedan,nissan
51521,2700,2002,salvage,181500.0,sedan,honda
51522,3950,2009,excellent,128000.0,sedan,hyundai
51523,7455,2013,good,139573.0,sedan,toyota


**Here we have all the columns I will want to have access to in my web app visualization. I have also removed all other columns to help reserve memory and improve speeds.**