## EDA Project

**What sales insights can we gather from advertisement data?** We've been provided with data about used car advertisements over the course of approximately a year: information about the cars, their prices, when they were listed, and how quickly they sold.

For this project, I want to look at a small subset of the data, pickup trucks, and compare sale prices and days listed for popular makes and models.

In [93]:
import pandas as pd

vehicles = pd.read_csv('../vehicles_us.csv')

vehicles.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51525 entries, 0 to 51524
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   price         51525 non-null  int64  
 1   model_year    47906 non-null  float64
 2   model         51525 non-null  object 
 3   condition     51525 non-null  object 
 4   cylinders     46265 non-null  float64
 5   fuel          51525 non-null  object 
 6   odometer      43633 non-null  float64
 7   transmission  51525 non-null  object 
 8   type          51525 non-null  object 
 9   paint_color   42258 non-null  object 
 10  is_4wd        25572 non-null  float64
 11  date_posted   51525 non-null  object 
 12  days_listed   51525 non-null  int64  
dtypes: float64(4), int64(2), object(7)
memory usage: 5.1+ MB


### Cleaning the data

A preliminary `info()` call shows that we have some missing values in the following fields: `model_year`, `cylinders`, `odometer`, `paint_color`, and `is_4wd`. Some of these factors affect a vehicle's value, but for now, I'm mostly focused on make/model.

I want to look only at certain models, but there are some duplicates and variations in the `model` field that I should clean up first. To make it easier to find duplicates, I'll take a quick look at each automaker that offers a pickup truck. First, what makes are in the dataset?

In [94]:
def getmake(makemodel):
    '''Get the first word from a string combining make and model.'''
    make = makemodel.split(' ',1)[0]
    return make

# Go through the data getting makes. Return unique results.
print(vehicles['model'].apply(getmake).unique())

['bmw' 'ford' 'hyundai' 'chrysler' 'toyota' 'honda' 'kia' 'chevrolet'
 'ram' 'gmc' 'jeep' 'nissan' 'subaru' 'dodge' 'mercedes-benz' 'acura'
 'cadillac' 'volkswagen' 'buick']


Most of these automakers have made a pickup truck at least once, but it's likely that only a few of them will be relevant to an analysis of pickup trucks in our dataset. Filtering `model` by make, I used `value_counts()` to look through short lists of models by maker.

There are no pickups in our dataset for BMW, Hyundai, Chrysler, Honda, Jeep, Subaru, Mercedes-Benz, Acura, Cadillac, Volkswagen, or Buick.

I am concerned with the following models:
- Ford: F-150, F-250, F-350, Ranger
- Toyota: Tacoma, Tundra
- Chevrolet: Silverado, Colorado
- Ram: 1500, 2500, 3500
- GMC: Sierra
- Nissan: Frontier
- Dodge: Dakota

The format of the `model` field is consistent enough that I should be able to filter pickups out of the dataset by checking the `model` field for lowercase make/model combinations that correspond to pickups. The exception is Ford, where some trucks have been entered as 'f150' and others as 'f-150`.

In [95]:
def hyphenate_trucks(model):
    '''If input string contains unhyphenated Ford model name, replace with hyphenated version'''
    model = model.replace('f150','f-150')
    model = model.replace('f250','f-250')
    model = model.replace('f350','f-350')
    return model

vehicles['model'] = vehicles['model'].apply(hyphenate_trucks)

To better limit the analysis to pickup trucks, I'll filter the `vehicles` dataframe into a `pickups`-only dataframe.

In [105]:
pickup_models = ['ford f-150', 'ford f-250', 'ford f-350', 'ford ranger', 'toyota tacoma', 'toyota tundra', \
                 'chevrolet colorado', 'chevrolet silverado', 'ram 1500', 'ram 2500', 'ram 3500', 'gmc sierra', \
                 'nissan frontier', 'dodge dakota']

# https://stackoverflow.com/questions/61158898/filter-pandas-where-some-columns-contain-any-of-the-words-in-a-list
# select only rows where 'model' field contains string from list of known pickup_models
pickups = vehicles[vehicles['model'].str.contains('|'.join(pickup_models))]

pickups['model'].value_counts()

model
ford f-150                       3326
chevrolet silverado 1500         2171
ram 1500                         1750
chevrolet silverado              1271
ram 2500                         1091
chevrolet silverado 2500hd        915
gmc sierra 1500                   906
toyota tacoma                     827
ford f-250                        761
ford f-250 super duty             611
toyota tundra                     603
ram 3500                          475
gmc sierra 2500hd                 438
ford f-250 sd                     426
ford ranger                       423
gmc sierra                        388
nissan frontier crew cab sv       345
ford f-150 supercrew cab xlt      327
chevrolet silverado 1500 crew     303
ford f-350 sd                     295
chevrolet colorado                286
nissan frontier                   281
ford f-350                        250
ford f-350 super duty             246
chevrolet silverado 3500hd        243
dodge dakota                      242
Name: 