# <h1><center>KBB Used vs New Car Cost Analysis</center></h1>

## Introduction

As the state of the economy and car market changes, the price of automobiles varies with lots of uncertainty. As someone who is curious how much my current car is worth and how much I can expect to pay for a car at this moment, this project reveals current prices and trends of the new and used car market. A cost analysis is accomplished by web scraping data from Kelly Blue Book, a service that posts new and used vehicles listed by owner and dealer for sale.

## Table of Contents:
* [Extracting Data](#first-bullet1)
* [Cleaning Data](#second-bullet1)
* [Visualization and Analysis](#third-bullet1)

## Extracting Data <a class="anchor" id="first-bullet1"></a>

In [19]:
# Import relevant libraries

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
s = Service(ChromeDriverManager().install())

from bs4 import BeautifulSoup as bs
import requests

[WDM] - Downloading: 100%|██████████| 6.46M/6.46M [00:00<00:00, 18.9MB/s]


In [20]:
# Running the Chrome webdriver via Selenium
driver = webdriver.Chrome(service=s)

# Accessing Kelly Blue Book's most recent posts in the Austin, TX area
driver.get('https://www.kbb.com/cars-for-sale/austin-tx-73301?dma=&listingTypes=NEW%2CUSED&searchRadius=75&location=&marketExtension=include&isNewSearch=false&showAccelerateBanner=false&sortBy=datelistedDESC&numRecords=25')

time.sleep (5)

# From the webpage, pulling the html and saving it
html = driver.page_source
driver.quit()

In [21]:
# Using Beautiful Soup, the html is parsed
soup = bs(html)

# Pulling the title of each post
title = soup.find_all("h2", {"class": "text-bold text-size-400 text-size-sm-500 link-unstyled"})
title_clean = []
for a in title:
    title_clean.append(a.get_text())
print(len(title))

# Pulling the price of each car
price = soup.find_all("span", {"class": "first-price"})
price_clean = []
for a in price:
    price_clean.append(a.get_text())
print(len(price))

# Pulling the mileage of each car. For new cars, the mileage is not posted because it is zero. This is why the length of the mileage list is less than the title and price lists.
mileage = soup.find_all("ul", {"class": "list list-inline display-inline margin-bottom-0 pipe-delimited text-gray text-size-300"})
mileage_clean = []
for a in mileage:
    mileage_clean.append(a.get_text())
print(len(mileage))

29
29
10


## Cleaning Data  <a class="anchor" id="second-bullet1"></a>

In [22]:
# putting into pandas df
import pandas as pd
df = pd.DataFrame(list(zip(title_clean,price_clean)),columns=['title','price'])

In [30]:
import sqlalchemy
from sqlalchemy import create_engine, MetaData, Table, select
import pyodbc

server_name = "DESKTOP-71F3NUV\SQLEXPRESS"
database = "Used vs New Cars"

engine = create_engine('mssql+pyodbc://' + server_name + '/' + database)

metadata = MetaData(conn)

df.to_sql(database,engine)


#df.to_sql(TableName, engine, chunksize=<yourParameterLimit>, method='multi')

#conn = pyodbc.connect(
#    Trusted_Connection = "Yes",
#    Driver = "{ODBC Driver 17 for SQL Server}",
#    Server = server_name,
#    Database = database)

#cursor = conn.cursor()

#engine = sqlalchemy.create_engine("mssql+pyodbc://")

  engine = create_engine('mssql+pyodbc://' + server_name + '/' + database)


InterfaceError: (pyodbc.InterfaceError) ('IM002', '[IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified (0) (SQLDriverConnect)')
(Background on this error at: https://sqlalche.me/e/14/rvf5)

In [None]:
# write the DataFrame to a table in the sql database
df.to_sql("table_name", engine)

In [25]:
print(pd.__version__)

1.4.2


In [None]:
# Cleaning the data
new_used_or_certified = []
year = []
for a in df['title']:
    new_used_or_certified.append(a.split()[0]) # Taking out the year from each title
    year.append(a.split()[1]) # Taking out whether the car is new or used from each title
for i, a in enumerate(df['price']):
    if a[-4:] == 'MSRP':
        df['price'].iloc[i] = a[:-4] # Removing 'MSPR' from the prices that contain the letters
df['new_used_or_certified'] = new_used_or_certified
df['year'] = year

In [11]:
# Adding a column for vehicle mileage. Only used vehicles have mileage

# The webscraper pulls in a shortened list of vehicle mileage because only a fraction of the listed vehicles are used.
mileage_extra_rows = [0]*(len(df.title)-len(mileage_clean)) # Extending the length of the mileage list to match the rest of the dataframe.
for a in mileage_extra_rows:
    mileage_clean.append(0)

mileage_all = []

# Looping through the list of new and used vehicles and assigning the mileage.
for a,b in zip(df['new_used_or_certified'],mileage_clean): 
    if a == 'Used':
        mileage_all.append(b)
    else:
        mileage_all.append(0)

df['mileage'] = mileage_all

df.head() # Preview the dataframe

Unnamed: 0,title,price,new_used_or_certified,year,mileage
0,Used 2008 Jeep Grand Cherokee Laredo,7335,Used,2008,"8,101 miles"
1,Used 2022 Toyota RAV4 XLE Premium,37900,Used,2022,"5,307 miles"
2,Certified 2022 INFINITI QX50 Sensory,46992,Certified,2022,0
3,Used 2022 Audi RS e-tron GT,149982,Used,2022,"26,820 miles"
4,New 2023 BMW M8 Coupe,147285,New,2023,0


In [12]:
# Importing a list of automobile manufacturers from Wikipedia
url = "https://en.wikipedia.org/wiki/List_of_current_automobile_manufacturers_by_country"
data = requests.get(url).text
soup = bs(data, 'html.parser')
makes = soup.find_all("a")
car_makes = []
for a in makes:
    if (len(str(a.text))<2):
        pass
    else:
        car_makes.append(a.text)

# Assigning manufacturer names based on posttitle
testindex = []
test = []

for i,a in enumerate(df['title']):
    for b in a.split():
        for c in car_makes:
            for d in c.split():
                if d == 'New':
                    pass
                elif d == b:
                    testindex.append(i)
                    test.append(d)
                else:
                    pass
# Creating a dataframe for the car manufacturer names and                 
d = {'index':testindex,'car_make':test}
dfcm = pd.DataFrame(data=d)
dfcm['car_make'] = dfcm['car_make'].replace('Abarth','Fiat')
dfcm['car_make'] = dfcm['car_make'].replace('Land','Land Rover')
dfcm = dfcm.drop_duplicates()
dfcm = dfcm.set_index('index') # Changing the index of the manufacturer df so the join is by index
df = df.join(dfcm)
df.head()

Unnamed: 0,title,price,new_used_or_certified,year,mileage,car_make
0,Used 2008 Jeep Grand Cherokee Laredo,7335,Used,2008,"8,101 miles",Jeep
1,Used 2022 Toyota RAV4 XLE Premium,37900,Used,2022,"5,307 miles",Toyota
2,Certified 2022 INFINITI QX50 Sensory,46992,Certified,2022,0,
3,Used 2022 Audi RS e-tron GT,149982,Used,2022,"26,820 miles",Audi
4,New 2023 BMW M8 Coupe,147285,New,2023,0,BMW


In [13]:
# Exporting to a csv
import datetime
date_today = [datetime.date.today()]*len(df)
df['date_post'] = date_today
    
df.to_csv('KBB Web Scraping Data.csv', mode='a', index=False, header=False)
df = pd.read_csv(r'C:\Users\ngret\DataPortfolio\Data-Analytics-Portfolio\Craigslist Project\KBB Web Scraping Data.csv') # Bring full historical csv back in as df

In [14]:
df_new = df[df['new_used_or_certified']=='New']
df_used = df[df['new_used_or_certified']=='Used']

df_min_cost = df[['car_make','cost','year']].groupby('car_make').agg('min').sort_values(by='cost',ascending=False)
df_max_cost = df[['car_make','cost','year']].groupby('car_make').agg('max').sort_values(by='cost',ascending=False)

In [15]:
df_min_cost

Unnamed: 0_level_0,cost,year
car_make,Unnamed: 1_level_1,Unnamed: 2_level_1
Lincoln,85675,2023
Chevrolet,81900,2020
Jeep,7335,2008
Romeo,58500,2015
Alfa,58500,2015
Lexus,51888,2020
Nissan,23838,2015
Toyota,23025,2020
Bentley,155358,2018
Kia,15999,2018


## Visualization and Analysis  <a class="anchor" id="third-bullet1"></a>

In [16]:
import matplotlib.pyplot as plt 
plt.plot(df_min_cost['car_make'],df_min_cost['cost'])
plt.plot(df_max_cost['car_make'],df_max_cost['cost'])

KeyError: 'car_make'