### Overall Notebook Breakdown:
#### This notebook will scrape and clean historical market price data from carguru.com using Beautiful Soup. This will result in two datasets that contain average price, 30 day price change, 90 day price change, and year-over-year price change. The first dataset will be aggreated by make (ex: Ford, Honda, etc) and the second dataset will be aggregated by model (F-150, Civic, etc). Additionally, since the data was in web format, some cleaning is performed to make the data usable.

In [3]:

# Imports:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import math
import datetime
import numpy as np


#### In this cell I perform the scraping of the first dataset. First I started by outlining the data I wanted to collect into arrays. Make is the actual manufacture, make_link is only collected to scrap the next page, avg_price is in dollar format, 30/90/YoY are precent changes, and data_label is calculated based on the type.

#### The next step in the process is to get the html from the website and filter out everything else besides table rows. Once the rows are stored, I use a for loop to iterate through each of them. There are certain edge cases that I need to address such as differentiating between a link and label for the "Make". Once everything is stored in the arrays, I assemble the dataframe.

In [4]:

# Create empty arrays over everything that will be collected:
make = []
make_link = []
avg_price = []
last_thirty = []
last_ninety = []
year_over_year = []
data_label = []

# Variable of page URL:
website_page = 'https://www.cargurus.com/Cars/price-trends/'
# Check for a successful responose:
response_page = requests.get(website_page)
# Save HTML from the response:
soup_page = BeautifulSoup(response_page.content, 'html.parser')

# Find all the rows in the table with the class "odd" or "even":
rows = soup_page.findAll('tr', class_ = ["odd","even"])

# Itterate through each row and save the variables to the correct array:
for row in rows:
    # Make - In this try, if a link is found in the row then it retreives the text and stores to the make array. 
    # Also, it will retreive the next href so I can iterate in the next code cell. If it fails it stores 'n/a'.
    # If a link is never found in the row, then it grabs the label instead. The first cell is not relevant, 
    # so I check the value and append to the data_label accordingly. If it fails it stores 'n/a'.
    try: 
        make.append(row.find('a').get_text().strip())
        data_label.append('Make')
        try:
            make_link.append(row.find('a').get('href'))
        except: make_link.append('n/a')
    except: 
        try: 
            make.append(row.find('label').get_text().strip())
            if odd.find('label').get_text().strip() == 'CarGurus Index': data_label.append('Overall')
            else: data_label.append('Type')
        except: data_label.append('n/a')
    # Avg_Price - In this try, if a cell is found with the class 'qPrice' then it append the avg_price array.
    # If nothing is found then it appends 'n/a'.
    try: avg_price.append(row.find('td', {'class':'qPrice'}).get_text())
    except: avg_price.append('n/a')
    # Last_30_Days - In this try, I grab the 2nd indexed cell value in the row and appends it to the last_thirty array.
    # If nothing is found then it appends 'n/a'.
    try: last_thirty.append(row.findAll('td')[2].get_text().strip())
    except: last_thirty.append('n/a')
    # Last_90_Days - In this try, I grab the 3rd indexed cell value in the row and appends it to the last_ninety array.
    # If nothing is found then it appends 'n/a'.
    try: last_ninety.append(row.findAll('td')[3].get_text().strip())
    except: last_ninety.append('n/a')
    #Year_Over_Year - In this try, I grab the 3rd indexed cell value in the row and appends it to the year_over_year array.
    # If nothing is found then it appends 'n/a'.
    try: year_over_year.append(row.findAll('td')[4].get_text().strip())
    except: year_over_year.append('n/a')

# Create dataframe from arrays:
cars = pd.DataFrame({'Make':make, 'Data_Label':data_label, 'Avg_Price':avg_price, 'Last_Thirty':last_thirty, 'Last_Ninety':last_ninety, 'Year_Over_Year':year_over_year})

# Print top 5 rows:
cars.head(20)


Unnamed: 0,Make,Data_Label,Avg_Price,Last_Thirty,Last_Ninety,Year_Over_Year
0,CarGurus Index,,"$30,801",+0.10%,+1.02%,+12.78%
1,Pickup Truck,,"$38,213",+0.42%,+1.16%,+1.93%
2,SUV,,"$39,466",-0.41%,-0.28%,+11.89%
3,Crossover,,"$27,485",-0.22%,-0.33%,+13.93%
4,Minivan,,"$23,865",-0.88%,-1.03%,+21.11%
5,Van,,"$34,948",+1.44%,+4.33%,+42.32%
6,Convertible,,"$42,368",+2.28%,+1.43%,+13.93%
7,Sedan,,"$23,471",+0.28%,+1.59%,+18.17%
8,Hatchback,,"$17,531",+0.28%,+1.12%,+19.20%
9,Coupe,,"$41,278",+1.82%,+4.67%,+14.16%


#### In this cell I perform the scraping of the second dataset. First I started by outlining the data I wanted to collect into arrays. Make is the actual manufacture, model is model of the car, make_link is only collected to scrap the next page, avg_price is in dollar format, 30/90/YoY are precent changes. I also use the make_link that I saved in the previous cell so that I can crawl to the next page. 

#### The next step in the process is to get the html from the website and filter out everything else besides table rows. The first 2 rows are not needed, so I used a count variable to skip them for each page. Once the rows are stored, I use a for loop to iterate through each of them. Once everything is stored in the arrays, I assemble the dataframe.

In [5]:
# Create empty arrays over everything that will be collected:
# Uses make_link array from before:
maker = []
model = []
model_average_price = []
last_30_days_percent = []
last_90_days_percent = []
year_over_year_percent = []

# Counter used to itterate through each make.
counter = 0

# Itterate through each make link:
for make in make_link:
    # Append the make link to the original website each time:
    website_page = 'https://www.cargurus.com/Cars/price-trends/' + make_link[counter]
    # Check for a successful response:
    response_page = requests.get(website_page)
    # Retreive the HTML from that makes specific webpage:
    soup_page = BeautifulSoup(response_page.content, 'html.parser')

    # Store all the rows with the class "odd" or "even"
    rows = soup_page.findAll('tr', class_ = ["odd","even"])
    
    # Retreives the make displayed on the current page:
    current_maker = rows[1].find('a').get_text().strip()
    
    # Counter used to skip first two rows:
    row_counter = 0
    
    # Itterate through each row and save the variables to the array:
    for row in rows:
        # Filter out the firs two rows:
        if row_counter > 1:
            # Make - In this try, we add the maker to the make array. If it fails it stores 'n/a'.
            try: maker.append(current_maker)
            except: maker.append('n/a')
            # Make - In this try, we add the model to the make array. If it fails it stores 'n/a'.
            try: model.append(row.find('a').get_text().strip())
            except: model.append('n/a')
            # Make - In this try, we add the avg_price to the make array from a td with the class qPrice. If it fails it stores 'n/a'.
            try: model_average_price.append(row.find('td', {'class':'qPrice'}).get_text())
            except: model_average_price.append('n/a')
            # Last_30_Days_Percent - In this try, we add the 30 day change to the make array. If it fails it stores 'n/a'.
            try: last_30_days_percent.append(row.findAll('td')[2].get_text().strip())
            except: last_30_days_percent.append('n/a')
            #Last_90_Days_Percent - In this try, we add the 90 day change to the make array. If it fails it stores 'n/a'.
            try: last_90_days_percent.append(row.findAll('td')[3].get_text().strip())
            except: last_90_days_percent.append('n/a')
            #Year_Over_Year_Percent - In this try, we add the YoY change to the make array. If it fails it stores 'n/a'.
            try: year_over_year_percent.append(row.findAll('td')[4].get_text().strip())
            except: year_over_year_percent.append('n/a')
        
        # Itterate row counter:
        row_counter = row_counter + 1
    
    # Itterate make counter:
    counter = counter + 1
    
# Create dataframe:
cars2 = pd.DataFrame({'Make':maker, 'Model':model, 'Model_Avg_Price':model_average_price, 'Last_30_Days':last_30_days_percent, 'Last_90_Days':last_90_days_percent, 'Year_Over_Year':year_over_year_percent})

# Print the head:
cars2.head(30)


Unnamed: 0,Make,Model,Model_Avg_Price,Last_30_Days,Last_90_Days,Year_Over_Year
0,Acura,Acura ILX,"$23,780",+0.63%,+1.21%,+14.65%
1,Acura,Acura Integra,"$13,218",+30.30%,+35.56%,+97.48%
2,Acura,Acura MDX,"$30,713",-1.71%,-2.86%,+9.98%
3,Acura,Acura RDX,"$31,814",-1.50%,-2.07%,+12.84%
4,Acura,Acura TL,"$12,323",+0.33%,+3.96%,+14.60%
5,Acura,Acura TLX,"$31,132",+0.42%,+2.67%,+12.21%
6,Acura,Acura TSX,"$11,984",-0.61%,+4.34%,+12.81%
7,Alfa Romeo,Alfa Romeo Giulia,"$34,505",+0.39%,-0.23%,+8.25%
8,Alfa Romeo,Alfa Romeo Stelvio,"$37,175",-0.19%,-0.72%,+8.57%
9,Aston Martin,Aston Martin DB11,"$175,433",+0.31%,-0.61%,+6.27%


In [6]:

# First Dataset Cleaning:

# Remove the commas and dollar sign from Avg_Price:
cars['Avg_Price'] = cars['Avg_Price'].str.replace('$', '').str.replace(',','')
# Remove the percents and + sign from Last_Thirty:
cars['Last_Thirty'] = cars['Last_Thirty'].str.replace('%', '').str.replace('+','')
# Remove the percents and + sign from Last_Ninety:
cars['Last_Ninety'] = cars['Last_Ninety'].str.replace('%', '').str.replace('+','')
# Remove the percents and + sign from Year_Over_Year:
cars['Year_Over_Year'] = cars['Year_Over_Year'].str.replace('%', '').str.replace('+','')

# Change Avg_Price to integer from String Obj:
cars["Avg_Price"] = cars["Avg_Price"].astype("int64")
# Change Last_Thirty to float from String Obj:
cars['Last_Thirty'] = cars['Last_Thirty'].apply('float64')
# Change Last_Ninety to float from String Obj:
cars['Last_Ninety'] = cars['Last_Ninety'].apply('float64')
# Change Year_Over_Year to float from String Obj:
cars['Year_Over_Year'] = cars['Year_Over_Year'].apply('float64')

# Check types:
cars.dtypes


Make               object
Data_Label         object
Avg_Price           int64
Last_Thirty       float64
Last_Ninety       float64
Year_Over_Year    float64
dtype: object

In [7]:

# Second Dataset Cleaning:

# Remove the commas and dollar sign from Model_Avg_Price:
cars2['Model_Avg_Price'] = cars2['Model_Avg_Price'].str.replace('$', '').str.replace(',','')
# Remove the percent and plus sign from Last_30_Days:
cars2['Last_30_Days'] = cars2['Last_30_Days'].str.replace('%', '').str.replace('+','')
# Remove the percent and plus sign from Last_90_Days:
cars2['Last_90_Days'] = cars2['Last_90_Days'].str.replace('%', '').str.replace('+','')
# Remove the percent and plus sign from Year_Over_Year:
cars2['Year_Over_Year'] = cars2['Year_Over_Year'].str.replace('%', '').str.replace('+','')

# Change Model_Avg_Price to integer from String Obj:
cars2["Model_Avg_Price"] = cars2["Model_Avg_Price"].astype("int64")
# Change Last_30_Days to float from String Obj:
cars2['Last_30_Days'] = cars2['Last_30_Days'].apply('float64')
# Change Last_90_Days to float from String Obj:
cars2['Last_90_Days'] = cars2['Last_90_Days'].apply('float64')
# Change Year_Over_Year to float from String Obj:
cars2['Year_Over_Year'] = cars2['Year_Over_Year'].apply('float64')

# In this step, I basically removed any reference of the make from the model field (Honda Civic -> Civic).
cars2_make = cars2['Make']
cars2_model = cars2['Model']
new_value = []

# Iterate through both make and model:
for x,y in zip(cars2_model,cars2_make):
    length = len(y) + 1 # Add space to end of the make
    y = y.ljust(length) # Get the new length
    value = x.replace(y,'') # Find make in model and remvoes it
    new_value.append(value) # ADd new values to the array

# Update model column:
cars2['Model'] = new_value

# Check data types:
cars2.dtypes


Make                object
Model               object
Model_Avg_Price      int64
Last_30_Days       float64
Last_90_Days       float64
Year_Over_Year     float64
dtype: object

In [8]:
# Save to the shared folder
cars.to_pickle('/dsa/groups/casestudy2022su/team05/carguru_make_v01.pkl')

In [9]:
# Save to the shared folder
cars2.to_pickle('/dsa/groups/casestudy2022su/team05/carguru_model_v01.pkl')