#### Notebook Goal:

In this notebook we intend to preform some additional data carpernty (the other notebooks will include visualizations and other code). Specifcially, in this notebook we want to aggreagate our live data to provide description stats and historical context to the dataset. We will then join that aggregated data back to the original dataset so that it can be utilized by our future machine learning model. 

In [1]:
# Imports
import pandas as pd
import numpy as np

In [2]:
# This is our cleaned live cars dataframe from Sprint 2:
df = pd.read_pickle('/dsa/groups/casestudy2022su/team05/carscom_v02.pkl')

#### Aggregation:

In this step, we are creating the two aggregates that we want: price and milage. These aggregates will be grouped on make, model and year to provide use with some historical context of how the make and model's value holds up over time. Additionally, we wanted to see how many of each vehicle fell into each group, so we can filter out ones with low samples.

In [3]:
# The two aggreated datasets are created here:
grouped_Price_Agg = df.groupby(['Make','Model','Year']).agg({'Price': ['mean', 'min', 'max']})
grouped_Mileage_Agg = df.groupby(['Make','Model','Year']).agg({'Mileage': ['mean', 'min', 'max']})

In [4]:
# Combines the two datasets and joins them on make, model and year:
Metrics_df = pd.concat([grouped_Price_Agg , grouped_Mileage_Agg ], axis=1)

In [5]:
# Print the dataset:
Metrics_df2 = Metrics_df.sort_values(by=['Make','Model','Year'],ascending=[True,True,False])
Metrics_df2.head()


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Price,Price,Price,Mileage,Mileage,Mileage
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,mean,min,max,mean,min,max
Make,Model,Year,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Acura,ILX,2021,31950.0,31000,32900,16285.5,9034,23537
Acura,ILX,2020,27400.0,27400,27400,13469.0,13469,13469
Acura,ILX,2019,28507.5,27260,29998,28269.5,10387,37200
Acura,ILX,2018,23806.666667,23590,24990,26139.833333,23968,34774
Acura,ILX,2017,22925.916667,20189,23998,46739.0,38495,57556


In [6]:
# We also wanted to get a count of how many of the vehicle fell into each group:
grouped_count = df.groupby(['Make','Model','Year'])['index'].count()

In [7]:
# Join all the datasets together:
result = pd.concat([Metrics_df, grouped_count], axis=1)

In [8]:
# Print the combined dataset:
result = result.sort_values(by=['Make','Model','Year'],ascending=[True,True,False])
result

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,"(Price, mean)","(Price, min)","(Price, max)","(Mileage, mean)","(Mileage, min)","(Mileage, max)",index
Make,Model,Year,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Acura,ILX,2021,31950.000000,31000,32900,16285.500000,9034,23537,2
Acura,ILX,2020,27400.000000,27400,27400,13469.000000,13469,13469,1
Acura,ILX,2019,28507.500000,27260,29998,28269.500000,10387,37200,16
Acura,ILX,2018,23806.666667,23590,24990,26139.833333,23968,34774,12
Acura,ILX,2017,22925.916667,20189,23998,46739.000000,38495,57556,12
...,...,...,...,...,...,...,...,...,...
smart,ForTwo Pure,2015,13092.714286,10500,16590,36237.714286,17818,70990,7
smart,ForTwo Pure,2014,10997.000000,7999,13995,62267.500000,42562,81973,2
smart,ForTwo Pure,2013,9870.250000,7499,12990,41551.750000,22895,64666,4
smart,ForTwo Pure,2009,10990.000000,10990,10990,5652.000000,5652,5652,19


In [9]:
# Create into CSV so we can ungroup the dataframe:
result.to_csv('/dsa/groups/casestudy2022su/team05/temp_ungrouped2.csv')

In [10]:
# Read CSV so we can ungroup the dataframe:
result = pd.read_csv('/dsa/groups/casestudy2022su/team05/temp_ungrouped2.csv')

In [11]:
# Renamed the columns:
result.columns = ['Make','Model','Year','Avg_Price','Min_Price',
                  'Max_Price','Avg_Mileage','Min_Mileage','Max_Mileage','Count']

#### Filtering:

In this step, we wanted to remove outliers and vehiles with a low sample size. The most common outlier we noticed was extremely old cars (we had some from the 70s and 80s) so we decided to filter since 2010. Additionally, we filtered the count on 5 or greater to remove extremely uncommon vehicles.

In [12]:
# Remove vehicles with a count less that 5
result = result[result['Count'] >= 5]

In [13]:
# Remove vehicles older than 2010
result = result[result['Year'] >= 2010]

In [14]:
result

Unnamed: 0,Make,Model,Year,Avg_Price,Min_Price,Max_Price,Avg_Mileage,Min_Mileage,Max_Mileage,Count
2,Acura,ILX,2019,28507.500000,27260,29998,28269.500000,10387,37200,16
3,Acura,ILX,2018,23806.666667,23590,24990,26139.833333,23968,34774,12
4,Acura,ILX,2017,22925.916667,20189,23998,46739.000000,38495,57556,12
5,Acura,ILX,2016,20463.083333,14593,23990,71307.000000,31060,124681,12
6,Acura,ILX 2.0L,2015,18827.000000,16980,21590,79976.357143,41775,110534,14
...,...,...,...,...,...,...,...,...,...,...
14933,Volvo,XC90 T6 R-Design,2016,32038.666667,27250,40990,79134.111111,41848,115113,9
14935,smart,ForTwo Electric Drive passion,2014,13061.428571,12990,13990,25501.785714,20519,34018,14
14939,smart,ForTwo Passion,2015,14591.166667,11797,15990,38537.833333,22255,49161,6
14941,smart,ForTwo Passion,2013,10865.000000,7995,11990,68047.750000,45701,91286,8


#### Ordering:

In this step, we reordered the dataframe to prepare it for caluclations. This included ordering by make, model and year, as well as resetting the index.

In [15]:
# Changing frame to sort descending on year for our interval calcs:
df2 = result.sort_values(by=['Make', 'Model','Year'],ascending=[True,True,True])
df2.head(10)

Unnamed: 0,Make,Model,Year,Avg_Price,Min_Price,Max_Price,Avg_Mileage,Min_Mileage,Max_Mileage,Count
5,Acura,ILX,2016,20463.083333,14593,23990,71307.0,31060,124681,12
4,Acura,ILX,2017,22925.916667,20189,23998,46739.0,38495,57556,12
3,Acura,ILX,2018,23806.666667,23590,24990,26139.833333,23968,34774,12
2,Acura,ILX,2019,28507.5,27260,29998,28269.5,10387,37200,16
8,Acura,ILX 2.0L,2013,16593.428571,10988,20998,91893.0,57342,144604,7
7,Acura,ILX 2.0L,2014,17635.105263,14495,20000,88992.894737,55436,112325,19
6,Acura,ILX 2.0L,2015,18827.0,16980,21590,79976.357143,41775,110534,14
11,Acura,ILX 2.0L w/Premium Package,2013,17107.2,14544,19906,85569.1,56647,168133,10
12,Acura,ILX 2.4L,2016,19991.210526,12995,25667,72371.684211,17520,158172,19
17,Acura,ILX Base,2018,21198.333333,16995,24488,62450.833333,21091,141684,6


In [16]:
# Resetting index and cleaning column
df2 = df2.reset_index(drop=True)
#df2 = df2.drop(columns = ["level_0"])
df2.head()

Unnamed: 0,Make,Model,Year,Avg_Price,Min_Price,Max_Price,Avg_Mileage,Min_Mileage,Max_Mileage,Count
0,Acura,ILX,2016,20463.083333,14593,23990,71307.0,31060,124681,12
1,Acura,ILX,2017,22925.916667,20189,23998,46739.0,38495,57556,12
2,Acura,ILX,2018,23806.666667,23590,24990,26139.833333,23968,34774,12
3,Acura,ILX,2019,28507.5,27260,29998,28269.5,10387,37200,16
4,Acura,ILX 2.0L,2013,16593.428571,10988,20998,91893.0,57342,144604,7


In [17]:
# Staging new columns for iterative loop that will be performed.
df2['price_diff']=0 #price difference from previous year
df2['mileage_diff']=0 # mileage difference from previous year
df2['YoY_price_pct_change']=0 #year over year price percent change
df2['YoY_mileage_pct_change']=0 #year over year mileage percent change
df2.head(10)

Unnamed: 0,Make,Model,Year,Avg_Price,Min_Price,Max_Price,Avg_Mileage,Min_Mileage,Max_Mileage,Count,price_diff,mileage_diff,YoY_price_pct_change,YoY_mileage_pct_change
0,Acura,ILX,2016,20463.083333,14593,23990,71307.0,31060,124681,12,0,0,0,0
1,Acura,ILX,2017,22925.916667,20189,23998,46739.0,38495,57556,12,0,0,0,0
2,Acura,ILX,2018,23806.666667,23590,24990,26139.833333,23968,34774,12,0,0,0,0
3,Acura,ILX,2019,28507.5,27260,29998,28269.5,10387,37200,16,0,0,0,0
4,Acura,ILX 2.0L,2013,16593.428571,10988,20998,91893.0,57342,144604,7,0,0,0,0
5,Acura,ILX 2.0L,2014,17635.105263,14495,20000,88992.894737,55436,112325,19,0,0,0,0
6,Acura,ILX 2.0L,2015,18827.0,16980,21590,79976.357143,41775,110534,14,0,0,0,0
7,Acura,ILX 2.0L w/Premium Package,2013,17107.2,14544,19906,85569.1,56647,168133,10,0,0,0,0
8,Acura,ILX 2.4L,2016,19991.210526,12995,25667,72371.684211,17520,158172,19,0,0,0,0
9,Acura,ILX Base,2018,21198.333333,16995,24488,62450.833333,21091,141684,6,0,0,0,0


#### Calculations and Descriptive Stats:

In this step, we used a for loop to itterate over each row and calculate the YoY price change for the vehicles of the same make and model but different year. This should provide us statistics about how well the vehicle holds its value, or at what point the value drops of signficantly.

In [18]:
# Iterate over each row:
for index, row in df2.iterrows(): # Loop start
    if index<df2.shape[0]-1: # Check for end of dataframe
        # Only want same model and make evaluated for change calcs
        if df2.Make[index]==(df2.Make[(index+1)]) and df2.Model[index]==(df2.Model[(index+1)]): 
            df2.price_diff[index]=df2.Avg_Price[index]-df2.Avg_Price[index+1] # price difference
            df2.mileage_diff[index]=df2.Avg_Mileage[index]-df2.Avg_Mileage[index+1] # mileage difference
            df2.YoY_price_pct_change[index]=100*(df2.Avg_Price[index]-df2.Avg_Price[index+1])/df2.Avg_Price[index+1] #pct diff price
            df2.YoY_mileage_pct_change[index] =100* (df2.Avg_Mileage[index]-df2.Avg_Mileage[index+1])/df2.Avg_Mileage[index+1] #pct diff mileage
        else:
            # Setting to string 0's for first car, make, model, and year
            df2.price_diff[index]="0"
            df2.mileage_diff[index]="0"
            df2.YoY_price_pct_change[index]="0"
            df2.YoY_mileage_pct_change[index]="0"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pa

In [19]:
# Print new statistics:
df2.head(10)

Unnamed: 0,Make,Model,Year,Avg_Price,Min_Price,Max_Price,Avg_Mileage,Min_Mileage,Max_Mileage,Count,price_diff,mileage_diff,YoY_price_pct_change,YoY_mileage_pct_change
0,Acura,ILX,2016,20463.083333,14593,23990,71307.0,31060,124681,12,-2462,24568,-10,52
1,Acura,ILX,2017,22925.916667,20189,23998,46739.0,38495,57556,12,-880,20599,-3,78
2,Acura,ILX,2018,23806.666667,23590,24990,26139.833333,23968,34774,12,-4700,-2129,-16,-7
3,Acura,ILX,2019,28507.5,27260,29998,28269.5,10387,37200,16,0,0,0,0
4,Acura,ILX 2.0L,2013,16593.428571,10988,20998,91893.0,57342,144604,7,-1041,2900,-5,3
5,Acura,ILX 2.0L,2014,17635.105263,14495,20000,88992.894737,55436,112325,19,-1191,9016,-6,11
6,Acura,ILX 2.0L,2015,18827.0,16980,21590,79976.357143,41775,110534,14,0,0,0,0
7,Acura,ILX 2.0L w/Premium Package,2013,17107.2,14544,19906,85569.1,56647,168133,10,0,0,0,0
8,Acura,ILX 2.4L,2016,19991.210526,12995,25667,72371.684211,17520,158172,19,0,0,0,0
9,Acura,ILX Base,2018,21198.333333,16995,24488,62450.833333,21091,141684,6,-5983,39644,-22,173


In [20]:
#Saving all grouped metrics for possible future use
df2.to_pickle('/dsa/groups/casestudy2022su/team05/carscom_groupedmetrics_v01.pkl')

In [21]:
# Read in the original dataframe for joining with new descriptive statistics dataframe:
original = pd.read_pickle('/dsa/groups/casestudy2022su/team05/carscom_v02.pkl')
original.head() 

Unnamed: 0,index,Year,Make,Model,Dealer_Name,Distance Radius,Zip,State,City,Mileage,Price,Rate,Under_Value($),miles,electronic_dealer
0,0,2020,Jeep,Grand Cherokee Laredo,Carl Burger's Dodge Chrysler Jeep RAM,50,92132,CA,San Diego,30134,37990,Fair,0,0,0
1,1,2016,Dodge,Challenger SRT Hellcat,TRED Private Seller (San Diego),50,92132,CA,San Diego,29635,54099,Good,0,2,0
2,2,2010,Lexus,ES 350,TRED Private Seller (San Diego),50,92132,CA,San Diego,159000,8909,Great,1111,2,0
3,3,2020,Buick,Encore Essence,Hertz Car Sales San Diego,50,92132,CA,San Diego,57751,21353,Good,0,2,0
4,4,2015,Lexus,IS 350 Base,Shift San Diego,50,92132,CA,San Diego,55800,28950,Good,0,0,1


In [22]:
# Making year column integer
original['Year']=original['Year'].astype(int)

In [23]:
# Merging data from original cars.com set to newly looped dataframe with difference calcs. joining on
# same year, make, model and inner join to drop anything that doesn't match up
new = pd.merge(original, df2, on=["Year","Make", "Model"], how='inner')

#### Geolocation Addition:

In this step, we wanted to add longitude and latitude to each listing so that we can use maps as visualization. The steps below go through the process of pulling in that data and joining it.

In [24]:
# Pull in geo location data:
city = pd.read_csv('/dsa/groups/casestudy2022su/team05/city.csv')
city = city.drop(columns = 'Unnamed: 0')
city.head(30)

Unnamed: 0,City,Lat,Lon
0,San Diego,32.716,-117.161
1,Los Angeles,34.052,-118.244
2,San Francisco,37.775,-122.419
3,Portland,45.515,-122.678
4,Seatle,47.606,-122.332
5,Ketchum/Boise,43.681,-114.364
6,Billings Bozeman,45.68,-111.039
7,Las Vegas,36.172,-115.139
8,Phoenix,33.448,-112.074
9,Salt Lake City,40.761,-111.891


In [25]:
# Join the geolocation data on our dataframe by city.
data = pd.merge(new, city, how='inner', on = 'City')

In [26]:
# Set the miles and distance to intergers
data['miles'] = data['miles'].astype(int)
data['Distance Radius'] = data['Distance Radius'].astype(int)

In [27]:
# Drop unneeded index column:
data = data.drop(columns = ["index"])
data.head(5)

Unnamed: 0,Year,Make,Model,Dealer_Name,Distance Radius,Zip,State,City,Mileage,Price,...,Avg_Mileage,Min_Mileage,Max_Mileage,Count,price_diff,mileage_diff,YoY_price_pct_change,YoY_mileage_pct_change,Lat,Lon
0,2020,Jeep,Grand Cherokee Laredo,Carl Burger's Dodge Chrysler Jeep RAM,50,92132,CA,San Diego,30134,37990,...,29486.888889,4661,68184,90,-4506,14484,-11,96,32.716,-117.161
1,2020,Jeep,Grand Cherokee Laredo,Carl Burger's Dodge Chrysler Jeep RAM,50,92132,CA,San Diego,30134,37990,...,29486.888889,4661,68184,90,-4506,14484,-11,96,32.716,-117.161
2,2020,Jeep,Grand Cherokee Laredo,Carl Burger's Dodge Chrysler Jeep RAM,50,92132,CA,San Diego,4661,38990,...,29486.888889,4661,68184,90,-4506,14484,-11,96,32.716,-117.161
3,2020,Jeep,Grand Cherokee Laredo,Carl Burger's Dodge Chrysler Jeep RAM,50,92132,CA,San Diego,4661,38990,...,29486.888889,4661,68184,90,-4506,14484,-11,96,32.716,-117.161
4,2020,Jeep,Grand Cherokee Laredo,San Diego Chrysler Dodge Jeep RAM,50,92132,CA,San Diego,21909,31388,...,29486.888889,4661,68184,90,-4506,14484,-11,96,32.716,-117.161


In [28]:
# Pickle the final and mature dataframe to the shared data folder:
data.to_pickle('/dsa/groups/casestudy2022su/team05/carscom_v03.pkl')