# Holman Data Science Outlier Investigation
Why do the prices have such variability between them?

## INSTALL LIBRARIES

In [1]:
import pandas as pd

## LOAD AND PREPARE DATA

In [2]:
df = pd.read_parquet('data/f150-tire-repairs.parquet')

### Create outlier dataframe

In [4]:
from datetime import datetime

# The most recent date was actually 4/30 so I need to adjust to get a full three years
end_date = datetime.strptime('4/30/20 9:59:23', '%m/%d/%y %H:%M:%S')

three_yr_df = df.loc[(df['date'] >= end_date)]
three_yr_df.sort_values(by=['date'])

Unnamed: 0,part_id,cost,date,id,repair_description,make,model,year
16726183,17001001,245.19,2020-04-30 10:55:17,83473092,"TIRE, RADIAL LUG TREAD",FORD,F-150,2011
8833068,17001001,158.38,2020-04-30 11:04:14,83473390,"TIRE, RADIAL LUG TREAD",FORD,F-150,2017
7349197,17001001,195.00,2020-04-30 11:05:12,83473413,"TIRE, RADIAL LUG TREAD",FORD,F-150,2018
12374356,17001001,195.00,2020-04-30 11:58:43,83474902,"TIRE, RADIAL LUG TREAD",FORD,F-150,2010
25172864,17001001,137.95,2020-04-30 12:50:02,83476126,"TIRE, RADIAL LUG TREAD",FORD,F-150,2016
...,...,...,...,...,...,...,...,...
20625359,17001001,208.99,2023-04-28 15:48:38,99701298,"TIRE, RADIAL LUG TREAD",FORD,F-150,2021
9098727,17001001,304.67,2023-04-28 17:43:41,99703604,"TIRE, RADIAL LUG TREAD",FORD,F-150,2010
7372935,17001001,200.90,2023-04-28 18:23:19,99704138,"TIRE, RADIAL LUG TREAD",FORD,F-150,2019
14149801,17001001,176.47,2023-04-29 10:57:35,99709483,"TIRE, RADIAL LUG TREAD",FORD,F-150,2014


#### Z-score is a subset of the IQR outlier set, so just use the IQR results

In [14]:
IQR = three_yr_df['cost'].quantile(0.75) - three_yr_df['cost'].quantile(0.25)

iqr_upper_limit = three_yr_df['cost'].quantile(0.75) + (IQR * 1.5)
iqr_lower_limit = three_yr_df['cost'].quantile(0.25) - (IQR * 1.5)
iqr_outlier_df = three_yr_df.loc[(three_yr_df['cost'] <= iqr_lower_limit) | (three_yr_df['cost'] >= iqr_upper_limit)]
iqr_outlier_df.sort_values(by=['cost'])

Unnamed: 0,part_id,cost,date,id,repair_description,make,model,year
13443980,17001001,1.00,2022-03-24 16:42:21,93807125,"TIRE, RADIAL LUG TREAD",FORD,F-150,2013
20923996,17001001,1.00,2022-03-25 13:46:18,93820847,"TIRE, RADIAL LUG TREAD",FORD,F-150,2018
24783259,17001001,2.53,2022-11-04 11:15:40,97111914,"TIRE, RADIAL LUG TREAD",FORD,F-150,2015
17842113,17001001,3.00,2021-06-24 14:22:04,89823438,"TIRE, RADIAL LUG TREAD",FORD,F-150,2015
22715321,17001001,4.59,2021-03-26 12:46:53,88446294,"TIRE, RADIAL LUG TREAD",FORD,F-150,2017
...,...,...,...,...,...,...,...,...
20573543,17001001,575.00,2022-11-18 11:06:02,97309917,"TIRE, RADIAL LUG TREAD",FORD,F-150,2021
5125386,17001001,610.82,2023-01-17 10:10:32,98081346,"TIRE, RADIAL LUG TREAD",FORD,F-150,2021
10451158,17001001,686.84,2022-02-28 17:12:59,93424908,"TIRE, RADIAL LUG TREAD",FORD,F-150,2016
12915899,17001001,825.00,2022-12-07 12:29:31,97550160,"TIRE, RADIAL LUG TREAD",FORD,F-150,2018


# Hypotheses on price variance
- There could be any number of things which could account for some of the outliers seen above.
- I'll touch briefly on a few possibilities below

## Regional Differences
- Should the price of a Ford F150 tire repair part be the same in every locality?
- Could the price of shipping a part to say, Hawaii, cause an outlier?
- Could the local cost of living be a factor, say a rural town versus a shop near Times Square?
- If we have an address for each record, I can dig deeper, but no address or name is present in the source dataframe


## Sales and Promotions
- Did any shop have a sale, or a promotion?  

In [18]:
pd.set_option('display.max_rows', 300)
day_month_df = iqr_outlier_df.copy()
day_month_df['day'] = iqr_outlier_df['date'].dt.day
day_month_df['month'] = iqr_outlier_df['date'].dt.month
day_month_df.sort_values(by=['month', 'day'])

Unnamed: 0,part_id,cost,date,id,repair_description,make,model,year,day,month
12438569,17001001,440.5,2023-01-03 13:39:30,97870175,"TIRE, RADIAL LUG TREAD",FORD,F-150,2020,3,1
15652252,17001001,60.0,2023-01-03 16:57:06,97875592,"TIRE, RADIAL LUG TREAD",FORD,F-150,2021,3,1
6457079,17001001,380.51,2022-01-05 13:14:54,92571111,"TIRE, RADIAL LUG TREAD",FORD,F-150,2019,5,1
6556974,17001001,445.63,2023-01-05 17:07:48,97919663,"TIRE, RADIAL LUG TREAD",FORD,F-150,2011,5,1
18023595,17001001,389.99,2023-01-06 16:21:11,97936847,"TIRE, RADIAL LUG TREAD",FORD,F-150,2022,6,1
20698115,17001001,391.0,2022-01-06 17:35:40,92606362,"TIRE, RADIAL LUG TREAD",FORD,F-150,2014,6,1
10794145,17001001,430.0,2023-01-09 10:57:49,97951004,"TIRE, RADIAL LUG TREAD",FORD,F-150,2021,9,1
18847459,17001001,422.25,2023-01-10 13:12:47,97976784,"TIRE, RADIAL LUG TREAD",FORD,F-150,2019,10,1
7807023,17001001,383.0,2023-01-11 14:52:32,97999535,"TIRE, RADIAL LUG TREAD",FORD,F-150,2018,11,1
8199364,17001001,380.51,2022-01-11 11:38:30,92662902,"TIRE, RADIAL LUG TREAD",FORD,F-150,2018,11,1


#### Manual inspection doesn't yield any discernable patterns for outliers below the mean price, as a sale would be

## Inflationary Pressures
- The upper bound for outliers was 374.5, I don't think inflation would cause a 62% increase

In [19]:
percent_increase = ((374.5 - 229.97) / 229.97) * 100
percent_increase

62.84732791233639

## Crime
- Not enough understanding of the source material to compose a theory