<a href="https://colab.research.google.com/github/juliahumphrys/data-2000/blob/main/Linear_models_homework.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install --upgrade numpy pandas matplotlib scikit-learn



In [2]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt

car_data_raw = pd.read_csv("https://cdn.c18l.org/vehicles_lab.csv")

In [3]:
car_data_raw.head()

Unnamed: 0,region,price,year,manufacturer,model,condition,cylinders,fuel,odometer,title_status,transmission,drive,size,type,paint_color,description,state,lat,long,posting_date
0,auburn,15000,2013.0,ford,f-150 xlt,excellent,6 cylinders,gas,128000.0,clean,automatic,rwd,full-size,truck,black,2013 F-150 XLT V6 4 Door. Good condition. Leve...,al,32.592,-85.5189,2021-05-03T14:02:03-0500
1,auburn,27990,2012.0,gmc,sierra 2500 hd extended cab,good,8 cylinders,gas,68696.0,clean,other,4wd,,pickup,black,Carvana is the safer way to buy a car During t...,al,32.59,-85.48,2021-05-03T13:41:25-0500
2,auburn,34590,2016.0,chevrolet,silverado 1500 double,good,6 cylinders,gas,29499.0,clean,other,4wd,,pickup,silver,Carvana is the safer way to buy a car During t...,al,32.59,-85.48,2021-05-03T12:41:33-0500
3,auburn,35000,2019.0,toyota,tacoma,excellent,6 cylinders,gas,43000.0,clean,automatic,4wd,,truck,grey,Selling my 2019 Toyota Tacoma TRD Off Road Dou...,al,32.6013,-85.443974,2021-05-03T12:12:59-0500
4,auburn,29990,2016.0,chevrolet,colorado extended cab,good,6 cylinders,gas,17302.0,clean,other,4wd,,pickup,red,Carvana is the safer way to buy a car During t...,al,32.59,-85.48,2021-05-03T11:31:14-0500


##Part 1: Feature Selection
In addition to price, I chose to keep year, odometer, condition, type, and posting date for a predictive model. These features were important to keep because they provide a comprehensive set of predictors for the pricing prediction model. Price, year, and odometer are numerical values that directly impact pricing. Condition and type are categorical features that influence value, and posting date provides insight to temporal variations.

In [3]:
data = car_data_raw.loc[:, ['price', 'year', 'odometer', 'condition', 'type', 'posting_date']]
print(data.head())

   price    year  odometer  condition    type              posting_date
0  15000  2013.0  128000.0  excellent   truck  2021-05-03T14:02:03-0500
1  27990  2012.0   68696.0       good  pickup  2021-05-03T13:41:25-0500
2  34590  2016.0   29499.0       good  pickup  2021-05-03T12:41:33-0500
3  35000  2019.0   43000.0  excellent   truck  2021-05-03T12:12:59-0500
4  29990  2016.0   17302.0       good  pickup  2021-05-03T11:31:14-0500


##Part 2: Data Cleaning
The first step I took in my data cleaning process was dummy encoding the condition and type features. Dummy encoding the categorical features results in interpretable model coefficients. To further clean the data, I filtered the numerical features. I removed records where price is less than or equal to 0, which excludes invalid or missing price data. I also removed records where price is greater than $25,550, to exclude outliers. I then filtered out vehicles from years before 1975 to focus on more recent vehicles. Lastly, I removed records where the odometer is less than or equal to 0, and excluded records where the odometer is greater than 200,000. Putting a limit on the odometer reading eliminates high-mileage outliers.

In [4]:
temp1 = pd.get_dummies(data['condition'], dummy_na = True).astype(int)
temp2 = pd.get_dummies(data['type'], dummy_na = True).astype(int)
new_data = pd.concat([data, temp1, temp2], axis = 1)
new_data.head()

Unnamed: 0,price,year,odometer,condition,type,posting_date,excellent,fair,good,like new,...,hatchback,mini-van,offroad,other,pickup,sedan,truck,van,wagon,NaN
0,15000,2013.0,128000.0,excellent,truck,2021-05-03T14:02:03-0500,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,27990,2012.0,68696.0,good,pickup,2021-05-03T13:41:25-0500,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
2,34590,2016.0,29499.0,good,pickup,2021-05-03T12:41:33-0500,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
3,35000,2019.0,43000.0,excellent,truck,2021-05-03T12:12:59-0500,1,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,29990,2016.0,17302.0,good,pickup,2021-05-03T11:31:14-0500,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0


In [5]:
new_data = new_data.loc[new_data['price'] > 0]
new_data = new_data.loc[new_data['price'] <= 25550.0]
new_data = new_data.loc[new_data['year'] >= 1975]
new_data = new_data.loc[new_data['odometer'] > 0]
new_data = new_data.loc[new_data['odometer'] <= 200000]

In [6]:
new_data.describe()

Unnamed: 0,price,year,odometer,excellent,fair,good,like new,new,salvage,NaN,...,hatchback,mini-van,offroad,other,pickup,sedan,truck,van,wagon,NaN.1
count,119878.0,119878.0,119878.0,119878.0,119878.0,119878.0,119878.0,119878.0,119878.0,119878.0,...,119878.0,119878.0,119878.0,119878.0,119878.0,119878.0,119878.0,119878.0,119878.0,119878.0
mean,10986.748928,2009.983733,106255.893041,0.384624,0.023132,0.250271,0.080398,0.002569,0.001969,0.257036,...,0.044587,0.022873,0.002936,0.015975,0.059235,0.339779,0.087998,0.030848,0.035803,0.0
std,6672.882466,6.772545,48125.756647,0.486508,0.150323,0.433171,0.27191,0.050623,0.044326,0.437001,...,0.206396,0.1495,0.054108,0.125378,0.236065,0.473636,0.283293,0.172907,0.1858,0.0
min,1.0,1975.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,5895.0,2007.0,71772.25,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,9975.0,2011.0,107273.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,15995.0,2015.0,143000.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
max,25550.0,2022.0,200000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0


##Part 3: Feature Engineering
The first column I created presents the age of each vehicle when it was initially posted to Craigslist. Second, I created a column for the average price per year for each vehicle. This provides insights into how the price of a vehicle depreciates over time.

In [7]:
new_data['posting_year'] = new_data['posting_date'].str.extract(r'^([0-4][0-4][0-4][0-4])')
new_data['car_age'] = new_data['posting_year'].astype(int) - new_data['year']
new_data = new_data.loc[new_data['car_age'] > 0]
new_data.head()

Unnamed: 0,price,year,odometer,condition,type,posting_date,excellent,fair,good,like new,...,offroad,other,pickup,sedan,truck,van,wagon,NaN,posting_year,car_age
0,15000,2013.0,128000.0,excellent,truck,2021-05-03T14:02:03-0500,1,0,0,0,...,0,0,0,0,1,0,0,0,2021,8.0
9,19900,2004.0,88000.0,good,pickup,2021-04-29T17:19:18-0500,0,0,1,0,...,0,0,1,0,0,0,0,0,2021,17.0
10,14000,2012.0,95000.0,excellent,mini-van,2021-04-27T12:20:01-0500,1,0,0,0,...,0,0,0,0,0,0,0,0,2021,9.0
12,22500,2001.0,144700.0,good,truck,2021-04-26T11:15:36-0500,0,0,1,0,...,0,0,0,0,1,0,0,0,2021,20.0
16,15000,2017.0,90000.0,excellent,sedan,2021-04-24T18:39:59-0500,1,0,0,0,...,0,0,0,1,0,0,0,0,2021,4.0


In [8]:
new_data['average_price_per_year'] = new_data['price'] / (2023 - new_data['year'])
new_data.head()

Unnamed: 0,price,year,odometer,condition,type,posting_date,excellent,fair,good,like new,...,other,pickup,sedan,truck,van,wagon,NaN,posting_year,car_age,average_price_per_year
0,15000,2013.0,128000.0,excellent,truck,2021-05-03T14:02:03-0500,1,0,0,0,...,0,0,0,1,0,0,0,2021,8.0,1500.0
9,19900,2004.0,88000.0,good,pickup,2021-04-29T17:19:18-0500,0,0,1,0,...,0,1,0,0,0,0,0,2021,17.0,1047.368421
10,14000,2012.0,95000.0,excellent,mini-van,2021-04-27T12:20:01-0500,1,0,0,0,...,0,0,0,0,0,0,0,2021,9.0,1272.727273
12,22500,2001.0,144700.0,good,truck,2021-04-26T11:15:36-0500,0,0,1,0,...,0,0,0,1,0,0,0,2021,20.0,1022.727273
16,15000,2017.0,90000.0,excellent,sedan,2021-04-24T18:39:59-0500,1,0,0,0,...,0,0,1,0,0,0,0,2021,4.0,2500.0



##Part 4: Multinomial classification


In [16]:
cut_labels_4 = ['1', '2', '3']
cut_bins = [0, 10000, 20000, 30000]
new_data['price_bin'] = pd.cut(new_data['price'], bins = cut_bins, labels = cut_labels_4)

In [17]:
new_data = new_data.dropna()

In [18]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

train_data, test_data = train_test_split(
    new_data,
    train_size=0.8,
    random_state=42
)

model = LogisticRegression(multi_class='multinomial').fit(
    X=train_data.loc[:, [
        'odometer', 'excellent', 'fair', 'good', 'like new', 'new', 'salvage', 'SUV', 'bus', 'convertible', 'coupe', 'hatchback', 'mini-van', 'offroad',
        'other', 'pickup', 'sedan', 'truck', 'van', 'wagon', 'car_age', 'average_price_per_year', 'price']],
    y=train_data['price_bin']
)


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [19]:
model

In [26]:
train_data.loc[:, [
        'odometer', 'excellent', 'fair', 'good', 'like new', 'new', 'salvage', 'SUV', 'bus', 'convertible', 'coupe', 'hatchback', 'mini-van', 'offroad',
        'other', 'pickup', 'sedan', 'truck', 'van', 'wagon', 'car_age', 'average_price_per_year', 'price']].corr()

Unnamed: 0,odometer,excellent,fair,good,like new,new,salvage,SUV,bus,convertible,...,offroad,other,pickup,sedan,truck,van,wagon,car_age,average_price_per_year,price
odometer,1.0,-0.058709,0.113039,0.120886,-0.147351,-0.03336,0.006611,0.073817,-0.012224,-0.078928,...,0.001417,-0.04245,0.086057,-0.092447,0.102556,0.001164,-0.001011,0.272153,-0.471545,-0.38982
excellent,-0.058709,1.0,-0.185659,-0.738289,-0.361204,-0.061064,-0.053644,0.087414,-0.005345,0.002344,...,-0.009214,-0.050946,-0.01532,-0.011338,-0.037324,-0.046909,0.007189,-0.135405,0.045946,0.093498
fair,0.113039,-0.185659,1.0,-0.127572,-0.062414,-0.010551,-0.009269,-0.027362,0.002488,0.005552,...,0.01584,0.005058,0.042863,-0.026943,0.032582,0.007895,-0.010313,0.212717,-0.140169,-0.190137
good,0.120886,-0.738289,-0.127572,1.0,-0.248193,-0.041959,-0.03686,-0.083653,0.000123,-0.012393,...,0.005419,0.060763,0.02494,-0.00482,0.05118,0.054821,0.006822,0.13194,-0.074461,-0.07349
like new,-0.147351,-0.361204,-0.062414,-0.248193,1.0,-0.020528,-0.018034,0.003891,-0.008961,0.012185,...,-0.003914,-0.014219,-0.036325,0.03991,-0.037992,-0.010991,-0.013847,-0.103309,0.120039,0.076488
new,-0.03336,-0.061064,-0.010551,-0.041959,-0.020528,1.0,-0.003049,-0.003559,0.086652,0.000289,...,-0.00371,0.00376,-0.010679,0.005986,0.002654,-0.008379,-0.009418,-0.020783,0.013558,-0.00429
salvage,0.006611,-0.053644,-0.009269,-0.03686,-0.018034,-0.003049,1.0,-0.007789,-0.002363,-0.00158,...,0.01411,0.001237,0.006704,-0.002596,0.008925,0.001392,-0.003304,0.033306,-0.0299,-0.048643
SUV,0.073817,0.087414,-0.027362,-0.083653,0.003891,-0.003559,-0.007789,1.0,-0.027527,-0.109005,...,-0.037965,-0.069386,-0.150899,-0.427093,-0.196967,-0.10081,-0.110223,-0.096492,0.073575,0.057762
bus,-0.012224,-0.005345,0.002488,0.000123,-0.008961,0.086652,-0.002363,-0.027527,1.0,-0.008256,...,-0.002876,-0.005256,-0.01143,-0.03235,-0.014919,-0.007636,-0.008349,0.01616,-0.006433,0.012599
convertible,-0.078928,0.002344,0.005552,-0.012393,0.012185,0.000289,-0.00158,-0.109005,-0.008256,1.0,...,-0.011387,-0.020812,-0.045261,-0.128103,-0.059079,-0.030237,-0.033061,0.194324,-0.06596,0.017745


In [25]:
model.score(
    X = train_data.loc[:, [
        'odometer', 'excellent', 'fair', 'good', 'like new', 'new', 'salvage', 'SUV', 'bus', 'convertible', 'coupe', 'hatchback', 'mini-van', 'offroad',
        'other', 'pickup', 'sedan', 'truck', 'van', 'wagon', 'car_age', 'average_price_per_year', 'price']],
    y = train_data['price_bin']
)

0.8383686437699456