# Execise 4

## Section 1: Predicting Used Car Prices

We’ll be using the cars.csv data set for this section of the exercise. The data set covers the characteristics and prices for used cars sold in India. We are interested in predicting the price of a car given some characteristics. We will attempt to build a linear regression model of Price. We are going to work on filling in the missing data that we previously dropped. 

**Don't forget to add statsmodels to your packages**

In [None]:
import pandas as pd
import numpy as np
import streamlit as st
import altair as alt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.imputation as smi
from statistics import mean
from sympy import im


In [None]:
cars = pd.read_csv('cars.csv')
cars.head()

In [None]:
# Get just the numbers from various fields that have units
cars["Mileage"] = cars["Mileage"].str.rstrip(" kmpl")
cars["Mileage"] = cars["Mileage"].str.rstrip(" km/g")
cars["Engine"] = cars["Engine"].str.rstrip(" CC")
cars["Power"] = cars["Power"].str.rstrip(" bhp")

# Replace the word "null" with NaN
cars["Power"]= cars["Power"].replace(regex="null", value = np.nan)

# Make sure numeric fields are numeric
cars["Mileage"]=cars["Mileage"].astype("float")
cars["Power"]=cars["Power"].astype("float")
cars["Engine"]=cars["Engine"].astype("float")

# Make categorical fields a category data type
cars["Fuel_Type"]=cars["Fuel_Type"].astype("category")
cars["Transmission"]=cars["Transmission"].astype("category")
cars["Owner_Type"]=cars["Owner_Type"].astype("category")

# Extract "Company" as the first part of the "Name"
cars["Company"]=cars["Name"].str.split(" ").str[0]

# Keep the second and third parts as the "Model"
cars["Model"]=cars["Name"].str.split(" ").str[1]+cars["Name"].str.split(" ").str[2]

cars.head()

### 1. Transform Price so that it looks more normal, produce histograms of the variable before and after transformation

In [None]:
p1 = alt.Chart(cars).mark_bar().encode(
    alt.X("Price:Q", bin=True),
    alt.Y('count()', title="Count"),
)

p1

In [None]:
# When data looks like an exponentially decreasing curve, 
# we can take the log to see if it's more normal and 
# work with logValue instead of Value
cars["logPrice"] = np.log(cars["Price"])

p2 = alt.Chart(cars).mark_bar().encode(
    alt.X("logPrice:Q", bin=True),
    alt.Y('count()', title="Count"),
)

alt.hconcat(p1,p2)

### 2. How many values are missing for Power and Engine?

In [None]:
# We can do a simple count of nulls
cars.isnull().sum()

### 3. Which column has the most missing values and what should we do about it?

In [None]:
# Looks like New_Price is the answer
# How about we just drop it. A lot of them are missing.
cars.drop(columns=['New_Price'], inplace=True)

### 4. Build a model of transformed price based on Power, Mileage, Kilometers Driven, and Year, how much variance is explained?

In [None]:
# We'll use statmodels OLS to fit a model
# but first, let's drop any records with missing data.
clean_cars = cars.dropna()

X = clean_cars[['Power','Kilometers_Driven', 'Year', 'Mileage']]
Y = clean_cars['logPrice']
X = sm.add_constant(X)

model = sm.OLS(Y,X)
results = model.fit()

results.summary()

In [None]:
# We can also use "variance_inflation_factor" to understand the impact of each individual variable
# https://www.statsmodels.org/stable/generated/statsmodels.stats.outliers_influence.variance_inflation_factor.html
#
# Anything >5 is highly colinear with other variables and parameter estimates will probably have large standard error

vif = pd.DataFrame()
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['variable'] = X.columns
vif

### 5. How many rows were used to train the model?

In [None]:
print(f'{len(clean_cars)} rows out of {len(cars)} total')

### 6. Fill the missing values in Power and Mileage with their respective means and rebuild the model. 

Now how much variance is explained?

In [None]:
mean_power = cars['Power'].mean()
mean_mileage = cars['Mileage'].mean()

cars['meanPower'] = cars['Power']
cars['meanMileage'] = cars['Mileage']
cars.fillna(value={'meanPower':mean_power, 'meanMileage':mean_mileage}, inplace=True)

cars.isnull().sum()

In [None]:
# We'll use statmodels OLS to fit a new model
clean_cars = cars[['logPrice','meanPower','Kilometers_Driven', 'Year', 'meanMileage']].dropna()
X = clean_cars[['meanPower','Kilometers_Driven', 'Year', 'meanMileage']]
Y = clean_cars['logPrice']
X = sm.add_constant(X)

model = sm.OLS(Y,X)
results = model.fit()

results.summary()

Wow. That new model has a worse R2? I guess using means was not an effective imputation technique.

### 7. How many rows were used to train the model?

In [None]:
print(f'{len(clean_cars)} rows out of {len(cars)} total')

### 8. Impute the missing data using MICE and rebuild the model

https://www.statsmodels.org/dev/generated/statsmodels.imputation.mice.MICE.html

In [None]:
print(f'before MICE, {mean_power}, {mean_mileage}')

mice_df = smi.mice.MICEData(cars[['logPrice','Power','Mileage','Kilometers_Driven','Year']])
model = 'logPrice ~ Power + Mileage + Kilometers_Driven + Year'

m = smi.mice.MICE(model, sm.OLS, mice_df)
results = m.fit(50,1)

mice_power = mice_df.data['Power']
mice_mileage = mice_df.data['Mileage']

print(f'after MICE {mice_power.mean()}, {mice_mileage.mean()}') 

In [None]:
cars['micePower'] = mice_df.data['Power']
cars['miceMileage'] = mice_df.data['Mileage']

clean_cars = cars[['logPrice','micePower','Kilometers_Driven', 'Year', 'miceMileage']].dropna()
X = clean_cars[['micePower','Kilometers_Driven', 'Year', 'miceMileage']]
Y = clean_cars['logPrice']
X = sm.add_constant(X)

model = sm.OLS(Y,X)
results = model.fit()
print(results.summary())

### How have parameter estimates changed?

```
miceMileage (-0.0084) has a substantially smaller magnitude than Mileage originally did (-0.0125).
```

In [None]:
c1 = alt.Chart(cars).mark_bar().encode(
    alt.X('micePower', bin=True),
    alt.Y('count()', title='Count')
)

c2 = alt.Chart(cars).mark_bar().encode(
    alt.X('Power', bin=True),
    alt.Y('count()', title='Count')
)

alt.hconcat(c1,c2)

In [None]:
c1 = alt.Chart(cars).mark_bar().encode(
    alt.X('miceMileage', bin=True),
    alt.Y('count()', title='Count')
)

c2 = alt.Chart(cars).mark_bar().encode(
    alt.X('Mileage', bin=True),
    alt.Y('count()', title='Count')
)

alt.hconcat(c1,c2)