#IS 675 Lab 2: Data and model understanding

---

This data set contains information of cars purchased at the Auction.
<br>
We will use this file to predict the quality of buying decisions and visualize decision processes.
<br>
<br>
VARIABLE DESCRIPTIONS:<br>
Auction: Auction provider at which the  vehicle was purchased<br>
Color: Vehicle Color<br>
IsBadBuy: Identifies if the kicked vehicle was an avoidable purchase<br>
MMRCurrentAuctionAveragePrice: Acquisition price for this vehicle in average condition as of current day<br>
Size: The size category of the vehicle (Compact, SUV, etc.)<br>
TopThreeAmericanName:Identifies if the manufacturer is one of the top three American manufacturers<br>
VehBCost: Acquisition cost paid for the vehicle at time of purchase<br>
VehicleAge: The Years elapsed since the manufacturer's year<br>
VehOdo: The vehicles odometer reading<br>
WarrantyCost: Warranty price (term=36month  and millage=36K)<br>
WheelType: The vehicle wheel type description (Alloy, Covers)<br>
<br>
Target variable: **IsBadBuy**

###1. Upload, understand, and clean data

In [None]:
# Mounting Google Drive
from google.colab import drive
drive.mount('/content/drive')

In [209]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
#from sklearn.metrics import plot_confusion_matrix
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix

In [None]:
# Read data
car_kick = pd.read_csv("/content/drive/MyDrive/IS675_data/car_kick.csv")
car_kick

In [None]:
car_kick.keys()

In [213]:
# Select the desired columns only
desired_columns = ['Auction', 'Color', 'IsBadBuy', 'MMRCurrentAuctionAveragePrice', 'Size','TopThreeAmericanName',
'VehBCost', 'VehicleAge', 'VehOdo', 'WarrantyCost', 'WheelType']
carAuction = car_kick [desired_columns]

In [None]:
# Show the head rows of a data frame
carAuction.head()

In [None]:
# Examine missing values again
carAuction.isnull().sum()

In [None]:
# Examine variable type
carAuction.dtypes

In [None]:
# Replacing 1 with Yes and 0 with No in the target column IsBadBuy
carAuction['IsBadBuy'] = carAuction['IsBadBuy'].replace({1:'Yes', 0:'No'})

In [None]:
# Change categorical variables to "category"
carAuction['Auction'] = carAuction['Auction'].astype('category')
carAuction['Color'] = carAuction['Color'].astype('category')
carAuction['IsBadBuy'] = carAuction['IsBadBuy'].astype('category')
carAuction['Size'] = carAuction['Size'].astype('category')
carAuction['TopThreeAmericanName'] = carAuction['TopThreeAmericanName'].astype('category')
carAuction['WheelType'] = carAuction['WheelType'].astype('category')

In [None]:
# Examine variable type
carAuction.dtypes

In [None]:
# Display all numeric variables
carAuction.select_dtypes(include=['number'])

In [None]:
# Display all categorical variables
carAuction.select_dtypes(include=['category'])

In [None]:
# Show the statistics of VehOdo
carAuction['VehOdo'].describe()

In [None]:
# Obtain the variance, standard deviation, and range of WarrantyCost
print("variance: ", carAuction['WarrantyCost'].var(), "standard deviation: ", carAuction['WarrantyCost'].std(), "range: ", carAuction['WarrantyCost'].min(), carAuction['WarrantyCost'].max())

In [None]:
# Display the IQR of WarrantyCost
IQR = carAuction['WarrantyCost'].quantile(0.75) - carAuction['WarrantyCost'].quantile(0.25)
print("IQR:", IQR)

In [None]:
# Boxplot of a numeric variable: VehBCost
snsplot = sns.boxplot(x='VehBCost', data = carAuction)
snsplot.set_title("Boxplot of VehBCost in the carAuction data set")

In [None]:
# Boxplot of a numeric variable: VehicleAge
snsplot = sns.boxplot(x='VehicleAge', data = carAuction)
snsplot.set_title("Boxplot of VehicleAge in the carAuction data set")

In [None]:
# Histogram of a numeric variable: VehOdo
snsplot = sns.histplot(x='VehOdo', data = carAuction)
snsplot.set_title("Histogram of VehOdo in the carAuction data set")

###3. Understanding a single variable: categorical variables

In [None]:
# Display the number of cars in different WheelType
carAuction['WheelType'].value_counts()

In [None]:
# Display the proportion of cars in different WheelType
carAuction['WheelType'].value_counts(normalize=True)

In [None]:
# Plot a categorical variable: WheelType
snsplot = sns.countplot(x='WheelType', data=carAuction)
snsplot.set_title("Countplot of WheelType in the carAuction data set")

### 4. Understand relationships of multiple variables

In [None]:
# scatter plot two numeric variables: VehBCost and MMRCurrentAuctionAveragePrice
snsplot = sns.scatterplot(x='VehBCost', y= 'MMRCurrentAuctionAveragePrice', data=carAuction)
snsplot.set_title("Scatterplot of VehBCost and MMRCurrentAuctionAveragePrice")

In [None]:
# Generate correlation coefficients of two numeric variables in a 2x2 matrix: VehBCost and MMRCurrentAuctionAveragePrice
carAuction[['VehBCost','MMRCurrentAuctionAveragePrice']].corr()

In [None]:
# Generate the correlation matrix of all numeric variables
carAuction.corr()

In [None]:
# Examine relationships between numeric and categorical variables: boxplot VehBCost based on IsBadBuy
snsplot = sns.boxplot(x='VehBCost', y= 'IsBadBuy', data = carAuction)
snsplot.set_title("Boxplot of VehBCost based on IsBadBuy")

###5. Partition the data set for Decision Tree model

In [None]:
# Create dummy variables (0.5 pts)
carAuction = pd.get_dummies(carAuction, columns=['Auction','Color','Size','TopThreeAmericanName','WheelType'], drop_first=True)
carAuction

In [None]:
# Examine the porportion of target variable for data set
target = carAuction['IsBadBuy']
print(target.value_counts(normalize=True))

In [None]:
# Partition the data (0.5 pts)
predictors = carAuction.drop(['IsBadBuy'],axis=1)
predictors_train, predictors_test, target_train, target_test = train_test_split(predictors, target, test_size=0.3, random_state=0)
print(predictors_train.shape, predictors_test.shape, target_train.shape, target_test.shape)

In [238]:
# Taking steps to balance the train data
# Combine predictors_train and target_train into a single DataFrame
combined_train_df = pd.concat([predictors_train, target_train], axis=1)

# Separate majority and minority classes
majority_df = combined_train_df[combined_train_df['IsBadBuy'] == 'No']
minority_df = combined_train_df[combined_train_df['IsBadBuy'] == 'Yes']

# Undersample the majority class randomly
undersampled_majority = majority_df.sample(n=len(minority_df), random_state=62)

# Combine the undersampled majority class and the minority class
undersampled_data = pd.concat([undersampled_majority, minority_df])

# Shuffle the combined DataFrame to ensure randomness
balanced_data = undersampled_data.sample(frac=1, random_state=62)

# Split the balanced_data into predictors_train and target_train
predictors_train = balanced_data.drop(columns=['IsBadBuy'])
target_train = balanced_data['IsBadBuy']

In [None]:
# Examine the porportion of target variable for train set
print(target_train.value_counts(normalize=True), target_train.shape)

In [None]:
# Examine the porportion of target variable for testing data set (0.5 pts)
print(target_test.value_counts(normalize=True))

## 6. Decision Tree model prediction

In [None]:
# Build a decision tree model on training data with max_depth = 2 (0.5 pts)
model = DecisionTreeClassifier(criterion = "entropy", random_state = 1, max_depth = 3)
model.fit(predictors_train, target_train)

In [None]:
# Plot the tree (0.5 pts)
fig = plt.figure(figsize=(30,20))
tree.plot_tree(model,
               feature_names=list(predictors_train.columns),
               class_names=['No','Yes'],
               filled=True)

In [None]:
# Text version of decision tree
print(tree.export_text(model, feature_names=list(carAuction.columns)[1:]))

Q1. How many decision nodes and how many leaf nodes are in the tree?  (0.5 pts)<br>


Q2. Compare to a decision tree with 7 decision nodes and 8 leaf nodes, is it more or less complex? Give reasons for your answer. (1 pt)<br>


Q3. What is the predictor that first splits the tree? How the decision tree selects the first predictor to split? (1 pt)<br>


Q4. Find one path in the tree to a leaf node that is classified to IsBadBuy = 'Yes'. What is this path/rule's misclassification error rate? (1 pt)<br>

In [246]:
# Make predictions on testing data
prediction_on_test = model.predict(predictors_test)

In [None]:
# Examine the evaluation results on testing data: confusion_matrix

#plot_confusion_matrix(model, predictors_test, target_test, cmap=plt.cm.Blues, values_format='d')

# Compute confusion matrix
cm = confusion_matrix(target_test, prediction_on_test)

# Plot confusion matrix using Seaborn
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=True)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Q5. On the testing set, how many bad buy cars are predicted as Not bad buy? (0.5 pts)<br>


In [None]:
# Examine the evaluation results on testing data: accuracy, precision, recall, and f1-score (0.5 pts)
print(classification_report(target_test, prediction_on_test))

Q6. Does the decision tree model have better performance on majority (IsBadBuy = 'No') or minority class (IsBadBuy = 'Yes')? why? (1 pt)<br>

Q7. How do you evaluate the model? Is it good or bad? why? can we improve it? how? (2 pts)


In [None]:
!jupyter nbconvert --to html "/content/drive/MyDrive/IS470_lab/IS675_lab02.ipynb"