In [36]:
import pandas as pd
import numpy as np

car_data = pd.read_csv("auto-mpg.tsv", sep="\t")
car_data = car_data.dropna()
car_data.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,-1.0,8.0,304.0,193.0,4732.0,18.5,70.0,1.0,hi 1200d
1,-1.0,8.0,307.0,200.0,4376.0,15.0,70.0,1.0,chevy c20
2,-1.0,8.0,360.0,215.0,4615.0,14.0,70.0,1.0,ford f250
3,-1.0,8.0,318.0,210.0,4382.0,13.5,70.0,1.0,dodge d200
4,-1.0,8.0,350.0,180.0,3664.0,11.0,73.0,1.0,oldsmobile omega


> 1. For each feature from the following:  [cylinders, displacement, horsepower, weight, acceleration, model_year, origin] indicate how you can represent it so as to make classification easier and get good generalization on unseen data, by choosing one of: 'drop' - leave the feature out, 'raw' - use values as they are, 'standard' - standardize values by subtracting out the average value and dividing by standard deviation, 'one-hot' - use a one-hot encoding.  There could be multiple answers that make sense for each feature; please mention the tradeoffs between each answer. Write down your choices.


1. Cylinders
- **One-hot**: ✅ as cylinders are discrete, allowing model to find non-linear connections between cylinder count and fuel efficiency.
- **Raw**: 🤔 if the link between cylinders and efficiency is almost linear
- **Standard**: ❌ due to its discrete nature.
- **Drop**: ❌ related to efficiency
2. Displacement
- **Standard**: ✅ it's a continuous variable; standardization aids in linear relationship cases.
- **Raw**: 🤔 possible, but standardization often enhances model convergence and performance.
- **Drop**: ❌ related to fuel efficiency
- **One-hot**: ❌ doesn't apply
3. Horsepower
- **Standard**: ✅ standardizing horsepower helps if it's linearly related to efficiency.
- **Raw**: 🤔 standardization usually yields superior results.
- **Drop**: ❌ related to fuel efficiency
- **One-hot**: ❌ doesn't apply
4. Weight
- **Standard**: ✅ weight is continuous and probably linearly related to efficiency.
- **Raw**: 🤔 possible, but standardization often enhances model convergence and performance.
- **Drop**: ❌ related to fuel efficiency
- **One-hot**: ❌ doesn't apply
5. Acceleration
- **Standard**: ✅ with weight and horsepower; normalization benefits certain models.
- **Raw**: 🤔 possible, but standardization often enhances model convergence and performance.
- **Drop**: ❌ related to fuel efficiency
- **One-hot**: ❌ doesn't apply

6.Model Year
- **One-hot**:  ✅ as model years are discrete, allowing model to find non-linear connections between model year and fuel efficiency.
- **Raw**:  🤔 possible, but one-hot encoding is likely superior.
- **Drop**: ❌ new car tech makes it relevant.
- **Standard**: ❌ doesn't apply
7. Origin
- **One-hot**: 🤔 as model years are discrete, allowing model to find non-linear connections between origin and fuel efficiency.
- **Raw**:  🤔 possible, but one-hot encoding is likely superior.
- **Standard**:🤔 possible, but one-hot encoding is likely superior.
- **Drop**:  🤔 only if very low correlation to efficiency


Key:
- ✅ Yes
- 🤔 Maybe
- ❌ No


In [37]:
discrete_features = ['cylinders', 'origin', 'model_year']
continuous_features = ['displacement', 'horsepower', 'weight', 'acceleration']

# Preprocess numerical features (standardization)
for feature in continuous_features:
    mean_value = car_data[feature].mean()
    std_dev = car_data[feature].std()
    car_data[feature] = (car_data[feature] - mean_value) / std_dev

# Preprocess categorical features (one-hot encoding)
for feature in discrete_features:
    unique_vals = car_data[feature].unique()
    for val in unique_vals:
        feat_name = f"{feature}_{val}"
        car_data[feat_name] = (car_data[feature] == val).astype(int)
car_data["car_name"].unique()


array(['hi 1200d', 'chevy c20', 'ford f250', 'dodge d200',
       'oldsmobile omega', 'chevrolet impala', 'mercury marquis',
       'oldsmobile delta 88 royale', 'oldsmobile vista cruiser',
       'dodge monaco (sw)', 'ford country', 'mercury marquis brougham',
       'buick electra 225 custom', 'ford mustang ii', 'ford f108',
       'ford gran torino (sw)', 'chevrolet chevelle concours (sw)',
       'dodge d100', 'plymouth volare premier v8', 'chevrolet malibu',
       'chevy c10', 'buick century luxus (sw)', 'buick lesabre custom',
       'buick century 350', 'ford ltd', 'plymouth custom suburb',
       'amc ambassador brougham', 'chevrolet caprice classic',
       'ford country squire (sw)', 'pontiac safari (sw)',
       'chrysler newport royal', 'chrysler new yorker brougham',
       'ford gran torino', 'amc matador', 'amc matador (sw)',
       'plymouth satellite custom (sw)', 'plymouth fury iii',
       'plymouth fury gran sedan', 'dodge coronet custom (sw)',
       "plymouth 'cu

> 2. How can car name, a textual feature, be transformed into a feature which can be used by the logistic regression algorithm?

### Car Name Transformation
For logistic regression and decision tree algorithms to effectively utilize these features, it needs conversion into numerical values. 

#### Logistic Regression
- **Label Encoding**: Assigns each unique car name an integer but assumes an ordinal relationship, which may not fit nominal data like car names.
- **Text Embeddings**: Utilizes complex text processing (e.g., word embeddings) if car names contain structured elements like brand and model. Could be excessive for simple categorical text.
- **One-hot Encoding**: Feasible if the count of unique car names is manageable. However, numerous unique names may cause high-dimensional feature sets (the curse of dimensionality).
- **Feature Hashing (Hashing Trick)**: Converts categories to a fixed-size of numerical values, suitable for handling a large number of categories. More space-efficient but can lead to collisions.


In [38]:
# example of hashing
car_data['car_name_hashed'] = car_data['car_name'].apply(hash)

> 3.  How can car name, a textual feature, be transformed into a feature which can be used by the decision tree algorithm?

### Textual to Numerical Conversion
To proceed as described in the previous section, we'll convert textual data into numerical format.

#### Decision Tree Algorithm
- **Label Encoding**: Works for decision trees and their ensembles (e.g., random forests) as they handle categorical data well. The ordinal nature of label encoding isn't a major concern for tree-based methods, unlike linear models.
- **One-hot Encoding**: Applicable with a moderate count of unique car names. might complicate the tree's structure.
- **Binary Encoding**: An in between label and one-hot encoding, creating binary columns with fewer dimensions. Effective for moderately high cardinality categorical variables.
- **Feature Hashing**: useful for handling numerous categories to reduce dimensionality.


In [39]:
# Label Encoding Example
from sklearn.preprocessing import LabelEncoder

label_encoder = LabelEncoder()
car_data['car_name_encoded'] = label_encoder.fit_transform(car_data['car_name'])

> 4. For this dataset is car name informative, useful?



- **Association**: "Car name" typically encompasses make and model details, indirectly linked to fuel efficiency due to design and engine variations among models.
- **Dataset Diversity**: The feature's utility relies heavily on dataset specifics. A limited dataset regarding car names might restrict its predictive value.
- **Granularity Impact**: Specific features like "car name" can induce overfitting, especially in smaller datasets. Models might learn overly specific patterns, reducing generalization ability.
- **Feature Engineering Potential**: Extracting manufacturer information or other details from "car name" could enhance its utility when combined with other features, offering insights into consumption patterns related to manufacturers.

So, based on this and low correlation in the data therefore implies that it isn't informative to our prediction and we can drop it.


> 5. Based on choices you made in (1) above make a feature matrix that you can use as an input to sklearn.tree.DecisionTreeClassifier 

In [40]:

X_data = car_data.drop(['mpg','car_name'], axis=1)[:]
y_data = car_data['mpg'][:]

np.random.seed(17)
shuffle = np.random.permutation(X_data.index)

X_data = X_data.loc[shuffle].reset_index(drop=True)
y_data = y_data.loc[shuffle].reset_index(drop=True)

X_data


Unnamed: 0,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,cylinders_8.0,cylinders_6.0,cylinders_3.0,...,model_year_76.0,model_year_74.0,model_year_77.0,model_year_79.0,model_year_78.0,model_year_81.0,model_year_80.0,model_year_82.0,car_name_hashed,car_name_encoded
0,4.0,1.956315,0.271330,-0.957831,2.377309,76.0,1.0,0,0,0,...,1,0,0,0,0,0,0,0,6664437763177975671,65
1,8.0,-0.278900,-1.075659,0.817534,-1.464852,73.0,1.0,1,0,0,...,0,0,0,0,0,0,0,0,6247184583113605697,9
2,4.0,1.328074,0.480861,-1.194468,-0.014980,74.0,2.0,0,0,0,...,0,1,0,0,0,0,0,0,3684250617797406736,280
3,8.0,-0.278900,-1.075659,0.536160,-1.283618,70.0,1.0,1,0,0,...,0,0,0,0,0,0,0,0,6853782529698776536,13
4,8.0,-0.126800,-1.030759,0.842258,-1.464852,70.0,1.0,1,0,0,...,0,0,0,0,0,0,0,0,-8205199783614854769,35
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
387,4.0,-0.821171,1.139389,-0.262048,0.093761,77.0,1.0,0,0,0,...,0,0,1,0,0,0,0,0,-1083050989054830603,153
388,4.0,-0.913754,1.019657,-0.473962,0.238748,82.0,1.0,0,0,0,...,0,0,0,0,0,0,0,1,-4907723336078447246,244
389,6.0,-0.540116,1.169322,0.474941,1.144918,80.0,1.0,0,1,0,...,0,0,0,0,0,0,1,0,-6437112557822341566,97
390,4.0,-0.768266,1.229188,-0.420983,-0.413694,81.0,1.0,0,0,0,...,0,0,0,0,0,1,0,0,-4661506389800659704,96


>6. PROGRAM (not use sklearn or other libraries) 10 fold cross validation method to train and evaluate a decision tree classifier on this data. Note you can use sklearn.tree.DecisionTreeClassifier to build the tree for each fold. Report final results for all three criterion used in sklearn.tree.DecisionTreeClassifier. 

In [41]:
from sklearn.tree import DecisionTreeClassifier

num_folds = 10
fold_size = len(X_data) // num_folds

X_folds = [X_data[i * fold_size: (i + 1) * fold_size] for i in range(num_folds)]
y_folds = [y_data[i * fold_size: (i + 1) * fold_size] for i in range(num_folds)]

results = []

for i in range(num_folds):
    X_train = np.concatenate([fold for j, fold in enumerate(X_folds) if j != i])
    y_train = np.concatenate([fold for j, fold in enumerate(y_folds) if j != i])
    X_val, y_val = X_folds[i], y_folds[i]

    dct = DecisionTreeClassifier()
    dct.fit(X_train, y_train)
    y_pred = dct.predict(X_val)
    
    accuracy = (np.sum(y_val == y_pred) / len(y_val))
    results.append(accuracy)

mean = np.mean(results)
print("Mean Accuracy", mean)


Mean Accuracy 0.8846153846153847




In [42]:
extra_params = {
    'max_depth': 10,
    'min_samples_split': 10,
    'min_samples_leaf': 8,
    'max_features': 'sqrt',  
    'criterion': 'gini' }

num_folds = 10
fold_size = len(X_data) // num_folds

X_folds = [X_data[i * fold_size: (i + 1) * fold_size] for i in range(num_folds)]
y_folds = [y_data[i * fold_size: (i + 1) * fold_size] for i in range(num_folds)]

results = []

for i in range(num_folds):
    X_train = np.concatenate([fold for j, fold in enumerate(X_folds) if j != i])
    y_train = np.concatenate([fold for j, fold in enumerate(y_folds) if j != i])
    X_val, y_val = X_folds[i], y_folds[i]

    dct = DecisionTreeClassifier(**extra_params)
    dct.fit(X_train, y_train)
    y_pred = dct.predict(X_val)
    
    accuracy = (np.sum(y_val == y_pred) / len(y_val))
    results.append(accuracy)

mean = np.mean(results)
print("Mean Accuracy", mean)


Mean Accuracy 0.9


