# Extra Practice for Machine Learning
For this you will be working with the `cars` dataset from `vega_datasets`. You may have seen them before, but if not, they are described below.

In [5]:
import pandas as pd

cars = pd.read_csv('cars.csv')

The `cars` dataset is a dataset with a bunch of different models of car, with several different statistics about each of them, including their horsepower, acceleration, etc., the year they were released, and their country of origin. Here's what it looks like:

In [6]:
cars.head()

Unnamed: 0,Name,Miles_per_Gallon,Cylinders,Displacement,Horsepower,Weight_in_lbs,Acceleration,Year,Origin
0,chevrolet chevelle malibu,18.0,8,307.0,130.0,3504,12.0,1970-01-01,USA
1,buick skylark 320,15.0,8,350.0,165.0,3693,11.5,1970-01-01,USA
2,plymouth satellite,18.0,8,318.0,150.0,3436,11.0,1970-01-01,USA
3,amc rebel sst,16.0,8,304.0,150.0,3433,12.0,1970-01-01,USA
4,ford torino,17.0,8,302.0,140.0,3449,10.5,1970-01-01,USA


## ML with Quantitative Data

Create and train a model that, given the `cars` dataset, will predict the Horsepower of a car. Think about the type of data you are trying to predict - what model (of the ones we have already seen) should you use to predict quantitative data? Make sure to split training and testing data, and check the mean squared error of your model.

In [11]:
from sklearn.tree import DecisionTreeRegressor # which model should you import?
from sklearn.metrics import mean_squared_error# How do you measure it?
from sklearn.model_selection import train_test_split

# Enter the rest of your solution here!
def predict_horsepower(cars: pd.DataFrame) -> float:
    # Step 1: Data cleaning
    # Remove rows where Horsepower is missing
    df = cars.dropna(subset=["Horsepower"])
    

    
    # Step 2: Features and labels
    # Exclude Horsepower from features
    X = df.drop(columns=["Horsepower"])
    y = df["Horsepower"]
    
    # One-hot encode categorical columns
    X = pd.get_dummies(X)
    
    # Step 3: Train/test split (70/30)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Step 4: Train the model
    model = DecisionTreeRegressor(random_state=42)
    model.fit(X_train, y_train)
    
    # Step 5: Predict on test set
    y_pred = model.predict(X_test)
    
    # Step 6: Calculate mean squared error
    mse = mean_squared_error(y_test, y_pred)
    
    return mse


Bonus: How does accuracy or mean squared error change with the split between training and testing data? Do at least three different splits with this data to see. 

In [10]:
predict_horsepower(cars)

238.46666666666667

More Bonus: Testing hyperparameters. What maximum depth has the greatest accuracy in our testing set. If we want to do this without making decision off our our training dataset, we need to split our data into three categories: train, test, and development. Then we can compare our changes in how it affects the development dataset and not the training dataset so we can do a final evaluation at the end with our final model. 

## ML with Categorical Data

Create, train, and test a model that will predict the country of origin for the `cars` dataset. Remember, this is categorical data, so you will need to use a different type of model (of the ones we have already seen) than you did for the `Horsepower` model.

In [9]:
from sklearn.tree import DecisionTreeClassifier # which model should you import?
from sklearn.metrics import # How do you measure it?
from sklearn.model_selection import train_test_split

# Enter the rest of your solution here!

SyntaxError: invalid syntax (594777127.py, line 2)

Bonus: How does accuracy or mean squared error change with the split between training and testing data? Do at least three different splits with this data to see. 

More Bonus: Testing hyperparameters. What maximum depth (or other hyperparameter) has the greatest accuracy in our testing set? If we want to do this without making decision off our our training dataset, we need to split our data into three categories: train, test, and development. Then we can compare our changes in how it affects the development dataset and not the training dataset so we can do a final evaluation at the end with our final model. 