# HW17

### Author: Joseph Wong

## Import the Packages

In [2]:
# Basic package imports
import os
import numpy as np
import pandas as pd

# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

# Scikit-learn: Core utilities for model building and evaluation
from sklearn.model_selection import train_test_split    # Train/test data splitting
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler, StandardScaler  # Feature transformations and scaling
from sklearn.metrics import (                            # Model evaluation metrics
    mean_squared_error, r2_score, accuracy_score, 
    precision_score, recall_score, confusion_matrix, 
    classification_report
)

# Scikit-learn: Linear and polynomial models
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor       # For KNN

# Scikit-learn: Synthetic dataset generators
from sklearn.datasets import make_classification, make_regression

# Scikit-learn: Naive Bayes models
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

# Scikit-learn: Decision Trees
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree

# Text Processing Packages and Code
from sklearn.feature_extraction.text import TfidfVectorizer
import string
from nltk.corpus import stopwords
from nltk import PorterStemmer as Stemmer

# Scikit-Learn: Datasets
from sklearn.datasets import fetch_california_housing, load_iris

## Part 1: Iris Data

### Import the Data

In [3]:
iris = load_iris()
X_cls, y_cls = iris.data, iris.target
feature_names_cls = iris.feature_names
class_names_cls = iris.target_names

df_iris = pd.DataFrame(X_cls,columns=feature_names_cls)
df_iris['Class Num'] = y_cls
print(class_names_cls)
df_iris

['setosa' 'versicolor' 'virginica']


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class Num
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


1. Using the Iris data - same data as we saw at the end of lecture:
   
- Fit a DecisionTreeClassifier to the training data.
- Evaluate accuracy on the testing data.
- Explore the effect of tree depth.
- Visualize the tree (limit depth if needed).
- Investigate feature importance:
    - Which features are most important for classification?
    - Does pruning or limiting depth change feature importance?
      
Hint:
*cls_tree.feature_importances_ gives the importance of each feature after fitting.*

Some things to consider:
- How  does pruning or limiting tree depth affects train/test error?
- How does feature importance changes with pruning or depth constraints.
- What are the trade-offs between overfitting and underfitting.
- Visualization of tree structure for insight.

**Optional:**
- For the classification tree, reduce to two features and plot decision boundaries.
- For the regression tree, plot predictions for a single feature to see the behavior.

Please write up your conclusions.

## Part 2: California Housing Data

### Import the Data

In [None]:
housing = fetch_california_housing()
X_reg, y_reg = housing.data, housing.target
feature_names_reg = housing.feature_names

df_housing = pd.DataFrame(X_reg,columns=feature_names_reg)
df_housing['Price'] = y_reg
df_housing

2. Using the California Housing data:

- Fit a DecisionTreeRegressor to the training data.
- Evaluate train and test MSE.
- Explore pruning using cost-complexity pruning.
- Visualize the tree (limit depth if needed).
- Investigate feature importance:
    - Which features are most important for splitting?
    - How does pruning affect feature importance?

Hint:
*reg_tree.feature_importances_ gives the importance of each feature after fitting.*

Some things to consider:
- How  does pruning or limiting tree depth affects train/test error?
- How does feature importance changes with pruning or depth constraints.
- What are the trade-offs between overfitting and underfitting.
- Visualization of tree structure for insight.

**Optional:**
- For the classification tree, reduce to two features and plot decision boundaries.
- For the regression tree, plot predictions for a single feature to see the behavior.

Please write up your conclusions.

In [4]:
# from sklearn.model_selection import GridSearchCV
# hyperparameter tuning