## Intermediate Data Science

#### University of Redlands - DATA 201
#### Prof: Joanna Bieri [joanna_bieri@redlands.edu](mailto:joanna_bieri@redlands.edu)
#### [Class Website: data201.joannabieri.com](https://joannabieri.com/data201_intermediate.html)

In [4]:
# NOTE - This list of package imports is getting long
# In a professional setting you would only want to 
#      import what you need!
# I had chatGPT break the packages into groups here

# ============================================================
# Basic packages
# ============================================================
import os                             # For file and directory operations
import numpy as np                    # For numerical computing and arrays
import pandas as pd                   # For data manipulation and analysis

# ============================================================
# Visualization packages
# ============================================================
import matplotlib.pyplot as plt        # Static 2D plotting
import seaborn as sns                  # Statistical data visualization built on matplotlib

# Interactive visualization with Plotly
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'colab'        # Set renderer for interactive output in Colab or notebooks

# ============================================================
# Scikit-learn: Core utilities for model building and evaluation
# ============================================================
from sklearn.model_selection import train_test_split    # Train/test data splitting
from sklearn.preprocessing import PolynomialFeatures, MinMaxScaler, StandardScaler  # Feature transformations and scaling
from sklearn.metrics import (                            # Model evaluation metrics
    mean_squared_error, r2_score, accuracy_score, 
    precision_score, recall_score, confusion_matrix, 
    classification_report
)

# ============================================================
# Scikit-learn: Linear and polynomial models
# ============================================================
from sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor       # For KNN

# ============================================================
# Scikit-learn: Synthetic dataset generators
# ============================================================
from sklearn.datasets import make_classification, make_regression

# ============================================================
# Scikit-learn: Naive Bayes models
# ============================================================
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB

# ============================================================
# Scikit-learn: Decision Trees
# ============================================================
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree

# ============================================================
# Text Processing Packages and Code
# ============================================================
from sklearn.feature_extraction.text import TfidfVectorizer
import string
from nltk.corpus import stopwords
from nltk import PorterStemmer as Stemmer

-------------
### Homework Day 17

This homework has two parts. You are going to do a regression analysis and a classificationn analysis using decision trees.

1. Using the Iris data - same data as we saw at the end of lecture:
   
- Fit a DecisionTreeClassifier to the training data.
- Evaluate accuracy on the testing data.
- Explore the effect of tree depth.
- Visualize the tree (limit depth if needed).
- Investigate feature importance:
    - Which features are most important for classification?
    - Does pruning or limiting depth change feature importance?
      
Hint:
*cls_tree.feature_importances_ gives the importance of each feature after fitting.*

2. Using the California Housing data:

- Fit a DecisionTreeRegressor to the training data.
- Evaluate train and test MSE.
- Explore pruning using cost-complexity pruning.
- Visualize the tree (limit depth if needed).
- Investigate feature importance:
    - Which features are most important for splitting?
    - How does pruning affect feature importance?

Hint:
*reg_tree.feature_importances_ gives the importance of each feature after fitting.*


Some things to consider:
- How  does pruning or limiting tree depth affects train/test error?
- How does feature importance changes with pruning or depth constraints.
- What are the trade-offs between overfitting and underfitting.
- Visualization of tree structure for insight.


**Optional:**
- For the classification tree, reduce to two features and plot decision boundaries.
- For the regression tree, plot predictions for a single feature to see the behavior.


Please write up your conclusions.

**Your final notebooks should:**

- [ ] Be a completely new notebook.
- [ ] **Contain the answers to the questions above along with code that supports your conclusions.**
- [ ] Be reproducible with junk code removed.
- [ ] Have lots of language describing what you are doing, especially for questions you are asking or things that you find interesting about the data. Use complete sentences, nice headings, and good markdown formatting: https://www.markdownguide.org/cheat-sheet/
- [ ] It should run without errors from start to finish.

In [1]:
from sklearn.datasets import fetch_california_housing, load_iris

In [9]:
# Here is the housing data for regression
housing = fetch_california_housing()
X_reg, y_reg = housing.data, housing.target
feature_names_reg = housing.feature_names

df_housing = pd.DataFrame(X_reg,columns=feature_names_reg)
df_housing['Price'] = y_reg
df_housing

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,Price
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [11]:
# Here is the iris data for classification
iris = load_iris()
X_cls, y_cls = iris.data, iris.target
feature_names_cls = iris.feature_names
class_names_cls = iris.target_names

df_iris = pd.DataFrame(X_cls,columns=feature_names_cls)
df_iris['Class Num'] = y_cls
print(class_names_cls)
df_iris

['setosa' 'versicolor' 'virginica']


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Class Num
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2
