<a href="https://colab.research.google.com/github/naterattner/data71200/blob/master/project_3/data71200_summer24_project3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Python notebook for project 3: https://bbhosted.cuny.edu/ultra/courses/_2383576_1/cl/outline

The goal for this assignment is two apply different types of unsupervised learning techniques on the dataset created in Project 1.

I'll be using this dataset containing estimations of obesity levels based on eating habits and physical condition: https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition

In [2]:
import numpy as np
import pandas as pd
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt

!pip install -U scikit-learn==1.4
!pip install mglearn
import mglearn

Collecting scikit-learn==1.4
  Downloading scikit_learn-1.4.0-1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.1/12.1 MB[0m [31m46.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: scikit-learn
  Attempting uninstall: scikit-learn
    Found existing installation: scikit-learn 1.2.2
    Uninstalling scikit-learn-1.2.2:
      Successfully uninstalled scikit-learn-1.2.2
Successfully installed scikit-learn-1.4.0
Collecting mglearn
  Downloading mglearn-0.2.0-py2.py3-none-any.whl (581 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m581.4/581.4 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: mglearn
Successfully installed mglearn-0.2.0


## Step 1: Load data, including testing/training split from Project 1

We will perform four steps in this section:
- Load the dataset from UCI
- Performing one-hot encoding on categorical features
- Split the data into a testing and training set
- Scale the data using StandardScaler

#### Load data from UCI

In [3]:
!pip install ucimlrepo
from ucimlrepo import fetch_ucirepo

# fetch dataset
estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition = fetch_ucirepo(id=544)

# data (as pandas dataframes)
features = estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.data.features
targets = estimation_of_obesity_levels_based_on_eating_habits_and_physical_condition.data.targets

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7


### Perform one-hot encoding
Because the dataset contains categorical variables, we will first perform one-hot encoding. This is done on a Dataframe containing all data to ensure categorical values are represented in the same way in both the testing and training sets.

Once one-hot encoding is done, we split the data into testing and training sets.

In [4]:
# Encode the categorical features
data_dummies = pd.get_dummies(features, dtype=int)

# Encode the target variable using LabelEncoder
# This encodes labels that were in the target data with values
# from 0 through n_classes-1, so 0 through 6 in this case

from sklearn.preprocessing import LabelEncoder

# 'NObeyesdad' is the target column
label_encoder = LabelEncoder()
targets_encoded = label_encoder.fit_transform(targets['NObeyesdad'])

#### Split into testing and training sets

In [5]:
X = data_dummies
y = targets_encoded
print("X.shape: {} y.shape: {}".format(X.shape, y.shape))

X.shape: (2111, 31) y.shape: (2111,)


In [6]:
# From project 1 we know the data is fairly evenly distributed,
# but we can still use stratified sampling on the target to avoid sampling bias

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, stratify=y, random_state=42,test_size=0.2)

In [7]:
# Check that the strafied sampling worked -- the distribution of targets should be the same in each dataset

def getArrayValueCounts(array):
  unique, counts = np.unique(array, return_counts=True)
  total_count = counts.sum()
  shares = counts / total_count

  print("Unique values:", unique)
  print("Counts:", counts)
  print("Shares:", shares)

print('Test')
getArrayValueCounts(y_test)
print("")
print('Train')
getArrayValueCounts(y_train)

Test
Unique values: [0 1 2 3 4 5 6]
Counts: [54 58 70 60 65 58 58]
Shares: [0.12765957 0.13711584 0.16548463 0.14184397 0.1536643  0.13711584
 0.13711584]

Train
Unique values: [0 1 2 3 4 5 6]
Counts: [218 229 281 237 259 232 232]
Shares: [0.12914692 0.13566351 0.16646919 0.14040284 0.15343602 0.13744076
 0.13744076]


#### Scale the data

In [10]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## Step 2: PCA for feature selection