In [1]:
import pandas as pd
import polars as pl
import duckdb as db
import numpy as np

from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score, KFold, LeaveOneOut, StratifiedKFold
from sklearn.metrics import silhouette_score, adjusted_rand_score, normalized_mutual_info_score, accuracy_score, confusion_matrix, classification_report

from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

import scipy.stats as st

import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.anova import anova_lm

import time
import random

from ucimlrepo import fetch_ucirepo 

import matplotlib.pyplot as plt

from urllib.request import urlopen
import xmltodict

# HW2

Overall rules:

- Do not split your answers into separate files. All answers must be in a single jupyter notebook. 
- Refrain from downloading and loading data from a local file unless specifically specified. Obtain all required remote data using the appropriate API.
- Refrain from cleaning data by hand on a spreadsheet. All cleaning must be done programmatically, with each step explained. This is so that I can replicate the procedure deterministically.
- Refrain from using code comments to explain what has been done. Document your steps by writing appropriate markdown cells in your notebook.
- Avoid duplicating code by copying and pasting it from one cell to another. If copying and pasting is necessary, develop a suitable function for the task at hand and call that function.
- When providing parameters to a function, never use global variables. Instead, always pass parameters explicitly and always make use of local variables.
- Document your use of LLM models (ChatGPT, Claude, Code Pilot etc). Either take screenshots of your steps and include them with this notebook, or give me a full log (both questions and answers) in a markdown file named HW2-LLM-LOG.md.

Failure to adhere to these guidelines will result in a 15-point deduction for each infraction.

## Q1

For this question we are going to use [RT-IoT2022](https://archive.ics.uci.edu/dataset/942/rt-iot2022) dataset from [UCI](https://archive.ics.uci.edu/). We have seen several clustering algorithms, and also some measures of quality for clusters in the lectures. Use all you have learned and compare the clustering algorithms we have learnt using the quality measures we have seen in the class.  You must write a detailed comparison analysis to evaluate which model behaves better based on the results you will obtain. One part of your analysis should also include how much time it takes to train and run the models.

UIC Data repository is down as of writing my answers. So, I am using a local copy.

In [2]:
#rt_iot2022 = fetch_ucirepo(id=942) 

rt_iot2022 = pd.read_csv('../data/RT_IOT2022.csv')
processed = rt_iot2022.sample(frac=1, random_state=42).reset_index(drop=True)
X = rt_iot2022.iloc[:,1:84]
y = np.array(rt_iot2022.iloc[:,84])

Let us first process the data. We must convert the categorical data into numerical data. 

In [3]:
categorical_columns = X.select_dtypes(include=['object', 'category']).columns
numerical_columns = X.select_dtypes(include=['number']).columns

# Preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_columns),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_columns)
    ]
)

X_processed = preprocessor.fit_transform(X)

This is the main engine of our experiments. It takes the data, splits into pieces, clusters each piece, and then evaluates the quality of the resulting clusters.

In [8]:
def error_bound(data, confidence=0.95):
    n = len(data)
    mean = np.mean(data)
    std_err = np.std(data, ddof=1) / np.sqrt(n)
    margin = std_err * st.t.ppf((1 + confidence) / 2, n - 1)
    return mean, margin

def clustering_experiment(model, X, y, pieces=5):
    N = X.shape[0]
    rng = np.random.default_rng()
    permuted_indices = rng.permutation(N)
    indices = [int(x) for x in np.linspace(0, 1, pieces+1)*N]
    results = {'Runtime': [],
               'Silhouette': [],
               'Rand': [],
               'Info': []}
    for i,j in zip(indices,indices[1:]):
        X_part = X[permuted_indices[i:j],:]
        y_part = y[permuted_indices[i:j]]
        start_time = time.time()
        labels = model.fit_predict(X_part)
        end_time = time.time()
        results['Runtime'].append(end_time - start_time)
        results['Silhouette'].append(silhouette_score(X_part, labels))
        results['Rand'].append(adjusted_rand_score(y_part, labels))
        results['Info'].append(normalized_mutual_info_score(y_part, labels))

    for x in results.keys():
        mean, margin = error_bound(results[x])
        results[x].extend([mean, margin])

    rows = [f'Part {i}' for i in range(pieces)] + ['Mean', 'Error']
    return pd.DataFrame(results,index=rows)

If you have difficulty in following what the code above does, feed it to your favorite LLM and ask it to explain what it does. It is a good exercise.

Now, Let us start doing the experiments one by one.

In [43]:
clustering_experiment(DBSCAN(eps=0.5, min_samples=5),X_processed,y)

Unnamed: 0,Runtime,Silhouette,Rand,Info
Part 0,1.257878,0.357051,0.311543,0.547402
Part 1,1.253352,0.335575,0.308457,0.549607
Part 2,1.244913,0.357509,0.318953,0.552906
Part 3,1.220655,0.353765,0.313115,0.548119
Part 4,1.224577,0.346631,0.308373,0.548239
Mean,1.240275,0.350106,0.312088,0.549254
Error,0.020904,0.011442,0.005393,0.002721


In [44]:
clustering_experiment(KMeans(n_clusters=3, random_state=42),X_processed,y)

Unnamed: 0,Runtime,Silhouette,Rand,Info
Part 0,0.161439,0.673169,0.422215,0.328069
Part 1,0.052845,0.65276,0.436527,0.340487
Part 2,0.078764,0.39309,0.32197,0.235202
Part 3,0.140359,0.669268,0.421881,0.323449
Part 4,0.057804,0.098161,-0.130793,0.105089
Mean,0.098242,0.49729,0.29436,0.266459
Error,0.061595,0.313408,0.300535,0.123452


In [45]:
clustering_experiment(AgglomerativeClustering(n_clusters=3),X_processed,y)

Unnamed: 0,Runtime,Silhouette,Rand,Info
Part 0,43.94603,0.663971,0.418418,0.326891
Part 1,43.31094,0.725628,0.276466,0.223801
Part 2,42.081197,0.662243,0.4378,0.335856
Part 3,42.315601,0.66421,0.416138,0.325693
Part 4,42.312122,0.663503,0.432,0.331712
Mean,42.793178,0.675911,0.396164,0.30879
Error,0.994021,0.034522,0.083846,0.059205


The results indicate that KMeans algorithm is the most unstable one in terms of error measures, and therefore, appears to be not suitable for the dataset we have. On the other hand, hiearchical clustering gives us better clusters (in terms of silhouette score) compared to DBScan, but DBScan performs better if we consider mutual information as our base measure. Recall that silhouette score measures internal smilarity while mutual information and rand index compares similarity with respect to a specific clustering. The results indicate that the clusters coming from the original labels are not very uniform. If we form our clusters based in distance and density, we tend to lose the information given by the initial labels. So, if we want to retain as much information coming from the original labels, we go for DBScan. If we want more uniform clusters we go for hiearchical clustering. Notice also that hiearchical clusterin algorithm takes too much time to run (approx. 42 seconds for each piece) compared to DBSCan (approx. 1.2 seconds for each piece). So, if the running time is a factor in choosing a model, we must opt for DBScan.

## Q2

For this question we are going to use MNIST digits dataset. We have analyzed 3 classification algorithms so far. These are

1. K-NN
2. SVM
3. Logistric Regression

Use these algorithms to obtain 3 classification models on the dataset. Then use an appropriate cross-validation scheme to test the quality of the models and compare. You must write a detailed comparison analysis to evaluate which model behaves better based on the results you will obtain. One part of your analysis should also include how much time it takes to train and run the models.

In [46]:
digits = load_digits()
digits_X = digits['data']
digits_y = digits['target']

In [49]:
def experiment(model,X,y,splits=3):
    t0 = time.time()
    k_fold = KFold(n_splits=splits)
    scores = cross_val_score(model, X, y, cv=k_fold)
    t1 = time.time()
    mean, error = error_bound(scores)
    return t1-t0, mean, error

In [54]:
models = [('KNN', KNeighborsClassifier(n_neighbors=3)), 
          ('SVM', SVC(kernel='rbf')),
          ('LR', LogisticRegression(max_iter=5000))]

results = {'Model': [],
           'Time': [],
           'Mean Accuracy': [],
           'Error': []}

for name, model in models:
    runtime, a, b = experiment(model,digits_X,digits_y,5)
    results['Model'].append(name)
    results['Time'].append(runtime)
    results['Mean Accuracy'].append(a)
    results['Error'].append(b)

pd.DataFrame(results)

Unnamed: 0,Model,Time,Mean Accuracy,Error
0,KNN,0.031231,0.96662,0.014416
1,SVM,0.271851,0.966063,0.026236
2,LR,4.621699,0.918771,0.034476


In the experiment above, I used a 5-fold crossvalidation on the 3 models we are asked to worked with. The confidence intervals of KNN and SVM do overlap. This means there are no statistically significant differences between KNN and SVM. On the other hand, since the interval for logistic regression doesn't overlap with the intervals of KNN and SVM, we can say that LR performed slightly worse than the other two, and it takes significantly more time to run. So, if it is crucial that the model runs fast, we must choose KNN, but if the accuracy is very important we should go for SVM since it runs a little faster than KNN.

## Q3

For this homework, we are going to use the [data warehouse](https://clerk.house.gov/Votes/) for the [US House of Representatives](https://www.house.gov/). The data server has data on each vote going back to 1990. The voting information is in XML format. For example, the code below pulls the data for the 71st roll call from 2024 Congress.

In [10]:
with urlopen('https://clerk.house.gov/evs/2024/roll071.xml') as url:
    raw = xmltodict.parse(url.read())
raw

{'rollcall-vote': {'vote-metadata': {'majority': 'R',
   'congress': '118',
   'session': '2nd',
   'committee': 'U.S. House of Representatives',
   'rollcall-num': '71',
   'legis-num': 'H R 2799',
   'vote-question': 'On Agreeing to the Amendment',
   'amendment-num': '4',
   'amendment-author': 'Wagner of Missouri Part B Amendment No. 4',
   'vote-type': 'RECORDED VOTE',
   'vote-result': 'Agreed to',
   'action-date': '7-Mar-2024',
   'action-time': {'@time-etz': '14:28', '#text': '2:28 PM'},
   'vote-desc': None,
   'vote-totals': {'totals-by-party-header': {'party-header': 'Party',
     'yea-header': 'Ayes',
     'nay-header': 'Noes',
     'present-header': 'Answered “Present”',
     'not-voting-header': 'Not Voting'},
    'totals-by-party': [{'party': 'Republican',
      'yea-total': '213',
      'nay-total': '0',
      'present-total': '0',
      'not-voting-total': '8'},
     {'party': 'Democratic',
      'yea-total': '57',
      'nay-total': '154',
      'present-total': '0',

1. Obtain all of the data from 2010 to 2024 (inclusive, all years, all calls). You may save a local copy, and push the local copy onto your github repo.
2. Construct a similarity measure function for a pair of legislators. The function should return the number of times both legislators voted the same way and the number of sessions both attended.
3. Find the legislators whose voting records are the most similar to: (i) Matt Gaetz and (ii) Alexandria Ocasio-Cortez.
4. Using this similarity measure, do a hiearchical clustering for legislators, and analyze the result.
5. Construct a function that takes a year as an input and returns a 2x2 table couting the number of times Democrats voted with other Democrats, the number of times Democrats voted with Republicans, the number of times Republicans voted with Democrats, and the number of times Republicans voted with other Republicans in that year.
6. Analyze the table for each year since 2010 using the $\chi^2$-statistic, and evaluate the polarization in the House of representatives. In this context, explain what a lower or higher $\chi^2$-metric means. Explain whether the polarization in the House of representatives increased or decreased over the years?

## Q4

For this question, we are going to use the following dataset.

In [2]:
credit = pd.read_csv('https://storage.googleapis.com/download.tensorflow.org/data/creditcard.csv')
credit.head(10)

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0
5,2.0,-0.425966,0.960523,1.141109,-0.168252,0.420987,-0.029728,0.476201,0.260314,-0.568671,...,-0.208254,-0.559825,-0.026398,-0.371427,-0.232794,0.105915,0.253844,0.08108,3.67,0
6,4.0,1.229658,0.141004,0.045371,1.202613,0.191881,0.272708,-0.005159,0.081213,0.46496,...,-0.167716,-0.27071,-0.154104,-0.780055,0.750137,-0.257237,0.034507,0.005168,4.99,0
7,7.0,-0.644269,1.417964,1.07438,-0.492199,0.948934,0.428118,1.120631,-3.807864,0.615375,...,1.943465,-1.015455,0.057504,-0.649709,-0.415267,-0.051634,-1.206921,-1.085339,40.8,0
8,7.0,-0.894286,0.286157,-0.113192,-0.271526,2.669599,3.721818,0.370145,0.851084,-0.392048,...,-0.073425,-0.268092,-0.204233,1.011592,0.373205,-0.384157,0.011747,0.142404,93.2,0
9,9.0,-0.338262,1.119593,1.044367,-0.222187,0.499361,-0.246761,0.651583,0.069539,-0.736727,...,-0.246914,-0.633753,-0.120794,-0.38505,-0.069733,0.094199,0.246219,0.083076,3.68,0


1. Write a regression model that predicts the Amount from the other variables. Assess the quality of the model using an appropriate measure.
2. Write a logistic regression model that predicts the Class from other variables. Assess the quality of the model using an appropriate measure.
3. The quality of the model seem very good, but in reality it doesn't work. Why? What is the problem? Offer a solution, implement and test it. Did it really solve our problem? Explain.
4. Try other supervised classification models we have learned. Are they susceptible to the same problem as before? Explain. Offer a solution, implement and test it. Did it really solve our problem? Explain.

In [57]:
model = smf.ols('Amount ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 \
                          + V11 + V12 + V13 + V14 + V15 + V16 + V17 + V18 + V19 + V20 \
                          + V21 + V22 + V23 + V24 + V25 + V26 + V27 + V28', data=credit)
results = model.fit()
results.summary()

0,1,2,3
Dep. Variable:,Amount,R-squared:,0.917
Model:,OLS,Adj. R-squared:,0.917
Method:,Least Squares,F-statistic:,113100.0
Date:,"Sat, 12 Apr 2025",Prob (F-statistic):,0.0
Time:,14:46:31,Log-Likelihood:,-1621600.0
No. Observations:,284807,AIC:,3243000.0
Df Residuals:,284778,BIC:,3244000.0
Df Model:,28,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,88.3496,0.135,656.123,0.000,88.086,88.614
V1,-29.0778,0.069,-422.968,0.000,-29.213,-28.943
V2,-80.4914,0.082,-987.091,0.000,-80.651,-80.332
V3,-34.7867,0.089,-391.710,0.000,-34.961,-34.613
V4,17.4414,0.095,183.394,0.000,17.255,17.628
V5,-70.0132,0.098,-717.656,0.000,-70.204,-69.822
V6,40.5482,0.101,401.185,0.000,40.350,40.746
V7,80.3298,0.109,738.005,0.000,80.117,80.543
V8,-21.5867,0.113,-191.469,0.000,-21.808,-21.366

0,1,2,3
Omnibus:,475780.388,Durbin-Watson:,1.953
Prob(Omnibus):,0.0,Jarque-Bera (JB):,14550861272.517
Skew:,-10.0,Prob(JB):,0.0
Kurtosis:,1110.143,Cond. No.,5.93


In [58]:
anova_lm(results)

Unnamed: 0,df,sum_sq,mean_sq,F,PR(>F)
V1,1.0,923858400.0,923858400.0,178902.196726,0.0
V2,1.0,5031576000.0,5031576000.0,974348.49052,0.0
V3,1.0,792353700.0,792353700.0,153436.735008,0.0
V4,1.0,173683800.0,173683800.0,33633.304449,0.0
V5,1.0,2659636000.0,2659636000.0,515029.97171,0.0
V6,1.0,831147600.0,831147600.0,160949.052786,0.0
V7,1.0,2812601000.0,2812601000.0,544651.10805,0.0
V8,1.0,189316100.0,189316100.0,36660.451082,0.0
V9,1.0,34880810.0,34880810.0,6754.556227,0.0
V10,1.0,183567900.0,183567900.0,35547.331907,0.0


Results indicate that the model fit measure $R^2$ is 91.7% which is a very good fit. ANOVA results indicate that V2, V5, V7, and V20 are more important variables compared to others. The residual indicates that there is still a little bit remained unexplained.

Now, let us do a logistic regression model:

In [3]:
X, y = credit.iloc[:,1:30], credit.iloc[:,30]

In [86]:
experiment(LogisticRegression(max_iter=1500),X,y,3)

(60.484336853027344,
 np.float64(0.999174879954165),
 np.float64(0.00024731195326837586))

The accuracy appears to be 99.9% which is very good, but let us look closer:

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

model = LogisticRegression(max_iter=5000)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test,y_pred))

confusion_matrix(y_test,y_pred)

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     28429
           1       0.85      0.67      0.75        52

    accuracy                           1.00     28481
   macro avg       0.93      0.84      0.88     28481
weighted avg       1.00      1.00      1.00     28481



array([[28423,     6],
       [   17,    35]])

There is a severe imbalance in the data. There are few things we can do:

1. We can tell the Logistic Regression model that the data is imbalanced.
2. We can repeat the train-test-split using a balanced sampling scheme.

In [68]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.1)

model = LogisticRegression(max_iter=5000, class_weight='balanced')
model.fit(X_train,y_train)
y_pred = model.predict(X_test)

print(classification_report(y_test,y_pred))

confusion_matrix(y_test,y_pred)

              precision    recall  f1-score   support

           0       1.00      0.98      0.99     28432
           1       0.06      0.88      0.12        49

    accuracy                           0.98     28481
   macro avg       0.53      0.93      0.55     28481
weighted avg       1.00      0.98      0.99     28481



array([[27803,   629],
       [    6,    43]])

As you can see, the recall improved but the precision dropped significantly, while there was a small drop in accuracy. Let us see if these numbers are statistically significant.

In [5]:
def stratified_experiment(model,X,y,scorer='accuracy',splits=3):
    t0 = time.time()
    k_fold = StratifiedKFold(n_splits=splits,shuffle=True)
    scores = cross_val_score(model, X, y, cv=k_fold, scoring=scorer)
    t1 = time.time()
    mean, err = error_bound(scores)
    return t1-t0, mean, err

In [70]:
stratified_experiment(LogisticRegression(max_iter=5000, class_weight='balanced'), X, y)

(114.77970671653748,
 np.float64(0.9767632191966311),
 np.float64(0.010984381680846074))

In [71]:
stratified_experiment(LogisticRegression(max_iter=5000, class_weight='balanced'), X, y, scorer='precision_macro')

(100.40494632720947,
 np.float64(0.5309661152028654),
 np.float64(0.0057245541647413176))

In [72]:
stratified_experiment(LogisticRegression(max_iter=5000, class_weight='balanced'), X, y, scorer='recall_macro')

(106.10843086242676,
 np.float64(0.9426491356884868),
 np.float64(0.020621208580060085))

The new experiment indicates, to increase precision on class labeled as '1' we must also accept a significant increase in false positives if we use logistic regression as our primary model. However, there are two other models we saw in the class

1. KNN
2. SVM
3. Decision Trees


In [9]:
stratified_experiment(KNeighborsClassifier(n_neighbors=3),X,y,scorer='precision_macro',splits=5)

(87.50482654571533,
 np.float64(0.9684287909778126),
 np.float64(0.030109486283569055))

In [11]:
stratified_experiment(DecisionTreeClassifier(),X,y,scorer='precision_macro')

(41.133445739746094,
 np.float64(0.8782600979775522),
 np.float64(0.022830748593683125))

In [12]:
stratified_experiment(SVC(kernel='rbf'),X,y,scorer='precision_macro')

(179.73794317245483,
 np.float64(0.904734647558627),
 np.float64(0.03885303103390294))

The experiments show that KNN has a better precision score but it is mode expensive (in terms of time) to run compared to other models. SVM is both expensive and it has similar precision score to Decision Tree. Comparing all parameters, KNN appears to be the model that strikes a balance between high precision and suitable time complexity.

Let us look at it closely:

In [14]:
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.25)
model = KNeighborsClassifier(n_neighbors=3)
model.fit(X_train,y_train)

y_pred = model.predict(X_test)

confusion_matrix(y_test,y_pred)

array([[71076,     3],
       [   38,    85]])

This model perform better than the logistic regression model.