    # Introduction
A very important aspect of supervised and semi-supervised machine learning is the quality of the labels produced by human labelers. Unfortunately, humans are not perfect and in some cases may even maliciously label things incorrectly. In this assignment, you will evaluate the impact of incorrect labels on a number of different classifiers.

We have provided a number of code snippets you can use during this assignment. Feel free to modify them or replace them.


## Dataset
The dataset you will be using is the [Adult Income dataset](https://archive.ics.uci.edu/ml/datasets/Adult). This dataset was created by Ronny Kohavi and Barry Becker and was used to predict whether a person's income is more/less than 50k USD based on census data.

### Data preprocessing
Start by loading and preprocessing the data. Remove NaN values, convert strings to categorical variables and encode the target variable (the string <=50K, >50K in column index 14).

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

In [2]:
# This can be used to load the dataset
data = pd.read_csv("adult.csv", na_values='?')
data

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K.
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K.
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K.
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K.


In [3]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

df = data.dropna().copy()

df['salary'] = df['salary'].astype(str).apply(lambda x: 1 if '>' in x else 0)

categorical_cols = [
    'marital-status', 'workclass', 'education', 'occupation',
    'relationship', 'race', 'sex', 'native-country'
]

categorical_transformer = OneHotEncoder(handle_unknown="ignore")
continuous_cols = [x for x in df.columns if x not in categorical_cols and x != 'salary']
scaler = StandardScaler()
# Column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("categorical", categorical_transformer, categorical_cols),
        ("continuous", scaler, continuous_cols)
    ],
    remainder="passthrough",
    verbose_feature_names_out=False
)

# Full pipeline: preprocessing + model
model = Pipeline(steps=[
    ("preprocessor", preprocessor)

])
df_preprocessed = preprocessor.fit_transform(df)
# Get transformed data
# Get feature names
feature_names = preprocessor.get_feature_names_out()

# Convert to dense DataFrame
df_preprocessed = pd.DataFrame(
    df_preprocessed.toarray(),  # make it dense
    columns=feature_names
)
df_preprocessed.head()

# Fit pipeline
# model.fit(X, y)
#
# # Predict
# preds = model.predict(X)


Unnamed: 0,marital-status_Divorced,marital-status_Married-AF-spouse,marital-status_Married-civ-spouse,marital-status_Married-spouse-absent,marital-status_Never-married,marital-status_Separated,marital-status_Widowed,workclass_Federal-gov,workclass_Local-gov,workclass_Private,...,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week,salary
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.034201,-1.062295,1.128753,0.142888,-0.21878,-0.07812,0.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.866417,-1.007438,1.128753,-0.146733,-0.21878,-2.326738,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,-0.041455,0.245284,-0.438122,-0.146733,-0.21878,-0.07812,0.0
3,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,1.0,0.0,0.0,1.093385,0.425853,-1.221559,-0.146733,-0.21878,-0.07812,0.0
4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,-0.798015,1.407393,1.128753,-0.146733,-0.21878,-0.07812,0.0


### Data classification
Choose at least 4 different classifiers and evaluate their performance in predicting the target variable. 

#### Preprocessing
Think about how you are going to encode the categorical variables, normalization, whether you want to use all of the features, feature dimensionality reduction, etc. Justify your choices 

A good method to apply preprocessing steps is using a Pipeline. Read more about this [here](https://machinelearningmastery.com/columntransformer-for-numerical-and-categorical-data/) and [here](https://medium.com/vickdata/a-simple-guide-to-scikit-learn-pipelines-4ac0d974bdcf). 

<!-- #### Data visualization
Calculate the correlation between different features, including the target variable. Visualize the correlations in a heatmap. A good example of how to do this can be found [here](https://towardsdatascience.com/better-heatmaps-and-correlation-matrix-plots-in-python-41445d0f2bec). 

Select a features you think will be an important predictor of the target variable and one which is not important. Explain your answers. -->

#### Evaluation
Use a validation technique from the previous lecture to evaluate the performance of the model. Explain and justify which metrics you used to compare the different models. 

In [4]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Define your preprocessing steps here
scaler = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown="ignore")
steps = [
        ("categorical", categorical_transformer, categorical_cols),
        ("continuous", scaler, continuous_cols)
    ]

# Combine steps into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=steps,
    remainder="passthrough",
    verbose_feature_names_out=False
)
target_col = 'salary'
data = df.copy()
dataX = data.drop(columns=[target_col])
dataY = data[target_col]
# show the correlation between different features including target variable
def visualize(data, preprocessor):
    X = data.drop(columns=[target_col])
    y = data[target_col]

    # Fit + transform features
    X_trans = preprocessor.fit_transform(X)

    # Get feature names
    feature_names = preprocessor.get_feature_names_out()

    # Convert to DataFrame (handle sparse matrices)
    if hasattr(X_trans, "toarray"):
        X_df = pd.DataFrame(X_trans.toarray(), columns=feature_names)
    else:
        X_df = pd.DataFrame(X_trans, columns=feature_names)

    # Add target variable
    X_df[target_col] = y.values

    # Compute correlation matrix
    corr = X_df.corr(numeric_only=True)

    # Plot heatmap
    plt.figure(figsize=(12, 10))
    sns.heatmap(corr, cmap="coolwarm", center=0)
    plt.title("Correlation Matrix Including Target Variable")
    plt.show()

    return corr

# Apply your model to feature array X and labels y
def apply_model(model, X, y):    
    # Wrap the model and steps into a Pipeline
    pipeline = Pipeline(steps=[('t', preprocessor), ('m', model)])
    
    # Evaluate the model and store results
    return evaluate_model(X, y, pipeline)

# Apply your validation techniques and calculate metrics
def evaluate_model(X, y, pipeline):
    scores = cross_val_score(pipeline, X, y, cv=5, scoring = 'accuracy')
    # cross validation for more reliable estimates
    results = {
            "Accuracy": scores.mean(),
        }
    print("Accuracy: ")
    print(results)
    print()
    return results


In [7]:
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
models = [LogisticRegression(), LinearSVC(), SVC(), GradientBoostingClassifier(random_state=42)]
dataY = dataY.astype(int)
for m in models:
    print(f'Model {m.__class__.__name__}:')
    apply_model(m, dataX, dataY)

Model LogisticRegression:
Accuracy: 
{'Accuracy': np.float64(0.8485693552973252)}

Model LinearSVC:
Accuracy: 
{'Accuracy': np.float64(0.8488347050437527)}

Model SVC:
Accuracy: 
{'Accuracy': np.float64(0.8533457729779526)}

Model GradientBoostingClassifier:
Accuracy: 
{'Accuracy': np.float64(0.8629208373582479)}



### Label perturbation
To evaluate the impact of faulty labels in a dataset, we will introduce some errors in the labels of our data.


#### Preparation
Start by creating a method which alters a dataset by selecting a percentage of rows randomly and swaps labels from a 0->1 and 1->0. 


In [5]:
"""Given a label vector, create a new copy where a random fraction of the labels have been flipped."""
def pertubate(y: np.ndarray, fraction: float) -> np.ndarray:
    y_copy = y.copy()
    n_flip = int(len(y) * fraction)

    # Randomly select indices to flip
    flip_indices = np.random.choice(len(y), size=n_flip, replace=False)

    # Flip the labels (binary case)
    y_copy[flip_indices] = 1 ^ y_copy[flip_indices]

    return y_copy

#### Analysis
Create a number of new datasets with perturbed labels, for fractions ranging from `0` to `0.5` in increments of `0.1`.

Perform the same experiment you did before, which compared the performances of different models except with the new datasets. Repeat your experiment at least 5x for each model and perturbation level and calculate the mean and variance of the scores. Visualize the change in score for different perturbation levels for all of the models in a single plot. 

State your observations. Is there a change in the performance of the models? Are there some classifiers which are impacted more/less than other classifiers and why is this the case?

In [8]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
fractions = np.arange(0, 0.6, 0.1)  # 0,0.1,...,0.5
n_repeats = 5
X = data.drop(columns=[target_col])
y = data[target_col]
pert_models = {
    "Logistic Regression": LogisticRegression(),
    "Linear SVC": LinearSVC(),
    "Decision Tree": DecisionTreeClassifier(max_depth=5, random_state=42),
    "Gradient Boosting Classifier": GradientBoostingClassifier(random_state=42)
}

results = {model_name: [] for model_name in pert_models.keys()}

# -------------------------------
# Run experiments
# -------------------------------
for frac in fractions:
    frac_results = {model_name: [] for model_name in pert_models.keys()}
    print(f"Fraction: {frac}")
    for repeat in range(n_repeats):
        y_perturbed = pertubate(dataY.values, frac)
        for model_name, model in pert_models.items():
            score = apply_model(model, dataX, y_perturbed)  # uses your pipeline + cross-val
            frac_results[model_name].append(score["Accuracy"])
    # Compute mean and variance per model
    for model_name in pert_models.keys():
        mean_score = np.mean(frac_results[model_name])
        var_score = np.var(frac_results[model_name])
        results[model_name].append((mean_score, var_score))



Fraction: 0.0
Accuracy: 
{'Accuracy': np.float64(0.8485693552973252)}

Accuracy: 
{'Accuracy': np.float64(0.8488347050437527)}

Accuracy: 
{'Accuracy': np.float64(0.8445226469744744)}

Accuracy: 
{'Accuracy': np.float64(0.8629208373582479)}

Accuracy: 
{'Accuracy': np.float64(0.8485693552973252)}

Accuracy: 
{'Accuracy': np.float64(0.8488347050437527)}

Accuracy: 
{'Accuracy': np.float64(0.8445226469744744)}

Accuracy: 
{'Accuracy': np.float64(0.8629208373582479)}

Accuracy: 
{'Accuracy': np.float64(0.8485693552973252)}

Accuracy: 
{'Accuracy': np.float64(0.8488347050437527)}

Accuracy: 
{'Accuracy': np.float64(0.8445226469744744)}

Accuracy: 
{'Accuracy': np.float64(0.8629208373582479)}

Accuracy: 
{'Accuracy': np.float64(0.8485693552973252)}

Accuracy: 
{'Accuracy': np.float64(0.8488347050437527)}

Accuracy: 
{'Accuracy': np.float64(0.8445226469744744)}

Accuracy: 
{'Accuracy': np.float64(0.8629208373582479)}

Accuracy: 
{'Accuracy': np.float64(0.8485693552973252)}

Accuracy: 
{'Accu

In [None]:
results

Observations + explanations: max. 400 words

#### Discussion

1)  Discuss how you could reduce the impact of wrongly labeled data or correct wrong labels. <br />
    max. 400 words



    Authors: Youri Arkesteijn, Tim van der Horst and Kevin Chong.


## Machine Learning Workflow

From part 1, you will have gone through the entire machine learning workflow which are they following steps:

1) Data Loading
2) Data Pre-processing
3) Machine Learning Model Training
4) Machine Learning Model Testing

You can see these tasks are very sequential, and need to be done in a serial fashion. 

As a small perturbation in the actions performed in each of the steps may have a detrimental knock-on effect in the task that comes afterwards.

In the final part of Part 1, you will have experienced the effects of performing perturbations to the machine learning model training aspect and the reaction of the machine learning model testing section.

## Part 2 Data Discovery

You will be given a set of datasets and you are tasked to perform data discovery on the data sets.

<b>The datasets are provided in the group lockers on brightspace. Let me know if you are having trouble accessing the datasets</b>

The process is to have the goal of finding datasets that are related to each other, finding relationships between the datasets.

The relationships that we are primarily working with are Join and Union relationships.

So please implement two methods for allowing us to find those pesky Join and Union relationships.

Try to do this with the datasets as is and no processing.



In [19]:
import difflib
import os
import pandas as pd
def load_datasets(path, n=20):
    datasets = {}
    for i in range(n):
        file_path = os.path.join(path, f"table_{i}.csv")
        print(i)
        datasets[i] = pd.read_csv(file_path, on_bad_lines="skip")
    return datasets


# --- Union candidates ---
def find_union_candidates(datasets):
    unions = []
    for name1, df1 in datasets.items():
        for name2, df2 in datasets.items():
            if name1 >= name2:
                continue
            if len(df1.columns) == len(df2.columns):
                # Compare datatypes column by column
                col_match = all(
                    df1.dtypes.values[i] == df2.dtypes.values[i]
                    for i in range(len(df1.columns))
                )
                if col_match:
                    unions.append((name1, name2))
    return unions


# --- Join candidates ---
def find_join_candidates(datasets, sample_size=100):
    joins = []
    for name1, df1 in datasets.items():
        for name2, df2 in datasets.items():
            if name1 >= name2:
                continue
            for col1 in df1.columns:
                for col2 in df2.columns:
                    # Check column name similarity
                    if col1.lower() == col2.lower() or \
                       difflib.SequenceMatcher(None, col1.lower(), col2.lower()).ratio() > 0.8:
                        joins.append((name1, col1, name2, col2))
                    else:
                        # Check value overlap (sampled for speed)
                        try:
                            vals1 = set(df1[col1].dropna().astype(str).sample(min(sample_size, len(df1))))
                            vals2 = set(df2[col2].dropna().astype(str).sample(min(sample_size, len(df2))))
                            if len(vals1 & vals2) > 0:
                                joins.append((name1, col1, name2, col2))
                        except Exception:
                            continue
    return joins
unionss = []
joinss = []
def discovery_algorithm():
    """Function should be able to perform data discovery to find related datasets
    Possible Input: List of datasets
    Output: List of pairs of related datasets
    """
    unionss = find_union_candidates(load_datasets(f'lake49/'))
    joinss = find_join_candidates(load_datasets(f'lake49/'))


In [None]:
discovery_algorithm()



0
1
2
3
4
5
6
7
8
9


  datasets[i] = pd.read_csv(file_path, on_bad_lines="skip")


10
11
12
13
14
15
16
17
18
19
0
1
2
3
4
5
6
7
8
9


  datasets[i] = pd.read_csv(file_path, on_bad_lines="skip")


10
11
12
13
14
15
16
17
18
19


In [None]:
unionss

In [None]:
joinss

You would have noticed that the data has some issues in them.
So perhaps those issues have been troublesome to deal with.

Please try to do some cleaning on the data.

After performing cleaning see if the results of the data discovery has changed?

Please try to explain this in your report, and try to match up the error with the observation.

In [None]:
## Cleaning data, scrubbing, washing, mopping

def cleaningData(data):
    """Function should be able to clean the data
    Possible Input: List of datasets
    Output: List of cleaned datasets
    """

    pass

## Discussions

1)  Different aspects of the data can effect the data discovery process. Write a short report on your findings. Such as which data quality issues had the largest effect on data discovery. Which data quality problem was repairable and how you choose to do the repair.

<!-- For the set of considerations that you have outlined for the choice of data discovery methods, choose one and identify under this new constraint, how would you identify and resolve this problem? -->

Max 400 words