# Task 1

**Introduction**  
In task 1, we tried 8 classifiers (including the baseline classifier). In our test, the **DecisionTreeClassifier** performed best among all. Furthermore, we tuned one hyperparameter — the max depth — to optimize the classifier.

DecisionTreeClassifier is a supervised learning algorithm that splits data into branches based on feature conditions, forming a tree-like structure.

| Classifier  | Accuracy       |
|----------|------------------|
| DummyClassifier      |0.7699530516431925     |
| DecisionTreeClassifier   |0.9061032863849765     |
| RandomForestClassifier |0.8474178403755869     |
| GradientBoostingClassifier |0.8896713615023474     |
| Perceptron |0.8145539906103286     |
| LogisticRegression |0.8732394366197183     |
| LinearSVC |0.8849765258215962     |
| MLPClassifier |0.8356807511737089     |


**Preparation**  
In this step, we read and shuffle the dataset. Then, we partition the data into training and test sets.

In [72]:
import pandas as pd
from sklearn.model_selection import train_test_split

LOCATION_OF_THE_FILE = '/Users/zhuangzhuanggong/Downloads/CTG.csv'

# Read the CSV file.
data = pd.read_csv(LOCATION_OF_THE_FILE, skiprows=1)

# Select the relevant numerical columns.
selected_cols = ['LB', 'AC', 'FM', 'UC', 'DL', 'DS', 'DP', 'ASTV', 'MSTV', 'ALTV',
                 'MLTV', 'Width', 'Min', 'Max', 'Nmax', 'Nzeros', 'Mode', 'Mean',
                 'Median', 'Variance', 'Tendency', 'NSP']
data = data[selected_cols].dropna()

# Shuffle the dataset.
data_shuffled = data.sample(frac=1.0, random_state=0)

# Split into input part X and output part Y.
X = data_shuffled.drop('NSP', axis=1)

# Map the diagnosis code to a human-readable label.
def to_label(y):
    return [None, 'normal', 'suspect', 'pathologic'][(int(y))]

Y = data_shuffled['NSP'].apply(to_label)

# Partition the data into training and test sets.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=0)
X.head()

Unnamed: 0,LB,AC,FM,UC,DL,DS,DP,ASTV,MSTV,ALTV,...,Width,Min,Max,Nmax,Nzeros,Mode,Mean,Median,Variance,Tendency
658,130.0,1.0,0.0,3.0,0.0,0.0,0.0,24.0,1.2,12.0,...,35.0,120.0,155.0,1.0,0.0,134.0,133.0,135.0,1.0,0.0
1734,134.0,9.0,1.0,8.0,5.0,0.0,0.0,59.0,1.2,0.0,...,109.0,80.0,189.0,6.0,0.0,150.0,146.0,150.0,33.0,0.0
1226,125.0,1.0,0.0,4.0,0.0,0.0,0.0,43.0,0.7,31.0,...,21.0,120.0,141.0,0.0,0.0,131.0,130.0,132.0,1.0,0.0
1808,143.0,0.0,0.0,1.0,0.0,0.0,0.0,69.0,0.3,6.0,...,27.0,132.0,159.0,1.0,0.0,145.0,144.0,146.0,1.0,0.0
825,152.0,0.0,0.0,4.0,0.0,0.0,0.0,62.0,0.4,59.0,...,25.0,136.0,161.0,0.0,0.0,159.0,156.0,158.0,1.0,1.0


**Traning the baseline classifier**

In [74]:
from sklearn.model_selection import cross_val_score
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score
clf = DummyClassifier(strategy='most_frequent')
print(cross_val_score(clf, Xtrain, Ytrain))

clf.fit(Xtrain, Ytrain)
Yguess = clf.predict(Xtest)
print(accuracy_score(Ytest, Yguess))

[0.78235294 0.78235294 0.77941176 0.77941176 0.77941176]
0.7699530516431925


**Training DecisionTreeClassifier**  
In our tests, the DecisionTreeClassifier performed the best. However, to avoid overfitting, we tuned the max depth of the tree and tried to find a better depth for the decision tree. Here, we used an array of depths as we don't know what depth could be better. It turned out the cross-validation score rose a bit.  
In this case, we chose this classified with the optimization on the max depth to be our evaluation classifier.  

While the decision tree might perform well, it still carries the risk of overfitting, especially if the tree is deep.

First, we tried the classifier without tuning hyperparameters.

In [77]:
#DecisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
cross_val_score(clf, Xtrain, Ytrain)

array([0.91470588, 0.94117647, 0.91176471, 0.90588235, 0.94705882])

Once we had a better-performing classifier, compared to other classifiers we tried, we tuned the max depth with an array.

In [79]:
#DecisionTreeClassifier with optimization of max_depth
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 10, None],  
}

grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(Xtrain, Ytrain)
cross_val_score(grid_search, Xtrain, Ytrain)

array([0.92647059, 0.94411765, 0.91176471, 0.91764706, 0.94117647])

Finally, we trained it on the whole training set and evaluated it on the held-out test set.

In [81]:
#DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 10, None],  
}

grid_search = GridSearchCV(clf, param_grid, cv=5, scoring='accuracy')
grid_search.fit(Xtrain, Ytrain)
print(cross_val_score(grid_search, Xtrain, Ytrain))

grid_search.fit(Xtrain, Ytrain)
Yguess = grid_search.predict(Xtest)
print(accuracy_score(Ytest, Yguess))

[0.92647059 0.94411765 0.91176471 0.91764706 0.94117647]
0.9061032863849765


**Training RandomForestClassifier**

In [83]:
#RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
clf = RandomForestClassifier(max_depth=2, random_state=0)
#clf.fit(X, Y)
print(cross_val_score(clf, Xtrain, Ytrain))
clf.fit(Xtrain, Ytrain)
Yguess = clf.predict(Xtest)
print(accuracy_score(Ytest, Yguess))

[0.87352941 0.87941176 0.86176471 0.86176471 0.88529412]
0.8474178403755869


**Training GradientBoostingClassifier**

In [85]:
#GradientBoostingClassifier
from sklearn.datasets import make_hastie_10_2
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,
    max_depth=1, random_state=0).fit(Xtrain, Ytrain)
print(cross_val_score(clf, Xtrain, Ytrain))
clf.fit(Xtrain, Ytrain)
Yguess = clf.predict(Xtest)
print(accuracy_score(Ytest, Yguess))

[0.90294118 0.94117647 0.9        0.91470588 0.92647059]
0.8896713615023474


**Training Perceptron**

In [87]:
#Perceptron
from sklearn.linear_model import Perceptron
clf = Perceptron(tol=1e-3, random_state=0)
clf.fit(Xtrain, Ytrain)
print(cross_val_score(clf, Xtrain, Ytrain))
clf.fit(Xtrain, Ytrain)
Yguess = clf.predict(Xtest)
print(accuracy_score(Ytest, Yguess))

[0.84705882 0.88529412 0.82941176 0.85294118 0.71176471]
0.8145539906103286


**Training LogisticRegression**  
We tried the default solver for this classifier, but the model wouldn't converge, even though we increased the number of iterations. Therefore, we used saga instead, to our understanding, which is more effective for handling multi-class classification problems.

In [89]:
#LogisticRegression
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, solver='saga', max_iter=10000).fit(Xtrain, Ytrain)
print(cross_val_score(clf, Xtrain, Ytrain))
clf.fit(Xtrain, Ytrain)
Yguess = clf.predict(Xtest)
print(accuracy_score(Ytest, Yguess))
#clf.predict(Xtrain[:2, :])
#clf.predict_proba(Xtrain[:2, :])
#clf.score(Xtrain, Ytrain)


[0.88529412 0.89411765 0.87058824 0.88529412 0.88235294]
0.8732394366197183


**Training LinearSVC**

In [91]:
#LinearSVC
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
clf = make_pipeline(StandardScaler(),
                    LinearSVC(random_state=0, tol=1e-5))
clf.fit(Xtrain, Ytrain)
print(cross_val_score(clf, Xtrain, Ytrain))
clf.fit(Xtrain, Ytrain)
Yguess = clf.predict(Xtest)
print(accuracy_score(Ytest, Yguess))

[0.88529412 0.90882353 0.86176471 0.89705882 0.90294118]
0.8849765258215962


**Training MLPClassifier**

In [93]:
#MLPClassifier
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(random_state=1, max_iter=300).fit(Xtrain, Ytrain)
clf.fit(Xtrain, Ytrain)
print(cross_val_score(clf, Xtrain, Ytrain))
clf.fit(Xtrain, Ytrain)
Yguess = clf.predict(Xtest)
print(accuracy_score(Ytest, Yguess))

[0.84117647 0.82058824 0.83529412 0.90588235 0.87647059]
0.8356807511737089


# Task 2