Prep

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
  
# Read the CSV file.
data = pd.read_csv('data.csv', skiprows=1)

# Select the relevant numerical columns.
selected_cols = ['LB', 'AC', 'FM', 'UC', 'DL', 'DS', 'DP', 'ASTV', 'MSTV', 'ALTV',
                 'MLTV', 'Width', 'Min', 'Max', 'Nmax', 'Nzeros', 'Mode', 'Mean',
                 'Median', 'Variance', 'Tendency', 'NSP']
data = data[selected_cols].dropna()

# Shuffle the dataset.
data_shuffled = data.sample(frac=1.0, random_state=0)

# Split into input part X and output part Y.
X = data_shuffled.drop('NSP', axis=1)

# Map the diagnosis code to a human-readable label.
def to_label(y):
    return [None, 'normal', 'suspect', 'pathologic'][(int(y))]

Y = data_shuffled['NSP'].apply(to_label)

# Partition the data into training and test sets.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=0)

In [2]:
X.head()

Unnamed: 0,LB,AC,FM,UC,DL,DS,DP,ASTV,MSTV,ALTV,...,Width,Min,Max,Nmax,Nzeros,Mode,Mean,Median,Variance,Tendency
658,130.0,1.0,0.0,3.0,0.0,0.0,0.0,24.0,1.2,12.0,...,35.0,120.0,155.0,1.0,0.0,134.0,133.0,135.0,1.0,0.0
1734,134.0,9.0,1.0,8.0,5.0,0.0,0.0,59.0,1.2,0.0,...,109.0,80.0,189.0,6.0,0.0,150.0,146.0,150.0,33.0,0.0
1226,125.0,1.0,0.0,4.0,0.0,0.0,0.0,43.0,0.7,31.0,...,21.0,120.0,141.0,0.0,0.0,131.0,130.0,132.0,1.0,0.0
1808,143.0,0.0,0.0,1.0,0.0,0.0,0.0,69.0,0.3,6.0,...,27.0,132.0,159.0,1.0,0.0,145.0,144.0,146.0,1.0,0.0
825,152.0,0.0,0.0,4.0,0.0,0.0,0.0,62.0,0.4,59.0,...,25.0,136.0,161.0,0.0,0.0,159.0,156.0,158.0,1.0,1.0


Step 2. Training the baseline classifier

We can now start to investigate different classifiers.

The DummyClassifier
Links to an external site. is a simple classifier that does not make use of the features: it just returns the most common label in the training set, in this case Spondylolisthesis. The purpose of using such a stupid classifier is as a baseline: a simple classifier that we can try before we move on to more complex classifiers.

In [None]:
from sklearn.dummy import DummyClassifier

clf = DummyClassifier(strategy='most_frequent')

To get an idea of how well our simple classifier works, we carry out a cross-validation Links to an external site. over the training set and compute the classification accuracy on each fold.

In [6]:
from sklearn.model_selection import cross_val_score

cross_val_score(clf, Xtrain, Ytrain)

NameError: name 'clf' is not defined

Step 3. Trying out some different classifiers

Replace the DummyClassifier with some more meaningful classifier and run the cross-validation again. Try out a few classifiers and see how much you can improve the cross-validation accuracy. Remember, the accuracy is defined as the proportion of correctly classified instances, and we want this value to be high.

Here are some possible options:

Tree-based classifiers:

    sklearn.tree.DecisionTreeClassifier 

Links to an external site.
sklearn.ensemble.RandomForestClassifier
Links to an external site.
sklearn.ensemble.GradientBoostingClassifier

    Links to an external site.

Linear classifiers:

    sklearn.linear_model.Perceptron 

Links to an external site.
sklearn.linear_model.LogisticRegression
Links to an external site.
sklearn.svm.LinearSVC

    Links to an external site.

Neural network classifier (will take longer time to train):

    sklearn.neural_network.MLPClassifier 

    Links to an external site.

You may also try to tune the hyperparameters of the various classifiers to improve the performance. For instance, the decision tree classifier has a parameter that sets the maximum depth, and in the neural network classifier you can control the number of layers and the number of neurons in each layer.

Step 4. Final evaluation

When you have found a classifier that gives a high accuracy in the cross-validation evaluation, train it on the whole training set and evaluate it on the held-out test set.

In [7]:
from sklearn.metrics import accuracy_score
  
clf.fit(Xtrain, Ytrain)
Yguess = clf.predict(Xtest)
print(accuracy_score(Ytest, Yguess))

NameError: name 'clf' is not defined

For the report. In your submitted report, include a list of three classifiers you tried in Step 3 and their accuracies, add a description of the classifier you selected in Step 4 and report its accuracy. (At this point, we are of course not asking you to describe internal workings of various machine learning models that we will cover in detail at later points during the course, but you are of course free to read about them if you're interested.)

-----------

Task 2: Decision trees for classification

Download the code that was shown during the lecture
Links to an external site. and use the defined class TreeClassifier as your classifier in an experiment similar to those in Task 1, using the same dataset. (Alternatively, you can create a new notebook and copy the code from this page

Links to an external site..) Tune the hyperparameter max_depth to get the best cross-validation performance, and then evaluate the classifier on the test set.

For the report. In your submitted report, please mention what value of max_depth you selected and what accuracy you got.

For illustration, let's also draw a tree. Set max_depth to a reasonably small value (not necessarily the one you selected above) and then call draw_tree to visualize the learned decision tree. Include this tree in your report.

-------------

Task 3: A regression example: predicting apartment prices

Here
Links to an external site. is another dataset. This dataset was created by Sberbank and contains some statistics from the Russian real estate market. Here

Links to an external site. is the Kaggle page where you can find the original data.

Since we will just be able to handle numerical features and not symbolic ones, we'll need with a simplified version of the dataset. So we'll just select 9 of the columns in the dataset. The goal is to predict the price of an apartment, given numerical information such as the number of rooms, the size of the apartment in square meters, the floor, etc. Our approach will be similar to what we did in the classification example: load the data, find a suitable model using cross-validation over the training set, and finally evaluate on the held-out test data.

The following code snippet will carry out the basic reading and preprocessing of the data.

In [None]:
# Read the CSV file using Pandas.
alldata = pd.read_csv('sberbank.csv')

# Convert the timestamp string to an integer representing the year.
def get_year(timestamp):
    return int(timestamp[:4])
alldata['year'] = alldata.timestamp.apply(get_year)

# Select the 9 input columns and the output column.
selected_columns = ['price_doc', 'year', 'full_sq', 'life_sq', 'floor', 'num_room', 'kitch_sq', 'full_all']
alldata = alldata[selected_columns]
alldata = alldata.dropna()

# Shuffle.
alldata_shuffled = alldata.sample(frac=1.0, random_state=0)

# Separate the input and output columns.
X = alldata_shuffled.drop('price_doc', axis=1)
# For the output, we'll use the log of the sales price.
Y = alldata_shuffled['price_doc'].apply(np.log)

# Split into training and test sets.
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, test_size=0.2, random_state=0)

We train a baseline dummy regressor (which always predicts the same value) and evaluate it in a cross-validation setting.

This example looks quite similar to the classification example above. The main differences are (a) that we are predicting numerical values, not symbolic values; (b) that we are evaluating using the mean squared error metric, not the accuracy metric that we used to evaluate the classifiers.

In [9]:
from sklearn.dummy import DummyRegressor
from sklearn.model_selection import cross_validate
m1 = DummyRegressor()
cross_validate(m1, Xtrain, Ytrain, scoring='neg_mean_squared_error')

ValueError: 
All the 5 fits failed.
It is very likely that your model is misconfigured.
You can try to debug the error by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "/home/david/.local/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/david/.local/lib/python3.10/site-packages/sklearn/dummy.py", line 540, in fit
    y = check_array(y, ensure_2d=False, input_name="y")
  File "/home/david/.local/lib/python3.10/site-packages/sklearn/utils/validation.py", line 879, in check_array
    array = _asarray_with_order(array, order=order, dtype=dtype, xp=xp)
  File "/home/david/.local/lib/python3.10/site-packages/sklearn/utils/_array_api.py", line 185, in _asarray_with_order
    array = numpy.asarray(array, order=order, dtype=dtype)
  File "/home/david/.local/lib/python3.10/site-packages/pandas/core/series.py", line 893, in __array__
    return np.asarray(self._values, dtype)
ValueError: could not convert string to float: 'normal'


Replace the dummy regressor with something more meaningful and iterate until you cannot improve the performance. Please note that the cross_validate function returns the negative mean squared error.

Some possible regression models that you can try:

    sklearn.linear_model.LinearRegression 

Links to an external site.
sklearn.linear_model.Ridge
Links to an external site.
sklearn.linear_model.Lasso
Links to an external site.
sklearn.tree.DecisionTreeRegressor
Links to an external site.
sklearn.ensemble.RandomForestRegressor
Links to an external site.
sklearn.ensemble.GradientBoostingRegressor
Links to an external site.
sklearn.neural_network.MLPRegressor

    Links to an external site.

Finally, train on the full training set and evaluate on the held-out test set:

In [10]:
from sklearn.metrics import mean_squared_error
  
regr.fit(Xtrain, Ytrain)
mean_squared_error(Ytest, regr.predict(Xtest))

NameError: name 'regr' is not defined

For the report. In your submitted report, include a list of all regression models you used and the regression model you selected for evaluation and report its evaluation score.