<a href="https://colab.research.google.com/github/maria-gedye/Y2-Python/blob/master/202AMG_PracticalAssessment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Task 1: Data visualisation and pre-processing


---



1. How many records are official QV data?
2. How many records are RBNZ estimate?

In [127]:
from tabulate import tabulate

data = [['Offical QV data', 81],
        ['RBNZ estimate', 88],
        ['Total', 169]]

col_names = ["Record Type", "Records"]

print(tabulate(data, headers=col_names, tablefmt="grid", showindex="always"))

+----+-----------------+-----------+
|    | Record Type     |   Records |
|  0 | Offical QV data |        81 |
+----+-----------------+-----------+
|  1 | RBNZ estimate   |        88 |
+----+-----------------+-----------+
|  2 | Total           |       169 |
+----+-----------------+-----------+




---


3. Read the CSV file as a Pandas data frame

In [128]:
import pandas as pd

#this creates a dataframe from csv file
mydata = pd.read_csv('/content/sample_data/housing.csv') 

#print(mydata.to_string())




---


4. Get statistics of the data using python

In [129]:
mydata.describe()

Unnamed: 0,Value of housing $billion,House prices,"HPI for houses, index"
count,169.0,125.0,129.0
mean,425.053254,7.076,1337.831783
std,394.275427,7.385987,802.662676
min,25.0,-9.0,466.3
25%,123.0,2.3,695.9
50%,232.0,6.4,1299.9
75%,613.7,12.3,1714.9
max,1763.1,28.7,3893.5




---


5. Does the data contain missing values?

// MY ANSWER: Yes. Rows 0-4 contain NaN values in the House prices column.

In [130]:
import numpy as np

mydata.isnull()   # this function returns dataframe bool values which are true
                  # when the value is NaN



Unnamed: 0,Year,Value of housing $billion,House prices,"HPI for houses, index"
0,False,False,True,True
1,False,False,True,True
2,False,False,True,True
3,False,False,True,True
4,False,False,True,True
...,...,...,...,...
164,False,False,False,False
165,False,False,False,False
166,False,False,False,False
167,False,False,False,False


6. Draw the graph of House prices annual change (in percentage) using matplotlib or seaborn using a suitable chart of your choice.

In [131]:
import plotly.express as px

fig = px.line(mydata, x = 'Year', y = 'House prices', title = 'House Prices Annual Change (%)')
fig.show()





---


7. Combine the previous graph with a graph of HPI for houses
8. What can you tell about the trend of the housing price in New Zealand?

// MY ANSWER: the trend has steep spikes(sharp increase) during the years of '94,'03, '15 & '21. Also there are big dips(sharp decrease) in price, especially during '91, '98 and '09. Overall, house prices are steadily increasing by looking at this graph.

In [132]:
import plotly.graph_objects as go

fig = go.Figure()
fig.add_trace(
    go.Scatter(x = mydata['Year'], y = mydata['House prices'],
               mode = 'lines+markers',
               name = '% of House Prices')
)
fig.add_trace(
    go.Scatter(x = mydata['Year'], y = mydata['HPI for houses, index'],
               name = 'House Indexes')
)

fig.update_layout(title = 'House Price Change in NZ 1989-2021')
fig.show()

##Task 2
 

---



*   Load the MNIST data, and split it into a training set, a validation set, and a test set (e.g., use the first 40,000 instances for training, the next 10,000 for validation, and the last 10,000 for testing).
*   Then train various classifiers of your choice, such as a Decision tree, Random Forest classifier, and an SVM, etc.
*   Hyperparameters tuning can be used to find the optimal models of each (if applicable). Alternatively, you can provide some discussion about possible tuning for the current model (no implementation needed)

### Method 1: Simple Neural Network

---



In [133]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [134]:
# first install dependencies 
! [ ! -z "$COLAB_GPU" ] && pip install torch scikit-learn==0.20.* skorch



In [135]:
# this method is using CNN a type of neural network
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt

# load MNIST data using scikit-learns fetch_openml (very large file)
mnist = fetch_openml('mnist_784', cache=False)

mnist.data.shape  # 70,000 instances with the vector dimension of 784

(70000, 784)

In [136]:
#preprocessing data: loading MNIST data returns data and target as uint8 which
# need to be converted to float32 and int64 respectively
X = mnist.data.astype('float32')
y = mnist.target.astype('int64')

In [137]:
#preprocessing data: reduce large weight from pixel values by scaling X down
X /= 255.0

In [138]:
X.min(), X.max()

(0.0, 1.0)

In [139]:
# split instances into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [140]:
assert(X_train.shape[0] + X_test.shape[0] == mnist.data.shape[0])

In [141]:
X_train.shape, y_train.shape

((52500, 784), (52500,))

In [142]:
# build a neural network with PyTorch
import torch 
from torch import nn 
import torch.nn.functional as F

In [143]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [144]:
mnist_dim = X.shape[1]
hidden_dim = int(mnist_dim/8)
output_dim = len(np.unique(mnist.target))

In [145]:
# input layer, hidden layer and output layer
mnist_dim, hidden_dim, output_dim

(784, 98, 10)

In [146]:
# create classifier module with PyTorch
class ClassifierModule(nn.Module):
  def __init__(
      self,
      input_dim = mnist_dim,
      hidden_dim = hidden_dim,
      output_dim = output_dim,
      dropout = 0.5 
  ):
    super(ClassifierModule, self).__init__()
    self.dropout = nn.Dropout(dropout)
    self.hidden = nn.Linear(input_dim, hidden_dim)
    self.output = nn.Linear(hidden_dim, output_dim)

  def forward(self, X, **kwargs):
    X = F.relu(self.hidden(X))
    X = self.dropout(X)
    X = F.softmax(self.output(X), dim = 1)
    return X

In [147]:
# now use skorch to use network in scikit-learn setting
from skorch import NeuralNetClassifier

In [148]:
torch.manual_seed(0)

net = NeuralNetClassifier(
    ClassifierModule,
    max_epochs = 20,
    lr = 0.1,
    device = device
)

In [149]:
net.fit(X_train, y_train);

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m0.8352[0m       [32m0.8786[0m        [35m0.4214[0m  1.2014
      2        [36m0.4340[0m       [32m0.9115[0m        [35m0.3170[0m  0.9220
      3        [36m0.3625[0m       [32m0.9208[0m        [35m0.2769[0m  1.0045
      4        [36m0.3235[0m       [32m0.9287[0m        [35m0.2388[0m  0.9293
      5        [36m0.2925[0m       [32m0.9343[0m        [35m0.2181[0m  1.0383
      6        [36m0.2722[0m       [32m0.9421[0m        [35m0.1959[0m  1.4860
      7        [36m0.2546[0m       [32m0.9441[0m        [35m0.1888[0m  1.6397
      8        [36m0.2439[0m       [32m0.9473[0m        [35m0.1766[0m  1.9090
      9        [36m0.2303[0m       [32m0.9515[0m        [35m0.1646[0m  1.6195
     10        [36m0.2237[0m       [32m0.9545[0m        [35m0.1562[0m  2.1359
     11        [36m0.2162[0m       0.9545   

In [150]:
# prediction
from sklearn.metrics import accuracy_score

In [151]:
y_pred = net.predict(X_test)

In [152]:
accuracy_score(y_test, y_pred)

0.9623428571428572

In [153]:
# reshape X to take a 4 dimensional tensor as input (batch size, number of channel, height, width)
XCnn = X.reshape(-1, 1, 28, 28)

In [154]:
XCnn.shape

(70000, 1, 28, 28)

In [155]:
XCnn_train, XCnn_test, y_train, y_test = train_test_split(XCnn, y, test_size = 0.25, random_state = 42)

In [156]:
XCnn_train.shape, y_train.shape

((52500, 1, 28, 28), (52500,))

In [157]:
#define Convolutional Neural Network (CNN) class using torch
class Cnn(nn.Module):
  def __init__(self, dropout = 0.5):
    super(Cnn, self).__init__()
    self.conv1 = nn.Conv2d(1, 32, kernel_size = 3)
    self.conv2 = nn.Conv2d(32, 64, kernel_size = 3)
    self.conv2_drop = nn.Dropout2d(p = dropout)
    self.fc1 = nn.Linear(1600, 100)
    self.fc2 = nn.Linear(100, 10)
    self.fc1_drop = nn.Dropout(p = dropout)

  def forward(self, x):
    x = torch.relu(F.max_pool2d(self.conv1(x), 2))
    x = torch.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))

    x = x.view(-1, x.size(1) * x.size(2) * x.size(3))
    x = torch.relu(self.fc1_drop(self.fc1(x)))
    x = torch.softmax(self.fc2(x), dim = 1)
    return x

In [158]:
torch.manual_seed(0)

cnn = NeuralNetClassifier(
    Cnn,
    max_epochs = 10,
    lr = 0.002,
    optimizer = torch.optim.Adam,
    device = device
)

In [159]:
cnn.fit(XCnn_train, y_train);

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m0.4152[0m       [32m0.9729[0m        [35m0.0880[0m  1.9008
      2        [36m0.1565[0m       [32m0.9800[0m        [35m0.0632[0m  1.7807
      3        [36m0.1272[0m       [32m0.9824[0m        [35m0.0568[0m  1.7602
      4        [36m0.1154[0m       [32m0.9838[0m        [35m0.0524[0m  1.7683
      5        [36m0.0981[0m       [32m0.9857[0m        [35m0.0456[0m  1.7692
      6        [36m0.0883[0m       [32m0.9865[0m        0.0495  1.7763
      7        [36m0.0842[0m       [32m0.9866[0m        [35m0.0423[0m  1.7500
      8        [36m0.0809[0m       [32m0.9884[0m        [35m0.0380[0m  1.7738
      9        [36m0.0775[0m       0.9881        [35m0.0378[0m  1.7650
     10        [36m0.0752[0m       0.9866        0.0408  1.7763


In [160]:
y_pred_cnn = cnn.predict(XCnn_test)

In [161]:
accuracy_score(y_test, y_pred_cnn)

0.9872571428571428

### Method 2: Decision Tree

---



In [162]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# loading MNIST dataset
df = pd.read_csv(r"/content/sample_data/mnist_train_small.csv")
df.head()

Unnamed: 0,6,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,...,0.581,0.582,0.583,0.584,0.585,0.586,0.587,0.588,0.589,0.590
0,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [163]:
# split into training and test sets
X = df.iloc[:, 1:]
y = df.iloc[:, 0]

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 7)


In [164]:
# call decision tree and fit data
from sklearn.tree import DecisionTreeClassifier

dtree = DecisionTreeClassifier()
dtree.fit(X_train, y_train)



DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [165]:
y_pred = dtree.predict(X_test)
y_pred

array([9, 4, 7, ..., 7, 3, 3])

In [166]:
# use confusion matrix to get accuracy
from sklearn.metrics import confusion_matrix

cmdtree = confusion_matrix(y_test, y_pred) 
dtree.score(X_test, y_test)

0.825

// 82% could be better...now let's re-split and re-fit the data into a baseline decision tree...

In [167]:
import os
import random as rn 

# use the following seed for random_states
seed = 1234
np.random.seed(seed)
rn.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)

#create accuracy function to use between labels and predictions (eg validation or testing labels)
def acc(y_true : np.ndarray, y_pred : np.ndarray) -> float:
  return round(accuracy_score(y_true, y_pred) * 100, 2)

In [168]:

# resplit training data, this time for training and validation
# this time testing set is combined into training set
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size = 0.25, random_state = seed)

In [169]:
# train a baseline decision tree
base_dtree = DecisionTreeClassifier(max_depth = 10, random_state = seed)
base_dtree.fit(X_train, y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=1234,
            splitter='best')

In [170]:
# evaluate the above baseline model

train_pred = base_dtree.predict(X_train)
valid_pred = base_dtree.predict(X_val)
acc_train = acc(train_pred, y_train)
acc_valid = acc(valid_pred, y_val)

print (f'Training accuracy:  {acc_train}%')
print (f'Validation accuracy:  {acc_valid}%')

Training accuracy:  92.71%
Validation accuracy:  82.58%


In [171]:
train_pred = base_dtree.predict(X_train)
train_pred


array([3, 2, 6, ..., 1, 2, 7])

In [172]:
# use confusion matrix to get accuracy of baseline tree

cm_base_dtree = confusion_matrix(y_train, train_pred) 
base_dtree.score(X_train, y_train)

0.9271284752316821

##Method 3: KNN


---



In [197]:
import pandas as pd

# once again, read from csv file
df = pd.read_csv(r"/content/sample_data/mnist_train_small.csv")
df.head()

Unnamed: 0,6,0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,...,0.581,0.582,0.583,0.584,0.585,0.586,0.587,0.588,0.589,0.590
0,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,9,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [199]:
# assign location on to x and y 
x = df.iloc[:,1:]
y = df.iloc[:,0]

# split into train and test data sets, test size is 20% and random state is 7
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 7)

In [201]:
# import KNN classifier and call main function. fit dataframe into KNN
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier()
knn.fit(x_train, y_train)


KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=5, p=2,
           weights='uniform')

In [202]:
y_pred = knn.predict(x_test)

In [203]:
# now lets call accuracy score
accuracy_score(y_pred, y_test)

0.959



---


*   Log all the model development and measure the performance of each algorithm used in a table.
*   Finally, make a conclusion about the performance of each model.

In [206]:
from tabulate import tabulate

data = [['Simple Neural \nNetwork', 96.0, 'reshape dataset using\nConvolutional Network, \nmax epoch = 10', 98],
        ['Decision Tree', 82, 're-split data and tune a \nbaseline model, random state = seed', 92],
        ['KNN', 95.0, 'a', 0.0]]

col_names = ["Model used", "accuracy %", "hyperparams/algorithms added", "new accuracy %"]

print(tabulate(data, headers=col_names, tablefmt="grid", showindex="never"))

+----------------+--------------+-------------------------------------+------------------+
| Model used     |   accuracy % | hyperparams/algorithms added        |   new accuracy % |
| Simple Neural  |           96 | reshape dataset using               |               98 |
| Network        |              | Convolutional Network,              |                  |
|                |              | max epoch = 10                      |                  |
+----------------+--------------+-------------------------------------+------------------+
| Decision Tree  |           82 | re-split data and tune a            |               92 |
|                |              | baseline model, random state = seed |                  |
+----------------+--------------+-------------------------------------+------------------+
| KNN            |           95 | a                                   |                0 |
+----------------+--------------+-------------------------------------+------------------+

##Conclusion
As this was the first time for me practicing all these models and using the MNIST dataset, I kept the focus as simple as I could. The metric I focused on was looking at the accuracy score twice for each model; the first time a model was built and then the second time was after new features were implemented or when the current hyperparams were changed.

The best performer was the simple neural network with a resulting 98% accuracy at prediction. It was by far the most complex and also had a very long runtime during some parts of the process. The original dataframe was reshaped using CNN and the epoch was reduced in half to 10 which resulted in a 2% increase in its accuracy score.

The model that had the biggest jump in accuracy (+10%) was the Decision Tree which had a score of 82% to begin with and after some hyperparam tuning there was a significant increase to 92%. A random seed function was used for random_state, an accuracy function and validation instances were added to the dataframe which I'm sure affected the final accuracy score.

Resources used:

MNIST explained
https://www.youtube.com/watch?v=5gLarqG8p4s&list=RDCMUCJINtWke3-FMz2WuEltWDVQ&start_radio=1&rv=5gLarqG8p4s&t=1104&ab_channel=AppliedAICourse

Visualising MNIST
https://colah.github.io/posts/2014-10-Visualizing-MNIST/

Neural Network & CNN
https://colab.research.google.com/github/skorch-dev/skorch/blob/master/notebooks/MNIST.ipynb#scrollTo=PcnTjzhaUWsd

Deep Neural Networks
https://www.youtube.com/watch?v=x89-G6gz3jg&ab_channel=RANJIRAJ

3 models using MNIST
https://medium.com/analytics-vidhya/knn-vs-decision-tree-vs-random-forest-for-handwritten-digit-recognition-470e864c75bc


decision tree using t-SNE
https://www.kaggle.com/code/carlolepelaars/97-on-mnist-with-a-single-decision-tree-t-sne/notebook

t-SNE visualisation of MNIST data
https://www.kaggle.com/code/apapiu/t-sne/notebook