1\. [20 pts] Define each of the following machine learning terms:

* *dataset* - A collection of instances is a dataset and when working with machine learning methods we typically need a few datasets for different purposes.
  * *instance* - a single row of data is called an instance. It is an observation from the domain.

* *training* - a dataset that we feed into our machine learning algorithm to train our model

* *testing* - a dataset that we use to validate the accuracy of our model but is not used to train the model. It may be called the validation dataset.
* *validation dataset* - used to tune the models hyper-parameters.

* *ground truth* - the value measured for your target variable for the training and testing examples where nearly all the time you can safely treat this the same as the label. If you augment your data set, there is a subtle difference between the ground truth (your actual measurements) and how the augmented examples relate to the labels you have assigned. 

* *label* - a human or machine generated description tagged to one or many data samples.

* *pre-processing* - data transforming or encoding to bring data to a state such that a machine can easily parse it.

* *feature* - an individual measurable property or characteristic of a phenomenon being observed. Choosing informative, discriminating and independent features is a crucial step for effective algorithms in pattern recognition, classification and regression.

* *numerical* - numerical or quantitative data will always be a number that can be measured.

* *nominal* - nominal data is classified without a natural rank, whereas ordinal data has a predetermined or natural order.

* *decision surface* - A hyper surface in a multidimensional state space that partitions the space into  regions. Data lying on one side of a decision surface are defined as belonging to a different class form those lying on the the other. Decision surfaces may be created or modified as a result of a learning process and they are frequently used in machine learning, pattern recognition, and classification systems

* *model validation* - the process where a trained model is evaluated with a testing data set. The testing data set is a separate portion of the same data set from which the training set is derived. The main purpose of using the testing data set is to test the generalization ability of a trained model.

* *accuracy* - is a weighted arithmetic mean of precision and inverse precision (weighted by bias) as well as weighted arithmetic mean of recall and inverse recall (weighted by prevalence). Inverse precision and inverse recall are simply the precision and recall of the inverse problem where positive and negative labels are exchanged (for both real classes and prediction labels). Recall and Inverse Recall, or equivalently true positive rate and false positive rate, are frequently plotted against each other as ROC curves and provide a principled mechanism to explore operating point trade offs.

* *cross-validation* - is a technique for evaluating ML models by training several ML models on a subset of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect over fitting, ie, failing to generalize a pattern.

2\. [20 pts] Pick **two** of the [Scikit-learn datasets] which are already included in the library (i.e. the ones with datasets_load_) an find out the following:
* the number of data points
* the number of features and their types
* the number and name of categories (i.e. the target field)
* the mean (or mode if nominal) of the first two features

[Scikit-learn datasets]: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

In [None]:
from sklearn import datasets
from termcolor import colored

In [None]:
data = datasets.load_wine(as_frame=True)
print(colored('number of data points:', 'green', attrs=['bold']))
print(data.frame.count())
print(colored('number of features:', 'green', attrs=['bold']))
print(data.frame.dtypes.size)
print(colored('type of feature:', 'green', attrs=['bold']))
print(data.frame.dtypes)
print(colored('number of categories:', 'green', attrs=['bold']))
print(data.target_names.size)
print(colored('name of categories: ', 'green', attrs=['bold']))
print(data.target_names)
print(colored('mean of alcohol: ', 'green', attrs=['bold']))
print(data.frame['alcohol'].mean)
print(colored('mean of malic_acid: ', 'green', attrs=['bold']))
print(data.frame['malic_acid'].mean)

In [None]:
data = datasets.fetch_california_housing(as_frame=True)
print(colored('number of data points:', 'green', attrs=['bold']))
print(data.frame.count())
print(colored('number of features:', 'green', attrs=['bold']))
print(data.frame.dtypes.size)
print(colored('type of feature:', 'green', attrs=['bold']))
print(data.frame.dtypes)
print(colored('number of categories:', 'green', attrs=['bold']))
print(data.target.size)
print(colored('name of categories: ', 'green', attrs=['bold']))
print(data.target_names)
print(colored('mean of MedInc: ', 'green', attrs=['bold']))
print(data.frame['MedInc'].mean)
print(colored('mean of HouseAge: ', 'green', attrs=['bold']))
print(data.frame['HouseAge'].mean)

3\. [40 pts] Implement a correlation program from scratch to look at the correlations between the features of Admission_Predict.csv dataset file (not provided, you have to download it by yourself by following the instructions in the module Jupyter notebook). Display the correlation matrix where each row and column are the features, which should be an 8 by 8 matrix (should we use 'Serial no'?). You can use pandas DataFrame.corr() to verify correctness of yours.

Observe that the diagonal of this matrix should have all 1's and explain why? Since the last column can be used as the target (dependent) varaible, what do you think about the correlations between all the variables? Which variable should be the most important for prediction of 'Chance of Admit'?

In [None]:
import pandas as pd
df = pd.read_csv('datasets/Admission_Predict_Ver1.1.csv')
print(df.head())
print(df.shape)

In [None]:
for col in df.columns: 
    print(col) 

In [None]:
import math 
def correlationCoefficient(X, Y, n) : 
    sum_X = 0
    sum_Y = 0
    sum_XY = 0
    squareSum_X = 0
    squareSum_Y = 0 
    i = 0
    while i < n : 
        sum_X = sum_X + X[i] # sum of elements of array X. 
        sum_Y = sum_Y + Y[i] # sum of elements of array Y. 
        sum_XY = sum_XY + X[i] * Y[i] # sum of X[i] * Y[i]. 
        squareSum_X = squareSum_X + X[i] * X[i] # sum of square of array elements. 
        squareSum_Y = squareSum_Y + Y[i] * Y[i] 
        i = i + 1 
    #Pearson correlation coefficient.
    corr = (float)(n * sum_XY - sum_X * sum_Y)/ \
           (float)(math.sqrt((n * squareSum_X - \
           sum_X * sum_X)* (n * squareSum_Y - \
           sum_Y * sum_Y))) 
    return corr 

In [None]:
print('{0:.6f}'.format(correlationCoefficient(df['GRE Score'].to_numpy(),
                             df['Chance of Admit '].to_numpy(),
                             df['GRE Score'].size)))

In [None]:
for X in df.columns:
    for Y in df.columns:
        print('{0:.6f}'.format(correlationCoefficient(df[X].to_numpy(),df[Y].to_numpy(),df[X].size)),end=' ')
    print('\n')

In [None]:
df.corr()

CGPA has the highest correlation with Chance of Admit

4\. [20 pts] Classification of mushrooms, edible or poisonous. Download the *assignment01_mushroom_dataset.csv* dataset file from the module content. Load the data set in your model development framework, examine the features to see they are all nominal features. The first column is the class which represents whether the mushroom is poisonous or not. Apply necessary pre-processing such as nominal to numerical conversions. Make sure to sanity check the pipeline and perhaps run your favorite baseline classifier first.

Report the performance of your classifier.

In [None]:
df = pd.read_csv('datasets/assignment01_mushroom_dataset.csv')
pd.set_option('display.max_columns', None)
df

In [None]:
l = df.columns.tolist()
l.remove('class')
df_o = pd.get_dummies(df, columns=l)
df_o

In [None]:
display(df_o['class'].value_counts())

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold, StratifiedKFold, train_test_split

# We will reuse the classifier function below
def rf_train_test(_X_tr, _X_ts, _y_tr, _y_ts):
    # Create a new random forest classifier, with working 4 parallel cores
    rf = RandomForestClassifier(n_estimators=200, max_depth=5, random_state=None, n_jobs=4)
    # Train on training data
    model = rf.fit(_X_tr, _y_tr)
    # Test on training data
    y_pred = rf.predict(_X_ts)
    # Return accuracy
    return accuracy_score(_y_ts, y_pred)

In [None]:
# Prepare the input X matrix and target y vector
X = df_o.loc[:, df_o.columns != 'class'].values
y = df_o.loc[:, df_o.columns == 'class'].values.ravel()

In [None]:
# Sanity check
print(y[:10])

In [None]:
# 80% split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=None)
rf_train_test(X_train, X_test, y_train, y_test)

In [None]:
import numpy as np

In [None]:
%%time
# Run 100 times and collect statistics
accuracies = []
for i in range(100):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=None)
    accuracies += [rf_train_test(X_train, X_test, y_train, y_test)]
#
print(f'80% train-test split accuracy is {np.mean(accuracies):.3f} {chr(177)}{np.std(accuracies):.4f}')