1\. [20 pts] Define each of the following machine learning terms:

* *dataset* - A collection of instances is a dataset and when working with machine learning methods we typically need a few datasets for different purposes.
  * *instance* - a single row of data is called an instance. It is an observation from the domain.

* *training* - a dataset that we feed into our machine learning algorithm to train our model

* *testing* - a dataset that we use to validate the accuracy of our model but is not used to train the model. It may be called the validation dataset.
* *validation dataset* - used to tune the models hyper-parameters.

* *ground truth* - the value measured for your target variable for the training and testing examples where nearly all the time you can safely treat this the same as the label. If you augment your data set, there is a subtle difference between the ground truth (your actual measurements) and how the augmented examples relate to the labels you have assigned. 

* *label* - a human or machine generated description tagged to one or many data samples.

* *pre-processing* - data transforming or encoding to bring data to a state such that a machine can easily parse it.

* *feature* - an individual measurable property or characteristic of a phenomenon being observed. Choosing informative, discriminating and independent features is a crucial step for effective algorithms in pattern recognition, classification and regression.

* *numerical* - numerical or quantitative data will always be a number that can be measured.

* *nominal* - nominal data is classified without a natural rank, whereas ordinal data has a predetermined or natural order.

* *decision surface* - A hyper surface in a multidimensional state space that partitions the space into  regions. Data lying on one side of a decision surface are defined as belonging to a different class form those lying on the the other. Decision surfaces may be created or modified as a result of a learning process and they are frequently used in machine learning, pattern recognition, and classification systems

* *model validation* - the process where a trained model is evaluated with a testing data set. The testing data set is a separate portion of the same data set from which the training set is derived. The main purpose of using the testing data set is to test the generalization ability of a trained model.

* *accuracy* - is a weighted arithmetic mean of precision and inverse precision (weighted by bias) as well as weighted arithmetic mean of recall and inverse recall (weighted by prevalence). Inverse precision and inverse recall are simply the precision and recall of the inverse problem where positive and negative labels are exchanged (for both real classes and prediction labels). Recall and Inverse Recall, or equivalently true positive rate and false positive rate, are frequently plotted against each other as ROC curves and provide a principled mechanism to explore operating point trade offs.

* *cross-validation* - is a technique for evaluating ML models by training several ML models on a subset of the available input data and evaluating them on the complementary subset of the data. Use cross-validation to detect over fitting, ie, failing to generalize a pattern.

2\. [20 pts] Pick **two** of the [Scikit-learn datasets] which are already included in the library (i.e. the ones with datasets_load_) an find out the following:
* the number of data points
* the number of features and their types
* the number and name of categories (i.e. the target field)
* the mean (or mode if nominal) of the first two features

[Scikit-learn datasets]: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets

In [1]:
from sklearn import datasets
from termcolor import colored

In [2]:
data = datasets.load_wine(as_frame=True)
print(colored('number of data points:', 'green', attrs=['bold']))
print(data.frame.count())
print(colored('number of features:', 'green', attrs=['bold']))
print(data.frame.dtypes.size)
print(colored('type of feature:', 'green', attrs=['bold']))
print(data.frame.dtypes)
print(colored('number of categories:', 'green', attrs=['bold']))
print(data.target_names.size)
print(colored('name of categories: ', 'green', attrs=['bold']))
print(data.target_names)
print(colored('mean of alcohol: ', 'green', attrs=['bold']))
print(data.frame['alcohol'].mean)
print(colored('mean of malic_acid: ', 'green', attrs=['bold']))
print(data.frame['malic_acid'].mean)

[1m[32mnumber of data points:[0m
alcohol                         178
malic_acid                      178
ash                             178
alcalinity_of_ash               178
magnesium                       178
total_phenols                   178
flavanoids                      178
nonflavanoid_phenols            178
proanthocyanins                 178
color_intensity                 178
hue                             178
od280/od315_of_diluted_wines    178
proline                         178
target                          178
dtype: int64
[1m[32mnumber of features:[0m
14
[1m[32mtype of feature:[0m
alcohol                         float64
malic_acid                      float64
ash                             float64
alcalinity_of_ash               float64
magnesium                       float64
total_phenols                   float64
flavanoids                      float64
nonflavanoid_phenols            float64
proanthocyanins                 float64
color_intensity      

In [3]:
data = datasets.fetch_california_housing(as_frame=True)
print(colored('number of data points:', 'green', attrs=['bold']))
print(data.frame.count())
print(colored('number of features:', 'green', attrs=['bold']))
print(data.frame.dtypes.size)
print(colored('type of feature:', 'green', attrs=['bold']))
print(data.frame.dtypes)
print(colored('number of categories:', 'green', attrs=['bold']))
print(data.target.size)
print(colored('name of categories: ', 'green', attrs=['bold']))
print(data.target_names)
print(colored('mean of MedInc: ', 'green', attrs=['bold']))
print(data.frame['MedInc'].mean)
print(colored('mean of HouseAge: ', 'green', attrs=['bold']))
print(data.frame['HouseAge'].mean)

[1m[32mnumber of data points:[0m
MedInc         20640
HouseAge       20640
AveRooms       20640
AveBedrms      20640
Population     20640
AveOccup       20640
Latitude       20640
Longitude      20640
MedHouseVal    20640
dtype: int64
[1m[32mnumber of features:[0m
9
[1m[32mtype of feature:[0m
MedInc         float64
HouseAge       float64
AveRooms       float64
AveBedrms      float64
Population     float64
AveOccup       float64
Latitude       float64
Longitude      float64
MedHouseVal    float64
dtype: object
[1m[32mnumber of categories:[0m
20640
[1m[32mname of categories: [0m
['MedHouseVal']
[1m[32mmean of MedInc: [0m
<bound method Series.mean of 0        8.3252
1        8.3014
2        7.2574
3        5.6431
4        3.8462
          ...  
20635    1.5603
20636    2.5568
20637    1.7000
20638    1.8672
20639    2.3886
Name: MedInc, Length: 20640, dtype: float64>
[1m[32mmean of HouseAge: [0m
<bound method Series.mean of 0        41.0
1        21.0
2        52.0
3 

3\. [40 pts] Implement a correlation program from scratch to look at the correlations between the features of Admission_Predict.csv dataset file (not provided, you have to download it by yourself by following the instructions in the module Jupyter notebook). Display the correlation matrix where each row and column are the features, which should be an 8 by 8 matrix (should we use 'Serial no'?). You can use pandas DataFrame.corr() to verify correctness of yours.

Observe that the diagonal of this matrix should have all 1's and explain why? Since the last column can be used as the target (dependent) varaible, what do you think about the correlations between all the variables? Which variable should be the most important for prediction of 'Chance of Admit'?

In [4]:
import pandas as pd
df = pd.read_csv('datasets/Admission_Predict_Ver1.1.csv')
print(df.head())
print(df.shape)

   Serial No.  GRE Score  TOEFL Score  University Rating  SOP  LOR   CGPA  \
0           1        337          118                  4  4.5   4.5  9.65   
1           2        324          107                  4  4.0   4.5  8.87   
2           3        316          104                  3  3.0   3.5  8.00   
3           4        322          110                  3  3.5   2.5  8.67   
4           5        314          103                  2  2.0   3.0  8.21   

   Research  Chance of Admit   
0         1              0.92  
1         1              0.76  
2         1              0.72  
3         1              0.80  
4         0              0.65  
(500, 9)


In [5]:
for col in df.columns: 
    print(col) 

Serial No.
GRE Score
TOEFL Score
University Rating
SOP
LOR 
CGPA
Research
Chance of Admit 


In [6]:
import math 
def correlationCoefficient(X, Y, n) : 
    sum_X = 0
    sum_Y = 0
    sum_XY = 0
    squareSum_X = 0
    squareSum_Y = 0 
    i = 0
    while i < n : 
        sum_X = sum_X + X[i] # sum of elements of array X. 
        sum_Y = sum_Y + Y[i] # sum of elements of array Y. 
        sum_XY = sum_XY + X[i] * Y[i] # sum of X[i] * Y[i]. 
        squareSum_X = squareSum_X + X[i] * X[i] # sum of square of array elements. 
        squareSum_Y = squareSum_Y + Y[i] * Y[i] 
        i = i + 1 
    #Pearson correlation coefficient.
    corr = (float)(n * sum_XY - sum_X * sum_Y)/ \
           (float)(math.sqrt((n * squareSum_X - \
           sum_X * sum_X)* (n * squareSum_Y - \
           sum_Y * sum_Y))) 
    return corr 

In [7]:
print('{0:.6f}'.format(correlationCoefficient(df['GRE Score'].to_numpy(),
                             df['Chance of Admit '].to_numpy(),
                             df['GRE Score'].size)))

0.810351


In [8]:
for X in df.columns:
    for Y in df.columns:
        print('{0:.6f}'.format(correlationCoefficient(df[X].to_numpy(),df[Y].to_numpy(),df[X].size)),end=' ')
    print('\n')

1.767839 -0.103839 -0.141696 -0.067641 -0.137352 -0.003694 -0.074289 -0.005332 0.008505 

-0.103839 1.000000 0.827200 0.635376 0.613498 0.524679 0.825878 0.563398 0.810351 

-0.141696 0.827200 1.000000 0.649799 0.644410 0.541563 0.810574 0.467012 0.792228 

-0.067641 0.635376 0.649799 1.000000 0.728024 0.608651 0.705254 0.427047 0.690132 

-0.137352 0.613498 0.644410 0.728024 1.000000 0.663707 0.712154 0.408116 0.684137 

-0.003694 0.524679 0.541563 0.608651 0.663707 1.000000 0.637469 0.372526 0.645365 

-0.074289 0.825878 0.810574 0.705254 0.712154 0.637469 1.000000 0.501311 0.882413 

-0.005332 0.563398 0.467012 0.427047 0.408116 0.372526 0.501311 1.000000 0.545871 

0.008505 0.810351 0.792228 0.690132 0.684137 0.645365 0.882413 0.545871 1.000000 





In [9]:
df.corr()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,1.0,-0.103839,-0.141696,-0.067641,-0.137352,-0.003694,-0.074289,-0.005332,0.008505
GRE Score,-0.103839,1.0,0.8272,0.635376,0.613498,0.524679,0.825878,0.563398,0.810351
TOEFL Score,-0.141696,0.8272,1.0,0.649799,0.64441,0.541563,0.810574,0.467012,0.792228
University Rating,-0.067641,0.635376,0.649799,1.0,0.728024,0.608651,0.705254,0.427047,0.690132
SOP,-0.137352,0.613498,0.64441,0.728024,1.0,0.663707,0.712154,0.408116,0.684137
LOR,-0.003694,0.524679,0.541563,0.608651,0.663707,1.0,0.637469,0.372526,0.645365
CGPA,-0.074289,0.825878,0.810574,0.705254,0.712154,0.637469,1.0,0.501311,0.882413
Research,-0.005332,0.563398,0.467012,0.427047,0.408116,0.372526,0.501311,1.0,0.545871
Chance of Admit,0.008505,0.810351,0.792228,0.690132,0.684137,0.645365,0.882413,0.545871,1.0


CGPA has the highest correlation with Chance of Admit

4\. [20 pts] Classification of mushrooms, edible or poisonous. Download the *assignment01_mushroom_dataset.csv* dataset file from the module content. Load the data set in your model development framework, examine the features to see they are all nominal features. The first column is the class which represents whether the mushroom is poisonous or not. Apply necessary pre-processing such as nominal to numerical conversions. Make sure to sanity check the pipeline and perhaps run your favorite baseline classifier first.

Report the performance of your classifier.

In [11]:
df = pd.read_csv('datasets/assignment01_mushroom_dataset.csv')
print(df.shape)
df.head()

(8124, 23)


Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
