# MCIS6273 Fall 2018 (Prof. Maull) Data Mining Final Review

* 30 questions on the final
* 8 questions are calculations or code -- formulae will be given to you unless noted below
* the final will be around 15% of your total grade, do your best!

## Study tips
* pace yourself and take breaks
* get plenty of rest, pace yourself and remember to eat and drink frequently
* on the night of the test, get a good night's rest

## Data and Patterns

* know the vector/matrix model of data 
    * $(f_1, f_2, \ldots, f_n)$, where $f_i$ are the features of the vector
* know the difference between binary, categorical and numeric data
* know how to binarize a vector of categorical data

* know the basic distance metrics
     * Euclidean
     * Jaccard (know the formula _and_ how to use it on **binary** data!)
         * consider $v_1 = (1, 0, 1)$ and $v_2 = (0, 0, 1)$
         * Jaccard is $$\bigg|\frac{v_1 \cap v_2}{v_1 \cup v_2} \bigg|$$ then we have $\frac{1}{2}$.
     * Minkowski (how can Minkowski be converted to other common distance metrics like Euclidean?)
     * cosine (useful with comparing documents)

* know how to interpret a scatterplot in 2 dimensions and how the pattern in a plot relates to correlation
* know what scaling and data normalizing do

* know how to compute simple conditional probabilities
	* for example, given $\Pr(color=\mathrm{"Blue"}) = 0.25$ and $\Pr(shape=\mathrm{"Round"})=0.25$, and $\Pr(color=\mathrm{"Blue"} \wedge shape=\mathrm{"Square"}) = 0.75$ what is $\Pr(color=\mathrm{"Blue"} \wedge shape=\mathrm{"Round"})$. Assume there are only two colors (Blue and Red) and two shapes (Square and Round)?
        * $\Pr(color=\mathrm{"Red"}) = 1 - \Pr(color=\mathrm{"Blue"}) = 0.75$
        * $\Pr(shape=\mathrm{"Round"}) = 1 - \Pr(color=\mathrm{"Square"}) = 0.75$

    
* Know the various visualizations (bar charts, scatter plots bubble plots)

![alt text](./infographics-design-flowchart.jpg "Visualization Guideline Chart")

## Unsupervised Learning

* know the basic concepts between various clustering algorithms
    * $k$-Means, $k$-medoids, Expectation Maximization are the only one's you need to review

## Supervised Learning

* know what linear and logistic regression are (how are they different?)
    * remember logistic regression returns a binary value, linear returns a numeric value
* know how to interpret $R^2$ values from a regression model
    * $0 \rightarrow$ **poor** models
    * $>0.70 \rightarrow$ **strong positive** models    

* know what a decision tree classifier is and how it is different from other classification
    * know that splitting criteria to build trees and are use and the highest _gain_ is the goal for splitting
* understand how to compute class membership with a Naive Bayes classifier by hand given the class probabilities (recall the example in the slides)
    * you will NOT have to memorize Baye's Theorem, but you **will** have to use it if given to you!
* know what a support vector machine is and why is it useful for high dimensional data

* know what the true positive rate (sensitivity) and true negative rate (specificity) are in terms of their formal definition
* know what a confusion matrix is and how it relates to determining the performance of a classifier
    * TP $\rightarrow$ true positive
    * FP $\rightarrow$ false positive; predict positive class when actual is negative 
    * TN $\rightarrow$ true negative
    * FN $\rightarrow$ false negative; predict negative class when actual is positive

## SciKit Learn and Code

* know what scaling tools you used in the homework
    * [minmax_scale()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.minmax_scale.html)
    

In [1]:
from sklearn.preprocessing import minmax_scale
D = [0,5,10,15,20]
minmax_scale(D)

array([0.  , 0.25, 0.5 , 0.75, 1.  ])

* understand test/training set creation on sklean

In [2]:
import pandas as pd
df = pd.read_csv('https://data.ny.gov/api/views/kpws-qgfw/rows.csv?accessType=DOWNLOAD&bom=true&query=select+*')

In [3]:
df.head()

Unnamed: 0,Year,DEC Region,County,Town,Waterbody,Date,Number,Species Name,Size (inches )
0,2018,3,Sullivan,Rockland,Beaver Kill,Mid April,692,Brown Trout,12 -15 inches
1,2018,5,Warren,Johnsburg,Botheration Pond,Spring,260,Brown Trout,8 - 9 inches
2,2018,5,Essex,Ticonderoga,Putnam Creek,May,260,Brown Trout,8 - 9 inches
3,2018,7,Cortland,Homer,Casterline Pond,Mid April,780,Rainbow Trout,8 - 9 inches
4,2018,9,Chautauqua,Gerry,Mill Creek,Mid April,800,Brook Trout,8 - 9 inches


In [4]:
df.shape

(1471, 9)

In [9]:
from sklearn.cross_validation import train_test_split
train_set, test_set = train_test_split(df, test_size=0.40)

In [6]:
train_set.shape

(882, 9)

In [7]:
test_set.shape # 589 / 1471 = 0.40

(589, 9)

* know how to interpret DataFrames and boolean indexing (e.g. what does `df[df.attribute<100]` return?)

In [8]:
df[df.County=='Genesee']

Unnamed: 0,Year,DEC Region,County,Town,Waterbody,Date,Number,Species Name,Size (inches )
172,2018,8,Genesee,Le Roy,Oatka Creek,March - April,700,Brown Trout,12 -15 inches
179,2018,8,Genesee,Le Roy,Oatka Creek,March - April,2870,Brown Trout,8 - 9 inches
280,2018,8,Genesee,Le Roy,Oatka Creek,May - June,2520,Brown Trout,8 - 9 inches
288,2018,8,Genesee,Byron,Spring Brook,March - April,150,Brown Trout,12 -15 inches
745,2018,8,Genesee,Le Roy,Oatka Creek,April - May,1480,Brown Trout,8 - 9 inches
781,2018,8,Genesee,Byron,Spring Brook,March - April,260,Brown Trout,8 - 9 inches
786,2018,8,Genesee,Le Roy,Oatka Creek,April - May,700,Brown Trout,12 -15 inches
1332,2018,8,Genesee,Batavia,Dewitt Pond,March - April,500,Brown Trout,12 -15 inches


* know what `df.describe()` does