The Bernoulli process is a single trial that ends with either success (that occurs with p probability) or failure. If you want to know:
 * How many successes in n trials? -> Binomial
 * Will this single trial be a success? -> Bernoulli
 * How many trials until the 1st success? -> Geometric
 * How many trials until the r-th success? -> Negative binomial
 * Given M successes out of N trials, how many of the first n are successes? -> Hypergeometric

I think of the Poisson process as the continuous version of the Bernoulli. Time is discrete in Bernoulli trials vs continuous for Poisson. If you want to know:
 * How many events occur in an interval of length (t)? -> Poisson
 * How long until your first event? -> Exponential
 * How long until the r-th event? -> Gamma
 * Given (A+B) events in a given interval, what fraction of the interval will it take until the Ath event occurs? -> Beta

Distribution Help
http://www.math.wm.edu/~leemis/chart/UDR/UDR.html

http://www.cs.elte.hu/~mesti/valszam/kepletek.pdf

In [1]:
mat = [[1,2,3],[4,5,6],[7,8,9]]
diag = [ mat[i][i] for i in range(len(mat)) ]

In [2]:
diag

np.random.choice() #allows you to set probability weights for simulating probability


[1, 5, 9]

## Using Linear Regression With Python

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
import numpy as np

#to split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Fit your model using the training set
linear = LinearRegression()
model = linear.fit(X_train, y_train)
model.summary()  #LOTS OF INFO!!!

# Call predict to get the predicted values for training and test set
train_predicted = linear.predict(X_train)
test_predicted = linear.predict(X_test)

#k-fold cross validation implementation
sklearn.cross_validation.cross_val_score()
cv_shuffle = cross_validation.ShuffleSplit(train_size=train_size, test_size=200, n=len(y), random_state=0)

### Linear Regression Assumptions
Linear relationship. Multivariate normality. No or little multicollinearity.

In [1]:
import numpy as np

a = np.array([1,2,3])

In [5]:
np.newaxis(a)

In [None]:
#simulating
scs.distribution.rvs()

## Logistic Regression

In [None]:

X, y = make_classification(n_features=2, n_redundant=0, n_informative=2,
                               n_clusters_per_class=2, n_samples=1000)
X_train, X_test, y_train, y_test = train_test_split(X, y)

model = LogisticRegression()
model.fit(X_train, y_train)
probabilities = model.predict_proba(X_test)[:, 1]

tpr, fpr, thresholds = roc_curve(probabilities, y_test)

plt.plot(fpr, tpr)
plt.xlabel("False Positive Rate (1 - Specificity)")
plt.ylabel("True Positive Rate (Sensitivity, Recall)")
plt.title("ROC plot of fake data")
plt.show()

In [None]:
from statsmodels.discrete.discrete_model import Logit
from statsmodels.tools import add_constant

X = df[['gre', 'gpa', 'rank']].values
X_const = add_constant(X, prepend=True)
y = df['admit'].values

logit_model = Logit(y, X_const).fit()

logit_model.summary()

### Use KFold cross validation to calculate average accuracy, precision, recall

In [None]:
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score

kfold = KFold(len(y))

accuracies = []
precisions = []
recalls = []

for train_index, test_index in kfold:
    model = LogisticRegression()
    model.fit(X[train_index], y[train_index])
    y_predict = model.predict(X[test_index])
    y_true = y[test_index]
    accuracies.append(accuracy_score(y_true, y_predict))
    precisions.append(precision_score(y_true, y_predict))
    recalls.append(recall_score(y_true, y_predict))

print "accuracy:", np.average(accuracies)
print "precision:", np.average(precisions)
print "recall:", np.average(recalls)

## Bias-Variance Tradeoff

The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting). Bias related to test error.

The variance is error from sensitivity to small fluctuations in the training set. High variance can cause overfitting: modeling the random noise in the training data, rather than the intended outputs. Variance related to training error.

## Beta Distribution

The beta distribution can be used in Bayesian analysis to describe initial knowledge concerning probability of success such as the probability that a space vehicle will successfully complete a specified mission. The beta distribution is a suitable model for the random behavior of percentages and proportions.

![alt text](http://www.gaussianwaves.com/gaussianwaves/wp-content/uploads/2013/10/Bayes_theorem_1.png)

In [6]:
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import KFold
from sklearn.cross_validation import train_test_split
from sklearn.cross_validation import cross_val_score
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score


### Assessment 3b

Info Gain = Impurity(parent) - sum(Impurity(children)* probability per child)

Gini Impurity: probability of still guessing incorrectly while still knowing 

Impurity for single node = 1 - (sum over all classes( probability_per_class^2))
= sum over all classes(p_per_class * (1-p_per_class))

### Logistic Regression
* Maximize the log likelihood
* Accuracy:
    * How often is our classifier correct?
    * Tp + Tn / N + P
* True Positive Rate- Sensitvity/Recall
    * When a yes, how often do we predict yes?
    * Tp / P
* True Negative Rate
    * Negatives correctly predicted as negative
    * Tn / N
* False Positive rate
    * Negatives not correctly predicted as positive
    * When no, how often do we predict no?
    * Fp / N
* False Negative Rate - Specificity
    * Fn / P
* Precision: Tp/(Fp + Tp)

* False Positive: Type 1 error
* False Negative: Type 2 error

### Gradient Descent


cost function
![alt text](https://github.com/zipfian/DSI_Lectures/raw/master/gradient-descent/giovanna_thron/images/log_likelihood_gradient2.png)


![alt text](https://github.com/zipfian/DSI_Lectures/raw/master/gradient-descent/giovanna_thron/images/logit.png)

## kNN
* Find distance between k number of nearest neighbor to decide class
* Pros
    * Easy to train (save the data)
    * Works with any number of classes
    * Easy to add new training datapoints
* Cons
    * Very slow to predict, have to calculate distance to every point in data set
    * Curse of dimensionality
        * As dimensionality increases, kNN performance decreases; nearest neighbors no longer nearby

## Decision Trees
* Calculate information gain to determine best split
* Categorical
    * Use Gini Impurity
    * Majority Class at each leaf
* Continuous
    * Use Entropy
    * Take average value of result at each leaf for expected value
![alt text](http://image.slidesharecdn.com/untitled-150504104846-conversion-gate01/95/a-taste-of-random-decision-forests-on-apache-spark-21-638.jpg?cb=1430736669)

* Decision Trees are high variance, prone to overfitting, can ease by pruning
    * Control by
        * Leaf size: stop when there's few data points at a node
        * Depth: stop when a tree gets too deep
        * Class mix: stop when some percent of data points are the same class
        * error reduction: stop when info gains are too little
* Pros
    * Interpetable, feature interactions, cheap to predict, can model complex/non-linear, can handle missing values, mixed data, extensible
* Cons
    * Computationally greedy (local optima), expensive to train, super high variance

#### Regression Decision Trees
* Responses are real values, can't use cross-entropy or Gini Index
* Choose best splits using RSS against mean of each leaf

## Random Forest
### Bias Variance Tradeoff
#### Bias
* Can our model represent the true best predictor?
* Caused by choosing a function that is too simple or choosing a regularized parameter that is too strong
#### Variance
* Random noise due to the specifics of our training data
* Gets better as we get more data


## Profit Curve
* 
