## Sidenotes (definitions, code snippets, resources, etc.)
- Note on data structure: list
    - empty list has a truth value of false
- [Feature Selection with scikit-learn for intro_to_ml](http://napitupulu-jon.appspot.com/posts/feature-selection-ud120.html)
    - Looks very helpful for copying notes, course materials
    - Investigate meaning of `# %%writefile new_enron_feature.py` inserted at top of edited studentMain.py module

### ML Algorithms
- A classic way to overfit an algorithm is by using lots of features and not a lot of training data.
- _Decision Trees_ are easy to overfit.
- classic use of regression is when output/labels consists of continuous data (e.g. from features of house determine its price)

### Useful git code snippets
- `git reset --soft HEAD~`
    - Leaves working tree as it was before git commit

# Principlal Component Analysis
`sklearn.decomposition`.PCA [Documentation](http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) and [User Guide](http://scikit-learn.org/stable/modules/decomposition.html#pca)
- `n_components == min(n_samples, n_features)`
- `explained_variance_ratio_` list of eigenvalues for each pricipal component (adds up to 1)
- `components_` list of principal components, provided directional information of components
- note that visualization might not seem to show orthogonal lines, but this is because of how the scale is done (could cut off based on lower limit)

__definition:__ principal component analysis (PCA)
- PCA returns straight-line _axes of variation_ as vectors, as well as an importance value for each one
    - These two axes define a _coordinate system_ centered around the data.
    - the the x-prime vector (like x-axis) is aligned with the _principal axis of variation_ (similar to like regression line, higher importance value of the two)
    - the y-prime is vector orthogonal to x-prime (dot product would equal 0)
- Part of it's beauty is that it can be useful with data not perfectly 1D, i.e. not well fit to a regression line.
- since PCA uses vectors for axis, more versatile than regression y = f(x) with x = c cases (swaps axes)
- Importance value
    - calculated with an _eigenvalue decomposition_ implemented by PCA (math, will learn later/as needed)
    - If x-axis _dominates_ y-axis, that means it has a much higher importance value
    - If no axis dominates, PCA output not useful
- ![PCA data set examples](lesson_12_images/pca_datasets.png)

## Dimensionality in PCA
- Examples of one-dimensional data that exist in two-dimensional space, as defined in PCA:
    - y = c and x = c (even with noise)
    - straight diagonal lines
        - appies even when there are small deviations (noise)
        - can manipulate (_by rotation and translation only_) with x-prime  and y-prime notation for new axes
- Curved lines of data that can be manipulated into 1D representations (like for regressions) are _not_ considered 1D when using PCA.
     - ![Exmaple of dimensionality for PCA](lesson_12_images/pca_dimensionality.png)

## Simple Examples:
- PCA outputs vectors that are normalized to 1
- Orthogonal vectors being 1 / (root of two)
    - those are x-prime and y-prime, each consisting of a delta-x and delta-y

Example 1:
![PCA Example 1](lesson_12_images/pca_example_1.png)

Example 2:
- ++ in image below indicates that the x-prime axis of variation will have much higher important than the other.
![PCA Example 2](lesson_12_images/pca_example_2.png)

## Measurable vs. Latent Features
- folds measurable features into single latent feature (an underlying factor we can determine form intuition)

    - e.g. no. of rooms, square-footage of house -> size of house
    - e.g. safety of neighborhood, schools nearby -> neighborhood
- Can use SelectKBest (or maybe SelectPercentile) to preserve data, but fold into latent aspects

## Composite Features
- Can make a composite feature (or principle component from PCA!) to measure/represent latent feature
    - part of dimensionality reduction and unsupervised learning (covered later in course)
    
    
## Determining a Principle Component
### Maximal Variance
def 
- seeks to minimize information loss when translating into 1D
- information lost is proportional to distance from point to line
- direction of maximal variance is mathematically defined as line that has least information loss (in aggregate, for all data points)

From wiki: (confusing that principle is described as having _higher_ variance of the two components). key: _direction_ of maximal variable, not line with most variance.
This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

### Maximal Variance and Information Loss



### PCA as a Generalized Algorithm for Feature Transformation
- necessary for scale
- PCA algorithm will run through all combinations and provide first principal component, second, etc. ranked by importance value
- powerful unsupervised learning technique

### When to Use PCA
1. 



## PCA Mini-Project!
- Eisenfaces code mostly taken from [this example](http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html) from sklearn's documentation.

We mentioned that PCA will order the principal components, with the first PC giving the direction of maximal variance, second PC has second-largest variance, and so on. How much of the variance is explained by the first principal component? The second?

- _Answer:_ 0.17561573, 0.15863393 (not accepting it, different people got different answers depending on OS, sklearn version, etc.)

Visual Output:

![eigenfaces visual output](lesson_12_images/eigenfaces_visual_output.png)


In [1]:
from eigenfaces import *
# index given slightly differently from list
print
print "Q1:", pca.explained_variance_ratio_[:2]


Faces recognition example using eigenfaces and SVMs

The dataset used in this example is a preprocessed excerpt of the
"Labeled Faces in the Wild", aka LFW_:

  http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB)

  .. _LFW: http://vis-www.cs.umass.edu/lfw/

  original source: http://scikit-learn.org/stable/auto_examples/applications/face_recognition.html


Total dataset size:
n_samples: 1217
n_features: 1850
n_classes: 6
Extracting the top 15 eigenfaces from 912 faces
done in 0.060s
Projecting the input data on the eigenfaces orthonormal basis
done in 0.011s
Fitting the classifier to the training set
done in 11.164s
Best estimator found by grid search:
SVC(C=1000.0, cache_size=200, class_weight='balanced', coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.1, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)
Predicting the people names on the testing set
done in 0.008s
                   precision    recall  

Now you'll experiment with keeping different numbers of principal components. In a multiclass classification problem like this one (more than 2 labels to apply), accuracy is a less-intuitive metric than in the 2-class case. Instead, a popular metric is the F1 score.

We’ll learn about the F1 score properly in the lesson on evaluation metrics, but you’ll figure out for yourself whether a good classifier is characterized by a high or low F1 score. You’ll do this by varying the number of principal components and watching how the F1 score changes in response.

As you add more principal components as features for training your classifier, do you expect it to get better or worse performance?

- _Answer:_ Better. Ideally, we hope that adding more components will give us more signal information to improve the classifier performance.

Change n_components to the following values: [10, 15, 25, 50, 100, 250]. For each number of principal components, note the F1 score for Ariel Sharon. (For 10 PCs, the plotting functions in the code will break, but you should be able to see the F1 scores.) If you see a higher F1 score, does it mean the classifier is doing better, or worse?

- _Answer:_ Better. Higher F1 means better performance of classifier.

Do you see any evidence of overfitting when using a large number of PCs? Does the dimensionality reduction of PCA seem to be helping your performance here?

- _Answer:_ Yes, the F1 score starts to drop when there are too many PCs.

## Selecting a Number of Principle Components
- Best way of determining this is by testing different no. of components
    - like when determining which features to include, from ranked importance/relevance
- Note: Do PCA _before_ feature selection (otherwise proceed with caution)