# I. Statistics Review
It's been a while since my Stats 101 and CFA L1 days here.

#### A) Correlation
- Represented as *"R"*
- Shows the strength of relationship, not causality, between 2 linear trends

#### B) Variance
- Represented as *"Sigma-squared"*
- Quantifies the the model's error, the disparity between predictions and actuals.
- I'm guessing this is similar to "cost function".

#### C) Coefficient of Determination
- Represented as *"R-squared"*
- At first glance, this seems pointless because you'd think it ties directly to (A) Correlation.
- This shows how much of (B) Variance is dictated by an input variable, which is relevant in a more complex, non-linear model with multiple variables.
- It's possible for an input variable to be strongly correlated (high R), but still play a minor role (low R-Squared) in predicting outputs. Clean and understandable relationships can be overshadowed by stronger factors or confounding variables. For example:
  - __PREDICTING HOW ATTRACTIVE A GUY IS TO GIRLS__
    - In a vacuum, women generally find a guy less attractive the more video games he plays.
    - However, a girl's repulsion often immediately goes out the window if the guy:
      - is tall
      - has a good income
      - is a colorful storyteller
    - Therefore, (lack of) video game activity is an example of something with <u>strong correlation</u>, but <u>low __coefficient of determination__</u> in predicting how attractive a guy is to girls

#### D) Covariance
- Represented as *"Cov(X,Y)"* when discussing the relationship between X and Y
- Shows directionality:
  - a positive covariance means factors increase together (housing prices grow with square footage)
  - a negative covariance means 1 factor shrinks while the other grows (a guy's level of attractiveness shrinks as his video game consumption goes up)
- Expressed as a non-standardized number, which may be hard to interpret. That's where __correlation__ comes in, which is scaled from -1 to 1.

# II. Logistical Regression
Refer back to the Handwritten Digit Classification repo inspired by Chapter 3 of Geron's ML textbook.

This is a probability model to figure out *"how likely is this a hot dog vs. not a hot dog?"* 

Logistical regression __is binary (yes/no)__, not a continuous line (e.g. housing price estimations).

![image.png](attachment:8e619822-d2ae-4139-8c45-fd73ce58ae49.png)

### Why The Curve?
The image above is a __Logistic Function__ known as a "Sigmoid Curve". For the sake of this sanitized example, assume the left half is low confidence in something being classified as true and the right half is high confidence of something being true. The slope is most dramatic at the tipping point, because each additional change is more likely to be a difference maker.

### Modern-Day Example
Pretend that the U.S. House of Representatives has 435 sitting elected officials and most of them fall into either the Republican or Democrat party. A new controversial legislation is proposed (increased gun control for example) and we are tasked with building a model to predict whether legislation will pass. We know from history and political science theory that Republicans will typically vote against gun control and Democrats will typically vote in favor of gun control, but it's not guaranteed. So we build a model, with a major input being party affiliations to predict pass vs. rejection.

Consider 3 scenarios:

- __200 Republicans and 235 Democrats__: A likely pass
- __235 Republicans and 200 Democrats__: A likely rejection
- __217 Republicans and 217 Democrats and 1 Independent__: Teetering on the will of an independent and/or a few House members willing to stand up to their party, which is where you see the dramatic shift between positive/negative or yes/no.

# III. Demonstration: Classifying Flowers into 3 Species
### 1. Sourcing the Data Set of Flower Images
Since this is a classic dataset, we can shortcut with scikit-Learn like we did with the handwritten images. Let's verify the download by checking the labels.

In [9]:
import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

In [10]:
iris = datasets.load_iris()
list(iris.keys())

['data',
 'target',
 'frame',
 'target_names',
 'DESCR',
 'feature_names',
 'filename',
 'data_module']

In [11]:
X = iris["data"][:, 3:]  # petal width
y = (iris["target"] == 2).astype(int)  # 1 if Iris-Virginica, else 0

# I. Statistics Review
asdf

# I. Statistics Review
asdf