# Phase 4 Notes

# Distance Metrics

## KNN
KNN is an effective classification and regression algorithm that uses nearby points in order to generate a prediction.

1. Choose a point 
2. Find the K-nearest points
    1. K is a predefined user constant such as 1, 3, 5, or 11 
3. Predict a label for the current point:
    1. Classification - Take the most common class of the k neighbors
    2. Regression - Take the average target metric of the k neighbors
    3. Both classification or regression can also be modified to use weighted averages based on the distance of the neighbors 
4. Don't technically train or fit
5. Efficient on small-mid size data not good for large data

### Assumptions of Distance Based Classifiers
distance helps us quantify similarity

### Manhattan distance
<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/manhattan_fs.png' width="300">

$$ \large d(x,y) = \sum_{i=1}^{n}|x_i - y_i | $$  


In [5]:
# Locations of two points A and B
A = (1, 7, 12)
B = (-1, 0, -5)

manhattan_distance = 0

# Use a for loop to iterate over each element
for i in range(3):
    # Calculate the absolute difference and add it
    manhattan_distance += abs(A[i] - B[i])

manhattan_distance

26

### Euclidean distance
<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/euclidean_fs.png' width = "200">

$a^2 + b^2 = c^2$, or the **Pythagorean theorem**!

$$ \large d(x,y) = \sqrt{\sum_{i=1}^{n}(x_i - y_i)^2} $$  

In [6]:
from math import sqrt

# Locations of two points A and B
A = (1, 7, 12)
B = (-1, 0, -5)

euclidean_distance = 0

# Use a for loop to iterate over each element
for i in range(3):
    # Calculate the difference, square, and add it
    euclidean_distance += (A[i] - B[i]) ** 2

# Square root of the final result
euclidean_distance = sqrt(euclidean_distance)

euclidean_distance

18.49324200890693

### Minkowski distance

A Normed Vector Space is just a fancy way of saying a collection of space where each point has been run through a function. It can be any function, as long it meets two criteria: 
1. the zero vector (just a vector filled with zeros) will output a length of 0, and 
2. every other vector must have a positive length 

Both the Manhattan and Euclidean distances are actually _special cases of Minkowski distance_. Take a look: 

$$\large d(x, y) = \left(\sum_{i=1}^{n}|x_i - y_i|^c\right)^\frac{1}{c}$$  


### Hamming Distance
Hamming distance can even be used to compare strings

### How adjusting K works
<img src="https://curriculum-content.s3.amazonaws.com/data-science/images/fit_fs.png" width = "700">


<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/best_k_fs.png' width = "550">

### Big O is Exponential for KNN
Note that KNN isn't the best choice for extremely large datasets, and/or models with high dimensionality. This is because the time complexity (what computer scientists call "Big O", which you saw briefly earlier) of this algorithm is exponential.

### Best value for K 
arrived at through testing on data set and trying diff values

## Lecutre on KNN
* Pick K for low bias low variance
* Fitting doesn't train, it just stores the locations in the feature space.  What's the distance, get the closest distance.
* Hyper tuning the number of neighbors we have
* Low K = overfit, High K = underfit
* Must scale the features!
* Kfolds, GridSearchCV etc standardize after splitting
* next(fold_index) will show the iteration of indexes in cross validation
* cross validation finding the best score
* lower k that predicts better is usually better
* weighted averages: multiply support by
* hidden dimensions latatent space
* predicting about generalizing well
* KNN is a lazy algorithm it works well with smaller data sets
    * over 100K it starts to be too big
    * columns matter too
* Alternative to OHE? Encode one column with all the values
* More features = more dimensions = more sparsity
    * makes it harder to train or predict and can overfit
    * volume scales exponentially
    * affects all algorithms
    * more columns can capture variance but you can over do it
* Feature spaces
    * cosine used for recommendations
    * hamming mlp, distance between words


<center><img src = "images/nonnormal.png" /></center>
<center>Unscaled</center>

<center><img src = "images/normalized.png" /></center>
<center>Scaled</center>

In [None]:
from sklearn.preprocessing import LabelEncoder
target_transform = LabelEncoder()
iris_df['Species'] = target_transform.fit_transform(iris_df['Species'])

Label Encoder - takes categorical data like dog, cat, fish etc and turns them into numerical values like 0, 1, 2

In [None]:
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

X = iris_df[['SepalWidthCm', 'PetalWidthCm']]
y = iris_df['Species']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, test_size = 0.15, random_state = 42)
fold_index = KFold(n_splits = 5).split(X_train)

next(fold_index)

## Lecture 1: K-Nearest Neighbors (KNN)

### Introduction to Distance-Based Algorithms

#### K-Nearest Neighbors (KNN)
- KNN is a distance-based supervised learning algorithm.
- It should not be confused with K-means clustering, which is an unsupervised technique.

#### Key Concepts
- **Distance Metrics:** Different types of distance metrics are used in KNN, similar to those in L1 and L2 regularizations.
  - Euclidean Distance (L2)
  - Manhattan Distance (L1)
  - Minkowski Distance (a combination of L1 and L2)
- **Hyperparameter Tuning:** Tuning the value of K is essential for achieving low bias and low variance.
- **Labels:** KNN uses labeled data for its predictions.

#### KNN in Practice
1. **Choosing the Value of K:**
   - K represents the number of nearest neighbors.
   - It is a hyperparameter that needs to be tuned for optimal performance.
   - Common values are 1, 3, 5, 7, etc., typically chosen to avoid ties.

2. **Distance Metrics:**
   - Euclidean Distance: Suitable for continuous features.
   - Manhattan Distance: Suitable for categorical features.
   - Minkowski Distance: Generalization that includes both Euclidean and Manhattan distances.

3. **Algorithm Steps:**
   - Take a test data point.
   - Calculate the distance to all training data points.
   - Select the K-nearest neighbors based on the chosen distance metric.
   - For classification, use majority voting among the neighbors.
   - For regression, take the average of the neighbors' target values.

4. **No Training Phase:**
   - KNN does not have a traditional training phase involving model fitting.
   - Instead, it stores the training data and calculates distances during prediction.

#### Tuning K and Avoiding Overfitting/Underfitting
- **Low K:** Leads to high variance and overfitting.
- **High K:** Leads to high bias and underfitting.
- **Optimal K:** Achieved through cross-validation and testing on the validation set.

#### Example Scenario
- Predict the class of a new data point based on its K-nearest neighbors.
- Different values of K can lead to different classifications.
- Evaluate the model using accuracy or other appropriate metrics.

#### Important Considerations
- **Scaling Features:** Features must be standardized to ensure fair distance calculations.
- **Odd vs. Even K:** Odd values are preferred to avoid ties in majority voting.

#### Computational Complexity
- KNN has exponential time complexity, making it less suitable for very large datasets or high-dimensional data.
- Dimensionality reduction techniques like PCA can help mitigate these issues.

#### Practical Applications
- KNN is straightforward and interpretable, making it useful for small to mid-sized datasets.
- It is not ideal for datasets with over 100,000 samples or very high-dimensional data.

### Summary
- KNN is a versatile and easy-to-understand algorithm.
- Proper selection of K and distance metrics is crucial.
- Scaling features and dimensionality reduction can enhance performance.
- KNN works well for both classification and regression tasks.

---

#### Additional Concepts from the Lecture
- **Lazy Algorithm:** KNN is considered a lazy algorithm because it does not involve explicit training.
- **Public Opinion Analogy:** The concept of KNN can be compared to public opinion, where the majority vote influences the outcome.
- **Cosine Similarity:** Often used in recommendation systems to find the similarity between vectors.
- **Hamming Distance:** Used in natural language processing to measure the distance between words.

# Level Up: Distance Metrics
> The "closeness" of data points → proxy for similarity

![](./images/distance.png)

# Lloyd's vs Fair Lloyds

K clustering fair lloyd's attempts to make cost between clusters fair by defining demographics groups where costs should be compared and altering clustering based on that, small increase in cpu.

# GridSearchCV
Cross validation and hyper parameter tuning all in one
It's exhaustive and how good it is depends on what params you feed it, it can waste a lot of time for no gain if not done thoughtfully.


# Pickle
serialize state and read or write it to a file

# Machine Learning Pipelines
helps avoid data leakage and lets you make a workflow

In [None]:
from sklearn.pipeline import Pipeline

# Create the pipeline
pipe = Pipeline([('mms', MinMaxScaler()),
                 ('tree', DecisionTreeClassifier(random_state=123))])

# Create the grid parameter
grid = [{'tree__max_depth': [None, 2, 6, 10], 
         'tree__min_samples_split': [5, 10]}]


# Create the grid, with "pipe" as the estimator
gridsearch = GridSearchCV(estimator=pipe, 
                          param_grid=grid, 
                          scoring='accuracy', 
                          cv=5)

# Fit using grid search
gridsearch.fit(X_train, y_train)

# Calculate the test score
gridsearch.score(X_test, y_test)

## Used with other libraries
Cross validate accepts a param for a pipeline and possibly others so it's well integrated with some libraries.

# Lecture on Pipelines and GridSearchCV

* Hyperparameters exist for both parametric and non parametric models
* Pipeline solves
    * K Fold cross validation takes loops and can get unwieldly
    * crossval for each fold
    * streamline this preprocessing
    * do things in parallel 
* Pipeline takes
    * constructor takes in a list of tuples as steps
        * user label and transformer/estimator
    * pipeline.fit
    * pipeline.transform
* GridSearchCV
    * pipelinename__hyperparameter
    * .best_estimator_
    * refit on entire train after for better predictions
    * ending in _ means it was filled after the fitting
* (add the rest of the lecture)

# Lecture 2: Pipelines

## Questions to Consider
1. **Describe KNN. How does it make predictions for regression or classification tasks?**
   - KNN makes predictions by storing the locations of data points. For classification, it assigns the majority class of the K nearest neighbors to the new data point. For regression, it predicts the average of the target values of the K nearest neighbors.

2. **How does KNN make predictions?**
   - KNN predicts based on the K nearest neighbors using a distance metric (e.g., Euclidean, Manhattan). It doesn't involve training in the traditional sense but stores the data points for distance computation during prediction.

3. **Impact of scaling on KNN:**
   - Scaling normalizes the distances between data points, ensuring fair comparisons. It standardizes the feature space so that no single feature disproportionately influences the distance calculation.

## Important Concepts
- **Nonparametric and Lazy Model:** KNN doesn't derive coefficients or optimization functions. It simply stores data points and computes distances during prediction.
- **Choosing K:** Small K values can overfit (high variance), while large K values can underfit (high bias). Optimal K balances this trade-off.
- **Distance Metrics:** Euclidean for continuous features, Manhattan for categorical features, and Minkowski as a generalization.

## Scaling Techniques
- **Standard Scaler:** Centers the data around the mean with unit variance.
- **Normalizer:** Scales individual samples to unit norm. More useful for clustering and classification when the direction is more important than the magnitude.

## Pipelines
- **Purpose:** Streamlines preprocessing and model fitting into a single workflow.
- **Steps:** 
  - Define preprocessing steps (e.g., imputation, scaling).
  - Define the model.
  - Fit and transform data within the pipeline.
  - Predict using the same pipeline to ensure consistent preprocessing.

### Pipeline Components
1. **Imputation:** Filling missing values.
2. **Scaling:** Standardizing or normalizing features.
3. **Model Fitting:** Training the model with the preprocessed data.

### Example: Basic Pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler()),
    ('classifier', DecisionTreeClassifier(random_state=42))
])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

### Column Transformer
**Purpose**: Apply different preprocessing steps to different subsets of features.

In [None]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

numeric_features = ['age', 'hours_per_week']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_features = ['occupation', 'sex']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

model = Pipeline(steps=[('preprocessor', preprocessor),
                        ('classifier', DecisionTreeClassifier())])

model.fit(X_train, y_train)
predictions = model.predict(X_test)


## Hyperparameter Tuning
**GridSearchCV**: Exhaustively searches through a specified parameter grid to find the best combination.

**RandomizedSearchCV**: Randomly samples a specified number of parameter settings from the grid, useful for large datasets or high-dimensional parameter spaces.


In [None]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'preprocessor__num__imputer__strategy': ['mean', 'median'],
    'classifier__max_depth': [5, 10, None]
}

grid_search = GridSearchCV(model, param_grid, cv=5)
grid_search.fit(X_train, y_train)

best_params = grid_search.best_params_
best_model = grid_search.best_estimator_

## SMOTE and Pipeline
**Purpose**: Handle class imbalance by oversampling the minority class using synthetic data.

In [None]:
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline

smote = SMOTE(random_state=42)
pipeline = ImbPipeline([
    ('smote', smote),
    ('classifier', DecisionTreeClassifier())
])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

## Feature Union
**Purpose**: Apply multiple transformations to the same features and combine the results.

In [None]:
from sklearn.pipeline import FeatureUnion
from sklearn.preprocessing import FunctionTransformer, PolynomialFeatures

def sin_transform(x):
    return np.sin(x)

def cos_transform(x):
    return np.cos(x)

feature_union = FeatureUnion([
    ('poly', PolynomialFeatures(degree=2)),
    ('sin', FunctionTransformer(sin_transform)),
    ('cos', FunctionTransformer(cos_transform))
])

pipeline = Pipeline([
    ('features', feature_union),
    ('classifier', Ridge())
])

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

# Summary

- **Pipelines** simplify the process of applying multiple transformations and fitting models, ensuring consistency and reducing code duplication.
- **ColumnTransformer** and **FeatureUnion** offer flexibility in preprocessing different types of features.
- **GridSearchCV** and **RandomizedSearchCV** help in tuning hyperparameters efficiently.
- **SMOTE** and **ImbPipeline** address class imbalance issues effectively.

# Ensemble
Model that uses more than one model to make a prediction.  They often aggregate results.  Usually used in supervised learning.

They are resilient to variance, think a group of specialists all weighing in on something to come up with wisdom of the crowd.

Over and under estimates cancel out which is called smoothing.



# Bootstrap Aggregation

<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-ensemble-methods/master/images/new_bagging.png' alt="flowchart of input sample being split into several bootstrap samples, then building several decision trees, then aggregation">

**_Bagging_**, which is short for **_Bootstrap Aggregation_** is two ideas bootstrap resampling and aggregation.

**Bootstrap resampling** is a statistical method used to estimate the distribution of a statistic (e.g., mean, variance) by sampling with replacement from the original dataset.

**Sampling with Replacement** Sampling with replacement means that when selecting elements from a dataset, each element is returned to the dataset after being selected. This allows the same element to be chosen multiple times in the sampling process.

**Aggregation** is combining.  In this case it is combining the bootstrap samples.

The process for training an ensemble through bootstrap aggregation is as follows:

1. Grab a sizable sample from your dataset, with replacement 
2. Train a classifier on this sample  
3. Repeat until all classifiers have been trained on their own sample from the dataset  
4. When making a prediction, have each classifier in the ensemble make a prediction 
5. Aggregate all predictions from all classifiers into a single prediction, using the method of your choice  

Decision Trees are often used because they are sensitive to variance but they don't have to be used.

# Lecture 3: Randome Forests and Bagging and Boosting Techniques

## Topics Covered
- Bagging and boosting techniques
- Random forest
- Averaging models
- Weighted averages of models
- Boosting algorithms such as gradient boosting and AdaBoost

## Questions

### 1. Explain the difference between Feature Union and Column Transformer. When would you use Feature Union and when would you use Column Transformer?

**Answers:**
- Column Transformer allows you to apply different transformations to different columns in parallel.
- Feature Union applies transformations to all columns and concatenates the results.

## Bagging

- Bagging is short for bootstrap aggregation.
- Bagging classifier/regressor can be used with different models.
- Random Forest involves two levels of randomness: bootstrap sampling and random feature selection.
- Extra Trees involves three levels of randomness.

## Random Forest

- Handles big data efficiently.
- Adds randomization by sampling with replacement and selecting random subsets of features at each node.
- Helps in reducing overfitting and increasing model robustness.

## Extra Trees

- Adds an extra layer of randomization by randomly selecting the feature space inside each node.
- Useful when Random Forest still shows overfitting.

## Combining Models

### 1. Averaging and Weighted Averaging

- Combine multiple models by averaging their predictions.
- Weighted averaging assigns different weights to each model's predictions based on their performance.

### 2. Stacking

- Uses the output of individual estimators as input for a final estimator.
- Can combine different types of models (e.g., logistic regression, KNN, decision tree) to improve overall performance.

## Bagging Classifier Example:

In [None]:
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Define the base model
base_estimator = DecisionTreeClassifier(max_depth=4)

# Define the bagging classifier
bagging_clf = BaggingClassifier(base_estimator=base_estimator, n_estimators=150, random_state=42)

# Fit the model
bagging_clf.fit(X_train, y_train)

# Make predictions
y_pred = bagging_clf.predict(X_test)

## Random Forest Example

In [None]:
from sklearn.ensemble import RandomForestClassifier

# Define the random forest classifier
rf_clf = RandomForestClassifier(n_estimators=100, max_features='sqrt', random_state=42)

# Fit the model
rf_clf.fit(X_train, y_train)

# Make predictions
y_pred = rf_clf.predict(X_test)

## ExtraTrees Example

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

# Define the extra trees classifier
et_clf = ExtraTreesClassifier(n_estimators=100, max_features='sqrt', bootstrap=True, random_state=42)

# Fit the model
et_clf.fit(X_train, y_train)

# Make predictions
y_pred = et_clf.predict(X_test)


## Stacking Example

In [None]:
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

# Define base models
estimators = [
    ('lr', LogisticRegression()),
    ('knn', KNeighborsClassifier()),
    ('dt', DecisionTreeClassifier())
]

# Define the stacking classifier
stack_clf = StackingClassifier(estimators=estimators, final_estimator=LogisticRegression())

# Fit the model
stack_clf.fit(X_train, y_train)

# Make predictions
y_pred = stack_clf.predict(X_test)


## Summary
- Pipelines simplify the process of applying multiple transformations and fitting models.
- ColumnTransformer and FeatureUnion offer flexibility in preprocessing different types of features.
- GridSearchCV and RandomizedSearchCV help in tuning hyperparameters efficiently.
- SMOTE and ImbPipeline address class imbalance issues effectively.
- Bagging, random forests, extra trees, averaging, and stacking are powerful techniques for building robust models.
- Hyperparameter tuning and model stacking can significantly improve model performance.


# Random Forest
Ensemble of decision trees, but decision trees use a greedy algorithm that maximizes information gain at each step.  We need each tree to be different.  **Bagging** and **subspace sampling** let the trees have more variance.


For each tree in the dataset:

1. Bag 2/3 of the overall data -- in our example, 2000 rows 
2. Randomly select a set number of features to use for training each node within this -- in this example, 6 features  
3. Train the tree on the modified dataset, which is now a DataFrame consisting of 2000 rows and 6 columns  
4. Drop the unused columns from step 3 from the out-of-bag rows that weren't bagged in step 1, and then use this as an internal testing set to calculate the out-of-bag error for this particular tree 

* Great for large complex datasets
* Not prone to overfitting
* Data doesn't need to be standardized 
* Uses bootstrap sampling to randomly select different samples
* Uses random feature selections
* Can fail to capture linear relationships
* Smooths out classes so it's not as subject to influence by single points

## Bagging for Random Forest
1. obtain a portion of the data with replacement
2. use this data to build a tree
3. remaining data is **Out-of-Bag Data** or **OOB**.  
4. OOB is used as test set to calculate the **Out-Of-Bag Error** to estimate performance.

# Subspace Sampling for Random Forest
Further increases variability between trees by using a subset of features for each tree.

## Random Foreset Visual of Algorithm
<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/new_rf-diagram.png' width="750">

## Resilient to overfitting
due to the number of trees and their variance it is resilient to overfitting.  Finds signal in the noise.

Each tree "votes" on the overall outcome.


## Benefits
**Strong Performance** - it is an ensemble method so and it tends to outperform many models.

**Interpretability** - it is called a **glass box model** because it is transparent and easy to see how it arrived at a solution.  

## Drawbacks

**Computational Cost** - It can be slow to train on large data sets.

**Memory Footprint** - It has to store all the data for each tree which can end up being hundreds of MBs.  Logistic regression only needs to store the coefficients.

## Random Forest Paper and Website

- [Random forests paper](https://www.stat.berkeley.edu/~breiman/randomforest2001.pdf)

- [Random forests website](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm)

# Lecture 3.2: Bagging and Boosting Techniques

## Summary

- Pipelines simplify the process of applying multiple transformations and fitting models.
- ColumnTransformer and FeatureUnion offer flexibility in preprocessing different types of features.
- GridSearchCV and RandomizedSearchCV help in tuning hyperparameters efficiently.
- SMOTE and ImbPipeline address class imbalance issues effectively.
- Bagging, random forests, extra trees, averaging, and stacking are powerful techniques for building robust models.
- Hyperparameter tuning and model stacking can significantly improve model performance.
### Introduction
- **Bagging and Boosting Techniques**
  - Random Forest
  - Averaging Models
  - Weighted Averages of Models
  - Boosting Algorithms (Gradient Boost, AdaBoost)

### Random Forests
- **Advantages**
  - Great for complex data sets.
  - Not prone to overfitting compared to decision trees.
  - Can handle large data sets.
  - Capable of modeling complex relationships.
  - Does not require data standardization.
- **Disadvantages**
  - Less interpretable as complexity increases.
  - Requires careful tuning of hyperparameters.

### K Nearest Neighbors (KNN)
- **Advantages**
  - Simple to understand and implement.
  - No training phase; just stores the training data.
  - Handles multi-class classification.
- **Disadvantages**
  - Computationally expensive during prediction.
  - Requires large memory.
  - Sensitive to redundant features and the curse of dimensionality.

### Adaptive Boosting (AdaBoost)
- Uses weak learners (e.g., decision tree stumps).
- Attaches weights to data points, increasing weights for misclassified points.
- Sequentially trains models, focusing more on difficult-to-classify data points.
- **Implementation Example**
  - Instantiate with `AdaBoostClassifier` or `AdaBoostRegressor`.
  - Fit the model and predict.

### Gradient Boosting
- Similar to AdaBoost but uses residuals instead of weights.
- Sequentially fits models to correct the residuals of previous models.
- **Implementation Example**
  - Instantiate with `GradientBoostingClassifier` or `GradientBoostingRegressor`.
  - Fit the model and predict.

### XGBoost
- An advanced implementation of gradient boosting.
- Handles missing values.
- Allows custom loss functions.
- **Implementation Example**
  - Instantiate with `XGBClassifier` or `XGBRegressor`.
  - Fit the model and predict.

### Model Stacking
- Combines different models' predictions by feeding them into a meta-model.
- Useful for leveraging the strengths of various models.

### Hyperparameter Tuning
- Essential for improving model performance.
- Common tools: `GridSearchCV`, `RandomizedSearchCV`.
- **Example Parameters for Tuning**
  - Number of estimators.
  - Learning rate.
  - Max depth.

### Conclusion
- Understanding and applying ensemble techniques like bagging, boosting, and stacking can significantly enhance model performance.
- Proper hyperparameter tuning is crucial for maximizing the potential of these models.

# Gradient Boosting and Weak Learners

## Weak Learners
A model that is only good at predicting slightly better than random chance

1. Train a single weak learner  
2. Figure out which examples the weak learner got wrong  
3. Build another weak learner that focuses on the areas the first weak learner got wrong  
4. Continue this process until a predetermined stopping condition is met, such as until a set number of weak learners have been created, or the model's performance has plateaued  

## Boosting vs Random Forest
Very similar to random forests: ensembles of high variance models that aggregate to make a prediction.  Both often use tree models, boosting can use other models though.
|Boosting|Random Forest|
|--------|-------------|
|Iterate|Parallel|
|Corrects on Prior Trees|Trees don't know of each other|
|Ensemble of Weak Learners|Ensemble of Strong Learners|
|Very Resistant To Overfitting|Resistant to Overfitting|
|Weighted Votes|Simple Votes|
|Weight on Trees That Solve Harder Problems|All Even Weights|
|Aggregate Solves Easy Problems|No Interaction Like this|

# AdaBoost
* One of the first boosting algorithms
* Uses weights on the sampling to increase weights on samples that the learner gets wrong, these weights increasing means the sample is more likely to end up in the bag
* Ensemble can guess easy on easy problems so they are given less weight


# Gradient Boosted Trees
* Makes use of Gradient Descent
* Uses weak learners
* This is where it diverges from AdaBoost: It calculated the residuals next to see how far it is off
* Residuals are combined with a loss function
* Loss function is differentiable
* Loss function is inflated more where the model is more wrong, thus it will be pushed towards making a model focusing on these harder problems

<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/new_gradient-boosting.png'>

$\rightarrow$ How does gradient boosting work for a classification problem? How do we even make sense of the notion of a gradient in that context? The short answer is that we appeal to the probabilities associated with the predictions for the various classes. See more on this topic [here](https://sefiks.com/2018/10/29/a-step-by-step-gradient-boosting-example-for-classification/). <br/> $\rightarrow$ Why is this called "_gradient_ boosting"? Because using a model's residuals to build a new model is using information about the derivative of that model's loss function. See more on this topic [here](https://www.ritchievink.com/blog/2018/11/19/algorithm-breakdown-why-do-we-call-it-gradient-boosting/).

### Learning Rate
$\gamma$ -- this is the greek letter, **_gamma_** which is for learning rate

Remember that too high of a learning rate is good to quickly train but won't find the best setting, and can lead to bouncing.

A small learning rate will take longer to train and can get stuck in local minimums easier but will find a better value

### Algorithm

Use mean squared error (MSE) and want to minimize that <-- done by gradient descent

Use the residuals (pattern in the residuals) to create an even better model

1. Fit a model to the data, $F_1(x) = y$
2. Fit a model to the residuals, $h_1(x) = y - F_1(x)$
3. Create a new model, $F_2(x) = F_1(x) + h_1(x)$
4. Repeat

#### Example of Iterative Steps

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html

> Parts adapted from https://github.com/ageron/handson-ml/blob/master/07_ensemble_learning_and_random_forests.ipynb

# XGBoost - Extreme Gradient Boosting
* Handles missing values for you
* Runs on multiple cpu cores in parallel
* Distributes training across computer clusters
* Go-to competition Algorithm
* Always use multiple algorithms but it's a top dog right now

# Lecture 4: Recommendation Systems

## Summary

- Pipelines simplify the process of applying multiple transformations and fitting models.
- ColumnTransformer and FeatureUnion offer flexibility in preprocessing different types of features.
- GridSearchCV and RandomizedSearchCV help in tuning hyperparameters efficiently.
- SMOTE and ImbPipeline address class imbalance issues effectively.
- Bagging, random forests, extra trees, averaging, and stacking are powerful techniques for building robust models.
- Hyperparameter tuning and model stacking can significantly improve model performance.

### Introduction
- **Boosting Techniques Overview**
  - Adaptive Boosting (AdaBoost)
  - Gradient Boosting (GB)
  - Difference: AdaBoost uses weights on misclassified data points; GB uses residuals to improve model accuracy.

### Key Concepts in Recommendation Systems
- **Cold Start Problem**: Difficulty in recommending items to new users without historical data.
- **Implicit vs. Explicit Data**
  - **Implicit Data**: User behaviors (e.g., click history, time spent on page).
  - **Explicit Data**: Direct feedback (e.g., ratings).

### Types of Filtering
- **Content-Based Filtering**
  - Recommends items similar to those the user liked in the past.
  - Based on item attributes and user preferences.
  - **Example**: Pandora recommends music with similar properties.

- **Collaborative Filtering**
  - Recommends items based on the preferences of similar users.
  - **User-User Collaborative Filtering**: Finds users similar to the target user and recommends items they liked.
  - **Item-Item Collaborative Filtering**: Recommends items that are similar to items the user liked.
  - **Example**: Netflix recommends shows watched by users with similar viewing histories.

### Memory-Based vs. Model-Based Systems
- **Memory-Based Systems**
  - Use the entire user-item dataset.
  - Compute similarity scores between users/items.
  - Metrics: Cosine Similarity, Pearson Correlation.

- **Model-Based Systems**
  - Use machine learning models to make recommendations.
  - Example Techniques: Matrix Factorization, Alternating Least Squares (ALS).

### Evaluating Recommendation Systems
- **Metrics**
  - Root Mean Square Error (RMSE)
  - Mean Absolute Error (MAE)
  - Precision, Recall, and F1-Score for binary recommendations.

### Alternating Least Squares (ALS)
- **Concept**
  - Factorizes the user-item matrix into two lower-dimensional matrices.
  - Iteratively minimizes the error between predicted and actual ratings.
  - Uses pseudo-inverse to handle non-square matrices.
  - Helps fill in missing values in sparse datasets.

### Implementation with Surprise Library
- **Surprise Library**
  - Specialized for building and analyzing recommender systems.
  - Handles datasets with user, item, and rating columns.
  - Provides various algorithms: KNNBasic, SVD, NMF.

- **Example Workflow**
  - Load dataset with user, item, and rating.
  - Split data using Surprise’s train-test split.
  - Instantiate and train models (KNNBasic, SVD, NMF).
  - Evaluate model performance using RMSE and MAE.

### Practical Considerations
- **Bias and Overfitting**
  - Beware of biases from users who rate disproportionately.
  - Ensure model generalizes well to new, unseen data.

- **Combining Techniques**
  - Hybrid models can combine content-based and collaborative filtering.
  - Use ensemble methods to improve recommendation accuracy.

### Summary
- Recommendation systems leverage user behavior and item attributes to suggest items.
- Balancing content-based and collaborative filtering improves recommendations.
- Matrix factorization techniques like ALS are powerful for handling sparse datasets.
- Evaluating recommendation accuracy is crucial for effective recommendations.


# Recommendation Systems
* Allows predicting the future preference list

## Matrix Factorization
* Singular Value Decomposition (SVD) and Alternating Least Squares (ALS)

## Surprise Library
* Used to create recommendation systems and runs really optimally

## Goal: Expose People to What They Like
* Predicts the future preference of a set of items or user
* Taps into the "long tail", there's very common items everyone buys but the long tail specific items, like a certain genre of music or special toy are long tail

<img src="https://raw.githubusercontent.com/learn-co-curriculum/dsc-recommendation-system-introduction/master/images/LongTailConcept.png" alt="graph showing products on the x-axis and popularity on the y-axis. a few products are very popular, labeled Head. many other products are not very popular, labeled Long Tail" width="500">

## Formal Definition
***Recommendation Systems are software agents that elicit the interests and preferences of individual consumers […] and make recommendations accordingly. They have the potential to support and improve the quality of the
decisions consumers make while searching for and selecting products online.***



## Applications of Recommendation Systems
* Suggest items to a customer
* Estimate profit & loss of many competing items and make recommendations to the customer (e.g. buying and selling stocks)
* Recommend a product or service based on experience of the custoemr
* Show offers appealing to a customer

## Types of Recommendation Systems
* Unpersonalized and Personalized

### Unpersonalized
* EX: Youtube recommending the most viewed videos.

### Personalized
__Given__: 
The profile of the "active" user and possibly some situational context, i.e. user browsing a product or making a purchase etc. 

__Required__:
Creating a set of items, and a score for each recommendable item in that set

__Profile__:

User profile may contain past purchases, ratings in either implicit or explicit form, demographics and interest scores for item features 

> There are two ways to gather such data. The first method is to ask for explicit ratings from a user, typically on a concrete rating scale (such as rating a movie from one to five stars). The second is to gather data implicitly as the user is in the domain of the system - that is, to log the actions of a user on the site.

Each of these techniques make use of different similarity metrics to determine how "similar" items are to one another. 
* [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance)
* [cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
* [Pearson correlation](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)
* [Jaccard index (useful with binary data)](https://en.wikipedia.org/wiki/Jaccard_index)

### Content-Based Recommenders 

> __Main Idea__: If you like an item, you will also like "similar" items.

<img src="https://raw.githubusercontent.com/learn-co-curriculum/dsc-recommendation-system-introduction/master/images/content_based.png" alt="content based filtering. user watches movies, then similar movies are recommended to the user" width="500">

* These systems are based on the characteristics of the items themselves. "Try other items like this"
* Gives the user a bit more information on why they are seeing the recommendation
* Require manual or semi-manual tagging of products
* advanced systems can average all items a user liked

### Collaborative Filtering Systems


> __Main Idea__: If user A likes items 5, 6, 7, and 8 and user B likes items 5, 6, and 7, then it is highly likely that user B will also like item 8.

<img src="https://raw.githubusercontent.com/learn-co-curriculum/dsc-recommendation-system-introduction/master/images/collaborative_filtering.png" alt="collaborative filtering: movies watched by both users indicate that the users are similar, then movies are recommended by one user to another user" width="450">

__The key idea behind collaborative filtering is that similar users share similar interests and that users tend to like items that are similar to one another.__

* Often based off user reviews
* Have a cold start problem on how to recommend things to new users that have no activity yet.

## Utility Matrix reprsents the associated opinion that a user holds.

|        | Toy Story | Cinderella | Little Mermaid | Lion King |
|--------|-----------|------------|----------------|-----------|
| Matt   |           | 2          |                | 5         |
| Lore   | 2         |            | 4              |           |
| Mike   |           | 5          | 3              | 2         |
| Forest | 5         |            | 1              |           |
| Taylor | 1         | 5          |                | 2         |

$r_{\text{Mike},\text{Little Mermaid}} = 3$.

A recommendation system tries to fill in the blanks.  Most of the time these values are largely empty.
The matrix above is what  is known as an explicit rating.  Each person has rated what they've seen.  However we can infer or use judgement to determine how to use data for a recommendation system.

|        | Toy Story | Cinderella | Little Mermaid | Lion King |
|--------|-----------|------------|----------------|-----------|
| Matt   |           |  1         |                | 1         |
| Lore   | 1         |            | 1              |           |
| Mike   |           | 1          | 1              | 1         |
| Forest | 1         |            | 1              |           |
| Taylor | 1         | 1          |                | 1         |

These are __implicit__ ratings because we are assuming that because a person has bought something, they would like to buy other items like it. Of course, this is not necessarily true, but it's better than nothing!

# Clustering
Create  clusters that have high similarity between the data belonging to one cluster while aiming for minimal similarity between clusters

## K-Means Clustering
K determines the number of clusters and the algorithm optimizes around that

## Hierarchial Agglomerative Clustering
You start with $n$ clusters equal the number of data points and at each step you join two clusters.  You stop joining when a certain criterion is reached.

## Semi-Supervised Learning
Combine both concepts of supervised and unsupervised learning.  Increasingly popular.

## Market Segmentation with Clustering
Common and useful, we'll practice with a market segmentation dataset.

# K-means Clustering
The most popular and widely used clustering algorithm, and clustering are one of the most popular unsupervised machine learning algorithms. 

## Goal
> **Intra**-class similarity is high

> **Inter**-class similarity is low

Similarity is determined by distance.  Closer is more similar.
* **Agglomerative hierarchical** algorithm starts with n clusters
* **Non-heirarchical** chooses k initial clusters

Unsupervised and you do not know how many clusters you are looking for

## Non-Hierarchical Clustering with K-Means Clustering

### Process
1. Select $k$ initial seeds 
2. Assign each observation to the cluster to which it is "closest"
3. Loop
    - Cluster center is the mean of all points in the cluster, recalculated each iteration.
    - Each iteration reassign points to be part of the closest cluster center.
    - Stop if there is no reallocation 

<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-k-means-clustering/master/images/good-centroid-start.gif' alt="k-means clustering animation" >

### Scikit-learn


In [None]:
from sklearn.cluster import KMeans

# Set number of clusters at initialization time
k_means = KMeans(n_clusters=3) 

# Run the clustering algorithm
k_means.fit(some_df) 

# Generate cluster index values for each row
cluster_assignments = k_means.predict(some_df) 

# Cluster predictions for each point are also stored in k_means.labels_

### Evaluation with Variance Ratio
* Accepted metric in wide use is **_Variance Ratio_** aka [**_Calinski Harabasz Score_**](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.calinski_harabasz_score.html)
    * The variance of the points within a cluster to the variance of a point to points in other clusters.
    * We want intra-cluster variance to be low suggesting the clusters are tightly knit.
    * We want inter-cluster variance to be high suggesting that there is little to no ambiguity about which cluster a point belongs to.

#### Calculating Variance Ratio


In [None]:
# This code builds on the previous example
from sklearn.metrics import calinski_harabasz_score

# Note that we could also pass in k_means.labels_ instead of cluster_assignments
print(calinski_harabasz_score(some_df, cluster_assignments))

### Other Metrics
* [Silhouette Score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html#sklearn.metrics.silhouette_score)
* No metric is best, each have diff strengths weaknesses based on given goals.

### Optimal K Value
1. Fit different K-means clustering objects for every $k$ we want to try then compare the variance ratio scores of each.
2. Visualize results with an **_Elbow Plot_** - plots that we can easily see wehre we hit a point of diminishign returns.  They are used with more than just variance ratios, one example is distortion another clustering metric.

<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-k-means-clustering/master/images/new_elbow-method.png' alt="Calinski Harabaz scores for different values of k" width='500'>

#### Understanding the Elbow

A note on elbow plots: higher scores aren't always better. Higher values of $k$ mean introducing more overall complexity -- we will sometimes see elbow plots that look like this:


<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-k-means-clustering/master/images/new_dim_returns.png' alt="plot with the number of clusters on the x-axis and the sum of squared distances to cluster center on the y-axis" width="500">

$k$ = 20 is technically better as a score but $k$ = 4 is better because it balances model complexity with score

## Lecture 5.1: Agglomerative Clustering

## Summary

- Pipelines simplify the process of applying multiple transformations and fitting models.
- ColumnTransformer and FeatureUnion offer flexibility in preprocessing different types of features.
- GridSearchCV and RandomizedSearchCV help in tuning hyperparameters efficiently.
- SMOTE and ImbPipeline address class imbalance issues effectively.
- Bagging, random forests, extra trees, averaging, and stacking are powerful techniques for building robust models.
- Hyperparameter tuning and model stacking can significantly improve model performance.

### Introduction

- **Pop Quiz: Quick Review**
  - **Elbow Method:** Used to find the optimal number of clusters in K-means clustering by identifying the "elbow point" where adding another cluster does not significantly decrease the variance.
  - **K-means Clustering:** An iterative algorithm that partitions data into K clusters by minimizing the distance between data points and the centroid of their assigned cluster.
  - **Hyperparameters in K-means:** Number of clusters (K), distance metric.
  - **When Elbow Method Fails:** Smooth curve without a clear elbow point, clusters of different shapes, sizes, and densities.

### Agglomerative Clustering Overview

- **Definition:** A type of hierarchical clustering that builds nested clusters by merging or splitting them successively.
- **Approach:** Bottom-up method starting with each point as its own cluster and merging the closest pairs of clusters step by step.
- **Comparison with K-means:**
  - K-means is flat clustering; agglomerative clustering is hierarchical.
  - K-means requires the number of clusters to be specified; agglomerative clustering does not.

### Steps in Agglomerative Clustering

1. **Calculate Pairwise Distance:** Determine the distance between each pair of data points.
2. **Linkage Criteria:** Methods to determine the distance between clusters:
   - Single Linkage: Minimum distance between points in two clusters.
   - Complete Linkage: Maximum distance between points in two clusters.
   - Average Linkage: Average distance between points in two clusters.
3. **Merge Closest Clusters:** Form a new cluster by merging the closest pair of clusters.
4. **Repeat:** Continue merging until all points are in one cluster or a stopping criterion is met.

### Dendrogram

- **Definition:** A tree-like diagram that records the sequences of merges or splits.
- **Usage:** Helps in determining the number of clusters by visualizing the hierarchical relationship between data points.
- **Threshold:** The horizontal cut-off line in the dendrogram to decide the number of clusters.

### Implementing Agglomerative Clustering

#### Using Scipy


In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from scipy.spatial.distance import pdist

# Calculate the pairwise distance matrix
distance_matrix = pdist(data, metric='euclidean')

# Perform hierarchical/agglomerative clustering
Z = linkage(distance_matrix, method='ward')

# Create a dendrogram
dendrogram(Z)
plt.show()

# Form flat clusters
clusters = fcluster(Z, t=50, criterion='distance')

#### Using SKLearn

In [None]:
from sklearn.cluster import AgglomerativeClustering
import matplotlib.pyplot as plt

# Define the model
model = AgglomerativeClustering(n_clusters=7, affinity='euclidean', linkage='ward')

# Fit the model
model.fit(data)

# Assign labels to data points
labels = model.labels_

# Visualize the clusters
plt.scatter(data[:, 0], data[:, 1], c=labels, cmap='rainbow')
plt.show()

### Evaluating Clusters

- **Silhouette Score:** Measures how similar a point is to its own cluster compared to other clusters.
- **Cophenetic Correlation Coefficient:** Measures how faithfully the dendrogram represents the dissimilarities among observations.

### Advantages and Disadvantages

- **Advantages:**
  - Can find clusters with arbitrary shapes.
  - No need to specify the number of clusters upfront.
- **Disadvantages:**
  - Computationally intensive, especially with large datasets.
  - Sensitive to noise and outliers.

### Applications

- Market segmentation
- Social network analysis
- Image segmentation
- Anomaly detection

### Conclusion

Agglomerative clustering is a powerful tool for hierarchical clustering that helps in identifying nested clusters within data. It offers flexibility in terms of linkage criteria and does not require the number of clusters to be predefined. However, it is computationally intensive and requires careful interpretation of dendrograms and threshold settings.


# Hierarchical Agglomerative Clustering
* K-means uses Expectation-Maximization after we tell it to give us $k$ clusters, however it can not have subgroups within subgroups
* Agglomerative Clustering to the rescue!  It can have subgroups within subgroups
* It starts with $n$ clusters with $n$ = the number of data points then merges until some stopping criterion

## Linking Clusters Together

* **ward** - merges two cluster on the least variance between them.  Leads to more equally sized clusters
* **average** - merges the two clusters that have the smallest average distance between all points
* **complete** - merges the two clusters that have the smallest maximum distance between their points

Can affect the performance, which to use is based on the data and goals.

The following diagram demonstrates the clusters created at each step for a dataset of 16 points. Take a look at the diagram and see if you can figure out what the algorithm is doing at each step as it merges clusters together:
<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-hierarchical-agglomerative-clustering/master/images/hac_iterative.png' alt="initialization through step 14 of HAC algorithm">



As you can see it takes the closest clusters and merges them into a single cluster.  Below shows as the dots disappear the visualization is repalcing them with the newly calculated center.

<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-hierarchical-agglomerative-clustering/master/images/dendrogram_gif.gif' alt="animation of clusters shown in x-y space on the left and a dendrogram on the right, showing which clusters correspond to which parts of the dendrogram">

### Dendrograms and Clustergrams
* Easily visualize the results at any given step 
* The image to the right above in the gif is a Dendrogram
    * shows the hierarchical realtionship between the various clusters that are computed throughout each step.  
* The image below is a Clustergram
    * Visualize the same information by drawing lines representing each cluster at the each step

<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-hierarchical-agglomerative-clustering/master/images/new_clustergram.png' alt="another view of clusters on the left and dendrogram on the right" width='600'>

### Use Cases
* market segmentation
    * things like market segmentation
* gain a deeper understanding of a dataset through cluster analysis
* photo sorting on smartphones

# Common Problems with Clustering Algorithms
* No way of verifying the results are correct or not
    * Never treat results of a cluster as ground-truth
## Advantages and Disadvantages of K-Means Clustering
### Advantages
* Easy to implement
* Usually faster than HAC with reasonably small $k$ and many features
* Objects can shift clusters
* Tighter clusters than HAC

### Disadvantages
* Need the right value for $k$
* Scaling completely changes the results
* Starting points have a strong impact on final results, as seen below.  Bad init is less likely than good init and you can run it multiple times.

<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/bad-centroid-start.gif'>

## Advantages & Disadvantages of HAC

### Advantages
* ordered relationship between clusters, which can be useful when visualized
* smaller clusters which allows more granular understanding

### Disadvantages
* Results depend on distance metric used
* Objects can be grouped badly early on and no way to move them
* We can't check visuals on more than 3 dimensions so it's hard to know when the algorithm was correct
* Clustergram below

<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/new_bad-hac.png' width='600'>

# Semi-Supervised Learning and Look-Alike Models
* Combining both to solve real world problems

## Case 1: Look-Alike Models
* Find a similar audience

<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/new_look-alike-model.png'>

* Identify more customers/market segments that we can plausibly assume are equally valuable due to their similarity with valuable custmoers or market segement we already identified.
* Divide into two groups: the ones we know are valuable and everyone else
* Uses distance metric of choice to rate similarity of the unknown customers with the ones we have identified
* Once we know they are somewhat similar to the valuable group we can spend resources to capture them
* Likely see customers that are only somewhat similar to our valuable group
* Customer sthat look nothing like our known valuable customer segment
* It is a lot like clustering
* referred to as prospecting.
* choose resources to market to the customers that look like our valuabel customers to increase our top-of-funnel, meaning an increase to the number of potential customers that haven't shown interest in our product or cmpany yet but are likely to.


## Use Case 2: Semi-Supervised Learning
* Known as weakly supervised learning too
* Generate Pseudo-labels that are possibly correct.
    * doesn't use clustering, it uses supervised learning algorithms in an unsupervised way.

<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/new_semi-supervised.png'>

### Steps
1. **Train your model on your labeled training data**
2. **Use your trained model to generate pseudo-labels for unlabeled data**
3. **Combine the pseudo-labels with your actual data**
4. **Retrain your model on the new data set**
#### Benefits
* It is risky
* When done well it can increase overall model performance by opening up access to much more data
* Saves a ton of money on labeling costs!

#### Downsides
* When data is really noisy incorrect lables will skew the model
* Feedback Loops
* More complicated problems tend to work less

#### Use a Holdout Set to Test
* As usual but even more important in this case, make sure to use ground-truth or non pseudo-code labels to test with.

# Silhouette Coefficient

The Silhouette Coefficient is a measure used to evaluate the quality of clusters created by a clustering algorithm. It takes into account both the cohesion within clusters and the separation between clusters.

## Definition

For a given data point $i$, the Silhouette Coefficient $s(i)$ is defined as:

$$ s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))} $$

where:
- $a(i)$ is the mean distance between $i$ and all other points in the same cluster.
- $b(i)$ is the mean distance between $i$ and all points in the nearest cluster (the cluster with the smallest mean distance to $i$).

## Interpretation
**Higher is better**

- $s(i)$ ranges from -1 to 1.
  - $s(i) \approx 1$: The data point is well-matched to its own cluster and poorly matched to neighboring clusters.
  - $s(i) \approx 0$: The data point is on or very close to the decision boundary between two neighboring clusters.
  - $s(i) \approx -1$: The data point is poorly matched to its own cluster and well-matched to a neighboring cluster.

## Overall Silhouette Score

The overall Silhouette Score for a clustering is the mean Silhouette Coefficient of all data points:

$$ S = \frac{1}{N} \sum_{i=1}^{N} s(i) $$

where $N$ is the total number of data points.

## Usage

The Silhouette Coefficient can be used to:
- Determine the optimal number of clusters by comparing the average silhouette scores for different numbers of clusters.
- Evaluate the quality of clustering algorithms, with higher scores indicating better-defined clusters.

## Example

To compute the Silhouette Coefficient in Python, you can use the `silhouette_score` function from the `sklearn.metrics` module:


In [None]:
from sklearn.metrics import silhouette_score

# X is your data and labels are the cluster labels
score = silhouette_score(X, labels)
print(f'Silhouette Score: {score}')

# PCA: Principal Component Analysis in scikit-learn
* Reduces dimensions while trying to capture as much info from the dataset as possible


In [None]:

from sklearn.decomposition import PCA

pca = PCA()
transformed = pca.fit_transform(X)

Transforms dataset along principal axes.  The first axes tries to capture the maximum variance within the data.  From here additional axes are constructed which are orthogonal to the previous axes and continue to account for as much of the remaining variance as possible.

Transforms this:

<img src="images/pca-data1.png">

Into this:

<img src="images/pca-data2.png">

<img src="images/inhouse_pca.png">

In [None]:
pca.explained_variance_ratio_

array([9.99760273e-01, 2.39727247e-04])

Results are cumulative

In [None]:
np.cumsum(pca.explained_variance_ratio_)

array([0.99976027, 1.        ])

Below we visualize the first PCA component

<img src="images/pca-data3.png">

## Steps for Performing PCA

The theory behind PCA rests upon many foundational concepts of linear algebra. After all, PCA is re-encoding a dataset into an alternative basis (the axes). Here are the exact steps:

1. Recenter each feature of the dataset by subtracting that feature's mean from the feature vector
2. Calculate the covariance matrix for your centered dataset
3. Calculate the eigenvectors of the covariance matrix
    1. You'll further investigate the concept of eigenvectors in the upcoming lesson
4. Project the dataset into the new feature space: Multiply the eigenvectors by the mean-centered features

# Market Segmentation with Clustering
* one of the most popular use cases for clustering

# What is Market Segmentation?
* **Cluster Analysis** to segment a customer base into different _market segments_ using the clustering techniques we've learned
* Ex: decide marketing budget allocation in order to attract more customers
    * Create personalized regression models for each group
* Know who your customer is.  Identify sements in the customer data we can look for trends
    * Ex decide the station to run commercials on
* Find the segments with clustering
    * find them based on behavior

## Targeting
Segmentation is just the first step

<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/new_marketing-strategy.png' width='700'>

* Build individualized strategies
    * which market segment is most valuable to us? Use resarch and data analysis
    * how do we allocate the advertising budget?  determine where to spend money best to reach the group
* Figure out how to position our product to make it both desirable and stand out from competitors


# Natural Language Processing

## Natural Language Tool Kit (NLTK)
Popular NLP library in Python

## Regular Expressions
regex

## Feature Engineering for Text Data
Text data has a lot of ambiguity and feature engineering for NLP is specific.  
* How to remvoe stop words
* create frequency distributions
* representing histograms
* stemming
* lemmatization
* bigrams which shows how often two words occur together

## Context-Free Grammars and Part-of-Speech (POS) Tagging
* Context Free Grammar and Part of Speech tagging
* POS tagging helps a computer understand how to interpret a sentence
* Context free grammars (CFG) defines the rules of how sentences can exist.

## Text Classification
Will be gone over



# Natural Language Processing (NLP)
The study of how computers can interact with humans through natural language

* intersection of *Computer Science*, *Artificial Intelligence*, *Linguistics*, and *Information Theory*
* History
    * Used to be rules based with rules borrowed from Linguistics
    * Around the 1980s machine learning and AI started to show potential
    * Now it is used around the globe every data by data scientists everywhere

## NLP and Bayesian Statistics
**_Naive Bayesian Classification_** is what keeps spam email at bay.

## Working With Text Data
Can require more cleaning and preprocessing than many other data types

## Creating a Bag of Words
* **Corpus**: a large structured text set used for NLP tasks
* **Bag of Words**: vectorize data by capturing the unique words in a work, in any order.  A common way to do that is have usage counts of all the unique words.

## Basic Cleaning and Tokenization
* Often lowercase everything and remove punctuation
* Decisions have to be made on how to tokenize and what variations of words to count as the same or different.  Ex. run and runs, Apple and Apple's

### Stemming, Lemmatization, and Stop Words
* **Stemming** reduces words to their root, in a crude way.  For example runs and running would become run but ponies would become poni.
<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-nlp-and-word-vectorization/master/images/new_stemming.png' alt="stemming rules and examples" width="400">
* **Lemmatization** uses **morphology** to reduce words to their basic forms called **lemma**
|   Word   |  Stem | Lemma |
|:--------:|:-----:|:-----:|
|  Studies | Studi | Study |
| Studying | Study | Study |
* **Stop Words** have little to no information.  examples are "of" and "the".  Stop words are often removed from models to cut down on dimensionality.

## Vectorization

### Count 
Count the number of times a word appears in a corpus.

| Document | Aardvark | Apple | ... | Zebra |
|:--------:|:--------:|:-----:|-----|-------|
|     1    |     0    |   3   | ... | 1     |
|     2    |     1    |   2   | ... | 0     |

### Term Frequency-Inverse Document Frequency (TF-IDF) Vectorization
Idea that rare words contain more information about a document than words used all the time in all documents.


**_Term Frequency_** is calculated with the following formula:

$$\large Term\ Frequency(t) = \frac{number\ of\ times\ t\ appears\ in\ a\ document} {total\ number\ of\ terms\ in\ the\ document} $$ 

**_Inverse Document Frequency_** is calculated with the following formula:

$$\large IDF(t) = log_e(\frac{Total\ Number\ of\ Documents}{Number\ of\ Documents\ with\ t\ in\ it})$$

The **_TF-IDF_** value is the product of both. 


## Lecture 6: Natural Language Processing (NLP)

### Introduction to NLP
We're going to go into natural language processing (NLP). In this lecture, we will cover text processing, vectorization, and classification. We'll begin with text processing, followed by vectorization, and finally classification.

### Required Installations
Ensure you have installed the necessary libraries. You can do this using pip:
```bash
pip install nltk


In [None]:
#how to download additional nltk data
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

## Overview of NLP Tasks

- **Supervised Learning:** Classify documents (e.g., spam detection).
- **Unsupervised Learning:** Group documents into topics without predefined labels.

## Representing Text for Models

To feed text into models, we need to convert it into numerical form using techniques like Count Vectorizer or TF-IDF Vectorizer.

## Tokenization

Tokenization is the process of splitting text into individual tokens (words or phrases).

- **Word Tokenizer:** Splits text into words.
- **Sentence Tokenizer:** Splits text into sentences.

## Dataset Loading

In [None]:
import pandas as pd
data = pd.read_csv('your_dataset.csv')
print(data.head())

## Normalization Steps

1. **Tokenization:** Convert sentences to words.
2. **Removing Stopwords:** Filter out common words that don't add significant meaning.
3. **Lowercasing:** Convert all text to lowercase to ensure uniformity.
4. **Removing Punctuation and Numbers:** Eliminate unnecessary symbols and digits.
5. **Lemmatization/Stemming:** Reduce words to their base or root form.

## Example Code

Tokenizing and cleaning text:

In [None]:
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def preprocess_text(text):
    # Tokenize
    tokens = word_tokenize(text)
    # Lowercase and remove stopwords
    tokens = [word.lower() for word in tokens if word.isalnum()]
    stop_words = set(stopwords.words('english'))
    tokens = [word for word in tokens if word not in stop_words]
    # Lemmatize
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return tokens

data['processed_text'] = data['text_column'].apply(preprocess_text)
print(data.head())


## Evaluating Clusters

- **Silhouette Score:** Measures how similar a point is to its own cluster compared to other clusters.
- **Cophenetic Correlation Coefficient:** Measures how faithfully the dendrogram represents the dissimilarities among observations.

## Advantages and Disadvantages

**Advantages:**
- Can find clusters with arbitrary shapes.
- No need to specify the number of clusters upfront.

**Disadvantages:**
- Computationally intensive, especially with large datasets.
- Sensitive to noise and outliers.

## Applications

- Market segmentation
- Social network analysis
- Image segmentation
- Anomaly detection

## Conclusion

Agglomerative clustering is a powerful tool for hierarchical clustering that helps in identifying nested clusters within data. It offers flexibility in terms of linkage criteria and does not require the number of clusters to be predefined. However, it is computationally intensive and requires careful interpretation of dendrograms and threshold settings.

## Next Steps

In the next class, we will cover vectorization and classification. Ensure you are familiar with the concepts discussed today. If you have any questions, please feel free to ask.


# Lecture 6.2 NLP Classification

## Review of Tokenization and Pre-Processing

### What is Tokenization?

Tokenization is the process of splitting text into individual tokens. These tokens can be words, phrases, or even characters. There are different types of tokenization:

- **Word Tokenization:** Splits text into words.
- **Sentence Tokenization:** Splits text into sentences.

### Stop Words

Stop words are common words (like "the", "and", "is") that are often removed from text data to reduce noise and focus on the meaningful words.

### Stemming and Lemmatization

- **Stemming:** Reduces words to their base form by removing suffixes. It can produce non-existent words.
- **Lemmatization:** Reduces words to their base form considering the part of speech, resulting in valid words.

### Example Code

In [None]:

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Example text preprocessing
text = "The quick brown fox jumps over the lazy dog."

# Tokenization
tokens = nltk.word_tokenize(text)

# Removing stop words
filtered_tokens = [word for word in tokens if word not in stopwords.words('english')]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

print(lemmatized_tokens)

## Vectorization

Vectorization is the process of converting text into numerical form so it can be fed into machine learning models.

### Types of Vectorizers

- **Count Vectorizer:** Converts text into a matrix of token counts.
- **TF-IDF Vectorizer:** Converts text into a matrix of token counts scaled by inverse document frequency.

### Example Code

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Example text corpus
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]

# Count Vectorizer
count_vectorizer = CountVectorizer()
X_count = count_vectorizer.fit_transform(corpus)

# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(corpus)

print(X_count.toarray())
print(X_tfidf.toarray())

## Evaluating Clusters

- **Silhouette Score:** Measures how similar a point is to its own cluster compared to other clusters.
- **Cophenetic Correlation Coefficient:** Measures how faithfully the dendrogram represents the dissimilarities among observations.

## Advantages and Disadvantages of Clustering

### Advantages

- Can find clusters with arbitrary shapes.
- No need to specify the number of clusters upfront.

### Disadvantages

- Computationally intensive, especially with large datasets.
- Sensitive to noise and outliers.

## Applications

- Market segmentation
- Social network analysis
- Image segmentation
- Anomaly detection

## Conclusion

Agglomerative clustering is a powerful tool for hierarchical clustering that helps in identifying nested clusters within data. It offers flexibility in terms of linkage criteria and does not require the number of clusters to be predefined. However, it is computationally intensive and requires careful interpretation of dendrograms and threshold settings.

## Next Steps

In the next class, we will cover vectorization and classification. Ensure you are familiar with the concepts discussed today. If you have any questions, please feel free to ask.


# NLP with NLTK
**_Natural Language Tool Kit_**: A popular library for python for Natural Language Processing
* Contains sample corpus like presidential speeches and project gutenberg transcripts.
* Contains its own Bayesian classifiers for quick testing
* Relies heavily on Linguistics but the tools are easy for a non linguist to use, ex making a parse tree
<center> <img src='images/new_parse_tree.png'  width="750"> </center>

## Working with Text
* **Stop Word Removal**
* **Filtering and Cleaning**
* **Feature Selection and Feature Engineering** - Libraries like Penn Tree Bank, part of speech tags, sentence polarity and more.


# Regular Expressions (regex)
* powerful for NLP
* word_tokenize() splits a word that contains an apostrephe into 3 tokens.  they're becomes ["they","'","re"].  
* we can use small regex patterns to capture they're as a word
## Patterns
* Regex is as good as the patterns we create


In [None]:
import re
sentence = 'he said that she said "hello".'
pattern = 'he'
p = re.compile(pattern)
p.findall(sentence) # Output will be ['he', 'he, 'he']

**anchors** can be used to define word boundaries

## Ranges, Groups, Quantifiers
* Range: `[A-Za-z0-9]` would match most alphanumeric chars in the English alphabet
* Character Class Examples
    * `\d` is `[0-9]`
    * `\w` is any word
    * `\D` anything that isn't a digit
    * `\W` anything that isn't a word
* Quantifiers - matches preceding ex. a* or (cat)* 
    * `*` 0 or more times
    * `+` 1 or more times
    * `?` 0 or 1 times
    * `{n}` match exactly n times, ex `{3}` matches 3 times
    * `{n,k}` match between n and k times, ex `{3,5}` matches 3-5 times

<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-introduction-to-regular-expressions/master/images/regex_cheat_sheet.png' alt="regex cheat sheet">

# Feature Engineering Text Data
* Text data has a lot of ambiguity

## Remove Stop Words

In [None]:
from nltk.corpus import stopwords
import string

# Get all the stop words in the English language
stopwords_list = stopwords.words('english')

# It is generally a good idea to also remove punctuation

# Now we have a list that includes all english stopwords, as well as all punctuation
stopwords_list += list(string.punctuation)

from nltk import word_tokenize

tokens = word_tokenize(some_text_data)

# It is usually a good idea to lowercase all tokens during this step, as well
stopped_tokens = [w.lower() for w in tokens if w not in stopwords_list]

## Frequency Distributions

In [None]:
from  nltk import FreqDist
freqdist = FreqDist(tokens)

# get the 200 most common words 
most_common = freqdist.most_common(200)

## Stemming and Lemmatization
Often utilizes wordnet lexical database

In [None]:
from nltk.stem.wordnet import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

lemmatizer.lemmatize('feet') # foot
lemmatizer.lemmatize('running') # run

## Bigrams and Mutual Information Score
* Pair adjacent words and treat them as one token, special case of n-grams where n is 2
    * n-grams can be created at the character level, and commonly done so
    * useful because you can apply a freuency filter.  Common values are at least 5.
    * **Pointwise Mutual Information Score** - from information theory a measure of mutual dependence between two words. ex San Franscisco would appear together a lot meaning it has a higher information score.
    * NLTK can compute this too
* "the dog played outside" becomes `('the', 'dog'), ('dog', 'played'), ('played', 'outside')`


# Context-Free Grammars and POS Tagging

> "Colorless green ideas sleep furiuosly."
Noam Chomsky.  

While correct grammatically and syntactically, it's not meaningful semantically.  There is a "deep structure" we recognize as correct regardless of content.  This is what Context-Free Gammar is or CFG, the idea that we don't need any context to determine the grammar is correct.

## Five Levels Of Language
<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/new_LevelsOfLanguage-Graph.png'>




**CFGs are the Syntax Level**
* CFGs are important to computer science due to parsers 
* Part of Speech (POS) Tags such as run being a noun or a verb.

## Parse Trees and Sentence Structure
* Sentences in English go Noun Phrase -> Verb Phrase -> Prepositional Phrase but this gets complicated because it can be recursive.  A noun phrase can have multiple noun phrases or even a verb phrase within it which themselves can contain noun and verb phrases.  This leads to ambiguity.

> "While hunting in Africa, I shot an elephant in my pajamas. How he got into my pajamas, I don't know."

The joke is the 2nd, least thought of meaning, is clarified by Marx in the 2nd sentence.  This ambiguity is hard for a computer processor.
Parse trees help us understand the difference.

<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/parse_tree.png'>

* Noun phrase: `['I']`
* Verb phrase: `['shot', 'an', 'elephant']`
* Prepositional phrase: `['in', 'my', 'pajamas']`

VS

* Noun phrase: `['I']`
* Verb phrase: `['shot', 'an', 'elephant', 'in', 'my', 'pajamas']`

The 2nd one treats in my pajamas as noun phrase within the verb phrase (so the pajamas belong to the elephant)
The 1st one is more typical and treats the prepositional phrase as its own thing

Put simply, we know what an elephant is, what pajamas are, and understand that it's highly unlikely that an elephant could fit in pajamas. This helps us determine how we understand that sentence on the fly -- computers don't have this luxury, so they don't know which to choose!

## POS Tagging and CFGs
<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/cfg.png'>

Let's break down this CFG, and see if we can understand it a bit better. 

* `S -> NP VP` A sentence (S) consists of a Noun Phrase (NP) followed by a Verb Phrase (VP).
* `PP -> P NP` A Prepositional Phrase (PP) consists of a Preposition (P) followed by a Noun Phrase (NP)
* `NP -> Det N | Det N PP | 'I'` A Noun Phrase (NP) can consist of:
    * a Determiner (Det) followed by a Noun (N), or (as denoted by `|`) 
    * a Determiner (Det) followed by a Noun (N), followed by a Prepositional Phrase (PP), or
    * The token `'I'`.
* `VP -> V NP | VP PP` A Verb Phrase can consist of:
    * a Verb (V) followed by a Noun Phrase (NP) or
    * a Verb Phrase (VP) followed by a Prepositional Phrase (PP)
* `Det -> 'an' | 'my'` Determiners are the tokens 'an' or 'my'
* `N -> 'elephant' | 'pajamas'` Nouns are the tokens 'elephant' or 'pajamas'
* `V -> 'shot'` Verbs are the token 'shot'
* `P -> 'in'` Prepositions are the token 'in'

As we can see, the CFG provides explicit rules as to both:
1. How sentences, noun phrases, verb phrases, and prepositional phrases may be structured
2. What parts of speech each token belongs to 

This defines a very small CFG that allows the parser to successfully generate parse trees for the Groucho Marx's sentence. Note that both the parse trees seen above are valid, according to the rules defined in this grammar. Even though this grammar is quite explicit, both of them work.

NLTK can get these tokens from things like the **Penn Tree Bank**

# Working with Text Data Questions
* Do we remove stop words or not?    
* Do we stem or lemmatize our text data, or leave the words as is?   
* Is basic tokenization enough, or do we need to support special edge cases through the use of regex?  
* Do we use the entire vocabulary, or just limit the model to a subset of the most frequently used words? If so, how many?  
* Do we engineer other features, such as bigrams, or POS tags, or Mutual Information Scores?   
* What sort of vectorization should we use in our model? Boolean Vectorization? Count Vectorization? TF-IDF? More advanced vectorization strategies such as Word2Vec?  


# Data Ethics
## Where Does NLP Data Come From?
### Ground Truth in Machine Learning
* labels need to exist for each record for supervised learning
    * these labels are called "ground truth"
    * difference between predictions and ground truth is how the quality of a model is measured
### Labeling
* Traces back to qualitative learning
    * social science researchers would code raw data
    * determined before the analysis begins (also known as "a priori" or "deductive" codes), and are sometimes developed during the analysis (also known as "grounded" or "inductive" codes). Often multiple researchers will apply codes to the same text sample and then use statistical techniques to measure the [inter-rater reliability](https://www.statology.org/inter-rater-reliability/) and ensure that the codes are being applied to the samples in a consistent way

### Data Labeling Today: Crowd Workers
Data scientists ([and social scientists](https://blogs.lse.ac.uk/impactofsocialsciences/2020/12/15/how-to-conduct-valid-social-science-research-using-mturk-a-checklist/)) instead frequently use crowdsourcing platforms to label text data cheaply, quickly, and at scale.

The most popular platform for data labeling is [Amazon Mechanical Turk (MTurk)](https://www.mturk.com/). Requesters can post "Human Intelligence Tasks" (HITs) such as data labeling to the MTurk platform and pay workers for completion of each individual task.

#### MTurk and Privacy
Sending any kind of personally-identifiable information (PII) or other sensitive data to a platform like MTurk risks the exposure of this data. A [Microsoft Research team](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/04/cscw14-crowdattack.pdf) even found that they could pay MTurk workers to steal user data if it was presented as part of a legitimate task!

Some [data manipulation techniques](https://www.cs.cmu.edu/~jbigham/pubs/pdfs/2017/crowdmask.pdf) have been proposed to avoid making certain data identifiable, but the less-risky option is to avoid sending any kind of sensitive data to this kind of platform.

#### MTurk and Working Conditions
MTurk workers are independent contractors and are therefore usually not legally entitled to certain worker protections. Research has found that some requesters exploit these workers by [paying less than &#36;2 per hour](https://arxiv.org/abs/1712.05796) and create stressful working conditions by [mass rejecting completed tasks](https://blog.turkopticon.net/?p=731) with essentially no recourse for workers.


## Implicit Labels Scraped from the Internet

Given the logistical and ethical challenges that can arise in paying people to label data, some data scientists have moved towards using data that was never explicitly labeled. As a data scientist in [one study](https://arxiv.org/abs/1812.05239) put it:

> There isn’t really a thought process surrounding... _Should [our team] ingest this data in?_ [...] If it is available to us, we ingest it.

One example of such a readily-available data source is [Common Crawl](https://commoncrawl.org/). Common Crawl attempts to scrape the entire Internet every month or so, and hosts petabytes of data from each crawl. The scale and ease of access of this dataset has made it very appealing to data scientists.

#### Scraped Data and Privacy Concerns
Even when posting "publicly", website users often have an expectation that this data will not be aggregated and analyzed outside of its original context. For example, the [publication of a dataset](https://www.vox.com/2016/5/12/11666116/70000-okcupid-users-data-release) of 70,000 users scraped from OkCupid was broadly criticized. Social computing research Os Keyes [wrote](https://ironholds.org/scientific-consent/):

> this is without a doubt one of the most grossly unprofessional, unethical and reprehensible data releases I have ever seen.

(The dataset eventually taken down not because of ethical issues but because OkCupid filed a [DMCA](https://www.copyright.gov/dmca/) complaint).


#### Scraped Data and Garbage In, Garbage Out
Concerns about data quality are especially relevant when working with data that was scraped from random sources rather than collected with a particular intention in mind.

Studies of the Common Crawl corpus in particular have found that it contains a ["significant amount of undesirable content"](https://aclanthology.org/2021.acl-short.24.pdf) including hate speech and sexually explicit content, and also that models trained on it [exhibit numerous historical biases](https://www.cs.princeton.edu/~arvindn/publications/language-bias.pdf) related to race and gender.

## Bonus Topic: Large Language Models
Another kind of model that uses text data but isn't a traditional text classifier is a ***large language model***. At the time of this writing, [GPT-3](https://www.nytimes.com/2020/11/24/science/artificial-intelligence-ai-gpt3.html), [GitHub Copilot](https://github.com/features/copilot), and [BERT](https://huggingface.co/blog/bert-101) are some popular examples of this type of model. Typically a large language model tries to predict the next word or words in a sequence, in a way that can _generate_ text rather than simply labeling it.

These models are almost always developed with scraped data, because that is the only way to achieve the necessary scale. GPT-3, for example, was trained on [the Common Crawl corpus in addition to curated sources](https://techcrunch.com/2020/08/07/here-are-a-few-ways-gpt-3-can-go-wrong/). It also has demonstrated bias against [Muslims](https://hai.stanford.edu/news/rooting-out-anti-muslim-bias-popular-language-model-gpt-3), [women](https://aclanthology.org/2021.nuse-1.5), and [disabled people](https://arxiv.org/abs/2206.11993).

In a more stunning example, the "Tay" chatbot started with a pre-trained dataset then "learned" racist and incendiary language from Twitter users, eventually being [pulled from the platform](https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist) after less than a day. This shows that social media platforms, despite having some form of moderation, may be worse sources of training data than generic sources like Common Crawl.


## Additional Resources
* [Qualitative Data Coding 101](https://gradcoach.com/qualitative-data-coding-101/)
* [Data and its (dis)contents: A survey of dataset development and use in machine learning research](https://arxiv.org/abs/2012.05345)
* [Documenting Data Production Processes: A Participatory Approach for Data Work](https://arxiv.org/abs/2207.04958)
* [The trainer, the verifier, the imitator: Three ways in which human platform workers support artificial intelligence](https://journals.sagepub.com/doi/10.1177/2053951720919776)
* [Garbage In, Garbage Out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?](https://arxiv.org/abs/1912.08320)
* [The dangers of data scraped from the internet](https://www.technologyreview.com/2021/08/13/1031836/ai-ethics-responsible-data-stewardship/)
* [On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)

# Gradient in Gradient Descent

> Here, in finding gradient ascent, our task is not to calculate the gain from a move in either the $x$ or $y$ direction.  Instead, our task is to **find some combination of a change in $x$,$y$ that brings the largest change in output**.  

The gradient of the function $f(x,y)$, that is $ \nabla f(x,y) = 2x + 3y $ is the following: 

$\frac{df}{dx}(2x + 3y) = 2 $ and $\frac{df}{dy}(2x + 3y) = 3 $.

So what this tells us is to move in the direction of greatest ascent for the function $f(x,y) = 2x + 3y $, is to move up three and to the right two.  So we would expect our path of greatest ascent to look like the following.

This example gives gradient **ascent** $\nabla $.  So $\nabla f(x, y) = \frac{\delta f}{\delta y}, \frac{\delta f}{\delta x} $.  This means that to take the path of greatest ascent, you should move $ \frac{\delta f}{\delta y} $ divided by $ \frac{\delta f}{\delta x} $.  So for example, when $ \frac{\delta f}{\delta y}f(x, y)  = 3 $ , and $ \frac{\delta f}{\delta x}f(x, y)  = 2$, you traveled in line with a slope of 3/2.

For gradient descent, that is to find the direction of greatest decrease, you simply reverse the direction of your partial derivatives and move in $ - \frac{\delta f}{\delta y}, - \frac{\delta f}{\delta x}$. 

# Scalars, Vectors, Matrices, Tensors
> **Scalar**: A single number
* **Real valued scalars**: Let $S \in  \mathbb{R} $  be the salary of an individual
* **Natural number scalars**: Let $n \in \mathbb{N}$ be the number of floors in a building


> **Vector**: An **array** of numbers arranged in some order, as opposed to the single numbered scalar. 

\begin{equation}
x = 
\begin{bmatrix}
  x_{1} \\
  x_{2} \\
  \vdots \\
  x_{n-1} \\
  x_{n} \\
\end{bmatrix}
\end{equation}

Where $x$ is the name of the vector and $(x_1,x_2, \ldots, x_{n-1}, x_n)$ are the scalar components of the vector.

A vector has direction and a magnitude (length).

Below is an example of a vector in 3D vector space:  

![](https://curriculum-content.s3.amazonaws.com/data-science/images/vec2.png)

In python numpy arrays work well for vectors
```python 
# create a vector from list [2,4,6]
import numpy as np
v = np.array([2, 4, 6])
print(v)

print (x[1:4])     # second to fourth element. Element 5 is not included
print (x[0:-1:2])  # every other element
print (x[:])       # print the whole vector
print (x[::-1]) # reverse the vector!
```

> **Matrix** is a 2-dimensional array of numbers written between square brackets. 

$$
   A=
  \left[ {\begin{array}{cccc}
   A_{1,1} & A_{1,2} & \ldots &A_{1,n} \\
   A_{2,1} & A_{2,2} & \ldots &A_{2,n} \\
   \vdots& \vdots & \ddots & \vdots \\
   A_{m,1} & A_{m,2} & \ldots &A_{m,n} \\
  \end{array} } \right]
$$

We usually give matrices uppercase variable names with bold typeface, such as $A$. If a real-valued matrix $A$ has a height of $m$ and a width of $n$ as above, we state this as $A \in \mathbb{R}^{m \times n}$. In machine learning, a vector can be seen as a special case of a matrix.

> A vector is a matrix that has only 1 column, so you have an $(m \times 1)$-matrix. $m$ is the number of rows, and 1 here is the number of columns, so a matrix with just one column is a vector.

array of arrays makes a matrix in python
```python
X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(X)
print (X[0, 0]) # element at first row and first column
print (X[-1, -1]) # element from the last row and last column 
print (X[0, :]) # first row and all columns
print (X[:, 0]) # all rows and first column 
print (X[:]) # all rows and all columns
```

We can also define matlab styles matrices (for those used to matlab definitions) in the following way:
```python
Y = np.mat([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(Y)
```

> **Tensor** An array of numbers arranged on a regular grid with a variable number of axes.

A vector is a one-dimensional or "first order tensor" and a matrix is a two-dimensional or "second order tensor".
Tensor notation is just like matrix notation, with a capital letter that represents a tensor, and lowercase letters with a subscript representing scalar values within the tensor. Many operations that can be performed with scalars, vectors, and matrices can be reformulated to be performed with tensors as well. The image below shows some of these operations for a  3D tensor. 

<img src="https://curriculum-content.s3.amazonaws.com/data-science/images/new_tensors.png" width="700">

As a tool, tensors and tensor algebra are widely used in the fields of physics and engineering, and in data science it is particularly useful when you'll learn about deep learning models. 
We'll revisit tensors and relevant operations in the deep learning sections and explain how tensors are created, manipulated, and saved using more advanced analytical tools like Keras. 


## Transpose
<img src="https://curriculum-content.s3.amazonaws.com/data-science/images/new_vector.png" width="150">

Neural networks frequently process weights and inputs of different sizes where the dimensions do not meet the requirements of matrix multiplication. Matrix transpose provides a way to "rotate" one of the matrices so that the operation complies with multiplication requirements and can continue. There are two steps to transpose a matrix:

* Rotate the matrix right 90° clockwise.
* Reverse the order of elements in each row (e.g. [a b c] becomes [c b a]).
This can be better understood looking at this image :

<img src="https://curriculum-content.s3.amazonaws.com/data-science/images/new_matrix.png" width="350">

Numpy provides the transpose operation by using the `.T` attribute or the `np.transpose()` function with the array that needs to be transposed as shown below:

```python
# create a transpose of a matrix

A = np.array([
   [1, 2, 3], 
   [4, 5, 6],
   [7, 8, 9]])

A_transposed = A.T
A_transposed_2 = np.transpose(A)

print(A,'\n\n', A_transposed, '\n\n', A_transposed_2)
```

# Keras

- Scalars = 0D tensors
- Vectors = 1D tensors
- Matrices = 2D tensors
- 3D tensors

A tensor is defined by three key attributes:
- rank or number of axes
- the shape
- the data type

## Some important data manipulations in NumPy

### Unrowing matrices (important for images)

eg Santa: `(790, 64, 64, 3)` matrix to a `(64*64*3, 790)` matrix!

```python 
img_unrow = img.reshape(790, -1).T  
```

### Increasing the rank

vector with `np.shape() (790,)`

```python
np.reshape(vector, (1,790)) 
```

```python  
tensor[start_idx : end_idx]
```

### Tensor Operations
* **Element-wise**: updates each element with the corresponding element from another tensor
```python
np.array([1, 2, 3, 4]) + np.array([5, 6, 7, 8])
#result
array([ 6,  8, 10, 12])
```

* **Broadcasting**: When tensors are different shapes it can broadcast, B is added to every row here for example.
```python
A += B
A:
 [[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]] 

B: [1 2 3] 

Updated A:
 [[ 1  3  5]
 [ 4  6  8]
 [ 7  9 11]
 [10 12 14]]
 ```
* **Tensor dot**: sum of element products following matrix rules
```python
# Recall that B is the vector [1, 2, 3]
# Taking the dot product of B and itself is equivalent to
# 1*1 + 2*2 + 3*3 = 1 + 4 + 9 = 14
print(np.dot(B,B))

A:
 [[ 0  1  2]
 [ 3  4  5]
 [ 6  7  8]
 [ 9 10 11]] 

B: [1 2 3] 

array([ 8, 26, 44, 62])
```

Here the first element is the sum of the first row of $A$ multiplied by $B$ elementwise:  
$$ 0*1 + 1*2 + 2*3 = 0 + 2 + 6 = 8 $$ 

Followed by the sum of the second row of $A$ multiplied by $B$ elementwise:  

$$ 3*1 + 4*2 + 5*3 = 3 + 8 + 15 = 26 $$

and so on.

<img src="images/dotproduct.png">

You continue this for every row x column.  Also notice that the width of A has to be the height of B.

## Keras Sequential Model Example

In [None]:
from keras import models
from keras import layers
from keras import optimizers

model = models.Sequential()
#Dense means this layer is fully connected
#input_shape is optional and the next layer added is based on the shape of the prior
model.add(layers.Dense(units, activation, input_shape))
#notice the loss function
model.compile(optimizer=optimizers.RMSprop(learning_rate=0.001),
              loss='mse',
              metrics=['accuracy'])


history = model.fit(x_train,
                    y_train,
                    epochs=20,
                    batch_size=512,
                    validation_data=(x_val, y_val))

### Machine Learning Training Terms

**Sample**: A single element of a dataset.
  * In a convolutional network, an image is a sample.
  * In a speech recognition model, an audio file is a sample.

**Batch**: A set of *N* samples processed independently, but in parallel.
  * **Training**: Processing a batch results in only one update to the model.
  * **Approximation**: Batches better approximate the distribution of input data than single samples. Larger batches provide a more accurate approximation.
  * **Inference**: Use the largest batch size your system can manage without memory issues for faster evaluation/prediction.

**Epoch**: One full pass over the entire dataset.
  * **Phases**: Used to segment training into distinct phases for easier logging and evaluation.
  * **Validation**: If using `validation_data` or `validation_split` during training with Keras, evaluations are performed at the end of each epoch.
  * **Callbacks**: Functions like adjusting learning rates or saving model checkpoints can be scheduled to run at the end of each epoch.

## After Fit
`history.history` - history on how the model was trained
- `history.history['loss']` loss values per epoch 
- `history.history['accuracy']` accuracy values per epoch

## Predict
```python
y_hat = model.predict(x)
```

## Evaluate the Model

Similarly, we can use the `.evaluate()` method in order to compute the loss and other specified metrics for our trained model.

For example,   

```python
model.evaluate(X_train, X_train_labels)
``` 

will return the final loss associated with the model for the training data as well as any other metrics that were specified during compilation.

Similarly, 

```python
model.evaluate(X_test, X_test_labels)
``` 
will return the final loss associated with the model for the test data as well as any other specified metrics.


## Additional Resource
    
* A full book on Keras by the author of Keras himself:  
https://www.manning.com/books/deep-learning-with-python-second-edition

# Neural Networks - Recap

## Key Takeaways

The key takeaways from this section include:

### Neural Networks

* Neural networks are powerful models that can be customized and tweaked using various amounts of nodes, layers, etc.
* The most basic neural networks are single-layer densely connected neural networks, which have very similar properties as logistic regression models
* Compared to more traditional statistics and ML techniques, neural networks perform particularly well when using unstructured data
* Apart from densely connected networks, other types of neural networks include convolutional neural networks, recurrent neural networks, and generative adversarial neural networks 
* When working with image data, it's important to understand how image data is stored when working with them in Python
* Logistic regression can be seen as a single-layer neural network with a sigmoid activation function
* Neural networks use loss and cost functions to minimize the "loss", which is a function that summarizes the difference between the actual outcome (eg. pictures contain Santa or not) and the model prediction (whether the model correctly identifies pictures with Santa)
* Backward and forward propagation are used to estimate the so-called "model weights"
* Adding more layers to neural networks can substantially increase model performance
* Several activations can be used in model nodes, you can explore with different types and evaluate how it affects performance

### Deep Neural Networks

* Deep neural network representations can lighten the burden and automate certain tasks of heavy data preprocessing
* Deep representations need exponentially fewer hidden units than shallow networks, to obtain the same performance
* Parameter initialization, forward propagation, cost function evaluation, and backward propagation are again the cornerstones of deep networks
* Tensors are the building blocks of neural networks and a good understanding of them and how to use them in Python is crucial
* Scalars can be seen as 0-D tensors. Vectors can be seen as 1-D tensors, and matrices as 2-D tensors
* The usage of tensors reaches beyond matrices: tensors can have N dimensions
* Tensors can be created and manipulated using NumPy
* Keras makes building neural networks in Python easy, and you learned how to do that in this section
* You can use Keras to do some NLP as well, e.g. for tokenization 

# Lecture 7: Neural Network Architecture

In this lecture, we'll explore the architecture of neural networks. On Monday, we will start working with TensorFlow to create neural networks.

## What is a Neural Network?

Neural networks mimic the structure of the human brain and are computational graphs connecting inputs to outputs through layers, including hidden layers. Each connection between layers has associated weights and biases, and uses activation functions to learn relationships.

### Layers and Complexity

- **Input Layer:** Represents features or columns of the dataset.
- **Hidden Layers:** Intermediate layers that learn from the input features.
- **Output Layer:** Provides the final prediction or classification.

Each neuron in a hidden layer represents a unique equation and processes inputs to generate outputs for the next layer.

### Example: Cat vs. Dog Classification

Inputs (features) are processed through multiple layers to produce an output (e.g., identifying if an image is a cat or a dog).

## Understanding Neural Network Training

### Epoch and Batch Sizes

- **Epoch:** One full pass through the entire dataset.
- **Batch Size:** The number of samples processed before the model's internal parameters are updated.

Each epoch consists of multiple batches, and after each batch, the model's weights are updated.

### Activation Functions

Activation functions introduce non-linearity into the network, allowing it to learn complex patterns.

- **Sigmoid:** Often used in the output layer for binary classification.
- **Tanh:** Scaled sigmoid function ranging from -1 to 1.
- **ReLU (Rectified Linear Unit):** Outputs the input if it's positive; otherwise, it outputs zero. Most popular due to its simplicity and effectiveness.

### Regularization and Learning Rate

- **Regularization:** Techniques like L1 and L2 regularization prevent overfitting by adding penalties to the loss function.
- **Learning Rate:** Determines the step size during gradient descent optimization.

## Building a Neural Network

### Weights and Biases

- Weights are initialized randomly and adjusted through training.
- Bias terms are added to the weighted sum of inputs to create the neuron's output.

### Forward Propagation

Inputs are passed through the network layer by layer. Each layer applies weights, sums the inputs, adds biases, and uses activation functions to produce outputs for the next layer.

### Backpropagation

After forward propagation, the error is calculated using a loss function. Backpropagation adjusts the weights by calculating gradients and updating the weights in the direction that reduces the error.

## Example Code: Forward Propagation and Activation Functions


In [None]:
import numpy as np

# Initialize weights and bias
weights = np.random.rand(64)
bias = np.random.rand(1)

# Input features
inputs = np.array([0.5, 0.3, ..., 0.9])  # Example values

# Weighted sum
z = np.dot(weights, inputs) + bias

# Sigmoid activation function
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

# Output after activation
output = sigmoid(z)
print("Output:", output)

## Practical Example: Image Classification

### Steps

1. **Load Image Data:** Flatten images into vectors.
2. **Initialize Weights and Biases:** Randomly initialize for each neuron.
3. **Forward Propagation:** Calculate weighted sums and apply activation functions.
4. **Calculate Error:** Compare predicted outputs with actual labels.
5. **Backpropagation:** Adjust weights based on error gradients.

## Conclusion

Neural networks are powerful tools for learning complex patterns and making predictions. They involve intricate processes of forward and backward propagation, weight adjustment, and the use of activation functions. While their workings can seem like a black box, understanding their architecture and training process is crucial for effectively using them in machine learning tasks.

## Next Steps

In the next class, we will cover vectorization and classification. Ensure you are familiar with the concepts discussed today. If you have any questions, please feel free to ask.


# Neural Network Partial Derivatives
Here’s a summary of each partial derivative:

1. $$\frac{\partial C}{\partial w_1} = -2 w_5 X_1 \sum \left( y - \left[ X_1 (w_1 w_5 + w_2 w_6) + X_2 (w_3 w_5 + w_4 w_6) \right] \right)$$
2. $$\frac{\partial C}{\partial w_2} = -2 w_6 X_1 \sum \left( y - \left[ X_1 (w_1 w_5 + w_2 w_6) + X_2 (w_3 w_5 + w_4 w_6) \right] \right)$$
3. $$\frac{\partial C}{\partial w_3} = -2 w_5 X_2 \sum \left( y - \left[ X_1 (w_1 w_5 + w_2 w_6) + X_2 (w_3 w_5 + w_4 w_6) \right] \right)$$
4. $$\frac{\partial C}{\partial w_4} = -2 w_6 X_2 \sum \left( y - \left[ X_1 (w_1 w_5 + w_2 w_6) + X_2 (w_3 w_5 + w_4 w_6) \right] \right)$$
5. $$\frac{\partial C}{\partial w_5} = -2 (w_1 X_1 + w_3 X_2) \sum \left( y - \left[ X_1 (w_1 w_5 + w_2 w_6) + X_2 (w_3 w_5 + w_4 w_6) \right] \right)$$
6. $$\frac{\partial C}{\partial w_6} = -2 (w_2 X_1 + w_4 X_2) \sum \left( y - \left[ X_1 (w_1 w_5 + w_2 w_6) + X_2 (w_3 w_5 + w_4 w_6) \right] \right)$$

In these equations:
- $y$ represents the actual target value.
- $X_1$ and $X_2$ are input features.
- $w_i$ are the weights of the neural network.
- The summation $\sum$ indicates summing over all training examples.

Each equation represents how the cost function $C$ changes with respect to a particular weight $w_i$. These derivatives are used in the backpropagation algorithm to adjust the weights iteratively, reducing the overall cost and improving the model's predictions.


We want to go the opposite direction of these partial derivatives in order to go down the gradient slope

# Network Notation 

<img src='https://curriculum-content.s3.amazonaws.com/data-science/images/new_small_deeper.png' width='700'>

For our 2-layer neural network above

- $x = a^{[0]}$  as x is what comes out of the input layer
- $a^{[1]} = \begin{bmatrix} a^{[1]}_1  \\ a^{[1]}_2 \\ a^{[1]}_3  \\\end{bmatrix}$ is the value generated by the hidden layer
- $\hat y =  a^{[2]}$, the output layer will generate a value $a^{[2]}$, which is equal to $\hat y$ 


Note that the input layer is not counted or is index 0, so if you say an n-layer network be sure to not count the input layer.

# Activation Functions
You won't use any of these, you'll use the ones in libraries, but it's a nice show of how they work

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def sigmoid(x, derivative=False):
    f = 1 / (1 + np.exp(-x))
    if (derivative == True):
        return f * (1 - f)
    return f

def tanh(x, derivative=False):
    f = np.tanh(x)
    if (derivative == True):
        return (1 - (f ** 2))
    return np.tanh(x)

def relu(x, derivative=False):
    f = np.zeros(len(x))
    if (derivative == True):
        for i in range(0, len(x)):
            if x[i] > 0:
                f[i] = 1  
            else:
                f[i] = 0
        return f
    for i in range(0, len(x)):
        if x[i] > 0:
            f[i] = x[i]  
        else:
            f[i] = 0
    return f

def leaky_relu(x, leakage = 0.05, derivative=False):
    f = np.zeros(len(x))
    if (derivative == True):
        for i in range(0, len(x)):
            if x[i] > 0:
                f[i] = 1  
            else:
                f[i] = leakage
        return f
    for i in range(0, len(x)):
        if x[i] > 0:
            f[i] = x[i]  
        else:
            f[i] = x[i]* leakage
    return f

def arctan(x, derivative=False):
    if (derivative == True):
        return 1/(1+np.square(x))
    return np.arctan(x)

z = np.arange(-10, 10, 0.2)

## The hyperbolic tangent (tanh) function 

The hyperbolic tangent (or tanh) function goes between -1 and +1, and is in fact a shifted version of the sigmoid function, with formula $ a=\dfrac{\exp(z)- \exp(-z)}{\exp(z)+ \exp(-z)}$. For intermediate layers, the tanh function generally performs pretty well because, with values between -1 and +1, the means of the activations coming out are closer to zero! 

tanh is a shifted sigmoid that goes between -1 and 1, recall sigmoid goes between 0 and 1.

A disadvantage of both tanh and sigmoid activation functions is that when $z$ gets quite large or small, the derivative of the slopes of these functions become very small, generally 0.0001. This will slow down gradient descent. You can see in the tanh plot that this already starts happening for values of $z > 2$ or $z < 2$. The next few activation functions will try to overcome this issue. 
<img src="images/tanh.png">


## arctan (inverse tangent)

The inverse tangent (arctan) function has a lot of the same qualities that tanh has, but the range roughly goes from -1.6 to 1.6, and  the slope is more gentle than the one we saw using the tanh function.

<img src="images/arctan.png">

## The Rectified Linear Unit function (ReLu)

This is probably the most popular activation function, along with the tanh! The fact that the activation is exactly 0 when $z <0$  is slightly cumbersome when taking derivatives though. 

$$a=\max(0, z)$$

<img src="images/relu.png">

## The leaky Rectified Linear Unit function 

The leaky ReLU solves the derivative issue by allowing for the activation to be slightly negative when $z <0$. 

$$a=\max(0.001*z ,z)$$

<img src="images/leaky_relu.png">

## Additional Resources 

- [Visualising activation functions in neural networks](https://dashee87.github.io/data%20science/deep%20learning/visualising-activation-functions-in-neural-networks/)

# Tuning Neural Networks with Normalization

## Normalized Inputs: Speed Up Training

Normalizing inputs (scaling features to a consistent range, e.g., 0 to 1) speeds up training and promotes convergence. Standardization involves subtracting the mean and dividing by the standard deviation.

### Vanishing or Exploding Gradients

Normalization helps mitigate numerical problems in gradient computation, preventing vanishing or exploding gradients. For deep networks, gradients can grow or shrink excessively, making training unstable.

Example:
- For a deep network with linear activations, the weight products can explode due to many layers. Normalization helps manage this risk.

#### Solutions to Gradient Issues

- **Initialization**: Small weights can prevent gradient issues. Common practices:
  - **Variance Rule**: $ \text{Var}(w_i) = \frac{1}{n} $ or $ \text{Var}(w_i) = \frac{2}{n} $
  - **ReLU Initialization**: $ w^{[l]} = \text{np.random.randn(shape)} \times \sqrt{2/n_{l-1}} $

## Optimization Strategies

### Gradient Descent with Momentum

- **Purpose**: Reduces oscillations and improves convergence.
- **How**:
  - Calculate moving averages for gradients.
  - Update weights with averaged gradients.

### RMSprop

- **Purpose**: Adapts learning rates for each parameter.
- **How**:
  - Use exponentially weighted average of squared gradients.
  - Adjust learning rate based on average squared gradients.

### Adam Optimization

- **Purpose**: Combines momentum and RMSprop benefits.
- **How**:
  - Maintain moving averages for gradients and their squares.
  - Apply bias corrections.
  - Update weights using corrected averages.

### Learning Rate Decay

- **Purpose**: Gradually reduce the learning rate over epochs.
- **Methods**:
  - **Inverse Time Decay**: $ \alpha = \frac{1}{1 + \text{decay rate} \times \text{epoch}} \times \alpha_0 $
  - **Exponential Decay**: $ \alpha = 0.97^{\text{epoch}} \times \alpha_0 $
  - **Manual Decay**

## Hyperparameter Tuning

- **Important**: Learning rate ($ \alpha $).
- **Next**: Momentum ($ \beta $), number of hidden units, mini-batch size.
- **Less Critical**: Number of layers, learning rate decay.
- **Rarely Tuned**: $ \beta_1 $, $ \beta_2 $, $ \epsilon $ (for Adam).

Avoid grid search; use iterative testing for hyperparameter tuning.

## Additional Resources
- [Coursera: Normalizing Inputs](https://www.coursera.org/learn/deep-neural-network/lecture/lXv6U/normalizing-inputs)
- [Coursera: Gradient Descent with Momentum](https://www.coursera.org/learn/deep-neural-network/lecture/y0m1f/gradient-descent-with-momentum)

# Tuning Neural Networks with Regularization

## Key Hyperparameters

When tuning neural networks, focus on:

- **Number of Hidden Units**: Controls model capacity.
- **Number of Layers**: Affects model depth.
- **Learning Rate ($\alpha$)**: Determines step size in optimization.
- **Activation Function**: Transforms node inputs.

Use a **validation set** to balance accuracy and generalization.

## Data Splits

Divide your data into:

- **Training Set**: For training the model.
- **Validation Set**: To tune hyperparameters and select the final model.
- **Test Set**: To evaluate performance on unseen data.

Ensure all sets come from the same distribution (e.g., same image resolution).

## Bias and Variance

Balance **bias** (error from too-simple models) and **variance** (error from too-complex models).

### The Circles Example

Examine the bias-variance trade-off with concentric circles:

- **High Bias (Underfitting)**: Model is too simple, missing key patterns.
  ![Underfitting](https://raw.githubusercontent.com/learn-co-curriculum/dsc-tuning-neural-networks-with-regularization/master/images/underfitting.png)
- **Good Fit**: Model accurately captures underlying patterns.
  ![Good Fit](https://raw.githubusercontent.com/learn-co-curriculum/dsc-tuning-neural-networks-with-regularization/master/images/good.png)
- **High Variance (Overfitting)**: Model is too complex, capturing noise.
  ![Overfitting](https://raw.githubusercontent.com/learn-co-curriculum/dsc-tuning-neural-networks-with-regularization/master/images/overfitting.png)

## The Santa Example

Performance across different models:

|       | High Variance | High Bias | High Variance & Bias | Low Variance and Bias |
|-------|---------------|-----------|----------------------|-----------------------|
| Train Set Error | 12% | 26% | 26% | 12% |
| Validation Set Error | 25% | 28% | 40% | 13% |

A model with low variance and bias performs best (87% accuracy).

## Bias / Variance Tips

| High Bias? (Training Error) | High Variance? (Validation Error) |
|------------------------------|-----------------------------------|
| Increase network size | Gather more data |
| Train longer | Apply regularization |
| Try different architectures | Try different architectures |

## Regularization

Prevents overfitting by penalizing large weights.

### L1 and L2 Regularization

#### In Logistic Regression

For L2 regularization:

$$ J(w, b) = \dfrac{1}{m} \sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \dfrac{\lambda}{2m} ||w||_2^2 $$

- **L2 Regularization**: Penalizes large weights to simplify the model.
- **L1 Regularization**: Adds a term $\dfrac{\lambda}{m} ||w||_1$ for sparsity.

#### In Neural Networks

For L2 regularization across all layers:

$$ J(w^{[1]}, b^{[1]}, \ldots, w^{[L]}, b^{[L]}) = \dfrac{1}{m} \sum_{i=1}^m \mathcal{L}(\hat{y}^{(i)}, y^{(i)}) + \dfrac{\lambda}{2m} \sum_{l=1}^L ||w^{[l]}||_2^2 $$

**Update Rule**:

$$ w^{[l]} := w^{[l]} - \alpha \left( \text{[backpropagation]} + \dfrac{\lambda}{m} w^{[l]} \right) $$

### Dropout Regularization

Dropout randomly ignores nodes during training, reducing overfitting.

**Before Dropout**:

![Standard Neural Net](https://raw.githubusercontent.com/learn-co-curriculum/dsc-tuning-neural-networks-with-regularization/master/images/dropout.png)

**After Dropout**:

![Neural Net with Dropout](https://raw.githubusercontent.com/learn-co-curriculum/dsc-tuning-neural-networks-with-regularization/master/images/dropout.png)

In Keras, use the `Dropout` layer:

```python
model = models.Sequential()
model.add(layers.Dense(5, activation='relu', input_shape=(500,)))
model.add(layers.Dropout(0.3))  # Dropout applied here
model.add(layers.Dense(5, activation='relu'))
model.add(layers.Dropout(0.3))  # Dropout applied here
model.add(layers.Dense(1, activation='sigmoid'))


# Machine Learning Interpretability - Introduction

## What is Interpretability?

Interpretability in machine learning refers to how well we can understand the decisions made by our models. There is no single definition or consensus, but it's crucial to align interpretability with the goals of your machine learning project.


## Interpretability and Machine Learning Goals

### Machine Learning Goals

- **Support Human Decisions**: Provides insights to aid human decision-making. For example, Clinical Decision Support (CDS) systems like [Watson Health's Micromedex](https://www.ibm.com/watson-health/solutions/clinical-decision-support) help clinicians by offering advice based on patient data and medical knowledge.

    [Watson Health Micromedex](https://www.ibm.com/watson-health/solutions/clinical-decision-support)

- **Automate Human Decisions**: Trains models to make decisions independently, such as Natural Language Generation applications used in predictive text and chatbots like ChatGPT.

### The Cost and Benefits of Decision Support

#### Clinical Decision Support (CDS)

CDS systems assist clinicians by providing actionable insights from medical data, which can speed up patient care but also introduce potential for false positives. 

![Clinical Decision Support](https://raw.githubusercontent.com/learn-co-curriculum/dsc-tuning-neural-networks-with-regularization/master/images/clinical-decision-support.png)

> **Goal**: Provide reliable recommendations to clinicians. Effectiveness depends on the relevance and accuracy of the information provided.

#### Computer Aided Detection (CAD)

CAD systems enhance diagnostic imaging, such as detecting breast cancer. They offer a "second opinion" but may introduce false positives, impacting patient care and resource allocation.

> **Goal**: Improve early detection of conditions like cancer. Balancing sensitivity with false positives is crucial.

### How Interpretability Enhances Decision Support

Interpretability can improve decision support in several ways:

1. **Trust**: Metrics like accuracy and ROC/AUC show how well the model performs and the nature of its errors.

2. **Causality**: Helps in understanding relationships between variables, guiding hypotheses and interventions.

3. **Transferability**: Assesses whether a model’s performance is consistent in new scenarios. Understanding decision drivers aids in model tuning and debugging.

4. **Informativeness**: Provides insights into feature importance, enhancing domain knowledge.

5. **Fair and Ethical Decision Making**: Ensures algorithms do not perpetuate societal biases, fostering accountability.

## White Box vs. Black Box Models

Historically, simpler models like regression and decision trees were more interpretable due to their transparency. However, modern complex models, especially neural networks, are often seen as "black boxes" because their decision-making processes are less transparent.

- **White Box Models**: Transparent and interpretable, such as regression and decision trees.
- **Black Box Models**: Complex and less interpretable, such as neural networks. 

Despite their complexity, black box models can be analyzed using various interpretability techniques.

# Machine Learning Interpretability - White-Box Models

## Definitions
- **White-box Models**: Models that are interpretable with minimal investigation, e.g., regression, decision trees.
- **Black-box Models**: Models that are not easily interpretable, e.g., neural networks.
- **Intrinsic Interpretability**: Understanding how a model arrived at a prediction directly from the model's structure.
- **Post-Hoc Interpretation**: Analyzing a model's predictions after training using additional methods.

## Model Selection and Common White-Box Models

### Linear Regression
Predicts a value based on well-understood variables, e.g., home price based on square footage.

### Logistic Regression
Classifies data into categories, e.g., determining if a home is a McMansion based on description.

### Naive Bayes
Analyzes unstructured data for classification, e.g., identifying common features of McMansions from descriptions.

### Decision Trees
Classifies based on important features, e.g., deciding whether to buy a McMansion based on listed characteristics.

## Types of Interpretation

### Intrinsic
- **Definition**: Directly interpretable due to model simplicity.
- **Examples**: Linear regression, logistic regression, simple decision trees.

### Post-Hoc
- **Definition**: Interpretation methods applied after model training.
- **Examples**: Permutation feature importance, visualizations, reading model internals.

## Methods of Interpretation

### Model-Specific
- **Definition**: Methods unique to specific models, e.g., regression weights for regression models.

### Model-Agnostic
- **Definition**: Methods applicable to any model based on input/output analysis.

### Scope of Interpretation
- **Local**: Explains individual predictions.
- **Global**: Explains overall model behavior.


# Machine Learning Interpretability - Black Box Models

## Common Black-Box Models

### Gradient Boosted Trees (GBDT)
- **Definition**: An ensemble method combining multiple decision trees for improved accuracy.
- **How It Works**: Boosting enhances weak models incrementally, using gradient descent to minimize errors.
- **Applications**:
  - **Fraud Detection**: Identifies fraudulent transactions.
  - **Medical Outcomes**: Predicts disease likelihood or treatment effectiveness.
  - **Recommender Systems**: Suggests products or content based on user preferences.
  - **Computer Vision**: Recognizes objects in images and videos.
  - **Customer Churn**: Predicts which customers may leave.

### Neural Networks
- **Definition**: Mimics biological neurons to process data through layers (input, hidden, output).
- **How They Work**: Nodes activate based on weights and thresholds, passing data through layers.
- **Applications**:
  - **Medical Imaging**: Analyzes X-rays, CT scans, and MRIs.
  - **Drug Research**: Predicts drug effectiveness and side effects.
  - **Patient Outcomes**: Predicts disease progression and survival rates.


## Use Cases for Neural Networks

Neural networks have significant applications in the medical field, including:

- **Medical Imaging**: Analyzing X-rays, CT scans, and MRIs to identify features or abnormalities, aiding in the diagnosis of diseases such as cancer, heart disease, and neurological disorders.

- **Drug Research and Development**: Analyzing data from chemical compounds to predict the effectiveness and side effects of new drugs, helping pharmaceutical companies identify promising drug candidates faster.

- **Patient Outcomes**: Predicting risks such as readmission, disease progression, and survival rates using a range of patient data (genetic, demographic, clinical), assisting doctors in making informed decisions about patient care.

Neural networks improve medical diagnosis and treatment accuracy by efficiently analyzing and learning from large data sets.


In [11]:
from PIL import ImageGrab
from IPython.display import Image as IPImage

def save_clipboard_image(file_path):
    try:
        # Grab the image from the clipboard
        image = ImageGrab.grabclipboard()
        if isinstance(image, Image.Image):
            # Save the image to the specified file path
            image.save(file_path)
            print(f"Image saved to {file_path}")
            return file_path
        else:
            print("No image found in clipboard.")
            return None
    except Exception as e:
        print(f"Error: {e}")
        return None

# Save image from clipboard
file_path = save_clipboard_image("clipboard_image.png")

# Display the saved image if it exists
if file_path:
    IPImage(file_path)


Image saved to clipboard_image.png
