# 11-ensembles

> **Use `scikit-learn`’s `RandomForestClassifier` and explain its main hyperparameters.**
> 

```python
pipe_rf = make_pipeline(
    preprocessor, RandomForestClassifier(random_state=123, n_jobs=-1)
)
```

- `n_estimators`: number of decision trees (higher = **increases complexity**)
- `max_depth`: max depth of each decision tree (higher = **increases complexity**)
- `max_features`: the number of features you get to look at each split (higher = **increases complexity**)
- setting `random_state` is important for reproducibility
- **Explain randomness in random forest algorithm.**
    - **************************************************************************randomness in classifier construction**************************************************************************
        1. **********Data:********** tree built on bootstrap sample (with replacement)
        2. ****************Feature:**************** each node select **************************************************random subset of features************************************************** + best possible test involving themxdisplay

> **Use other tree-based models such as as `XGBoost` and `LGBM`.**
> 

```python
pipe_lgbm = make_pipeline(preprocessor, LGBMClassifier(random_state=123))
pipe_xgb = make_pipeline(
    preprocessor, XGBClassifier(random_state=123, eval_metric="logloss", verbosity=0)
)
```

> **Employ ensemble classifier approaches, in particular model averaging and stacking.

Use `scikit-learn` implementations of these ensemble methods.**
> 

```python
from sklearn.ensemble import VotingClassifier

classifiers = {
    "logistic regression": pipe_lr,
    "decision tree": pipe_dt,
    "random forest": pipe_rf,
    #"XGBoost": pipe_xgb,
    "LightGBM": pipe_lgbm,
    "CatBoost": pipe_catboost,
}
averaging_model = VotingClassifier(
    list(classifiers.items()), voting="soft"
)  # need the list() here for cross_val to work!

averaging_model.fit(X_train, y_train);
```

- `voting='hard'`
    - output of `predict` and actually votes
- `voting='soft'`
    - averages output of `predict_proba` from base classifier, uses threshold/larger
    - assumes we trust `predict_proba`

```python
from sklearn.ensemble import StackingClassifier

# remove cat boost for time
classifiers_nocat = classifiers.copy()
del classifiers_nocat["CatBoost"]

stacking_model = StackingClassifier(list(classifiers_nocat.items()))
stacking_model.fit(X_train, y_train)

pd.DataFrame(
    data=stacking_model.final_estimator_.coef_[0],
    index=classifiers_nocat.keys(),
    columns=["Coefficient"],
)

stacking_model.final_estimator_.intercept_
```

- default `final_estimator` is `LogisticRegression` for classification
- does cross-validation by default
    - fit base estimators on training fold → predicts on validation fold → fit meta-estimator on output (validation fold)
    - `estimators_` fitted on full `X`
    - `final_estimator_` trained using cross-validation predictions of base estimators using `cross_val_predict`

> **Explain voting and stacking and the differences between them.**
> 

| Voting | Stacking |
| --- | --- |
| Multiple models either soft vote (average) or hard vote to produce final classification | outputs of one model used as input to another |
|  | outputs coefficients + intercept for each base classifier |
| Takes long time to fit/predict | Takes very long time to fit/predict |
| reduces interpretability |  |
| reduces maintainability | reduces maintainability |
|  | better accuracy generally than voting |
- Voting
    - Cons
        - `fit` `predict` time
        - reduces interpretability
        - reduces maintainability
- **Stacking:** one models output is input to another model

# 12 - Feature Importances

> **Interpret the coefficients of linear regression for ordinal, one-hot encoded categorical, and scaled numeric features.**
> 

```python
lr = make_pipeline(preprocessor, Ridge())
lr.fit(X_train, y_train)
lr.named_steps['ridge'].coef_
```

- **********************************Ordinal features:**********************************
    - easier to interpret
    - increasing by one “ordered” category increases prediction by coefficient value
- ************************Categorical:************************ use a ************************************reference category************************************
    - subtracting all OHE coefficients of a category by one of the category coefficients
    - coefficients explain difference between reference category and others
- ********************************Scaled Numeric Features:******************************** Increase feature by 1 scaled unit changes prediction by coefficient value
    - careful of scale when interpreting coefficients
- coefficients tell about the ******************************************model, not accurately reflecting data******************************************

> **Explain why interpretability is important in ML.**
> 
- diagnosing errors in ML systems
- not mindlessly trusting model with high accuracy
- reasoning about predictions

> **Use `feature_importances_` attribute of `sklearn` models and interpret its output.**
> 

```python
pipe_dt = make_pipeline(preprocessor, DecisionTreeClassifier(max_depth=3))
pipe_dt.fit(X_train, y_train)
pipe_dt.named_steps["decisiontreeclassifier"].feature_importances_,
```

- `feature_importances_` don’t have sign
    - ********************increasing******************** feature may cause prediction to go **up, then down**

> **Use `eli5` to get feature importances of non `sklearn` models and interpret its output.**
> 

```python
conda install -c conda-forge eli5
import eli5

#LightGBM
pipe_lgbm = make_pipeline(preprocessor, LGBMClassifier(random_state=123))
pipe_lgbm.fit(X_train, y_train)
eli5.explain_weights(
    pipe_lgbm.named_steps["lgbmclassifier"], 
    feature_names=feature_names
)
```

![Untitled](./img-notes/12-eli5.jpg)

- tell us globally what features are important

> **Apply SHAP to assess feature importances and interpret model predictions.**
> 
- ******************************Shapley values:****************************** for each example & feature → explain prediction by computing contribution of each feature to prediction
- For tree-based models
    
    ```python
    import shap
    
    pipe_lgbm = make_pipeline(preprocessor, LGBMClassifier(random_state=123))
    pipe_lgbm.fit(X_train, y_train)
    
    lgbm_explainer = shap.TreeExplainer(pipe_lgbm.named_steps["lgbmclassifier"])
    # for each example & each feature
    train_lgbm_shap_values = lgbm_explainer.shap_values(X_train_enc)
    ```
    

> **Explain force plot, summary plot, and dependence plot produced with shapely values.**
> 
- **Force plot**
    
    ```python
    ex_l50k_index = # index of one value that is classified as <50k salary
    shap.force_plot(
        lgbm_explainer.expected_value[1],
        test_lgbm_shap_values[1][ex_l50k_index, :],
        X_test_enc.iloc[ex_l50k_index, :],
        matplotlib=True,
    )
    ```
    
    ![Untitled](./img-notes/12-force.jpg)
    
    - from ********************base value******************** → average over dataset
        - red → features pushing prediction towards higher score
        - blue → features pushign prediction towards lower score
    - feature importances sum to prediction
- ************************Summary plot************************
    
    ```python
    shap.summary_plot(train_lgbm_shap_values[1], X_train_enc, plot_type="bar")
    
    shap.summary_plot(train_lgbm_shap_values[1], X_train_enc)
    ```
    

![Untitled](./img-notes/12-summary1.jpg)

![Untitled](./img-notes/12-summary2.jpg)

- `married-civ-spouse` = bigger SHAP values for class 1
- higher education = bigger SHAP
- ******************************Dependency Plot******************************
    
    ```python
    shap.dependence_plot("age", train_lgbm_shap_values[1], X_train_enc)
    ```
    
    ![Untitled](./img-notes/12-dependence.jpg)
    
    - X-axis = scaled `age` values
    - Y-axis = SHAP values
    - smaller age = smaller SHAP values
    - optimal age value with highest SHAP around (scaled) `

# 13 - K-Means Clustering

> **Explain the unsupervised paradigm.**
> 
- train model to find **patterns** in dataset that is typically ********unlabeled********

> **Explain the motivation and potential applications of clustering.**
> 
- Partition data into groups called clusters to ****************************************************discover underlying groups****************************************************
    - labels arbitrarily identify clusters
    - meaning depends on application and prior knowledge about data
- Applications
    1. ********************************Data exploration********************************
    2. ******************************************Customer segmentation******************************************
    3. **************************************Document clustering**************************************

> **Define the clustering problem**
> 

> **Broadly explain the K-Means algorithm.**
> 
- Input: `X` data points, `K` clusters
- initialization of `K` cluster centers
- Iterate
    1. assign example to closest center
    2. estimate new center as **********************************************average of observations**********************************************
- may not converge or converge sub-optimally

> **Apply `sklearn`’s `KMeans` algorithm.**
> 

```python
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
# We are only passing X because this is unsupervised learning

kmeans.predict(X)

kmeans.cluster_centers_
```

> **Point out pros and cons of K-Means and the difficulties associated with choosing the right number of clusters.**
> 

Pros

- allows clustering
- simple to understand, easy to implement
- fast + scales to large data
- choose clusters using elbow and silhouette method

Cons

- Stochastic initialization → can start poorly
- may converge suboptimally
- must specify clusters in advance
- each example must be assigned to one cluster

> **Create the Elbow plot and Silhouette plots for a given dataset.**
> 
- **************Elbow Method:************** looks at **************inertia**************, the sum of **********************************************************intra-cluster distances********************************************************** between points and their cluster center
    
    ```python
    from yellowbrick.cluster import KElbowVisualizer
    
    model = KMeans()
    visualizer = KElbowVisualizer(model, k=(1, 10))
    
    visualizer.fit(X)  # Fit the data to the visualizer
    visualizer.finalize()
    
    visualizer.draw()
    ```
    
    ![Untitled](./img-notes/13-elbow.jpg)
    
    - elbow at `k=3` → more clusters doesn’t bring improvement in decreasing inertia
- ********************************Silhouette plot:******************************** calculated using
    - **mean intra-cluster distance $a$:** distance from a point to other points in same cluster
    - **mean nearest-cluster distance $b$:** distance from point to other points in nearest cluster
    - ****************************************silhouette distance:**************************************** difference between $b-a$ normalized by maximum value
        - $[-1, 1]$
        - $0$ means overlapping clusters
    - **********************************Silhouette score:********************************** average of silhouette score for all samples
    
    ```python
    from yellowbrick.cluster import SilhouetteVisualizer
    ```
    

```python
model = KMeans(2, random_state=42)
visualizer = SilhouetteVisualizer(model, colors="yellowbrick")
visualizer.fit(X)  # Fit the data to the visualizer
visualizer.show();
# Finalize and render the figure
```

```python
model = KMeans(3, random_state=42)
visualizer = SilhouetteVisualizer(model, colors="yellowbrick")
visualizer.fit(X)  # Fit the data to the visualizer
visualizer.show();
# Finalize and render the figure
```

```python
model = KMeans(5, random_state=42)
visualizer = SilhouetteVisualizer(model, colors="yellowbrick")
visualizer.fit(X)  # Fit the data to the visualizer
visualizer.show();
# Finalize and render the figure
```

![Untitled](./img-notes/13-silhouette2.jpg)

![Untitled](./img-notes/13-silhouette3.jpg)

![Untitled](./img-notes/13-silhouette5.jpg)

- Silhouette score for each sample in cluster
    - ********************************higher value →******************************** well-separated
    - size → # samples
- ****************************************more rectangular →**************************************** points are happy in cluster

> **Visualize clusters in low dimensional space.**
> 

```python
import umap

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(data_df)

reducer = umap.UMAP(n_neighbors=15)
Z = reducer.fit_transform(data_df)
umap_df = pd.DataFrame(data=Z, columns=["dim1", "dim2"])
umap_df["cluster"] = kmeans.labels_

labels = np.unique(umap_df["cluster"])

fig, ax = plt.subplots(figsize=(10, 7))
ax.set_title("K-means with k = 3")

scatter = ax.scatter(
    umap_df["dim1"],
    umap_df["dim2"],
    c=umap_df["cluster"],
    cmap="tab20b",
    s=50,
    edgecolors="k",
    linewidths=0.1,
)

legend = ax.legend(*scatter.legend_elements(), loc="best", title="Clusters")
ax.add_artist(legend)

plt.show()
```

![k=3](./img-notes/13-lowdim.jpg)

k=3

> **Use clustering for customer segmentation problem.**
> 
- ********************************************Customer segmentation:******************************************** understand landscape of market in business to tailor products to each group
    - uses demographic, geographic, psychographic, behavioral data
1. Preprocess, EDA, and build pipeline
2. analyze which `k` to use using elbow or silhouette plots
3. build model
    
    ```python
    kmeans = KMeans(n_clusters=4, random_state=42)
    kmeans.fit(transformed_df)
    labels = kmeans.labels_
    
    cluster_centers = pd.DataFrame(
        data=kmeans.cluster_centers_, columns=[transformed_df.columns]
    )
    cluster_centers
    ```
    
4. ********************************************************************************************inverse transform to unscale cluster centers********************************************************************************************
    
    ```python
    data = (
        preprocessor.named_transformers_["pipeline"]
        .named_steps["standardscaler"]
        .inverse_transform(cluster_centers[numeric_features])
    )
    ```
    

> **Interpret the clusters discovered by K-Means.**
>

# 14 - DBSCAN & Hierchical Clustering

> **Identify limitations of K-Means.**
> 
- Stochastic initialization → can start poorly
- may converge suboptimally
- must specify clusters in advance
- each example must be assigned to one cluster
- ********************************************************fails to identify complexity******************************************************** in shape of data
    - ******************************************boundaries are linear******************************************

> **Explain the difference between core points, border points, and noise points in the context of DBSCAN.**
> 
- Iterative algorithm using 3 kinds of points to identify dense regions
    1. **********************Core point:********************** points with `min_samples` points within `eps` distance
    2. **************************Border point:************************** connected to core point within `eps` distance, but fewer than `min_samples`
    3. ************************Noise point:************************ points don’t belong to cluster 

> **Broadly explain how DBSCAN works.**
> 
- ****************DBSCAN:**************** Density-Based Spatial Clustering of Applications with Noise
    - ****************************************************identify crowded regions:**************************************************** clusters form dense regions in data
- algorithm
    1. pick random point $p$
    2. check whether $p$ is ********************core point******************** (at least `min_samples` neighbors within `eps`)
    3. if core point, give label
    4. for neighbors of $p$, check if ********************core point******************** → continue spreading label if core point
    5. once no more core points to spread label, pick new unlabelled $p$ and repeat

> **Apply DBSCAN using `sklearn`.**
> 

```python
dbscan = DBSCAN(eps=0.2)
dbscan.fit(X)
```

> **Explain the effect of epsilon and minimum samples hyperparameters in DBSCAN.**
> 
- `eps` : determines ******************closeness****************** of points
    - small eps → difficult to find neighbors
    - good eps → detects neighbors forming some clusters
    - high eps → points all in one cluster
- `min_samples` : determines # neighboring points to consider as part of cluster
    - low min_samples → few outliers, many clusters
    - good min_samples → some outliers, fewer clusters
    - high min_samples → many outliers, very few clusters

> **Identify DBSCAN limitations.**
> 

 Pros

- doesn’t require # clusters in advance
- identifies points not part of any cluster
- captures complex shapes
- use silhouette method

Cons

- must tune hyperparameter `esp` and `min_samples`
- doesn’t `predict` new points (only existing)
- cannot use elbow method
- fails for varying density

> **Explain the idea of hierarchical clustering.**
> 
- get picture of similarity before picking # clusters
- ******************Algorithm******************
    1. start with every point in own cluster
    2. ****************************greedily merge**************************** similar clusters
    3. repeat until only one cluster ($n-1$ times)

> **Visualize dendrograms using `scipy.cluster.hierarchy.dendrogram`.**
> 

```python
from scipy.cluster.hierarchy import dendrogram

ax = plt.gca()
dendrogram(linkage_array, ax=ax)
plt.xlabel("Sample index")
plt.ylabel("Cluster distance");
```

- `truncate_mode` to control dendrogram length
    - `lastp` = # leaves
    - `level` = max depth
    
    ```python
    dendrogram(Z, p=6, truncate_mode="lastp", ax=ax, labels=data.index)
    dendrogram(Z, p=6, truncate_mode="level", ax=ax, labels=data.index);
    ```
    
- `fcluster` to flatten
    
    ```python
    from scipy.cluster.hierarchy import fcluster
    
    cluster_labels = fcluster(Z, 6, criterion="maxclust")
    
    pd.DataFrame(cluster_labels, data.index)
    ```
    

> **Explain the advantages and disadvantages of different clustering methods.**
> 

|  | K-means | DBSCAN | hierarchical |
| --- | --- | --- | --- |
| Advantage | - easy implement
- fast/efficient for large data
- works well for linearly separated data
- variety of distance metrics | - automatically identifies # clusters
- identifies irregular shapes
- robust to outliers | - hierarchy of clusters, range of solutions
- doesn’t require specifying # clusters
- distance metrics & linkage methods
- identify clusters at varying granularity |
| Disadvantage | - pre-specified # clusters
- sensitive to initial selection → different results
- can converge sub-optimally
- not suitable for irregular shaped data | - sensitivity to hyperparameters
- computationally expensive for large data
- requires tuning
- doesn’t work for varying density | - computationally expensive for large data
- sensitive to distance/linkage choice
- difficult to determine optimal number clusters
- suffer from chaining effect (merging too early) |

> **Apply clustering algorithms on image datasets and interpret clusters.**
> 

> **Recognize the impact of distance measure and representation in clustering methods.**
> 

```python
from scipy.cluster.hierarchy import (
    average,
    complete,
    dendrogram,
    fcluster,
    single,
    ward,
)

Z = single(X)
Z = average(X)
Z = complete(X)
Z = ward(X)
dendrogram(Z)
```

- similarity/********************************linkage criteria******************************** between clusters
    1. ******************************single linkage:****************************** smallest, min-distance between two clusters
    2. ********************************average linkage:******************************** smallest, avg-distance between two clusters
    3. ******************************************complete/max linkage:****************************************** smallest, max-distance between points of two clusters
    4. **************************ward linkage:************************** merge clusters to minimize increase in cluster variance

# 15 - Recommender Systems

> **State the problem of recommender systems.**
> 
- **************************************Recommender system:************************************** Recommend particular product/service to users that are likely to consume
    - requires user ratings, features related to items + users, customer purchase history
- **************ratings************** for set of $M$ items for $N$ users

> **Describe components of a utility matrix.**
> 
- ******************************Utility matrix:****************************** iteration between $N$ users, $M$ items
    - e.g. ratings, clicks, purchases
    - complete utility matrix to recommend items they will rate higher

> **Create a utility matrix given ratings data.**
> 

```python
# Maps user/item to a number
user_mapper = dict(zip(np.unique(ratings[user_key]), list(range(N))))
item_mapper = dict(zip(np.unique(ratings[item_key]), list(range(M))))
# Maps number to user/item
user_inverse_mapper = dict(zip(list(range(N)), np.unique(ratings[user_key])))
item_inverse_mapper = dict(zip(list(range(M)), np.unique(ratings[item_key])))

user_key = "userId"
item_key = "productId"
Y = np.zeros((N, M))
Y.fill(np.nan)
for index, val in data.iterrows():
    n = user_mapper[val[user_key]]
    m = item_mapper[val[item_key]]
    Y[n, m] = val["rating"]
```

> **Describe a common approach to evaluate recommender systems.**
> 
- ********************Diversity:******************** how different are recommendations
- ********************Freshness:******************** people like new/surprising things, tradeoff is trust issues and explaining recommendations
- ************************Persistence:************************ how long recommendation lasts
- ********************************************Social recommendation:******************************************** what did friends watch

> **Implement some baseline approaches to complete the utility matrix.**
> 

```python
X = ratings.copy()
# will not use y
y = ratings[user_key]
X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_valid.shape
# ((113089, 3), (28273, 3))

# Create training and validation utility matrices
train_mat = create_Y_from_ratings(X_train, N, M, user_mapper, item_mapper)
valid_mat = create_Y_from_ratings(X_valid, N, M, user_mapper, item_mapper)
train_mat.shape, valid_mat.shape
# ((3635, 140), (3635, 140))
# same shape, but each only contain a proportion of all ratings
```

```python
def error(X1, X2):
    """
    Returns the root mean squared error.
    """
    return np.sqrt(np.nanmean((X1 - X2) ** 2))

def evaluate(pred_X, train_X, valid_X, model_name="Global average"):
    print("%s train RMSE: %0.2f" % (model_name, error(pred_X, train_X)))
    print("%s valid RMSE: %0.2f" % (model_name, error(pred_X, valid_X)))
```

- predict as global average
    
    ```python
    avg = np.nanmean(train_mat)
    pred_g = np.zeros(train_mat.shape) + avg
    
    evaluate(pred_g, train_mat, valid_mat, model_name="Global average")
    '''
    Global average train RMSE: 5.75
    Global average valid RMSE: 5.77
    '''
    ```
    
- KNN imputation: imputate using ********************mean value******************** of kNN in training set, distances between existing values
    
    ```python
    from sklearn.impute import KNNImputer
    
    imputer = KNNImputer(n_neighbors=10)
    train_mat_imp = imputer.fit_transform(train_mat)
    
    evaluate(train_mat_imp, train_mat, valid_mat, model_name="KNN imputer")
    '''
    KNN imputer train RMSE: 0.00
    KNN imputer valid RMSE: 4.79
    '''
    ```
    

> **Explain the idea of collaborative filtering.**
> 
- ****************************************************unsupervised learning →**************************************************** learn features using sparse labels
- **intuition**: similar users and items help predict entries → leveraging ************************************social information************************************
- can use ********************************cross-validation******************************** and ************grid search************

```python
import surprise
from surprise import SVD, Dataset, Reader, accuracy

reader = Reader()
data = Dataset.load_from_df(ratings, reader)  # Load the data

# I'm being sloppy here. Probably there is a way to create validset from our already split data.
trainset, validset = surprise.model_selection.train_test_split(
    data, test_size=0.2, random_state=42
)  # Split the data

k = 10
algo = SVD(n_factors=k, random_state=42)
algo.fit(trainset)
svd_preds = algo.test(validset)
accuracy.rmse(svd_preds, verbose=True)

from surprise.model_selection import cross_validate

pd.DataFrame(cross_validate(algo, data, measures=["RMSE", "MAE"], cv=5, verbose=True))
```

> **Explain the idea of content-based filtering.**
> 
- e.g. for movie recommendation
    - we know movie ratings
    - we know movie features
    - we can create profile for each user based on the movies they like
- rating prediction → **********************************************************supervised regressoin problem**********************************************************
    - given movie info, create profile
    - build regression model for each user, learn regression weights
    - each user has personalized regression model

********Pros********

- don’t need many users to provide rating
- each user modeled separately → uniqueness of taste
- can obtain **features of items**, can immediately recommend new items
    - not possible with collaborative filtering
- recommendations **************************interpretable************************** by weights

********Cons********

- feature acquisition & feature engineering
    - what features should we use to explain difference in ratings?
    - obtaining features for each item may be expensive
- ************************************less diversity →************************************ hardly recommend item outside profile
- ****************************cold start →**************************** new users, no information

> **Explain some serious consequences of recommendation systems.**
> 
- User exposed to information **reinforcing beliefs and biases**
    - increases **polarization**, **lessens diversity** in perspective
- maximizing user engagement by **reinforcing harmful ideas**
- perpetuate/amplify ******************************************bias & discrimination****************************************** by learning them and recommending the products/content
- ************************************privacy violations************************************ by relying on personal user data without consent
- **********************************************************misinformation and propoganda********************************************************** if recommendations spread false information

# 16 - NLP Intro

> **Broadly explain what is natural language processing (NLP).**
> 
- ******NLP:****** making computers understand what humans say
    - requires common sense and reasoning
        - ************************************lexical ambiguity:************************************ e.g. panini means a sandwich and a person’s name
        - ********************************************referential ambiguity:******************************************** e.g. the word “it” could refer to multiple nouns in a sentence

> **Name some common NLP applications.**
> 
- Voice assistants
- auto-complete
- translation

> **Explain the general idea of a vector space model.**
> 
- represent text as a ********************************************************************************numeric vector in high-dimensional space********************************************************************************
    - each word represented as a point
    - distance between points represents ********************similarity********************
        - i.e. normalized dot product between word vectors or **********************************cosine similarity**********************************

> **Explain the difference between different `word representations`: term-term co-occurrence matrix representation and Word2Vec representation.**
> 

| term-term co-occurrence matrix | Word2Vec representation. |
| --- | --- |
| in text, counts words within a context window | dense word embedding trained with ML models |
| long, sparse representation to capture relationship between words | short, dense vectors |
|  | training is expensive |
|  | pre-trained word embedding |
| BoW is document-term co-occurrence matrix |  |

> **Describe the reasons and benefits of using pre-trained embeddings.**
> 
- pre-computed word/phrase representations trained on large amount of data useful when
    1. ************************Data scarce:************************ data available is limited
    2. ****************************Time efficient**************************** to skip step of training own word embedding
    3. ****************************************improved performance**************************************** with good initialization point to capture semantic and syntactic information to learn better representations downstream
    4. ************************************transfer learning:************************************ model trained on one task fine-tuned on new task
    5. ******************************************multilingual support******************************************

> **Load and use pre-trained word embeddings to find word similarities and analogies.**
> 

```python
import gensim
import gensim.downloader as api

google_news_vectors = api.load('word2vec-google-news-300')

google_news_vectors.most_similar("UBC")
google_news_vectors.similarity("Japan", "hockey")

# analogy
print("%s : %s :: %s : ?" % (word1, word2, word3))
sim_words = model.most_similar(positive=[word3, word2], negative=[word1])
return pd.DataFrame(sim_words, columns=["Analogy word", "Score"])
```

> **Demonstrate biases in embeddings and learn to watch out for such biases in pre-trained embeddings.**
> 

> **Use word embeddings in text classification and document clustering using `spaCy`.**
> 

```python
import spacy

nlp = spacy.load("en_core_web_md")

doc = nlp("pineapple") # extract all interesting information about the document

#Average Embedding
doc = nlp("All empty promises")
avg_sent_emb = doc.vector

# document similarity
doc1 = nlp("Deep learning is very popular these days.")
doc2 = nlp("Machine learning is dominated by neural networks.")
doc3 = nlp("A home-made fresh bread with butter and cheese.")
doc1.similarity(doc2)
```

Airline sentiment analysis

- Split data
    
    ```python
    from sklearn.model_selection import cross_validate, train_test_split
    
    train_df, test_df = train_test_split(df, test_size=0.2, random_state=123)
    X_train, y_train = train_df["text"], train_df["airline_sentiment"]
    X_test, y_test = test_df["text"], test_df["airline_sentiment"]
    ```
    
- BoW representation
    
    ```python
    pipe = make_pipeline(
        CountVectorizer(stop_words="english"), LogisticRegression(max_iter=1000)
    )
    pipe.named_steps["countvectorizer"].fit(X_train)
    pipe.fit(X_train, y_train)
    pipe.score(X_train, y_train)
    pipe.score(X_test, y_test)
    ```
    
- Average Embedding
    
    ```python
    # get word vectors by creating average embedding representation for examples
    X_train_embeddings = pd.DataFrame([text.vector for text in nlp.pipe(X_train)])
    X_test_embeddings = pd.DataFrame([text.vector for text in nlp.pipe(X_test)])
    
    # Logistic Regression
    lgr = LogisticRegression(max_iter=2000)
    lgr.fit(X_train_embeddings, y_train)
    lgr.score(X_train_embeddings, y_train)
    lgr.score(X_test_embeddings, y_test)
    ```
    

> **Explain the general idea of topic modeling.**
> 
- **********************************Topic modelling:********************************** summarize major themes in collection of documents (corpus)
    - organize and categorize documents on variety of topics
    - commonly using ****************************************unsupervised methods****************************************
- application
    - EDA to get sense of large corpus

```python
from sklearn.decomposition import LatentDirichletAllocation

n_topics = 3 # number of topics
lda = LatentDirichletAllocation(
    n_components=n_topics, learning_method="batch", max_iter=10, random_state=0
)
lda.fit(toy_X) 
# document-topic association
document_topics = lda.transform(toy_X)

# word-topic association: weight associated with each word
lda.components_
```

> **Describe the input and output of topic modeling.**
> 
- ************Input:************
    - collection of documents
    - hyperparameter for number of topics/clusters $K$
- **************Output:**************
    1. **********************************************Topic-word association:********************************************** for each topic, what words describe it
    2. **************Document-topics association:************** what topic expressed by each document

> **Carry out basic text preprocessing using `spaCy`.**
> 

```python
clean_text = []
min_token_len = 2
irrelevant_pos=["ADV", "PRON", "CCONJ", "PUNCT", "PART", "DET", "ADP", "SPACE"]

for token in nlp.pipe(text_df["text"]):
    if (
        token.is_stop == False  # Check if it's not a stopword
        and len(token) > min_token_len  # Check if the word meets minimum threshold
        and token.pos_ not in irrelevant_pos
    ):  # Check if the POS is in the acceptable POS tags
        lemma = token.lemma_  # Take the lemma of the word
        clean_text.append(lemma.lower())
return " ".join(clean_text)
```

- **************************tokenization:**************************
    - sentence segmentation (text into sentences)
        
        ```python
        from nltk.tokenize import sent_tokenize
        sent_tokenized = sent_tokenize(text)
        ```
        
    - word tokenization (sentences into words)
        
        ```python
        from nltk.tokenize import word_tokenize
        
        word_tokenized = [word_tokenize(sent) for sent in sent_tokenized]
        ```
        
- Punctuation and stopword removal
- lemmatization: convert inflected form of words into base form
- stemming: chopping affixes

# 17 - Multi-class classification

> **Apply classifiers to multi-class classification algorithms.**
> 
1. **********************One vs Rest:********************** For each class, binary model separating it from all others (imbalanced) → scores from all binary classifiers determines winner
    
    ```python
    lr = LogisticRegression(max_iter=2000, multi_class="ovr")
    ```
    
2. ********************One vs One:******************** binary model for each pair of classes, $\frac{n\times (n-1)}{2}$ binary classifiers → apply all classifiers, most votes wins
- Wrappers for any binary classifier
    - **`[OneVsRestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html)`**
    - **`[OneVsOneClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsOneClassifier.html)`**
    
    ```python
    model = OneVsOneClassifier(LogisticRegression())
    ```
    

> **Explain the role of neural networks in machine learning, and the pros/cons of using them.**
> 
- ********************************Neural networks:******************************** sequence of transformations on input data
- **Perceptron**: single layer neural network with input/output layer and adjustable weights
- ********************************************************Multi Layer Perceptron (MLP)********************************************************: multi-layers of perceptrons
    - layers can apply non-linear functions
    - can specify number of features after each transormation

********Pros********

- Learn **complex** functions
- tradeoff controlls by # layers & layer size
- more/bigger layers → more complexity
- can get **model that won’t underfit**
- Works well for **structured** data
    - 1D sequences (e.g. timeseries, language)
    2D image
    - 3D image or video
- **Transfer** **learning** is useful

**Cons**

- requires lots of **data**
- requires high **compute time**, and GPUs to be faster
- **huge** number of **hyperparameters** → difficult to tune
    - each layer has hyperparameters + overall hyperparameters
- **not interpretable**
- `fit` **not** guaranteed to be **optimal**
    - hyperparameters specific to `fit`
    - don’t know if it was successful
    - never know how long to run `fit`

> **Explain why the methods we’ve learned previously would not be effective on image data.**
> 
- Previous methods require flattening data for images into vector of features → ********************************************************************************removes structured information of images********************************************************************************
- ********************************Convolutional neural networks (CNN):******************************** use images without flattening

> **Apply pre-trained neural networks to classification and regression problems.**
> 

```python
import torch
from PIL import Image
from torchvision import transforms
from torchvision.models import vgg16

clf = vgg16(weights='VGG16_Weights.DEFAULT') 
preprocess = transforms.Compose([
                 transforms.Resize(299),
                 transforms.CenterCrop(299),
                 transforms.ToTensor(),
                 transforms.Normalize(mean=[0.485, 0.456, 0.406], 
                                     std=[0.229, 0.224, 0.225]),])

img_t = preprocess(img)
batch_t = torch.unsqueeze(img_t, 0)

clf.eval()
  output = clf(batch_t)
  _, indices = torch.sort(output, descending=True)
  probabilities = torch.nn.functional.softmax(output, dim=1)
  d = {'Class': [classes[idx] for idx in indices[0][:topn]], 
       'Probability score': [np.round(probabilities[0, idx].item(),3) for idx in indices[0][:topn]]}
  df = pd.DataFrame(d, columns = ['Class','Probability score'])
```

> **Utilize pre-trained networks as feature extractors and combine them with models we’ve learned previously.**
> 

```python
# Attribution: [Code from PyTorch docs](https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html?highlight=transfer%20learning)

IMAGE_SIZE = 200
BATCH_SIZE = 64

data_transforms = {
    "train": transforms.Compose(
        [
            # transforms.RandomResizedCrop(224),
            # transforms.RandomHorizontalFlip(),
            transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),     
            transforms.ToTensor(),
            #transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
            transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),            
        ]
    ),
    "valid": transforms.Compose(
        [
            # transforms.Resize(256),
            # transforms.CenterCrop(224),
            transforms.Resize((IMAGE_SIZE, IMAGE_SIZE)),                        
            transforms.ToTensor(),
            # transforms.Normalize([0.485, 0.456, 0.406], [0.229, 0.224, 0.225]),
            transforms.Normalize([0.5, 0.5, 0.5], [0.5, 0.5, 0.5]),                        
        ]
    ),
}
data_dir = "data/animal-faces"
image_datasets = {
    x: datasets.ImageFolder(os.path.join(data_dir, x), data_transforms[x])
    for x in ["train", "valid"]
}
dataloaders = {
    x: torch.utils.data.DataLoader(
        image_datasets[x], batch_size=BATCH_SIZE, shuffle=True, num_workers=4
    )
    for x in ["train", "valid"]
}
dataset_sizes = {x: len(image_datasets[x]) for x in ["train", "valid"]}
class_names = image_datasets["train"].classes

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

def get_features(model, train_loader, valid_loader):
    """Extract output of squeezenet model"""
    with torch.no_grad():  # turn off computational graph stuff
        Z_train = torch.empty((0, 1024))  # Initialize empty tensors
        y_train = torch.empty((0))
        Z_valid = torch.empty((0, 1024))
        y_valid = torch.empty((0))
        for X, y in train_loader:
            Z_train = torch.cat((Z_train, model(X)), dim=0)
            y_train = torch.cat((y_train, y))
        for X, y in valid_loader:
            Z_valid = torch.cat((Z_valid, model(X)), dim=0)
            y_valid = torch.cat((y_valid, y))
    return Z_train.detach(), y_train.detach(), Z_valid.detach(), y_valid.detach()

densenet = models.densenet121(weights="DenseNet121_Weights.IMAGENET1K_V1")
densenet.classifier = nn.Identity()  # remove that last "classification" layer

Z_train, y_train, Z_valid, y_valid = get_features(
    densenet, dataloaders["train"], dataloaders["valid"]
)

pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=2000))
pipe.fit(Z_train, y_train)
pipe.score(Z_train, y_train)
# 1.0
pipe.score(Z_valid, y_valid)
# 0.9
```

# 18 - Time Series

> **Recognize when it is appropriate to use time series.**
> 
- Data indexed in time order

> **Explain the pitfalls of train/test splitting with time series data.**
> 
- cannot split randomly → ****************************************************************forecasting must not know future****************************************************************
    - split by time threshold

> **Appropriately split time series data, both train/test split and cross-validation.**
> 

```python
rain_df["Date"].min()
# Timestamp('2007-11-01 00:00:00')
rain_df["Date"].max()
# Timestamp('2017-06-25 00:00:00')

# then split by time in the middle
```

`TimeSeriesSplit` - sort dataframe by date for ********************************cross-validation********************************

```python
lr_pipe = make_pipeline(
	preprocessor, 
	LogisticRegression(max_iter=1000)
)
cross_val_score(
	lr_pipe, train_df_ordered, y_train_ordered, 
	cv=TimeSeriesSplit()).mean()
)
```

> **Perform time series feature engineering:**
> 

> **Encode time as various features in a tabular dataset**
> 

```python
# Encodes days as number
first_day = train_df["Date"].min()

train_df = train_df.assign(
    Days_since=train_df["Date"].apply(lambda x: (x - first_day).days)
)
# OHE of month
train_df = train_df.assign(
    Month=train_df["Date"].apply(lambda x: x.month_name())
)
```

> **Create lag-based features**
> 

```python
orig_feature = "Rainfall"
lag = -1
new_df = df.copy()
new_feature_name = f"{orig_feature}_lag{lag}"

# if there are multiple time series by category
for location, df_location in new_df.groupby(
        "Location"
):  # Each location is its own time series
  new_df.loc[df_location.index[-lag:], new_feature_name] = df_location.iloc[:lag][
      orig_feature
  ].values

```

> **Explain how can you forecast multiple time steps into the future.**
> 
- Approaches to predict into future
    1. Train separate model for ******************each time step******************
        - e.g. predict next month, predict two months
    2. multi-output model jointly predicting multiple time steps
    3. One model predicts one time step, and uses it to predict next time step… `for` loop

> **Explain the challenges of time series data with unequally spaced time points.**
> 
- if unequally spaced
    - can still do feature engineering
    - lags would not make sense
        - could group into equal bins

> **At a high level, explain the concept of trend.**
> 
- patterns that emerge over continuous time
    - usually consistent upward/downward movement

# 19 - Survival Analysis

> **Explain what is right-censored data.**
> 
- his method suffers from *right-censoring* in which the method is biased towards the window of data collection
    - data that isn’t fully captured within the window is biased as it is incomplete

> **Explain the problem with treating right-censored data the same as “regular” data.**
> 
- Not all the data points are complete, the end of data capture window is a cut-off
    - data will tend to under-estimate

> **Determine whether survival analysis is an appropriate tool for a given problem.**
> 
- helps answer questions such as
    1. how long customers stay
    2. for customer, predict how long they may stay
    3. factors influencing churn time

> **Apply survival analysis in Python using the `lifelines` package.**
> 
> 
> > **Interpret a survival curve, such as the Kaplan-Meier curve.**
> > 

```python
kmf = lifelines.KaplanMeierFitter()
kmf.fit(train_df_surv["tenure"], train_df_surv["Churn"])
kmf.survival_function_.plot();
# or with error 
# kmf.plot();
plt.title("Survival function of customer churn")
plt.xlabel("Time with service (months)")
plt.ylabel("Survival probability");
```

- can look at KM curves for different groups
    - e.g. market segments

> **Interpret the coefficients of a fitted Cox proportional hazards model.**
> 
- ************************************************************Cox proportional hazard model:************************************************************ interpret how features influence censor duration
    - ****coefficient**** for each feature → influence on survival
    - **************************************************************proportional hazards assumption**************************************************************

```python
cph = lifelines.CoxPHFitter(penalizer=0.1)
cph.fit(train_df_surv, duration_col="tenure", event_col="Churn");

cph_params = pd.DataFrame(cph.params_).sort_values(by="coef", ascending=False)
cph_params

cph.summary

# confidence intervals
cph.plot();
```

> **Make predictions for existing individuals and interpret these predictions.**
> 

```python
cph.predict_expectation(test_df_surv)

# survival function for individuals
cph.predict_survival_function(test_df_surv).plot()
```

# 20 - Ethics

> **Sources of bias**
> 
- ************Historical Bias:************ training data reflects biases/prejudices from past
- **********************************Measurement Bias:********************************** data not accurately measured for intention
- ****************************************Representation bias:**************************************** data doesn’t accurately represent population/phenomenon of interest

> **Why algorithmic bias matter and how can it impacts us**
> 
- marginalized groups
- can emphasize existing biases/stereotypes/prejudice

> **Different fairness metrics**
> 
- ************************************demographic/statistical parity:************************************ population percentage reflected in output classes
- **************************************Equality of False Negatives or equalized odds:************************************** constant-false-negative rates across groups
- ************************************Equal opportunity:************************************ equal True Positive Rate for all groups

# 21 - Communication

> **When communicating about applied ML, tailor an explanation to the intended audience.**
> 

> **Apply best practices of technical communication, such as bottom-up explanations and reader-centric writing.**
> 

> **Given an ML problem, analyze the decision being made and the objectives.**
> 

> **Avoid the pitfall of thinking about ML as coding in isolation; build the habit of relating your work to the surrounding context and stakeholders.**
> 

> **Interpret a confidence score or credence, e.g. what does it mean to be 5% confident that a statement is true.**
> 
- ****************************************credence in practice****************************************
    1. ********************************************************************I would accept a bet at these odds********************************************************************
        - e.g. 99% → to win 1, I would bet 99
        - e.g. 75% → to win 25, I would bet 75
    2. ********Long-run frequency of correctness********
        - e.g. 99% → for every 100 predictions, I expect 1 to be incorrect
        - e.g. 75% → for every 100 predictions, I expect 25 to be incorrect

> **Maintain a healthy skepticism of `predict_proba` scores and their possible interpretation as credences.**
> 

> **Be careful and precise when communicating confidence to stakeholders in an ML project.**
> 

> **Identify misleading visualizations.**
> 
- Things to watch out for
    - Chopping off the x-axis
    - Saturate the axes
    - Bar chart for a cherry-picked values
    - Different y-axes