## Cheat Sheet


### **0. Introduction:**


- 2020: 40 zettabtyes of data ($40 * 10^9$ terabytes)
- DS: "the journey of extracting knowledge from data", "From collecting data to doing something useful with it".
- Data Mining = Data science + machine learning + visualization
- “Machine learning is the science of making computers learn and act like humans by feeding data and information without explicitly being programmed”
- Triangle of Data and Knowledge
  - Data = simple, isolated facts (e.g. values in a database table)
  - Information = data in context
  - Knowledge = interpreted information
  - Intelligence = use of knowledge to choose between alternatives
  - Wisdom = intelligence + experience (guided by values and commitment)
- Dataset size: Columns (features) x Rows (instances/observations)
  -Data gathering
  - Scraping
  - APIs
    - JSON (JavaScript Object Notation)
    - IMDB
    - Kaggle
  - Tools for data collection
    - beautifulsoup (HTML)
- **GDPR:**
  - All data thats collected within the EU is protected by the GDPR
  - Personal data = any information relating to an identified or identifiable natural person (data subject)
    - Name, ID number, religion, political views, location data, online identifier, etc.
    - Data controllers must notify the authorities of any data breach within 72 hours
    - Consent must be: clear, freely given, informed, and specific
  - Data rights:
    - Right to be informed
    - Right to be forgotten
    - Right to rectification (=)
    - Right to restrict processing
    - Data portability (Individuals can request their personal information. Companies need to reply in 30 days )
    - Right to object or appeal to automated decision making
- **Data types:**
  - Structured data:
    - Categorical
      - Nominal (not orderable: gender, ID, zip code,)
      - Ordinal (orderable: e.g. grades, low-med-high)
    - Numerical
      - Continuous (distance, time, temperature, weight, height)
      - Discrete (countable: number of children, number of cars)
  - Unstructured or semistructued:
    - Text
    - Multimedia (image, audio, video)
    - XML/JSON
- **Metrics for data readiness:**
  - reliability, accuracy, accessibility, security, privacy, consistency validity, timeliness, granularity
- Converting Continuous data to Discrete variables:
  - Convert incomes to income groups
- Typical numerical transformations:
  - Logarithmic transformation $log(x)$
  - 0-1 Normalization $x_{new} = (x_i - min(x)/max(x)-min(x))$
  - Z Normalization $z= (x_i - mean(x)/std(x))$


### **1. Regression:**


Regression is a statistical method that helps you understand the relationship between dependent and independent variables. It's mainly used for prediction and forecasting.

- Predicting numerical values (age, price, temperature, etc.)
- Supervised learning
- Residual Sum of Squares = $\sum(y-y')^2$, where y=actual_value, y'=predicted value
- How to find the best fitted line = _Gradient Descent_
  - Important to scale features! Might not work well when not scaled. Squished circle vs Normal circle.
  - Iterative approach that tries to reduce the RSS
- **Goodness of fit**
  - RSS = Residual Sum of Squares -> $\sum(y-y')^2$, where y=actual_value, y'=predicted value
  - TSS = Total Sum of Squares -> $\sum(y-\bar{y})^2$ where y=actual_value, $\bar{y}$=mean
  - $R^2$ : 1 = perfect fit, 0 = same as avg.
  - $R^2 = 1 - RSS/TSS$
- **MAE & MSE:**

  - are metrics to evaluate the regression model. MAE is the average of the absolute differences between the predicted and actual values. MSE is the average of the squared differences. MSE penalizes larger errors more than MAE. Eg if MAE = 25’000 it means that on average we are off by 25k CHF in our predictions.

  - **MAE (Mean Absolute Error):**
    - $1/n * Σ|yi - y'|$
  - **MSE (Mean Squared Error):**
    - $1/n * Σ(yi - y')^2$
  - $Model: y = β0 + β1*x + β2*x^2 + ... + βd*x^d + ε$
  -

- **Multivariate linear regression:**

  - Here, you try to predict a dependent variable based on two or more independent variables. For example, predicting a house's price based on the number of bedrooms and the size in square feet. The formula is y = β0 + β1*x1 + β2*x2 + ... + βn\*xn + ε, where y is the dependent variable, x1, x2, ..., xn are the independent variables, β0, β1, ..., βn are the coefficients, and ε is the error term.
  - $Model: y = β0 + β1*x1 + β2*x2 + ... + βn*xn + ε$

- **Polynomial regression:**

  - This is a form of regression where the relationship between the independent and the dependent variables is modelled as an nth degree polynomial. It is useful when the relationship is not linear but rather curvilinear. For example, predicting crop yield based on temperature, where the yield might increase with temperature to a point, but then decrease with further temperature increase.

- **Regularization (Lasso, Ridge):**

  - Regularization helps to prevent overfitting by adding a penalty term to the loss function. Ridge regression (L2 regularization) minimizes the sum of the squares of the coefficients, while Lasso (L1 regularization) minimizes the sum of the absolute values of the coefficients. This penalty term encourages smaller coefficients, which leads to simpler models.
  - Lasso (L1 regularization): Sum of absolute values of coefficients
  - Ridge (L2 regularization): Sum of squares of coefficients

- **Cross Validation:**

  - Using K-fold (5 or 10 usually) to get a more accurate picture of the error.

- **Regression with categorical values:**
  - Binary data (gender) = convert to [0,1]
  - 1-hot encoding = > 2 values (convert each value to a column). ROT: 10-15
  - Label-encoding = > 2 values (give each value a label and create a register) ROT: > 15


### **2. Classification:**


- Supervised learning
- Classification involves predicting a categorical (something discrete) label for an input. Useful to detect language, decide whether to give a loan or not, classify sentiment.
- Problems that can be solved by a linear classifier are called _linearly separable_.
- Default rate:

  - size (most_common_class) / size(dataset)
  - For binary classes: 1/N, where N is number of classes.

- **Logistic Regression:**
- $Logistic(x)=1/(1+e^{-x})$
- This is a binary classification method that models the probability that an instance belongs to the default class. For example, it can be used to predict whether an email is spam (1) or not spam (0). Goes from +inf to -inf.
- $log(p/1-p) = β0 + β1*x1 + β2*x2 + ... + βn*xn$
- **Confusion Matrix:**

|               | Predicted Positive | Predicted Negative |
| ------------- | ------------------ | ------------------ |
| True Positive | TP                 | FN                 |
| True Negative | FP                 | TN                 |

Where:

- TP = True Positives: The cases in which the model predicted Positive, and the truth is also Positive.
- FP = False Positives (Type I error): The cases in which the model predicted Positive, but the truth is Negative.
- FN = False Negatives (Type II error): The cases in which the model predicted Negative, but the truth is Positive.
- TN = True Negatives: The cases in which the model predicted Negative, and the truth is also Negative.

  - This is a table that describes the performance of a classification model. It includes True Positives (actual positive and predicted positive), True Negatives (actual negative and predicted negative), False Positives (actual negative but predicted positive), and False Negatives (actual positive but predicted negative).
  - **Accuracy:**
    - Accuracy is the ratio of correctly predicted instances to the total instances. Overall, how often is the model correct?
    - $(TP + TN) / (TP + TN + FP + FN)$
  - **Precision:**
    - Precision is the ratio of correctly predicted positive instances to the total predicted positives. When the model predicts Positive, how often is it correct?
    - $TP / (TP + FP)$
  - **Recall:**
    - Recall (or Sensitivity) is the ratio of correctly predicted positive instances to all actual positives. When the truth is Positive, how often does the model predict it?
    - $TP / (TP + FN)$
  - **Misclassification Rate (Error Rate):**
    - Overall, how often is the model wrong?
    - $Error Rate = (FP+FN) / (TP+FP+FN+TN)$
  - **F1 Score:**
    - F1 Score is the harmonic mean of Precision and Recall It's useful when the data has imbalanced classes.
    - $2*(Precision*Recall)/(Precision + Recall)$

**Lets use the following matrix to do calculations:**

|              | Predicted Class 1 | Predicted Class 2 | Predicted Class 3 |
| ------------ | ----------------- | ----------------- | ----------------- |
| True Class 1 | 50                | 10                | 5                 |
| True Class 2 | 20                | 60                | 5                 |
| True Class 3 | 10                | 10                | 70                |

1. For class 1:

   - Precision for Class 1 = TP / (TP + FP) = 50 / (50 + 30) = 0.625
   - Recall for Class 1 = TP / (TP + FN) = 50 / (50 + 15) = 0.77
   - F1 Score for Class 1 = 2 _ (Precision _ Recall) / (Precision + Recall) = 2 _ (0.625 _ 0.77) / (0.625 + 0.77) = 0.69
   - Accuracy for Class 1 = (TP + TN) / (TP + TN + FP + FN) = (50 + 160) / (50 + 160 + 30 + 15) = 0.84
   - Error rate for Class 1 = 1 - Accuracy = 1 - 0.84 = 0.16

2. For the entire confusion matrix:

   First, let's calculate overall TP, FP, FN, and TN:

   - Total TP = Sum of diagonal elements = 50 + 60 + 70 = 180
   - Total FP = Sum of all non-diagonal elements in predicted columns = 30 (from Class 1 column) + 20 (from Class 2 column) + 10 (from Class 3 column) = 60
   - Total FN = Sum of all non-diagonal elements in true class rows = 15 (from Class 1 row) + 25 (from Class 2 row) + 20 (from Class 3 row) = 60
   - Total TN = Sum of all elements - TP - FP - FN = 300 - 180 - 60 - 60 = 0 (this makes sense, because in multi-class problems, an instance can either be a TP, FP, or FN for a class)
   - Accuracy for the matrix = (Total TP + Total TN) / (Total TP + Total TN + Total FP + Total FN)

**ROC Curve:**

- ROC curve is a plot of True Positive Rate (Recall) vs False Positive Rate (1 - Specificity) for different classification thresholds. It shows the tradeoff between sensitivity and specificity. (TP: Y-axis. FP: X-axis). FYI: random model: ROC @ 0.5. Diagonal line from 0,0 to 1,1.
- ROC curve is useful when the classes are imbalanced.
- ROC curves answers the question: what will happen if we move the prediction line up and down?
- Advantages of ROC-curve: (1) Visualize TPR and TNR. (2) Compare different models.

**kNN (k-Nearest Neighbors) Classification:**

- Odd numbers in KNN: avoid ties.
- Distance types:
  - (1) Euclidean (L2): captures the distance between two points in a plane (Pythagorean theorem) $d(p,q) = \sqrt{(q_1-p_1)^2+(q_2-p_2)^2}$.
  - (2) Manhattan (L1): captures the distance between two points in a plane (sum of absolute differences) $d(x,y) = \sum_{i=1}^{n}|x_i-y_i|$.
- Euclidean distance is more sensitive to outliers than Manhattan distance. Often used when the data is dense or continuous. Manhattan distance is often used when the data is sparse or discrete.
- We pick the right K by using cross-validation. Using ONLY the training data.

**Decision Trees:**

- Decision trees are a non-parametric supervised learning method used for classification and regression. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.
- Decision trees are easy to interpret and visualize.
- Use cases: decide prescriptions, decide whether to approve a loan, decide whether to hire a candidate, etc.
- How to grow a Decision Tree:
  - (1) Start at the root node with all observations
  - (2) Pick attributes and value/threshold to get new nodes.
    - Split should produce pure and equally balanced nodes
    - Select the feature that leads to lowest classification error.
  - (3) When to stop? Purity, less than x points...
- How to find the optimal tree depth: cross validation
- Entropy: measure of impurity in a bunch of examples. $H(X) = -\sum_{i=1}^{n}p(x_i)log_2p(x_i)$. The information gain is then defined as $IG(T,a) = H(T) - H(T|a)$.
  - Calculation steps:
    - (1) Calculate entropy of parent node -> $H(T)$.
    - (2) Calculate entropy of each individual node -> $H(T|a)$.
    - (3) Calculate information gain -> $IG(T,a) = H(T) - H(T|a)$.
    - (4) Select the feature that leads to highest information gain.

<image  src='./illustrations/entropy_inf_gain.png' height='200'/>

**Random Forests:**

- Random forests are an ensemble (classifier) learning method for classification and regression. It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
- Random forests are a bagging (bootstrap aggregating) method.
- Steps to build a random forest:
  - (1) Randomly select k features from total m features where k << m
  - (2) Among the k features, calculate the node d using the best split point
  - (3) Split the node into daughter nodes using the best split
  - (4) Repeat steps 1 to 3 until leaf nodes are finalized
  - (5) Build forest by repeating steps 1 to 4 for n times to create n number of trees
  - (6) Select the feature that leads to lowest classification error.


### **3. Clustering:**


- Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups.
- No labels, discover structure by grouping
- Unsupervised learning
- **K-Means:**
  - Runtime complexity: $O(ktn)$ where k is the number of clusters, t is the number of iterations, and n is the number of data points.
  - This algorithm divides a set of samples into disjoint clusters. Each cluster is described by the mean of the samples in the cluster. For example, you might cluster customers into k groups based on their shopping behavior.
  - Initialize k centroids randomly
  - Assign each point to the nearest centroid
  - Recalculate centroids
  - Repeat until convergence
  - How to find the right amount of clusters: elbow method
- **Hierarchical clustering:**
  - Runtime complexity: At least $O(n^2)$ where n is the number of data points (single-linkage)
  - This algorithm builds a hierarchy of clusters by creating a cluster tree or dendrogram. It can either start with all individual instances and merge them into clusters (_agglomerative_ : bottom-up), or start with a single cluster and divide it up (_divisive_ : top-down).
  - Start with each point as a separate cluster
  - Merge the two closest clusters
  - Repeat until one single cluster remains
  - Hierarchical clustering works with all types of distance type. (kMeans only with Euclidean distance)
  - How to derive the number of cluster: divide where the largest vertical distance is.
  - Single-linkage: distance between two clusters is defined as the shortest distance between points in the two clusters.
  - Complete-linkage: distance between two clusters is defined as the longest distance between points in the two clusters.
  - Avg-linkage: distance between two clusters is defined as the average distance between points in the two clusters.
  - Wards-linkage: distance between two clusters is defined as the sum of squared differences within all clusters.
  - Cannot be used to more than 10.000 datapoints.


### **4. Association Rules:**


- Association rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.
- Unsupervised learning
- - Itemset = set of items

  - 1-itemset: set of 1 item
  - 2-itemset: set of 2 items
  - ...

- Frequency of an itemset = number of transactions containing that itemset
- Monotonicity property = if an itemset is frequent, then all of its subsets are frequent
- Support = frequency of an itemset / total number of transactions (A=>B, typically expressed as a percentage, "occurs together")
- Confidence = support of itemset A and B / support of itemset A (A=>B, typically expressed as a percentage, "conditional probability")
- Lift = confidence / support of itemset B (typically expressed as a percentage, "how much more likely") 1: No association, >1: Positive association, <1: Negative association
-

- **Apriori Algorithm:**

  - This is a popular algorithm for extracting frequent 1-itemsets with applications in association rule learning. The Apriori algorithm uses breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support.
  - Given a threshold for minimum support and confidence, find all itemsets in the dataset meeting that threshold
  - Generate rules from those itemsets meeting the minimum confidence

- **Example:**

Given the following buckets:

- a,b,c
- a,c
- a,d
- b,e,f

Minimum support = 50%, minimum confidence = 70%

- Frequent 1-itemsets (support count): a(3), b(2), c(2), d(1), e(1), f(1). a,b,c are the only with support >= 50%.
- Frequent 2-itemsets (support count): ab(1), ac(2), ad(1), ae(1), af(1), bc(1), be(1), bf(1), ce(1), cf(1). ac is the only with support >= 50%.
- Frequent 3-itemsets (support count): acb(1), ace(1), acf(1). None with support >= 50%.
- Measure the support of the itemsets with above 50% support.
- Confidence: a=>c = 2/3, c=>a = 2/2. Only a=>c with confidence >= 70%.


### **5. Recommender Systems:**

- Recommender systems are used to suggest products, services, information to users based on their past preferences.
- Supervised learning
- relevance: how relevant is the item to the user
- diversity: how diverse are the items in the recommendation
- serendipity: how surprising are the items in the recommendation
- matrix factorization: given a matrix, factorize it into two matrices
- Cold start: when a new user or item enters the system, there is no information about him/her or it. This is called the cold start problem.
  - Problem for: collaborative filtering, content based recommenders
  - Not a problem for: knowledge based recommenders, hybrid recommenders
- Precision-recall curve vs ROC curve: Precision-recall curve is used when the classes are very imbalanced. ROC curve is used when the classes are balanced.

- **User-User Collaborative Filtering:**

  - This method predicts a user's interest by collecting preferences from many users. The premise of collaborative filtering is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue.
  - Given a user u, predict items for u by finding users similar to u, and suggest items those similar users liked

- **Content based recommenders:**
  - This method predicts a user's interest based on the similarity between the content of the items and a user profile. The premise of content-based filtering is that if a person A likes a particular item, he or she will also like an item that is similar to it.
  - Given a user u, predict items for u by finding items similar to items u liked


### **6. Text Analytics:**


Text Analytics is the process of converting unstructured text data into meaningful data.

- **Tokenization:** Breaking down a stream of text into words, phrases, symbols, or other meaningful elements called tokens.
- Corpus: collection of documents
- Document: collection of sentences and words
- Token: a word, number, or other "meaningful" element
- Vocabulary: set of unique tokens
- Stemming = removing the suffixes of words (Porter Stemmer)
- Word embedding: mapping words or phrases from the vocabulary to vectors of real numbers
- Word semantics: the meaning of words
- Word2Vec: a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2Vec is a method to construct such an embedding. It can be obtained using two methods (both involving Neural Networks): Skip Gram and Common Bag Of Words (CBOW)
-

- **Lemmatization:** It involves reducing the inflected words properly ensuring that the root word belongs to the language.

- **Bag of Words, TF-IDF:** These are methods to convert text data into numerical vectors. Bag of Words counts the occurrence of each word in a document while TF-IDF increases proportionally to count but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently. BOW works well with linear models.
  - Cosine similarity
    - Step 1: BOW Vocabulary -
    - Unique words are: ['I', 'like', 'cats', 'and', 'dogs', 'My', 'cat', 'does', 'not', 'your', 'dog']
    - Step 2: Vectorize sentences -
      - Vector S1: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
      - Vector S2: [0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1]
    - Step 3: Calculate cosine similarity -
      - Dot product (sum of the products of the corresponding entries of the two sequences of numbers):
      - Dot product = (10) + (11) + (10) + (10) + (10) + (01) + (01) + (01) + (01) + (01) + (0\*1)
      - Dot product = 1
    - Magnitude of vectors:
      - Magnitude of S1 = sqrt((1^2) + (1^2) + (1^2) + (1^2) + (1^2) + (0^2) + (0^2) + (0^2) + (0^2) + (0^2) + (0^2)) = sqrt(5)
      - Magnitude of S2 = sqrt((0^2) + (1^2) + (0^2) + (0^2) + (0^2) + (1^2) + (1^2) + (1^2) + (1^2) + (1^2) + (1^2)) = sqrt(6)
    - Cosine similarity:
      - Cosine similarity = Dot product / (Magnitude of S1 _ Magnitude of S2) = 1 / (sqrt(5) _ sqrt(6))

Use a calculator to compute this value and round off to 2 decimal places.

- **TF-IDF (Term Frequency-Inverse Document Frequency):** $TF(t, d) * IDF(t)$
  - Term frequency: TF(t, d) = Number of times term t/word appears in the document d (can contain several words and sentences)
  - $IDF(t) = log(N / df(t))$ where N is total number of documents and df(t) is document frequency i.e., number of documents that contain term t
  - Dimensionality: the dimensionality of the vector representing a document is equal to the size of the vocabulary of all documents. This vocabulary is formed by distinct words across all documents. Given D1 = "the cat is gray", D2 = "my cat is fast", the dimensionality of D1 is thus 6.
  - A high TF-IDF: a high weight in TF-IDF is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents.
- **The Jaccard similarity:** also known as the Jaccard coefficient or the Jaccard index, is a statistical measure used for comparing the similarity and diversity of sample sets. It's often used in natural language processing and information retrieval to estimate the similarity between documents or text. Given two sets, A and B, the Jaccard similarity is computed as the size of the intersection divided by the size of the union of the two sets. More formally, it can be defined as:

  - J(A, B) = |A ∩ B| / |A ∪ B|
  - Where:

    - |A ∩ B| is the number of elements in both sets (i.e., the intersection of A and B)
    - |A ∪ B| is the number of elements in either set, duplicates removed (i.e., the union of A and B)

  - In the context of text analysis, sets A and B could be sets of words (or n-grams) in two different documents. The Jaccard similarity then measures how many words the documents have in common, divided by the total number of unique words across both documents.

  - For example, for two sentences:

    - S1: "I like cats"
    - S2: "I like dogs"

    The Jaccard similarity would be calculated as follows:

    - A = {I, like, cats}
    - B = {I, like, dogs}

    - |A ∩ B| = 2 (the words "I" and "like" are common to both sets)
    - |A ∪ B| = 4 (the set of all unique words is {I, like, cats, dogs})

    Therefore, J(A, B) = 2 / 4 = 0.5

    So, the Jaccard similarity of sentences S1 and S2 is 0.5, meaning they share 50% of their unique words.

**Edit Distance:**

- Edit distance is a way of quantifying how dissimilar two strings (e.g., words) are to one another by counting the minimum number of operations required to transform one string into the other. Quadratic time complexity. Dynamic programming algorithm. Levenshtein distance is a string metric for measuring the difference between two sequences. Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.


### **7. Neural Networks:**

- "Stacking" multiple logistic regression models together
- Tensor = multidimensional array (input)
- Fine tune the weights of the network to minimize the error
- How? Backpropagation = gradient descent
- Number of outputs = number of classes to be predicted
- Softmax = activation function for the output layer. A way to get probabilities [0,1]
  - Defined as: $softmax(x)_i = \frac{exp(x_i)}{\sum_{j=1}^{n}exp(x_j)}$,
  - If we have very large numbers: $softmax(x)_i = \frac{exp(x_i - c)}{\sum_{j=1}^{n}exp(x_j - c)}$, where $c = max(x_i)$
- Neuron = basic unit of a neural network, set of parameters (weights and biases) to be learnt.
- Learning rate = how fast the model learns, how big the steps are during gradient descent
  - Too big: overshoot the minimum, missing the optimal state
  - Too small: takes too long to converge
- epoch = one forward pass and one backward pass of all the training examples
- batch size = number of training examples in one forward/backward pass
- number of iterations = number of passes, each pass using [batch size] number of examples
- Ways to fight overfitting:
  - Early stopping (stop training when validation error starts to increase)
  - Dropout (randomly remove some neurons)
  - Regularization (L1, L2)

A neural network takes in inputs, which are then processed in hidden layers using weights that are adjusted during training. The more hidden layers there are, the deeper the network is, and the more complex the patterns it can detect. The output layer then gives the final output for the given inputs. The network is trained by adjusting the weights between layers until the output error is minimized.

- **Artificial Neuron:** It is the basic unit of a neural network. It gets certain inputs, processes them based on some activation function and gives an output. Input is processed using a weighted sum followed by a non-linear function (often a Sigmoid or ReLU)
- **Multilayer Perceptron:** This is a class of feedforward artificial neural network which consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Each node in one layer connects with a certain weight to every node in the following layer. A neural network model that consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer


### **8. Graph Analytics:**

- **Centrality (Degree, Closeness, Betweenness, Eigenvector)**: Measures to identify the most important vertices within a graph
- **PageRank:** A link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set
  - PageRank is a link analysis algorithm that assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set. The algorithm may be applied to any collection of entities with reciprocal quotations and references. The numerical weight that it assigns to any given element E is referred to as the PageRank of E and denoted by PR(E).
  - Video Explanation : https://www.youtube.com/watch?v=P8Kt6Abq_rM
  - Recipe for computing page rank:
    - Start with a graph
    - Initialize all nodes with a page rank of 1
    - For each iteration:
      - For each node:
        - Calculate the page rank of the node by summing the page rank of each node that points to it, divided by the number of nodes that point to it
        - Update the page rank of the node
        - Repeat until convergence (= until the page rank of each node doesn't change much)


### **9. Dimensionality Reduction:**

- Unsupervised learning technique that reduces the number of features in a dataset by obtaining a set of principal features. It can be used for data compression, feature extraction, and data visualization.
- **Feature Selection:** Selecting a subset of the most relevant features for use in model construction
- **Feature Extraction:** Creating new features from the existing ones
- **Feature Transformation:** Transforming the features to a new space
- **Feature Scaling:** Scaling the features to a specific range

- **PCA (Principal Component Analysis):** A statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components
