## Cheat Sheet

### **1. Regression:**

 Regression is a statistical method that helps you understand the relationship between dependent and independent variables. It's mainly used for prediction and forecasting. Supervised learning.

   * **Multivariate linear regression:**
       * Here, you try to predict a dependent variable based on two or more independent variables. For example, predicting a house's price based on the number of bedrooms and the size in square feet. The formula is y = β0 + β1*x1 + β2*x2 + ... + βn*xn + ε, where y is the dependent variable, x1, x2, ..., xn are the independent variables, β0, β1, ..., βn are the coefficients, and ε is the error term.
       * $Model: y = β0 + β1*x1 + β2*x2 + ... + βn*xn + ε$
   
   * **Polynomial regression:**
       * This is a form of regression where the relationship between the independent and the dependent variables is modelled as an nth degree polynomial. It is useful when the relationship is not linear but rather curvilinear. For example, predicting crop yield based on temperature, where the yield might increase with temperature to a point, but then decrease with further temperature increase.
       * $Model: y = β0 + β1*x + β2*x^2 + ... + βd*x^d + ε$
   
   * **Regularization (Lasso, Ridge):**
       * Regularization helps to prevent overfitting by adding a penalty term to the loss function. Ridge regression (L2 regularization) minimizes the sum of the squares of the coefficients, while Lasso (L1 regularization) minimizes the sum of the absolute values of the coefficients. This penalty term encourages smaller coefficients, which leads to simpler models.
       * Lasso (L1 regularization): Sum of absolute values of coefficients
       * Ridge (L2 regularization): Sum of squares of coefficients
   
   * **RSS, MAE & MSE:**
       * are metrics to evaluate the regression model. MAE is the average of the absolute differences between the predicted and actual values. MSE is the average of the squared differences. MSE penalizes larger errors more than MAE.
       * **RSS (REsidual sum of squares):**
          * $Σ(yi - (w_0 + w_1*x_i))^2$
       * **MAE (Mean Absolute Error):** 
         * $1/n * Σ|yi - y'|$
       * **MSE (Mean Squared Error):** 
         * $1/n * Σ(yi - y')^2$
         if root mean squared error (RMSE take squareroot)
       * **$R^2$** = $1 - (RSS/(Σ|yi - mean(y))^2)$ 

### **2. Classification:**
   * Classification involves predicting a categorical label for an input. Supervised learning.
   * **Logistic Regression:** 
   * This is a binary classification method that models the probability that an instance belongs to the default class. For example, it can be used to predict whether an email is spam (1) or not spam (0).
   * $log(p/1-p) = β0 + β1*x1 + β2*x2 + ... + βn*xn$
   * **Confusion Matrix:**

|       | Predicted Positive | Predicted Negative |
|-------|--------------------|--------------------|
| True Positive  | TP               | FN                |
| True Negative  | FP               | TN                |

Where:
- FP = False Positives (Type I error): The cases in which the model predicted Positive, but the truth is Negative.
- FN = False Negatives (Type II error): The cases in which the model predicted Negative, but the truth is Positive.
  
   * **Accuracy:** 
       * Accuracy is the ratio of correctly predicted instances to the total instances. Overall, how often is the model correct? 
       * $(TP + TN) / (TP + TN + FP + FN)$ Correct prediction/total predictions   
   * **Default rate:** 
       * For c classes, default rate $1/c$   
  For unbalanced classes:       
  * **Precision:** 
       * Of the ones we classified as Class X. how many are actually Class X?
       * $TP / (TP + FP)$ 
   * **Base rate:** 
       * If we always predict the largest class, how hoften are we right? model should be better than that.
       * $1/ size(largerst class)$         
   * **Recall:** 
     * Recall (or Sensitivity) of all the ones that acutally are Class X, how many did we identify as such?
     * $TP / (TP + FN)$
   * **Misclassification Rate (Error Rate):** 
     * Overall, how often is the model wrong? 
     * $Error Rate = (FP+FN) / (TP+FP+FN+TN)$ incorrect predictins/total predictions
   * **F1 Score:** 
       * F1 Score is the harmonic mean of Precision and Recall It's useful when the data has imbalanced classes.
       * $2*(Precision*Recall)/(Precision + Recall)$

**Lets use the following matrix to do calculations:**


|       | Predicted Class 1 | Predicted Class 2 | Predicted Class 3 |
|-------|-------------------|-------------------|-------------------|
| True Class 1 | 50                | 10                | 5                 |
| True Class 2 | 20                | 60                | 5                 |
| True Class 3 | 10                | 10                | 70                |
   

1. For class 1:

   * Precision for Class 1 = TP / (TP + FP) = 50 / (50 + 30) = 0.625
   * Recall for Class 1 = TP / (TP + FN) = 50 / (50 + 15) = 0.77
   * F1 Score for Class 1 = 2 * (Precision * Recall) / (Precision + Recall) = 2 * (0.625 * 0.77) / (0.625 + 0.77) = 0.69
   * Accuracy for Class 1 = (TP + TN) / (TP + TN + FP + FN) = (50 + 160) / (50 + 160 + 30 + 15) = 0.84
   * Error rate for Class 1 = 1 - Accuracy = 1 - 0.84 = 0.16

2. For the entire confusion matrix:

   First, let's calculate overall TP, FP, FN, and TN:

   * Total TP = Sum of diagonal elements = 50 + 60 + 70 = 180
   * Total FP = Sum of all non-diagonal elements in predicted columns = 30 (from Class 1 column) + 20 (from Class 2 column) + 10 (from Class 3 column) = 60
   * Total FN = Sum of all non-diagonal elements in true class rows = 15 (from Class 1 row) + 25 (from Class 2 row) + 20 (from Class 3 row) = 60
   * Total TN = Sum of all elements - TP - FP - FN = 300 - 180 - 60 - 60 = 0 (this makes sense, because in multi-class problems, an instance can either be a TP, FP, or FN for a class)
   * Accuracy for the matrix = (Total TP + Total TN) / (Total TP + Total TN + Total FP + Total FN)

### **2.b. Entropy:**  
measure for disorder, can be used to quantify information loss $H = - Σp_i(log_2 p_i)
find out on which feature to split:
1st step: calculate entire entropy: H(yes) + H(no) = H_tot
2nd sep: split on a feature1 (options a and b) and calcualte entropy for yes and no given that feature H(yes or no | feature 1a) = H(yes| feature1a) + H(no| feature1a)
H(yes or no | feature 1a or feature 1b) = probability(feature 1a) * H(yes or no|feature 1a) + probability(feature 1b) * H(yes or no|feature 1b)
compare entropy (want it lower than the total entropy, the one that reduces most is the winner)

### **3. Clustering:**
   * Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. Unsupervised 
   * **K-Means:**
    * Initialize k centroids randomly, Assign each point to the nearest centroid, Recalculate centroids, Repeat until convergence
   * **Hierarchical clustering:**
       * This algorithm builds a hierarchy of clusters by creating a cluster tree or dendrogram. It can either start with all individual instances and merge them into clusters (agglomerative), or start with a single cluster and divide it up (divisive).
       * Start with each point as a separate cluster, Merge the two closest clusters, Repeat until one single cluster remains

### **4. Association Rules:**
 Association rules are widely used to analyze retail basket or transaction data, and are intended to identify strong rules discovered in transaction data using measures of interestingness, based on the concept of strong rules.
   * **Support:** transaction containing I (or S and i) / total nr of transactions
   Given a support-threshold s, frequent itemsets are those itemsets that appear in at least s% of the baskets.
   * **Confidence:** transaction containing S and I/ transactions containing S
   * **Lift** Confidence(s,i)/support(i) measures interetingness of a rule
   * **Apriori Algorithm:**
       * This is a popular algorithm for extracting frequent itemsets with applications in association rule learning. The Apriori algorithm uses breadth-first search strategy to count the support of itemsets and uses a candidate generation function which exploits the downward closure property of support.
       * Given a threshold for minimum support and confidence, find all itemsets in the dataset meeting that threshold
       * Generate rules from those itemsets meeting the minimum confidence

**5. Recommender Systems:**
Recommender systems are used to suggest products, services, information to users based on their past preferences.
   * **User-User Collaborative Filtering:**
       * This method predicts a user's interest by collecting preferences from many users. The premise of collaborative filtering is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue.
       * Given a user u, predict items for u by finding users similar to u, and suggest items those similar users liked

### **6. Text Analytics:**
   Text Analytics is the process of converting unstructured text data into meaningful data.
   * **Corpus:** collection of documents
   * **Document:** collection of sentences and words
   * **Token:** elementary building blocks (word, numbers, characters) in a doc
   * **Vocabulary:** all tokens appearing in a corpus
   * **Tokenization:** Breaking down a stream of text into words, phrases, symbols, or other meaningful elements called tokens.
  
   * **Lemmatization:** It involves reducing the inflected words properly ensuring that the root word belongs to the language.

   * **Bag of Words, TF-IDF:** These are methods to convert text data into numerical vectors. Bag of Words counts the occurrence of each word in a document while TF-IDF increases proportionally to count but is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently.
     * **Cosine similarity**
       * Step 1: BOW Vocabulary -
       * Unique words are: ['I', 'like', 'cats', 'and', 'dogs', 'My', 'cat', 'does', 'not', 'your', 'dog']
       * Step 2: Vectorize sentences -
          * Vector S1: [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
          * Vector S2: [0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1]
       * Step 3: Calculate cosine similarity -
          * Dot product (sum of the products of the corresponding entries of the two sequences of numbers):
          * Dot product = (10) + (11) + (10) + (10) + (10) + (01) + (01) + (01) + (01) + (01) + (0*1)
          * Dot product = 1
       * Magnitude of vectors:
          * Magnitude of S1 = sqrt((1^2) + (1^2) + (1^2) + (1^2) + (1^2) + (0^2) + (0^2) + (0^2) + (0^2) + (0^2) + (0^2)) = sqrt(5)
          * Magnitude of S2 = sqrt((0^2) + (1^2) + (0^2) + (0^2) + (0^2) + (1^2) + (1^2) + (1^2) + (1^2) + (1^2) + (1^2)) = sqrt(6)
       * Cosine similarity:
           * Cosine similarity = Dot product / (Magnitude of S1 * Magnitude of S2) = 1 / (sqrt(5) * sqrt(6))

Use a calculator to compute this value and round off to 2 decimal places.
   * **TF-IDF (Term Frequency-Inverse Document Frequency):** $TF(t, d) * IDF(t)$
       * TF(t, d) = Number of times term t appears in document d
       * $IDF(t) = log(N / df(t))$ where N is total number of documents and df(t) is document frequency i.e., number of documents that contain term t
       * Dimensionality: the dimensionality of the vector representing a document is equal to the size of the vocabulary of all documents. This vocabulary is formed by distinct words across all documents. Given D1 = "the cat is gray", D2 = "my cat is fast", the dimensionality of D1 is thus 6. 
   * **The Jaccard similarity:** The Jaccard similarity then measures how many words the documents have in common, divided by the total number of unique words across both documents
     * J(A, B) = |A ∩ B| / |A ∪ B|
     * Where:
         - |A ∩ B| is the number of elements in both sets (i.e., the intersection of A and B)
         - |A ∪ B| is the number of elements in either set, duplicates removed (i.e., the union of A and B)

### **7. Neural Networks:**
  A neural network takes in inputs, which are then processed in hidden layers using weights that are adjusted during training. The more hidden layers there are, the deeper the network is, and the more complex the patterns it can detect. The output layer then gives the final output for the given inputs. The network is trained by adjusting the weights between layers until the output error is minimized. 
  * Update the weights of the network, typically using a simple update rule: weight = weight - learning_rate *gradient
  weight change = learning rate * direction that reduces error (Learning rate controls how much weights are changed, ie. speed of convergence)
   * **Artificial Neuron:** It is the basic unit of a neural network. It gets certain inputs, processes them based on some activation function and gives an output. Input is processed using a weighted sum followed by a non-linear function (often a Sigmoid or ReLU)
   * **Multilayer Perceptron:** This is a class of feedforward artificial neural network which consists of at least three layers of nodes: an input layer, a hidden layer and an output layer. Each node in one layer connects with a certain weight to every node in the following layer. A neural network model that consists of at least three layers of nodes: an input layer, a hidden layer, and an output layer

### **8. Graph Analytics:**
   * **Diamter**: longest shortest path between two nodes in the graph
   * **Random Graph**:  Start with N nodes, Each link is formed independently with some probability p, Average degree $*k* ≈ *pN*$ -> diameter peaks at k = 1
   * **Centrality (Degree, Closeness, Betweenness, Eigenvector)**: Measures to identify the most important vertices within a graph
      **degree centrality**: Who knows the most nodes? adjacent to most edges u = arg max_v d(v)
      **Closeness centrality:** Who has the shortest distance to other nodes? = 1/sumx(d(x,c))
      **Betweenness Centrality**: Who controls knowledge flows? node through whom most shortest paths go through
   * Pagerank for page u = sum over all v in set of pages pointing to u of PR(v)/|set of pages v points to| 

### **9. Dimensionality Reduction:**
   * **PCA (Principal Component Analysis):** A statistical procedure that uses orthogonal transformation to convert a set of observations of possibly correlated variables into a set of linearly uncorrelated variables called principal components
