# Important Points
- Overfitting (Low Bias, High Variance): Happens when model fits the training data but gives error in testing data.
- Underfitting (High Bias, High Variance): Happens when model doesn't fit training or testing data.
- Ridge and LASSO regression works by introducing a hyperparameter Lambda which adds to cost function.
    - In Ridge and LASSO, an additional lambda term is added to the cost function that reduces overfitting and can also help in feature selection.
- Gradient Descent works by going downwards towards the global minima. The length of the step depends on the learning rate (alpha) which is a hyperparameter

# Linear Regression

## Assumptions
- Linearity
- Normality - Use Logarithmic, Box-cox, Square-root transformation
- Homoscedasticity - Variance of residual is same for every value of X
- Independence

## Advantages
- Good for Linear Separable data
- Easy to train
- Handle overfitting using Dimentionality reduction, Cross validation and regularization

## Disadvantages
- MultiColinearity (Can be solved using Ridge Regression, PCA, LASSO)
- Prune to noise and overfitting
- Lot of feature engineering is required

## Additional Points
- Feature scaling is required because without it, the gradient descent will take time to reach global minima (Min-max scaling, StandardScaler)
- Sensetive to missing data
- Sensetive to outliers
- R<sup>2</sup> and Adjusted R<sup>2</sup> are metrics used to measure goodness of fit.
    - R<sup>2</sup> ranges from 0 to 1 and is simple.
    - Adjusted R<sup>2</sup> changes with adding features. If feature added is meaningful, the value increases else it decreases. This is done by penalising the atteibutes with less meaning to the fitness.

In [7]:
# DECORATOR

def hello(func):  
    def wrapper():
        func()
        print("Hello, World!")
    return wrapper
    
@hello
def bye():
    print("Bye, World!")

bye()

Bye, World!
Hello, World!


# Decision Trees

## Some Points
- Low Bais, High Variance: Can be fixed by making the tree shallow or by using using XGBoost where multiple shallow trees are created and the dataset are passed on to next DT with the weights of the incorrect predictions from the previous DT increased.
- No assumptions in DT
- Hyperparameter tuning where we can do post-pruning or pre-prunung of the tree, or set limit to the number of leafs

## Advantages
- Clear Visualisation, like a if-else clauses
- Used for both Classification and regression
- No feature scaling required
- Handles non-linear parameter easily
- Automatically handle missing values
- Robust to outliers (Does not get affected by outliers)
- Less training time

## Disadvantages
- Overfitting
- High Variance
- Unstable: if new datapoint is added, the decision tree is to be recreated
- Not suitable for large dataset

# Logistic Regression
## Assumptions
- Linear relation between independent features and log odds

## Some points
- The classification is done in either 1 or -1 (instead of 0, this is because classification becomes easier in that way)
- If the multiplication of Y (1 or -1) and the value of distance between the point and the best fit line is +ve, then the point is correctly classified and vice versa.
- COST FUNCTION: max(Summation(func(Y<sub>i</sub> * W<sup>t</sup> * x<sub>i</sub>))), &nbsp;&nbsp;&nbsp; where W<sup>t</sup> is basically the slope of the best fit line. 
- the ***func*** in the above equation is sigmoid function to remove problems due to outliers. (1 / (1 + e<sup>-x</sup>)) where x is the value after calculating Y * W * x.
- We find the max because the correctly predicted outputs when put in the above formula gives +ve answer. So summation of all correct predictions will be a large +ve number.
- We can use AUC-ROC curve to find the threshold that needs to be set above or below which the input gets classified into either one of the classes.
- Feature scaling is required.
- Sensetive to missing values
- Highly impactful to outliers but can be avoided using sigmoid function.
- 

## Advantages
- Less prone to overfitting and can be avoided using L1 and L2 Regularisation techniques

## Disadvantages
- Feature engineering is required a lot
- Multicolinearity causes problems
- Prone to noise and overfitting
- If not linearly related to log odds, this might problem.


# Random Forest Classifier (Bagging / Bootstrap Aggregator)
## Some points
- It is an ensemble technique.
- Decision tree has low bias and high variance which is because the tree is constructed to its complete depth causing overfitting.
- To overcome this problem, we do bootsrap aggregrating, where we take multiple decision trees in parallel, do row sampling and <br> column sampling and we can even prune the decision trees we created to reduce it's depth all of this to reduce overfitting.
- Biased to features having many category.
- No feature scaling required
- Robust to Outliers


# Support Vector Machines
- The support vector machine is like Regression but it contains hyperplanes that run parallel to the original line (that separates the positive and negative points).
- The two Hyperplanes pass through the points that are nearest to the original line and the slope of these three lines should be in a way such that the marginal distance between the two hyperplanes is maximum.
- The two points through which the hyperplane pass are called support vectors.
- There is something called SVM-Kernel that can help us in creating higher dimension lines that can handle non-linear separable data.
- Feature Scaling is required.
- Sensetive to outliers.
- SVM can be affected by missing values.
- Soft margin means hyperplanes that allow some points to be inside it to reduce overfitting.

## Advantages
- SVM works well with high dimension features due to kernels (rbf, poly, linear)
- Memory efficient
- Good when we have no idea of data
- less overfitting
- works well with semi-structured data like text, images, trees

## Disadvantages
- More time for large dataset
- Hard to choose good kernel
- Hard to hyperparameter tune

# Weight initialisation techniques
- Uniform Weight initialisation
- Xavier Glorot
- HE Initialisation

# Loss Functions for classification
## Binary Class Entropy
- Used when facing binary classification problem
- −(y⋅log(p)+(1−y)⋅log(1−p))
## Categorical class entropy
- Used when facing multi-class classification problem
- -Sum(y<sub>i</sub> * log(p<sub>i</sub>)) &nbsp; where y<sub>i</sub> is one-hot-encoded vector with 1 for true class and 0 for other.
## Sparse Categorical class entropy
- Similar to categorical class entropy except the target labels are given as integers instead of one-hot-encoded vector.

# Loss Functions for Regression
- Mean Absolute Error
- Mean Squared Error
- Root mean squared error

# Optimizers
## Some Points
- **Gradient Descent**
    - GD is an optimizer that is used to find the optimal value of a specific parameter on each epoch of a ML Algorithm
    - Required huge resources for computation (computational extensive)
    - very slow
- **Stohcastic Gradient Descent**
    - Mini batch SGD contains noise that makes the gradient curve not so smooth. To avoid this, we use SGD with momentum.
    - *SGD with Momentum* is basically SGD but it also incorporates previous loss function values to know where it is going in the gradient descent curve.
- Adagrad
- Adadelta & RMSProp
- **ADAM** is the best optimizer that has both (Momentum and AdaGrad)


# Vanishing Gradient Problem
- This occurs when we are using Sigmoid function
- What happens is when we are backpropogating, we calculate the deriviative of sigmoid and the deriviative of the sigmoid function ranges from 0 to 0.25
- So when we apply chain rule to backpropogate, the gradient that updates the weights while backpropogating becomes so small that the change in the weights is almost negligible and hence the neural network is like forever stuck.

# Exploding Gradient Problem
- This occurs in the same way as Vanishing gradient problem except this occurs when the weights are high at the beginning due to which the weight being updated varies a lot and gradient will not converge.
- When we use ReLU activation function, HE initialisation technique can be used to avoid Exploding and Vanishing Gradient Problem
- When we use Sigmoid, we can use Xavier Glorot weight initialisation technique to avoid Vanishing and Exploding Gradient Problem.

# Minkowski Distance
<img src="https://rittikghosh.com/images/min.png">

- Minkowski Distance is a generalised form of Manhattan and Euclidean distance where the P represents Euclidean or Manhattan
    - P = 1 represents Manhattan distance
    - P = 2 represents Euclidean distance
    - L1 Norms is called Manhattan Distance
    - L2 Norms is Euclidean Distance
    
# Cosine Similarity
- Used in recommendation systems
- Cosine Distance = 1 - Cosine Similarity
- Cosine Similarity is the angle between the two vectors of whose similiarity we are finding.

# Miscellaneous Points
- Gradient.ai is a website that proives AI solution and their API can be used to create fine-tuned model using our own data.
- Gradient is the only platform that enables healthcare organizations to combine industry expert LLMs like Nightingale, with their own institutional knowledge and data.
- Nightingale Open Science provides open data for public use from multiple companies in ethical way.
- With data privacy a significant concern, HIPAA (Health Insurance Portability and Accountability Act) and SOC2 (System and Organizations Controls) are federal standards for protecting and securing PHI.
- Word Embedding is done using gensim library
    - 300 dimensions are selected where each word made into vector of 300 dimensions.
    - Each value in those 300 dimensions represent how closely the word is related to each of those features.
- Sentence Embedding is done using Doc2Vec library
    - Each sentence in itself converted into a vector without breaking into words

# How we decide how many number of neurons and layers we need in our deep learning model?
- Kerastuner is a library that can be used for where we can handle hyperparameters like the range of layers or number of neurons in each layer and kerastuner will give us the most optimized number of layer and neurons for our model.

# Large Language Model
### Basically Large language model is a general term and not an architecture. Large Langauge models usually work on the architecture of Transformers.
## ChatGPT
- When we give audio input to ChatGPT, it first converts the audio to text using Whisper.
- Then uses GPT (Generative Pre-trained Transformer) to encode the text and analyse it.
- Then uses DALL-E to convert the encoded text to an image.

# Reinforment Learning
- Agent and Environment interact with each other and that basis the environment rewards(+ve or -ve) the agent for every correct step

In [19]:
def func(n):
    if n == 1 or n == 0 or n == 2:
        return 1
    return func(n-1) + func(n-2)
print(func(40))

KeyboardInterrupt: 

In [None]:
1 1 2 3 5 8