# Don't Run the Code

In [None]:
#Step 1: Import Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
import xgboost as xgb

In [None]:
#Step 2: Read the Dataset

# Assuming the dataset is stored in a CSV file named 'dataset.csv'
data = pd.read_csv('dataset.csv')

In [None]:
#Step 3: Feature Engineering (if needed)
#If you need to perform any feature engineering tasks like handling missing values, encoding categorical variables, or creating new features, you can do so at this step.

In [None]:
# Step 4: Split Data into Features (X) and Target Variable (y)

X = data.drop(columns=['target_column'])  # Drop the target variable column
y = data['target_column']

In [None]:
# Step 5: Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state

In [None]:
# Step 6: Feature Scaling (if needed)

# If your models require feature scaling (e.g., logistic regression), you can scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [None]:
# Step 7: Feature Selection

# Select the top k features using ANOVA F-value
selector = SelectKBest(score_func=f_classif, k=5)
X_train_selected = selector.fit_transform(X_train_scaled, y_train)
X_test_selected = selector.transform(X_test_scaled)

In [None]:
# Step 8: Define and Train Models

# Define classifiers
models = {
    'Logistic Regression': LogisticRegression(),
    'AdaBoost': AdaBoostClassifier(),
    'XGBoost': xgb.XGBClassifier()
}

# Train models
for name, model in models.items():
    model.fit(X_train_selected, y_train)

In [None]:
# Step 9: Evaluate Models

# Evaluate models
results = {}
for name, model in models.items():
    y_pred = model.predict(X_test_selected)
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy
    print(f'{name}: Accuracy = {accuracy:.4f}')

# Feature Scaling

#### Let's say we have IF - Height(cm), Weight(kg) and DF - BMI
#### Magnitude is value (height example - 183) and Unit is cm

##### 1. If we directly apply ML algorith (let's say KNN (it works on Eucledian Distance)) and if we take the same magnitude and plot this in a 2D graph then points will be having varying distance or the distances will be huge.
##### 2. Scaling also happens with respect to that feature 
##### 3. Algorithms in which Scaling is compulsory :-
* Linear Regression (After scaling the random coefficient (starting point on global minima) will get close to global minima in the beginning as well), our convergence will happen quickly
* Algorithms in which Eucledian distance is used (K means clustering, KNN)

# When should we not apply Feature Scaling

1. Decision Tree
2. Random Forest
3. XgBoost

# Standardization and Normalization

1. Normalization helps us to scale down our features between 0 to 1
2. Standadization helps us to scale down our features based on Standard Normal Distribution (mean-0, std deviation-1)

# Encoding
1. Whenever we talk about Encoding Techniques, it means we are talking about categorical variables
2. For ex we have gender feature (male,female). If we directly provide these values to ML algorith then it will not be able to understand because ML algorithms involves a lot of mathematical calculations

## Types of Encoding
1. Nominal Encoding
   * One Hot Encoding
   * One Hot Encoding with many Categorical Variables
   * Mean Encoding
---------------------------------
---------------------------------
2. Ordinal Encoding
   * Label Encoding
   * Target Guided Ordinal Encoding

## Nominal Data
1. These categories have no intrinsic order or ranking.
2. Examples of nominal categories include gender (male, female), marital status (single, married, divorced), and types of fruits (apple, banana, orange).
3. Nominal data can be represented by numbers, but the numbers do not have any mathematical meaning. For example, assigning "1" to male and "2" to female does not imply any mathematical operation between male and female.
4. Statistical measures such as mode (most frequent value) and frequency distributions are commonly used to describe nominal data.

## Ordinal Data
1. Ordinal categories, unlike nominal, have a natural order or ranking between the categories.
2. The differences between categories are not necessarily uniform, but there's a clear hierarchy or sequence.
3. Examples of ordinal categories include ratings (poor, fair, good, excellent), educational levels (high school, bachelor's, master's, PhD), and socio-economic status (low, middle, high).
4. In ordinal data, the order matters, but the differences between the categories may not be equal.
5. Statistical measures such as median and percentile can be used with ordinal data, but arithmetic operations like addition and subtraction are not meaningful because the intervals between categories may not be consistent.
  
#### In summary, nominal categories are used for labeling and categorizing data without any inherent order, while ordinal categories have a natural order or hierarchy between the categories.

## 1. One Hot Encoding

Let's say we have country feature in which we have "Germany", "France", "Spain". Now 3 separate features will be created named as Germany, France, Spain and values will be provided as (if germany is present in the original feature then 1 will be given to Germany and 0,0 will be given to the remaining features). We Germany is 0 and France is also 0 then it means Spain is 1, with the help of this idea we can delete the Spain column.
This is called as Dummy variable trap.
We can do this with the help of pandas (pd.get_dummies) and sklearn as well.

#### Disadvantage
Let's say we have 100 unique categories so it means 99 columns will get created that means we are increasing the number of dimensions and that leads to CURSE OF DIMENSIONALITY

## 2. One Hot Encoding with Multiple Categories

Let's say we have 50 different categories in a feature. We will see top 10 categories that are getting repeated more than the others and will apply one hot encoding to these categories.

## 3. Mean Encoding

##### ----Learn Target Encoding first then read this----

In this we also take the output feature. Let's say featuer f1 is having (A,B,C,D,A,B...) and output is (1,1,0,0,0,1...) respectively (classification problem). Now we will calculate the mean of "A" using the value of output wrt "A" and similarly for B,C,D. Now the original values will directly be replaced by the mean values of the respective category. For example - A category is have mean of 0.73 now we will directly replace A with 0.73

##### When to use?
* When we have pincodes like feature in which we have 1000 records and we don't need to care about the ranking

------------------------------------------------------------
------------------------------------------------------------

## 1. Label Encoding

Let's say we have Education feature in which we have BE, Master's, PHD, Diploma, now we will give the ranking and in this we are giving rank 4 to PHD (considering the Rank 1 is lowest and Rank 4 is highest).
We have Label Encoder library in Sklearn to do this


## 2. Target Guided Ordinal Encoding

In this we also take the output feature. Let's say featuer f1 is having (A,B,C,D,A,B...) and output is (1,1,0,0,0,1...) respectively (classification problem). Now we will calculate the mean of "A" using the value of output wrt "A" and similarly for B,C,D. Now based on this mean values we are going to assign the Ranks. Higher the value of mean, higher will be the Rank (1 is lowest, 4 is highest)

##### When to use?
* When we have ordinal variables and along with many categories