#Feature Engineering Assignment

##Assignment Questions

Q1) What is a parameter?

- A parameter is a measurable characteristic of a population that is typically unknown and is estimated using data. In the context of statistics, parameters refer to values like the population mean (μ), standard deviation (σ), or proportion (p). In Machine Learning, a parameter usually refers to the internal variables of a model that are learned from training data, such as weights in linear regression or coefficients in logistic regression. These parameters are optimized to minimize the model’s loss function and improve predictive accuracy.



---



Q2) What is correlation? What does negative correlation mean?

- Correlation is a statistical measure that expresses the extent to which two variables move in relation to each other. It is quantified using the correlation coefficient, which ranges from -1 to 1. A positive correlation means that as one variable increases, the other also increases. A negative correlation means that as one variable increases, the other decreases. For example, if the temperature increases and the sale of jackets decreases, they show a negative correlation. A correlation close to 0 means there's little or no linear relationship.





---



Q3) Define Machine Learning. What are the main components in Machine Learning?

- Machine Learning is a branch of Artificial Intelligence that enables systems to learn patterns from data and make predictions or decisions without being explicitly programmed. The main components of Machine Learning include:

Data: The input used to train and evaluate models.

Model: A mathematical representation that learns patterns from the data.

Loss function: Measures the error between predictions and actual values.

Optimization algorithm (optimizer): Minimizes the loss function by adjusting model parameters.

Evaluation metrics: Used to assess the performance of the model.



---



Q4) How does loss value help in determining whether the model is good or not?

- The loss value quantifies the difference between the predicted output of a model and the actual target value. A smaller loss indicates that the model's predictions are close to the actual values, suggesting a better model. Conversely, a high loss implies poor predictions and that the model needs improvement. The loss function is optimized during training, and its value helps guide updates to the model parameters. Monitoring the loss over time helps detect issues like underfitting or overfitting.





---



Q5) What are continuous and categorical variables?

- Continuous variables are numerical variables that can take an infinite number of values within a given range. Examples include height, temperature, and weight. They are measured and can be fractional.
Categorical variables, on the other hand, represent categories or labels and have a limited number of distinct values. They can be nominal (no natural order, like color) or ordinal (with order, like ratings: low, medium, high). Handling these types of variables properly is crucial for effective model training.





---



Q6) How do we handle categorical variables in Machine Learning? What are the common techniques?

- Categorical variables need to be converted into numerical values before being fed into machine learning models. Common techniques include:

Label Encoding: Assigns an integer to each category (e.g., red=0, blue=1).

One-Hot Encoding: Creates a binary column for each category and assigns 1/0 based on presence.

Ordinal Encoding: Assigns ordered integers to categories with a logical order.

Binary Encoding or Target Encoding: Used when there are many categories.
Choosing the right technique depends on the nature of the data and the model being used.





---



Q7) What do you mean by training and testing a dataset?

- Training and testing a dataset involves dividing your data into two subsets. The training set is used to train the machine learning model, meaning it learns the underlying patterns and relationships in the data. The testing set is used to evaluate the model's performance on unseen data to ensure that it generalizes well. A good practice is to split the data typically in an 80:20 or 70:30 ratio. This separation helps in preventing overfitting and gives a realistic estimate of model performance.





---



Q8) What is sklearn.preprocessing?

- sklearn.preprocessing is a module in Scikit-learn that provides various utility functions and classes for data preprocessing. These functions help prepare raw data into a format suitable for model training.

Common tasks include:

Scaling features (e.g., StandardScaler, MinMaxScaler),

Encoding categorical variables (e.g., LabelEncoder, OneHotEncoder),

Handling missing values, and

Polynomial transformations.
Proper preprocessing ensures that features are on similar scales and that categorical variables are appropriately represented for the algorithm to learn effectively.



---



Q9) What is a Test set?

- A test set is a portion of the dataset that is separated from the training data and is used solely to evaluate the final performance of a machine learning model. The model never sees the test set during training, ensuring an unbiased assessment of how well it can generalize to new, unseen data. The test set helps detect overfitting and gives a realistic picture of model accuracy, precision, recall, and other performance metrics.



---



Q10) How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?

- Data can be split using the train_test_split() function from sklearn.model_selection. It randomly splits data into training and testing sets. The approach to a Machine Learning problem typically involves:

Understanding the problem and data.

Preprocessing and cleaning the data.

Performing Exploratory Data Analysis (EDA).

Feature engineering and encoding.

Splitting data.

Selecting and training a model.

Evaluating and tuning the model.

Deploying the model.
Each step is crucial for building an effective and accurate model.



---



Q11) Why do we have to perform EDA before fitting a model to the data?

- Exploratory Data Analysis (EDA) helps us understand the underlying structure and patterns in the data before applying any model.

It allows us to:

Identify missing or inconsistent data.

Understand the distribution of variables.

Detect outliers and correlations.

Visualize relationships between features.
EDA provides insights into feature importance and interactions, helping to choose appropriate preprocessing techniques and model selection. It ensures the data quality and guides decisions throughout the modeling process.





---



Q12) What is correlation?

Correlation measures the strength and direction of a linear relationship between two variables. It is expressed as a coefficient ranging from -1 to 1:

1 indicates perfect positive correlation.

0 means no correlation.

-1 shows perfect negative correlation.
Understanding correlation is essential in feature selection, as highly correlated features can affect model performance. It’s important to visualize and quantify correlation to ensure model accuracy and avoid multicollinearity.



---



Q13) What does negative correlation mean?

- Negative correlation means that when one variable increases, the other tends to decrease. For example, if the number of hours watching TV increases, academic performance might decrease. This is reflected by a correlation coefficient that lies between -1 and 0. A value closer to -1 implies a strong negative relationship. Negative correlations are useful in regression modeling, feature selection, and understanding inverse relationships in data.





---



Q14) How can you find correlation between variables in Python?

- Correlation can be computed using the .corr() method in Pandas, which calculates the Pearson correlation coefficient by default. To visualize correlation, seaborn.heatmap() can be used.

Example:

df.corr()
sns.heatmap(df.corr(), annot=True)
You can also use other methods like Spearman or Kendall for non-linear relationships. This analysis helps identify multicollinearity and decide which features to keep or drop before training a model.




---



Q15) What is causation? Explain difference between correlation and causation with an example.

- Causation implies that one variable directly affects another, whereas correlation only indicates that two variables move together. For example, ice cream sales and drowning incidents may be positively correlated, but one does not cause the other. The real cause may be a third variable like hot weather. Correlation does not imply causation; establishing causality requires controlled experiments or domain knowledge. Understanding this distinction is critical to avoid drawing incorrect conclusions from data.





---



Q16) What is an Optimizer? What are different types of optimizers? Explain each with an example.

- An optimizer in Machine Learning is an algorithm used to adjust the model parameters (like weights) to minimize the loss function. Common optimizers include:

Gradient Descent: Updates parameters in the direction of the negative gradient.

Stochastic Gradient Descent (SGD): Uses a single data point per iteration.

Adam: Combines momentum and adaptive learning rates for faster convergence.

RMSprop: Adjusts learning rate based on recent gradients.
Example: In TensorFlow or PyTorch, you might use optimizer = Adam() to update weights during training. Choosing the right optimizer impacts training speed and model performance.





---



Q17) What is sklearn.linear_model?

- sklearn.linear_model is a sub-module in Scikit-learn that provides various linear models for regression and classification tasks.

Some of the commonly used models are:

LinearRegression

LogisticRegression

Ridge, Lasso, and ElasticNet
These models are useful when there is a linear relationship between input features and the target variable. The module allows easy training, evaluation, and prediction, making it a core part of any ML workflow.





---



Q18) What does model.fit() do? What arguments must be given?

- model.fit() trains a machine learning model using the provided data.

It takes in two main arguments:

X (features or input data)

y (target or labels)
The function estimates the model parameters (like weights) that minimize the loss function. It also stores the learned parameters for later predictions. It’s a fundamental step in model training across all Scikit-learn models.





---



Q19) What does model.predict() do? What arguments must be given?

- model.predict() uses the learned parameters from training to predict target values for new or unseen data. It requires only the feature data X as input. The function returns predicted values that can be compared against actual values to evaluate the model’s performance. It’s typically used after model.fit() during model testing or deployment.





---



Q20) What are continuous and categorical variables?

- Continuous variables are numerical variables that can take an infinite number of values within a given range. Examples include height, temperature, and weight. They are measured and can be fractional.
Categorical variables, on the other hand, represent categories or labels and have a limited number of distinct values. They can be nominal (no natural order, like color) or ordinal (with order, like ratings: low, medium, high). Handling these types of variables properly is crucial for effective model training.







---



Q21) What is feature scaling? How does it help in Machine Learning?

- Feature scaling is a technique to normalize the range of independent variables in a dataset. It ensures that all features contribute equally to the result, especially important for algorithms like SVM, KNN, and Gradient Descent. Without scaling, features with larger magnitudes dominate those with smaller values. Techniques include Min-Max Scaling and Standardization. Feature scaling improves model performance, convergence rate, and overall accuracy.





---



Q22) How do we perform scaling in Python?

- In Python, feature scaling is done using:

StandardScaler: Scales data to have mean = 0 and std = 1.

MinMaxScaler: Scales data to a fixed range, usually [0, 1].

Example using Scikit-learn:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(X)
This transformation helps models interpret features correctly and perform better.





---



Q23) What is sklearn.preprocessing?

- sklearn.preprocessing is a module in Scikit-learn that provides various utility functions and classes for data preprocessing. These functions help prepare raw data into a format suitable for model training. Common tasks include:

Scaling features (e.g., StandardScaler, MinMaxScaler),

Encoding categorical variables (e.g., LabelEncoder, OneHotEncoder),

Handling missing values, and

Polynomial transformations.
Proper preprocessing ensures that features are on similar scales and that categorical variables are appropriately represented for the algorithm to learn effectively.





---



Q24) How do we split data for model fitting (training and testing) in Python?

- Data can be split using the train_test_split() function from sklearn.model_selection. It randomly splits data into training and testing sets. The approach to a Machine Learning problem typically involves:

Understanding the problem and data.

Preprocessing and cleaning the data.

Performing Exploratory Data Analysis (EDA).

Feature engineering and encoding.

Splitting data.

Selecting and training a model.

Evaluating and tuning the model.

Deploying the model.
Each step is crucial for building an effective and accurate model.





---



Q25) Explain data encoding?

Data encoding is the process of converting categorical values into numerical format for machine learning models. This is necessary because most models can’t handle non-numeric data.

Common encoding techniques include:

Label Encoding: Converts each label into a unique integer.

One-Hot Encoding: Creates binary columns for each category.

Ordinal Encoding: Preserves order among categories.
Proper encoding helps models understand the categorical data correctly and avoids bias due to artificial ordering.

