# Feature Engineering & Machine Learning: Assignment

## Theoretical Questions

1. **What is a parameter?**
   
   A parameter is a variable within a function or model that can be adjusted or learned. In machine learning, parameters (such as weights and biases) are adjusted during training so that the model can best map inputs to outputs.

2. **What is correlation? What does negative correlation mean?**
   
   Correlation is a statistical measure that describes the degree to which two variables are linearly related. It is typically expressed as a value between -1 and 1. A correlation of 1 means perfect positive linear relationship, -1 indicates perfect negative linear relationship, and 0 indicates no linear relationship.
   Negative correlation means that as one variable increases, the other variable tends to decrease. It indicates an inverse relationship between the two variables.

3. **Define Machine Learning. What are the main components in Machine Learning?**
   
   Machine Learning is a branch of artificial intelligence that focuses on developing algorithms that learn from and make predictions on data without being explicitly programmed. The main components include:
   - **Data:** The raw information used to train and evaluate models.
   - **Features:** The input variables or attributes derived from the data.
   - **Model:** The mathematical or computational structure that maps inputs to outputs.
   - **Loss Function:** A metric that measures the error of the model’s predictions.
   - **Optimizer:** An algorithm that adjusts the model parameters to minimize the loss.
   - **Evaluation Metrics:** Criteria to assess model performance (e.g., accuracy, precision, recall).

4. **How does loss value help in determining whether the model is good or not?**
   
   The loss value quantifies the difference between the predicted outputs and the actual outputs. A lower loss value indicates that the model’s predictions are closer to the actual values, which generally means a better-performing model. It is used during training to guide the optimization of parameters.

5. **What are continuous and categorical variables?**
   
   - **Continuous Variables:** Variables that can take any numeric value within a range, such as temperature or height.
   - **Categorical Variables:** Variables that represent distinct categories or groups, such as color or type of product.

6. **How do we handle categorical variables in Machine Learning? What are the common techniques?**
   
   Categorical variables are typically transformed into a numerical format so that ML algorithms can process them. Common techniques include:
   - **One-Hot Encoding:** Creating binary columns for each category.
   - **Label Encoding:** Assigning each category a unique integer.
   - **Ordinal Encoding:** Encoding categories based on a natural order if one exists.

7. **What do you mean by training and testing a dataset?**
   
   - **Training:** The process of teaching a model using a subset of the data to learn the underlying relationships.
   - **Testing:** Evaluating the model on a separate subset of the data that it has not seen during training to assess its performance.

8. **What is `sklearn.preprocessing`?**
   
   `sklearn.preprocessing` is a module in scikit-learn that provides functions and classes for data preprocessing, such as scaling, normalization, and encoding categorical variables.

9. **What is a Test set?**
    
    A test set is a subset of the dataset that is held out during model training and used to evaluate the model's performance on unseen data.

10. **How do we split data for model fitting (training and testing) in Python? How do you approach a Machine Learning problem?**
    
    We can split the data using the `train_test_split` function from scikit-learn, which randomly divides the dataset into training and testing sets according to a specified ratio.
    A typical approach involves:
    - Defining the problem
    - Collecting and exploring the data (EDA)
    - Preprocessing and feature engineering
    - Splitting data into training and testing sets
    - Selecting and training a model
    - Evaluating the model
    - Tuning hyperparameters
    - Deployment

11. **Why do we have to perform EDA before fitting a model to the data?**
    
    EDA (Exploratory Data Analysis) helps understand the data structure, detect outliers, uncover patterns, and inform decisions on data cleaning and feature engineering, all of which are crucial for building effective models.

12. **What is correlation?**
    
    Correlation is a statistical measure that describes the degree to which two variables are linearly related. It is typically expressed as a value between -1 and 1. A correlation of 1 means perfect positive linear relationship, -1 indicates perfect negative linear relationship, and 0 indicates no linear relationship.

13. **What does negative correlation mean?**
    
    Negative correlation means that as one variable increases, the other variable tends to decrease. It indicates an inverse relationship between the two variables.

14. **How can you find correlation between variables in Python?**
    
    Correlation between variables can be computed with pandas using the `DataFrame.corr()` method or with NumPy using `np.corrcoef()`. Visualization libraries like Seaborn can be used to create heatmaps for a graphical representation.

15. **What is causation? Explain the difference between correlation and causation with an example.**
    
    **Causation** implies that one event is the direct result of another. In contrast, **correlation** indicates a relationship between two variables without proving that one causes the other. For example, ice cream sales and drowning incidents may be correlated (both rise in summer), but ice cream sales do not cause drowning incidents; rather, a lurking variable (summer heat) affects both.

16. **What is an Optimizer? What are different types of optimizers? Explain each with an example.**
    
    An optimizer is an algorithm that adjusts the parameters (weights and biases) of a model to minimize the loss function. Examples of optimizers include:
    - **Gradient Descent:** Updates parameters using the gradient of the loss function.
    - **Stochastic Gradient Descent (SGD):** Uses a single or a few training examples for updates.
    - **Adam:** Combines ideas from RMSprop and momentum for adaptive learning rates.

17. **What is `sklearn.linear_model`?**
    
    `sklearn.linear_model` is a module within scikit-learn that provides a variety of linear models for regression and classification (e.g., Linear Regression, Logistic Regression, Ridge, Lasso).

18. **What does `model.fit()` do? What arguments must be given?**
    
    The `model.fit()` method trains the model using the provided training data. It generally requires the feature matrix (X) and the target vector (y) and can accept additional arguments such as sample weights or validation data.

19. **What does `model.predict()` do? What arguments must be given?**
    
    The `model.predict()` method uses the trained model to make predictions on new input data. The primary argument is the feature matrix for which predictions are to be made.

20. **What are continuous and categorical variables?**
    
    - **Continuous Variables:** Variables that can take any numeric value within a range, such as temperature or height.
   - **Categorical Variables:** Variables that represent distinct categories or groups, such as color or type of product.

21. **What is feature scaling? How does it help in Machine Learning?**
    
    Feature scaling standardizes or normalizes the range of independent variables. It is important because many ML algorithms perform better or converge faster when features are on a similar scale.

22. **How do we perform scaling in Python?**
    
    Scaling can be performed using scalers available in `sklearn.preprocessing`, such as `StandardScaler` for standardization or `MinMaxScaler` for normalization.

23. **What is `sklearn.preprocessing`?**
    
    `sklearn.preprocessing` is a module in scikit-learn that provides functions and classes for data preprocessing, such as scaling, normalization, and encoding categorical variables.

24. **How do we split data for model fitting (training and testing) in Python?**
    
    We can split the data using the `train_test_split` function from scikit-learn, which randomly divides the dataset into training and testing sets according to a specified ratio.

25. **Explain data encoding?**
    
    Data encoding is the process of converting categorical data into a numerical format that can be utilized by machine learning algorithms. Common techniques include one-hot encoding, label encoding, and ordinal encoding.