## Question 1: What is a parameter?
A **parameter** is a variable that is part of a mathematical model, algorithm, or function, which helps define the behavior or output of that system. 

In machine learning, parameters are internal variables of a model that are learned from training data, such as weights and biases in neural networks.

## Question 2: What is correlation?
**Correlation** is a statistical measure that describes the extent to which two variables are linearly related. It shows how one variable changes in response to another and is represented by a value between **-1 and 1**:
- **1**: Perfect positive correlation  
- **0**: No correlation  
- **-1**: Perfect negative correlation  

## Question 3: What does negative correlation mean?
**Negative correlation** means that as one variable increases, the other variable decreases, and vice versa. The correlation coefficient is less than 0 (e.g., -0.5). 

For example:
- There might be a negative correlation between the number of hours spent on social media and productivity levels.


## Question 4: Define Machine Learning.
**Machine Learning** is a subset of artificial intelligence that focuses on enabling machines to learn from data and improve their performance on a specific task without being explicitly programmed.

It involves algorithms that can generalize patterns and make predictions or decisions based on input data.


## Question 5: What are continuous and categorical variables?
- **Continuous variables**: Numerical variables that can take an infinite range of values within a given range (e.g., height, temperature). They can have decimals.
- **Categorical variables**: Variables that represent distinct groups or categories (e.g., gender, colors). These values are often represented as labels or levels.


## Question 6: How do we handle categorical variables in Machine Learning? What are the common techniques?
To handle categorical variables in Machine Learning, we transform them into numerical representations that algorithms can process. The common techniques include:

1. **Label Encoding**:
   - Each unique category is assigned a numeric value (e.g., "red" = 0, "blue" = 1).
   - Suitable for ordinal categories.

2. **One-Hot Encoding**:
   - Creates binary columns for each category.
   - Example: "red", "blue" → [1, 0] and [0, 1].

3. **Ordinal Encoding**:
   - Similar to label encoding but preserves the inherent order of categories (e.g., "low" = 1, "medium" = 2, "high" = 3).

4. **Binary Encoding**:
   - Converts categories into binary representations and reduces dimensionality compared to one-hot encoding.

5. **Frequency Encoding**:
   - Replaces categories with the frequency of their occurrence in the dataset.

6. **Target Encoding**:
   - Encodes categories based on the mean of the target variable for each category.


## Question 7: What do you mean by training and testing a dataset?
- **Training Dataset**:
  - A subset of the data used to train the machine learning model.
  - The model learns patterns, relationships, and rules from this dataset.

- **Testing Dataset**:
  - A separate subset of the data used to evaluate the trained model.
  - It provides a measure of the model's performance on unseen data, ensuring it generalizes well.


## Question 8: What is sklearn.preprocessing?
The `sklearn.preprocessing` module in **Scikit-learn** provides tools for transforming and preprocessing data. These tools help make data suitable for machine learning models. Common functions include:

- **Scaling and Normalization**:
  - `StandardScaler`: Scales data to have zero mean and unit variance.
  - `MinMaxScaler`: Scales data to a specific range, often [0, 1].

- **Encoding**:
  - `LabelEncoder`: Encodes target labels with values between 0 and n-1.
  - `OneHotEncoder`: Encodes categorical features as a one-hot numeric array.

- **Binarization**:
  - Converts numerical values into binary values based on a threshold.

- **Imputation**:
  - Handles missing values using `SimpleImputer`

## Question 9: What is a Test set?
The **Test set** is a portion of the dataset that is separated from the training process and used to evaluate the model's performance on unseen data. 

- It ensures that the model generalizes well to new data and isn't overfitting to the training set.
- Typically, 20–30% of the dataset is reserved as the test set.


# Question 10: How do we split data for model fitting (training and testing) in Python?
To split the dataset for training and testing, we use Scikit-learn's `train_test_split()` function. Here's an example:

```python
from sklearn.model_selection import train_test_split

# Assuming X (features) and y (target variable) are already defined
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Parameters:
# test_size: Fraction of the data to be used as the test set (e.g., 0.2 = 20%).
# random_state: A seed to ensure the splits are reproducible.

## Question 11: Why do we have to perform EDA before fitting a model to the data?
**Exploratory Data Analysis (EDA)** is crucial because:
- It helps identify missing values, outliers, and data inconsistencies.
- Reveals the underlying distribution of data and variable relationships.
- Detects multicollinearity and other data issues.
- Guides feature selection, transformation, and preprocessing.
EDA ensures that the data is clean, well-structured, and suitable for building robust machine learning models.


## Question 12: What is correlation?
**Correlation** is a statistical measure that quantifies the degree to which two variables are linearly related.  
- It ranges from **-1** to **1**:
  - **1**: Perfect positive correlation.
  - **0**: No linear relationship.
  - **-1**: Perfect negative correlation.

## Question 13: What does negative correlation mean?
**Negative correlation** means that as one variable increases, the other decreases. The correlation coefficient is less than 0.  
**Example**: As the temperature decreases, the demand for hot beverages increases.


## Question 14: How can you find correlation between variables in Python?
In Python, you can compute the correlation between variables using the **`pandas`** library:


In [4]:
import pandas as pd

data = {'A': [1, 2, 3, 4], 'B': [4, 3, 2, 1]}
df = pd.DataFrame(data)

correlation_matrix = df.corr()
print(correlation_matrix)

     A    B
A  1.0 -1.0
B -1.0  1.0


## Question 15: What is causation? Explain the difference between correlation and causation with an example.

- **Causation**: When one event directly leads to another.
- **Correlation**: A statistical relationship between two variables, but it does not imply that one causes the other.

**Example**:
- **Correlation**: Ice cream sales and drowning rates are correlated because both increase in summer.
- **Causation**: Increasing the price of a product reduces its sales.


## Question 16: What is an Optimizer? What are different types of optimizers? Explain each with an example.

An **Optimizer** is an algorithm that minimizes the loss function by updating the model's parameters (weights and biases) during training.

### Types of Optimizers:

1. **Gradient Descent**:
   - Updates weights in the direction of the steepest descent of the loss function.
   - **Example**:
     ```python
     from tensorflow.keras.optimizers import SGD
     optimizer = SGD(learning_rate=0.01)
     ```

2. **Stochastic Gradient Descent (SGD)**:
   - Updates weights using one sample at a time.
   - Useful for large datasets.

3. **Adam**:
   - Combines momentum and adaptive learning rates.
   - **Example**:
     ```python
     from tensorflow.keras.optimizers import Adam
     optimizer = Adam(learning_rate=0.001)
     ```

4. **RMSprop**:
   - Uses an exponentially decaying average of past squared gradients to scale learning rates.


# Question 17: What is sklearn.linear_model?

`sklearn.linear_model` is a module in Scikit-learn that implements linear models for regression and classification tasks.

### Popular Classes:
- **LinearRegression**: For ordinary least squares regression.
- **LogisticRegression**: For classification problems.
- **Ridge** and **Lasso**: For regression with regularization.

**Example**:
```python
from sklearn.linear_model import LinearRegression
model = LinearRegression()

## Question 18: What does model.fit() do? What arguments must be given?


In [9]:
# The model.fit() method trains the model on the training dataset by finding 
   
# Example:
from sklearn.linear_model import LinearRegression

model = LinearRegression()

X_train = [[1], [2], [3], [4], [5]]  
y_train = [2, 4, 6, 8, 10]          

model.fit(X_train, y_train)  # Arguments: X (input features) and y (target variable)

## Question 19: What does model.predict() do? What arguments must be given?

In [10]:
# The model.predict() method uses the trained model to make predictions on new data.

# Example:
# Test dataset
X_test = [[6], [7], [8]]  # Input features for prediction

# Making predictions
predictions = model.predict(X_test)  # Argument: X (input features for prediction)

# Displaying the predictions
print("Predictions:", predictions)

Predictions: [12. 14. 16.]


## Question 20: What are continuous and categorical variables?

In [11]:
# Continuous Variables
# These are variables that can take any value within a range.
# Example:
height = [160.5, 172.3, 158.0, 180.1]  # Heights in cm

# Categorical Variables
# These are variables that represent categories or labels.
# Example:
gender = ['Male', 'Female', 'Female', 'Male']
colors = ['Red', 'Blue', 'Green', 'Red']

print("Continuous Variables (Heights):", height)
print("Categorical Variables (Gender):", gender)
print("Categorical Variables (Colors):", colors)

Continuous Variables (Heights): [160.5, 172.3, 158.0, 180.1]
Categorical Variables (Gender): ['Male', 'Female', 'Female', 'Male']
Categorical Variables (Colors): ['Red', 'Blue', 'Green', 'Red']


## Question 21: What is feature scaling? How does it help in Machine Learning?

**Feature scaling** refers to the process of normalizing or standardizing the features of your data so that they all have similar ranges. This helps in improving the performance and convergence speed of machine learning algorithms, especially those that rely on distance metrics, such as k-nearest neighbors (KNN) or gradient descent-based methods.

### Why is it important?
- Some algorithms assume that all features are on the same scale (e.g., Logistic Regression, SVM).
- Prevents one feature with larger numerical values from dominating others.
- Speeds up convergence during training.


## Question 22: How do we perform scaling in Python?

You can use **scikit-learn's preprocessing** module to perform feature scaling. Two common methods are **Standardization** and **Normalization**.

### Example with StandardScaler (Standardization):

In [13]:
from sklearn.preprocessing import StandardScaler

data = [[1, 2], [3, 4], [5, 6]]

scaler = StandardScaler()

scaled_data = scaler.fit_transform(data)

print("Scaled Data:\n", scaled_data)

Scaled Data:
 [[-1.22474487 -1.22474487]
 [ 0.          0.        ]
 [ 1.22474487  1.22474487]]


# Question 23: What is sklearn.preprocessing?
- sklearn.preprocessing is a module in scikit-learn that provides tools for feature scaling and encoding.
- It includes classes for StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder, etc.

 Example:

In [21]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

labels = ['Red', 'Blue', 'Green', 'Red']
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)
print("\nEncoded Labels:", encoded_labels)

data = [['Red'], ['Blue'], ['Green'], ['Red']]
one_hot_encoder = OneHotEncoder(sparse_output=False)  
encoded_data = one_hot_encoder.fit_transform(data)
print("\nOne-Hot Encoded Data:\n", encoded_data)


Encoded Labels: [2 0 1 2]

One-Hot Encoded Data:
 [[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


## Question 24: How do we split data for model fitting (training and testing) in Python?
 Use train_test_split from sklearn.model_selection to split data.

In [17]:
from sklearn.model_selection import train_test_split

X = [[1, 2], [3, 4], [5, 6], [7, 8]]  # Features
y = [1, 0, 1, 0]  # Target labels

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nX_train:", X_train)
print("X_test:", X_test)



X_train: [[7, 8], [1, 2], [5, 6]]
X_test: [[3, 4]]


## Question 25: Explain data encoding.
 Data encoding is the process of converting categorical data into numerical format.

- Label Encoding Example:

In [19]:
from sklearn.preprocessing import OneHotEncoder

data = [['Red'], ['Blue'], ['Green'], ['Red']]
one_hot_encoder = OneHotEncoder(sparse_output=False)
encoded_data = one_hot_encoder.fit_transform(data)
print("\nOne-Hot Encoded Data:\n", encoded_data)



One-Hot Encoded Data:
 [[0. 0. 1.]
 [1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
