<a href="https://colab.research.google.com/github/koad7/DS_Interview_Revision/blob/main/DS_Interview_Revision_ML.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Machine Learning

#### 7.1. Say you are building a binary classifier for an unbalanced dataset (where one class is rarer  than the other, say 1% and 99% respectively). How do you handle this situation?


**Solution 1**: Try to find more data.

**Solution 2**: Use appropriate performance metrics, Accuracy doesnt work well with imbalanced data. (Why?)

**Solution 3**: Resambpl e the training set by either oversampling the rare samples or undersampling the abundant samples. You can use bootstrapping.

**Solution 4**: Generate synthetic examples. e.g SMOTE

**Solution 5**: Runing ensemble models with differnt rations of the classes or by running an ensemble model using alls amples of the rare class and a differeing amount of the abundant class.

**Solution 6**: Design your own cost function that penalizes wrong classification of the rare class more than wrong classifications of the abundant class.

In [1]:
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

# Create a dataset (replace this with your actual dataset)
X, y = make_classification(n_classes=2, class_sep=2, weights=[0.99, 0.01], n_informative=3, n_redundant=1, flip_y=0, n_features=20, n_clusters_per_class=1, n_samples=1000, random_state=10)

# Splitting dataset into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Apply SMOTE
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)


Adjusting class weight

In [2]:
from sklearn.linear_model import LogisticRegression

# Initialize logistic regression with class_weight='balanced'
clf = LogisticRegression(class_weight='balanced')

# Train the model
clf.fit(X_train, y_train)

# Proceed with prediction and evaluation...


Using appropriate evvaluation metrics

In [3]:
from sklearn.metrics import classification_report, roc_auc_score

# Predict on test set
y_pred = clf.predict(X_test)

# Evaluation
print(classification_report(y_test, y_pred))
print("ROC-AUC Score:", roc_auc_score(y_test, y_pred))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00       247
           1       1.00      1.00      1.00         3

    accuracy                           1.00       250
   macro avg       1.00      1.00      1.00       250
weighted avg       1.00      1.00      1.00       250

ROC-AUC Score: 1.0


Ensemble method like Random Forests or Gradient Boosting

In [4]:
from sklearn.ensemble import RandomForestClassifier

# Initialize random forest with class_weight='balanced'
rf = RandomForestClassifier(class_weight='balanced')

# Train the model
rf.fit(X_train, y_train)

# Proceed with prediction and evaluation...


Custom probabvility threshold

In [5]:
import numpy as np

# Predict probabilities
y_probs = clf.predict_proba(X_test)[:, 1] # Probabilities for the positive class

# Custom threshold
threshold = 0.3
y_pred_custom = np.where(y_probs > threshold, 1, 0)

# Evaluation with custom threshold
print(classification_report(y_test, y_pred_custom))


              precision    recall  f1-score   support

           0       1.00      1.00      1.00       247
           1       0.75      1.00      0.86         3

    accuracy                           1.00       250
   macro avg       0.88      1.00      0.93       250
weighted avg       1.00      1.00      1.00       250



#### 7.2. What are some differences you would expect in a model that minimizes squared error versus a model that minimizes absolute error? In which cases would each error metric be appropriate?


Minimizing squared error and minimizing absolute error are two different approaches to error measurement in predictive models, each with its unique characteristics and implications. Here are the key differences you would expect in models optimized for these two types of errors, along with scenarios where each is appropriate:

###### Squared Error (L2 Loss)

**Minimizing Squared Error** focuses on minimizing the square of the differences between predicted and actual values. This is commonly known as L2 loss or mean squared error (MSE).

##### Characteristics:
1. **More Sensitive to Outliers**: Squared error penalizes larger errors more heavily than smaller ones, as the error increases quadratically with the distance from the true value.
2. **Differentiable**: This makes it easier to use in optimization algorithms that rely on gradient descent.
3. **Tends to Have a Single Solution**: This can lead to more stable and predictable models.

###### Appropriate Cases:
- When the model needs to be particularly sensitive to larger errors.
- In regression problems where a smooth loss landscape is beneficial (e.g., linear regression).
- When the data distribution is Gaussian, as MSE corresponds to maximum likelihood estimation under this assumption.

###### Example - Linear Regression Minimizing MSE:

```python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Assuming X_train, y_train are your training data
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
predictions = model.predict(X_test)

# Calculate MSE
mse = mean_squared_error(y_test, predictions)
```

**Absolute Error (L1 Loss)**

**Minimizing Absolute Error** focuses on minimizing the absolute differences between predicted and actual values. This is known as L1 loss or mean absolute error (MAE).

###### Characteristics:
1. **Less Sensitive to Outliers**: L1 loss increases linearly with errors, making it more robust to outliers compared to L2 loss.
2. **Can Have Multiple Optimal Solutions**: This can lead to a model that is less stable but potentially more robust.
3. **Non-differentiable at Zero**: This can make optimization more challenging, particularly for methods that rely on gradient information.

###### Appropriate Cases:
- In scenarios where outliers are to be treated less harshly.
- For distributions that are not well approximated by a Gaussian.
- In problems where a median is more representative than a mean.

###### Example - Linear Regression Minimizing MAE:

```python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

# Custom loss function
def mae_loss(y_true, y_pred):
    return np.mean(np.abs(y_true - y_pred))

# Assuming X_train, y_train are your training data
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
predictions = model.predict(X_test)

# Calculate MAE
mae = mae_loss(y_test, predictions)
```

##### Choosing Between MSE and MAE

The choice between MSE and MAE depends on the specific requirements of your problem and the nature of your data. If your data contains a lot of outliers and you want a measure that's

robust to these outliers, MAE might be more appropriate. On the other hand, if you're dealing with data where the Gaussian assumption holds and you want to penalize larger errors more, MSE would be the better choice.

It's also worth noting that while MSE can give more weight to larger errors (which could be either good or bad, depending on the context), MAE provides a more uniform treatment of all errors, making it a more 'democratic' measure in some senses.

In practice, it's often beneficial to experiment with both types of loss functions and evaluate which one aligns better with the objectives of your specific problem. Additionally, consider the distribution of your data and the impact of outliers on your predictive modeling.

Squared error is *MSE* and absolute error is *MAE*.


The main difference is that errors are squared before being averaged in *MSE*, meaning there is a relatively high weight given to large errors.

*MSE* is useful when large errors in the model are trying to be avoided. Outliers disproportionately affect MSE more than *MAE*. *MAE* is more robust to outliers.

*MSE* is easier to use since the gradient calculation is more straightforward than that of *MAE*, which requires some linear programming to compute the gradient.





#### 7.3. Facebook: When performing K-means clustering, how do you choose k?


Choosing the right number of clusters (\( k \)) in K-means clustering is a critical step as it directly influences the results of the clustering process. There's no hard and fast rule for this, but several methods are commonly used to estimate a good value for \( k \):

##### 1. The Elbow Method

The Elbow Method involves plotting the explained variation as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use.

###### Code Example using the Elbow Method:

```python
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

# Sample data
X = # your data here

# Calculate distortions for a range of number of clusters
distortions = []
for i in range(1, 11):
    km = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=300, random_state=0)
    km.fit(X)
    distortions.append(km.inertia_)

# Plot
plt.plot(range(1, 11), distortions, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('Distortion')
plt.title('Elbow Method')
plt.show()
```

##### 2. The Silhouette Method

The Silhouette Method measures how similar an object is to its own cluster compared to other clusters. The silhouette score ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters.

###### Code Example using the Silhouette Method:

```python
from sklearn.metrics import silhouette_score

# Sample data
X = # your data here

# Range of clusters to try
range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)
    
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)
```

##### 3. The Gap Statistic

The Gap Statistic compares the total within intra-cluster variation for different values of \( k \) with their expected values under null reference distribution of the data.

###### Code Example using the Gap Statistic:

For the Gap Statistic, it's recommended to use an existing library like `gap_statistic` from `gap_stat` due to the complexity of the calculations.

```python
from gap_statistic import OptimalK
import numpy as np

# Sample data
X = # your data here

optimalK = OptimalK(parallel_backend='joblib')
n_clusters = optimalK(X, cluster_array=np.arange(1, 11))

print('Optimal clusters: ', n_clusters)
```

##### Choosing \( k \)

In practice, you might need to combine these methods and use domain knowledge to determine the most appropriate number of clusters. Each method has its strengths and weaknesses, and the choice can depend on the specific characteristics of your data and the purpose of the clustering.

Example:

```
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
import numpy as np

# Generating sample data using make_blobs
# This will create a dataset with a specified number of clusters (for illustration purposes)
X, _ = make_blobs(n_samples=300, centers=4, cluster_std=0.60, random_state=0)

# Range of clusters to try
range_n_clusters = [2, 3, 4, 5, 6]

# Calculate silhouette scores for different number of clusters
silhouette_scores = {}
for n_clusters in range_n_clusters:
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    silhouette_avg = silhouette_score(X, cluster_labels)
    silhouette_scores[n_clusters] = silhouette_avg
```

silhouette_scores

Here are the calculated average silhouette scores for each number of clusters, using the generated sample data:

- For 2 clusters, the average silhouette score is: $ \approx $ 0.7049787496083262
- For 3 clusters, the average silhouette score is: $ \approx $ 0.5882004012129721
- For 4 clusters, the average silhouette score is: $ \approx $ 0.6505186632729437
- For 5 clusters, the average silhouette score is: $ \approx $ 0.56376469026194
- For 6 clusters, the average silhouette score is: $ \approx $ 0.4504666294372765

Based on these scores, 2 clusters provide the highest silhouette score, indicating that it might be the most appropriate choice for this dataset. Remember that the silhouette score ranges from -1 to 1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. In real-world scenarios, you should also consider domain knowledge and the specific context of the data when choosing the number of clusters.

#### 7.4. How can you make your models more robust to outliers?


Making your models more robust to outliers is crucial, especially in datasets where anomalies are common or when these outliers can significantly skew your results. Here are several strategies to improve the robustness of your models:

##### 1. Data Preprocessing

- **Outlier Detection and Removal**: Identify outliers using statistical methods (like Z-scores, IQR) or visualization techniques (like box plots) and remove them from your dataset.
- **Transformation**: Apply transformations like log, square root, or Box-Cox transformations to reduce the impact of outliers.

##### 2. Robust Scaling

- **Robust Scaler**: Use scalers that are robust to outliers, like `RobustScaler` in Scikit-learn, which scales features using statistics that are robust to outliers.
- **Normalization**: Normalize data so that the scale of outliers is reduced (e.g., Min-Max scaling).

##### 3. Choosing Robust Models

- **Robust Algorithms**: Use algorithms that are inherently robust to outliers, like Random Forests, or robust variants of algorithms (e.g., RANSAC regressor).
- **Regularization**: Use regularization techniques (like L1 or L2 regularization) which can reduce the impact of outliers by penalizing large weights in linear models.

##### 4. Ensemble Methods

- **Bagging**: Use ensemble methods like Bagging which can reduce the variance and improve model robustness.
- **Random Forests**: A type of ensemble method that is inherently robust to outliers.

##### 5. Using Appropriate Loss Functions

- **Robust Loss Functions**: Use loss functions that are less sensitive to outliers, such as Mean Absolute Error (MAE) instead of Mean Squared Error (MSE).

##### 6. Data Resampling

- **Stratified Sampling**: Ensure that outliers are proportionally represented in both training and testing datasets.

##### 7. Cross-Validation

- **Robust Cross-Validation**: Use cross-validation techniques that are robust to outliers, like stratified k-fold.

##### Example Code for Some Strategies:

###### Robust Scaler:

```python
from sklearn.preprocessing import RobustScaler

scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)  # X is your data
```

###### Random Forest:

```python
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)  # X_train and y_train are your training data
```

###### Regularization (Lasso for L1):

```python
from sklearn.linear_model import Lasso

model = Lasso(alpha=0.1)  # alpha is a regularization parameter
model.fit(X_train, y_train)
```

Remember, the choice of strategy heavily depends on the nature of your data and the specific requirements of your task. In many cases, a combination of these strategies will yield the best results.

#### 7.5. Say that you are running a multiple linear regression and that you have reason to believe that several of the predictors are correlated. How will the results of the regression be affected if several are indeed correlated? How would you deal with this problem?


When several predictors in a multiple linear regression are correlated, this situation is known as multicollinearity. Here's how it affects the regression and ways to deal with it:

##### Effects of Multicollinearity

1. **Unreliable Coefficients**: Multicollinearity can lead to large variances for the coefficient estimates, making them unstable and unreliable.
2. **Misleading Significance of Predictors**: It becomes difficult to determine the individual effect of correlated predictors, as their coefficients can be significantly altered.
3. **Overfitting**: The model might fit the training data well but perform poorly on unseen data.
4. **Difficulty in Interpreting**: Interpreting the model becomes challenging because the effect of one predictor is difficult to isolate from the effect of the others.

##### Dealing with Multicollinearity

1. **Feature Selection**:
   - **Manual Selection**: Remove predictors that are known to be correlated.
   - **Variance Inflation Factor (VIF)**: Calculate the VIF for each predictor; a VIF value greater than 5 or 10 indicates high multicollinearity.

    ###### VIF Calculation Example:

    ```python
    from statsmodels.stats.outliers_influence import variance_inflation_factor
    from statsmodels.tools.tools import add_constant
    import pandas as pd

    # Assuming df is your DataFrame and it includes all the predictors
    X = add_constant(df)

    # Calculating VIF for each feature
    vif_data = pd.DataFrame()
    vif_data["feature"] = X.columns
    vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

    print(vif_data)
    ```

2. **Principal Component Analysis (PCA)**:
   - Use PCA to reduce the dimensionality of the data. PCA combines the predictors into a smaller set of uncorrelated components.
   
    ###### PCA Example:

    ```python
    from sklearn.decomposition import PCA

    # Assuming X is your predictors
    pca = PCA(n_components=3)  # n_components based on the desired number of features
    X_pca = pca.fit_transform(X)

    # Then use X_pca for your regression
    ```

3. **Ridge or Lasso Regression**:
   - Use regularization methods like Ridge or Lasso regression which can handle multicollinearity by shrinking the coefficients.
   
    ###### Ridge Regression Example:

    ```python
    from sklearn.linear_model import Ridge

    # Assuming X_train and y_train are your training data
    ridge = Ridge(alpha=1.0)  # Alpha is a tuning parameter
    ridge.fit(X_train, y_train)
    ```

4. **Partial Least Squares Regression (PLSR)**:
   - PLSR is similar to PCA but also considers the response variable, making it suitable for cases with multicollinearity.
   
    ###### PLSR Example:

    ```python
    from sklearn.cross_decomposition import PLSRegression

    pls = PLSRegression(n_components=3)
    pls.fit(X_train, y_train)
    ```

It's important to remember that each of these methods has its advantages and drawbacks. The choice of method depends on the specifics of your dataset and the research question at hand. Sometimes, a combination of these techniques might be necessary to effectively address the issue of multicollinearity.

#### 7.6. Describe the motivation behind random forests. What are two ways in which they improve upon individual decision trees?


##### Motivation Behind Random Forests

Random Forests are motivated by the desire to improve upon the performance and robustness of individual decision trees. Decision trees are simple and interpretable, but they have significant limitations, like a tendency to overfit to the training data. Random Forests address these limitations by combining the predictions of multiple decision trees, leading to better generalization and robustness.

##### Improvements Over Individual Decision Trees

1. **Reduction of Overfitting**:
   - Individual decision trees often overfit the data; they're sensitive to noise and variations in the training set. Random Forests mitigate this by averaging the results of many trees, each built on different subsets of the data. This averaging process reduces the model's sensitivity to the noise and specific quirks of the training set, leading to better generalization to unseen data.

2. **Increased Accuracy**:
   - By combining the predictions of multiple trees (each potentially focusing on different aspects of the data), Random Forests often achieve higher accuracy than individual trees. The ensemble approach leverages the strengths of each tree, while their individual errors are likely to cancel out when averaged, leading to a more accurate overall prediction.

3. **Handling Non-Linearity and Feature Interactions**:
   - Random Forests, by virtue of aggregating multiple trees, can capture complex non-linear relationships and interactions between features more effectively than a single tree.

##### Code Example

Below is a basic example of using Random Forests for classification with Scikit-Learn:

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Load a sample dataset
data = load_iris()
X, y = data.data, data.target

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42) # n_estimators is the number of trees

# Train the model
rf.fit(X_train, y_train)

# Make predictions
predictions = rf.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy}")
```

This example demonstrates a basic application of a Random Forest classifier using the Iris dataset. It highlights the simplicity of using Random Forests in a similar manner to individual decision trees, while reaping the benefits of improved performance and robustness.

#### 7.7. Given a large dataset of payment transactions, say we want to predict the likelihood of a given transaction being fraudulent. However, there are many rows with missing values for various columns. How would you deal with this?

Dealing with missing values in a dataset, especially in a scenario like predicting fraudulent transactions, is crucial for the performance of the predictive model. Here are some common strategies to handle missing data, along with code examples:

##### 1. Removing Data

This involves simply removing rows or columns with missing values. This approach is straightforward but can lead to loss of valuable information, especially if a significant portion of data is missing.

###### Code Example - Dropping Rows/Columns:

```python
import pandas as pd

# Assuming df is your DataFrame
# Drop rows with any missing value
df_dropped_rows = df.dropna()

# Drop columns with any missing value
df_dropped_columns = df.dropna(axis=1)
```

##### 2. Imputation

Imputation involves filling in missing data with substitute values.

- **Mean/Median/Mode Imputation**: This is suitable for numerical columns. Use the mean for normally distributed data, median for skewed data, and mode for categorical data.
- **Model-Based Imputation**: Use other columns to predict missing values.

###### Code Example - Simple Imputation:

```python
from sklearn.impute import SimpleImputer

# Mean imputation
mean_imputer = SimpleImputer(strategy='mean')
df['column'] = mean_imputer.fit_transform(df[['column']])

# Median imputation
median_imputer = SimpleImputer(strategy='median')
df['column'] = median_imputer.fit_transform(df[['column']])

# Mode imputation for categorical data
mode_imputer = SimpleImputer(strategy='most_frequent')
df['column'] = mode_imputer.fit_transform(df[['column']])
```

##### 3. Using Algorithms that Support Missing Values

Some algorithms can handle missing values inherently. For example, tree-based algorithms like Random Forests and Gradient Boosting can handle missing values without imputation.

##### 4. Creating Missing Value Indicators

Sometimes the fact that data is missing can be informative. You can create a new column that indicates whether data was missing.

###### Code Example - Missing Value Indicator:

```python
df['column_missing'] = df['column'].isnull().astype(int)
```

##### 5. Advanced Imputation Techniques

- **K-Nearest Neighbors (KNN) Imputation**: Fill missing values based on K-nearest neighbors.
- **Multiple Imputation by Chained Equations (MICE)**: A sophisticated technique that models each feature with missing values as a function of other features in a round-robin fashion.

###### Code Example - KNN Imputation:

```python
from sklearn.impute import KNNImputer

knn_imputer = KNNImputer(n_neighbors=5)
df = pd.DataFrame(knn_imputer.fit_transform(df), columns=df.columns)
```

##### Choosing the

Right Strategy:

The choice of strategy largely depends on:
- The nature of your data (categorical vs. numerical, distribution of values)
- The amount of missing data
- The importance of preserving all rows or columns vs. the importance of maintaining data integrity

In the context of predicting fraudulent transactions, it's crucial to balance the need for complete data with the need to avoid introducing biases or inaccuracies. For instance, if missing values in certain columns are related to the nature of the transaction itself (e.g., some types of transactions might not include certain information), then imputing these values might introduce noise. On the other hand, if the missing values are randomly distributed, imputation could be a viable approach.

In any case, it's advisable to perform exploratory data analysis to understand the patterns of missingness and to experiment with different strategies, evaluating their impact on the performance of your predictive models.

#### <font color="cyan">7.8 Say you are running a simple logistic regression to solve a problem but find the results to be unsatisfactory. What are some ways you might improve your model, or what other models might you look into using instead?</font>



If you find that a simple logistic regression model is yielding unsatisfactory results, there are several steps you can take to potentially improve your model, or you might consider exploring alternative models. Here are some strategies for improvement:

##### Improving Logistic Regression

1. **Feature Engineering**:
   - Create new features based on your existing data.
   - Perform transformations (like log, square, or interaction terms) to existing features.

   ###### Code Example - Feature Transformation:

   ```python
   import numpy as np

   # Assuming 'X' is your feature matrix and 'feature1' is a column in 'X'
   X['log_feature1'] = np.log(X['feature1'] + 1)
   X['feature1_squared'] = X['feature1'] ** 2
   ```

2. **Feature Selection**:
   - Use techniques like Recursive Feature Elimination (RFE) to select features that contribute most to the prediction.
   
   ###### Code Example - RFE:

   ```python
   from sklearn.feature_selection import RFE
   from sklearn.linear_model import LogisticRegression

   estimator = LogisticRegression()
   selector = RFE(estimator, n_features_to_select=5, step=1)
   selector = selector.fit(X, y)

   X_selected = selector.transform(X)
   ```

3. **Hyperparameter Tuning**:
   - Adjust hyperparameters like `C` (inverse of regularization strength) in logistic regression.

   ###### Code Example - Hyperparameter Tuning:

   ```python
   from sklearn.model_selection import GridSearchCV

   param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100]}
   grid_search = GridSearchCV(LogisticRegression(), param_grid, cv=5)
   grid_search.fit(X_train, y_train)
   ```

4. **Data Scaling**:
   - Scale your features, especially if they're on different scales.

   ###### Code Example - Data Scaling:

   ```python
   from sklearn.preprocessing import StandardScaler

   scaler = StandardScaler()
   X_scaled = scaler.fit_transform(X)
   ```

##### Alternative Models

If improving logistic regression doesn't yield better results, consider trying other models:

1. **Random Forest**:
   - Good for capturing non-linearities and interactions between features.
   
   ###### Code Example - Random Forest:

   ```python
   from sklearn.ensemble import RandomForestClassifier

   rf = RandomForestClassifier()
   rf.fit(X_train, y_train)
   ```

2. **Support Vector Machines (SVM)**:
   - Effective in high dimensional spaces.

   ###### Code Example - SVM:

   ```python
   from sklearn.svm import SVC

   svm = SVC()
   svm.fit(X_train, y_train)
   ```

3. **Gradient Boosting Machines (GBM)**:
   - Powerful for handling a variety of data types.

   ###### Code Example - Gradient Boosting:

   ```python
   from sklearn.ensemble import GradientBoostingClassifier

   gbm = GradientBoostingClassifier()
   gbm.fit(X_train, y_train)
   ```

4. **Neural Networks**:
   - Suitable for complex datasets with large amounts of data.

   ###### Code Example - Neural Network:

   ```python
   from sklearn.neural_network import MLPClassifier

   nn = MLPClassifier(hidden_layer_sizes=(100,), max_iter=1000)
   nn.fit(X_train, y_train)
   ```

When switching models, it's important to understand the nature of your data and the problem you're trying to solve. Different models have different strengths and are suited to different types of data and prediction tasks. For example, if your data is highly non-linear or if there are complex interactions between features, tree-based models like Random Forests or Gradient Boosting Machines might perform better than logistic regression.

Furthermore, it's crucial to validate your model thoroughly using techniques like cross-validation and to evaluate it using appropriate metrics for your problem domain. This process helps in understanding whether the changes you've made are genuinely improving the model's performance or if they're just fitting noise in the training data.

Lastly, keep in mind that more complex models can be more difficult to interpret and might require more computational resources. Therefore, it's important to balance complexity with the practical constraints and requirements of your project.

#### <font color="cyan">7.9. Say you were running a linear regression for a dataset but you accidentally duplicated every data point. What happens to your beta coefficient?</font>


Duplicating every data point in a dataset for linear regression will not change the beta coefficients (slope parameters) of the model. The reason for this is that linear regression parameters are estimated based on the relationship between the independent and dependent variables, and this relationship doesn't change when you duplicate the data.

However, duplicating the data points will affect other aspects of the regression output, such as the p-values and standard errors, making the model appear more significant than it actually is. This is because duplicating data points artificially inflates the sample size, misleadingly increasing the statistical power of the test.

Let's demonstrate this with a code example. First, we'll fit a linear regression model to a dataset, and then we'll fit the same model to a duplicated version of the dataset and compare the beta coefficients.

### Code Example:

```python
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Generate some random data
np.random.seed(0)
X = np.random.rand(100, 1)
y = 2 * X + 1 + np.random.randn(100, 1) * 0.5  # y = 2x + 1 + noise

# Add a constant to the model (for the intercept term)
X_sm = sm.add_constant(X)

# Fit a linear regression model
model = sm.OLS(y, X_sm).fit()
original_coefs = model.params

# Duplicate the data
X_duplicated = np.concatenate([X, X])
y_duplicated = np.concatenate([y, y])
X_duplicated_sm = sm.add_constant(X_duplicated)

# Fit the same model to the duplicated data
model_duplicated = sm.OLS(y_duplicated, X_duplicated_sm).fit()
duplicated_coefs = model_duplicated.params

original_coefs, duplicated_coefs
```

In this example, `original_coefs` and `duplicated_coefs` should be very similar, indicating that the beta coefficients haven't changed significantly despite the duplication of data. However, if you were to look at the summary of both models (using `model.summary()` and `model_duplicated.summary()`), you would see differences in the standard errors, t

As shown in the code example, the beta coefficients (`original_coefs` and `duplicated_coefs`) remain the same before and after duplicating the data points. Both models yield the coefficients:

- Intercept: approximately 1.111
- Slope: approximately 1.968

This demonstrates that duplicating every data point in a dataset for linear regression does not affect the estimated coefficients of the model. However, as previously mentioned, this will impact other statistical measures like p-values and standard errors, which can lead to misleading interpretations of the model's significance and confidence.

In [1]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

# Generate some random data
np.random.seed(0)
X = np.random.rand(100, 1)
y = 2 * X + 1 + np.random.randn(100, 1) * 0.5  # y = 2x + 1 + noise

# Add a constant to the model (for the intercept term)
X_sm = sm.add_constant(X)

# Fit a linear regression model
model = sm.OLS(y, X_sm).fit()
original_coefs = model.params

# Duplicate the data
X_duplicated = np.concatenate([X, X])
y_duplicated = np.concatenate([y, y])
X_duplicated_sm = sm.add_constant(X_duplicated)

# Fit the same model to the duplicated data
model_duplicated = sm.OLS(y_duplicated, X_duplicated_sm).fit()
duplicated_coefs = model_duplicated.params

original_coefs, duplicated_coefs

(array([1.11107554, 1.96846751]), array([1.11107554, 1.96846751]))

#### <font color="cyan">7.10. Compare and contrast gradient boosting and random Forests.</font>


Gradient Boosting and Random Forests are both popular ensemble learning techniques used for regression and classification problems, but they operate in fundamentally different ways.

##### Random Forests

**Random Forests** is an ensemble learning method that operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

###### Key Characteristics:
1. **Bagging Model**: Random Forests use the bagging technique, where each tree is built on a bootstrap sample (with replacement) from the training data.
2. **Feature Subset**: In each split of the tree, only a random subset of features is considered.
3. **Parallel Training**: Trees are built independently, making the process easily parallelizable.
4. **Reduction of Variance**: Helps in reducing overfitting by averaging multiple trees.

###### Example - Random Forest in Python:

```python
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

# Load data
iris = load_iris()
X, y = iris.data, iris.target

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize Random Forest Classifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Predict and evaluate
rf_predictions = rf.predict(X_test)
```

##### Gradient Boosting

**Gradient Boosting** is a boosting technique that builds trees one at a time, where each new tree helps to correct errors made by previously trained trees.

###### Key Characteristics:
1. **Boosting Model**: Gradient Boosting builds trees sequentially, where each tree tries to correct the errors of the previous one.
2. **Gradient Descent**: It uses gradient descent to minimize the loss when adding new models.
3. **Shrinkage**: Introduces a learning rate parameter that scales the contribution of each tree. Smaller values (shrinkage) generally lead to better models.
4. **Reduction of Bias and Variance**: Focuses on reducing bias, though with enough trees, it can also achieve variance reduction.

###### Example - Gradient Boosting in Python:

```python
from sklearn.ensemble import GradientBoostingClassifier

# Initialize Gradient Boosting Classifier
gb = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0, max_depth=1, random_state=42)
gb.fit(X_train, y_train)

# Predict and evaluate
gb_predictions = gb.predict(X_test)
```

##### Comparison

- **Model Complexity**: Random Forests build complex models by combining the results of many independently built decision trees. In contrast, Gradient Boosting builds increasingly complex models by sequentially adding trees that correct the errors of the combined ensemble of previous trees.
- **Training Speed**: Random Forests can be faster to train compared to Gradient Boosting because of parallel tree construction. Gradient Boosting often requires more careful tuning and longer training times.
- **

Overfitting Risk**: Random Forests are less likely to overfit if there are enough trees, due to their averaging nature. Gradient Boosting can overfit if the number of trees is too large or the learning rate is too high.
- **Interpretability**: Both models are less interpretable than simple decision trees, but Random Forests generally offer a bit more interpretability since each tree can be considered independently.
- **Performance Tuning**: Gradient Boosting often requires careful tuning of parameters like the number of trees, learning rate, and depth of trees. Random Forests are more robust to the choice of parameters, but tuning can still improve performance.
- **Handling Noisy Data**: Random Forests can be more robust to noise, given their averaging nature. Gradient Boosting can be sensitive to outliers and noise.

In practice, the choice between Random Forests and Gradient Boosting depends on the specific requirements of the application, the nature of the data, and the computational resources available. Gradient Boosting is often a go-to when performance is the primary concern and you have the time and resources to properly tune the model. Random Forests are a strong choice when you need a robust model quickly, with less tuning required.

#### <font color="cyan">7.11. Say that DoorDash is launching in Singapore. For this new market, you want to predict the estimated time of arrival (BA) for a delivery to reach a customet afer an order has been placed on the app. From an earlier beta test in Singapore, there were 10,000 deliveries made. Do you have enough training data to create an accurate ETA model?</font>

Whether 10,000 deliveries provide enough data for an accurate Estimated Time of Arrival (ETA) model depends on various factors, including the complexity of the model, the diversity of the data, and the specific features captured in the dataset. Generally, 10,000 data points could be considered a reasonable starting point for building a predictive model, especially if the data is rich in informative features.

For a more concrete assessment, we can simulate a scenario with a dataset of similar size and complexity to estimate the performance of a predictive model. Let's assume we're using a relatively simple model, like a Random Forest, for this purpose. We'll create a synthetic dataset with features that might be relevant for an ETA prediction model, such as distance to destination, time of day, day of the week, and weather conditions.

### Code Example: Simulating a Model Training with a Dataset of 10,000 Deliveries

```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Generate a synthetic dataset
np.random.seed(42)
n_samples = 10000
data = {
    'distance': np.random.uniform(1, 10, n_samples),  # distance in kilometers
    'time_of_day': np.random.randint(0, 24, n_samples),  # hour of the day
    'day_of_week': np.random.randint(0, 7, n_samples),  # day of the week
    'weather_condition': np.random.randint(0, 4, n_samples),  # 0: clear, 1: rainy, 2: foggy, 3: stormy
    'ETA': np.random.uniform(10, 60, n_samples)  # ETA in minutes
}
df = pd.DataFrame(data)

# Prepare the data
X = df.drop('ETA', axis=1)
y = df['ETA']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a Random Forest model
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse}")
```

This example provides a basic framework to start with. In a real-world scenario, the model's accuracy would depend heavily on the quality and relevance of the features. For instance, in an ETA prediction model, traffic conditions, driver experience, and specific location data might be critical factors.

If the model's performance is not satisfactory, additional data, more complex models, or feature engineering might be necessary. However, with 10,000 data points, you have a solid foundation to begin the modeling process and iteratively improve the model.


Medium
how would you supply these reasons?
7.13. Google: Say you are given a very large corpus of words. How would you identify synonyms?
7.14. Facebook: What is the bias-variance trade-off? How is it expressed using an equation?
7.15. Uber: Define the cross-validation process. What is the motivation behind using it?
7.16. Salesforce: How would you build a lead scoring algorithm to predict whether a prospective company is likely to convert into being an enterprise customer?
7.17. Spotify: How would you approach creating a music recommendation algorithm?
7.18. Amazon: Define what it means for a function to be convex. What is an example of a machine learning algorithm that is not convex and describe why that is so?
7.19. Microsoft: Explain what information gain and entropy are in the context of a decision tree and walk through a numerical example.
7.20. Uber: What is L1 and L2 regularization? What are the differences between the two?
7.21. Amazon: Describe gradient descent and the motivations behind stochastic gradient descent.
7.22. Affirm: Assume we have a classifier that produces a score between 0 and 1 for the probability of a particular loan application being fraudulent. Say that for each application's score, we take the square root of that score. How would the ROC curve change? If it doesn't change, what kinds of functions would change the curve?
7.23. IBM: Say X is a univariate Gaussian random variable. What is the entropy of X?
7.24. Stitch Fix: How would you build a model to calculate a customer's propensity to buy a particular item? What are some pros and cons of your approach?
7.25. Citadel: Compare and contrast Gaussian Naive Bayes (GNB) and logistic regression. When would you use one over the other?
lard
26. Walmart: What loss function is used in k-means clustering given k clusters and n sample points? Compute the update formula using (1) batch gradient descent and (2) stochastic gradient descent for the cluster mean for cluster k using a learning rate c.

## When performing K-means clustering, how do you choose K?

How can you make your models more robust to outliers?

Say that you are running a multiple linear regression and that you have reason to believe that several of the predictors are correlated. Ho w will the results of the regression de affected if several are indeed correlated? How would you deal with this problem?

Describe the motivation behind random forested. What are two ways in which they improve upon individual decision trees?




Given a large dateset of payementn trnasactions, say we want to predict the likelihood of a given transasction being fraudulent. however there are many rows with missing values for various columns Hwo would you deal with this?

In [None]:
SELECT
  COUNT DISTINCT app_id
WHERE event_ID IS "clicktrough rate"
AND YEAR(timestamp)  iS "2019";