# Задание 1
Сгенерируйте датасет с 10000 наблюдений и 1000 колонок (сэмплируйте из разных распеределений) и сформируйте из него таргет на основе 100 колонок + зашумление (общее или небольшое для каждой колонки - постарайтесь сделать так, чтобы шум не сильно влиял на корреляции между предикторами и таргетами). Удостоверьтесь, что в датасете существуют колонки, которые не использовались для таргета, но при этом имеют высокую корреляцию с теми, что использовались (покажите это в коде).

In [1]:
import numpy as np
import pandas as pd

np.random.seed(42)

n_samples = 10000
n_columns = 1000
data = np.zeros((n_samples, n_columns))

for i in range(n_columns // 4):
    data[:, i] = np.random.normal(loc=0, scale=1, size=n_samples)
    data[:, i + n_columns // 4] = np.random.uniform(-1, 1, size=n_samples)
    data[:, i + n_columns // 2] = np.random.exponential(scale=1, size=n_samples)
    data[:, i + 3 * n_columns // 4] = np.random.poisson(lam=3, size=n_samples)

# Хотим, чтобы последние 100 колонок были сильно скоррелированными
corr_base = np.random.normal(0, 1, size=n_samples)
for i in range(-100, 0):
    data[:, i] = corr_base + np.random.normal(0, 0.05, size=n_samples)  # шум

df = pd.DataFrame(data, columns=[f"feature_{i}" for i in range(n_columns)])
df.head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,...,feature_990,feature_991,feature_992,feature_993,feature_994,feature_995,feature_996,feature_997,feature_998,feature_999
0,0.496714,1.280471,-1.122551,0.524982,0.987901,0.017862,0.312194,-0.609789,1.002612,0.293701,...,0.043637,-0.010377,0.050621,0.035498,-0.095048,0.021487,-0.041451,0.060218,-0.128142,0.030686
1,-0.138264,-0.245924,2.79612,0.690287,1.879859,0.255147,-0.601062,0.838519,-1.380665,-0.638074,...,1.957549,1.883027,1.991596,1.9052,2.040944,1.970121,2.013584,1.949059,1.943506,1.907892
2,0.647689,1.342503,0.689283,0.484867,-2.813737,0.290908,0.143389,-1.159115,-1.946456,-0.664319,...,1.22766,1.341263,1.264885,1.323409,1.300004,1.135428,1.306268,1.252321,1.253547,1.20188
3,1.52303,1.935503,-1.021214,-0.643541,0.897376,0.230489,1.918386,0.366521,0.354375,0.049559,...,0.085804,0.072291,0.05994,-0.018429,-0.005226,0.017241,0.03887,0.033559,0.113008,0.137452
4,-0.234153,0.975839,-1.576624,-0.670365,-0.653806,-0.146138,0.832448,0.083562,-0.72907,-0.278974,...,1.705048,1.732025,1.636691,1.703631,1.737022,1.67654,1.71413,1.739646,1.793698,1.780125


In [2]:
# Создадим таргет
used_columns = np.random.choice(range(n_columns - 100), 100, replace=False)
target_noise = np.random.normal(0, 0.1, size=n_samples)
target = np.sum(df.iloc[:, used_columns].values, axis=1) + target_noise
df["target"] = target

In [3]:
correlation_matrix = df.corr()
target_correlations = correlation_matrix["target"].sort_values(ascending=False)
highly_correlated_unused = target_correlations.index[target_correlations.index.str.contains("feature_")][-100:]
correlation_summary = target_correlations[highly_correlated_unused]
print(correlation_summary)

feature_763   -0.011870
feature_932   -0.011876
feature_296   -0.011886
feature_988   -0.011918
feature_901   -0.011925
                 ...   
feature_838   -0.025293
feature_418   -0.027312
feature_635   -0.027908
feature_500   -0.028378
feature_579   -0.031655
Name: target, Length: 100, dtype: float64


Реализуйте forward stage wise регрессию стандартным образом и с помощью QR разложения наиболее быстрым образом (засекайте время для всех опробованных вариантов). Замерьте качество и процент колонок, которые были правильно найдены.

In [4]:
import time
from sklearn.metrics import mean_squared_error

X = df.drop(columns=["target"]).values
y = df["target"].values

n_train = int(0.8 * len(y))
X_train, X_test = X[:n_train], X[n_train:]
y_train, y_test = y[:n_train], y[n_train:]

In [5]:
def fsw(X, y, max_iter=10000, tol=1e-4, step_size=0.01): # forward stage wise регрессия
    n_samples, n_features = X.shape
    coefficients = np.zeros(n_features)
    residuals = y.copy()

    for _ in range(max_iter):
        # Сосчитаем корреляцию
        correlations = X.T @ residuals
        max_correlation_index = np.argmax(np.abs(correlations))
        sign = np.sign(correlations[max_correlation_index])
        coefficients[max_correlation_index] += step_size * sign

        residuals = y - X @ coefficients

        if np.max(np.abs(correlations)) < tol:
            break

    return coefficients

In [6]:
def fsw_qr(X, y, max_iter=1000, tol=1e-4):  # с помощью qr разложения
    n_samples, n_features = X.shape
    coefficients = np.zeros(n_features)
    residuals = y.copy()
    Q, R = np.linalg.qr(X)

    for _ in range(max_iter):
        correlations = Q.T @ residuals
        max_correlation_index = np.argmax(np.abs(correlations))
        step_size = correlations[max_correlation_index] / R[max_correlation_index, max_correlation_index]
        coefficients[max_correlation_index] += step_size
        residuals = y - X @ coefficients
        if np.max(np.abs(correlations)) < tol:
            break

    return coefficients

In [7]:
start_standard = time.time()
coefficients_standard = fsw(X_train, y_train, max_iter=10000, step_size=0.01)
end_standard = time.time()

In [8]:
start_qr = time.time()
coefficients_qr = fsw_qr(X_train, y_train, max_iter=10000)
end_qr = time.time()

In [9]:
predicted_y_standard = X_test @ coefficients_standard
predicted_y_qr = X_test @ coefficients_qr

In [10]:
mse_standard = mean_squared_error(y_test, predicted_y_standard)
mse_qr = mean_squared_error(y_test, predicted_y_qr)

In [11]:
selected_standard = np.where(coefficients_standard != 0)[0]
selected_qr = np.where(coefficients_qr != 0)[0]

correct_standard = len(set(selected_standard) & set(used_columns)) / len(used_columns) * 100
correct_qr = len(set(selected_qr) & set(used_columns)) / len(used_columns) * 100

In [12]:
results = {
    "Method": ["Standard Forward Stagewise", "QR-Optimized Forward Stagewise"],
    "MSE": [mse_standard, mse_qr],
    "Time (seconds)": [end_standard - start_standard, end_qr - start_qr],
    "Correct Features Found (%)": [correct_standard, correct_qr],
}

results_df = pd.DataFrame(results)
print(results_df)

                           Method       MSE  Time (seconds)  \
0      Standard Forward Stagewise  1.394428       56.895492   
1  QR-Optimized Forward Stagewise  0.011401       27.280895   

   Correct Features Found (%)  
0                       100.0  
1                       100.0  


Дополнительно: Попробуйте генерировать данные таким образом, чтобы ошибка постепенно ухудшалась. Подсказка: увеличивайте шум, используйте нелинейные функции и комбинации предикторов. Попробуйте оценить bias и variance для forward stage-wise regression.

In [13]:
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

np.random.seed(42)

# Функция для генерации данных, учитывая увеличивающийся шум
def generate(n_samples=10000, n_features=1000, noise_levels=None):
    if noise_levels is None:
        noise_levels = [0.1, 0.5, 1.0, 2.0]
    data = np.zeros((n_samples, n_features))

    for i in range(n_features // 4):
        data[:, i] = np.random.normal(0, 1, size=n_samples)
        data[:, i + n_features // 4] = np.random.uniform(-1, 1, size=n_samples)
        data[:, i + n_features // 2] = np.random.exponential(1, size=n_samples)
        data[:, i + 3 * n_features // 4] = np.random.poisson(3, size=n_samples)

    # Создаем таргет, учитывая увеличивающийся шум
    targets = []
    used_columns = np.random.choice(range(n_features - 100), 100, replace=False)
    for noise in noise_levels:
        target = np.sum(data[:, used_columns], axis=1)  # лк
        target += np.sum(np.sin(data[:, used_columns]), axis=1)  # нелинейные преобразования
        target += np.random.normal(0, noise, size=n_samples)  # шум
        targets.append(target)

    return data, targets, used_columns

In [14]:
X, targets, used_columns = generate()

X_train, X_test, y_train_list, y_test_list = train_test_split(
    X, np.array(targets).T, test_size=0.2, random_state=42
)

In [15]:
biases = []
biases_qr = []
variances = []
variances_qr = []
mse_list = []
mse_list_qr = []

for i, y_train in enumerate(y_train_list.T):
    y_test = y_test_list[:, i]

    coefficients = fsw(X_train, y_train, max_iter=10000, step_size=0.01)
    coefficients_qr = fsw_qr(X_train, y_train, max_iter=10000)

    predictions_train = X_train @ coefficients
    predictions_train_qr = X_train @ coefficients_qr

    predictions_test = X_test @ coefficients
    predictions_test_qr = X_test @ coefficients_qr

    bias = np.mean((np.mean(predictions_test) - y_test) ** 2)
    bias_qr = np.mean((np.mean(predictions_test_qr) - y_test) ** 2)

    variance = np.var(predictions_test)
    variance_qr = np.var(predictions_test_qr)

    mse = mean_squared_error(y_test, predictions_test)
    mse_qr = mean_squared_error(y_test, predictions_test_qr)

    biases.append(bias)
    biases_qr.append(bias_qr)

    variances.append(variance)
    variances_qr.append(variance_qr)

    mse_list.append(mse)
    mse_list_qr.append(mse_qr)

In [17]:
results = pd.DataFrame({
    "Method": "Standard Forward Stagewise + noise",
    "Noise Level": [0.1, 0.5, 1.0, 2.0],
    "Bias": biases,
    "Variance": variances,
    "MSE": mse_list
})

print(results)

                               Method  Noise Level        Bias   Variance  \
0  Standard Forward Stagewise + noise          0.1  176.828816  93.658298   
1  Standard Forward Stagewise + noise          0.5  177.071621  93.694333   
2  Standard Forward Stagewise + noise          1.0  177.262838  93.877403   
3  Standard Forward Stagewise + noise          2.0  180.277090  93.564031   

         MSE  
0  43.136812  
1  43.141017  
2  43.982147  
3  47.063059  


In [18]:
results_qr = pd.DataFrame({
    "Method": "QR-Optimized Forward Stagewise + noise",
    "Noise Level": [0.1, 0.5, 1.0, 2.0],
    "Bias": biases_qr,
    "Variance": variances_qr,
    "MSE": mse_list_qr
})

print(results_qr)

                                   Method  Noise Level        Bias  \
0  QR-Optimized Forward Stagewise + noise          0.1  176.881543   
1  QR-Optimized Forward Stagewise + noise          0.5  177.120127   
2  QR-Optimized Forward Stagewise + noise          1.0  177.318761   
3  QR-Optimized Forward Stagewise + noise          2.0  180.359764   

     Variance        MSE  
0  172.583974  14.988887  
1  172.635323  15.134655  
2  173.248405  16.329562  
3  172.790497  19.842370  
