

# 2 lab vaja

### Napovedovanje stopnje kriminala z uporabo metode ponovnega vzorčenja in izbire atributov

V tej nalogi raziskujemo metode ponovnega vzorčenja za ocenjevanje modelov in izbiro atributov za napovedovanje stopnje kriminala na podlagi različnih atributov. Uporabljamo podatkovno zbirko "Communities and Crime" in uporabljamo tehnike linearne regresije. (ker je najenostavnejša)

In [6]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.utils import resample

### Predobdelava podatkov

Začnemo z nalaganjem podatkovne zbirke in predobdelavo za linearno regresijo:

- **Ciljna spremenljivka**: `ViolentCrimesPerPop` (atribut 127)
- **Odstranjeni atributi**: `state`, `county`, `community`, `community name`, in `fold` (stolpci 1 do 5)

Manjkajoče vrednosti obravnavamo tako, da jih nadomestimo s povprečjem njihovih ustreznih stolpcev.

In [7]:

data_file_path = "communities+and+crime/communities.data"
columns_to_remove = ['attr_0', 'attr_1', 'attr_2', 'attr_3', 'attr_4']  
target_column = 'attr_127'  

data = pd.read_csv(data_file_path, header=None, na_values='?')
data.columns = [f'attr_{i}' for i in range(data.shape[1])]
data = data.drop(columns=columns_to_remove)


data = data.fillna(data.mean())

X = data.drop(columns=[target_column])
y = data[target_column]


## Metode navzkrižne validacije in metode leave-one-out

Implementiramo:

- **Navzkrižna validacija**: Razdelitev podatkovne zbirke na `k` delov (običajno je najboljša 5-kratna ali 10-kratna delitev) za ocenjevanje uspešnosti modela.
- **Leave-One-Out navzkrižna validacija**: Poseben primer navzkrižne validacije, kjer je `k` enako številu opazovanj. Ta metoda ima preveliko varianco.

Te metode pomagajo oceniti splošno uporabnost našega modela linearne regresije.

 

In [8]:
def cross_validation(X, y, k=5):
    from sklearn.model_selection import KFold
    kf = KFold(n_splits=k)
    model = LinearRegression()
    scores = []
    for train_index, val_index in kf.split(X):
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)
        scores.append(mean_squared_error(y_val, y_pred))
    return np.mean(scores)
cross_validation(X, y, k=5)



np.float64(0.18907747633998667)

In [9]:
def leave_one_out(X, y):
    from sklearn.model_selection import LeaveOneOut
    loo = LeaveOneOut()
    model = LinearRegression()
    scores = []
    for train_index, val_index in loo.split(X):
        X_train, X_val = X.iloc[train_index], X.iloc[val_index]
        y_train, y_val = y.iloc[train_index], y.iloc[val_index]
        model.fit(X_train, y_train)
        y_pred = model.predict(X_val)
        scores.append(mean_squared_error(y_val, y_pred))
    return np.mean(scores)
leave_one_out(X, y)

np.float64(0.2986172439516377)

## Izbira atributov z naprej usmerjeno selekcijo

Implementiramo naprej usmerjeno selekcijo atributov, da identificiramo podmnožico lastnosti, ki najbolj prispevajo k napovedovanju stopnje kriminala.

- **Uporabljena metrika**: R-kvadrat (`R²`)
- **Kriterij za ustavitev**: Izboljšanje manjše od `0.001`

Na vsakem koraku izberemo atribut, ki najbolj izboljša uspešnost modela glede na izbrano metriko.

### About R-squared (R²)

R-squared, also known as the coefficient of determination, is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. It provides an indication of the goodness of fit and how well the model explains the variability of the outcome data.

- **Range**: R-squared values range from 0 to 1.
  - **0**: Indicates that the model explains none of the variability of the response data around its mean.
  - **1**: Indicates that the model explains all the variability of the response data around its mean.
- **Interpretation**: 
  - A higher R-squared value indicates a better fit for the model.
  - However, a high R-squared value does not necessarily mean the model is good. It is important to consider other metrics and validate the model with different datasets.

We are selecting the variable that is most closely related to improving the R-squared value, thereby enhancing the model's ability to explain the variance in the target variable.

In [None]:
def forward_selection(X, y, metric='r2', stopping_criteria=0.001):
    selected_features = []
    remaining_features = list(X.columns)
    
    if metric == 'mse':
        current_score = float('inf')
    else:
        current_score = -float('inf')
    
    while remaining_features:
        scores_with_candidates = []
        for candidate in remaining_features:
            features_to_test = selected_features + [candidate]
            X_train, X_val, y_train, y_val = train_test_split(
                X[features_to_test], y, test_size=0.2, random_state=42
            )
            model = LinearRegression().fit(X_train, y_train)
            y_pred = model.predict(X_val)
            
            if metric == 'mse':
                score = mean_squared_error(y_val, y_pred)
            else:
                score = model.score(X_val, y_val)
            
            scores_with_candidates.append((score, candidate))
        
        if metric == 'mse':
            best_new_score, best_candidate = min(scores_with_candidates, key=lambda x: x[0])
            improvement = current_score - best_new_score
        else:
            best_new_score, best_candidate = max(scores_with_candidates, key=lambda x: x[0])
            improvement = best_new_score - current_score
        
        print(f"Evaluating candidate: {best_candidate}, Score: {best_new_score}, Improvement: {improvement}")
        
        if improvement > stopping_criteria:
            remaining_features.remove(best_candidate)
            selected_features.append(best_candidate)
            current_score = best_new_score
            print(f"Selected {best_candidate}, Current Best Score: {current_score}")
        else:
            break
        
    return selected_features

selected_features = forward_selection(X, y, metric='r2', stopping_criteria=0.001)
print("Selected Features:", selected_features)

if selected_features:
    X_train, X_test, y_train, y_test = train_test_split(
        X[selected_features], y, test_size=0.2, random_state=42
    )
    
    print("X_train shape:", X_train.shape)
    print("y_train shape:", y_train.shape)
    
    model = LinearRegression().fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    print("Test MAE:", mean_absolute_error(y_test, y_pred))
    print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
    
    bootstrap_results = []
    n_iterations = 1000
    for _ in range(n_iterations):
        X_resample, y_resample = resample(X_train, y_train)
        model = LinearRegression().fit(X_resample, y_resample)
        y_pred = model.predict(X_test)
        bootstrap_results.append(mean_absolute_error(y_test, y_pred))
    
    confidence_interval = np.percentile(bootstrap_results, [2.5, 97.5])
    print("95% Confidence Interval for MAE:", confidence_interval)
else:
    print("No features were selected.")

Evaluating candidate: attr_49, Score: 0.5217378882227437, Improvement: inf
Selected attr_49, Current Best Score: 0.5217378882227437
Evaluating candidate: attr_8, Score: 0.6131370592366217, Improvement: 0.09139917101387796
Selected attr_8, Current Best Score: 0.6131370592366217
Evaluating candidate: attr_16, Score: 0.6290662293159336, Improvement: 0.01592917007931194
Selected attr_16, Current Best Score: 0.6290662293159336
Evaluating candidate: attr_54, Score: 0.6370152274288681, Improvement: 0.00794899811293448
Selected attr_54, Current Best Score: 0.6370152274288681
Evaluating candidate: attr_73, Score: 0.6451848153105901, Improvement: 0.008169587881722062
Selected attr_73, Current Best Score: 0.6451848153105901
Evaluating candidate: attr_12, Score: 0.6505539232280049, Improvement: 0.005369107917414739
Selected attr_12, Current Best Score: 0.6505539232280049
Evaluating candidate: attr_97, Score: 0.6547847405492455, Improvement: 0.004230817321240643
Selected attr_97, Current Best Score

## Metoda bootstrap

Uporabimo metodo bootstrap za oceno intervalov zaupanja v uspešnost našega modela:

- **Ponovno vzorčenje**: Ustvarimo 1000 različnih učnih sklopov z vzorčenjem z vračanjem
- **Usposobljeni modeli**: 1000 modelov linearne regresije
- **Ocenjevanje uspešnosti**: Ocenimo MAE na testnem sklopu za vsak model
- **Intervali zaupanja**: Izračunamo 95% interval zaupanja za MAE

To nam pomaga razumeti variabilnost in zanesljivost, torej verjetnost uspešnosti našega modela. iz tega lahko tudi t stat izračunamo.

In [11]:

from sklearn.utils import resample

bootstrap_results = []
n_iterations = 1000
for _ in range(n_iterations):
    X_resample, y_resample = resample(X_train, y_train)
    model = LinearRegression().fit(X_resample, y_resample)
    y_pred = model.predict(X_test)
    bootstrap_results.append(mean_absolute_error(y_test, y_pred))

confidence_interval = np.percentile(bootstrap_results, [2.5, 97.5])
print("95% Confidence Interval for MAE:", confidence_interval)

95% Confidence Interval for MAE: [0.08696414 0.09071894]
