Классификация


Импорт библиотек

In [1]:
import pandas as pd
import numpy as np
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, f1_score, accuracy_score, confusion_matrix
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

Загрузка и первичный анализ данных

In [52]:
df_cancer = pd.read_csv('cancer-risk-factors.csv')
print("Размер датасета:", df_cancer.shape)
print("\nПервые 5 строк:")
print(df_cancer.head())

Размер датасета: (2000, 21)

Первые 5 строк:
  Patient_ID Cancer_Type  Age  Gender  Smoking  Alcohol_Use  Obesity  \
0     LU0000      Breast   68       0        7            2        8   
1     LU0001    Prostate   74       1        8            9        8   
2     LU0002        Skin   55       1        7           10        7   
3     LU0003       Colon   61       0        6            2        2   
4     LU0004        Lung   67       1       10            7        4   

   Family_History  Diet_Red_Meat  Diet_Salted_Processed  ...  \
0               0              5                      3  ...   
1               0              0                      3  ...   
2               0              3                      3  ...   
3               0              6                      2  ...   
4               0              6                      3  ...   

   Physical_Activity  Air_Pollution  Occupational_Hazards  BRCA_Mutation  \
0                  4              6                     3    

Удаляем столбцы

In [53]:
columns_to_drop = ['Overall_Risk_Score', 'Risk_Level', 'Patient_ID']
df_cancer = df_cancer.drop(columns=columns_to_drop)
print("\nСтолбцы после удаления:")
print(df_cancer.columns.tolist())


Столбцы после удаления:
['Cancer_Type', 'Age', 'Gender', 'Smoking', 'Alcohol_Use', 'Obesity', 'Family_History', 'Diet_Red_Meat', 'Diet_Salted_Processed', 'Fruit_Veg_Intake', 'Physical_Activity', 'Air_Pollution', 'Occupational_Hazards', 'BRCA_Mutation', 'H_Pylori_Infection', 'Calcium_Intake', 'BMI', 'Physical_Activity_Level']


Разделяем признаки и целевую переменную

In [54]:
y = df_cancer['Cancer_Type']
X = df_cancer.drop(columns=['Cancer_Type'])

print("X shape:", X.shape)
print("y shape:", y.shape)


X shape: (2000, 17)
y shape: (2000,)


Кодируем целевую переменную

In [55]:
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

print("Классы рака:", label_encoder.classes_)


Классы рака: ['Breast' 'Colon' 'Lung' 'Prostate' 'Skin']


Разделяем данные

In [56]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y_encoded,
    test_size=0.2,
    random_state=42,
    stratify=y_encoded
)


Обучаем KNN

In [57]:
knn = KNeighborsClassifier(n_neighbors=5)

knn.fit(X_train, y_train)


Предсказание

In [58]:
y_pred = knn.predict(X_test)

Оценка качества

In [59]:
print("Accuracy:", accuracy_score(y_test, y_pred))

print("\nClassification report:")
print(classification_report(
    y_test,
    y_pred,
    target_names=label_encoder.classes_
))


Accuracy: 0.58

Classification report:
              precision    recall  f1-score   support

      Breast       0.49      0.61      0.54        92
       Colon       0.57      0.62      0.59        84
        Lung       0.69      0.78      0.74       105
    Prostate       0.55      0.39      0.46        61
        Skin       0.56      0.31      0.40        58

    accuracy                           0.58       400
   macro avg       0.57      0.54      0.55       400
weighted avg       0.58      0.58      0.57       400



Улучшение бейзлайна.

Гипотезы:

1.масштабирование признаков

Алгоритм KNN чувствителен к масштабам, поэтому признаки с большими значениями (Age, BMI, Air_Pollution) доминируют над бинарными (Family_History, BRCA_Mutation).

2.подбор числа соседей k

Фиксированное значение k=5 может быть неоптимальным.

3.взвешивание соседей

Ближайшие соседи должны влиять сильнее, чем дальние.


Поиск лучших параметров

In [60]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('knn', KNeighborsClassifier())
])

param_grid = {
    'knn__n_neighbors': [3, 5, 7, 9, 11, 15],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['euclidean', 'manhattan']
}

grid = GridSearchCV(
    pipeline,
    param_grid,
    cv=5,
    scoring='f1_weighted',
    n_jobs=-1
)

grid.fit(X_train, y_train)

print("Лучшие параметры:", grid.best_params_)


Лучшие параметры: {'knn__metric': 'manhattan', 'knn__n_neighbors': 15, 'knn__weights': 'distance'}


Модель с улучшенным бейзлайном

In [61]:
best_knn = grid.best_estimator_

y_pred_knn = best_knn.predict(X_test)


Оценка качества

In [62]:
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print("\nClassification report:")
print(classification_report(
    y_test,
    y_pred_knn,
    target_names=label_encoder.classes_
))


Accuracy: 0.705

Classification report:
              precision    recall  f1-score   support

      Breast       0.70      0.79      0.74        92
       Colon       0.69      0.73      0.71        84
        Lung       0.75      0.87      0.80       105
    Prostate       0.68      0.62      0.65        61
        Skin       0.66      0.33      0.44        58

    accuracy                           0.70       400
   macro avg       0.69      0.67      0.67       400
weighted avg       0.70      0.70      0.69       400



Было

In [63]:
print("Accuracy:", accuracy_score(y_test, y_pred))

print("\nClassification report:")
print(classification_report(
    y_test,
    y_pred,
    target_names=label_encoder.classes_
))

Accuracy: 0.58

Classification report:
              precision    recall  f1-score   support

      Breast       0.49      0.61      0.54        92
       Colon       0.57      0.62      0.59        84
        Lung       0.69      0.78      0.74       105
    Prostate       0.55      0.39      0.46        61
        Skin       0.56      0.31      0.40        58

    accuracy                           0.58       400
   macro avg       0.57      0.54      0.55       400
weighted avg       0.58      0.58      0.57       400



Выводы:

Accuracy увеличилась на +0.12

Weighted F1 вырос на +0.12

Macro F1 вырос на +0.12

Гипотезы о необходимости масштабирования и подбора гиперпараметров полностью подтвердились.

Имплементация алгоритма машинного обучения


In [32]:
class MyKNNClassifier:
    def __init__(self, n_neighbors=5, metric='euclidean', weights='uniform'):
        self.k = n_neighbors
        self.metric = metric
        self.weights = weights

    def fit(self, X, y):
        self.X_train = X.values if hasattr(X, 'values') else X
        self.y_train = y

    def _distance(self, x1, x2):
        if self.metric == 'euclidean':
            return np.sqrt(np.sum((x1 - x2) ** 2))
        elif self.metric == 'manhattan':
            return np.sum(np.abs(x1 - x2))
        else:
            raise ValueError("Unknown metric")

    def predict(self, X):
        X = X.values if hasattr(X, 'values') else X
        predictions = []

        for x in X:
            distances = np.array([
                self._distance(x, x_train)
                for x_train in self.X_train
            ])

            k_idx = np.argsort(distances)[:self.k]
            k_labels = self.y_train[k_idx]
            k_distances = distances[k_idx]

            if self.weights == 'uniform':
                vote = Counter(k_labels).most_common(1)[0][0]

            elif self.weights == 'distance':
                weights = 1 / (k_distances + 1e-5)
                class_scores = {}

                for label, w in zip(k_labels, weights):
                    class_scores[label] = class_scores.get(label, 0) + w

                vote = max(class_scores, key=class_scores.get)

            predictions.append(vote)

        return np.array(predictions)


Обучение

In [33]:
my_knn_baseline = MyKNNClassifier(
    n_neighbors=5,
    metric='euclidean',
    weights='uniform'
)

my_knn_baseline.fit(X_train, y_train)
y_pred_baseline = my_knn_baseline.predict(X_test)


Оценка качества

In [34]:
print("Baseline MyKNN Accuracy:",
      accuracy_score(y_test, y_pred_baseline))

print(classification_report(
    y_test,
    y_pred_baseline,
    target_names=label_encoder.classes_
))


Baseline MyKNN Accuracy: 0.555
              precision    recall  f1-score   support

      Breast       0.51      0.51      0.51        92
       Colon       0.53      0.50      0.52        84
        Lung       0.69      0.79      0.73       105
    Prostate       0.47      0.44      0.45        61
        Skin       0.47      0.40      0.43        58

    accuracy                           0.56       400
   macro avg       0.53      0.53      0.53       400
weighted avg       0.55      0.56      0.55       400



Было у алгоритма из sklearn


In [35]:
print("Accuracy:", accuracy_score(y_test, y_pred))

print("\nClassification report:")
print(classification_report(
    y_test,
    y_pred,
    target_names=label_encoder.classes_
))

Accuracy: 0.58

Classification report:
              precision    recall  f1-score   support

      Breast       0.49      0.61      0.54        92
       Colon       0.57      0.62      0.59        84
        Lung       0.69      0.78      0.74       105
    Prostate       0.55      0.39      0.46        61
        Skin       0.56      0.31      0.40        58

    accuracy                           0.58       400
   macro avg       0.57      0.54      0.55       400
weighted avg       0.58      0.58      0.57       400



Выводы: у алгоритма из slkearn немного лучшие метрики

Улучшение MyKNN

In [36]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

my_knn_improved = MyKNNClassifier(
    n_neighbors=15,
    metric='manhattan',
    weights='distance'
)

my_knn_improved.fit(X_train_scaled, y_train)
y_pred_improved = my_knn_improved.predict(X_test_scaled)


Оценка


In [37]:
print("Improved MyKNN Accuracy:",
      accuracy_score(y_test, y_pred_improved))

print(classification_report(
    y_test,
    y_pred_improved,
    target_names=label_encoder.classes_
))


Improved MyKNN Accuracy: 0.705
              precision    recall  f1-score   support

      Breast       0.70      0.79      0.74        92
       Colon       0.69      0.73      0.71        84
        Lung       0.75      0.87      0.80       105
    Prostate       0.68      0.62      0.65        61
        Skin       0.66      0.33      0.44        58

    accuracy                           0.70       400
   macro avg       0.69      0.67      0.67       400
weighted avg       0.70      0.70      0.69       400



Было у улучшенного бейзлайна из sklearn

In [38]:
print("Accuracy:", accuracy_score(y_test, y_pred_knn))
print("\nClassification report:")
print(classification_report(
    y_test,
    y_pred_knn,
    target_names=label_encoder.classes_
))

Accuracy: 0.705

Classification report:
              precision    recall  f1-score   support

      Breast       0.70      0.79      0.74        92
       Colon       0.69      0.73      0.71        84
        Lung       0.75      0.87      0.80       105
    Prostate       0.68      0.62      0.65        61
        Skin       0.66      0.33      0.44        58

    accuracy                           0.70       400
   macro avg       0.69      0.67      0.67       400
weighted avg       0.70      0.70      0.69       400



Вывод: Результаты у собственной реализации и у алгоритма из sklearn после применения масштабирования и одинаковых гиперпараметров оказались одинаковыми. Это объясняется детерминированной природой алгоритма KNN и подтверждает корректность собственной реализации.

Регрессия


Загрузка и первичный анализ данных

In [25]:
df_productivity = pd.read_csv('social_media_vs_productivity.csv')
print("Размер датасета:", df_productivity.shape)
print("\nПервые 5 строк:")
print(df_productivity.head())

Размер датасета: (30000, 19)

Первые 5 строк:
   age  gender    job_type  daily_social_media_time  \
0   56    Male  Unemployed                 4.180940   
1   46    Male      Health                 3.249603   
2   32    Male     Finance                      NaN   
3   60  Female  Unemployed                      NaN   
4   25    Male          IT                      NaN   

  social_platform_preference  number_of_notifications  work_hours_per_day  \
0                   Facebook                       61            6.753558   
1                    Twitter                       59            9.169296   
2                    Twitter                       57            7.910952   
3                   Facebook                       59            6.355027   
4                   Telegram                       66            6.214096   

   perceived_productivity_score  actual_productivity_score  stress_level  \
0                      8.040464                   7.291555           4.0   
1       

Обработка пропусков

In [26]:
from sklearn.impute import SimpleImputer
from sklearn.metrics import mean_squared_error, r2_score

df_productivity_clean = df_productivity.dropna(
    subset=['actual_productivity_score']
)

df_productivity_clean = df_productivity_clean.iloc[:1000].copy()




Разделяем признаки и целевую переменную

In [27]:
target = 'actual_productivity_score'

X = df_productivity_clean.drop(columns=[target])
y = df_productivity_clean[target]


Препроцессинг



In [28]:
numeric_features = X.select_dtypes(include=[np.number]).columns
categorical_features = X.select_dtypes(exclude=[np.number]).columns


from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

preprocessor = ColumnTransformer(
    transformers=[
        ('num', SimpleImputer(strategy='mean'), numeric_features),
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(handle_unknown='ignore'))
        ]), categorical_features)
    ]
)


Train / Test split

In [29]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


KNN-регрессия

In [30]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.pipeline import Pipeline

knn_reg_baseline = Pipeline([
    ('preprocessor', preprocessor),
    ('knn', KNeighborsRegressor(n_neighbors=5))
])

knn_reg_baseline.fit(X_train, y_train)

y_pred_baseline = knn_reg_baseline.predict(X_test)


Оценка качества

In [31]:
print("Baseline MSE:", mean_squared_error(y_test, y_pred_baseline))
print("Baseline R2:", r2_score(y_test, y_pred_baseline))


Baseline MSE: 2.4292999887798516
Baseline R2: 0.2809548955036979


Улучшение бейзлайна

Гипотезы:

Гипотеза 1

KNN чувствителен к масштабу, StandardScaler улучшит MSE

Гипотеза 2

Оптимальное k отличается от 5, подбор k улучшит качество

Гипотеза 3

weights='distance' лучше усредняет локальные значения

In [32]:
from sklearn.preprocessing import StandardScaler

pipe_reg = Pipeline([
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler(with_mean=False)),
    ('knn', KNeighborsRegressor())
])


Подбор гиперпараметров

In [33]:
from sklearn.model_selection import GridSearchCV

param_grid = {
    'knn__n_neighbors': [3, 5, 7, 11, 15],
    'knn__weights': ['uniform', 'distance'],
    'knn__metric': ['euclidean', 'manhattan']
}

grid_reg = GridSearchCV(
    pipe_reg,
    param_grid,
    cv=5,
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

grid_reg.fit(X_train, y_train)

print("Лучшие параметры:", grid_reg.best_params_)


Лучшие параметры: {'knn__metric': 'euclidean', 'knn__n_neighbors': 11, 'knn__weights': 'distance'}


Оценка улучшенного бейзлайна

In [34]:
best_knn_reg = grid_reg.best_estimator_
y_pred_improved = best_knn_reg.predict(X_test)

print("Improved MSE:", mean_squared_error(y_test, y_pred_improved))
print("Improved R2:", r2_score(y_test, y_pred_improved))


Improved MSE: 1.6356877163619854
Improved R2: 0.5158550815597089


Было

In [35]:
print("Baseline MSE:", mean_squared_error(y_test, y_pred_baseline))
print("Baseline R2:", r2_score(y_test, y_pred_baseline))

Baseline MSE: 2.4292999887798516
Baseline R2: 0.2809548955036979


Выводы:

лучшенный бейзлайн показал объективное улучшение качества:

меньше MSE следовательно точнее предсказания

больше R² следовательно лучше объясняет вариацию

Это подтверждает, что гипотезы для регрессии (масштабирование, подбор k, distance weights) работают

Собственная реализация KNN-регрессии

In [36]:
import numpy as np

class MyKNNRegressorFast:
    def __init__(self, n_neighbors=5, metric='euclidean', weights='uniform'):
        self.k = n_neighbors
        self.metric = metric
        self.weights = weights

    def fit(self, X, y):
        self.X_train = np.array(X)
        self.y_train = np.array(y)

    def _compute_distances(self, X):

        if self.metric == 'euclidean':
            dists = np.sqrt(
                np.sum((X[:, None, :] - self.X_train[None, :, :]) ** 2, axis=2)
            )
        elif self.metric == 'manhattan':
            dists = np.sum(np.abs(X[:, None, :] - self.X_train[None, :, :]), axis=2)
        else:
            raise ValueError("Unknown metric")
        return dists

    def predict(self, X):
        X = np.array(X)
        dists = self._compute_distances(X)  # n_test x n_train

        idx_k = np.argsort(dists, axis=1)[:, :self.k]  # n_test x k
        y_neighbors = self.y_train[idx_k]  # n_test x k
        d_neighbors = np.take_along_axis(dists, idx_k, axis=1)  # n_test x k

        if self.weights == 'uniform':
            preds = np.mean(y_neighbors, axis=1)
        elif self.weights == 'distance':
            weights = 1 / (d_neighbors + 1e-5)
            preds = np.sum(weights * y_neighbors, axis=1) / np.sum(weights, axis=1)
        else:
            raise ValueError("Unknown weights")

        return preds


Подготовка данных для MyKNNRegressor (используем только числовые признаки)




In [37]:
X_num = df_productivity_clean[numeric_features]
y_num = df_productivity_clean[target]

X_num = SimpleImputer(strategy='mean').fit_transform(X_num)

X_train_n, X_test_n, y_train_n, y_test_n = train_test_split(
    X_num, y_num, test_size=0.2, random_state=42
)


Обучение и оценка

In [38]:
my_knn_fast = MyKNNRegressorFast(n_neighbors=5)
my_knn_fast.fit(X_train_n, y_train_n.values)
y_pred_fast = my_knn_fast.predict(X_test_n)

print("MyKNN Fast Baseline MSE:", mean_squared_error(y_test_n, y_pred_fast))
print("MyKNN Fast Baseline R2:", r2_score(y_test_n, y_pred_fast))



MyKNN Fast Baseline MSE: 2.383604302591147
MyKNN Fast Baseline R2: 0.29448029771929274


Было у алгоритма из sklearn

In [39]:
print("Baseline MSE:", mean_squared_error(y_test, y_pred_baseline))
print("Baseline R2:", r2_score(y_test, y_pred_baseline))

Baseline MSE: 2.4292999887798516
Baseline R2: 0.2809548955036979


Вывод: собственная реализация оказалась точнее и лучше предсказывает вариацию

Используем гипотезы для собственного алгоритма

In [40]:
my_knn_fast_improved = MyKNNRegressorFast(
    n_neighbors=15,
    metric='manhattan',
    weights='distance'
)

# Масштабирование
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_n)
X_test_scaled = scaler.transform(X_test_n)

my_knn_fast_improved.fit(X_train_scaled, y_train_n.values)
y_pred_fast_improved = my_knn_fast_improved.predict(X_test_scaled)

print("MyKNN Fast Improved MSE:", mean_squared_error(y_test_n, y_pred_fast_improved))
print("MyKNN Fast Improved R2:", r2_score(y_test_n, y_pred_fast_improved))


MyKNN Fast Improved MSE: 0.6859945121523893
MyKNN Fast Improved R2: 0.7969534442214972


Было у улучшенного бейзлайна из sklearn

In [41]:
best_knn_reg = grid_reg.best_estimator_
y_pred_improved = best_knn_reg.predict(X_test)

print("Improved MSE:", mean_squared_error(y_test, y_pred_improved))
print("Improved R2:", r2_score(y_test, y_pred_improved))

Improved MSE: 1.6356877163619854
Improved R2: 0.5158550815597089


Вывод: Результаты у улучшенной собственной реализации оказались лучше чем у улучшенного алгоритма из sklearn.