# Задание 1. Методы понижения размерности

Примените методы понижения размерности: PCA, t-SNE и UMAP к изображениям клеток крови из датасета BloodMNIST. Отобразите проекцию данных на двумерное пространство, так как это допускает наиболее простую визуализацию полученного результата (воспользуйтесь [`sns.scatterplot`](https://seaborn.pydata.org/generated/seaborn.scatterplot.html)).  Какой метод позволяет лучше разделить данные в пространстве? Опишите ваши наблюдения.


## Формат результата

Пример графика для одного из пунктов задания:

<img src ="https://edunet.kea.su/repo/EduNet-web_dependencies/dev-2.0/Exercises/EX04/result_1_task_ex04.png" width="300">

Установка и импорт необходимых библиотек

In [None]:
!pip install -q umap-learn
!pip install -q --upgrade scikit-image
!pip install -q --upgrade git+https://github.com/MedMNIST/MedMNIST.git

In [None]:
import umap
import medmnist
import matplotlib
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from medmnist import INFO
from sklearn import manifold
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

matplotlib.style.use("ggplot")

Произведем загрузку данных:

In [None]:
data_flag = "bloodmnist"
download = True

info = INFO[data_flag]
task = info["task"]
n_channels = info["n_channels"]
n_classes = len(info["label"])

DataClass = getattr(medmnist, info["python_class"])


# load the data
bloodmnist = DataClass(split="train", download=download)
print(bloodmnist)

Доступ к данным идет посредством обращения к ключу `bloodmnist.imgs`, доступ к разметке классов — через `bloodmnist.labels`

In [None]:
x = bloodmnist.imgs / 255.0
x = x.reshape(-1, 2352)
y = pd.Series(bloodmnist.labels.reshape(-1))
y = y.astype("int").map(dict(zip(range(0, 8), info["label"].values())))


bloodmnist.montage(length=10)

## PCA

In [None]:
# Your code here
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x)
pca = PCA(n_components=2)
x_pca = pca.fit_transform(x_scaled)

fig, ax = plt.subplots(figsize=(10, 10))
sns.scatterplot(
    x=x_pca[:, 0],
    y=x_pca[:, 1],
    hue=y,
    palette=sns.color_palette("hls", 8),
    legend="full",
    alpha=0.5,
)
plt.show()

## t-SNE

In [None]:
tsne = manifold.TSNE(n_components=2, random_state=42)
x_tsne = tsne.fit_transform(x)

plt.figure(figsize=(10, 10))
sns.scatterplot(
    x=x_tsne[:, 0],
    y=x_tsne[:, 1],
    hue=y,
    palette=sns.color_palette("hls", 8),
    legend="full",
    alpha=0.5,
)
plt.show()

## UMAP

In [None]:
UMAP = umap.UMAP(n_components=2, n_neighbors=100)
x_umap = UMAP.fit_transform(x)

plt.figure(figsize=(10, 10))
sns.scatterplot(
    x=x_umap[:, 0],
    y=x_umap[:, 1],
    hue=y,
    palette=sns.color_palette("hls", 8),
    legend="full",
    alpha=0.5,
)
plt.show()

Выводы:

*Ваш текст тут*



# Задание 2. Использование понижения размерности для ускорения обучения

Рассмотрите набор данных TissueMNIST. В этом задании вам нужно сравнить производительность двух моделей: обученной с использованием всех доступных признаков и обученной на данных пониженной размерности. От вас требуется:

1. Построить модель `RandomForestClassifier()` и обучить ее на тренировочной выборке, оценить `accuracy` модели на тестовой выборке и время, потраченное на обучение.
2. Построить модель PCA на тренировочных данных и определить число главных компонент, объясняющих 90% дисперсии (или используйте любой другой способ выбора оптимального числа главных компонент, разбиравшийся на лекции).
3. Преобразовать данные тестовой выборки на главные компоненты полученной модели PCA.
4. Построить модель `RandomForestClassifier()` и обучить ее на данных пониженной размерности, оценить `accuracy` модели на тестовой выборке и время, потраченное на обучение.
5. Описать ваши наблюдения, сделать выводы.


## Формат результата

Получить значения точности (`accuracy`) и времени обучения `RandomForestClassifier()` на обычных данных и данных с пониженной размерностью.

Установка и импорт необходимых библиотек:

In [None]:
!pip install -q --upgrade git+https://github.com/MedMNIST/MedMNIST.git
!pip install -q --upgrade scikit-image

In [None]:
import time
import medmnist
import matplotlib
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from medmnist import INFO
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

matplotlib.style.use("ggplot")

Произведем загрузку данных:

In [None]:
data_flag = "tissuemnist"
download = True

info = INFO[data_flag]
task = info["task"]
n_channels = info["n_channels"]
n_classes = len(info["label"])

DataClass = getattr(medmnist, info["python_class"])

# load the data
tissuemnist = DataClass(split="test", download=download)
print(tissuemnist)

In [None]:
x = tissuemnist.imgs / 255.0
x = x.reshape(-1, 784)
y = tissuemnist.labels

tissuemnist.montage(length=10)

In [None]:
rng = np.random.RandomState(42)
rf = RandomForestClassifier(n_estimators=200, random_state=rng)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
# Your code here
t0 = time.time()
rf.fit(x_train, y_train.ravel())
y_pred = rf.predict(x_test)
rf_scores = accuracy_score(y_test, y_pred)
t1 = time.time()
rf_time = t1 - t0

print(f"Test accuracy: {rf_scores:0.3f}")
print(f"Training time: {rf_time:0.3f}s")

In [None]:
# Your code here

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

pca = PCA(n_components=0.9)
x_pca_train = pca.fit_transform(x_train_scaled)
x_pca_test = pca.transform(x_test_scaled)
print(f"n components explaining 90% of variance: {pca.n_components_}")

sns.scatterplot(
    x=x_pca_train[:, 0],
    y=x_pca_train[:, 1],
    hue=y_train[:, 0],
    palette=sns.color_palette("hls", 8),
    legend="full",
    alpha=0.5,
)
plt.show()

In [None]:
# Your code here

rng = np.random.RandomState(42)
rf_pca = RandomForestClassifier(n_estimators=200, random_state=rng)
t0 = time.time()
rf_pca.fit(x_pca_train, y_train.ravel())
y_pred = rf_pca.predict(x_pca_test)
pca_scores = accuracy_score(y_test, y_pred)
t1 = time.time()
pca_time = t1 - t0

print(f"Test accuracy: {pca_scores:0.3f}")
print(f"Training time: {pca_time:0.3f}s")

In [None]:
print(f"Diff time: {(rf_time-pca_time):0.3f}")
print(f"Diff accuracy {(rf_scores-pca_scores):0.3f}")

# Задание 3. Отбор признаков


У нас есть датасет из 30 признаков. Известно, что для улучшения качества предсказания достаточно использовать 5 признаков, но неизвестно, какие.

Отберите 5 признаков, используя методы отбора признаков, и увеличьте качество предсказания.

## Формат результата

* Accuracy модели > 0.62.


Импорт и установка необходимых библиотек:

In [None]:
!pip install -q catboost phik boruta

In [None]:
import matplotlib
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.inspection import permutation_importance

matplotlib.style.use("ggplot")

In [None]:
df = pd.read_csv(
    "https://edunet.kea.su/repo/EduNet-web_dependencies/datasets/feature_select_ex.csv"
)
df

In [None]:
x_train, x_test, y_train, y_test = train_test_split(
    df.drop(columns=["target"]), df["target"], test_size=0.2, random_state=42
)

rf = RandomForestClassifier(random_state=42)
rf.fit(x_train, y_train)
y_pred_rf = rf.predict(x_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Use all features, accuracy: {accuracy_rf:.2f}")

In [None]:
import phik

df_result = pd.DataFrame()

plt.figure(figsize=(12, 12))
phik_overview = df.phik_matrix().round(2).sort_values("target")
# mask = np.triu(np.ones_like(df_new, dtype=bool))


sotred_columns = (
    df.phik_matrix(interval_cols=df.columns)
    .round(2)
    .sort_values("target", ascending=False, axis=1)
    .columns
)

phik_result = (
    df.phik_matrix(interval_cols=df.columns)
    .round(2)
    .sort_values("target", ascending=False, axis=1)
    .reindex(sotred_columns)
)

heatmap = sns.heatmap(
    phik_result,
    annot=True,
    square=True,
    cmap="Blues",
    cbar_kws={"fraction": 0.01},  # shrink colour bar
    linewidth=2,
    # mask=mask
)

heatmap.set_xticklabels(
    heatmap.get_xticklabels(), rotation=45, horizontalalignment="right"
)
heatmap.set_title("Correalation heatmap", fontdict={"fontsize": 18}, pad=16)
plt.show()
df_result["phik"] = phik_result.index[1:]

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

lr = LogisticRegression(max_iter=1000)
lr.fit(x_train_scaled, y_train)

temp_df = pd.DataFrame({"name": x_train.columns, "coef": lr.coef_[0]}).sort_values(
    "coef", key=abs, ascending=False
)

temp_df["sign"] = ["neg" if x < 0 else "pos" for x in temp_df["coef"]]

palette = {"neg": sns.xkcd_rgb["orange"], "pos": sns.xkcd_rgb["azure"]}

plt.figure(figsize=(8, 8))
sns.barplot(
    data=temp_df,
    y="name",
    x="coef",
    hue="sign",
    palette=palette,
    legend=False,
    orient="h",
)
plt.show()

df_result["logreg_coef"] = temp_df["name"].reset_index()["name"]

In [None]:
from sklearn.feature_selection import SelectFromModel

rf = RandomForestClassifier(random_state=42)

rf_selector = SelectFromModel(rf)
rf_selector.fit(x_train, y_train)  # Fit it on the training data

temp_df = pd.DataFrame(
    {"name": x_train.columns, "coef": rf_selector.estimator_.feature_importances_}
).sort_values("coef", key=abs, ascending=False)

temp_df["sign"] = ["neg" if x < 0 else "pos" for x in temp_df["coef"]]

palette = {"neg": sns.xkcd_rgb["orange"], "pos": sns.xkcd_rgb["azure"]}

plt.figure(figsize=(8, 8))
sns.barplot(
    data=temp_df,
    y="name",
    x="coef",
    hue="sign",
    palette=palette,
    legend=False,
    orient="h",
)
plt.show()

df_result["rf_fi"] = temp_df["name"].reset_index()["name"]

In [None]:
from catboost import CatBoostClassifier

model = CatBoostClassifier(random_state=42, thread_count=-1)
model.fit(
    x_train,
    y_train,
    eval_set=(x_test, y_test),
    verbose=100,
    plot=False,
    early_stopping_rounds=100,
)

temp_df = pd.DataFrame(
    {"name": x_train.columns, "coef": model.feature_importances_}
).sort_values("coef", key=abs, ascending=False)


temp_df["sign"] = ["neg" if x < 0 else "pos" for x in temp_df["coef"]]

palette = {"neg": sns.xkcd_rgb["orange"], "pos": sns.xkcd_rgb["azure"]}

plt.figure(figsize=(8, 8))
sns.barplot(
    data=temp_df,
    y="name",
    x="coef",
    hue="sign",
    palette=palette,
    legend=False,
    orient="h",
)
plt.show()

df_result["catboost_fi"] = temp_df["name"].reset_index()["name"]

In [None]:
from sklearn.inspection import permutation_importance

rf = RandomForestClassifier(random_state=42)
rf.fit(x_train, y_train)
perm_importance = permutation_importance(
    rf, x_train, y_train, n_repeats=10, random_state=42
)

temp_df = pd.DataFrame(
    {"name": x_train.columns, "imp": perm_importance.importances_mean}
).sort_values("imp", ascending=False)
df_result["rf_pi"] = temp_df["name"].reset_index()["name"]

In [None]:
from boruta import BorutaPy

# define Boruta feature selection method
model = RandomForestClassifier(random_state=42)

feat_selector = BorutaPy(model, n_estimators=100, verbose=0, random_state=42)

# find all relevant features
feat_selector.fit(x_train.values, y_train.values)
feature_ranks = list(
    zip(x_train.columns, feat_selector.ranking_, feat_selector.support_)
)

temp_df = pd.DataFrame(feature_ranks, columns=["feature", "rank", "boruta_keep"])
temp_df.sort_values("rank")

df_result["rf_boruta"] = temp_df.sort_values("rank")["feature"].reset_index()["feature"]

In [None]:
stacked = df_result.stack().value_counts()
top_features = stacked.index[:5]

print(f"Top 5 selected features:\n{list(top_features)}")

rf = RandomForestClassifier(random_state=42)
rf.fit(x_train[top_features], y_train)
y_pred_rf = rf.predict(x_test[top_features])
fs_accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"\nUse top 5 selected features, accuracy: {fs_accuracy_rf:.2f}")

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.model_selection import KFold

sffs = SequentialFeatureSelector(
    RandomForestClassifier(random_state=42),  # represents the classifier
    k_features=5,  # the number of features you want to select
    forward=True,  # add features
    floating=True,  # remove features
    scoring="accuracy",  # means that the selection will be decided by the accuracy of the classifier.
    cv=KFold(n_splits=3, shuffle=True, random_state=42),
)

sffs.fit(x_train.values, y_train)  # performs the actual SFFS algorithm
temp_df = pd.DataFrame.from_dict(sffs.get_metric_dict()).T
temp_df.head(temp_df.shape[0])
sffs_columns = [
    "feature_" + str(int(i) + 1) for i in sffs.get_metric_dict()[5]["feature_names"]
]

rf = RandomForestClassifier(random_state=42)
rf.fit(x_train[sffs_columns], y_train)
y_pred_rf = rf.predict(x_test[sffs_columns])
sffs_accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Use top 5 sffs selected features, accuracy : {sffs_accuracy_rf:.2f}")

# Задание 4. Бинарная классификация с LogisticRegression

В этом задании вам нужно решить задачу бинарной классификации. Используя только [`LogisticRegression`](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html), добейтесь качества `accuracy` выше 0.91.

Что можно:
* Генерировать и отбирать признаки

Что нельзя:
* Менять модель

## Формат результата

* Accuracy модели > 0.91.


Импорт необходимых библиотек:

In [None]:
import matplotlib
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

matplotlib.style.use("ggplot")

Произведем загрузку данных:

In [None]:
!wget -q https://edunet.kea.su/repo/EduNet-web_dependencies/datasets/feature_engineering_data.csv

In [None]:
df = pd.read_csv("/content/feature_engineering_data.csv")
df

In [None]:
df = pd.get_dummies(df)  # to One-Hot Encoding
x = df.drop(columns=["target"])
y = df["target"]


# We make a 80/20% train/test split of the data
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42
)

model = LogisticRegression(max_iter=1000)
model.fit(x_train, y_train)

# Make predictions
print("Accuracy of the model = %.2f" % model.score(x_test, y_test))

In [None]:
sns.pairplot(
    df,
    hue="target",
)
plt.show()

In [None]:
df["new_feature_1"] = df["feature_3"] ** 2
df["new_feature_2"] = df["feature_5"] ** 2

In [None]:
df = pd.get_dummies(df)  # to One-Hot Encoding

x = df.drop(columns=["target"])
y = df["target"]


# We make a 80/20% train/test split of the data
x_train, x_test, y_train, y_test = train_test_split(
    x, y, test_size=0.2, random_state=42
)


scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)

model = LogisticRegression(max_iter=1000)
model.fit(x_train, y_train)

# Make predictions
print("Accuracy of the model = %.2f" % model.score(x_test, y_test))