# Лабораторная работа №3

## Задание

Провести классификацию найденного датасета, методами линеной и логистической регрессий

Импорт библиотек

In [67]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import mean_squared_error

Загрузка датасета

In [68]:
df = pd.read_csv('cars.csv', encoding='utf-8')
print(df.head(5))

  manufacturer_name model_name transmission   color  odometer_value  \
0            Subaru    Outback    automatic  silver          190000   
1            Subaru    Outback    automatic    blue          290000   
2            Subaru   Forester    automatic     red          402000   
3            Subaru    Impreza   mechanical    blue           10000   
4            Subaru     Legacy    automatic   black          280000   

   year_produced engine_fuel  engine_has_gas engine_type  engine_capacity  \
0           2010    gasoline           False    gasoline              2.5   
1           2002    gasoline           False    gasoline              3.0   
2           2001    gasoline           False    gasoline              2.5   
3           1999    gasoline           False    gasoline              3.0   
4           2001    gasoline           False    gasoline              2.5   

   ... feature_1  feature_2 feature_3 feature_4  feature_5  feature_6  \
0  ...      True       True      True

Удаление ненужных столбцов из датасета

In [69]:
columns_to_drop = ['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9', 'duration_listed']
df = df.drop(columns=columns_to_drop, axis=1)
print(df.head(5))

  manufacturer_name model_name transmission   color  odometer_value  \
0            Subaru    Outback    automatic  silver          190000   
1            Subaru    Outback    automatic    blue          290000   
2            Subaru   Forester    automatic     red          402000   
3            Subaru    Impreza   mechanical    blue           10000   
4            Subaru     Legacy    automatic   black          280000   

   year_produced engine_fuel  engine_has_gas engine_type  engine_capacity  \
0           2010    gasoline           False    gasoline              2.5   
1           2002    gasoline           False    gasoline              3.0   
2           2001    gasoline           False    gasoline              2.5   
3           1999    gasoline           False    gasoline              3.0   
4           2001    gasoline           False    gasoline              2.5   

   body_type  has_warranty  state drivetrain  price_usd  is_exchangeable  \
0  universal         False  owned 

Удаление строк с пропущенными значениями

In [70]:
df.dropna(inplace=True)

Предобработка данных

In [71]:
# Кодирование категориальных признаков
label_encoders = {}
for column in df.select_dtypes(include=['object']).columns:
    label_encoders[column] = LabelEncoder()
    df[column] = label_encoders[column].fit_transform(df[column])
    
print(df.head(5))

   manufacturer_name  model_name  transmission  color  odometer_value  \
0                 45         763             0      8          190000   
1                 45         763             0      1          290000   
2                 45         519             0      7          402000   
3                 45         609             1      1           10000   
4                 45         664             0      0          280000   

   year_produced  engine_fuel  engine_has_gas  engine_type  engine_capacity  \
0           2010            2           False            1              2.5   
1           2002            2           False            1              3.0   
2           2001            2           False            1              2.5   
3           1999            2           False            1              3.0   
4           2001            2           False            1              2.5   

   body_type  has_warranty  state  drivetrain  price_usd  is_exchangeable  \
0        

In [72]:
# Масштабирование числовых признаков
scaler = StandardScaler()
numeric_features = df.select_dtypes(include=['int32', 'int64', 'float32', 'float64']).columns
df[numeric_features] = scaler.fit_transform(df[numeric_features])

print(df.head(5))

   manufacturer_name  model_name  transmission     color  odometer_value  \
0           1.089297    0.597314     -1.410285  0.978100       -0.432979   
1           1.089297    0.597314     -1.410285 -0.969685        0.302004   
2           1.089297   -0.149960     -1.410285  0.699845        1.125184   
3           1.089297    0.125674      0.709076 -0.969685       -1.755946   
4           1.089297    0.294117     -1.410285 -1.247940        0.228505   

   year_produced  engine_fuel  engine_has_gas  engine_type  engine_capacity  \
0       0.875318     0.721021           False     0.708498         0.662782   
1      -0.116665     0.721021           False     0.708498         1.407751   
2      -0.240663     0.721021           False     0.708498         0.662782   
3      -0.488659     0.721021           False     0.708498         1.407751   
4      -0.240663     0.721021           False     0.708498         0.662782   

   body_type  has_warranty     state  drivetrain  price_usd  is_exch

Разделение данных на обучающий и тестовый наборы

In [73]:
X = df.drop('feature_0', axis=1)
Y = df['feature_0']
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

Обучение модели линейной регрессии

In [74]:
linear_model = LinearRegression()

Определение сетки параметров

In [75]:
param_grid_linear = {
    'fit_intercept': [True, False],
    'copy_X': [True, False],
    'positive': [True, False],
}

Поиск оптимальных параметров с использованием кросс-валидации

In [76]:
grid_search_linear = GridSearchCV(linear_model, param_grid_linear, cv=5)
grid_search_linear.fit(X_train, Y_train)

Получение лучших параметров

In [77]:
best_params_linear = grid_search_linear.best_params_
print("Лучшие параметры:", best_params_linear)

Лучшие параметры: {'copy_X': True, 'fit_intercept': True, 'positive': False}


Обучение модели с лучшими параметрами

In [78]:
best_svm_linear = grid_search_linear.best_estimator_
best_svm_linear.fit(X_train, Y_train)

Предсказание на тестовом наборе данных

In [79]:
Y_pred = best_svm_linear.predict(X_test)

Оценка модели на тестовом наборе данных

In [81]:
mse_linear = mean_squared_error(Y_test, Y_pred)
print("Среднеквадратичная ошибка модели:", mse_linear)

Среднеквадратичная ошибка модели: 0.1351065296591853


Обучение модели логистической регрессии

In [83]:
logistic_model = LogisticRegression()

Определение сетки параметров

In [103]:
param_grid = {
    'C': [0.1, 1, 10],
    'penalty': ['l2'],
    'solver': ['lbfgs', 'liblinear'],
}

Поиск оптимальных параметров с использованием кросс-валидации

In [104]:
grid_search = GridSearchCV(logistic_model, param_grid, cv=5)
grid_search.fit(X_train, Y_train)

Получение лучших параметров

In [105]:
best_params = grid_search.best_params_
print("Лучшие параметры:", best_params)

Лучшие параметры: {'C': 1, 'penalty': 'l2', 'solver': 'lbfgs'}


Обучение модели с лучшими параметрами

In [106]:
best_svm = grid_search.best_estimator_
best_svm.fit(X_train, Y_train)

Предсказание на тестовом наборе данных

In [107]:
Y_pred = best_svm.predict(X_test)

Оценка модели на тестовом наборе данных

In [108]:
accuracy = accuracy_score(Y_test, Y_pred)
print("Оценка модели:", accuracy)

Оценка модели: 0.8150551589876703


Оценка модели логистической регрессии

In [109]:
report_log = classification_report(Y_test, Y_pred)
print(report_log)

              precision    recall  f1-score   support

       False       0.83      0.96      0.89      5958
        True       0.71      0.31      0.43      1747

    accuracy                           0.82      7705
   macro avg       0.77      0.64      0.66      7705
weighted avg       0.80      0.82      0.79      7705

