# **Лабораторная работа №1. Обработка данных учебной аналитики с использованием методов машинного обучения**

# Описание работы

**Цель работы:** сформировать навыки обработки данных учебной аналитики с использованием методов машинного обучения. </br>
**Задачи:**


1.   Изучить предоставленный набор данных.
2.   Собрать данные согласно своему варианту.
2.   Выполнить предобработку данных.
2.   Изучить алгоритмы машинного обучения для решения задачи классификации.
2.   Обосновать, для каких алгоритмов необходимо сделать бинаризацию и нормализацию признаков.
2.   Построить модели классификации.
2.   Вычислить метрики качества моделей.
3.   С помощью подбора параметров добиться точности не менее 0.65
3.   Сделать выводы по работе.

**Формат сдачи работы** - google colab с аналитическим пояснением.

In [1]:
# Вычисление варианта
student = 2 # напишите свой номер в списке группы
print('Мой вариант студенты номер - ', student, student*2, student*3)

Мой вариант студенты номер -  2 4 6


**Примечание**

Если вдруг в наборе данных отсутствуют файлы по номеру полученного варианта, то можете выбрать любой другой. Главное - полученный объем данных для работы с алгоритмами классификации был не менее 1500 строк.

In [2]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

### Сбор данных

In [3]:
feature_names = ['session', 'student_Id', 'exercise', 'activity', 'start_time', 'end_time', 'idle_time', 'mouse_wheel', 
                             'mouse_wheel_click', 'mouse_click_left', 'mouse_click_right', 'mouse_movement', 'keystroke']

In [4]:
def concat_student_data(filenames: list[str]):
	files = []
	for file in filenames:
		files.append(pd.read_csv(f'./data/{file}', names=feature_names))
	
	return pd.concat(files)

In [5]:
N_SESSIONS = 6
student_2 = concat_student_data([f'{i}_2' for i in range(1, N_SESSIONS)])
student_4 = concat_student_data([f'{i}_4' for i in range(1, N_SESSIONS)])
student_6 = concat_student_data([f'{i}_6' for i in range(2, N_SESSIONS)]) # 1st session missing
students = pd.concat([student_2, student_4, student_6])

In [33]:
student_2.head()

Unnamed: 0,session,student_Id,exercise,activity,start_time,end_time,idle_time,mouse_wheel,mouse_wheel_click,mouse_click_left,mouse_click_right,mouse_movement,keystroke
0,1,2,Es,Other,2.10.2014 11:35:32,2.10.2014 11:35:34,16,0,0,0,0,37,0
1,1,2,Es_1_1,Deeds_Es_1_1,2.10.2014 11:35:35,2.10.2014 11:38:23,8247889,0,0,2,0,45,20
2,1,2,Es_1_1,Other,2.10.2014 11:38:24,2.10.2014 11:38:24,0,0,0,0,0,0,0
3,1,2,Es_1_1,Blank,2.10.2014 11:38:25,2.10.2014 11:38:25,0,0,0,0,0,0,0
4,1,2,Es_1_1,Study_Es_1_1,2.10.2014 11:38:26,2.10.2014 11:38:36,13578,0,0,0,0,0,0


In [6]:
print(student_2.shape)
print(student_4.shape)
print(student_6.shape)
print(students.shape)

(1484, 13)
(2880, 13)
(2233, 13)
(6597, 13)


Проверим самые многочисленные типы активностей среди трех студентов

In [7]:
students.activity.value_counts()[:5]

 Other         1296
 Blank          844
 Diagram        731
 Aulaweb        306
 Properties     249
Name: activity, dtype: int64

In [8]:
students.head()

Unnamed: 0,session,student_Id,exercise,activity,start_time,end_time,idle_time,mouse_wheel,mouse_wheel_click,mouse_click_left,mouse_click_right,mouse_movement,keystroke
0,1,2,Es,Other,2.10.2014 11:35:32,2.10.2014 11:35:34,16,0,0,0,0,37,0
1,1,2,Es_1_1,Deeds_Es_1_1,2.10.2014 11:35:35,2.10.2014 11:38:23,8247889,0,0,2,0,45,20
2,1,2,Es_1_1,Other,2.10.2014 11:38:24,2.10.2014 11:38:24,0,0,0,0,0,0,0
3,1,2,Es_1_1,Blank,2.10.2014 11:38:25,2.10.2014 11:38:25,0,0,0,0,0,0,0
4,1,2,Es_1_1,Study_Es_1_1,2.10.2014 11:38:26,2.10.2014 11:38:36,13578,0,0,0,0,0,0


Отфильтруем данные по двум самым частым активностям. Возьмем числовые признаки для обучения модели логистической регрессии

In [9]:
CLASS_NAMES = [' Other', ' Diagram']
TARGET_FEATURE = 'activity'
features = ['idle_time', 'mouse_wheel', 'mouse_wheel_click', 'mouse_click_left', 'mouse_click_right', 'mouse_movement', 'keystroke']

def get_samples(df):
	return df.loc[df[TARGET_FEATURE].isin(CLASS_NAMES)]

def get_X_y(df):
	return [df[features], df[TARGET_FEATURE]]

In [10]:
students = get_samples(students)
X_students, y_students = get_X_y(students)

In [11]:
students.shape

(2027, 13)

Разобьем на обучающую и тестовую выборки

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X_students, y_students, test_size = 0.4, random_state = 1)

Нормализуем

In [13]:
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [14]:
students['activity'].replace({' Other': 0, ' Diagram': 1}, inplace=True)
students.head()

Unnamed: 0,session,student_Id,exercise,activity,start_time,end_time,idle_time,mouse_wheel,mouse_wheel_click,mouse_click_left,mouse_click_right,mouse_movement,keystroke
0,1,2,Es,0,2.10.2014 11:35:32,2.10.2014 11:35:34,16,0,0,0,0,37,0
2,1,2,Es_1_1,0,2.10.2014 11:38:24,2.10.2014 11:38:24,0,0,0,0,0,0,0
13,1,2,Es_1_1,0,2.10.2014 11:44:5,2.10.2014 11:44:33,82512,0,0,0,0,0,32
15,1,2,Es_1_1,1,2.10.2014 11:47:7,2.10.2014 11:50:24,11799085,0,0,0,0,0,0
17,1,2,Es_1_1,1,2.10.2014 11:50:26,2.10.2014 11:51:25,846045,0,0,0,0,0,0


### Логистическая регрессия

In [15]:
log_model = LogisticRegression()
log_model.fit(X_train, y_train)

log_pred = log_model.predict(X_test)

print('Accuracy Metrics for Logistic Regression:', accuracy_score(y_test, log_pred).round(3), '\n')
print('Confusion matrix \n', confusion_matrix(y_test, log_pred), '\n')
print('Classification report \n', classification_report(y_test, log_pred))

Accuracy Metrics for Logistic Regression: 0.747 

Confusion matrix 
 [[107 184]
 [ 21 499]] 

Classification report 
               precision    recall  f1-score   support

     Diagram       0.84      0.37      0.51       291
       Other       0.73      0.96      0.83       520

    accuracy                           0.75       811
   macro avg       0.78      0.66      0.67       811
weighted avg       0.77      0.75      0.72       811



### KNN

In [16]:
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)

print('Accuracy Metrics for KNN:', 
      accuracy_score(y_test, knn_pred).round(3), '\n')
print('Confusion_matrix \n', 
      confusion_matrix(y_test, knn_pred), '\n')
print(classification_report(y_test, knn_pred))

Accuracy Metrics for KNN: 0.73 

Confusion_matrix 
 [[167 124]
 [ 95 425]] 

              precision    recall  f1-score   support

     Diagram       0.64      0.57      0.60       291
       Other       0.77      0.82      0.80       520

    accuracy                           0.73       811
   macro avg       0.71      0.70      0.70       811
weighted avg       0.73      0.73      0.73       811



### Decision tree

In [17]:
cart_model = DecisionTreeClassifier()
cart_model.fit(X_train, y_train)
cart_pred = cart_model.predict(X_test)

print('Accuracy Metrics for Decision Tree Classifier:', 
      accuracy_score(y_test, cart_pred).round(3), '\n')
print('Confusion_matrix \n', 
      confusion_matrix(y_test, cart_pred), '\n')
print(classification_report(y_test, cart_pred))

Accuracy Metrics for Decision Tree Classifier: 0.684 

Confusion_matrix 
 [[168 123]
 [133 387]] 

              precision    recall  f1-score   support

     Diagram       0.56      0.58      0.57       291
       Other       0.76      0.74      0.75       520

    accuracy                           0.68       811
   macro avg       0.66      0.66      0.66       811
weighted avg       0.69      0.68      0.69       811



### Random Forest

In [18]:
rfc_model = RandomForestClassifier()
rfc_model.fit(X_train, y_train)
rfc_pred = rfc_model.predict(X_test)

print('Accuracy Metrics for Random Forest Classifier:', 
      accuracy_score(y_test, rfc_pred).round(3), '\n')
print('Confusion_matrix \n', 
      confusion_matrix(y_test, rfc_pred), '\n')
print(classification_report(y_test, rfc_pred))

Accuracy Metrics for Random Forest Classifier: 0.719 

Confusion_matrix 
 [[164 127]
 [101 419]] 

              precision    recall  f1-score   support

     Diagram       0.62      0.56      0.59       291
       Other       0.77      0.81      0.79       520

    accuracy                           0.72       811
   macro avg       0.69      0.68      0.69       811
weighted avg       0.71      0.72      0.72       811



Все алгоритмы дали неплохой результат без настройки параметров, заметно лучше показала себя логистическая регрессия, заметно хуже - бинарное дерево

### Построим модель для каждого из студентов

Проверим самые многочисленные типы активностей - они совпадают по всех трех случаях

In [19]:
student_2.activity.value_counts()[:5]

 Other         220
 Diagram       167
 Blank         144
 Properties     71
 Aulaweb        55
Name: activity, dtype: int64

In [20]:
student_4.activity.value_counts()[:5]

 Other         705
 Blank         346
 Diagram       242
 Aulaweb       145
 Properties     76
Name: activity, dtype: int64

In [21]:
student_6.activity.value_counts()[:5]

 Other         371
 Blank         354
 Diagram       322
 Aulaweb       106
 Properties    102
Name: activity, dtype: int64

In [22]:
def pipeline(student_df):

      student = get_samples(student_df)

      X_student, y_student = get_X_y(student)
      X_train, X_test, y_train, y_test = train_test_split(X_student, y_student, test_size = 0.4, random_state = 1)
      # y_log = students['activity']

      students['activity'].replace({' Other': 0, ' Diagram': 1}, inplace=True)
      # y_log_train, y_log_test = train_test_split(y_log, test_size = 0.4, random_state = 1)


      log_model = LogisticRegression()
      log_model.fit(X_train, y_train)

      log_pred = log_model.predict(X_test)

      print('Accuracy Metrics for Logistic Regression:', accuracy_score(y_test, log_pred).round(3), '\n')
      print('Confusion matrix \n', confusion_matrix(y_test, log_pred), '\n')
      print('Classification report \n', classification_report(y_test, log_pred))
            

      knn_model = KNeighborsClassifier()
      knn_model.fit(X_train, y_train)
      knn_pred = knn_model.predict(X_test)

      print('Accuracy Metrics for KNN:', 
            accuracy_score(y_test, knn_pred).round(3), '\n')
      print('Confusion_matrix \n', 
            confusion_matrix(y_test, knn_pred), '\n')
      print(classification_report(y_test, knn_pred))


      cart_model = DecisionTreeClassifier()
      cart_model.fit(X_train, y_train)
      cart_pred = cart_model.predict(X_test)

      print('Accuracy Metrics for Decision Tree Classifier:', 
            accuracy_score(y_test, cart_pred).round(3), '\n')
      print('Confusion_matrix \n', 
            confusion_matrix(y_test, cart_pred), '\n')
      print(classification_report(y_test, cart_pred))


      rfc_model = RandomForestClassifier()
      rfc_model.fit(X_train, y_train)
      rfc_pred = rfc_model.predict(X_test)

      print('Accuracy Metrics for Random Forest Classifier:', 
            accuracy_score(y_test, rfc_pred).round(3), '\n')
      print('Confusion_matrix \n', 
            confusion_matrix(y_test, rfc_pred), '\n')
      print(classification_report(y_test, rfc_pred))


In [23]:
pipeline(student_2)

Accuracy Metrics for Logistic Regression: 0.658 

Confusion matrix 
 [[21 46]
 [ 7 81]] 

Classification report 
               precision    recall  f1-score   support

     Diagram       0.75      0.31      0.44        67
       Other       0.64      0.92      0.75        88

    accuracy                           0.66       155
   macro avg       0.69      0.62      0.60       155
weighted avg       0.69      0.66      0.62       155

Accuracy Metrics for KNN: 0.619 

Confusion_matrix 
 [[30 37]
 [22 66]] 

              precision    recall  f1-score   support

     Diagram       0.58      0.45      0.50        67
       Other       0.64      0.75      0.69        88

    accuracy                           0.62       155
   macro avg       0.61      0.60      0.60       155
weighted avg       0.61      0.62      0.61       155

Accuracy Metrics for Decision Tree Classifier: 0.613 

Confusion_matrix 
 [[33 34]
 [26 62]] 

              precision    recall  f1-score   support

     Dia

In [24]:
pipeline(student_4)

Accuracy Metrics for Logistic Regression: 0.33 

Confusion matrix 
 [[ 92   8]
 [246  33]] 

Classification report 
               precision    recall  f1-score   support

     Diagram       0.27      0.92      0.42       100
       Other       0.80      0.12      0.21       279

    accuracy                           0.33       379
   macro avg       0.54      0.52      0.31       379
weighted avg       0.66      0.33      0.26       379

Accuracy Metrics for KNN: 0.763 

Confusion_matrix 
 [[ 39  61]
 [ 29 250]] 

              precision    recall  f1-score   support

     Diagram       0.57      0.39      0.46       100
       Other       0.80      0.90      0.85       279

    accuracy                           0.76       379
   macro avg       0.69      0.64      0.66       379
weighted avg       0.74      0.76      0.75       379

Accuracy Metrics for Decision Tree Classifier: 0.755 

Confusion_matrix 
 [[ 52  48]
 [ 45 234]] 

              precision    recall  f1-score   suppor

In [25]:
pipeline(student_6)

Accuracy Metrics for Logistic Regression: 0.468 

Confusion matrix 
 [[130   1]
 [147   0]] 

Classification report 
               precision    recall  f1-score   support

     Diagram       0.47      0.99      0.64       131
       Other       0.00      0.00      0.00       147

    accuracy                           0.47       278
   macro avg       0.23      0.50      0.32       278
weighted avg       0.22      0.47      0.30       278

Accuracy Metrics for KNN: 0.612 

Confusion_matrix 
 [[71 60]
 [48 99]] 

              precision    recall  f1-score   support

     Diagram       0.60      0.54      0.57       131
       Other       0.62      0.67      0.65       147

    accuracy                           0.61       278
   macro avg       0.61      0.61      0.61       278
weighted avg       0.61      0.61      0.61       278

Accuracy Metrics for Decision Tree Classifier: 0.662 

Confusion_matrix 
 [[ 80  51]
 [ 43 104]] 

              precision    recall  f1-score   support



##### Итог: на малом количестве данных логистическая регрессия показывает себя заметно хуже, чем остальные алгоритмы

Подберем лучшие параметры для обучения модели на всех студентах с помощью кросс-валидации

In [26]:
from sklearn.model_selection import GridSearchCV

Для логистической регрессии возьмем оптимальный алгоритм оптимизации для обучения на малом количестве данных - liblinear и параметр регуляризации

In [27]:
model = LogisticRegression(solver='liblinear')
param = {'C': [0.1, 0.5, 1, 2],
        'penalty': ['l1', 'l2']
        }
grid = GridSearchCV(estimator=model, param_grid=param, cv=20, n_jobs=-1)
grid.fit(X_train, y_train)
grid.best_params_

{'C': 2, 'penalty': 'l1'}

In [28]:
model = LogisticRegression(solver='liblinear', C=2, penalty='l1')
model.fit(X_train, y_train)
log_pred = model.predict(X_test)
print('Accuracy Metrics for Logistic Regression:', accuracy_score(y_test, log_pred).round(3), '\n')
print('Confusion matrix \n', confusion_matrix(y_test, log_pred), '\n')
print('Classification report \n', classification_report(y_test, log_pred))

Accuracy Metrics for Logistic Regression: 0.746 

Confusion matrix 
 [[107 184]
 [ 22 498]] 

Classification report 
               precision    recall  f1-score   support

     Diagram       0.83      0.37      0.51       291
       Other       0.73      0.96      0.83       520

    accuracy                           0.75       811
   macro avg       0.78      0.66      0.67       811
weighted avg       0.77      0.75      0.71       811



Результат такой же, как для значений по умолчанию. Для KNN подберем количество соседей при поиске:

In [29]:
model = KNeighborsClassifier()

param = {'n_neighbors': [30, 35, 40]}
grid = GridSearchCV(estimator=model, param_grid=param, cv=20, n_jobs=-1)
grid.fit(X_train, y_train)
grid.best_params_

{'n_neighbors': 35}

In [30]:
knn_model = KNeighborsClassifier(35)
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)

print('Accuracy Metrics for KNN:', 
			accuracy_score(y_test, knn_pred).round(3), '\n')
print('Confusion_matrix \n', 
			confusion_matrix(y_test, knn_pred), '\n')
print(classification_report(y_test, knn_pred))

Accuracy Metrics for KNN: 0.766 

Confusion_matrix 
 [[150 141]
 [ 49 471]] 

              precision    recall  f1-score   support

     Diagram       0.75      0.52      0.61       291
       Other       0.77      0.91      0.83       520

    accuracy                           0.77       811
   macro avg       0.76      0.71      0.72       811
weighted avg       0.76      0.77      0.75       811



Виден значительный прирост точности. Для случайного леса подберем количество деревьев, глубину дерева и критерий отбора сэмплов:

In [31]:
model = RandomForestClassifier()
param = {'n_estimators': [100, 500, 1000],
        'max_depth': [10, 50, 100],
        'criterion': ['gini', 'entropy', 'log_loss'],
        }
grid = GridSearchCV(estimator=model, param_grid=param, cv=20, n_jobs=-1)
grid.fit(X_train, y_train)
grid.best_params_

{'criterion': 'gini', 'max_depth': 10, 'n_estimators': 100}

In [32]:
rfc_model = RandomForestClassifier(**grid.best_params_)
rfc_model.fit(X_train, y_train)
rfc_pred = rfc_model.predict(X_test)

print('Accuracy Metrics for Random Forest Classifier:', 
      accuracy_score(y_test, rfc_pred).round(3), '\n')
print('Confusion_matrix \n', 
      confusion_matrix(y_test, rfc_pred), '\n')
print(classification_report(y_test, rfc_pred))

Accuracy Metrics for Random Forest Classifier: 0.75 

Confusion_matrix 
 [[136 155]
 [ 48 472]] 

              precision    recall  f1-score   support

     Diagram       0.74      0.47      0.57       291
       Other       0.75      0.91      0.82       520

    accuracy                           0.75       811
   macro avg       0.75      0.69      0.70       811
weighted avg       0.75      0.75      0.73       811



Вывод: в ходе проделанной работы была решена задача классификации с помощью четырех алгоритмов. Алгоритмы ожидаемо показали худшую точность предсказания на меньшем количестве данных, в особенности - логистическая регрессия. При поиске оптимальных гиперпараметров логистическая регрессия не улучшила показатели, KNN-классификатор улучшил точность при увеличении количества соседей, a случайный лес показал значительный прирост точности при повышении количества деревьев.