#### Задача в этом соревновании - предсказать средний балл на экзамене по математике, который получают ученики репетиторов из датасета test.csv. Даны два датасета: train.csv (содержит признаки и целевую переменную) и test.csv (только признаки).

#### Метрика для оценки – коэффициент детерминации

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

#### 1. Изучение датасета

In [2]:
TRAIN_PATH = 'train.csv'
TEST_PATH = 'test.csv'
data = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)
data.head(10)

Unnamed: 0,Id,age,years_of_experience,lesson_price,qualification,physics,chemistry,biology,english,geography,history,mean_exam_points
0,0,40.0,0.0,1400.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,63.0
1,1,48.0,4.0,2850.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0,86.0
2,2,39.0,0.0,1200.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,53.0
3,3,46.0,5.0,1400.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,56.0
4,4,43.0,1.0,1500.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,59.0
5,5,33.0,4.0,1650.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,43.0
6,6,53.0,1.0,2100.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,72.0
7,7,60.0,3.0,1800.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,60.0
8,8,39.0,1.0,1200.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,57.0
9,9,49.0,5.0,1750.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,37.0


**Описание датасета**

#### Данные о репетиторах:

* **Id** - идентификатор репетитора
* **age** - возраст
* **years_of_experience** - стаж
* **lesson_price** - стоимость занятия
* **qualification** - квалификация

**преподаваемые предметы:**
* **physics** - физика
* **chemistry** - химия
* **biology** - биология
* **english** - английский язык
* **geography** - география
* **history** - история  

* **mean_exam_points** - средняя оценка за экзамен


#### типы данных в датасете:

In [3]:
data.dtypes

Id                       int64
age                    float64
years_of_experience    float64
lesson_price           float64
qualification          float64
physics                float64
chemistry              float64
biology                float64
english                float64
geography              float64
history                float64
mean_exam_points       float64
dtype: object

In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
Id                     10000 non-null int64
age                    10000 non-null float64
years_of_experience    10000 non-null float64
lesson_price           10000 non-null float64
qualification          10000 non-null float64
physics                10000 non-null float64
chemistry              10000 non-null float64
biology                10000 non-null float64
english                10000 non-null float64
geography              10000 non-null float64
history                10000 non-null float64
mean_exam_points       10000 non-null float64
dtypes: float64(11), int64(1)
memory usage: 937.6 KB


In [5]:
data.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,10000.0,4999.5,2886.89568,0.0,2499.75,4999.5,7499.25,9999.0
age,10000.0,45.878,8.043929,23.0,40.0,46.0,51.0,68.0
years_of_experience,10000.0,1.9868,1.772213,0.0,0.0,2.0,3.0,10.0
lesson_price,10000.0,1699.105,524.886654,200.0,1300.0,1500.0,2150.0,3950.0
qualification,10000.0,1.7195,0.792264,1.0,1.0,2.0,2.0,4.0
physics,10000.0,0.375,0.484147,0.0,0.0,0.0,1.0,1.0
chemistry,10000.0,0.1329,0.339484,0.0,0.0,0.0,0.0,1.0
biology,10000.0,0.1096,0.312406,0.0,0.0,0.0,0.0,1.0
english,10000.0,0.0537,0.225436,0.0,0.0,0.0,0.0,1.0
geography,10000.0,0.0321,0.176274,0.0,0.0,0.0,0.0,1.0


Из вышеприведенной информации видим, что тренировочный датасет состоит из 10000 строк, пропусков данных нет, минимальные и максимальные значения попадают в допустимые пределы (нет отрицательных или NaN значений).

Для сравнения посмотрим статистику тестовых данных:

In [6]:
test.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Id,10000.0,14999.5,2886.89568,10000.0,12499.75,14999.5,17499.25,19999.0
age,10000.0,45.9728,7.95628,23.0,41.0,46.0,51.0,68.0
years_of_experience,10000.0,1.98,1.783289,0.0,0.0,2.0,3.0,10.0
lesson_price,10000.0,1697.095,524.262621,200.0,1300.0,1500.0,2150.0,4050.0
qualification,10000.0,1.7094,0.793483,1.0,1.0,2.0,2.0,4.0
physics,10000.0,0.3813,0.48573,0.0,0.0,0.0,1.0,1.0
chemistry,10000.0,0.1235,0.329027,0.0,0.0,0.0,0.0,1.0
biology,10000.0,0.1201,0.325095,0.0,0.0,0.0,0.0,1.0
english,10000.0,0.056,0.229933,0.0,0.0,0.0,0.0,1.0
geography,10000.0,0.0314,0.174405,0.0,0.0,0.0,0.0,1.0


Посмотрим, какие именно признаки оказывают наибольшее влияние на выбор репетитора:

In [7]:
data.corr()['mean_exam_points'].sort_values()

age                   -0.007646
history               -0.000113
Id                     0.004121
english                0.013174
geography              0.014401
chemistry              0.017825
biology                0.023022
physics                0.187726
years_of_experience    0.205417
lesson_price           0.721179
qualification          0.755963
mean_exam_points       1.000000
Name: mean_exam_points, dtype: float64

Видно, что на выбор репетитора наибольшее влияние оказывают стоимость занятий, квалификация преподавателя, стаж и преподавание естественных наук (физика).

Реализуем метод градиентного бустинга для определения среднего балла на экзамене.


За основу возьмем алгоритм градиентного бустинга, сформированный на занятиях

In [8]:
class Node: 
    def __init__(self, index, t, true_branch, false_branch):
        self.index = index  # индекс признака, по которому ведется сравнение с порогом в этом узле
        self.t = t  # значение порога
        self.true_branch = true_branch  # поддерево, удовлетворяющее условию в узле
        self.false_branch = false_branch  # поддерево, не удовлетворяющее условию в узле

In [9]:
class Leaf:
    
    def __init__(self, data, labels):
        self.data = data
        self.labels = labels
        self.prediction = self.predict()
        
    def predict(self):
        #  найдем значение как среднее по выборке   
        prediction = np.mean(self.labels)
        return prediction   

In [10]:
class Tree:

    def __init__(self, max_depth=50):
        self.max_depth = max_depth
        self.tree = None

      # Расчёт дисперсии значений
    def dispersion(self, labels):
        return np.std(labels)

      # Расчет качества

    def quality(self, left_labels, right_labels, current_dispersion):

        # доля выборки, ушедшая в левое поддерево
        p = float(left_labels.shape[0]) / (left_labels.shape[0] + right_labels.shape[0])
    
        return current_dispersion - p * self.dispersion(left_labels) - (1 - p) * self.dispersion(right_labels)

        # Разбиение датасета в узле

    def split(self, data, labels, index, t):
    
        left = np.where(data[:, index] <= t)
        right = np.where(data[:, index] > t)
        
        true_data = data[left]
        false_data = data[right]
        true_labels = labels[left]
        false_labels = labels[right]
        
        return true_data, false_data, true_labels, false_labels

    # Нахождение наилучшего разбиения

    def find_best_split(self, data, labels):
    
        #  обозначим минимальное количество объектов в узле
        min_leaf = 5

        current_dispersion = self.dispersion(labels)

        best_quality = 0
        best_t = None
        best_index = None
    
        n_features = data.shape[1]
    
        for index in range(n_features):
          # будем проверять только уникальные значения признака, исключая повторения
            t_values = np.unique([row[index] for row in data])
      
            for t in t_values:
                true_data, false_data, true_labels, false_labels = self.split(data, labels, index, t)
                #  пропускаем разбиения, в которых в узле остается менее 5 объектов
                if len(true_data) < min_leaf or len(false_data) < min_leaf:
                    continue
        
                current_quality = self.quality(true_labels, false_labels, current_dispersion)
        
                #  выбираем порог, на котором получается максимальный прирост качества
                if current_quality > best_quality:
                    best_quality, best_t, best_index = current_quality, t, index

        return best_quality, best_t, best_index

    # Построение дерева с помощью рекурсивной функции

    def build_tree(self, data, labels, tree_depth, max_depth):

        quality, t, index = self.find_best_split(data, labels)

        #  Базовый случай - прекращаем рекурсию, когда нет прироста в качества
        if quality == 0:
            return Leaf(data, labels)

        # Базовый случай (2) - прекращаем рекурсию, когда достигнута максимальная глубина дерева
        if tree_depth >= max_depth:
            return Leaf(data, labels)

        # Увеличиваем глубину дерева на 1
        tree_depth += 1

        true_data, false_data, true_labels, false_labels = self.split(data, labels, index, t)

        # Рекурсивно строим два поддерева
        true_branch = self.build_tree(true_data, true_labels, tree_depth, max_depth)
        false_branch = self.build_tree(false_data, false_labels, tree_depth, max_depth)

        # Возвращаем класс узла со всеми поддеревьями, то есть целого дерева
        return Node(index, t, true_branch, false_branch)

    def predict_object(self, obj, node):

        #  Останавливаем рекурсию, если достигли листа
        if isinstance(node, Leaf):
            answer = node.prediction
            return answer

        if obj[node.index] <= node.t:
            return self.predict_object(obj, node.true_branch)
        else:
            return self.predict_object(obj, node.false_branch)

    def predict(self, data):
    
        val = []
        for obj in data:
            prediction = self.predict_object(obj, self.tree)
            val.append(prediction)
        return val

    def fit(self, data, labels):
        self.tree = self.build_tree(data, labels, 0, self.max_depth)
        return self

In [11]:
class GradientBoosting:
  
    def __init__(self, n_trees, max_depth, coefs, eta):
        self.n_trees = n_trees
        self.max_depth = max_depth
        self.coefs = coefs
        self.eta = eta
        self.trees = []

    def bias(self, y, z):
        return (y - z)

    def fit(self, X_train, y_train):
    
        # Деревья будем записывать в список
        trees = []

        for i in range(self.n_trees):
            tree = Tree(max_depth=self.max_depth)

            # инициализируем бустинг начальным алгоритмом, возвращающим ноль, 
            # поэтому первый алгоритм просто обучаем на выборке и добавляем в список
            if len(self.trees) == 0:
                # обучаем первое дерево на обучающей выборке
                tree.fit(X_train, y_train)
            else:
                # Получим ответы на текущей композиции
                target = self.predict(X_train)

                # алгоритмы начиная со второго обучаем на сдвиг
                bias = self.bias(y_train, target)
                tree.fit(X_train, bias)

            self.trees.append(tree)

        return self

    def predict(self, X):
        # Реализуемый алгоритм градиентного бустинга будет инициализироваться нулевыми значениями,
        # поэтому все деревья из списка trees_list уже являются дополнительными и при предсказании прибавляются с шагом eta
        return np.array([sum([self.eta* coef * alg.predict([x])[0] for alg, coef in zip(self.trees, self.coefs)]) for x in X])

In [12]:
def r_2(y_pred, y_true):
    numerator = ((y_true - y_pred) ** 2).sum(axis=0, dtype=np.float64)
    denominator = ((y_true - np.average(y_true)) ** 2).sum(axis=0, dtype=np.float64)
    return 1 - (numerator / denominator)

In [13]:
def standart_scaler(data, column_list):
    for columnName in column_list:
        col_mean = data[columnName].mean()
        col_std = data[columnName].std()
        data[columnName]=(data[columnName]-col_mean)/col_std
    return data

In [14]:
X = data.drop(['Id','mean_exam_points'], axis=1)
y = data['mean_exam_points']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# standart_columns = ['age','years_of_experience','lesson_price']
# X_train = standart_scaler(X_train,standart_columns)
# X_test = standart_scaler(X_test,standart_columns)

In [15]:
#путём дёрганья за ниточки пришёл к следующим параметрам:

# Число деревьев в ансамбле
n_trees = 20
# примем коэффициенты равными 1
coefs = [1] * n_trees
# Максимальная глубина деревьев
max_depth = 6
# Шаг
eta = 0.245

In [16]:
gb = GradientBoosting(n_trees, max_depth, coefs, eta)
gb.fit(X_train.values, y_train.values)
train_answers = gb.predict(X_train.values)
test_answers = gb.predict(X_test.values)
print(f'r2 тренировочной выборки: {r_2(train_answers, y_train.values)}')
print(f'r2 тестовой выборки: {r_2(test_answers, y_test.values)}')

r2 тренировочной выборки: 0.8015934820294607
r2 тестовой выборки: 0.7727435589060444


обучим модель на всём датасете:

In [17]:
gb_all = GradientBoosting(n_trees, max_depth, coefs, eta)
gb_all.fit(X.values, y.values)
train_answers_all = gb_all.predict(X.values)
print(f'r2 тренировочной выборки на всём датасете: {r_2(train_answers_all, y.values)}')

r2 тренировочной выборки на всём датасете: 0.7970887553035478


#### Применение модели:

In [18]:
X_test_predict = test.drop(['Id'], axis=1)
X_test_predict.head()

Unnamed: 0,age,years_of_experience,lesson_price,qualification,physics,chemistry,biology,english,geography,history
0,46.0,3.0,1050.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0
1,43.0,3.0,1850.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
2,52.0,1.0,1550.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
3,57.0,6.0,2900.0,3.0,1.0,0.0,1.0,0.0,0.0,0.0
4,44.0,4.0,3150.0,3.0,1.0,0.0,0.0,0.0,0.0,0.0


In [19]:
test_predict_answers = gb.predict(X_test_predict.values)
test_predict_answers_all = gb_all.predict(X_test_predict.values)
Out_data = pd.DataFrame({'Id':test['Id'], 'mean_exam_points':test_predict_answers.round(decimals=3)})
Out_data_all = pd.DataFrame({'Id':test['Id'], 'mean_exam_points':test_predict_answers_all.round(decimals=3)})

In [20]:
Out_data.to_csv(f'PodoynitsynVA_predictions_GB_{n_trees}_{max_depth}_{eta}.csv', index=None)
Out_data_all.to_csv(f'PodoynitsynVA_predictions_GB_{n_trees}_{max_depth}_{eta}_ALL.csv', index=None)