# Input

After our problem owner suggested to look into a polynomial regression model again as it ranks the predictions which can be useful in our case we briefly looked into it. The past model did not give promising results, but there were mistakes in the file (no cross-validation) and we wanted to look into it further. Therefore, I created this small example on polynomial regression to see if it could be useful for us.

In [13]:
# Importing libraries 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler, PowerTransformer, PolynomialFeatures
from sklearn.metrics import mean_squared_error 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from itertools import chain

pd.options.mode.chained_assignment = None

In [14]:
# CHANGE HERE
n_classes = 5
deg = 2
scaler = StandardScaler()

In [15]:
# Create df 
df = pd.read_csv('/datc/nano/notebooks/Target variable & Features (V3).csv', index_col = 0)

# Creating dataframe with only yen values
df_yen = df[df['Threshold method'] =='yen']

In [32]:
# MODEL EXPERIMENT

# Classes (function from Oscar)
def split_classes(n_classes):
    class_labels = []
    current_classes = np.sort(df_yen['User score'].unique()) 
    split = np.array_split(current_classes, n_classes) 

    for row in df_yen.iterrows(): 
        for label, class_ranges in enumerate(split): 
            if(row[1]['User score'] in class_ranges): 
                class_labels.append(label) 
    return class_labels 

# Create df 
df_yen['Class'] = split_classes(n_classes)
df_yen.head()

# Creating x and y
#x = df_yen[['Threshold: separation', 'Threshold: border' , 'Threshold: area spread', 'Threshold: fill', 'Threshold: count', 'Threshold: intensity']]
x = df_yen[['Threshold: separation', 'Threshold: border']]
if n_classes == 10:
    y = df_yen[['User score']]
else:
    y = df_yen[['Class']]
    
# Scaler
x_array = scaler.fit_transform(x)
x = pd.DataFrame(x_array, index=x.index, columns=x.columns)

# Cross validation
mse_list = []
kf = KFold(n_splits=3, random_state=42, shuffle=True)
for train_ind, valid_ind in kf.split(x):
    x_train = x.iloc[train_ind].to_numpy()
    x_test = x.iloc[valid_ind].to_numpy()
    y_train = y.iloc[train_ind].to_numpy()
    y_test = y.iloc[valid_ind].to_numpy()

    # Polynomial
    poly = PolynomialFeatures(degree=deg, include_bias=False)
    xp_train = poly.fit_transform(x_train)
    xp_test = poly.transform(x_test)

    # Model
    model = LinearRegression()
    model.fit(xp_train, y_train)
    y_pred = model.predict(xp_test)

    # Unnest
    y_pred = list(chain.from_iterable(y_pred))
    y_test = list(chain.from_iterable(y_test))
    
    # Calculate mse
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

# Mean MSE
mse_mean = sum(mse_list)/len(mse_list)
print(mse_mean)

2.8747454081177612


In [5]:
# VISUALIZATION

# Compare 10 classes (user scores)
def comp_scores():
    
    compare_pol = pd.DataFrame(columns=('User Score', 'Predicted score'))
    
    compare_pol['User Score'] = y_test
    compare_pol['Predicted score'] = y_pred

    def highlight(val):
        if val['User Score']-1 <= val['Predicted score'] <= val['User Score']+1:
            return ['background: green']*2 
        elif val['User Score']-2 <= val['Predicted score'] <= val['User Score']+2:
            return ['background: yellow']*2
        else:
            return ['background: red']*2

    return compare_pol.style.apply(highlight, axis=1)
    
    
# Compare other class numbers (classes)
def comp_classes():
    
    compare_pol = pd.DataFrame(columns=('Class', 'Predicted score'))
    
    compare_pol['Class'] = y_test
    compare_pol['Predicted score'] = y_pred

    def highlight(val):
        if val['Class']-1 <= val['Predicted score'] <= val['Class']+1:
            return ['background: green']*2 
        elif val['Class']-2 <= val['Predicted score'] <= val['Class']+2:
            return ['background: yellow']*2
        else:
            return ['background: red']*2

    return compare_pol.style.apply(highlight, axis=1)


if n_classes == 10:
    comp_table = comp_scores()
else:
    comp_table = comp_classes()
    
comp_table

Unnamed: 0,Class,Predicted score
0,1,4.245725
1,0,1.906806
2,0,1.851714
3,2,0.576689
4,4,1.532727
5,2,2.531647
6,3,3.918067
7,3,3.637089
8,3,3.545485
9,3,3.120731


Mean Squared Error:  3.0137723273803045


# Output

The first impression is that the results do not look too bad and could potentially be useful with different features and polynomial order chosen this time (compared to the first model). Therefore, I used this example to turn it into a real experiment which directly compares scores.