link to [Colab](https://colab.research.google.com/drive/1JK2ZOFO3S0B_uFvrSgVPP8Qi7Mo5yqim?usp=sharing)

# Description  
Одной из самых больших проблем при покупке подержанного автомобиля на автоаукционе является риск того, что у машины могут быть серьезные проблемы, которые не позволят продать ее клиентам. В автомобильном сообществе такие неудачные покупки называют "киками".

Кикнутые автомобили часто становятся результатом подделки одометров, механических проблем, которые дилер не в состоянии решить, проблем с получением права собственности на автомобиль от продавца или других непредвиденных проблем. Кикнутые автомобили могут обойтись дилерам очень дорого, если учесть транспортные расходы, затраты на ремонт и рыночные потери при перепродаже автомобиля.

Специалисты по моделированию, способные определить, какие автомобили имеют более высокий риск стать "киком", могут принести реальную пользу дилерским центрам, стремящимся предоставить своим клиентам наилучший выбор товарного запаса.

Задача этого конкурса - предсказать, является ли автомобиль, купленный на аукционе, "киком" (неудачной покупкой).  
- **Target:** `'IsBadBuy'`

Field Name		---		Definition
- RefID				---        Unique (sequential) number assigned to vehicles
- IsBadBuy	(target)		--- 	Identifies if the kicked vehicle was an avoidable - purchase
- PurchDate			--- 	The Date the vehicle was Purchased at Auction
- Auction			--- 		Auction provider at which the  vehicle was purchased
- VehYear			--- 		The manufacturer's year of the vehicle
- VehicleAge		--- 		The Years elapsed since the manufacturer's year
- Make				--- 	Vehicle Manufacturer
- Model				--- 	Vehicle Model
- Trim				--- 	Vehicle Trim Level
- SubModel			--- 	Vehicle Submodel
- Color				--- 	Vehicle Color
- Transmission		--- 		Vehicles transmission type (Automatic, Manual)
- WheelTypeID		--- 		The type id of the vehicle wheel
- WheelType			--- 	The vehicle wheel type description (Alloy, Covers)
- VehOdo			--- 		The vehicles odometer reading
- Nationality		--- 		The Manufacturer's country
- Size				--- 	The size category of the vehicle (Compact, SUV, etc.)
- TopThreeAmericanName		--- 	Identifies if the manufacturer is one of the top three American manufacturers
- MMRAcquisitionAuctionAveragePrice	--- Acquisition price for this vehicle in average condition at time of purchase
- MMRAcquisitionAuctionCleanPrice	--- 	Acquisition price for this vehicle in the above Average condition at time of purchase
- MMRAcquisitionRetailAveragePrice	--- Acquisition price for this vehicle in the retail market in average condition at time of purchase
- MMRAcquisitonRetailCleanPrice		--- Acquisition price for this vehicle in the retail market in above average condition at time of purchase
- MMRCurrentAuctionAveragePrice		--- Acquisition price for this vehicle in average condition as of current day
- MMRCurrentAuctionCleanPrice		--- Acquisition price for this vehicle in the above condition as of current day
- MMRCurrentRetailAveragePrice		--- Acquisition price for this vehicle in the retail market in average condition as of current day
- MMRCurrentRetailCleanPrice		--- Acquisition price for this vehicle in the retail market in above average condition as of current day
- PRIMEUNIT				--- Identifies if the vehicle would have a higher demand than a standard purchase
- AcquisitionType				--- Identifies how the vehicle was aquired (Auction buy, trade in, etc)
- AUCGUART				--- The level guarntee provided by auction for the vehicle (Green light - Guaranteed/arbitratable, Yellow Light - caution/issue, red light - sold as is)
- KickDate				--- Date the vehicle was kicked back to the auction
- BYRNO					--- Unique number assigned to the buyer that purchased the vehicle
- VNZIP                ---                    Zipcode where the car was purchased
- VNST                ---                     State where the the car was purchased
- VehBCost				--- Acquisition cost paid for the vehicle at time of purchase
- IsOnlineSale		--- 		Identifies if the vehicle was originally purchased online
- WarrantyCost       ---                      Warranty price (term=36month  and millage=36K)





# 1. Download data from Don’tGetKicked competition.

In [68]:
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, roc_curve, auc, roc_auc_score, precision_recall_curve
from sklearn import svm
from sklearn.model_selection import GridSearchCV
import warnings
warnings. filterwarnings('ignore')

In [2]:
df = pd.read_csv('data/training.csv')
df.head(5)

Unnamed: 0,RefId,IsBadBuy,PurchDate,Auction,VehYear,VehicleAge,Make,Model,Trim,SubModel,...,MMRCurrentRetailAveragePrice,MMRCurrentRetailCleanPrice,PRIMEUNIT,AUCGUART,BYRNO,VNZIP1,VNST,VehBCost,IsOnlineSale,WarrantyCost
0,1,0,12/7/2009,ADESA,2006,3,MAZDA,MAZDA3,i,4D SEDAN I,...,11597.0,12409.0,,,21973,33619,FL,7100.0,0,1113
1,2,0,12/7/2009,ADESA,2004,5,DODGE,1500 RAM PICKUP 2WD,ST,QUAD CAB 4.7L SLT,...,11374.0,12791.0,,,19638,33619,FL,7600.0,0,1053
2,3,0,12/7/2009,ADESA,2005,4,DODGE,STRATUS V6,SXT,4D SEDAN SXT FFV,...,7146.0,8702.0,,,19638,33619,FL,4900.0,0,1389
3,4,0,12/7/2009,ADESA,2004,5,DODGE,NEON,SXT,4D SEDAN,...,4375.0,5518.0,,,19638,33619,FL,4100.0,0,630
4,5,0,12/7/2009,ADESA,2005,4,FORD,FOCUS,ZX3,2D COUPE ZX3,...,6739.0,7911.0,,,19638,33619,FL,4000.0,0,1020


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72983 entries, 0 to 72982
Data columns (total 34 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   RefId                              72983 non-null  int64  
 1   IsBadBuy                           72983 non-null  int64  
 2   PurchDate                          72983 non-null  object 
 3   Auction                            72983 non-null  object 
 4   VehYear                            72983 non-null  int64  
 5   VehicleAge                         72983 non-null  int64  
 6   Make                               72983 non-null  object 
 7   Model                              72983 non-null  object 
 8   Trim                               70623 non-null  object 
 9   SubModel                           72975 non-null  object 
 10  Color                              72975 non-null  object 
 11  Transmission                       72974 non-null  obj

In [4]:
df.IsBadBuy.value_counts()

0    64007
1     8976
Name: IsBadBuy, dtype: int64

# 2. Design train/validation/test split.
Use “PurchDate” field for splitting, test must be later in time than validation, the same goes for validation and train: train.PurchDate < valid.PurchDate < test.PurchDate.
Use the first 33% of dates for train, last 33% of dates for test, and middle 33% for validation set.
Don’t use the test dataset until the end!

In [5]:
df_prepared = df.copy()

проверим дубликаты пропуски

In [6]:
df_prepared.duplicated().value_counts()

False    72983
dtype: int64

In [7]:
def print_useful_rows_info(df):
    """Количество и процент заполненных строк"""
    print('Amount of useful rows:', len(df.dropna()))
    print('Persentage of filled rows', round(len(df.dropna()) / len(df) * 100, 2))

In [8]:
print_useful_rows_info(df_prepared)

Amount of useful rows: 3276
Persentage of filled rows 4.49


In [9]:
def blank_rows_percentage(df):
  """Вывод колонок и процента пропусков в каждой"""
  print((df.isna().sum() / len(df) * 100).sort_values(ascending=False))

In [10]:
blank_rows_percentage(df_prepared)

PRIMEUNIT                            95.315347
AUCGUART                             95.315347
WheelType                             4.348958
WheelTypeID                           4.342107
Trim                                  3.233630
MMRCurrentAuctionAveragePrice         0.431607
MMRCurrentRetailCleanPrice            0.431607
MMRCurrentRetailAveragePrice          0.431607
MMRCurrentAuctionCleanPrice           0.431607
MMRAcquisitionAuctionAveragePrice     0.024663
MMRAcquisitionAuctionCleanPrice       0.024663
MMRAcquisitionRetailAveragePrice      0.024663
MMRAcquisitonRetailCleanPrice         0.024663
Transmission                          0.012332
SubModel                              0.010961
Color                                 0.010961
Nationality                           0.006851
Size                                  0.006851
TopThreeAmericanName                  0.006851
BYRNO                                 0.000000
VNZIP1                                0.000000
VNST         

в столбцах  PRIMEUNIT  и AUCGUART больше 90% пропусков, удалим эти столбцы

In [11]:
df_prepared.drop(columns=['PRIMEUNIT', 'AUCGUART'], axis=0, inplace=True)

In [12]:
print_useful_rows_info(df_prepared)

Amount of useful rows: 67270
Persentage of filled rows 92.17


остальные пропуски заполним

In [13]:
df_prepared['WheelType'].value_counts()

Alloy      36050
Covers     33004
Special      755
Name: WheelType, dtype: int64

In [14]:
df_prepared['WheelType'].isna().sum()

3174

In [15]:
df_prepared[df_prepared['WheelType'].isna()]['WheelTypeID'].isna().sum()

3169

In [16]:
df_prepared['WheelTypeID'].value_counts()

1.0    36050
2.0    33004
3.0      755
0.0        5
Name: WheelTypeID, dtype: int64

чтобы сохранить больше данных, заполним ячейки как 'other'

In [17]:
df_prepared['WheelType'] = df_prepared['WheelType'].fillna('other')
df_prepared['WheelTypeID'] = df_prepared['WheelTypeID'].fillna(-1.0)

In [18]:
df_prepared['Trim'].value_counts()

Bas    13950
LS     10174
SE      9348
SXT     3825
LT      3540
       ...  
Har        1
LL         1
JLX        1
JLS        1
L 3        1
Name: Trim, Length: 134, dtype: int64

In [19]:
df_prepared['Trim'] = df_prepared['Trim'].fillna('other')

In [20]:
print_useful_rows_info(df_prepared)

Amount of useful rows: 72658
Persentage of filled rows 99.55


In [21]:
df_prepared[df_prepared.isnull().any(axis=1)]['IsBadBuy'].value_counts(normalize=True)

0    0.898462
1    0.101538
Name: IsBadBuy, dtype: float64

In [22]:
df_prepared[df_prepared.notnull()]['IsBadBuy'].value_counts(normalize=True)

0    0.877012
1    0.122988
Name: IsBadBuy, dtype: float64

остальных пропусков меньше 0,5% и они имеют тот же дисбаланс классов целевой переменной, поэтому
можно их удалить

In [23]:
df_prepared.dropna(inplace=True)

преобразуем категориальные переменные

удалим разные айдишники

In [24]:
df_prepared.drop(columns=['RefId', 'BYRNO'], axis=1, inplace=True)

преобразуем категориальные переменные

In [25]:
df_prepared['Model'] = df_prepared['Model'].astype('category')
df_prepared['SubModel'] = df_prepared['SubModel'].astype('category')
df_prepared['Auction'] = df_prepared['Auction'].astype('category')
df_prepared['Make'] = df_prepared['Make'].astype('category')
df_prepared['Trim'] = df_prepared['Trim'].astype('category')
df_prepared['Color'] = df_prepared['Color'].astype('category')
df_prepared['Transmission'] = df_prepared['Transmission'].astype('category')
df_prepared['Nationality'] = df_prepared['Nationality'].astype('category')
df_prepared['VNST'] = df_prepared['VNST'].astype('category')
df_prepared['VNZIP1'] = df_prepared['VNZIP1'].astype('category')
df_prepared['WheelType'] = df_prepared['WheelType'].astype('category')
df_prepared['Size'] = df_prepared['Size'].astype('category')
df_prepared['TopThreeAmericanName'] = df_prepared['TopThreeAmericanName'].astype('category')

In [26]:
df_prepared['PurchDate'].dtype

dtype('O')

Как видим 'PurchDate' является типом объект, преобразуем в дату

In [27]:
df_prepared['PurchDate'] = pd.to_datetime(df_prepared['PurchDate'], errors='coerce') #ставим NaT в случае ошибки

if df_prepared['PurchDate'].notnull().all():
    print("All dates are valid.")
else:
    print("Some dates are invalid.")

All dates are valid.


In [28]:
df_prepared.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 72658 entries, 0 to 72982
Data columns (total 30 columns):
 #   Column                             Non-Null Count  Dtype         
---  ------                             --------------  -----         
 0   IsBadBuy                           72658 non-null  int64         
 1   PurchDate                          72658 non-null  datetime64[ns]
 2   Auction                            72658 non-null  category      
 3   VehYear                            72658 non-null  int64         
 4   VehicleAge                         72658 non-null  int64         
 5   Make                               72658 non-null  category      
 6   Model                              72658 non-null  category      
 7   Trim                               72658 non-null  category      
 8   SubModel                           72658 non-null  category      
 9   Color                              72658 non-null  category      
 10  Transmission                      

In [30]:
df_prepared['PurchDate'].min()

Timestamp('2009-01-05 00:00:00')

In [31]:
df_prepared['PurchDate'].max()

Timestamp('2010-12-30 00:00:00')

In [32]:
df_prepared.sort_values(by='PurchDate', inplace=True)

In [33]:
train_val, test = train_test_split(df_prepared, test_size=0.33, shuffle=False)
train, val = train_test_split(train_val, test_size=0.5, shuffle=False)

print(f"train date from {train['PurchDate'].min()} to {train['PurchDate'].max()} size:{train.shape[0]}")
print(f"val   date from {val['PurchDate'].min()} to {val['PurchDate'].max()} size:{val.shape[0]}")
print(f"test  date from {test['PurchDate'].min()} to {test['PurchDate'].max()} size:{test.shape[0]}")

train date from 2009-01-05 00:00:00 to 2009-09-16 00:00:00 size:24340
val   date from 2009-09-16 00:00:00 to 2010-05-18 00:00:00 size:24340
test  date from 2010-05-18 00:00:00 to 2010-12-30 00:00:00 size:23978


# 3. Preprocess categorical variables
Use LabelEncoder or OneHotEncoder from sklearn to preprocess categorical variables. Be careful with data leakage (fit Encoder on train and apply on validation & test). Consider another encoding approach if you meet new categorical values in valid & test (unseen in the training dataset), for example: https://contrib.scikit-learn.org/category_encoders/count.html

Handling New Categories
If you encounter new categories in the validation or test datasets that were not present in the training dataset, you have a few options:

- Drop the rows with the new categories in the validation and test datasets. This approach is straightforward but may result in losing valuable data.
- Create a new category (e.g., "Unknown") for these new categories. This approach is less drastic but may introduce bias if the "Unknown" category is significantly different from the known categories.
- Re-fit the encoder on the combined training, validation, and test data. This approach is more flexible but may lead to data leakage if not handled carefully.  

In practice, the choice depends on the specific context and the importance of avoiding data leakage versus preserving data integrity.



In [34]:
X_train = train.drop(columns=['IsBadBuy', 'PurchDate'], axis=1)
y_train = train['IsBadBuy']

X_val = val.drop(columns=['IsBadBuy', 'PurchDate'], axis=1)
y_val = val['IsBadBuy']

X_test = test.drop(columns=['IsBadBuy', 'PurchDate'], axis=1)
y_test = test['IsBadBuy']

**For hot encoder:**  
Model  
SubModel  
Auction  
Make  
Trim  
Color  
Transmission  
Nationality  
VNST  
VNZIP1  

In [36]:
list_ohe_features = ['Model',
                    'SubModel',
                    'Auction',
                    'Make',
                    'Trim',
                    'Color',
                    'Transmission',
                    'Nationality',
                    'VNST',
                    'VNZIP1']
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

encoded = encoder.fit(X_train[list_ohe_features])

X_train_ohe = pd.DataFrame(encoder.transform(X_train[list_ohe_features]), columns=encoded.get_feature_names_out())
X_train = pd.concat([X_train.reset_index(), X_train_ohe.reset_index()], axis=1).drop(['index'], axis=1)
X_train.drop(list_ohe_features, axis=1, inplace=True)

X_val_ohe = pd.DataFrame(encoder.transform(X_val[list_ohe_features]), columns=encoded.get_feature_names_out())
X_val = pd.concat([X_val.reset_index(), X_val_ohe.reset_index()], axis=1).drop(['index'], axis=1)
X_val.drop(list_ohe_features, axis=1, inplace=True)

X_test_ohe = pd.DataFrame(encoder.transform(X_test[list_ohe_features]), columns=encoded.get_feature_names_out())
X_test = pd.concat([X_test.reset_index(), X_test_ohe.reset_index()], axis=1).drop(['index'], axis=1)
X_test.drop(list_ohe_features, axis=1, inplace=True)



In [38]:
X_train.shape

(24340, 1809)

**For label encoder:**  
WheelTypeID  
WheelType  
Size  
TopThreeAmericanName  

In [39]:
X_test[['WheelType', 'Size', 'TopThreeAmericanName']].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23978 entries, 0 to 23977
Data columns (total 3 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   WheelType             23978 non-null  category
 1   Size                  23978 non-null  category
 2   TopThreeAmericanName  23978 non-null  category
dtypes: category(3)
memory usage: 71.2 KB


In [40]:
list_le_features = ['WheelType',
                    'Size',
                    'TopThreeAmericanName']
le = LabelEncoder()

for f in list_le_features:
    le.fit(X_train[f])

    X_train[f] = pd.DataFrame(le.transform(X_train[f]))
    X_val[f] = pd.DataFrame(le.transform(X_val[f]))
    X_test[f] = pd.DataFrame(le.transform(X_test[f]))

In [41]:
X_train['TopThreeAmericanName'].value_counts()

2    8468
0    8090
1    4600
3    3182
Name: TopThreeAmericanName, dtype: int64

In [42]:
X_train.shape

(24340, 1809)

# 4. Train: LogisticRegression, GaussianNB, KNN from sklearn
 on the training dataset and check the quality of your algorithms on the validation dataset. The dependent variable (IsBadBuy) is binary. Don’t forget to normalize your datasets before training models.  

You must receive at least 0.15 Gini score (the best of all four). Which algorithm performs better? Why?

Нормализация величин

In [43]:
df_prepared.describe()

Unnamed: 0,IsBadBuy,VehYear,VehicleAge,WheelTypeID,VehOdo,MMRAcquisitionAuctionAveragePrice,MMRAcquisitionAuctionCleanPrice,MMRAcquisitionRetailAveragePrice,MMRAcquisitonRetailCleanPrice,MMRCurrentAuctionAveragePrice,MMRCurrentAuctionCleanPrice,MMRCurrentRetailAveragePrice,MMRCurrentRetailCleanPrice,VehBCost,IsOnlineSale,WarrantyCost
count,72658.0,72658.0,72658.0,72658.0,72658.0,72658.0,72658.0,72658.0,72658.0,72658.0,72658.0,72658.0,72658.0,72658.0,72658.0,72658.0
mean,0.123083,2005.341243,4.180132,1.385794,71525.19929,6128.838944,7374.900493,8501.494164,9856.861805,6132.229624,7390.851221,8776.037766,10145.731509,6730.807825,0.025379,1277.072229
std,0.328535,1.729603,1.710427,0.720205,14568.009802,2462.457325,2723.82035,3156.178758,3386.050131,2434.561422,2686.25004,3090.554492,3310.117825,1767.914054,0.157275,599.236636
min,0.0,2001.0,0.0,-1.0,4825.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,462.0
25%,0.0,2004.0,3.0,1.0,61874.25,4273.0,5409.0,6288.0,7501.0,4275.0,5415.0,6537.0,7784.0,5435.0,0.0,837.0
50%,0.0,2005.0,4.0,1.0,73382.0,6098.0,7305.0,8447.0,9798.0,6063.0,7313.0,8729.0,10103.0,6700.0,0.0,1169.0
75%,0.0,2007.0,5.0,2.0,82452.0,7761.0,9023.75,10657.0,12092.0,7736.0,9013.0,10911.0,12309.0,7900.0,0.0,1623.0
max,1.0,2010.0,9.0,3.0,115717.0,35722.0,36859.0,39080.0,41482.0,35722.0,36859.0,39080.0,41062.0,45469.0,1.0,7498.0


In [44]:
minmax = MinMaxScaler()
cols_for_minmax = ['VehOdo',
                    'MMRAcquisitionAuctionAveragePrice',
                    'MMRAcquisitionAuctionCleanPrice',
                    'MMRAcquisitionRetailAveragePrice',
                    'MMRAcquisitonRetailCleanPrice',
                    'MMRCurrentAuctionAveragePrice',
                    'MMRCurrentAuctionCleanPrice',
                    'MMRCurrentRetailAveragePrice',
                    'MMRCurrentRetailCleanPrice',
                    'VehBCost',
                    'WarrantyCost']
X_train[cols_for_minmax] = minmax.fit_transform(X_train[cols_for_minmax])

X_val[cols_for_minmax] = minmax.transform(X_val[cols_for_minmax])
X_test[cols_for_minmax] = minmax.transform(X_test[cols_for_minmax])


In [45]:
%%time
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

lr_predictions = logreg.predict_proba(X_val)[:, 1]
lr_gini_score = 2 * roc_auc_score(y_val, lr_predictions) - 1
lr_gini_score

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


CPU times: user 13.2 s, sys: 2.26 s, total: 15.4 s
Wall time: 10.8 s


0.4673323431788501

In [46]:
%%time
gnb = GaussianNB()
gnb.fit(X_train, y_train)
gnb_predictions = gnb.predict_proba(X_val)[:, 1]
gnb_gini_score = 2 * roc_auc_score(y_val, gnb_predictions) - 1
gnb_gini_score

CPU times: user 1.4 s, sys: 1.39 s, total: 2.79 s
Wall time: 2.72 s


0.06498356195479849

In [47]:
%%time
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
knn_predictions = knn.predict_proba(X_val)[:, 1]
knn_gini_score = 2 * roc_auc_score(y_val, knn_predictions) - 1
knn_gini_score

CPU times: user 2min 14s, sys: 735 ms, total: 2min 15s
Wall time: 1min 23s


0.3476430514092421

Самый высокий джини у логистической регрессии

# 5. Implement Gini score calculation.
You can use 2*ROC AUC - 1 approach, so you need to implement ROC AUC calculation. Check if your metric approximately equals abs(2\*sklearn.metrcs.roc_auc_score - 1).

In [48]:
def calculate_roc_auc(y_true, y_pred):
    sorted_indices = np.argsort(y_pred)[::-1]
    y_true_sorted = y_true[sorted_indices]
    tpr = []
    fpr = []
    n_positive = np.sum(y_true)
    n_negative = len(y_true) - n_positive

    tp = 0
    fp = 0
    for i in range(len(y_true_sorted)):
        if y_true_sorted[i] == 1:
            tp += 1
        else:
            fp += 1
        tpr.append(tp / n_positive)
        fpr.append(fp / n_negative)

    auc = np.trapz(tpr, fpr)

    return auc



auc_custom = calculate_roc_auc(y_val.to_numpy(), lr_predictions)
auc_sklearn = roc_auc_score(y_val.to_numpy(), lr_predictions)

print("Custom  ROC AUC:", auc_custom)
print("Sklearn ROC AUC:", auc_sklearn)

Custom  ROC AUC: 0.7336661715894249
Sklearn ROC AUC: 0.733666171589425


In [56]:
def gini_score(y_true, y_pred):
    return 2 * calculate_roc_auc(y_true, y_pred) - 1

In [57]:
gini_score(y_val.to_numpy(), lr_predictions)

0.46733234317884986

In [51]:
abs(2*roc_auc_score(y_val.to_numpy(), lr_predictions) - 1)

0.4673323431788501

# 6. Implement your own versions of LogisticRegression, KNN and NaiveBayes classifiers.
For LogisticRegression compute gradients with respect to the loss and use stochastic gradient descent.
Are you able to reproduce results from step 4?
Guide for this task:
Your model must be represented by class with fit, predict (predict_proba with 0.5 threshold), predict_proba methods.
For LR moder compute gradient of loss with respect to parameters w and parameter b in fit function. Use a simple SGD approach for estimating optimal values of parameters.

**Предсказания:**  
$$
y_{pred}(x, w) = \frac{1}{1 + e^{-\langle x, w \rangle}}
$$

**Лосс (LogLoss):**  
$$
L(w) = -y\, log\,y_{pred} - (1-y)\,log\,(1-y_{pred})
$$

**Градиент:**  
$$
\frac{\partial{L}}{\partial{w}}
= \left(-\frac{y}{y_{pred}} + \frac{1-y}{1-y_{pred}}\right)\frac{\partial{y_{pred}}}{\partial{w}}
$$

$$
\frac{\partial{y_{pred}}}{\partial{w}} = \frac{1}{(1+e^{-\langle x, w \rangle})^2} e^{-\langle x, w \rangle} (-x) = -y_{pred}(1-y_{pred})x
$$

$$
\frac{\partial{L}}{\partial{w}} = (y_{pred} - y) x
$$

In [52]:
class MyLogisticRegression():
    def __init__(self, learning_rate=0.001, num_iter=100):
        self.learning_rate = learning_rate
        self.num_iter = num_iter

    def fit(self, X, y):
        self.w = np.zeros(X.shape[1])
        self.b = 0

        for _ in range(self.num_iter):
            z = np.dot(X, self.w) + self.b
            y_pred = self.sigmoid(z)
            gradient_w = (1 / X.shape[0]) * np.dot(X.T, (y_pred - y))
            gradient_b = (1 / X.shape[0]) * np.sum(y_pred - y)

            self.w -= self.learning_rate * gradient_w
            self.b -= self.learning_rate * gradient_b

    def predict_proba(self, X):
        z = np.dot(X, self.w) + self.b
        return self.sigmoid(z)

    def predict(self, X):
        return self.predict_proba(X) > 0.5

    def sigmoid(self, z):
        return 1 / (1 + np.exp(-z))


In [53]:
%%time
my_logreg = MyLogisticRegression()
my_logreg.fit(X_train, y_train)

my_lr_predictions = my_logreg.predict_proba(X_val)
gini_score(y_val.to_numpy(), my_lr_predictions)

  return 1 / (1 + np.exp(-z))


CPU times: user 28.9 s, sys: 43.1 s, total: 1min 12s
Wall time: 49.5 s


0.308373205203476

In [60]:
from scipy.spatial import KDTree

class MyKNeighborsClassifier:
    def __init__(self, n_neighbors=3):
        self.n_neighbors = n_neighbors
        self.X_train = None
        self.y_train = None
        self.kdtree = None

    def fit(self, X_train, y_train):
        self.X_train = X_train
        self.y_train = y_train
        self.kdtree = KDTree(X_train)

    def predict(self, X_test):
        _, indices = self.kdtree.query(X_test, k=self.n_neighbors)
        y_pred = np.apply_along_axis(lambda x: np.argmax(np.bincount(x)), axis=1, arr=self.y_train[indices])
        return y_pred

    def predict_proba(self, X_test):
        _, indices = self.kdtree.query(X_test, k=self.n_neighbors)
        y_pred_proba = np.apply_along_axis(lambda x: np.bincount(x, minlength=len(np.unique(self.y_train))) / self.n_neighbors, axis=1, arr=self.y_train[indices])
        return y_pred_proba

In [61]:
%%time
my_knn = MyKNeighborsClassifier()
my_knn.fit(X_train.to_numpy(), y_train.to_numpy())
my_knn_predictions = my_knn.predict(X_val.to_numpy())
gini_score(y_val.to_numpy(), my_knn_predictions)

0.18907557460683533

In [79]:
class MyNaiveBayesClassifier:
    def __init__(self):
        self.class_probabilities = None
        self.feature_probabilities = None
        self.uniq_classes = None

    def fit(self, X, y):
        self.uniq_classes = np.unique(y)
        num_classes = X.shape[0]
        num_features = X.shape[1]

        self.class_probabilities = np.zeros(num_classes)
        self.feature_probabilities = np.zeros((num_classes, num_features))

        for i, c in enumerate(self.uniq_classes):
            X_c = X[y == c]
            self.class_probabilities[i] = X_c.shape[0] / num_classes
            self.feature_probabilities[i] = np.mean(X_c, axis=0)

    def predict_proba(self, X):
        num_samples = X.shape[0]
        count_classes = len(self.uniq_classes)

        predictions = np.zeros((num_samples, count_classes))

        for i in range(count_classes):
            class_probability = self.class_probabilities[i]
            feature_probability = self.feature_probabilities[i]
            predictions[:, i] = np.prod(X * feature_probability + (1 - X) * (1 - feature_probability), axis=1) * class_probability

        return predictions / np.sum(predictions, axis=1, keepdims=True)

    def predict(self, X):
        probabilities = self.predict_proba(X)
        return np.argmax(probabilities, axis=1)

In [80]:
%%time
my_nb = MyNaiveBayesClassifier()
my_nb.fit(X_train.to_numpy(), y_train.to_numpy())
my_nb_predictions = my_nb.predict(X_val.to_numpy())
my_nb_gini_score = 2 * roc_auc_score(y_val.to_numpy(), my_nb_predictions) - 1
my_nb_gini_score

CPU times: user 642 ms, sys: 499 ms, total: 1.14 s
Wall time: 1.6 s


0.2008982122055123

# 7. Try to create non-linear features,
for example:

fractions: feature1/feature2
groupby features: df[‘categorical_feature’].map(df.groupby(‘categorical_feature’)[‘continious_feature’].mean())

Add new features into your pipeline, repeat step 4. Did you manage to increase your Gini score (you should!)?

In [132]:
X_train_nl = X_train.copy()
X_val_nl = X_val.copy()
X_test_nl = X_test.copy()

In [190]:
X_train_nl['fraction_feature'] = pd.DataFrame(train['MMRCurrentRetailAveragePrice'] / train['WarrantyCost']).iloc[:,0].values
X_train_nl['fraction_feature2'] = pd.DataFrame(train['Color'].astype('category').cat.codes / train['WarrantyCost']).iloc[:,0].values
X_train_nl['grouped_mean'] = pd.DataFrame(train['Auction'] \
                                          .map(train.groupby('Auction')['VehicleAge'] \
                                               .mean())).iloc[:,0].values
X_train_nl['grouped_mean2'] = pd.DataFrame(train['Color'] \
                                          .map(train.groupby('Color')['VehYear'] \
                                               .mean())).iloc[:,0].values


In [191]:
X_val_nl['fraction_feature'] = pd.DataFrame(val['MMRCurrentRetailAveragePrice'] / val['WarrantyCost']).iloc[:,0].values
X_val_nl['fraction_feature2'] = pd.DataFrame(val['Color'].cat.codes / val['WarrantyCost']).iloc[:,0].values
X_val_nl['grouped_mean'] = pd.DataFrame(val['Auction'] \
                                          .map(val.groupby('Auction')['VehicleAge'] \
                                               .mean())).iloc[:,0].values
X_val_nl['grouped_mean2'] = pd.DataFrame(val['Color'] \
                                          .map(val.groupby('Color')['VehYear'] \
                                               .mean())).iloc[:,0].values

In [192]:
minmax = MinMaxScaler()
cols_for_minmax = ['fraction_feature',
                   'fraction_feature2',
                   'grouped_mean',
                   'grouped_mean2']
X_train_nl[cols_for_minmax] = minmax.fit_transform(X_train_nl[cols_for_minmax])

X_val_nl[cols_for_minmax] = minmax.transform(X_val_nl[cols_for_minmax])
# X_test[cols_for_minmax] = minmax.transform(X_test[cols_for_minmax])

In [193]:
%%time
logreg_nl = LogisticRegression()
logreg_nl.fit(X_train_nl, y_train)
lr_nl_predictions = logreg_nl.predict_proba(X_val_nl)[:, 1]

print(f"new: {(2 * roc_auc_score(y_val, lr_nl_predictions) - 1)} \nold: {lr_gini_score}")

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


new: 0.47394948186121333 
old: 0.4673323431788501
CPU times: user 15.7 s, sys: 1.54 s, total: 17.3 s
Wall time: 9.78 s


In [195]:
%%time
gnb_nl = GaussianNB()
gnb_nl.fit(X_train_nl, y_train)
gnb_nl_predictions = gnb_nl.predict_proba(X_val_nl)[:, 1]

print(f"new: {(2 * roc_auc_score(y_val, gnb_nl_predictions) - 1)} \nold: {gnb_gini_score}")

new: 0.06498356195479849 
old: 0.06498356195479849
CPU times: user 1.23 s, sys: 749 ms, total: 1.98 s
Wall time: 2.01 s


In [194]:
%%time
knn_nl = KNeighborsClassifier()
knn_nl.fit(X_train_nl, y_train)
knn_nl_predictions = knn_nl.predict_proba(X_val_nl)[:, 1]

print(f"new: {(2 * roc_auc_score(y_val, knn_nl_predictions) - 1)} \nold: {knn_gini_score}")

new: 0.3531454804727385 
old: 0.3476430514092421
CPU times: user 2min 18s, sys: 472 ms, total: 2min 18s
Wall time: 1min 25s


# 8. Detect the best features for the problem using coefficients of the Logistic model.
Try to eliminate useless features by hand and by using L1 regularization. Which approach is better in terms of Gini score?

In [196]:
%%time
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

lr_predictions = logreg.predict_proba(X_val)[:, 1]
gini_score(y_val.to_numpy(), lr_predictions)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


CPU times: user 16.9 s, sys: 1.08 s, total: 17.9 s
Wall time: 10.2 s


0.46733234317884986

Выбираем топ500 по значимости веса

In [202]:
important_features = logreg.coef_[0].argsort()[-500:]

обучаем на них

In [203]:
%%time
X_train_important = X_train.iloc[:, important_features]
X_val_important = X_val.iloc[:, important_features]

logreg_important = LogisticRegression()
logreg_important.fit(X_train_important, y_train)

y_pred_important = logreg_important.predict(X_val_important)
gini_score(y_val.to_numpy(), y_pred_important)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


CPU times: user 4.4 s, sys: 936 ms, total: 5.34 s
Wall time: 5.74 s


0.10315121215416201

Отбираем по l1 регуляризации

In [209]:
model = LogisticRegression(penalty='l1', solver='liblinear')
model.fit(X_train, y_train)

coefficients = model.coef_[0]

non_zero_features = np.where(coefficients != 0)[0]

X_train_selected = X_train.iloc[:, non_zero_features]
X_val_selected = X_val.iloc[:, non_zero_features]

logreg_l1 = LogisticRegression()
logreg_l1.fit(X_val_selected, y_train)

lr_predictions = logreg_l1.predict_proba(X_val_selected)[:, 1]
print(gini_score(y_val.to_numpy(), lr_predictions))

0.2670649064244721


по l1 лучше отбор получается чем в ручную по весам, но все это хуже чем изначальный вариант

# 9. *Try to apply non-linear variants of SVM,
use the RAPIDS library if you have access to GPU. In other cases, use sklearn SVC with a non-linear kernel. If the training process needs too much time or memory try to subsample training data. Are you able to receive a better Gini score (on valid dataset) with this approach?

In [212]:
%%time
model = svm.SVC(kernel='rbf', probability=True)

model.fit(X_train, y_train)

y_pred = model.predict_proba(X_val)[:, 1]
gini_score(y_val.to_numpy(), y_pred)


CPU times: user 26min 28s, sys: 5.35 s, total: 26min 33s
Wall time: 26min 42s


0.19815381312340974

нет получилось не лучше чем логрег

# 10. Select your best model
 (algorithm + feature set) and tweak its hyperparameters to increase Gini score on the validation dataset.
Which hyperparameters are the most impactful?

лучше всех себя показала лог регресиия, ее и буду максимизировать по гиперпараметрам

In [60]:
%%time
learning_rates = [0.01, 0.001, 0.0005]
num_iters = [50, 100, 200]

best_gini_score = -1
best_lr = None
best_num_iter = None

for lr in learning_rates:
    for num_iter in num_iters:
        # print(f'started: lr={lr}, n_iter={num_iter}')
        model = MyLogisticRegression(learning_rate=lr, num_iter=num_iter)
        model.fit(X_train, y_train)
        y_pred_proba = model.predict_proba(X_val)
        current_gini_score = gini_score(y_val.to_numpy(), y_pred_proba)
        print(f'Finished: lr={lr}, n_iter={num_iter}, gini Score:{current_gini_score}')

        if current_gini_score > best_gini_score:
            best_gini_score = current_gini_score
            best_lr = lr
            best_num_iter = num_iter

print("Best Gini Score:", best_gini_score)
print("Best Learning Rate:", best_lr)
print("Best Number of Iterations:", best_num_iter)

Finished: lr=0.01, n_iter=50, gini Score:-0.05541681584307945
Finished: lr=0.01, n_iter=100, gini Score:-0.05541681584307945
Finished: lr=0.01, n_iter=200, gini Score:-0.05541681584307945
Finished: lr=0.001, n_iter=50, gini Score:-0.05541681584307945
Finished: lr=0.001, n_iter=100, gini Score:0.308373205203476
Finished: lr=0.001, n_iter=200, gini Score:-0.05541681584307945
Finished: lr=0.0005, n_iter=50, gini Score:-0.05541681584307945
Finished: lr=0.0005, n_iter=100, gini Score:0.3083725497285372
Finished: lr=0.0005, n_iter=200, gini Score:-0.05541681584307945
Best Gini Score: 0.308373205203476
Best Learning Rate: 0.001
Best Number of Iterations: 100
CPU times: user 5min 5s, sys: 7min 38s, total: 12min 43s
Wall time: 8min 54s


# 11. Check Gini scores
 on all three datasets for your best model: train Gini, valid Gini, test Gini. Can you see any drop in performance when comparing valid quality vs test quality? Is your model overfitted or not? Explain.

In [62]:
%%time
best_model = MyLogisticRegression(learning_rate=0.001, num_iter=100)
best_model.fit(X_train, y_train)

print("Train Gini: ", gini_score(y_train.to_numpy(), best_model.predict_proba(X_train)))
print("Val Gini  : ", gini_score(y_val.to_numpy(), best_model.predict_proba(X_val)))
print("Test Gini : ", gini_score(y_test.to_numpy(), best_model.predict_proba(X_test)))


Train Gini:  0.33285295867409026
Val Gini  :  0.308373205203476
Test Gini :  0.3240141147204838
CPU times: user 29.1 s, sys: 42.9 s, total: 1min 12s
Wall time: 52.7 s


Модель не переобучена, тк показатель Джини довольно стабилен

# 12. Implement calculation or Recall, Precision, F1 score and AUC PR metrics.
Compare your algorithms on the test dataset using AUC PR metric.

In [71]:
def calculate_recall(y_true, y_pred):
    true_positives = np.sum(np.logical_and(y_true == 1, y_pred == 1))
    actual_positives = np.sum(y_true == 1)
    recall = true_positives / actual_positives
    return recall

def calculate_precision(y_true, y_pred):
    true_positives = np.sum(np.logical_and(y_true == 1, y_pred == 1))
    predicted_positives = np.sum(y_pred == 1)
    precision = true_positives / predicted_positives
    return precision

def calculate_f1_score(y_true, y_pred):
    recall = calculate_recall(y_true, y_pred)
    precision = calculate_precision(y_true, y_pred)
    f1_score = 2 * (precision * recall) / (precision + recall)
    return f1_score

def calculate_auc_pr(y_true, y_pred_proba):
    sorted_indices = np.argsort(y_pred_proba)[::-1]
    y_true_sorted = y_true[sorted_indices]
    y_pred_proba_sorted = y_pred_proba[sorted_indices]
    precision = np.cumsum(y_true_sorted) / np.arange(1, len(y_true_sorted) + 1)
    auc_pr = np.sum(precision * y_true_sorted) / np.sum(y_true_sorted)
    return auc_pr

In [72]:
calculate_auc_pr(y_test.to_numpy(), my_logreg.predict_proba(X_test))

0.2237640684953177

In [70]:
precision, recall, thresholds = precision_recall_curve(y_test.to_numpy(), my_logreg.predict_proba(X_test))
auc(recall, precision)

0.2235992430399681

# 13. Which hard label metric do you prefer for the task of detecting “lemon” cars?

In [74]:
logreg.coef_[0].argsort()[-1:]

array([3])

In [78]:
X_train.iloc[:, logreg.coef_[0].argsort()[-1:]].columns[0]

'WheelType'

самый значимы признак для определения "кислой" машины это "WheelType"