Sure, I can help you with that. Here's an example code to build a KNN model for the given dataset:

In [None]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
# Load the data
df = pd.read_csv("repurchase_training.csv")

In [None]:
df.drop("age_band", axis=1, inplace=True)
# Drop the ID column as it doesn't provide any valuable information
df.drop("ID", axis=1, inplace=True)

# Encode the categorical variables
le = LabelEncoder()
df["gender"] = le.fit_transform(df["gender"])
df["car_model"] = le.fit_transform(df["car_model"])
df["car_segment"] = le.fit_transform(df["car_segment"])

# Split the data into training and testing sets
X = df.drop("Target", axis=1)
y = df["Target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train the KNN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = knn.predict(X_test)

# Evaluate the performance of the model
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

You can fine-tune the hyperparameters of the KNN model by adjusting the value of n_neighbors and experimenting with different distance metrics. You can also use cross-validation techniques to select the best value for n_neighbors.

Sure, here's an example of how you can fine-tune the hyperparameters of a KNN model in Python:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report, confusion_matrix

# Load the data and split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define the KNN classifier
knn = KNeighborsClassifier()

# Define the grid of hyperparameters to search over
param_grid = {
    'n_neighbors': [3, 5, 7, 9],
    'weights': ['uniform', 'distance'],
    'metric': ['euclidean', 'manhattan', 'minkowski']
}

# Perform a grid search with cross-validation to find the best hyperparameters
grid = GridSearchCV(knn, param_grid, cv=5, scoring='accuracy', verbose=1)
grid.fit(X_train, y_train)

# Print the best hyperparameters and the corresponding cross-validation score
print('Best hyperparameters:', grid.best_params_)
print('Cross-validation score:', grid.best_score_)

# Evaluate the performance of the best model on the test set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)
print('Confusion matrix:\n', confusion_matrix(y_test, y_pred))
print('Classification report:\n', classification_report(y_test, y_pred))

In this example, we first define the KNN classifier and the hyperparameters to search over. We then use GridSearchCV to perform a grid search with cross-validation, which tests all combinations of hyperparameters in param_grid and finds the combination that produces the best cross-validation score. We print the best hyperparameters and cross-validation score, and then evaluate the performance of the best model on the test set using confusion_matrix and classification_report.

Note that this example assumes that you have already loaded and split the data into X_train, X_test, y_train, and y_test variables. You will also need to replace ... with the appropriate code to load and split the data.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plotting the distribution of the target variable
sns.countplot(x='target', data=df)
plt.title('Target variable distribution')

# Plotting the correlation matrix
plt.figure(figsize=(12, 8))
sns.heatmap(df.corr(), cmap='coolwarm', annot=True)
plt.title('Correlation Matrix')

# Plotting the validation curve
param_range = range(1, 20, 2)
train_scores, test_scores = validation_curve(KNeighborsClassifier(), X_train, y_train, param_name="n_neighbors", param_range=param_range, cv=5)
plt.figure(figsize=(12, 8))
plt.plot(param_range, np.mean(train_scores, axis=1), label='Training score')
plt.plot(param_range, np.mean(test_scores, axis=1), label='Cross-validation score')
plt.xlabel('Number of neighbors')
plt.ylabel('Score')
plt.legend()
plt.title('Validation Curve')

# Plotting the learning curve
train_sizes, train_scores, test_scores = learning_curve(KNeighborsClassifier(n_neighbors=5), X_train, y_train, cv=5)
plt.figure(figsize=(12, 8))
plt.plot(train_sizes, np.mean(train_scores, axis=1), label='Training score')
plt.plot(train_sizes, np.mean(test_scores, axis=1), label='Cross-validation score')
plt.xlabel('Training set size')
plt.ylabel('Score')
plt.legend()
plt.title('Learning Curve')