In [6]:
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score

# Load the dataset 
file_path = '/Users/jean-paulhendriksen/Documents/Data Driven Decision Making in Business/DataMining/Amazon Sale Report.csv'
data = pd.read_csv(file_path)
data = data.drop(columns=['index', 'Order ID', 'promotion-ids', 'Unnamed: 22'], errors='ignore')
data = data.dropna(subset=['Amount', 'Qty', 'Category', 'Status'])

# Convert categorical columns to numerical values
categorical_columns = [col for col in ['Status', 'Category', 'Sales Channel', 'ship-country'] if col in data.columns]
data = pd.get_dummies(data, columns=categorical_columns, drop_first=True)

X = data[['Qty', 'Amount'] + [col for col in data.columns if 'Status_' in col or 'Category_' in col]]
y = data['B2B'].astype(int)

# Apply Random Oversampling
oversampler = RandomOverSampler(random_state=42)
X_resampled, y_resampled = oversampler.fit_resample(X, y)

# Split the resampled data
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.3, random_state=42)

# Scale features and train the KNN model
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Evaluate
y_pred = knn.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


  data = pd.read_csv(file_path)


Accuracy: 0.7275182471642453
Classification Report:
               precision    recall  f1-score   support

           0       0.75      0.68      0.71     35963
           1       0.71      0.77      0.74     36240

    accuracy                           0.73     72203
   macro avg       0.73      0.73      0.73     72203
weighted avg       0.73      0.73      0.73     72203





1. **Precision**: Indicates how many of the predicted positive samples are actually positive.
   - In this case:
     - **Class 0 (Non-B2B)**: 75% of the predicted non-B2B labels were correct.
     - **Class 1 (B2B)**: 71% of the predicted B2B labels were correct.
   
2. **Recall**: Measures how many of the actual positive samples were correctly identified by the model.
   - Here:
     - **Class 0**: The model correctly identified 68% of the actual non-B2B samples.
     - **Class 1**: The model correctly identified 77% of the actual B2B samples.

3. **F1-Score**: The harmonic mean of precision and recall, providing a single measure of a class's accuracy. It’s useful for imbalanced datasets.
   - **Class 0**: 0.71 F1-score, indicating a balanced performance between precision and recall.
   - **Class 1**: 0.74 F1-score, also reflecting balanced accuracy.

4. **Accuracy**: The overall correctness of the model across all samples (72.75%), combining both classes.

5. **Macro Avg**: The unweighted average across classes for precision, recall, and F1-score. It’s useful when class sizes are similar.

6. **Weighted Avg**: The weighted average across classes, accounting for class imbalance. Since the classes were balanced, the weighted average here approximates the macro average.

