**Import Librarires And Dataset**

In [81]:
import warnings
import numpy as np
import pandas as pd
import time

#some settings to show data
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', 50)

#import dataset
audit_risk = pd.read_csv("datasets/audit_risk.csv")
trial = pd.read_csv("datasets/trial.csv")

**Show Data**

In [82]:
audit_risk.head(10)

In [83]:
trial.head(10)

**Lets See Values Of Two Dataset**

In [84]:
audit_risk.describe()

In [85]:
trial.describe()

**Analysis**

As you can see, two dataset are similarly same expect a bit difference. 
Firsty, SCORE_A AND SCORE_B in trial, multiply 10 with audit_risk Score_A and Score_B values, also that's capital. 
Second, Loss and Risk column in trial, completely different from audit_risk.

First of all, change capital column names like audit_risk columns, then divide by 10 to Score_A and Score_B;

In [86]:
trial.columns = ['Sector_score', 'LOCATION_ID', 'PARA_A', 'Score_A', 'PARA_B',
                 'Score_B', 'TOTAL', 'numbers', 'Marks',
                 'Money_Value', 'MONEY_Marks', 'District',
                 'Loss', 'LOSS_SCORE', 'History', 'History_score', 'Score', 'Risk_trial']

In [87]:
trial['Score_A'] = trial['Score_A'] / 10
trial['Score_B'] = trial['Score_B'] / 10

**Observe two dataset**

In [88]:
same_columns = np.intersect1d(audit_risk.columns, trial.columns)
same_columns

**Let's merge two dataset with same column**

In [89]:
merged_df = pd.merge(audit_risk, trial, how='outer',
                     on=['History', 'LOCATION_ID', 'Money_Value', 'PARA_A', 'PARA_B', 'Score', 'Score_A', 'Score_B',
                         'Sector_score', 'TOTAL', 'numbers'])
merged_df.columns

**Analysis**

As you can see some values in Risk_trial in trial and Risk in audit_risk are different, we can select Risk column in audit_risk because if you will click link https://api.openml.org/d/42931, you can see target value is Risk in audit_risk dataset. So delete that column.

In [90]:
df = merged_df.drop(['Risk_trial'], axis=1)

Check null values

In [91]:
df.isnull().sum()

As you can see, Money_Value column has a null value. Set average value,

In [92]:
df['Money_Value'] = df['Money_Value'].fillna(df['Money_Value'].median())

and Detection_Risk column is same value of Risk column, so delete it.

In [93]:
df = df.drop(['Detection_Risk'], axis=1)
df.info()

Up to now, everything is good, let's see location id

In [94]:
df["LOCATION_ID"].unique()

if you iterate to showed values, you will see end of the table there are some non numeric values, LOHARU, NUH and SAFIDON. How much that values in dataset

In [95]:
len(df[(df["LOCATION_ID"] == 'LOHARU') | (df["LOCATION_ID"] == 'NUH') | (df["LOCATION_ID"] == 'SAFIDON')])

In [96]:
len(df)

Only 3 rows we have non numerical rows, so they seem deletable, i deleted it.

In [97]:
df = df[(df.LOCATION_ID != 'LOHARU')]
df = df[(df.LOCATION_ID != 'NUH')]
df = df[(df.LOCATION_ID != 'SAFIDON')]

In [98]:
len(df)

Also i drop duplicate values,

In [99]:
df = df.drop_duplicates(keep='first')
print(f"Rows: {len(df)}")

i drop high correlation values;

In [100]:
import seaborn as sns

corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')
# 'RdBu_r' & 'BrBG' are other good diverging colormaps
cm = sns.diverging_palette(220, 20, sep=20, as_cmap=True)
corr.style.background_gradient(cmap=cm)

In [101]:
df = df[['Risk_A', 'Risk_B', 'Risk_C', 'Risk_D', 'RiSk_E', 'Prob', 'Score', 'CONTROL_RISK', 'Audit_Risk', 'Risk', 'MONEY_Marks', 'Loss']]
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

In [102]:
df

# Data Clean Operation Is Done

# I will Implement Knn

In [103]:
import math

# Define a function to calculate the Euclidean distance between two points
def euclidean_distance(x1, x2):
    return math.sqrt(np.sum((x1 - x2) ** 2))

In [104]:
# Define the KNN function
def knn_classification_with_euclidean_distance(X_train, y_train, X_test, k):
    # List to store the predicted labels for the test set
    y_pred = []
    distances = []
    
    for i in range(len(X_test)):
        for j in range(len(X_train)):
            # Calculate the distance between the two points using euclidean_distance func where I defined above section
            dist = euclidean_distance(X_test[i], X_train[j])
            distances.append((dist, y_train[j]))

        distances.sort()
        neighbors = distances[:k] # Get the k nearest neighbors

        counts = {} # Count the votes for each class
        for neighbor in neighbors:
            label = neighbor[1]
            if label in counts:
                counts[label] += 1
            else:
                counts[label] = 1

        max_count = max(counts, key=counts.get) # Get the class with the most votes
        y_pred.append(max_count)

    return y_pred

In [105]:
# Define a function to calculate the Manhattan distance between two points
def manhattan_distance(x1, x2):
    return np.sum(np.abs(x1 - x2))

In [106]:
def knn_regressor_with_manhattan_distance(X_train, y_train, X_test, k):
    y_pred = []
    distances = []
    
    for i in range(len(X_test)):
        for j in range(len(X_train)):
            # Calculate the distance between the two points using manhattan_distance func where I defined above section
            dist = manhattan_distance(X_test[i], X_train[j])
            distances.append((dist, y_train[j]))

        distances.sort()
        neighbors = distances[:k]# Get the k nearest neighbors
        
        mean_val = np.mean(neighbors)# Get the mean from the neighbors
        y_pred.append(mean_val)

    return y_pred

**I finished preprocessing to data. I will go to implementing functions, start Part1**

In [107]:
from sklearn.model_selection import train_test_split

class_df = df.drop("Audit_Risk", axis=1)
classification_X = class_df.drop(["Risk"], axis=1)
classification_y = class_df["Risk"]

**I am seperate my data for train %70 and test %30, so i will use train_test_split func in model_selection library**

In [108]:
X_train, X_test, y_train, y_test = train_test_split(classification_X, classification_y, test_size=0.3, random_state=42)

# PART 2

In [114]:
bike = pd.DataFrame(pd.read_csv("datasets/day.csv"))
print(bike.head())
print(bike.info())
print(bike.describe())
print(bike.shape)

# Conclusion of Data Analysis

Dataset has 730 rows and 16 columns.
Except one column, all others are either float or integer type.
One column is date type.

Looking at the data, it seems to be some fields that are categorical, but in integer/float type.
We will analyse to convert them to categorical as integer.

In [115]:
round(100 * (bike.isnull().sum() / len(bike)), 2).sort_values(ascending=False)

In [116]:
round((bike.isnull().sum(axis=1) / len(bike)) * 100, 2).sort_values(ascending=False)

**Analysis**

There are no missing / Null values either in columns or rows

In [117]:
bike_dup = bike.copy()
bike_dup = bike_dup.drop_duplicates(keep='first')
# we can assume same operation like this => bike_dup.drop_duplicates(subset=None, inplace=True)

print(bike_dup.shape)
print(bike.shape)

**Analysis**
The shape after running the drop duplicate command is same as the original dataframe.
Hence we can conclude that there were zero duplicate values in the dataset.

In [118]:
bike_dummy = bike.iloc[:, 1:16]

for col in bike_dummy:
    print(bike_dummy[col].value_counts(ascending=False), '\n\n\n')

**As you can see on above code output;**

instant, dteday, casual and registered columns are nonessential, so we can remove these columns. Because;

*instant* : its index,
*dteday* : it has the date,
*casual & registered* : i dont consider the columns that specify bike counts by customer categories since our objective is to determine the total bike count. Furthermore, we've introduced a new variable to represent the proportion of different customer types.

In [119]:
bike_new = bike[
    ['season', 'yr', 'mnth', 'holiday', 'weekday', 'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed', 'cnt']]
# bike_new = bike[['temp', 'atemp', 'hum', 'windspeed', 'cnt']] we can see that columns are numerical

In [120]:
bike_new.info()
bike_new.head()

**Creating Dummy Variables**
I can drop all categorical data, so ['temp', 'atemp', 'hum', 'windspeed', 'cnt'] columns are usefull. But i can create dummy variable. 
Dummy variables are usefull because they allow us to include categorical variables in our analysis, which would otherwise be difficult to include due to their non-numeric nature. 
They can also help us to control for confounding factors and improve the validity of our results.

In [121]:
bike_new['season'] = bike_new['season'].astype('category')
bike_new['weathersit'] = bike_new['weathersit'].astype('category')
bike_new['mnth'] = bike_new['mnth'].astype('category')
bike_new['weekday'] = bike_new['weekday'].astype('category')

In [122]:
bike_new = pd.get_dummies(bike_new, drop_first=True)
bike_new.info()

# Go

Splitting the data to Train and Test: - I am splitting the data into TRAIN and TEST (70:30 ratio), now,

In [123]:
from sklearn.model_selection import train_test_split

regression_df = bike_new
regression_X = regression_df.drop(["cnt"], axis=1)
regression_y = regression_df["cnt"]

X_train, X_test, y_train, y_test = train_test_split(regression_X, regression_y, test_size=0.3, random_state=42)

**Perform K-NN for k = 3 With K-fold Cross Validation**

In [124]:
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert y_train and y_test to numpy arrays for using in knn_classification_with_euclidean_distance func
y_train = np.array(y_train)
y_test = np.array(y_test)

kf = KFold(n_splits=6, shuffle=True, random_state=42)

r2_values = []
mse_values = []
rmse_values = []

# Perform k-fold cross-validation
for train_index, val_index in kf.split(X_train_scaled):
    X_train_fold, X_val_fold = X_train_scaled[train_index], X_train_scaled[val_index]
    y_train_fold, y_val_fold = y_train[train_index], y_train[val_index]


    start_time = time.time()
    # Predict using KNN regression
    y_pred_fold = knn_regressor_with_manhattan_distance(X_train_fold, y_train_fold, X_val_fold, 3)
    end_time = time.time()
    
    # mean squared error, r2 score
    # r2 = r2_score(y_val_fold, y_pred_fold)
    mse = mean_squared_error(y_val_fold, y_pred_fold)
    rmse = np.sqrt(mse)

    # r2_values.append(r2)
    mse_values.append(mse)
    rmse_values.append(rmse)
    
# Calculate average mean
# print("Average R2 Score:", np.mean(r2_values))
print(f"Average Mean Squared Error: {np.mean(mse_values)}")
print(f"Average Root Mean Squared error: {np.mean(rmse_values)}")

In [125]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
knn = KNeighborsClassifier()
knn.fit(X_train_scaled, y_train)
k_neighbors_predictions = knn.predict(X_test_scaled)
accuracy_score(y_test, k_neighbors_predictions)

**Runtime Performance**

Mean Squared Error (MSE) is a metric commonly used to evaluate the performance of a regression model. It measures the average of the squares of the errors, 
which are the differences between actual values and predicted values. 

MSE quantifies the average squared difference between actual values and predicted values. A smaller MSE indicates better agreement between the predicted and actual values, 
whereas a larger MSE suggests poorer model performance.