## K Nearest Neighbours

In [1]:
# Imports

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,f1_score
from sklearn.utils import resample
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

### Data Science Workflow 
1. Identify business questions
 - Data is only as good as the questions you ask. Many organizations spend millions collecting data of all kinds from different sources, but many fail to create value from it. The truth is that no matter how much data your company owns or how many data scientists comprise the department, data only becomes a game-changer once you have identified the right business questions.
2. Collect and store data
- Now that you have a clear set of questions, it’s time to get your hands dirty. First, you need to collect and store your data in a safe place to analyze it.
- Sources: Company, Machine, open source data.
- Types: Qualitative or quantitative.
3. Clean and prepare data
- Once you’ve collected and stored your data, the next step is to assess its quality. It’s important to remember that the success of your data analysis depends greatly on the quality of your data. Your insights will be wrong or misleading if your information is inaccurate, incomplete, or inconsistent. That’s why spending time cleaning and preparing time is mandatory.  EDA processing.
4. Analyze data
- Now that your data looks clean, you’re ready to analyze data. Finding patterns, connections, insights, and predictions is often the most satisfying part of the data scientist's work. 
5. Visualize and communicate data
- The last step of the data science workflow is visualizing and communicating the results of your data analysis. To turn your insights into decision-making, you must ensure your audience and key stakeholders understand your work.

https://www.datacamp.com/blog/how-to-analyze-data-for-business

### K nearest Neighbours: 
- The k-nearest neighbors (k-NN) algorithm is a simple, instance-based learning algorithm used for classification and regression tasks in supervised machine learning. It is a non-parametric method that makes predictions based on the majority class or average value of the k-nearest neighbors of a given data point.
- How it works:

#### 1. Training Phase:
- In the training phase, the algorithm simply memorizes the training dataset. There is no explicit training involved as the model doesn't learn any parameters.

#### 2. Prediction Phase:
- When a prediction is needed for a new data point, the algorithm calculates the distances between the new data point and all the points in the training dataset. The most common distance metric used is the Euclidean distance, but other distance metrics can also be used.
- The algorithm then selects the k-nearest neighbors (data points with the smallest distances) to the new data point.

#### 3. Classification Task:
- For classification tasks, the algorithm assigns the class label that appears most frequently among the k-nearest neighbors to the new data point. This is known as the majority voting scheme.

#### 4. Regression Task:
- For regression tasks, the algorithm calculates the average value of the target variable among the k-nearest neighbors and assigns it as the predicted value for the new data point. 

#### 5. Choosing the Value of k:
- The value of k, the number of neighbors to consider, is a hyperparameter that needs to be specified before applying the algorithm.
- The choice of k can significantly impact the performance of the algorithm. A smaller value of k tends to be more sensitive to noise, while a larger value of k may lead to smoother decision boundaries but could potentially miss local patterns in the data.

#### 6. Evaluation:
- The performance of the k-NN algorithm can be evaluated using various metrics such as accuracy, precision, recall, F1 score, or mean squared error, depending on whether it is applied to classification or regression tasks.

Source: ChatGpt Prompt.

### Data Preparation

dataset Link: https://www.kaggle.com/datasets/sriharshaeedala/financial-fraud-detection-dataset

In [None]:
# Load data
df = pd.read_csv('../data/Synthetic_Financial_datasets_log.csv')
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
df.describe()

In [None]:
df.isnull().sum()

In [None]:
fraud_flagged_not_fraud=df.query("isFraud == 1 and isFlaggedFraud ==0")
fraud_flagged_not_fraud

In [None]:
fraud_flagged_not_fraud.shape[0]

### Feature Engineerings

In [None]:
# Will keep column type for this analysis.
df.drop(columns=['nameOrig','nameDest'],axis=1,inplace=True)

In [None]:
df['nameDest'].value_counts()

In [None]:
df['nameDest'].value_counts().sum()

In [None]:
df = pd.get_dummies(df,drop_first=True)
df.head()

In [None]:
# get correlation
df.corr(numeric_only=True)['isFraud'].sort_values(ascending=False)

In [None]:
loans_df= df.drop('isFlaggedFraud',axis=1)

X = loans_df.drop('isFraud',axis=1)
y= loans_df['isFraud']

In [None]:
loans_df['isFraud'].value_counts()

In [None]:
sns.countplot(x='isFraud', data=loans_df)
# Customize the plot
plt.xlabel('Fraudulent')
plt.ylabel('Count')
plt.title('Count of Fraudulent vs. Non-Fraudulent Loans')
plt.show()

Our data is clearly imbalanced, We have to find a way to make it balanced for analysis

### Perform Oversampling technique

In [None]:
# Before oversampling
loans_df['isFraud'].value_counts()

In [None]:
smote = SMOTE(random_state=42)

X_resampled, y_resampled = smote.fit_resample(X,y)

# Join the values
oversampled_loans = pd.concat([pd.DataFrame(X_resampled, columns=X.columns), pd.DataFrame(y_resampled, columns=['isFraud'])], axis=1)



In [None]:
# Recheck after oversampling
oversampled_loans['isFraud'].value_counts()

### Perform Train test split

In [None]:
X_train,X_test,y_train,y_test = train_test_split(X_resampled,y_resampled,test_size=0.2,random_state=42)

In [None]:
# Fitting and evaluating the model
knn = KNeighborsClassifier(n_neighbors=3)
#knn.fit(X_train,y_train)
#y_pred = knn.predict(X_test)

In [None]:
# Evaluate the model
#train_accuracy = knn.score(X_train, y_train)
#test_accuracy = knn.score(X_test, y_test)

#print("Training Accuracy:", train_accuracy)
#print("Test Accuracy:", test_accuracy)

I will continue with down sampling technique

In [None]:
majority_class = loans_df[loans_df['isFraud'] == 0]
minority_class = loans_df[loans_df['isFraud'] == 1]

# Downsample the majority class to match the minority class
majority_downsampled = resample(majority_class, replace=False, n_samples=len(minority_class), random_state=42)

downsampled_loans= pd.concat([majority_downsampled,minority_class])


In [None]:
# check the shape
print(f'Downsampled Loans: {downsampled_loans.shape}')
print(f'Downsampled Loans Count: {downsampled_loans['isFraud'].value_counts()}')

In [None]:
# Split, train and predict
X= downsampled_loans.drop('isFraud',axis=1)
y = downsampled_loans['isFraud']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train,y_train)
y_pred = knn.predict(X_test)

In [None]:
train_accuracy = knn.score(X_train, y_train)
test_accuracy = knn.score(X_test, y_test)
print("Training Accuracy:", train_accuracy)
print("Test Accuracy:", test_accuracy)

In [None]:

f1 = f1_score(y_test,y_pred)
cls_report = classification_report(y_test,y_pred)
print(f'F1 Score: {f1}')

print(f'Classification Report: {cls_report}')

## Hyperparameters Tuning

In [None]:
parameters= {
    'n_neighbors':range(1,11),
    'weights':['uniform','distance'],
    'metric':['minkowski','manhattan','euclidean']
}
grid = GridSearchCV(estimator=knn,param_grid=parameters,cv=5,scoring='accuracy')
grid.fit(X_train,y_train)

print(grid.best_params_)
print(grid.best_score_)
print(grid.best_estimator_)

dataset Link: https://www.kaggle.com/datasets/sriharshaeedala/financial-fraud-detection-dataset