# Prediction of ATM Dataset using Machine Learning Algorithms

The objective of this project is to analyze the ATM dataset and predict the monthly withdrawal and rating of an ATM based on the features in the dataset. The dataset is available in the file train.tsv. The dataset contains 16 features. The features are described below.

## Dataset

- Number_of_Shops_Around_ATM
- ATM_Zone
- No_of_Other_ATMs_in_1_KM_radius
- Estimated_Number_of_Houses_in_1_KM_Radius
- ATM_Placement
- ATM_TYPE
- ATM_Location_TYPE
- ATM_looks
- ATM_Attached_to
- Average_Wait_Time
- Day_Type
- rating
- revenue

In [3]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.utils import shuffle
from sklearn.model_selection import cross_val_score
from sklearn.impute import SimpleImputer
from scipy.stats import pearsonr

Predict the monthly withdrawal for a given ATM. We will use a Random Forest Regressor for this task and evaluate the model using the Pearson Correlation Coefficient, r2_score, MAE, MSE, RMSE and Cross Validation Score.

In [4]:
# Load training and test data from tsv files

train_df = pd.read_csv('train.tsv', sep='\t')
test_df = pd.read_csv('test.tsv', sep='\t')
train_df = shuffle(train_df)

In [5]:
# Categorical features to encode

cat_cols = [
    'ATM_Zone',
    'ATM_Placement',
    'ATM_TYPE',
    'ATM_Location_TYPE',
    'ATM_looks',
    'ATM_Attached_to',
    'Day_Type'
]

In [6]:
# Encode categorical features using LabelEncoder

le = LabelEncoder()
for col in cat_cols:
    train_df[col] = le.fit_transform(train_df[col])
    test_df[col] = le.transform(test_df[col])

In [7]:
# Split training and test datasets

imputer = SimpleImputer(strategy='mean')
X_train = imputer.fit_transform(train_df.drop(['revenue'], axis=1))
y_train = train_df['revenue'].values
X_test = imputer.transform(test_df.drop(['revenue'], axis=1))
y_test = test_df['revenue'].values

In [8]:
# Train Random Forest Regressor model

rf = RandomForestRegressor(n_estimators=100, random_state=25)
rf.fit(X_train, y_train)

RandomForestRegressor(random_state=25)

In [9]:
# Evaluate our model on test set

y_pred = rf.predict(X_test)

In [10]:
# View various statistics

print("Regression Statistics\n")
print("Pearson Correlation Coefficient: ", pearsonr(y_test, y_pred)[0])
print("r2_score: ", r2_score(y_test, y_pred))
print("MAE: ", mean_absolute_error(y_test, y_pred))
print("MSE: ", mean_squared_error(y_test, y_pred))
print("RMSE: ", np.sqrt(mean_squared_error(y_test, y_pred)))
scores = cross_val_score(rf, X_train, y_train, cv=10, scoring='r2')
print("Cross-validation scores: ", scores)
print("Average cross-validation score: ", scores.mean())

Regression Statistics

Pearson Correlation Coefficient:  0.9980186989844818
r2_score:  0.9960411617552533
MAE:  3095.674253350437
MSE:  25014695.200349074
RMSE:  5001.469304149439
Cross-validation scores:  [0.99538243 0.99592175 0.99615999 0.99612578 0.99594197 0.99607539
 0.99586181 0.99574443 0.99629987 0.99582814]
Average cross-validation score:  0.9959341559705773


Pearson Correlation Coefficient: This statistic measures the strength and direction
of the linear relationship between two variables. In this case, it is 0.9980106084390973
which indicates a strong positive correlation between the variables.

r2_score: This is a measure of how well the model fits the data. It ranges from 0 to 1,
with a score of 1 indicating a perfect fit. In this case, the score is 0.9960250291366735
which suggests that the model fits the data very well.

MAE: Mean Absolute Error is a measure of the average difference between the predicted values
and the actual values. In this case, the MAE is 3093.2470765742655, which means that on
average, the predicted values are about 3093 units away from the actual values.

MSE: Mean Squared Error is another measure of the average difference between the predicted
values and the actual values, but it gives more weight to large errors. In this case, the
MSE is 25116632.31209975.

RMSE: Root Mean Squared Error is the square root of MSE and is another measure of the average
difference between the predicted values and the actual values. In this case, the RMSE is
5011.649659752739.

Cross-validation scores: This is a list of scores obtained through cross-validation, which
is a technique used to evaluate the performance of the model. In this case, the cross-validation
scores range from 0.99570387 to 0.99635264.

Average cross-validation score: This is the average of the cross-validation scores, which gives
us an overall idea of how well the model is performing. In this case, the average cross-validation
score is 0.9959295742226046, which suggests that the model is performing very well.

Predict the customer rating of a given ATM. We will use a Decision Tree Classifier
for this task and evaluate the model using confusion matrix, accuracy, precision, recall and F1 score.

In [11]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, precision_score, accuracy_score, recall_score, f1_score

In [12]:
# Split training and test datasets

X_train = imputer.fit_transform(train_df.drop(['rating'], axis=1))
y_train = train_df['rating'].values
X_test = imputer.transform(test_df.drop(['rating'], axis=1))
y_test = test_df['rating'].values

In [13]:
# Train a Decision Tree Classifier

dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)

DecisionTreeClassifier()

In [14]:
# Evaluate our model on test set

predictions = dtc.predict(X_test)

In [15]:
# View various statistics

print("Classification Statistics\n")
print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))
print("Precision: ", precision_score(y_test, predictions, average=None))
print("Recall: ", recall_score(y_test, predictions, average=None))
print("Accuracy: ", accuracy_score(y_test, predictions))
print("F-score: ", f1_score(y_test, predictions, average=None))

Classification Statistics

Confusion Matrix:
 [[  75    0    0    0]
 [   0 1432    0    0]
 [   0    1 1134    0]
 [   0    0    0  194]]
Precision:  [1.         0.99930216 1.         1.        ]
Recall:  [1.         1.         0.99911894 1.        ]
Accuracy:  0.9996473906911142
F-score:  [1.         0.99965096 0.99955928 1.        ]


The classification statistics show the performance of a classification model. The confusion matrix reveals the number of true positive, true negative, false positive, and false negative predictions made by the model. The precision score represents the proportion of true positive predictions compared to the total predicted positive values. The recall score represents the proportion of true positive predictions compared to the total actual positive values. The accuracy score represents the overall proportion of correct predictions made by the model. The F-score is the harmonic mean of precision and recall and provides an overall measure of the model's performance. Based on these statistics, we can conclude that the classification model has a very high level of accuracy and precision, and is able to correctly identify the majority of positive cases with a high degree of confidence.

In [16]:
%load_ext watermark
%watermark -v -m -p pandas,numpy,sklearn,scipy
%watermark -u -n -t -z

The watermark extension is already loaded. To reload it, use:
  %reload_ext watermark
Python implementation: CPython
Python version       : 3.7.9
IPython version      : 7.26.0

pandas : 1.3.5
numpy  : 1.21.5
sklearn: 1.0.2
scipy  : 1.7.3

Compiler    : MSC v.1900 64 bit (AMD64)
OS          : Windows
Release     : 10
Machine     : AMD64
Processor   : Intel64 Family 6 Model 158 Stepping 13, GenuineIntel
CPU cores   : 6
Architecture: 64bit

Last updated: Fri May 12 2023 22:29:40AUS Eastern Standard Time

