# Importance score for dataset training samples

This study evaluates the importance of a particular observation in a training dataset as far as the performance of the model is observed. 

The necessary modules are imported

In [1]:
import k_nn
from load_dataset import load_dataset
import explore_data

The load_dataset function in the custom_made load_dataset module is used to import the dataset into a pandas Dataframe called df. Another dataframe is created from df such that one datapoint(observation) has been deleted. This datapoint is the one whose relevance needs to be quantified. 

In [2]:
filename = "vehicles.csv"
df = load_dataset(filename)

In [3]:
index_num = 842
df_without = df.drop(df.index[index_num])

Raw data from both datasets are observed. In this case, due to the location of the deleted observation, the difference in raw data can be displayed using the tail() method.

In [4]:
print(df.tail())

     COMPACTNESS  CIRCULARITY  DISTANCE_CIRCULARITY  RADIUS_RATIO  \
841           93           39                    87           183   
842           89           46                    84           163   
843          106           54                   101           222   
844           86           36                    78           146   
845           85           36                    66           123   

     PR.AXIS_ASPECT_RATIO  MAX.LENGTH_ASPECT_RATIO  SCATTER_RATIO  \
841                    64                        8            169   
842                    66                       11            159   
843                    67                       12            222   
844                    58                        7            135   
845                    55                        5            120   

     ELONGATEDNESS  PR.AXIS_RECTANGULARITY  MAX.LENGTH_RECTANGULARITY  \
841             40                      20                        134   
842             43      

In [5]:
print(df_without.tail())

     COMPACTNESS  CIRCULARITY  DISTANCE_CIRCULARITY  RADIUS_RATIO  \
840           93           34                    66           140   
841           93           39                    87           183   
843          106           54                   101           222   
844           86           36                    78           146   
845           85           36                    66           123   

     PR.AXIS_ASPECT_RATIO  MAX.LENGTH_ASPECT_RATIO  SCATTER_RATIO  \
840                    56                        7            130   
841                    64                        8            169   
843                    67                       12            222   
844                    58                        7            135   
845                    55                        5            120   

     ELONGATEDNESS  PR.AXIS_RECTANGULARITY  MAX.LENGTH_RECTANGULARITY  \
840             51                      18                        120   
841             40      

It is noticed that the observation at index 842 is absent in the second dataframe, df_without, as expected.

The accuracy of a KNN classifier trained with each dataset is computed and returned as a list.

In [6]:
lst = k_nn.with_without(df, df_without, "Class")

The results of the computation are displayed as such:

In [8]:
print(
    "The accuracy of the KNN model before the index {} observation was removed was {}.".format(
        index_num, lst[0]
    )
)
print(
    "The accuracy of the model without observation {} is {}.".format(index_num, lst[1])
)

difference = lst[0] - lst[1]
# The impact of the observation is calculated as a percentage and rounded up to two decimal places.
percentage = (difference / lst[0]) * 100
percentage = round(percentage, 2)

print("-" * 100)
print(
    "The performance score changed by {} when observation {} was not considered.".format(
        difference, index_num
    )
)
print(
    "The observation at index {} accounts for {} percent of the model's performance.".format(
        index_num, percentage
    )
)


The accuracy of the KNN model before the index 842 observation was removed was 0.6653543307086615.
The accuracy of the model without observation 842 is 0.6771653543307087.
----------------------------------------------------------------------------------------------------
The performance score changed by -0.011811023622047223 when observation 842 was not considered.
The observation at index 842 accounts for -1.78 percent of the model's performance.


The importance of any datapoint in the dataset can be evaluated by assigning its row index to index_num.