Task-1: please do a proper analysis of the whole data, plot all relevant plots, note down all observations

Task 2: Sample (S) 100 transactions from whole data (D), for every transaction in S, print 10 transactions from D which have least values of 'similarity'

the similarity between any two vectors is defined as
similarity(vi,vj) = cosine^-1(dot product (vi, vj) / (length(vi) * length(vj)) )

1. vi represents a vector i.e. a row in your data.
2. similarity(i,j) is just a function you can think it like f(x,y)
3. length(vi): length of the vector vi
4. dot product(i,j) is the dot product between the vectors vi, vj [for more about the dot product please check the linear algebra videos.]





## Task -1 : Performing EDA(Exploratory Data Analysis on credit card data)

### About the data:

Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. 

**Feature 'Time'** contains the seconds elapsed between each transaction and the first transaction in the dataset. 

The **Feature 'Amount'** is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. 

**Feature 'Class'** is the response variable and it takes value 1 in case of fraud and 0 otherwise.



EDA is performed only on Time and Amount Feature.

In [None]:
#importing pandas module
import pandas as pd

In [None]:
credit_card_data = pd.read_csv("../input/creditcard.csv")

In [None]:
credit_card_data.head()

In [None]:
#finding the shape of dataframe (finding no.of observations and features in the given dataframe)
credit_card_data.shape

284807 observations and 31 features  , 30 features by excluding 1 class feature 

In [None]:
#finding whether it is balanced data or imbalanced data
credit_card_data['Class'].value_counts()

From the above info it is clear that it is an imbalanced dataset

Renaming the class variables 

0 - legitimate

1- fraud

In [None]:
credit_card_data['Class'] = credit_card_data['Class'].apply(lambda x:'legitimate' if x == 0 else 'fraud')

In [None]:
credit_card_data.head()

In [None]:
#subsetting the dataset (selecting only required columns which are useful for analysis)
credit_card_data_subset = credit_card_data[['Time','Amount','Class']]

In [None]:
#checking the distribution of data (to check whether our subset of data contains same count as original dataset)
credit_card_data_subset['Class'].value_counts()

Performing EDA on this data (Pair plots to find out which features have more importance when compared to other features)


In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
%matplotlib inline

In [None]:
#sns.pairplot(data = credit_card_data,hue='Class',kind='scatter')

In [None]:
sns.FacetGrid(credit_card_data_subset,hue='Class',size = 5).map(sns.distplot,'Time').add_legend()
plt.title("Histogram with PDF for feature 'time' ")

In [None]:
 sns.FacetGrid(credit_card_data_subset,hue='Class',size=8).map(sns.distplot,'Amount').add_legend()
plt.title("Histogram with PDF for feature 'Amount' ")

**Observation:**

1.It is evident that the feature 'time' has so much of overlapping. We couldn't make any predictions or classifications out of it. So, we could discard this feature in classification of outcome.

2.Whereas the feature 'Amount' has also overlapping but we could make some prediction out of it when compared to another feature. 
    

    

In [None]:
#dividing the data according to classes for appropriate analysis
credit_card_data_subset_class_fraud = credit_card_data_subset[credit_card_data_subset['Class'] == 'fraud']

In [None]:
credit_card_data_subset_class_fraud.shape

In [None]:
credit_card_data_subset_class_fraud.head()

In [None]:
credit_card_data_subset_class_legitimate = credit_card_data_subset[credit_card_data_subset['Class'] == 'legitimate']

In [None]:
credit_card_data_subset_class_legitimate.shape

In [None]:
credit_card_data_subset_class_legitimate.head()

### Univariate Analysis on feature 'Amount'

Plotting **CDF(Cumulative Distributive Function)** and **PDF(Probability Density Function)** to analyze more about the data

In [None]:
import numpy as np


In [None]:
count,bin_edges = np.histogram(credit_card_data_subset_class_legitimate['Amount'],bins = 20,density = True)
PDF = count/sum(count)
#print("PDF : ",PDF)
#print("\nbin edges : " , bin_edges)
#computing CDF with help of PDF 
CDF = np.cumsum(PDF)
#plotting PDF,CDF
plt.plot(bin_edges[1:],PDF,label = "PDF ---- legitimate")
plt.plot(bin_edges[1:],CDF,label = "CDF ---- legitimate")

count,bin_edges = np.histogram(credit_card_data_subset_class_fraud['Amount'],bins = 20,density = True)
PDF = count/sum(count)
#computing CDF with help of PDF 
CDF = np.cumsum(PDF)
#plotting PDF,CDF
plt.plot(bin_edges[1:],PDF,label = "PDF ---- Fraud")
plt.plot(bin_edges[1:],CDF,label = "CDF ---- Fraud")

plt.xlabel("Amount")
plt.ylabel("Probability")
plt.title("Plot of PDF and CDF for feature 'Amount' ")
plt.legend()



**Observation and Conclusion **

1.From the above plot we can observe that there is slight overlap between CDF's of both classes

2.But , we can make an observation that more than **95% of fraud transactions** have **purchase Amount below 2500** approximately.

3.From that we can conclude that if the amount of purchase is below 2500 it is more likely to be a fraud transaction else can be classified as a legitimate 

### Box plot and whiskers

In [None]:
sns.boxplot(data = credit_card_data_subset,x='Class',y='Amount')

### Violin plot

In [None]:

sns.violinplot(data= credit_card_data_subset,x='Class',y='Amount')

**Observation:**
    
These both plots(Box plots as well as Violin plots) are difficult to analyze when compared to PDF and CDF as the data is imbalanced,huge dataset and also scale is not so clear.

### Task-2 : Finding the least values of similarities 
Task 2: Sample (S) 100 transactions from whole data (D), for every transaction in S, print 10 transactions from D which have least values of 'similarity'



In [None]:
#creating a sample of 100 values from Data , 
#sample will not have 'Class' feature in it because for comparison we don't need 'Class' feature.
credit_card_data_sample = credit_card_data[credit_card_data.columns[:-1]].sample(100)

In [None]:
#indexes of all the samples (100 samples)
credit_card_data_sample.index

In [None]:
credit_card_data_sample.head()

In [None]:
#removing class for credit_card_data also.
credit_card_data_without_label = credit_card_data[credit_card_data.columns[:-1]]

In [None]:
credit_card_data_without_label.shape

#### Create a function called similarity to check for the similarity and return least 10 transactions:


In [None]:
def similarity(sample, whole_data): 
    """Returns a dataframe of top 10 least values of similarity
    sample -- pass the one of the index of sample and
    whole_data -- entire dataframe , as we have to compare each row in dataframe with sample."""
    index_value = []
    Class = []
    similarity_list = []
    for i in whole_data.index:
        similarity_value = np.arccos(np.dot(credit_card_data_sample.loc[sample],credit_card_data_without_label.loc[i])/(np.linalg.norm(credit_card_data_sample.loc[sample])*np.linalg.norm(credit_card_data_without_label.loc[i])))
        similarity_list.append(similarity_value)
        Class.append(credit_card_data['Class'][i])
        index_value.append(i)
    similarity_df = pd.DataFrame({'index_value':index_value,'similarity_value':similarity_list,'Class':Class})
    print(f"sample index value is {sample} ")
    return similarity_df.sort_values('similarity_value').head(10)
    
    
#sort_values -- will sort the dataframe based on the column given in the function of sort_values

#np.dot -- built-in function of numpy, to calculate dot product between two vectors

# .loc[] --- is used to return a row based on the value provided in the arguments (passing the index in that arguments)

#linalg.norm -- calculates the magnitude of the vector
#The length of the vector is referred to as the vector norm or the vector抯 magnitude.
#The length of a vector is a nonnegative number that describes the extent of the vector in space, 
#and is sometimes referred to as the vector抯 magnitude or the norm.
    

In [None]:
#sample output

#providing input directly without calling the sample_index variable as it wouldn't
#work in the frontend

#similarity(index_from_sample_data,credit_card_data_without_label)

In [None]:
#commenting it as it is uploaded notebook , we couldn't give input in the kaggle frontend

#sample_index = int(input("Enter the one of the index value from your credit_card_data_sample : "))

In [None]:
#execute this only after you execute the above block 

#calling the function

#similarity(sample_index,credit_card_data_without_label)  

#credit_card_data_without_label -- is the whole dataframe without labels
#sample_index is the value (one of the index value) from the samples.

 