## Bank Note Authentication - PySpark

**Description of the data**

Data were extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.

**Data obtained from UCI ML Repository**

**Objective:**
**To build a classification model that can predict the authenticity of banknotes**

### Importing Libraries

In [15]:
import findspark
findspark.init()

from pyspark.sql import SparkSession, functions as f

In [16]:
spark = SparkSession.builder.master('local[2]').appName('BankNoteAuthentication').getOrCreate()

In [28]:
df = spark.read.csv('BankNote_Authentication.csv',header=True,inferSchema=True)

**Let's print the first few elements of the dataset**

In [29]:
df.show(5)

+--------+--------+--------+--------+-----+
|variance|skewness|curtosis| entropy|class|
+--------+--------+--------+--------+-----+
|  3.6216|  8.6661| -2.8073|-0.44699|    0|
|  4.5459|  8.1674| -2.4586| -1.4621|    0|
|   3.866| -2.6383|  1.9242| 0.10645|    0|
|  3.4566|  9.5228| -4.0112| -3.5944|    0|
| 0.32924| -4.4552|  4.5718| -0.9888|    0|
+--------+--------+--------+--------+-----+
only showing top 5 rows



**Let's describe the dataset**

In [59]:
df.describe().show()

+-------+------------------+------------------+------------------+------------------+------------------+
|summary|          variance|          skewness|          curtosis|           entropy|             class|
+-------+------------------+------------------+------------------+------------------+------------------+
|  count|              1372|              1372|              1372|              1372|              1372|
|   mean|0.4337352570699707|1.9223531206393603|1.3976271172667651|-1.191656520043731|0.4446064139941691|
| stddev|2.8427625862785577| 5.869046743695513| 4.310030090106595| 2.101013137359609|0.4971032701256608|
|    min|           -7.0421|          -13.7731|           -5.2861|           -8.5482|                 0|
|    max|            6.8248|           12.9516|           17.9274|            2.4495|                 1|
+-------+------------------+------------------+------------------+------------------+------------------+



**We see four features in the above dataframe: variance, skewness, curtosis and entropy of the images**

**Let's print the schema of the data**

In [30]:
df.printSchema()

root
 |-- variance: double (nullable = true)
 |-- skewness: double (nullable = true)
 |-- curtosis: double (nullable = true)
 |-- entropy: double (nullable = true)
 |-- class: integer (nullable = true)



**Spark has correctly inferred the schema for this data. Hence type casting is not required.**

**Searching for null values in df**

In [50]:
df.select([f.count(f.when(f.isnan(f.col(c)),True)).alias(f'nullcount_{c}') for c in df.columns]).show()

+------------------+------------------+------------------+-----------------+---------------+
|nullcount_variance|nullcount_skewness|nullcount_curtosis|nullcount_entropy|nullcount_class|
+------------------+------------------+------------------+-----------------+---------------+
|                 0|                 0|                 0|                0|              0|
+------------------+------------------+------------------+-----------------+---------------+



**Note that the dataset is very clean and has no null values present**

**Let's print the length of the dataset**

In [58]:
df.count()

1372