<h3>6.	Let’s carry out fraud detection on a very large transactional dataset – assume that it is 1 TB in size – which is generated by a retail chain with multiple store locations.
</h3>
In the retail space, an SKU is a unique identifier or code that refers to a particular stock keeping unit. In our context, we shall denote each SKU with an alphabet followed by a 6-digit sequence, for example, B432156

Every row in our dataset captures one complete transaction and can be represented as shown via a sequence: k here denotes the number of item types in the transaction. 

{ Store ID,
  Transaction Date,
  (Item 1 SKU, Quantity Purchased of Item 1, Unit Cost of Item 1), 
  (Item 2 SKU, Quantity Purchased of Item 2, Unit Cost of Item 2),
      . . . 
  (Item k SKU, Quantity Purchased of Item k, Unit Cost of Item k) }

In a flattened out “relational” format, transaction sequences might appear as shown, with items, their quantities of purchase and unit costs listed in serial order.

 

Assume you have been supplied a function isFraud(transaction sequence) that checks if an input transaction is fraudulent or not, and returns TRUE or FALSE. Skip worrying about the construction of the function isFraud or of its correctness – just assume that it is adequate for our purposes. 

As the proprietor of the chain, you wish to know which transactions are fraudulent, and just as importantly, what is the total amount of fraud by store.

Assume that the data is contained in a tab-separated file AllTransactions.txt. 

a.	In simple words, explain how you would use the Hadoop framework to both represent the problem and then solve it. Elaborate on the dynamics of what you would first do with the data file, what will Hadoop do next, and so on.

b.	Write down a mapper and reducer in a language of your choice, so long as it is commented properly. What are the inputs to each function? What logic will each function perform? Clearly explain how the final answer is arrived at. 


<h3>a. In simple words, explain how you would use the Hadoop framework to both represent the problem and then solve it. Elaborate on the dynamics of what you would first do with the data file, what will Hadoop do next, and so on.</h3>




<h4>b. Write down a mapper and reducer in a language of your choice, so long as it is commented properly. What are the inputs to each function? What logic will each function perform? Clearly explain how the final answer is arrived at.</h4>

In [6]:
# Load Spark libraries
import numpy as np
import pandas as pd
import pyspark

from pyspark import SparkContext
from pyspark.sql.session import SparkSession
from pyspark.sql.functions import *

In [7]:
# Print all outputs in a block - not just the last one

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [8]:
from pyspark.sql import SparkSession
spark = SparkSession \
        .builder \
        .appName("PythonSpark") \
        .config("hive.metastore.uris",
                "thrift://hive-metastore:9083") \
        .config("spark.sql.warehouse.dir",
                "http://namenode:50070/user/hive/warehouse") \
        .enableHiveSupport() \
        .getOrCreate()

print(spark.version)

spark

3.3.1


In [21]:
from pyspark.sql.types import *
#create structure for the data
schema = StructType([\
    StructField("Store_ID", StringType(), True),\
    StructField("Transaction_Date", StringType(), True),\
    StructField("SKU1",  StringType(), True),
    StructField("Quantity1",  StringType(), True),
    StructField("UnitCost1", StringType(), True),
    StructField("SKU2",  StringType(), True),
    StructField("Quantity2",  StringType(), True),
    StructField("UnitCost2", StringType(), True),
    StructField("SKU3",  StringType(), True),
    StructField("Quantity3",  StringType(), True),
    StructField("UnitCost3", StringType(), True)])


In [22]:
#readind data from the Retail Sales file on hive cluster
sales_df = spark.read \
                    .option("header", "true") \
                    .schema(schema) \
                    .csv("hdfs://namenode:8020/examples/RetailSales", sep=r",")

In [23]:
sales_df.printSchema()

root
 |-- Store_ID: string (nullable = true)
 |-- Transaction_Date: string (nullable = true)
 |-- SKU1: string (nullable = true)
 |-- Quantity1: string (nullable = true)
 |-- UnitCost1: string (nullable = true)
 |-- SKU2: string (nullable = true)
 |-- Quantity2: string (nullable = true)
 |-- UnitCost2: string (nullable = true)
 |-- SKU3: string (nullable = true)
 |-- Quantity3: string (nullable = true)
 |-- UnitCost3: string (nullable = true)



In [24]:
sales_df.take(5)

[Row(Store_ID='5', Transaction_Date='2/19/2023', SKU1='B398703', Quantity1='9', UnitCost1='161', SKU2='B617760', Quantity2='8', UnitCost2='473', SKU3='B252470', Quantity3='3', UnitCost3='132'),
 Row(Store_ID='8', Transaction_Date='11/30/2022', SKU1='B325734', Quantity1='2', UnitCost1='253', SKU2='B598408', Quantity2='1', UnitCost2='328', SKU3='B777874', Quantity3='9', UnitCost3='494'),
 Row(Store_ID='7', Transaction_Date='8/6/2022', SKU1='B914702', Quantity1='2', UnitCost1='178', SKU2='B441976', Quantity2='1', UnitCost2='111', SKU3='B785187', Quantity3='10', UnitCost3='292'),
 Row(Store_ID='2', Transaction_Date='10/6/2022', SKU1='B532921', Quantity1='2', UnitCost1='275', SKU2='B477851', Quantity2='2', UnitCost2='401', SKU3='B334946', Quantity3='9', UnitCost3='492'),
 Row(Store_ID='1', Transaction_Date='10/15/2022', SKU1='B554939', Quantity1='2', UnitCost1='182', SKU2='B913557', Quantity2='8', UnitCost2='345', SKU3='B483552', Quantity3='7', UnitCost3='331')]

In [27]:
sales_df_p = spark.createDataFrame(sales_df.take(5)).toPandas()

In [26]:
sales_df_p

Unnamed: 0,Store_ID,Transaction_Date,SKU1,Quantity1,UnitCost1,SKU2,Quantity2,UnitCost2,SKU3,Quantity3,UnitCost3
0,5,2/19/2023,B398703,9,161,B617760,8,473,B252470,3,132
1,8,11/30/2022,B325734,2,253,B598408,1,328,B777874,9,494
2,7,8/6/2022,B914702,2,178,B441976,1,111,B785187,10,292
3,2,10/6/2022,B532921,2,275,B477851,2,401,B334946,9,492
4,1,10/15/2022,B554939,2,182,B913557,8,345,B483552,7,331
...,...,...,...,...,...,...,...,...,...,...,...
195,1,8/25/2022,B400974,6,405,B150259,4,293,B872148,3,101
196,2,8/7/2022,B708110,2,102,B137105,2,357,B406237,8,104
197,10,10/23/2022,B724754,2,318,B513070,5,327,B340658,7,109
198,1,2/22/2022,B919699,4,487,B357324,10,464,B912729,1,309


In [55]:
#function to check the fraud transaction, will accept each row and share the output
def isFraud(transaction) :
    isfraud = 1
    #print(transaction)
    return(isfraud)

#Loop will pass the dataset row by row to frad detection function and receive result in case transaction is identified as fraud. based on that the message is printed.
for i in sales_df_p.itertuples(): 
    tran_isfraud = isFraud(i) 
    if tran_isfraud == 1 :
        print('>> Print transaction No: ', str(i[0]) ,' is fraud\n')
    else:
        print('Print transaction is Fraud')
    
    

>> Print transaction No:  0  is fraud

>> Print transaction No:  1  is fraud

>> Print transaction No:  2  is fraud

>> Print transaction No:  3  is fraud

>> Print transaction No:  4  is fraud

