# Null Value & Missing Data Analysis

## Census Data

- sa2_census.parquet
- sa2_pops.parquet
- sa2_to_postcode.parquet

There are 363 null postcodes in the sa2_to_postcode dataframe out of the 35,040 rows. This is likely due to there not being a postcode linked to those corresponding SA2 areas. It appears that the postcodes present in the consumer data are already covered in the SA2 to postcode coding index though so we can safely ignore these null postcodes.

There are no missing data entries in any of the initial sa2 datasets.

## Consumer Data
- consumer_fraud_probability.parquet
- consumer_tbl.parquet
- consumer_user_details.parquet

There are no null values or missing entries in the consumer dataframes.

## Merchant Data
- merchant_fraud_probability.parquet
- merchant_tbl.parquet

There are no null values or missing entries in the merchant dataframes.

## Transaction Data
- transactions_all.parquet

There are no null values or missing entries in the transaction dataframe.

## Joining Datasets

Joining datasets together, checking for null values after each join and missing data

### Join cons_tbl data with cons_user_det

There are no null values after joining consumer data. 
Also there are no missing data after the first join. That is the number of rows before the join is 499,999 and then there is still 499,999 rows after the join.

### Join cons_join with transaction_all

There are no null values after joining in the transaction data. 
Also there is no missing data after the first join. That is the number of rows in transaction before the join is 14,195,505 and then there is still 14,195,505 rows after the join.

### Join cons_transaction with cons_fp

There is 14,115,157 null values in the consumer fraud probability values out of the 14,195,717 rows. This is expected as not all transactions have been flagged as possibly being fraudulent, we will simply set the fraud probability to 0 for all other transactions.

Also there is an increase in the size of the data after this join. The number of rows before the join is 14,195,505 and then there 14,195,717 rows after the join. So 212 rows are gained in this join, most likely as a result of there being multiple matches in the right table (on user id and order datetime in the consumer_fraud_probability data) for a transaction in the left table. Due to the relatively small amount of duplicate matches compared to the total number of consumer fraud probability data, we can safely ignore these duplicate matches, especially when considering that the main use of the fraud probabilities will be for filtering out fraudulent transactions.

### Join cons_transaction_with_fraud with merch_tbl

Still the only column with null values is consumer fraud probabilities with 13,543,038 missing out of 13,614,854 rows.

There is a decrease in the size of this data after this join. The number of rows before the join is 14,195,505 whereas after the join there is 13,614,854 rows. This means that there are 580,651 missing rows of data after this join. This would likely be due to the fact that some of the Merchant ABNs in the transaction data are not covered by the merchant data. This could be due to an error with the merchant ABN entry in the transaction data or due to information relating to this merchant being missing from the merchant data. In either case there isn't much that can be done to resolve this and so we will simply ignore these transactions.

### Join cons_transaction_fraud_merchant with merch_fp

After joining the merchant fraud probability, there are 2 columns that have null values. These are consumer_fraud_probability_% and merchant_fraud_probability_%. consumer_fraud_probability_% has 13,543,038 null values and merchant_fraud_probability_% has 13,610,826 null values. This is again expected as not all transactions have been flagged as possibly being fraudulent, we will simply set the fraud probability to 0 for all other transactions.

There is also no change in the size of this data after this join. The number of rows before the join is 13,614,854 and remains the same after the join.

## Joining SA2/Census Data

### Join the sa2_to_post to sa2_pops

There are no null values from this join. Also no rows are missing, with 35,040 rows still remaining after the join

### Join the sa2_postcode_and_pops with sa2_census

There are still the null values present from the other previous search (363 postcode values are null). 

Also after the joining, there are still 35,040 rows in the dataframe. No rows have been lost joining the data together.

## Now to join the original data with the SA2 population and census data

No new columns with null values. Still consumer_fraud_probability_% and merchant_fraud_probability_%. consumer_fraud_probability_% has 11,312,983 nulls and merchant_fraud_probability_% has 11,369,563 null values.

There is a decrease in the size of this data after this join. The number of rows before the join is 13,614,854 whereas after the join there is 11,372,905 rows. This means that there are 2,241,949 missing rows of data after this join. This is likely due to the fact that not all postcodes present in the consumer data are present in the SA2 to postcode coding index. Upon further investigation online, it appears that a large portion of these postcodes either don't have SA2 codes for 2021 data or are tied to PO boxes. Therefore, we assume that it is safe to ignore the transaction data lost after joining with SA2 data as a large portion of the transactions, if not all, which were discarded were no longer considered part of SA2 data or had a PO box postcode entered for some reason.

# Outlier Analysis

For our outlier analysis we looked at orders that were below the 1% percentile (3 orders) from our total dataset. In total, 61 records were removed from leaving 3965 merchants remaining. After conducting some outlier analysis, we noticed that these outliers had a higher consumer fraud rate  and merchant fraud rate than the rest of the dataset.

# Data Insights