# Solution for Overpaid Taxes

## Problem Understanding


Ask:

Develop a data science-based solution to optimize Deloitte Tax's process of identifying and refunding overpaid taxes for their clients. Analyze existing data from the client's accounts payable system to discover potential areas of improvement and create a model to streamline the process. Apply the data science process (attached) to historical client project data and present your findings and model to the Deloitte Tax team.
 
Context:

Clients provide 4 years’ worth of export data from their accounts payable system. This covers all areas of their business spending, and can cover multiple different tax jurisdictions.
Each jurisdiction can have their own way to treat the taxability for the same items.
The main output of the work Deloitte teams do currently are determinations for taxability, which can be ‘taxable’ and ‘non-taxable’. Once this determination is made, overpayments are found by finding when taxes have been paid for transactions that are ‘non-taxable’. This field is labeled “Taxability.STATE.Status”
Clients want to understand why determinations are made so that their tax software can be updated to address mistakes previously made. Incorporate this need into the type and complexity of the model selected. 
 


## Import libraries and load data

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
from ydata_profiling import ProfileReport
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.metrics import mean_squared_error,confusion_matrix
from sklearn.metrics import auc,roc_curve, accuracy_score, classification_report
from sklearn.model_selection import cross_val_score
from sklearn import model_selection
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Load data
df = pd.read_csv('Disney_CogTax_CA_Audit_Addl_Training-Working_Transaction[62].csv')

## Exploratory Data Analysis (EDA)

In [None]:
# Dataframe shape
print('Dataframe shape:')
print(df.shape)

Looking at the shape of the dataframe there are 4457 rows and 61 columns

### Univariate Analysis

In [None]:
# Basic info
print('Basic Info:')
print(df.info())
print(' ')


In Basic Info, all dataframe feature column names are listed. There are three data types: object, float64, and int64, therefore we have both numerical and categorical data in this dataframe. Several columns are empty and have no entries at all. Of those that do have entries, only one:

     'Taxability.STATE.Exemption.CategoryCode'

has missing data producing null values. 

In [None]:
df['Taxability.STATE.Status']

This feature column gives us the determination for taxability as either TAXABLE or NONTAXABLE and will be the target variable.