# Solution for Overpaid Taxes

## Problem Understanding
Ask:

Develop a data science-based solution to optimize Deloitte Tax's process of identifying and refunding overpaid taxes for their clients. Analyze existing data from the client's accounts payable system to discover potential areas of improvement and create a model to streamline the process. Apply the data science process (attached) to historical client project data and present your findings and model to the Deloitte Tax team.

Context:

Clients provide 4 years’ worth of export data from their accounts payable system. This covers all areas of their business spending, and can cover multiple different tax jurisdictions. Each jurisdiction can have their own way to treat the taxability for the same items. The main output of the work Deloitte teams do currently are determinations for taxability, which can be ‘taxable’ and ‘non-taxable’. Once this determination is made, overpayments are found by finding when taxes have been paid for transactions that are ‘non-taxable’. This field is labeled “Taxability.STATE.Status” Clients want to understand why determinations are made so that their tax software can be updated to address mistakes previously made. Incorporate this need into the type and complexity of the model selected.

Insight:

The clients ask presents a  binary classifictaion problem between the two classes of 'taxable' and 'non-taxable' status, as determined in our target variable for this dataset, “Taxability.STATE.Status”.

## Import libraries and load data

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from ydata_profiling import ProfileReport
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, RobustScaler, StandardScaler
from sklearn import metrics
from sklearn.metrics import mean_squared_error,confusion_matrix
from sklearn.metrics import auc,roc_curve, accuracy_score, classification_report
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection
from category_encoders import HashingEncoder
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

In [2]:
# load data
data = pd.read_csv('KP_NV19_SUM_and_Details_Separate-Working_Summ_with totals.csv')

  data = pd.read_csv('KP_NV19_SUM_and_Details_Separate-Working_Summ_with totals.csv')


In [3]:
data.head()

Unnamed: 0,Invoice.ParentCompanyCode,Invoice.CompanyCode,Invoice.VendorName,Invoice.VendorCode,Invoice.Date,Invoice.Number,Invoice.Sequence,Line.Number,Line.CostCenterCode,Line.GLMainAccountNumber,...,Taxability.STATE.OutOfStatuteDate,Taxability.Mode,Taxability.STATE.Status,Taxability.STATE.ReviewStatus,Taxability.STATE.Confidence,Invoice.Note.text,Invoice.VoucherCode,"@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)","@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)","@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)"
0,201,NORTHERN CA HOSPITAL,W W GRAINGER INC,100008941,11/1/19,22768,1,1,4450-Prop/Fac-Facilities Svcs,76999,...,1/31/23,INVOICE,TAXABLE,OUTSTANDING,99.99,00000000-ALL ITEMS,22768,0.0,2.44,NO_ERROR
1,201,NORTHERN CA HOSPITAL,W W GRAINGER INC,100008941,11/1/19,22767,1,1,4450-Prop/Fac-Facilities Svcs,76999,...,1/31/23,INVOICE,TAXABLE,OUTSTANDING,99.99,00000000-ALL ITEMS,22767,0.0,3.14,NO_ERROR
2,201,NORTHERN CA HOSPITAL,W W GRAINGER INC,100008941,11/1/19,22773,1,1,0279-Hosp Svcs - Epidemic,76999,...,1/31/23,INVOICE,TAXABLE,OUTSTANDING,99.99,00000000-ALL ITEMS,22773,0.0,84.94,NO_ERROR
3,201,NORTHERN CA HOSPITAL,W W GRAINGER INC,100008941,11/1/19,22771,1,1,4450-Prop/Fac-Facilities Svcs,76999,...,1/31/23,INVOICE,TAXABLE,OUTSTANDING,99.99,00000000-ALL ITEMS,22771,0.0,0.87,NO_ERROR
4,201,NORTHERN CA HOSPITAL,W W GRAINGER INC,100008941,11/1/19,22779,1,1,0308-Outpatient Surgery- 1,76999,...,1/31/23,INVOICE,TAXABLE,OUTSTANDING,99.98,00000000-ALL ITEMS,22779,0.0,2.91,NO_ERROR


In [4]:
data['@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)']

0             NO_ERROR
1             NO_ERROR
2             NO_ERROR
3             NO_ERROR
4             NO_ERROR
              ...     
224655       LIABILITY
224656       LIABILITY
224657       LIABILITY
224658       LIABILITY
224659    UNDETERMINED
Name: @CustomField(Invoice,TaxabilityClassification,Invoice,STRING), Length: 224660, dtype: object

## Exploratory Data Analysis (EDA)

In [5]:
# Dataframe shape
print('Dataframe shape:')
print(data.shape)

Dataframe shape:
(224660, 35)


This dataframe contains 4457 rows and 61 columns

### Univariate Analysis

In [6]:
# Basic info
print('Basic Info:')
print(data.info())
print(' ')

Basic Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 224660 entries, 0 to 224659
Data columns (total 35 columns):
 #   Column                                                         Non-Null Count   Dtype  
---  ------                                                         --------------   -----  
 0   Invoice.ParentCompanyCode                                      224660 non-null  int64  
 1   Invoice.CompanyCode                                            224660 non-null  object 
 2   Invoice.VendorName                                             224660 non-null  object 
 3   Invoice.VendorCode                                             224660 non-null  object 
 4   Invoice.Date                                                   224660 non-null  object 
 5   Invoice.Number                                                 224660 non-null  int64  
 6   Invoice.Sequence                                               224660 non-null  int64  
 7   Line.Number                        

In Basic Info, all dataframe feature column names are listed. There are three data types in this dataset: object, float64, and int64, therefore we have both numerical and categorical data in this dataframe. Several columns are empty and have no entries at all. Here, only three features:

 'Taxability.STATE.Exemption.CategoryCode'
 
 'Invoice.VendorCode'
 
 'Invoice.VendorName'
 
are identified as having missing data producing null values. However as we will come to find later that identifictaion of null values is in fact decpetive.

In [7]:
data['Taxability.STATE.Status']

0              TAXABLE
1              TAXABLE
2              TAXABLE
3              TAXABLE
4              TAXABLE
              ...     
224655         TAXABLE
224656         TAXABLE
224657         TAXABLE
224658         TAXABLE
224659    UNDETERMINED
Name: Taxability.STATE.Status, Length: 224660, dtype: object

This categorical feature is our target variable and gives us the determination for taxability as either 'TAXABLE' or 'NONTAXABLE' .

In [8]:
data.isnull().sum(axis=0)

Invoice.ParentCompanyCode                                           0
Invoice.CompanyCode                                                 0
Invoice.VendorName                                                  0
Invoice.VendorCode                                                  0
Invoice.Date                                                        0
Invoice.Number                                                      0
Invoice.Sequence                                                    0
Line.Number                                                         0
Line.CostCenterCode                                                 0
Line.GLMainAccountNumber                                            0
Line.GLSubAccountNumber                                             0
Line.GLAccountDescription                                           0
Invoice.Description                                                 0
Line.ItemDescription                                                0
Invoice.GrossValue  

In [9]:
list(zip(data.columns, data.nunique()))

[('Invoice.ParentCompanyCode', 26),
 ('Invoice.CompanyCode', 26),
 ('Invoice.VendorName', 2298),
 ('Invoice.VendorCode', 3038),
 ('Invoice.Date', 1),
 ('Invoice.Number', 224660),
 ('Invoice.Sequence', 1),
 ('Line.Number', 1),
 ('Line.CostCenterCode', 1549),
 ('Line.GLMainAccountNumber', 222),
 ('Line.GLSubAccountNumber', 8),
 ('Line.GLAccountDescription', 221),
 ('Invoice.Description', 76913),
 ('Line.ItemDescription', 76913),
 ('Invoice.GrossValue', 60512),
 ('Invoice.SalesTaxPaid', 22609),
 ('Invoice.UseTax', 9535),
 ('Invoice.NetValue', 74395),
 ('Line.GrossValue', 60512),
 ('Line.SalesTaxPaid', 22609),
 ('Line.UseTax', 9535),
 ('Line.NetValue', 74395),
 ('Invoice.PaymentReference', 224660),
 ('Taxability.STATE.JurisdictionCode', 11),
 ('Taxability.STATE.JurisdictionDescription', 11),
 ('Taxability.STATE.OutOfStatuteDate', 5),
 ('Taxability.Mode', 1),
 ('Taxability.STATE.Status', 3),
 ('Taxability.STATE.ReviewStatus', 2),
 ('Taxability.STATE.Confidence', 2911),
 ('Invoice.Note.text'

In [10]:
data['Line.GLAccountDescription']

0         Other Non-Medical Supplies
1         Other Non-Medical Supplies
2         Other Non-Medical Supplies
3         Other Non-Medical Supplies
4         Other Non-Medical Supplies
                     ...            
224655        CIP-Moveable Equipment
224656        CIP-Moveable Equipment
224657        CIP-Moveable Equipment
224658        CIP-Moveable Equipment
224659        CIP-Moveable Equipment
Name: Line.GLAccountDescription, Length: 224660, dtype: object

In [11]:
data_final = data

In [12]:
data_final.head()

Unnamed: 0,Invoice.ParentCompanyCode,Invoice.CompanyCode,Invoice.VendorName,Invoice.VendorCode,Invoice.Date,Invoice.Number,Invoice.Sequence,Line.Number,Line.CostCenterCode,Line.GLMainAccountNumber,...,Taxability.STATE.OutOfStatuteDate,Taxability.Mode,Taxability.STATE.Status,Taxability.STATE.ReviewStatus,Taxability.STATE.Confidence,Invoice.Note.text,Invoice.VoucherCode,"@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)","@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)","@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)"
0,201,NORTHERN CA HOSPITAL,W W GRAINGER INC,100008941,11/1/19,22768,1,1,4450-Prop/Fac-Facilities Svcs,76999,...,1/31/23,INVOICE,TAXABLE,OUTSTANDING,99.99,00000000-ALL ITEMS,22768,0.0,2.44,NO_ERROR
1,201,NORTHERN CA HOSPITAL,W W GRAINGER INC,100008941,11/1/19,22767,1,1,4450-Prop/Fac-Facilities Svcs,76999,...,1/31/23,INVOICE,TAXABLE,OUTSTANDING,99.99,00000000-ALL ITEMS,22767,0.0,3.14,NO_ERROR
2,201,NORTHERN CA HOSPITAL,W W GRAINGER INC,100008941,11/1/19,22773,1,1,0279-Hosp Svcs - Epidemic,76999,...,1/31/23,INVOICE,TAXABLE,OUTSTANDING,99.99,00000000-ALL ITEMS,22773,0.0,84.94,NO_ERROR
3,201,NORTHERN CA HOSPITAL,W W GRAINGER INC,100008941,11/1/19,22771,1,1,4450-Prop/Fac-Facilities Svcs,76999,...,1/31/23,INVOICE,TAXABLE,OUTSTANDING,99.99,00000000-ALL ITEMS,22771,0.0,0.87,NO_ERROR
4,201,NORTHERN CA HOSPITAL,W W GRAINGER INC,100008941,11/1/19,22779,1,1,0308-Outpatient Surgery- 1,76999,...,1/31/23,INVOICE,TAXABLE,OUTSTANDING,99.98,00000000-ALL ITEMS,22779,0.0,2.91,NO_ERROR


In [13]:
X = data
X = X.dropna(subset=['@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)'])

In [14]:
y = X['Taxability.STATE.Status']
X = X.drop(['Taxability.STATE.Status'], axis =1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 41)

In [15]:

X_train.drop('Invoice.Number', axis=1,inplace=True)
X_train.drop('Invoice.VoucherCode', axis=1,inplace=True)
X_train.drop('Invoice.PaymentReference', axis=1,inplace=True)
X_train.drop('Taxability.Mode', axis=1,inplace=True)
X_train.drop('Line.Number', axis=1,inplace=True)
X_train.drop('Invoice.Sequence', axis=1,inplace=True)
X_train.drop('Invoice.Date', axis=1,inplace=True)
X_train.drop("Line.GrossValue",axis=1,inplace=True)
X_train.drop("Line.SalesTaxPaid",axis=1,inplace=True)
X_train.drop("Line.UseTax",axis=1,inplace=True)
X_train.drop("Line.NetValue",axis=1,inplace=True)
X_train.drop("Line.ItemDescription",axis=1,inplace=True)
X_train.drop('Line.GLSubAccountNumber', axis=1,inplace=True)
X_train.drop('Invoice.Note.text', axis=1,inplace=True)
X_train.drop('Taxability.STATE.JurisdictionDescription', axis=1,inplace=True)
X_train.drop('Invoice.VendorName',axis=1,inplace=True)
X_train.drop('Invoice.VendorCode',axis=1,inplace=True)
X_train.drop('Line.CostCenterCode',axis=1,inplace=True)
X_train.drop('Invoice.CompanyCode',axis=1,inplace=True)
X_train.drop('Line.GLMainAccountNumber',axis=1,inplace=True)
X_train.drop("Invoice.ParentCompanyCode",axis=1,inplace=True)

X_train.info()
print('data_final dimensions: ')
print(X_train.shape)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165261 entries, 182352 to 35183
Data columns (total 13 columns):
 #   Column                                                         Non-Null Count   Dtype  
---  ------                                                         --------------   -----  
 0   Line.GLAccountDescription                                      165261 non-null  object 
 1   Invoice.Description                                            165261 non-null  object 
 2   Invoice.GrossValue                                             165261 non-null  float64
 3   Invoice.SalesTaxPaid                                           165261 non-null  float64
 4   Invoice.UseTax                                                 165261 non-null  float64
 5   Invoice.NetValue                                               165261 non-null  float64
 6   Taxability.STATE.JurisdictionCode                              165261 non-null  object 
 7   Taxability.STATE.OutOfStatuteDate          

In [16]:
X_train = X_train.replace(to_replace='N.A',value=np.nan)

In [17]:
X_train.isna().sum()

Line.GLAccountDescription                                        12
Invoice.Description                                               0
Invoice.GrossValue                                                0
Invoice.SalesTaxPaid                                              0
Invoice.UseTax                                                    0
Invoice.NetValue                                                  0
Taxability.STATE.JurisdictionCode                                 4
Taxability.STATE.OutOfStatuteDate                                88
Taxability.STATE.ReviewStatus                                     0
Taxability.STATE.Confidence                                       0
@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)              0
@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)                0
@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)     0
dtype: int64

In [18]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 165261 entries, 182352 to 35183
Data columns (total 13 columns):
 #   Column                                                         Non-Null Count   Dtype  
---  ------                                                         --------------   -----  
 0   Line.GLAccountDescription                                      165249 non-null  object 
 1   Invoice.Description                                            165261 non-null  object 
 2   Invoice.GrossValue                                             165261 non-null  float64
 3   Invoice.SalesTaxPaid                                           165261 non-null  float64
 4   Invoice.UseTax                                                 165261 non-null  float64
 5   Invoice.NetValue                                               165261 non-null  float64
 6   Taxability.STATE.JurisdictionCode                              165257 non-null  object 
 7   Taxability.STATE.OutOfStatuteDate          

In [19]:
list(zip(X_train.columns, X_train.nunique()))

[('Line.GLAccountDescription', 209),
 ('Invoice.Description', 61435),
 ('Invoice.GrossValue', 50246),
 ('Invoice.SalesTaxPaid', 19123),
 ('Invoice.UseTax', 7958),
 ('Invoice.NetValue', 61247),
 ('Taxability.STATE.JurisdictionCode', 10),
 ('Taxability.STATE.OutOfStatuteDate', 5),
 ('Taxability.STATE.ReviewStatus', 2),
 ('Taxability.STATE.Confidence', 2812),
 ('@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)', 7397),
 ('@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)', 19122),
 ('@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)', 3)]

In [20]:
X_train['Taxability.STATE.ReviewStatus'].value_counts()

APPROVED       97757
OUTSTANDING    67504
Name: Taxability.STATE.ReviewStatus, dtype: int64

In [21]:
X_train['@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)'].value_counts()

NO_ERROR     130869
REFUND        29131
LIABILITY      5261
Name: @CustomField(Invoice,TaxabilityClassification,Invoice,STRING), dtype: int64

In [22]:
y_train.value_counts()

TAXABLE       136124
NONTAXABLE     29137
Name: Taxability.STATE.Status, dtype: int64

In [23]:
X_train['Taxability.STATE.Confidence'].value_counts()

100.00    97757
99.99     35721
99.98      2124
99.97      1392
99.96      1050
          ...  
94.22         1
84.65         1
76.10         1
77.63         1
70.74         1
Name: Taxability.STATE.Confidence, Length: 2812, dtype: int64

In [24]:
scaled = X_train[['Invoice.GrossValue','Invoice.SalesTaxPaid', 'Invoice.UseTax', 'Invoice.NetValue', 'Taxability.STATE.Confidence', '@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)', '@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)']]
cols = ['Invoice.GrossValue','Invoice.SalesTaxPaid', 'Invoice.UseTax', 'Invoice.NetValue', 'Taxability.STATE.Confidence',
        '@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)', '@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)',
]

scaler = StandardScaler()
ro_scaler = scaler.fit_transform(scaled)
ro_scaler = pd.DataFrame(ro_scaler, columns = cols)


In [25]:
ro_scaler

Unnamed: 0,Invoice.GrossValue,Invoice.SalesTaxPaid,Invoice.UseTax,Invoice.NetValue,Taxability.STATE.Confidence,"@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)","@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)"
0,-0.032179,-0.046012,-0.010132,-0.031855,0.236048,-0.033117,-0.023890
1,-0.006018,-0.048912,0.020403,-0.005661,0.221889,-0.033117,0.002875
2,-0.006568,0.041835,-0.010132,-0.005352,0.020837,-0.033117,0.012746
3,-0.027895,-0.034623,-0.010132,-0.027535,0.236048,-0.033117,-0.019141
4,-0.032091,-0.045400,-0.010132,-0.031754,0.238879,-0.033117,-0.023635
...,...,...,...,...,...,...,...
165256,0.013969,0.055998,0.005513,0.015147,0.238879,-0.033117,0.032985
165257,-0.032415,-0.047728,-0.010132,-0.032130,-2.346476,-0.029549,-0.025100
165258,-0.031025,-0.043053,-0.010132,-0.030695,0.238879,-0.033117,-0.022656
165259,-0.033053,-0.047612,-0.010132,-0.032711,0.238879,-0.033117,-0.024558


In [26]:
X_train_final = X_train
X_train_final.drop('Invoice.GrossValue', axis=1, inplace=True)
X_train_final.drop('Invoice.SalesTaxPaid', axis=1, inplace=True)
X_train_final.drop('Invoice.UseTax', axis=1, inplace=True)
X_train_final.drop('Invoice.NetValue', axis=1, inplace=True) 
X_train_final.drop('Taxability.STATE.Confidence', axis=1, inplace=True)
X_train_final.drop('@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)', axis=1, inplace=True)
X_train_final.drop('@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)', axis=1, inplace=True)

In [27]:
print(X_train_final.shape)
print(ro_scaler.shape)

(165261, 6)
(165261, 7)


In [28]:
X_train_final = pd.concat([X_train_final.reset_index(), ro_scaler.reset_index()], axis=1)

In [29]:
X_train_final.shape

(165261, 15)

In [30]:
X_train_final.drop('index', axis=1, inplace =True)

In [31]:
X_train_final.isnull().sum()

Line.GLAccountDescription                                        12
Invoice.Description                                               0
Taxability.STATE.JurisdictionCode                                 4
Taxability.STATE.OutOfStatuteDate                                88
Taxability.STATE.ReviewStatus                                     0
@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)     0
Invoice.GrossValue                                                0
Invoice.SalesTaxPaid                                              0
Invoice.UseTax                                                    0
Invoice.NetValue                                                  0
Taxability.STATE.Confidence                                       0
@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)              0
@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)                0
dtype: int64

In [32]:
colnames = ['Line.GLAccountDescription', 'Taxability.STATE.JurisdictionCode', 'Taxability.STATE.OutOfStatuteDate']
feat = X_train_final[['Line.GLAccountDescription', 'Taxability.STATE.JurisdictionCode', 'Taxability.STATE.OutOfStatuteDate']]
imputer = SimpleImputer(strategy ='most_frequent')
imputer.fit(feat)

feats = imputer.transform(feat)
feats = pd.DataFrame(feats, columns = colnames)
feats['Taxability.STATE.ReviewStatus'] = X_train_final['Taxability.STATE.ReviewStatus']
feats['@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)'] = X_train_final['@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)']

In [33]:
feats

Unnamed: 0,Line.GLAccountDescription,Taxability.STATE.JurisdictionCode,Taxability.STATE.OutOfStatuteDate,Taxability.STATE.ReviewStatus,"@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)"
0,Dental Supplies,WA,1/20/24,OUTSTANDING,NO_ERROR
1,Other Medical Supplies,CA,1/31/23,OUTSTANDING,NO_ERROR
2,Surgical Supplies,CA,1/31/23,OUTSTANDING,NO_ERROR
3,Expendable Equipment,CA,1/31/23,OUTSTANDING,NO_ERROR
4,Office Supplies,CA,1/31/23,APPROVED,NO_ERROR
...,...,...,...,...,...
165256,AP Inventory Clring-Non Contro,CA,1/31/23,APPROVED,NO_ERROR
165257,AP Inventory Clring-Non Contro,HI,1/20/23,OUTSTANDING,REFUND
165258,Medical Gloves,CA,1/31/23,APPROVED,NO_ERROR
165259,Office Supplies,CA,1/31/23,APPROVED,NO_ERROR


In [34]:
new_feats = pd.get_dummies(feats, columns = ['Taxability.STATE.JurisdictionCode', 'Taxability.STATE.OutOfStatuteDate', 'Taxability.STATE.ReviewStatus', '@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)'])

In [35]:
new_feats.drop('Taxability.STATE.ReviewStatus_OUTSTANDING', axis=1, inplace=True)

In [36]:
X_train_final.drop('Line.GLAccountDescription', axis=1, inplace=True)
X_train_final.drop('Taxability.STATE.JurisdictionCode', axis=1, inplace=True)
X_train_final.drop('Taxability.STATE.OutOfStatuteDate', axis=1, inplace=True)
X_train_final.drop('Taxability.STATE.ReviewStatus', axis=1, inplace=True)
X_train_final.drop('@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)', axis=1, inplace=True)

In [37]:
X_train_final = pd.concat([new_feats.reset_index(), X_train_final.reset_index()], axis=1)

In [38]:
X_train_final.drop('index', axis=1, inplace=True)

In [39]:
X_train_final.isna().sum()

Line.GLAccountDescription                                                  0
Taxability.STATE.JurisdictionCode_CA                                       0
Taxability.STATE.JurisdictionCode_CO                                       0
Taxability.STATE.JurisdictionCode_CT                                       0
Taxability.STATE.JurisdictionCode_DC                                       0
Taxability.STATE.JurisdictionCode_GA                                       0
Taxability.STATE.JurisdictionCode_HI                                       0
Taxability.STATE.JurisdictionCode_MD                                       0
Taxability.STATE.JurisdictionCode_OR                                       0
Taxability.STATE.JurisdictionCode_VA                                       0
Taxability.STATE.JurisdictionCode_WA                                       0
Taxability.STATE.OutOfStatuteDate_1/20/23                                  0
Taxability.STATE.OutOfStatuteDate_1/20/24                                  0

In [40]:
X_train_final.shape

(165261, 28)

In [41]:
hashenc1 = HashingEncoder(cols = ['Invoice.Description'],
                         n_components=16)
hash_res1 = hashenc1.fit_transform(X_train_final['Invoice.Description'])
hash_res1.sample(5)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,col_11,col_12,col_13,col_14,col_15
100268,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
137180,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
144832,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0
94372,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0
74736,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [42]:
hash_res1.shape

(165261, 16)

In [43]:
X_train_final = pd.concat([hash_res1.reset_index(), X_train_final.reset_index()], axis=1)
X_train_final.drop('Invoice.Description', axis=1, inplace=True)
X_train_final.drop('index', axis=1, inplace=True)
X_train_final

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,"@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)_LIABILITY","@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)_NO_ERROR","@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)_REFUND",Invoice.GrossValue,Invoice.SalesTaxPaid,Invoice.UseTax,Invoice.NetValue,Taxability.STATE.Confidence,"@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)","@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)"
0,0,0,0,0,1,0,0,0,0,0,...,0,1,0,-0.032179,-0.046012,-0.010132,-0.031855,0.236048,-0.033117,-0.023890
1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,-0.006018,-0.048912,0.020403,-0.005661,0.221889,-0.033117,0.002875
2,0,0,0,0,1,0,0,0,0,0,...,0,1,0,-0.006568,0.041835,-0.010132,-0.005352,0.020837,-0.033117,0.012746
3,0,0,0,0,0,0,1,0,0,0,...,0,1,0,-0.027895,-0.034623,-0.010132,-0.027535,0.236048,-0.033117,-0.019141
4,0,0,0,0,0,0,0,0,0,1,...,0,1,0,-0.032091,-0.045400,-0.010132,-0.031754,0.238879,-0.033117,-0.023635
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165256,0,0,0,0,0,0,0,0,0,0,...,0,1,0,0.013969,0.055998,0.005513,0.015147,0.238879,-0.033117,0.032985
165257,0,0,0,0,0,0,0,0,0,0,...,0,0,1,-0.032415,-0.047728,-0.010132,-0.032130,-2.346476,-0.029549,-0.025100
165258,0,0,0,0,0,0,0,0,0,0,...,0,1,0,-0.031025,-0.043053,-0.010132,-0.030695,0.238879,-0.033117,-0.022656
165259,0,0,0,0,0,0,0,0,0,0,...,0,1,0,-0.033053,-0.047612,-0.010132,-0.032711,0.238879,-0.033117,-0.024558


In [44]:
hashenc2 = HashingEncoder(cols = ['Line.GLAccountDescription'],
                         n_components=16)
hash_res2 = hashenc2.fit_transform(X_train_final['Line.GLAccountDescription'])
hash_res2.sample(5)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,col_10,col_11,col_12,col_13,col_14,col_15
77221,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
161937,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
127807,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0
115722,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
149785,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


In [45]:
X_train_final = pd.concat([hash_res2.reset_index(), X_train_final.reset_index()], axis=1)
X_train_final.drop('Line.GLAccountDescription', axis=1, inplace=True)
X_train_final.drop('index', axis=1, inplace=True)

X_train_final

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,"@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)_LIABILITY","@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)_NO_ERROR","@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)_REFUND",Invoice.GrossValue,Invoice.SalesTaxPaid,Invoice.UseTax,Invoice.NetValue,Taxability.STATE.Confidence,"@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)","@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)"
0,0,0,0,0,0,0,0,0,0,0,...,0,1,0,-0.032179,-0.046012,-0.010132,-0.031855,0.236048,-0.033117,-0.023890
1,0,0,0,0,0,0,1,0,0,0,...,0,1,0,-0.006018,-0.048912,0.020403,-0.005661,0.221889,-0.033117,0.002875
2,0,0,0,1,0,0,0,0,0,0,...,0,1,0,-0.006568,0.041835,-0.010132,-0.005352,0.020837,-0.033117,0.012746
3,0,0,0,0,0,0,0,0,0,1,...,0,1,0,-0.027895,-0.034623,-0.010132,-0.027535,0.236048,-0.033117,-0.019141
4,0,0,0,0,0,1,0,0,0,0,...,0,1,0,-0.032091,-0.045400,-0.010132,-0.031754,0.238879,-0.033117,-0.023635
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165256,0,1,0,0,0,0,0,0,0,0,...,0,1,0,0.013969,0.055998,0.005513,0.015147,0.238879,-0.033117,0.032985
165257,0,1,0,0,0,0,0,0,0,0,...,0,0,1,-0.032415,-0.047728,-0.010132,-0.032130,-2.346476,-0.029549,-0.025100
165258,0,0,0,0,0,0,0,0,0,1,...,0,1,0,-0.031025,-0.043053,-0.010132,-0.030695,0.238879,-0.033117,-0.022656
165259,0,0,0,0,0,1,0,0,0,0,...,0,1,0,-0.033053,-0.047612,-0.010132,-0.032711,0.238879,-0.033117,-0.024558


In [49]:
X_train_final.isna().sum()

col_0                                                                      0
col_1                                                                      0
col_2                                                                      0
col_3                                                                      0
col_4                                                                      0
col_5                                                                      0
col_6                                                                      0
col_7                                                                      0
col_8                                                                      0
col_9                                                                      0
col_10                                                                     0
col_11                                                                     0
col_12                                                                     0

In [46]:
lrclassifier = LogisticRegression(random_state=42)
lrclassifier.fit(X_train_final, y_train)

LogisticRegression(random_state=42)

In [47]:
y_train.shape

(165261,)

In [48]:
lrclassifier.score(X_train_final, y_train)

0.999963693793454

##### Test_set

In [49]:
X_test_final = X_test
X_test_final.drop('Invoice.Number', axis=1,inplace=True)
X_test_final.drop('Invoice.VoucherCode', axis=1,inplace=True)
X_test_final.drop('Invoice.PaymentReference', axis=1,inplace=True)
X_test_final.drop('Taxability.Mode', axis=1,inplace=True)
X_test_final.drop('Line.Number', axis=1,inplace=True)
X_test_final.drop('Invoice.Sequence', axis=1,inplace=True)
X_test_final.drop('Invoice.Date', axis=1,inplace=True)
X_test_final.drop("Line.GrossValue",axis=1,inplace=True)
X_test_final.drop("Line.SalesTaxPaid",axis=1,inplace=True)
X_test_final.drop("Line.UseTax",axis=1,inplace=True)
X_test_final.drop("Line.NetValue",axis=1,inplace=True)
X_test_final.drop("Line.ItemDescription",axis=1,inplace=True)
X_test_final.drop('Line.GLSubAccountNumber', axis=1,inplace=True)
X_test_final.drop('Invoice.Note.text', axis=1,inplace=True)
X_test_final.drop('Taxability.STATE.JurisdictionDescription', axis=1,inplace=True)
X_test_final.drop('Invoice.VendorName',axis=1,inplace=True)
X_test_final.drop('Invoice.VendorCode',axis=1,inplace=True)
X_test_final.drop('Line.CostCenterCode',axis=1,inplace=True)
X_test_final.drop('Invoice.CompanyCode',axis=1,inplace=True)
X_test_final.drop('Line.GLMainAccountNumber',axis=1,inplace=True)
X_test_final.drop("Invoice.ParentCompanyCode",axis=1,inplace=True)
X_test_final.shape

(55087, 13)

In [50]:
X_test_final = X_test_final.replace(to_replace='N.A',value=np.nan)

In [51]:
X_test_final.isna().sum()

Line.GLAccountDescription                                         4
Invoice.Description                                               0
Invoice.GrossValue                                                0
Invoice.SalesTaxPaid                                              0
Invoice.UseTax                                                    0
Invoice.NetValue                                                  0
Taxability.STATE.JurisdictionCode                                 0
Taxability.STATE.OutOfStatuteDate                                22
Taxability.STATE.ReviewStatus                                     0
Taxability.STATE.Confidence                                       0
@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)              0
@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)                0
@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)     0
dtype: int64

In [52]:
scaled2 = X_test_final[['Invoice.GrossValue','Invoice.SalesTaxPaid', 'Invoice.UseTax', 'Invoice.NetValue', 'Taxability.STATE.Confidence', '@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)', '@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)']]
col2 = ['Invoice.GrossValue','Invoice.SalesTaxPaid', 'Invoice.UseTax', 'Invoice.NetValue', 'Taxability.STATE.Confidence',         '@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)', '@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)',]

test_scaler = scaler.transform(scaled2)
test_scaler = pd.DataFrame(test_scaler, columns = col2)

In [53]:
test_scaler

Unnamed: 0,Invoice.GrossValue,Invoice.SalesTaxPaid,Invoice.UseTax,Invoice.NetValue,Taxability.STATE.Confidence,"@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)","@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)"
0,-0.032391,-0.047768,-0.009751,-0.032160,0.193528,-0.027609,-0.024923
1,-0.023137,-0.027204,-0.009751,-0.022931,0.239021,-0.027609,-0.015854
2,-0.033581,-0.047824,-0.009751,-0.033254,0.236178,-0.027609,-0.024947
3,-0.033648,-0.047971,-0.009751,-0.033320,0.239021,-0.027609,-0.025013
4,-0.032288,-0.048303,-0.009190,-0.032042,0.122445,-0.024805,-0.025159
...,...,...,...,...,...,...,...
55082,0.047219,0.148510,-0.009751,0.047930,0.239021,-0.027631,0.061638
55083,-0.030953,-0.042708,-0.009109,-0.030611,0.239021,-0.027609,-0.022106
55084,-0.033486,-0.048303,-0.009635,-0.033175,0.236178,-0.027030,-0.025159
55085,-0.032612,-0.045455,-0.009751,-0.032279,0.239021,-0.027609,-0.023903


In [54]:

X_test_final.drop('Invoice.GrossValue', axis=1, inplace=True)
X_test_final.drop('Invoice.SalesTaxPaid', axis=1, inplace=True)
X_test_final.drop('Invoice.UseTax', axis=1, inplace=True)
X_test_final.drop('Invoice.NetValue', axis=1, inplace=True) 
X_test_final.drop('Taxability.STATE.Confidence', axis=1, inplace=True)
X_test_final.drop('@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)', axis=1, inplace=True)
X_test_final.drop('@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)', axis=1, inplace=True)

In [55]:
X_test_final = pd.concat([X_test_final.reset_index(), test_scaler.reset_index()], axis=1)

In [56]:
X_test_final.shape

(55087, 15)

In [57]:
X_test_final.drop('index', axis=1, inplace =True)

In [58]:
X_test_final.shape

(55087, 13)

In [59]:
X_test_final.isna().sum()

Line.GLAccountDescription                                         4
Invoice.Description                                               0
Taxability.STATE.JurisdictionCode                                 0
Taxability.STATE.OutOfStatuteDate                                22
Taxability.STATE.ReviewStatus                                     0
@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)     0
Invoice.GrossValue                                                0
Invoice.SalesTaxPaid                                              0
Invoice.UseTax                                                    0
Invoice.NetValue                                                  0
Taxability.STATE.Confidence                                       0
@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)              0
@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)                0
dtype: int64

In [60]:
col2names = ['Line.GLAccountDescription', 'Taxability.STATE.JurisdictionCode', 'Taxability.STATE.OutOfStatuteDate']
feat2 = X_test_final[['Line.GLAccountDescription', 'Taxability.STATE.JurisdictionCode', 'Taxability.STATE.OutOfStatuteDate']]
feat2 = imputer.transform(feat2)
feat2 = pd.DataFrame(feat2, columns = col2names)
feat2['Taxability.STATE.ReviewStatus'] = X_test_final['Taxability.STATE.ReviewStatus']
feat2['@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)'] = X_test_final['@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)']

In [61]:
feat2.shape

(55087, 5)

In [62]:
new_feat2 = pd.get_dummies(feat2, columns = ['Taxability.STATE.JurisdictionCode', 'Taxability.STATE.OutOfStatuteDate', 'Taxability.STATE.ReviewStatus', '@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)'])

In [63]:
new_feat2.drop('Taxability.STATE.ReviewStatus_OUTSTANDING', axis=1, inplace=True)

In [64]:
X_test_final.drop('Line.GLAccountDescription', axis=1, inplace=True)
X_test_final.drop('Taxability.STATE.JurisdictionCode', axis=1, inplace=True)
X_test_final.drop('Taxability.STATE.OutOfStatuteDate', axis=1, inplace=True)
X_test_final.drop('Taxability.STATE.ReviewStatus', axis=1, inplace=True)
X_test_final.drop('@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)', axis=1, inplace=True)

In [65]:
X_test_final = pd.concat([new_feat2.reset_index(), X_test_final.reset_index()], axis=1)

In [66]:
X_test_final.isna().sum()

index                                                                      0
Line.GLAccountDescription                                                  0
Taxability.STATE.JurisdictionCode_CA                                       0
Taxability.STATE.JurisdictionCode_CO                                       0
Taxability.STATE.JurisdictionCode_DC                                       0
Taxability.STATE.JurisdictionCode_GA                                       0
Taxability.STATE.JurisdictionCode_HI                                       0
Taxability.STATE.JurisdictionCode_MD                                       0
Taxability.STATE.JurisdictionCode_OR                                       0
Taxability.STATE.JurisdictionCode_VA                                       0
Taxability.STATE.JurisdictionCode_WA                                       0
Taxability.STATE.OutOfStatuteDate_1/20/23                                  0
Taxability.STATE.OutOfStatuteDate_1/20/24                                  0

In [67]:
X_test_final.shape

(55087, 29)

In [68]:
X_test_final.drop('index', axis=1, inplace=True)

In [69]:
hash_res3 = hashenc1.transform(X_test_final['Invoice.Description'])


In [70]:
X_test_final = pd.concat([hash_res3.reset_index(), X_test_final.reset_index()], axis=1)
X_test_final.drop('Invoice.Description', axis=1, inplace=True)
X_test_final.drop('index', axis=1, inplace=True)
X_test_final

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,"@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)_LIABILITY","@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)_NO_ERROR","@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)_REFUND",Invoice.GrossValue,Invoice.SalesTaxPaid,Invoice.UseTax,Invoice.NetValue,Taxability.STATE.Confidence,"@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)","@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)"
0,0,0,0,0,0,1,0,0,0,0,...,0,1,0,-0.032391,-0.047768,-0.009751,-0.032160,0.193528,-0.027609,-0.024923
1,0,0,1,0,0,0,0,0,0,0,...,0,1,0,-0.023137,-0.027204,-0.009751,-0.022931,0.239021,-0.027609,-0.015854
2,0,1,0,0,0,0,0,0,0,0,...,0,1,0,-0.033581,-0.047824,-0.009751,-0.033254,0.236178,-0.027609,-0.024947
3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,-0.033648,-0.047971,-0.009751,-0.033320,0.239021,-0.027609,-0.025013
4,0,0,0,0,0,0,0,0,0,0,...,0,0,1,-0.032288,-0.048303,-0.009190,-0.032042,0.122445,-0.024805,-0.025159
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55082,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0.047219,0.148510,-0.009751,0.047930,0.239021,-0.027631,0.061638
55083,0,0,0,0,0,1,0,0,0,0,...,0,1,0,-0.030953,-0.042708,-0.009109,-0.030611,0.239021,-0.027609,-0.022106
55084,0,0,0,0,0,0,0,0,0,0,...,0,0,1,-0.033486,-0.048303,-0.009635,-0.033175,0.236178,-0.027030,-0.025159
55085,1,0,0,0,0,0,0,0,0,0,...,0,1,0,-0.032612,-0.045455,-0.009751,-0.032279,0.239021,-0.027609,-0.023903


In [71]:
X_test_final.isna().sum()

col_0                                                                      0
col_1                                                                      0
col_2                                                                      0
col_3                                                                      0
col_4                                                                      0
col_5                                                                      0
col_6                                                                      0
col_7                                                                      0
col_8                                                                      0
col_9                                                                      0
col_10                                                                     0
col_11                                                                     0
col_12                                                                     0

In [72]:
X_test_final.shape

(55087, 42)

In [73]:
hash_res4 = hashenc2.transform(X_test_final['Line.GLAccountDescription'])


In [74]:
X_test_final = pd.concat([hash_res2.reset_index(), X_test_final.reset_index()], axis=1)
X_test_final.drop('Line.GLAccountDescription', axis=1, inplace=True)
X_test_final.drop('index', axis=1, inplace=True)
X_test_final

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,col_8,col_9,...,"@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)_LIABILITY","@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)_NO_ERROR","@CustomField(Invoice,TaxabilityClassification,Invoice,STRING)_REFUND",Invoice.GrossValue,Invoice.SalesTaxPaid,Invoice.UseTax,Invoice.NetValue,Taxability.STATE.Confidence,"@CustomField(Invoice,PotentialTaxDue,Invoice,DOUBLE)","@CustomField(Invoice,CalculatedTax,Invoice,DOUBLE)"
0,0,0,0,0,0,1,0,0,0,0,...,0.0,1.0,0.0,-0.032391,-0.047768,-0.009751,-0.032160,0.193528,-0.027609,-0.024923
1,0,0,0,0,0,0,1,0,0,0,...,0.0,1.0,0.0,-0.023137,-0.027204,-0.009751,-0.022931,0.239021,-0.027609,-0.015854
2,0,0,0,0,0,0,0,0,0,1,...,0.0,1.0,0.0,-0.033581,-0.047824,-0.009751,-0.033254,0.236178,-0.027609,-0.024947
3,0,0,0,0,0,0,1,0,0,0,...,0.0,1.0,0.0,-0.033648,-0.047971,-0.009751,-0.033320,0.239021,-0.027609,-0.025013
4,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,1.0,-0.032288,-0.048303,-0.009190,-0.032042,0.122445,-0.024805,-0.025159
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
165256,0,1,0,0,0,0,0,0,0,0,...,,,,,,,,,,
165257,0,0,1,0,0,0,0,0,0,0,...,,,,,,,,,,
165258,0,0,0,0,0,1,0,0,0,0,...,,,,,,,,,,
165259,0,1,0,0,0,0,0,0,0,0,...,,,,,,,,,,


In [75]:
X_test_final.isna().sum()

col_0                                                                           0
col_1                                                                           0
col_2                                                                           0
col_3                                                                           0
col_4                                                                           0
col_5                                                                           0
col_6                                                                           0
col_7                                                                           0
col_8                                                                           0
col_9                                                                           0
col_10                                                                          0
col_11                                                                          0
col_12          

In [76]:
X_test_final.dropna(inplace=True)

In [77]:
y_pred = lrclassifier.predict(X_test_final)

Feature names seen at fit time, yet now missing:
- Taxability.STATE.JurisdictionCode_CT



ValueError: X has 57 features, but LogisticRegression is expecting 58 features as input.

In [None]:

score = accuracy_score(y_pred, y_test)
print(score)
print(classification_report(y_pred, y_test))