Use decision trees to prepare a model on fraud data

treating those who have taxable_income <= 30000 as "Risky" and others are "Good"

Data Description :

Undergrad : person is under graduated or not

Marital.Status : marital status of a person

Taxable.Income : Taxable income is the amount of how much tax an individual owes to the government

Work Experience : Work experience of an individual person

Urban : Whether that person belongs to urban area or not

In [6]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from sklearn.tree import export_text


data = pd.read_csv("Fraud_check.csv")


print(data.head())

data['Taxable.Income'] = data['Taxable.Income'].apply(lambda x: 'Risky' if x <= 30000 else 'Good')

# Converting categorical variables into numerical format using one-hot encoding
data = pd.get_dummies(data, columns=['Undergrad', 'Marital.Status', 'Urban'], drop_first=True)

# features and target variable
X = data.drop('Taxable.Income', axis=1)
y = data['Taxable.Income']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


clf = DecisionTreeClassifier(random_state=42)


clf.fit(X_train, y_train)


y_pred = clf.predict(X_test)


accuracy = metrics.accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")


tree_rules = export_text(clf, feature_names=list(X.columns))
print("Decision Tree Rules:\n", tree_rules)


  Undergrad Marital.Status  Taxable.Income  City.Population  Work.Experience  \
0        NO         Single           68833            50047               10   
1       YES       Divorced           33700           134075               18   
2        NO        Married           36925           160205               30   
3       YES         Single           50190           193264               15   
4        NO        Married           81002            27533               28   

  Urban  
0   YES  
1   YES  
2   YES  
3   YES  
4    NO  
Accuracy: 0.6666666666666666
Decision Tree Rules:
 |--- Marital.Status_Married <= 0.50
|   |--- Work.Experience <= 29.50
|   |   |--- Work.Experience <= 23.50
|   |   |   |--- Work.Experience <= 6.50
|   |   |   |   |--- City.Population <= 170738.00
|   |   |   |   |   |--- City.Population <= 108091.00
|   |   |   |   |   |   |--- Work.Experience <= 1.50
|   |   |   |   |   |   |   |--- Work.Experience <= 0.50
|   |   |   |   |   |   |   |   |--- class: G

Decision Tree
 
Assignment


About the data: 
Let’s consider a Company dataset with around 10 variables and 400 records. 
The attributes are as follows: 
 Sales -- Unit sales (in thousands) at each location
 Competitor Price -- Price charged by competitor at each location
 Income -- Community income level (in thousands of dollars)
 Advertising -- Local advertising budget for company at each location (in thousands of dollars)
 Population -- Population size in region (in thousands)
 Price -- Price company charges for car seats at each site
 Shelf Location at stores -- A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
 Age -- Average age of the local population
 Education -- Education level at each location
 Urban -- A factor with levels No and Yes to indicate whether the store is in an urban or rural location
 US -- A factor with levels No and Yes to indicate whether the store is in the US or not
The company dataset looks like this: 
 
Problem Statement:
A cloth manufacturing company is interested to know about the segment or attributes causes high sale. 
Approach - A decision tree can be built with target variable Sale (we will first convert it in categorical variable) & all other variable will be independent in the analysis.  


In [7]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics


data = pd.read_csv("Company_Data.csv")


print(data.head())


data['Sales_Category'] = pd.cut(data['Sales'], bins=[-float('inf'), data['Sales'].mean(), float('inf')], labels=['Low', 'High'])


data.drop('Sales', axis=1, inplace=True)


label_encoder = LabelEncoder()
data['ShelveLoc'] = label_encoder.fit_transform(data['ShelveLoc'])
data['Urban'] = label_encoder.fit_transform(data['Urban'])
data['US'] = label_encoder.fit_transform(data['US'])


X = data.drop('Sales_Category', axis=1)
y = data['Sales_Category']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


clf = DecisionTreeClassifier(random_state=42)


clf.fit(X_train, y_train)

# predictions
y_pred = clf.predict(X_test)


accuracy = metrics.accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")


tree_rules = export_text(clf, feature_names=list(X.columns))
print("Decision Tree Rules:\n", tree_rules)


   Sales  CompPrice  Income  Advertising  Population  Price ShelveLoc  Age  \
0   9.50        138      73           11         276    120       Bad   42   
1  11.22        111      48           16         260     83      Good   65   
2  10.06        113      35           10         269     80    Medium   59   
3   7.40        117     100            4         466     97    Medium   55   
4   4.15        141      64            3         340    128       Bad   38   

   Education Urban   US  
0         17   Yes  Yes  
1         10   Yes  Yes  
2         12   Yes  Yes  
3         14   Yes  Yes  
4         13   Yes   No  
Accuracy: 0.7
Decision Tree Rules:
 |--- Price <= 92.50
|   |--- Population <= 253.50
|   |   |--- ShelveLoc <= 0.50
|   |   |   |--- Population <= 105.50
|   |   |   |   |--- class: High
|   |   |   |--- Population >  105.50
|   |   |   |   |--- class: Low
|   |   |--- ShelveLoc >  0.50
|   |   |   |--- Income <= 88.50
|   |   |   |   |--- class: High
|   |   |   |--- Inc