Use Random Forest to prepare a model on fraud data 
treating those who have taxable_income <= 30000 as "Risky" and others are "Good"


In [1]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics


data = pd.read_csv("Fraud_checkk.csv")


print(data.head())

data['Taxable.Income'] = data['Taxable.Income'].apply(lambda x: 'Risky' if x <= 30000 else 'Good')


data = pd.get_dummies(data, columns=['Undergrad', 'Marital.Status', 'Urban'], drop_first=True)


X = data.drop('Taxable.Income', axis=1)
y = data['Taxable.Income']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initializing the Random Forest Classifier
clf = RandomForestClassifier(random_state=42)


clf.fit(X_train, y_train)

#predictions on the test data
y_pred = clf.predict(X_test)


accuracy = metrics.accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

# Displaying feature importance
feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': clf.feature_importances_})
print("Feature Importance:\n", feature_importance.sort_values(by='Importance', ascending=False))


  Undergrad Marital.Status  Taxable.Income  City.Population  Work.Experience  \
0        NO         Single           68833            50047               10   
1       YES       Divorced           33700           134075               18   
2        NO        Married           36925           160205               30   
3       YES         Single           50190           193264               15   
4        NO        Married           81002            27533               28   

  Urban  
0   YES  
1   YES  
2   YES  
3   YES  
4    NO  
Accuracy: 0.7333333333333333
Feature Importance:
                   Feature  Importance
0         City.Population    0.543116
1         Work.Experience    0.334691
2           Undergrad_YES    0.036858
5               Urban_YES    0.036427
3  Marital.Status_Married    0.026014
4   Marital.Status_Single    0.022894


Random Forest
 
Assignment


About the data: 
Let’s consider a Company dataset with around 10 variables and 400 records. 
The attributes are as follows: 
 Sales -- Unit sales (in thousands) at each location
 Competitor Price -- Price charged by competitor at each location
 Income -- Community income level (in thousands of dollars)
 Advertising -- Local advertising budget for company at each location (in thousands of dollars)
 Population -- Population size in region (in thousands)
 Price -- Price company charges for car seats at each site
 Shelf Location at stores -- A factor with levels Bad, Good and Medium indicating the quality of the shelving location for the car seats at each site
 Age -- Average age of the local population
 Education -- Education level at each location
 Urban -- A factor with levels No and Yes to indicate whether the store is in an urban or rural location
 US -- A factor with levels No and Yes to indicate whether the store is in the US or not
The company dataset looks like this: 
 
Problem Statement:
A cloth manufacturing company is interested to know about the segment or attributes causes high sale. 
Approach - A Random Forest can be built with target variable Sales (we will first convert it in categorical variable) & all other variable will be independent in the analysis.  


In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics


data = pd.read_csv("Company_Dataa.csv")


print(data.head())


data['Sales_Category'] = pd.cut(data['Sales'], bins=[-float('inf'), data['Sales'].mean(), float('inf')], labels=['Low', 'High'])


data.drop('Sales', axis=1, inplace=True)


data = pd.get_dummies(data, columns=['ShelveLoc', 'Urban', 'US'], drop_first=True)


X = data.drop('Sales_Category', axis=1)
y = data['Sales_Category']


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


clf = RandomForestClassifier(random_state=42)

clf.fit(X_train, y_train)


y_pred = clf.predict(X_test)


accuracy = metrics.accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

feature_importance = pd.DataFrame({'Feature': X.columns, 'Importance': clf.feature_importances_})
print("Feature Importance:\n", feature_importance.sort_values(by='Importance', ascending=False))


   Sales  CompPrice  Income  Advertising  Population  Price ShelveLoc  Age  \
0   9.50        138      73           11         276    120       Bad   42   
1  11.22        111      48           16         260     83      Good   65   
2  10.06        113      35           10         269     80    Medium   59   
3   7.40        117     100            4         466     97    Medium   55   
4   4.15        141      64            3         340    128       Bad   38   

   Education Urban   US  
0         17   Yes  Yes  
1         10   Yes  Yes  
2         12   Yes  Yes  
3         14   Yes  Yes  
4         13   Yes   No  
Accuracy: 0.85
Feature Importance:
              Feature  Importance
4              Price    0.241766
5                Age    0.135556
0          CompPrice    0.116948
3         Population    0.108888
2        Advertising    0.108873
1             Income    0.091713
7     ShelveLoc_Good    0.090437
6          Education    0.052787
8   ShelveLoc_Medium    0.025219
10       