### Classification in Business

<center>
    <img src = 'images/uci_biz.png' />
</center>



------------


For this try-it, you are to explore some available datasets related to business applications of classification.  Using a dataset from the UCI Machine Learning Repository, scan datasets   under the subject area "BUSINESS" [link here](https://archive.ics.uci.edu/ml/datasets.php?format=&task=cla&att=&area=bus&numAtt=&numIns=&type=&sort=nameUp&view=table).  Find a dataset that looks interesting to you and decide how you could use Logistic Regression to help make a business decision using the data. 

In sharing your results, be sure to clearly describe the following:

- the dataset and its features
- the classification problem -- what are you classifying here?
- a business decision that can be supported using the results of the classification model

Share your summary on the appropriate discussion board for the activity. 

In [333]:
import pandas as pd
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.neighbors import KNeighborsRegressor
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import make_column_transformer
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn import set_config

from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay
from sklearn.linear_model import LogisticRegression
set_config("figure")

In [334]:
df=pd.read_excel("data/Absenteeism_at_work.xls")

In [335]:
df.head()

Unnamed: 0,ID,Reason for absence,Month of absence,Day of the week,Seasons,Transportation expense,Distance from Residence to Work,Service time,Age,Work load Average/day,...,Disciplinary failure,Education,Son,Social drinker,Social smoker,Pet,Weight,Height,Body mass index,Absenteeism time in hours
0,11,26,7,3,1,289,36,13,33,239554,...,0,1,2,1,0,1,90,172,30,4
1,36,0,7,3,1,118,13,18,50,239554,...,1,1,1,1,0,0,98,178,31,0
2,3,23,7,4,1,179,51,18,38,239554,...,0,1,0,1,0,0,89,170,31,2
3,7,7,7,5,1,279,5,14,39,239554,...,0,1,2,1,1,0,68,168,24,4
4,11,23,7,5,1,289,36,13,33,239554,...,0,1,2,1,0,1,90,172,30,2


In [336]:
df.dtypes

ID                                 int64
Reason for absence                 int64
Month of absence                   int64
Day of the week                    int64
Seasons                            int64
Transportation expense             int64
Distance from Residence to Work    int64
Service time                       int64
Age                                int64
Work load Average/day              int64
Hit target                         int64
Disciplinary failure               int64
Education                          int64
Son                                int64
Social drinker                     int64
Social smoker                      int64
Pet                                int64
Weight                             int64
Height                             int64
Body mass index                    int64
Absenteeism time in hours          int64
dtype: object

In [337]:

col         = 'Absenteeism time in hours'

df[col].describe()

count    740.000000
mean       6.924324
std       13.330998
min        0.000000
25%        2.000000
50%        3.000000
75%        8.000000
max      120.000000
Name: Absenteeism time in hours, dtype: float64

In [338]:

# conditions  = [ df[col] >= 40, (df[col] < 40) & (df[col]> 16), (df[col] <=16 ) & (df[col]> 8), (df[col] <=8) & (df[col]> 4) , (df[col] <=4) & (df[col]> 2), df[col] <=2]
# choices     = [5, 4, 3, 2, 1, 0 ]
conditions  = [ df[col] >= 40, df[col]< 40]
choices = [1, 0]
df[col]

0      4
1      0
2      2
3      4
4      2
      ..
735    8
736    4
737    0
738    0
739    0
Name: Absenteeism time in hours, Length: 740, dtype: int64

In [339]:

    
df["absence"] = np.select(conditions, choices, default=np.nan)

# y=np.where(df["Absenteeism time in hours"] > 40, 1,  0)
# y=pd.Series(y)
X_train, X_test, y_train, y_test = train_test_split(df.drop(["Absenteeism time in hours", "absence", "ID"], axis = 1), df['absence'], random_state = 442,
                                                   stratify = df['absence'])


In [340]:
X.dtypes

Reason for absence                 int64
Month of absence                   int64
Day of the week                    int64
Seasons                            int64
Transportation expense             int64
Distance from Residence to Work    int64
Service time                       int64
Age                                int64
Work load Average/day              int64
Hit target                         int64
Disciplinary failure               int64
Education                          int64
Son                                int64
Social drinker                     int64
Social smoker                      int64
Pet                                int64
Weight                             int64
Height                             int64
Body mass index                    int64
dtype: object

In [341]:
y_train.unique()

array([0., 1.])

In [342]:
selector = make_column_selector()
#transformer = make_column_transformer((OneHotEncoder(handle_unknown='ignore'), selector),
#                                    remainder = StandardScaler())


transformer = make_column_transformer((OneHotEncoder(handle_unknown='ignore'), selector),
                                     remainder =MinMaxScaler())


In [343]:
extractor = SelectFromModel(LogisticRegression(penalty='l2', solver = 'liblinear' ,random_state = 42))

In [344]:

from sklearn.impute import SimpleImputer
lgr_pipe = Pipeline([ ('transformer', transformer),
                    ('selector', extractor),
                    ('lgr', LogisticRegression(random_state=42, max_iter = 1000))])

lgr_pipe.fit(X_train, y_train)

pipe_1_acc = lgr_pipe.score(X_test, y_test)

In [345]:
pipe_1_acc

0.9675675675675676

In [346]:
print(y_test.value_counts(normalize = True))

0.0    0.972973
1.0    0.027027
Name: absence, dtype: float64


In [347]:
fp = 126
fn = 194
auc = 0.86

In [348]:
no_probs = lgr_pipe.predict_proba(X_test)[:, 0]
no_probs[:5]
high_prob_no = no_probs[no_probs > 0.95]


In [349]:
# YOUR CODE HERE
percent_of_test_data = len(high_prob_no)/len(y_test)
percent_of_no = len(high_prob_no)/sum(y_test ==0 )

### ANSWER CHECK
print(percent_of_test_data)
print(percent_of_no)

0.8486486486486486
0.8722222222222222


In [350]:
lgr_pipe

Pipeline(steps=[('transformer',
                 ColumnTransformer(remainder=MinMaxScaler(),
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(handle_unknown='ignore'),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x000001C027ECAA90>)])),
                ('selector',
                 SelectFromModel(estimator=LogisticRegression(random_state=42,
                                                              solver='liblinear'))),
                ('lgr', LogisticRegression(max_iter=1000, random_state=42))])

Pipeline(steps=[('transformer',
                 ColumnTransformer(remainder=StandardScaler(),
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(drop='first'),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x00000224BA8FC1C0>)])),
                ('selector',
                 SelectFromModel(estimator=LogisticRegression(penalty='l1',
                                                              random_state=42,
                                                              solver='liblinear'))),
                ('lgr', LogisticRegression(max_iter=1000, random_state=42))])

In [351]:
lgr_pipe.n_features_in_.denominator

1

In [352]:
df.shape

(740, 22)

In [353]:
lgr_pipe.named_steps['transformer'].get_feature_names_out() 

AttributeError: 'ColumnTransformer' object has no attribute 'get_feature_names_out'

In [206]:
# feature_names = lgr_pipe.named_steps['transformer'].get_feature_names_out() 



AttributeError: 'ColumnTransformer' object has no attribute 'get_feature_names_out'

In [None]:

# selected_features =feature_names[ [int(i[1:]) for i in lgr_pipe.named_steps['selector'].get_feature_names_out()]]
# clean_names = [i.split('__')[-1] for i in selected_features]
# coef_df = pd.DataFrame({'feature': clean_names, 'coefs': lgr_pipe.named_steps['lgr'].coef_[0]})
# coef_df['coefs'] = coef_df['coefs'].apply(abs)
# coef_df = coef_df.sort_values(by = 'coefs', ascending = False)