<img src='logo/dsl-logo.png' width="500" align="center" />

# HR Competition

## Load Data

### Initializations

In [24]:
# Bibliotheken einbinden
import numpy as np
import scipy.stats as stats
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [25]:
# Definition einer Klasse für Text Styles
class color:
   PURPLE = '\033[95m'
   CYAN = '\033[96m'
   DARKCYAN = '\033[36m'
   BLUE = '\033[94m'
   GREEN = '\033[92m'
   YELLOW = '\033[93m'
   RED = '\033[91m'
   BOLD = '\033[1m'
   UNDERLINE = '\033[4m'
   END = '\033[0m'

### Import Data from CSV

In [26]:
# Beschreibung des HR Datensets
with open('data/hr_desc.txt', 'r') as file:
    print (file.read())

Warum verlassen uns unsere besten und erfahrensten Mitarbeiter und Mitarbeiterinnen?

Ziel: Entwicklung eines Prognosemodells für die Vorhersage, ob ein Mitarbeiter oder eine Mitarbeiterin das Unternehmen als nächstes verlassen könnte.

Verfügbare Attribute:
- satisfaction_level: Zufriedenheitslevel (0-1)
- last_evaluation: Zeit in Jahren seit der letzten Evaluierung
- number_project: Anzahl abgeschlossener Projekte 
- average_monthly_hours: Durchschnittliche monatliche Arbeitsstunden 
- time_spent_company: Zeit in Jahren im Unternehmen
- work_accident: Lag ein Arbeitsunfall vor? 
- promotion_last_5years: Gab es eine Beförderung in den letzten fünf Jahren?
- department: Abteilung, für die gearbeitet wird
- salary: relativer Gehaltslevel 
- left: Wurde gekündigt?


In [27]:
# Trainingsdaten einlesen
df = pd.read_csv('data/hr_train.csv', delimiter=';', decimal='.')
df.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,left,promotion_last_5years,department,salary
0,0.55,0.86,4,169,6,0,0,0,IT,medium
1,0.66,0.48,4,229,4,0,0,0,sales,medium
2,0.56,0.67,5,165,3,1,0,0,management,medium
3,0.59,1.0,2,155,5,0,1,0,sales,low
4,0.87,0.49,4,149,2,0,0,0,sales,low


In [28]:
# Format des Trainingsdaten Dataframes
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11999 entries, 0 to 11998
Data columns (total 10 columns):
satisfaction_level       11999 non-null float64
last_evaluation          11999 non-null float64
number_project           11999 non-null int64
average_montly_hours     11999 non-null int64
time_spend_company       11999 non-null int64
Work_accident            11999 non-null int64
left                     11999 non-null int64
promotion_last_5years    11999 non-null int64
department               11999 non-null object
salary                   11999 non-null object
dtypes: float64(2), int64(6), object(2)
memory usage: 937.5+ KB


In [29]:
# Testdaten einlesen
df_test = pd.read_csv('data/hr_test.csv', delimiter=';', decimal='.')
df_test.head()

Unnamed: 0,satisfaction_level,last_evaluation,number_project,average_montly_hours,time_spend_company,Work_accident,promotion_last_5years,department,salary
0,0.1,0.93,7,258,4,0,0,technical,low
1,0.24,0.55,6,231,4,0,0,sales,low
2,0.23,0.84,5,140,4,0,0,IT,low
3,0.42,0.54,2,159,3,0,0,sales,medium
4,0.43,0.47,2,144,3,0,0,hr,medium


In [30]:
# Format des Testdaten Dataframes
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3000 entries, 0 to 2999
Data columns (total 9 columns):
satisfaction_level       3000 non-null float64
last_evaluation          3000 non-null float64
number_project           3000 non-null int64
average_montly_hours     3000 non-null int64
time_spend_company       3000 non-null int64
Work_accident            3000 non-null int64
promotion_last_5years    3000 non-null int64
department               3000 non-null object
salary                   3000 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 211.0+ KB


### Check Appropriate Data Types

In [48]:
# Anzahl Unique Values pro Column
print(color.UNDERLINE + color.BOLD + 'Number of Unique Values per Column:' + color.END)
for col in df.columns:
    col_unique_value_count = df[col].unique().size
    print('\t' + col + color.BOLD,':', col_unique_value_count, color.END)

[4m[1mNumber of Unique Values per Column:[0m
	satisfaction_level[1m : 92 [0m
	last_evaluation[1m : 65 [0m
	number_project[1m : 6 [0m
	average_montly_hours[1m : 215 [0m
	time_spend_company[1m : 8 [0m
	Work_accident[1m : 2 [0m
	left[1m : 2 [0m
	promotion_last_5years[1m : 2 [0m
	department[1m : 10 [0m
	salary[1m : 3 [0m


In [35]:
df.columns

Index(['satisfaction_level', 'last_evaluation', 'number_project',
       'average_montly_hours', 'time_spend_company', 'Work_accident', 'left',
       'promotion_last_5years', 'department', 'salary'],
      dtype='object')

In [47]:
# Ausgabe der einzelnen Ausprägungen und deren Anzahl pro Spalte mit höchstens 11 Unique Values
for col in df.columns:
    if df[col].unique().size <= 10:
        print()
        print(color.BOLD + ' attr : count for ' + col + color.END)
        uniques, counts = np.unique(df[col], return_counts=True)
        zipped = zip(uniques, counts)
        dictionary = dict(zipped)
        pd.DataFrame.from_dict(data=dictionary, orient='index')
        dictionary
        for unique, count in dictionary.items():
            print('\t', unique, ':', count)


[1m attr : count for number_project[0m
	 2 : 1934
	 3 : 3246
	 4 : 3483
	 5 : 2191
	 6 : 938
	 7 : 207

[1m attr : count for time_spend_company[0m
	 2 : 2584
	 3 : 5165
	 4 : 2049
	 5 : 1156
	 6 : 576
	 7 : 163
	 8 : 132
	 10 : 174

[1m attr : count for Work_accident[0m
	 0 : 10245
	 1 : 1754

[1m attr : count for left[0m
	 0 : 9149
	 1 : 2850

[1m attr : count for promotion_last_5years[0m
	 0 : 11739
	 1 : 260

[1m attr : count for department[0m
	 IT : 972
	 RandD : 631
	 accounting : 627
	 hr : 576
	 management : 514
	 marketing : 666
	 product_mng : 733
	 sales : 3325
	 support : 1795
	 technical : 2160

[1m attr : count for salary[0m
	 high : 997
	 low : 5845
	 medium : 5157


In [52]:
# Datentyp für bestimmte Spalten in Category umwandeln
for col in ['Work_accident', 'left', 'promotion_last_5years', 'department', 'salary']:
    print('transforming', col)
    df[col] = df[col].astype('category')
    if col != 'left':
        df_test[col] = df_test[col].astype('category')

transforming Work_accident
transforming left
transforming promotion_last_5years
transforming department
transforming salary


In [53]:
# Ergebnis überprüfen
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11999 entries, 0 to 11998
Data columns (total 10 columns):
satisfaction_level       11999 non-null float64
last_evaluation          11999 non-null float64
number_project           11999 non-null int64
average_montly_hours     11999 non-null int64
time_spend_company       11999 non-null int64
Work_accident            11999 non-null category
left                     11999 non-null category
promotion_last_5years    11999 non-null category
department               11999 non-null category
salary                   11999 non-null category
dtypes: category(5), float64(2), int64(3)
memory usage: 528.2 KB
