### Finding outliers with z-score

We will use the 'Unemployment' dataset from the pydataset library.

Our goal is to find all the outliers regarding the duration of unemployment.

We will use the z-score to define the outliers.

Outliers are all records with a duration z-score greater than +3.

In [18]:
# Install pydataset, if not already installed
# !pip install pydataset

# Importing libraries
import pydataset

In [19]:
# Load the dataset 'Unemployment' from pydataset
df = pydataset.data('Unemployment')

In [20]:
# Print the first 5 rows of the dataset
print(df.head())

   duration  spell      race     sex  reason search pubemp  ftp1  ftp2  ftp3  \
1         4      1     white    male  reentr    yes    yes     1     0     0   
2         7      0     white    male    lose     no     no     1     1     1   
3         1      0  nonwhite    male    lose     no     no     0     0     0   
4         1      1  nonwhite    male  reentr     no     no     0     1     0   
5         3      1  nonwhite  female  reentr     no     no     0     0     0   

   ftp4  nobs  
1     0     1  
2     1     2  
3     0     1  
4     0     1  
5     0     1  


In [21]:
# Calculate the mean of the 'duration' column
mean = df['duration'].mean()

In [22]:
# Calcutalte the standard deviation of the 'duration' column
std = df['duration'].std()

In [23]:
# Calculate the z-score of the 'duration' column
df['z_score'] = (df['duration'] - mean) / std

In [24]:
# Find the number of rows that have a z-score greater than 3. Mark outliers as True, otherwise False
df['outlier'] = df['z_score'] > 3

In [25]:
# Print the first 15 rows of the outlier column
print(df['outlier'].head(15))

1     False
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9      True
10    False
11    False
12     True
13    False
14    False
15    False
Name: outlier, dtype: bool


In [26]:
# Print only the outliers
print(df[df['outlier'] == True])

     duration  spell      race     sex  reason search pubemp  ftp1  ftp2  \
9         113      0     white  female  reentr     no     no     0     0   
12        104      0     white    male   leave    yes     no     1     0   
16        104      1     white  female    lose    yes     no     1     0   
26        104      1  nonwhite    male  reentr    yes    yes     0     1   
35        108      1     white    male    lose     no     no     1     1   
36        104      1     white  female    lose    yes     no     0     1   
51        104      1     white    male  reentr    yes    yes     1     0   
176       104      1     white  female   leave     no     no     0     0   
232       117      0     white    male    lose     no    yes     1     1   
236       100      0     white    male    lose     no    yes     1     1   
244       109      0  nonwhite  female    lose     no    yes     1     1   
252       112      1     white  female  reentr     no     no     1     1   
256       10

In [27]:
# Count the number of outliers
print(df['outlier'].value_counts())

outlier
False    438
True      14
Name: count, dtype: int64


### Now, let's calculate the outliers based on the different categories of the reason , to see if we get different outliers

In [28]:
# Show unique values in the reason column
df['reason'].unique()

array(['reentr', 'lose', 'leave', 'new'], dtype=object)

In [29]:
# Calculate the mean of the duration column for each category in the reason column
df.groupby('reason').duration.mean()

reason
leave     18.130435
lose      21.362573
new       13.097561
reentr    16.952703
Name: duration, dtype: float64

In [30]:
# Calcualte the standard deviation of the duration column for each category in the reason column
df.groupby('reason').duration.std()

reason
leave     21.378441
lose      25.047428
new       17.377866
reentr    22.895009
Name: duration, dtype: float64

In [31]:
# Calculate the z-score of the duration column for each category in the reason column and assign the values to a new column called z-score_grouped
df['z_score_grouped'] = df.groupby('reason').duration.transform(lambda x: (x - x.mean()) / x.std())

In [32]:
# Create a new column called outlier and mark the values that have a z-score greater than 3 as True, otherwise False
df['outlier_grouped'] = df['z_score_grouped'] > 3

In [33]:
# Calculate the number of outliers for each category in the reason column
df.groupby('reason').outlier_grouped.sum()

reason
leave     2
lose      6
new       1
reentr    5
Name: outlier_grouped, dtype: int64

In [37]:
# Compare the number of outliers for each category in the reason column with the number of ungrouped outliers for the entire dataset
print("Ungrouped outliers:" , df['outlier'].sum())
print("Grouped outliers:" , df['outlier_grouped'].sum())


Ungrouped outliers: 14
Grouped outliers: 14


In [36]:
# Print z_score and z_score_grouped for each row
print(df[['z_score', 'z_score_grouped']])

      z_score  z_score_grouped
1   -0.628596        -0.565744
2   -0.498641        -0.573415
3   -0.758551        -0.812961
4   -0.758551        -0.696776
5   -0.671915        -0.609421
..        ...              ...
448  1.060822         0.863858
449 -0.282049        -0.216322
450  1.450688         1.584286
451 -0.282049        -0.216322
452  0.107817         0.176776

[452 rows x 2 columns]


In [38]:
# Print outliers and outliers_grouped for each row
print(df[['outlier', 'outlier_grouped']])

     outlier  outlier_grouped
1      False            False
2      False            False
3      False            False
4      False            False
5      False            False
..       ...              ...
448    False            False
449    False            False
450    False            False
451    False            False
452    False            False

[452 rows x 2 columns]


In [39]:
# Find all the rows where outliers and outliers_grouped are not equal
df[df['outlier'] != df['outlier_grouped']]

Unnamed: 0,duration,spell,race,sex,reason,search,pubemp,ftp1,ftp2,ftp3,ftp4,nobs,z_score,outlier,z_score_grouped,outlier_grouped


#### Result: Our different ways to calculate the outliers didn't make a difference. Outliers und outliers based on the different reason categories are the same. 