<a href="https://colab.research.google.com/github/nehagoyal09/Python_Main_Topics/blob/main/Day_20_Employee_Statistics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Introduction**
This Jupyter notebook is part of your learning experience in the study of descriptive statistics You will work with a simple data set that contains information about employee.

In this exercise, you will perform the following tasks:

1 - Load and study the data

2 - Visualise the distributions of ratings and compensations

3 - Subset the data based on thresholds

### **`Task 1`** - Load and study the data
Load the data and study its features such as:

- The number of employees
- The number of features
- The types of features

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Read in the "Employee Dataset" file as a Pandas DataFrame
data = pd.read_csv('/content/Employee Dataset.csv')
data.head()

In [None]:
data.tail()

In [None]:
# Look at basic information about the dataframe
data.info()

### **`Observations from Task 1`**
There are 50 rows and 6 columns in the data. Each row contains the employee specifics of a certain employee in the company

The features in the data set are:

1. The id of the employees

2. Their respective groups and age

3. Their healthy_eating and active_lifestyle

4. Their salary

### **Task 2 - Visualise the distributions of ratings and compensations**
We will now visualise the distributions of employee ratings and compensations

We will create the following plots:

1. A scatter plot of employee age and employee salary, healthy_eating and active_lifestyle, health_eating and salary

2. A count plot of groups

3. A histogram of employee salary

In [None]:
# Create a scatter plot of the "Age" and "Employee Salary" features
plt.figure(figsize= (9,4))

sns.scatterplot(data= data, x= 'age', y= 'salary', color= 'purple',
                edgecolor= 'black', alpha= 0.5)

plt.title("Scatterplot of Employee Age v/s Employee Salary")
plt.xlabel("Employee Age")
plt.ylabel("Employee Salary")
plt.show()


Observations:

We observe that in general, there is a relationship between age and Salary of employee

As age increases, the Salary also increases , though there are some outliers present as well

In [None]:
# Create a scatter plot of the "healthy_eating" and "active_lifestyle" features
plt.figure(figsize= (9,4))

sns.scatterplot(data= data, x ='healthy_eating', y= 'active_lifestyle', color= 'green',
                edgecolor= 'black', alpha = 0.5)

plt.title("Scatterplot of Healthy Eating v/s Active Lifestyle")
plt.xlabel("Healthy Eating")
plt.ylabel("Active Lifestyle")
plt.show()


Observations:

We observe that in general, there is a relationship between healthy_eating and active_lifestyle of employee

As healthy_eating increases, the active_lifestyle also increases

In [None]:
# Create a scatter plot of the "healthy_eating" and "salary" features
plt.figure(figsize= (9,4))

sns.scatterplot(data= data, x= 'healthy_eating', y= 'salary', color= 'grey',
                edgecolor= 'black', alpha= 0.5)

plt.title("scatterplot of Healthy Eating v/s salary")
plt.xlabel("Healthy Eating")
plt.ylabel("Salary")
plt.show()


Observations:

We observe that in general, there is a relationship between healthy_eating and salary of employee

As healthy_eating increases, the salary also increases

In [None]:
# Create a count plot of the "groups" feature
plt.figure(figsize= (9,4))

sns.countplot(data= data, x= 'groups',
              edgecolor= 'linen', alpha= 0.7)

plt.title("Countplot of Groups")
plt.xlabel("Groups")
plt.ylabel("Count")
plt.show()


Observation:

We see that most employees either belong to Group A or Group O, with group A having maximum frequency

In [None]:
# Create a histogram of the "Salary" feature
plt.figure(figsize= (9,4))

sns.histplot(data= data, x= 'salary', color= 'orange',
             edgecolor= 'linen', alpha= 0.7, bins=10)

plt.title("Histogram of Employee salary")
plt.xlabel("Salary")
plt.ylabel("Count")
plt.show()

Observations:

We see that the salaries are not uniform, which is not necessarily a discrepancy

But we do observe that there are employees with salaries on either extremities of the histogram

Based on the histogram, we can see that the count of highest earning employee is very less

### **Observations from Task 2**
1. Generally, as the employee's age increase, there is an increase in the employee salary

2. Generally, as the employee's healthy_eating habit increase, there is an increase in the employee salary as well

3. However, the employee salary values are more spread out

4. An employee of healthy_eating greater than 8 can be taken as 8 only

5. An employee salary less than 1000 can be considered as a lower salary

### `Task 3 - Subset the data based on thresholds`
We will now subset the original data frame based on the following conditions:

- Employees with healthy_eating greater than 8

- Employees with Salary less than 1000

- Employees with healthy_eating greater than 8 and with salary less than 1000

In [None]:
# Employee healthy_eating is greater than 8
data[data['healthy_eating']> 8]

In [None]:
# salary less than 1000
data[data['salary']< 1000]

In [None]:
# Employees with healthy_eating greater than 8 and with salary less than 1000
data[(data['healthy_eating']> 8) & (data['salary']< 1000)]

**Observations:**

1. The only employee seemingly facing a discrepancy in salary as compared to healthy_eating is employee who is having id = 26

2. The employee with id having 26 has a salary of 700

### **Final Conclusions**
From the given data, we can use simple visualisations to get a sense of how data are distributed

We can conduct preliminary analyses simply by subsetting data sets using well thought out thresholds and conditions