# Exercise

In the following we are conducting several exercises about basic pandas analysis.

In this exercise, you will work with a fictional dataset containing sales data for a retail store. The dataset is provided in CSV format and consists of the following columns:

1. Order_ID: Unique identifier for each order.
2. Product: Name of the product sold.
3. Category: Category of the product (e.g., Electronics, Clothing, Furniture).
4. Price: Price of the product.
5. Quantity: Quantity of the product sold.
5. Order_Date: Date and time of the order.
Your task is to use pandas to perform various data analysis tasks and derive insights from the dataset.

In [1]:
import pandas as pd

# Create a fictional dataset
data = {
    'Employee_ID': [101, 102, 103, 104, 105],
    'Name': ['John', 'Alice', 'Bob', 'Emily', 'David'],
    'Department': ['HR', 'IT', 'Marketing', 'Finance', 'HR'],
    'Position': ['Manager', 'Developer', 'Marketing Specialist', 'Accountant', 'HR Assistant'],
    'Salary': [6000, 5000, 4500, 5500, 4000],
    'Hire_Date': ['2020-01-15', '2019-05-20', '2020-03-10', '2018-11-25', '2021-02-05']
}

# Convert the dictionary to a DataFrame
df = pd.DataFrame(data)


### 2. Display Basic Information:
- Display the first 5 rows of the DataFrame.
- Display the basic information about the DataFrame (number of rows, columns, data types, memory usage).

In [4]:
# Display the first 5 rows of the DataFrame
print("First 5 rows of the DataFrame:")
# NOTE: 5 is the default value, so df.head() also works in this case
df.head(5)

# Display basic information about the DataFrame
print("\nBasic information about the DataFrame:")
print(df.info())

First 5 rows of the DataFrame:

Basic information about the DataFrame:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Employee_ID  5 non-null      int64 
 1   Name         5 non-null      object
 2   Department   5 non-null      object
 3   Position     5 non-null      object
 4   Salary       5 non-null      int64 
 5   Hire_Date    5 non-null      object
dtypes: int64(2), object(4)
memory usage: 372.0+ bytes
None


### 2.Summary Statistics:
- Calculate and display summary statistics for numerical columns (count, mean, min, max, etc.).

In [5]:
# Calculate and display summary statistics for numerical columns
print("Summary statistics for numerical columns:")
print(df.describe())


Summary statistics for numerical columns:
       Employee_ID       Salary
count     5.000000     5.000000
mean    103.000000  5000.000000
std       1.581139   790.569415
min     101.000000  4000.000000
25%     102.000000  4500.000000
50%     103.000000  5000.000000
75%     104.000000  5500.000000
max     105.000000  6000.000000


3. Data Manipulation:
- Convert the Hire_Date column to datetime format.
- Add a new column named Years_Worked that represents the number of years each employee has worked in the company (as of the current year).

In [8]:
# Convert 'Hire_Date' column to datetime format
df['Hire_Date'] = pd.to_datetime(df['Hire_Date'])

# Add a new column 'Years_Worked'
current_year = pd.Timestamp.now().year
df['Years_Worked'] = current_year - df['Hire_Date'].dt.year
# Display the modified DataFrame
# for curiosity, look at the dtypes
print(df.dtypes)
print("DataFrame with 'Hire_Date' converted and 'Years_Worked' added:")
df


Employee_ID              int64
Name                    object
Department              object
Position                object
Salary                   int64
Hire_Date       datetime64[ns]
Years_Worked             int32
dtype: object
DataFrame with 'Hire_Date' converted and 'Years_Worked' added:


Unnamed: 0,Employee_ID,Name,Department,Position,Salary,Hire_Date,Years_Worked
0,101,John,HR,Manager,6000,2020-01-15,5
1,102,Alice,IT,Developer,5000,2019-05-20,6
2,103,Bob,Marketing,Marketing Specialist,4500,2020-03-10,5
3,104,Emily,Finance,Accountant,5500,2018-11-25,7
4,105,David,HR,HR Assistant,4000,2021-02-05,4


### 4. Data Filtering:
- Filter the DataFrame to include only employees who work in the 'HR' department.
- Display the filtered DataFrame.

In [11]:
# Filter the DataFrame for employees in the 'HR' department
hr_df = df[df['Department'] == 'HR']

# Display the filtered DataFrame
print("Employees in the 'HR' department:")

hr_df


Employees in the 'HR' department:


Unnamed: 0,Employee_ID,Name,Department,Position,Salary,Hire_Date,Years_Worked
0,101,John,HR,Manager,6000,2020-01-15,5
4,105,David,HR,HR Assistant,4000,2021-02-05,4
