# Exercise: Data Wrangling - Join, Combine, Reshape

### Submitted by: Nigel Haim N. Sebastian
#### Note:
- I have manually created a csv file from the Assignment tab
- The csv file is located on a local Datasets folder

## Install

In [17]:
# %pip install numpy
# %pip install pandas

## Imports

In [18]:
import numpy as np 
import pandas as pd 

### Get the dataset

In [19]:
employees = pd.read_csv('Datasets/employees.csv')
departments = pd.read_csv('Datasets/departments.csv') 

In [20]:
employees.head()

Unnamed: 0,Employee_ID,Name,Age,Department_ID
0,101,Alice,30,D001
1,102,Bob,35,D002
2,103,Charlie,28,D001
3,104,David,40,D003
4,105,Eve,45,D004


In [21]:
departments.head()

Unnamed: 0,Department_ID,Department_Name,Location
0,D001,Sales,New York
1,D002,Marketing,London
2,D003,IT,San Francisco
3,D004,HR,Singapore


## Tasks

### Join the Data

Merge the employees.csv and departments.csv datasets using the Department_ID column. Show the combined dataset.

In [22]:
merged_df = employees.merge(departments, on='Department_ID', how='inner')
merged_df

Unnamed: 0,Employee_ID,Name,Age,Department_ID,Department_Name,Location
0,101,Alice,30,D001,Sales,New York
1,102,Bob,35,D002,Marketing,London
2,103,Charlie,28,D001,Sales,New York
3,104,David,40,D003,IT,San Francisco
4,105,Eve,45,D004,HR,Singapore


### Filter the Data
From the merged dataset, extract a subset of employees who are older than 30 and work in New York or London.

In [23]:
extracted_df = merged_df[(merged_df['Age'] > 30) & ((merged_df['Location'] == 'New York') | (merged_df['Location'] == 'London'))]
extracted_df

Unnamed: 0,Employee_ID,Name,Age,Department_ID,Department_Name,Location
1,102,Bob,35,D002,Marketing,London


### Reshape the Data (Pivoting)

Create a summary table that shows the count of employees in each department by location.

In [24]:
location_summary = merged_df.pivot_table(index='Department_Name', columns='Location', aggfunc='size', fill_value=0)
location_summary

Location,London,New York,San Francisco,Singapore
Department_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HR,0,0,0,1
IT,0,0,1,0
Marketing,1,0,0,0
Sales,0,2,0,0


### Create a New Column

#### Add a new column to the combined dataset that categorizes employees into age groups:
- "Young" if age < 35
- "Mid-aged" if age is between 35 and 45
- "Senior" if age > 45

In [25]:
def categorize_age(age):
    if age < 35:
        return 'Young'
    elif 35 <= age <= 45:
        return 'Mid-aged'
    else:
        return 'Senior'

In [26]:
final_df = merged_df.copy()
final_df['Age_Group'] = final_df['Age'].apply(categorize_age)
final_df


Unnamed: 0,Employee_ID,Name,Age,Department_ID,Department_Name,Location,Age_Group
0,101,Alice,30,D001,Sales,New York,Young
1,102,Bob,35,D002,Marketing,London,Mid-aged
2,103,Charlie,28,D001,Sales,New York,Young
3,104,David,40,D003,IT,San Francisco,Mid-aged
4,105,Eve,45,D004,HR,Singapore,Mid-aged


### Final Output:
- Save the final reshaped and categorized dataset to a CSV file.
- Submit the following in your PDF:
    - Screenshots of your code.
- Final answers for the following:
    - Combined dataset.
    - Subset of employees older than 30 in New York or London.
    - Summary table with the count of employees per department by location.
    - Final dataset with the new age group column.

### Save the final reshaped and categorized dataset to a CSV file

In [27]:
final_df.to_csv('Datasets/reshaped_categorized.csv', index=False)

In [28]:
final_dataset = pd.read_csv('Datasets/reshaped_categorized.csv') 
final_dataset

Unnamed: 0,Employee_ID,Name,Age,Department_ID,Department_Name,Location,Age_Group
0,101,Alice,30,D001,Sales,New York,Young
1,102,Bob,35,D002,Marketing,London,Mid-aged
2,103,Charlie,28,D001,Sales,New York,Young
3,104,David,40,D003,IT,San Francisco,Mid-aged
4,105,Eve,45,D004,HR,Singapore,Mid-aged


## Answers for the following

### Combined dataset

In [29]:
merged_df

Unnamed: 0,Employee_ID,Name,Age,Department_ID,Department_Name,Location
0,101,Alice,30,D001,Sales,New York
1,102,Bob,35,D002,Marketing,London
2,103,Charlie,28,D001,Sales,New York
3,104,David,40,D003,IT,San Francisco
4,105,Eve,45,D004,HR,Singapore


### Subset of employees older than 30 in New York or London

In [30]:
extracted_df

Unnamed: 0,Employee_ID,Name,Age,Department_ID,Department_Name,Location
1,102,Bob,35,D002,Marketing,London


### Summary table with the count of employees per department by location

In [31]:
location_summary

Location,London,New York,San Francisco,Singapore
Department_Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
HR,0,0,0,1
IT,0,0,1,0
Marketing,1,0,0,0
Sales,0,2,0,0


### Final dataset with the new age group column

In [32]:
final_df

Unnamed: 0,Employee_ID,Name,Age,Department_ID,Department_Name,Location,Age_Group
0,101,Alice,30,D001,Sales,New York,Young
1,102,Bob,35,D002,Marketing,London,Mid-aged
2,103,Charlie,28,D001,Sales,New York,Young
3,104,David,40,D003,IT,San Francisco,Mid-aged
4,105,Eve,45,D004,HR,Singapore,Mid-aged
