# Advanced Pandas Assignment 4

In this assignment, you will practice date formatting, joins, and exporting data from a Pandas dataframe.

### Note about assignments
You can add lines of code according to your preferences. As long as the code required by the assignment is found in this notebook under the corresponding question header (ie. the answer to question 1 is underneath the title "Question 1"), you will receive credit for it.

## About the data
The data used in this assignment is a table built from the Human Resources schema of the Adventure Works 2019 database. This data contains information about each time that Employee Pay History was changed (each line is a pay rate change). It also contains information about the employee and the department they were working in when they received the pay rate listed.

The actual data is stored in a CSV file located inside the `data` folder. The file is called `pay_history.csv`. Another supplemental file called `shifts.csv` has also been provided in the `data` folder that will be used in some of the exercises below.

## Instructions
### Set up
##### Import Pandas
Import the Pandas library into Jupyter Lab.

<p style="font-size:.75rem">Expected output: None</p>

In [1]:
import pandas as pd

##### Disable column display limit
Use the following code to disable the default limit for displaying columns. If you don't use this code, a data set with more than 20 columns will be truncated when displayed to take up less space.

```python
pd.options.display.max_columns = None
```

<p style="font-size:.75rem">Expected output: None</p>

In [2]:
pd.options.display.max_columns = None

##### Create the dataframe
Use the `read_csv()` function from Pandas to read the data from the `pay_history.csv` file into a dataframe called `df`.

<p style="font-size:.75rem">Expected output: None</p>

In [3]:
df = pd.read_csv("./data/pay_history.csv")

##### Preview dataframe
Use the `.head()` method to print out the first 5 rows of the dataframe.

<p style="font-size:.75rem">Expected output: 5 rows with indexes 0-4.</p>

In [4]:
df.head()

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
0,40,7,00:00.0,25.0,2,adventure-works\jolynn0,3.0,Production Supervisor - WC60,1/13/1965,S,F,12/23/2016,0,82,61,1,3,12/23/2016,,00:00.0,Production,Manufacturing
1,41,7,00:00.0,12.45,1,adventure-works\bryan0,4.0,Production Technician - WC60,8/25/1982,S,M,1/19/2018,0,35,37,1,3,1/19/2018,,00:00.0,Production,Manufacturing
2,42,7,00:00.0,12.45,1,adventure-works\james0,4.0,Production Technician - WC60,7/23/1993,M,M,12/25/2017,0,39,39,1,3,12/25/2017,,00:00.0,Production,Manufacturing
3,43,7,00:00.0,12.45,1,adventure-works\nancy0,4.0,Production Technician - WC60,11/17/1997,M,F,12/31/2017,0,34,37,1,3,12/31/2017,,00:00.0,Production,Manufacturing
4,44,7,00:00.0,12.45,1,adventure-works\simon0,4.0,Production Technician - WC60,5/15/1999,S,M,12/6/2017,0,38,39,1,3,12/6/2017,,00:00.0,Production,Manufacturing


### Questions
#### Age when hired
Your organization, Adventure Works, is looking to create a promotion for finding new talent to hire on at the company. They would like to target a specific age group when running the promotion, and need to know some information about how old each current employee was when they were hired on at the company. In this problem, you will create datetime fields for the `BirthDate` and `HireDate` fields and use subtraction to determine how many years old the average employee was when they were hired.

##### Question 1: Print the data types
Use the `.info()` method to print out the data types for each column in the dataframe.

<p style="font-size:.75rem">Expected output: Informational dataframe showing that BirthDate and HireDate have dtype of object.</p>

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304 entries, 0 to 303
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   EmployeeID         304 non-null    int64  
 1   DepartmentID       304 non-null    int64  
 2   RateChangeDate     304 non-null    object 
 3   Rate               304 non-null    float64
 4   PayFrequency       304 non-null    int64  
 5   LoginID            304 non-null    object 
 6   OrganizationLevel  303 non-null    float64
 7   JobTitle           304 non-null    object 
 8   BirthDate          304 non-null    object 
 9   MaritalStatus      304 non-null    object 
 10  Gender             304 non-null    object 
 11  HireDate           304 non-null    object 
 12  SalariedFlag       304 non-null    int64  
 13  VacationHours      304 non-null    int64  
 14  SickLeaveHours     304 non-null    int64  
 15  CurrentFlag        304 non-null    int64  
 16  ShiftID            304 non

##### Question 2: Data types for `BirthDate` and `HireDate`
Looking at the results of the `.info()` method above, what are the default data types for the `BirthDate` and `HireDate` columns? What data type should these columns be converted to that make the most sense?

<p style="font-size:.75rem">Expected output: None</p>

```
The `BirthDate` and `HireDate` columns have a default data type of `object`, which is a string type. The columns should be converted to data type `datetime`.
```

##### Question 3: Cast the `BirthDate` column to `datetime`
Using the Pandas function `to_datetime()`, cast the `BirthDate` column into data type `datetime` and print it out. Make sure that at the bottom of the Series that gets printed out you see `dtype: datetime64[ns]`.

You shouldn't need to pass in a Python format string to convert this column to a datetime.

<p style="font-size:.75rem">Expected output: Series of dtype datetime64[ns]</p>

In [6]:
pd.to_datetime(df['BirthDate'])

0     1965-01-13
1     1982-08-25
2     1993-07-23
3     1997-11-17
4     1999-05-15
         ...    
299   1994-05-02
300   1978-01-27
301   1985-01-03
302   1985-01-03
303   1985-01-03
Name: BirthDate, Length: 304, dtype: datetime64[ns]

##### Question 4: Save the `BirthDate` column
Using the code from above, save the newly converted `BirthDate` column back to the original dataframe as the column `BirthDate`.

<p style="font-size:.75rem">Expected output: None</p>

In [7]:
df['BirthDate'] = pd.to_datetime(df['BirthDate'])

##### Question 5: Print out the dataframe info
Using the `.info()` method, print out the dataframe information again. Make sure that the data type for the `BirthDate` column is now `datetime` (`datetime64[ns]` is the same thing).

<p style="font-size:.75rem">Expected output: Informational dataframe now showing that BirthDate column has dtype of datetime64[ns]</p>

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304 entries, 0 to 303
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   EmployeeID         304 non-null    int64         
 1   DepartmentID       304 non-null    int64         
 2   RateChangeDate     304 non-null    object        
 3   Rate               304 non-null    float64       
 4   PayFrequency       304 non-null    int64         
 5   LoginID            304 non-null    object        
 6   OrganizationLevel  303 non-null    float64       
 7   JobTitle           304 non-null    object        
 8   BirthDate          304 non-null    datetime64[ns]
 9   MaritalStatus      304 non-null    object        
 10  Gender             304 non-null    object        
 11  HireDate           304 non-null    object        
 12  SalariedFlag       304 non-null    int64         
 13  VacationHours      304 non-null    int64         
 14  SickLeaveH

##### Question 6: Save the `HireDate` column
Using code similar to the code that you wrote above, cast the `HireDate` column to data type `datetime` and save it back to the original dataframe as the column `HireDate`.

<p style="font-size:.75rem">Expected output: None</p>

In [9]:
df['HireDate'] = pd.to_datetime(df['HireDate'])

##### Question 7: Print out the dataframe info again
Using the `.info()` method, print out the dataframe information again. Make sure that the data type for both `BirthDate` and `HireDate` is now `datetime`.

<p style="font-size:.75rem">Expected output: Informational dataframe showing that BirthDate and HireDate columns have dtype of datetime64[ns].</p>

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304 entries, 0 to 303
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   EmployeeID         304 non-null    int64         
 1   DepartmentID       304 non-null    int64         
 2   RateChangeDate     304 non-null    object        
 3   Rate               304 non-null    float64       
 4   PayFrequency       304 non-null    int64         
 5   LoginID            304 non-null    object        
 6   OrganizationLevel  303 non-null    float64       
 7   JobTitle           304 non-null    object        
 8   BirthDate          304 non-null    datetime64[ns]
 9   MaritalStatus      304 non-null    object        
 10  Gender             304 non-null    object        
 11  HireDate           304 non-null    datetime64[ns]
 12  SalariedFlag       304 non-null    int64         
 13  VacationHours      304 non-null    int64         
 14  SickLeaveH

##### Question 8: Subtract dates
Using the subtraction operator, create a Series of `timedelta` objects that show how many days between the employees' hire date and birth date. Subtract birth date from hire date.

<p style="font-size:.75rem">Expected output: Series of dtype timedelta64[ns] whose first value is 18972 and last value is 12079 days.</p>

In [11]:
df['HireDate'] - df['BirthDate']

0     18972 days
1     12931 days
2      8921 days
3      7349 days
4      6780 days
         ...    
299    8699 days
300   14595 days
301   12079 days
302   12079 days
303   12079 days
Length: 304, dtype: timedelta64[ns]

##### Question 9: What data type is returned?
After subtracting the `BirthDate` column from the `HireDate` column, what is the data type of the Series that is returned?

<p style="font-size:.75rem">Expected output: None</p>

```
The data type of the Series is `timedelta`.
```

##### Question 10: Save calculation as `DaysOldAtHireDate`.
Using the calculation from above, create a new column called `DaysOldAtHireDate` that contains `timedelta` objects in each row.

<p style="font-size:.75rem">Expected output: None</p>

In [12]:
df['DaysOldAtHireDate'] = df['HireDate'] - df['BirthDate']

##### Question 11: Extract the days
Using the `.dt.days` property, extract the number of days between each employee's hire date and birth date using the new `DaysOldAtHireDate` column. Print out this Series of integers.

<p style="font-size:.75rem">Expected output: Series of dtype int64 whose first value is 18972 and last value is 12079.</p>

In [13]:
df['DaysOldAtHireDate'].dt.days

0      18972
1      12931
2       8921
3       7349
4       6780
       ...  
299     8699
300    14595
301    12079
302    12079
303    12079
Name: DaysOldAtHireDate, Length: 304, dtype: int64

##### Question 12: Convert days to years
Using the code from above, convert the number of days old that each employee was when hired to years. You can do this by dividing the number of days by 365.25, which will approximate how old in years each employee is.

<p style="font-size:.75rem">Expected output: Series of dtype float64 whose first value is 51.942505 and whose last value is 33.070500.</p>

In [14]:
df['DaysOldAtHireDate'].dt.days / 365.25

0      51.942505
1      35.403149
2      24.424367
3      20.120465
4      18.562628
         ...    
299    23.816564
300    39.958932
301    33.070500
302    33.070500
303    33.070500
Name: DaysOldAtHireDate, Length: 304, dtype: float64

##### Question 13: Save new column `YearsOldAtHireDate`
Using the calculation from above, create a new column in the dataframe `YearsOldAtHireDate` that contains float values that reflect how many years old each employee was when hired.

<p style="font-size:.75rem">Expected output: None</p>

In [15]:
df['YearsOldAtHireDate'] = df['DaysOldAtHireDate'].dt.days / 365.25

##### Question 14: Find the average age when hired
Use the `.mean()` method on the `YearsOldAtHireDate` column to determine how old the average employee was when they were hired.

<p style="font-size:.75rem">Expected output: 30.932985698332075</p>

In [16]:
df['YearsOldAtHireDate'].mean()

30.932985698332075

##### Question 15: How old was the average employee when they were hired?
Answer the question, "How old was the average employee when they were hired?" below.

<p style="font-size:.75rem">Expected output: None</p>

```
The average employee was 30.93 years old.
```

#### Find shift leaders
There has recently been some disorganization among the employees working in the Production department. Employees whose `ShiftID` is 2 have been complaining for a while about problems going on during their shift with production equipment, and nobody seems to know who the manager of that shift is.

The database administrator provided you with a CSV file called "shifts.csv" that contains information about each shift.

##### Question 16: Import the shifts CSV
Using the `read_csv()` function from Pandas, import the file `shifts.csv` from the `data` directory into a Pandas dataframe called `shifts_df`.

<p style="font-size:.75rem">Expected output: None</p>

In [17]:
shifts_df = pd.read_csv("./data/shifts.csv")

##### Question 17: Print out the top 5 rows
Using the `.head()` method, print out the top 5 rows of the `shifts_df` dataframe.

<p style="font-size:.75rem">Expected output: 3 rows where the EmployeeShiftLeader column has values 39, 70, and 46.</p>

In [18]:
shifts_df.head()

Unnamed: 0,ShiftID,ShiftStartTime,ShiftEndTime,EmployeeShiftLeader
0,1,8:00:00 AM,4:00:00 PM,39
1,2,4:00:00 PM,12:00:00 AM,70
2,3,12:00:00 AM,8:00:00 AM,46


##### Question 18: Join the `shifts_df` dataframe to the `df` dataframe
Using the `.merge()` method, join the `shifts_df` dataframe to the `df` dataframe using the `ShiftID` column on both dataframes. Save this new dataframe with the joined data as `df`. Print out the first five rows of `df` using the `.head()` method when finished.

<p style="font-size:.75rem">Expected output: 5 rows starting whose EmployeeIDs are 40-44.</p>

In [23]:
df = df.merge(shifts_df, left_on="ShiftID", right_on="ShiftID")
df.head()

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department,DaysOldAtHireDate,YearsOldAtHireDate,ShiftStartTime_x,ShiftEndTime_x,EmployeeShiftLeader_x,ShiftStartTime_y,ShiftEndTime_y,EmployeeShiftLeader_y,ShiftStartTime,ShiftEndTime,EmployeeShiftLeader
0,40,7,00:00.0,25.0,2,adventure-works\jolynn0,3.0,Production Supervisor - WC60,1965-01-13,S,F,2016-12-23,0,82,61,1,3,12/23/2016,,00:00.0,Production,Manufacturing,18972 days,51.942505,12:00:00 AM,8:00:00 AM,46,12:00:00 AM,8:00:00 AM,46,12:00:00 AM,8:00:00 AM,46
1,41,7,00:00.0,12.45,1,adventure-works\bryan0,4.0,Production Technician - WC60,1982-08-25,S,M,2018-01-19,0,35,37,1,3,1/19/2018,,00:00.0,Production,Manufacturing,12931 days,35.403149,12:00:00 AM,8:00:00 AM,46,12:00:00 AM,8:00:00 AM,46,12:00:00 AM,8:00:00 AM,46
2,42,7,00:00.0,12.45,1,adventure-works\james0,4.0,Production Technician - WC60,1993-07-23,M,M,2017-12-25,0,39,39,1,3,12/25/2017,,00:00.0,Production,Manufacturing,8921 days,24.424367,12:00:00 AM,8:00:00 AM,46,12:00:00 AM,8:00:00 AM,46,12:00:00 AM,8:00:00 AM,46
3,43,7,00:00.0,12.45,1,adventure-works\nancy0,4.0,Production Technician - WC60,1997-11-17,M,F,2017-12-31,0,34,37,1,3,12/31/2017,,00:00.0,Production,Manufacturing,7349 days,20.120465,12:00:00 AM,8:00:00 AM,46,12:00:00 AM,8:00:00 AM,46,12:00:00 AM,8:00:00 AM,46
4,44,7,00:00.0,12.45,1,adventure-works\simon0,4.0,Production Technician - WC60,1999-05-15,S,M,2017-12-06,0,38,39,1,3,12/6/2017,,00:00.0,Production,Manufacturing,6780 days,18.562628,12:00:00 AM,8:00:00 AM,46,12:00:00 AM,8:00:00 AM,46,12:00:00 AM,8:00:00 AM,46


##### Question 19: Get shift leaders
Knowing that the `EmployeeID` column of `df` matches up with the `EmployeeShiftLeader` column that was just joined on to `df`, create a filter that returns rows of the dataframe where `EmployeeID` is equal to `EmployeeShiftLeader`.

<p style="font-size:.75rem">Expected output: 3 rows for EmployeeID 46, 70, and 39.</p>

In [20]:
df.loc[ df['EmployeeID'] == df['EmployeeShiftLeader']]

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department,DaysOldAtHireDate,YearsOldAtHireDate,ShiftStartTime,ShiftEndTime,EmployeeShiftLeader
6,46,7,00:00.0,12.45,1,adventure-works\eugene1,4.0,Production Technician - WC60,1985-02-07,S,M,2018-02-06,0,36,38,1,3,2/6/2018,,00:00.0,Production,Manufacturing,12052 days,32.996578,12:00:00 AM,8:00:00 AM,46
60,70,7,00:00.0,12.45,1,adventure-works\david2,4.0,Production Technician - WC60,1993-12-27,M,M,2017-12-13,0,33,36,1,2,12/13/2017,,00:00.0,Production,Manufacturing,8752 days,23.96167,4:00:00 PM,12:00:00 AM,70
187,39,7,00:00.0,12.45,1,adventure-works\ed0,4.0,Production Technician - WC60,1980-09-08,S,M,2019-02-03,0,25,32,1,1,2/3/2019,,00:00.0,Production,Manufacturing,14027 days,38.403833,8:00:00 AM,4:00:00 PM,39


##### Question 20: Info about the shift leaders
Answer the following question below:

What is the name of the shift leader for shift number 2? What department do they work in? You can look in the `LoginID` field for the first name.

<p style="font-size:.75rem">Expected output: None</p>

```
David is the leader of shift 2 and works in the Production department.
```

#### Export the data
##### Question 21: Export to CSV
Using the `.to_csv()` method, export the dataframe `df` as a CSV file called `pay_history_modified.csv`. Do not include the dataframe index in the CSV file (you can do this by passing in a parameter `index=False` to the method).

<p style="font-size:.75rem">Expected output: None</p>

In [21]:
df.to_csv('pay_history_modified.csv', index=False)

### Submission

Submit this file as you normally would, by zipping this `.ipynb` file and the `data` folder into a zip folder. However, also zip up your `pay_history_modified.csv` file into this zip folder. Submit this zip folder to Canvas.