# Pandas Assignment 1
In this assignment you will practice working with Pandas dataframes.

Specifically, you will be carrying out the following excercises:

1. Import data from a CSV file.
2. Describe and get information from dataframes.
3. Collect observations of the data.

### Note about assignments
You can add lines of code according to your preferences. As long as the code required by the assignment is found in this notebook under the corresponding question header (ie. the answer to question 1 is underneath the title "Question 1"), you will recieve credit for it.

## About the data
The data used in this assignment is a table built from the Human Resources schema of the Adventure Works 2019 database. This data contains information about each time that Employee Pay History was changed (each line is a pay rate change). It also contains information about the employee and the department they were working in when they received the pay rate listed.

The actual data is stored in a `.csv` file located inside the `data` directory. The file is called `pay_history.csv`.

## Instructions
##### Question 1: Import pandas
Import pandas to be able to use it in the notebook

In [1]:
import pandas as pd

##### Question 2: Import the data
This data is stored in the `data` directory as a csv. Read the CSV file into a dataframe, and store it as a variable called `df`.

In [2]:
df = pd.read_csv("./data/pay_history.csv")

##### Question 3: View the first five rows
Use the `.head()` method to output the first five rows of the dataframe.

In [3]:
df.head()

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,...,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
0,1,16,00:00.0,125.5,2,adventure-works\ken0,,Chief Executive Officer,1/29/1969,S,...,1,99,69,1,1,1/14/2009,,00:00.0,Executive,Executive General and Administration
1,2,1,00:00.0,63.4615,2,adventure-works\terri0,1.0,Vice President of Engineering,8/1/1971,S,...,1,1,20,1,1,1/31/2008,,00:00.0,Engineering,Research and Development
2,3,1,00:00.0,43.2692,2,adventure-works\roberto0,2.0,Engineering Manager,11/12/1974,M,...,1,2,21,1,1,11/11/2007,,00:00.0,Engineering,Research and Development
3,4,1,00:00.0,8.62,2,adventure-works\rob0,3.0,Senior Tool Designer,12/23/1974,S,...,0,48,80,1,1,12/5/2007,5/30/2010,00:00.0,Engineering,Research and Development
4,4,2,00:00.0,8.62,2,adventure-works\rob0,3.0,Senior Tool Designer,12/23/1974,S,...,0,48,80,1,1,5/31/2010,,00:00.0,Tool Design,Research and Development


##### Question 4: View the last three rows
Use the `.tail()` method to output the last **three** rows of the dataframe.

In [4]:
df.tail(3)

Unnamed: 0,EmployeeID,DepartmentID,RateChangeDate,Rate,PayFrequency,LoginID,OrganizationLevel,JobTitle,BirthDate,MaritalStatus,...,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
301,288,3,00:00.0,23.0769,2,adventure-works\rachel0,3.0,Sales Representative,7/9/1975,S,...,1,35,37,1,1,5/30/2013,,00:00.0,Sales,Sales and Marketing
302,289,3,00:00.0,23.0769,2,adventure-works\jae0,3.0,Sales Representative,3/17/1968,M,...,1,37,38,1,1,5/30/2012,,00:00.0,Sales,Sales and Marketing
303,290,3,00:00.0,23.0769,2,adventure-works\ranjit0,3.0,Sales Representative,9/30/1975,S,...,1,34,37,1,1,5/30/2012,,00:00.0,Sales,Sales and Marketing


##### Question 5: Describe the data set
Use the `.describe()` method to get some summary statistics about the numerical data in the data set.

In [5]:
df.describe()

Unnamed: 0,EmployeeID,DepartmentID,Rate,PayFrequency,OrganizationLevel,SalariedFlag,VacationHours,SickLeaveHours,CurrentFlag,ShiftID
count,304.0,304.0,304.0,304.0,303.0,304.0,304.0,304.0,304.0,304.0
mean,146.825658,7.302632,18.148704,1.460526,3.465347,0.203947,49.963816,45.207237,1.0,1.546053
std,85.129793,2.903371,12.356331,0.499261,0.804347,0.403595,28.57164,14.704341,0.0,0.769371
min,1.0,1.0,6.5,1.0,1.0,0.0,0.0,20.0,1.0,1.0
25%,72.75,7.0,11.0,1.0,3.0,0.0,26.75,33.0,1.0,1.0
50%,148.5,7.0,14.0,1.0,4.0,0.0,49.0,45.0,1.0,1.0
75%,224.0,7.0,23.0769,2.0,4.0,0.0,74.0,57.25,1.0,2.0
max,290.0,16.0,125.5,2.0,4.0,1.0,99.0,80.0,1.0,3.0


##### Question 6: Observe the data
Knowing that the "SalariedFlag" column is either a 1 or a 0 to indicate if an employee is salaried (1) or not (0), can you estimate the percent of employees that are salaried (hint: look at the mean)?  Why isn't this a good estimator of the actual ratio of salaried to non-salaried employees? Hint: What does this data set show?

```
The mean of the SalariedFlag column is .2, which seems to indicate that 20% of employees are salaried. However, the granularity of this data set is changes in employee pay. Thus, this is not an average of employees but rather an average of how many employee changes have been for salaried employees.
```

##### Question 7: Describe the categorical data
Use the `.describe()` method to get summary statistics about the categorical data in the data set. Remember that you can pass in `include=[object]` to the `.describe()` method to get information about categorical data.

In [6]:
df.describe(include=[object])

Unnamed: 0,RateChangeDate,LoginID,JobTitle,BirthDate,MaritalStatus,Gender,HireDate,StartDate,EndDate,ModifiedDate,DepartmentName,Sub-Department
count,304,304,304,304,304,304,304,304,7,304,304,304
unique,1,290,67,275,2,2,164,170,6,1,16,6
top,00:00.0,adventure-works\sheela0,Production Technician - WC60,2/10/1978,S,M,5/31/2011,5/31/2011,7/14/2012,00:00.0,Production,Manufacturing
freq,304,6,26,6,153,212,9,9,2,304,180,187


##### Question 8: Get data information
Use the `.info()` method to get information about the data set.

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 304 entries, 0 to 303
Data columns (total 22 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   EmployeeID         304 non-null    int64  
 1   DepartmentID       304 non-null    int64  
 2   RateChangeDate     304 non-null    object 
 3   Rate               304 non-null    float64
 4   PayFrequency       304 non-null    int64  
 5   LoginID            304 non-null    object 
 6   OrganizationLevel  303 non-null    float64
 7   JobTitle           304 non-null    object 
 8   BirthDate          304 non-null    object 
 9   MaritalStatus      304 non-null    object 
 10  Gender             304 non-null    object 
 11  HireDate           304 non-null    object 
 12  SalariedFlag       304 non-null    int64  
 13  VacationHours      304 non-null    int64  
 14  SickLeaveHours     304 non-null    int64  
 15  CurrentFlag        304 non-null    int64  
 16  ShiftID            304 non

##### Question 9: Analyze null values
Which columns have null values? How many null values do they have?

```
There are 297 null values in the EndDate column and there is 1 null value in the OrganizationLevel column.
```

##### Question 10: How many rows and columns are there?
Print out your answer using the `print()` function, or create a Markdown cell and write it there. Show what you did to work this out. Hint: Use the `.shape` property.

In [9]:
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns.")

There are 121317 rows and 10 columns.
