# Advanced Pandas Assignment 3

In this assignment, you will practice cleaning "dirty" data.

### Note about assignments
You can add lines of code according to your preferences. As long as the code required by the assignment is found in this notebook under the corresponding task header (ie. the code and answers for task 1 are underneath the title "Task 1"), you will receive credit for it.

## About the data
The data used in this assignment is a table built from the Human Resources schema of the Adventure Works 2019 database. This data contains information about each time that Employee Pay History was changed (each line is a pay rate change). It also contains information about the employee and the department they were working in when they received the pay rate listed.

This data is similar to data used in previous assignments but has been "dirtied" in order to make this assignment possible. 

The actual data is stored in a CSV file located inside the `data` folder. The file is called `pay_history.csv`.

## Instructions
### Set up
##### Import Pandas
Import the Pandas library into Jupyter Lab.

<p style="font-size:.75rem">Expected output: None</p>

##### Disable column display limit
Use the following code to disable the default limit for displaying columns. If you don't use this code, a data set with more than 20 columns will be truncated when displayed to take up less space.

```python
pd.options.display.max_columns = None
```

<p style="font-size:.75rem">Expected output: None</p>

##### Create the dataframe
Read the data from the `pay_history.csv` file into a dataframe called `df`.

<p style="font-size:.75rem">Expected output: None</p>

##### Preview dataframe
Print out the first 5 rows of the dataframe.

<p style="font-size:.75rem">Expected output: 5 rows, indexes 0-4</p>

##### Describe dataframe
Get information about the numerical data in the dataframe.

##### Information about dataframe
Get information about the columns in the dataframe.

### Questions
The data you were given to analyze looks perfectly normal at first glance. However, knowing that humans are imperfect at inputting data, you decide to do a quick glance over the data to make sure that nothing strange is happening. If there are mistakes in the data, you are determined to fix them in such a way that they don't affect the analysis.

---

#### Task 1: Find Null Values

Determine which columns have null values. List the columns that have null values below.

<p style="font-size:.75rem">Expected output: None</p>

List the columns that have null values below.

```
Your answer here.
```

---
#### Task 2: Rate - Imputation (part 1)

Because there are only two missing values in the `Rate` column, let's try to impute them. Get the average rate of the data set and assign it to the missing `Rate` values in the dataframe. Then, show that there are no longer null values in the `Rate` column.

Note: Before modifying the null values, make sure to take note of their indexes (either on a piece of paper or by using Pandas). This will be important in the next question.

<p style="font-size:.75rem">Expected output: Something that shows that there are no longer null values in the `Rate` column</p>

---
#### Task 3: Rate - Imputation (part 2)

Now we can see that the overall average `Rate` was successfully imputed into the null values for each `Rate`. However, is this `Rate` an accurate reflection of what the `Rate` likely actually is for both employees? **What could be done to estimate the rate more accurately?** Answer this question below.

```
Your answer here.
```

Change the rates that were previously null to a value that more accurately reflects what the `Rate` probably is. You can do this by finding the median `Rate` among the `JobTitle` for both employees. Then, show that these rows (whose `Rate` was previously null) have a new `Rate` that more accurately reflects what other employees with the same `JobTitle` earn.

<p style="font-size:.75rem">Expected output: A dataframe or two that shows that the `Rate` of both employees whose `Rate` was previously null is now similar to the `Rate` of other employees with the same `JobTitle`</p>

---
#### Task 4: SalariedFlag - Drop Column
The columns `SalariedFlag` and `EndDate` seem to have many null values. In a way, it makes sense that there might be null values in the `EndDate` column, since the database designers may have designed the system so that it doesn't record any value until an employee actually stops working at Adventure Works. However, the `SalariedFlag` field is a little more suspicious.

Most likely, we will end up dropping the `SalariedFlag` column because there are so many null values. However, let's make sure to look at it first to make sure that we *can't* impute it (ie. if we can see that all of the recorded values are the same, we might be able to justify imputation in specific situations).

First, check the value counts of each unique value in the `SalariedFlag` column. What do you notice when counting the distinct values in this column? Is there any way to justify imputation?

```
Your answer here.
```

Next, drop the `SalariedFlag` column and show that the column has been dropped.

---
#### Task 5: Check inconsistent format
To make sure that each of our columns has data in a consistent format, we should check to make sure that each column has the data type that we expect. If all of the quantitative fields are either `int64` or `float64`, you can move on to the categorical fields and get a count of each unique value to make sure each row has an expected value.

First, output information about the dataframe columns and their datatypes.

<p style="font-size:.75rem">Expected output: Informational dataframe</p>

Looking at this information and knowing some things about the data, do any of the columns' data types surprise you? Which column(s) surprise you and why? (yes, there should be at least one)

```
Your answer here.
```

To be sure that there are issues with the `CurrentFlag` column, print out the count of unique values in the column.

<p style="font-size:.75rem">Expected output: Value count of CurrentFlag column is returned. 1 = 299 and yes = 5.</p>

What values do you see in the `CurrentFlag` column?

```
Your answer here.
```

Replace all occurences of `"yes"` in the `CurrentFlag` column with **the integer** `1`. Then, print out the counts of the unique values in the column again. You should see the number `1` listed twice with different unique counts.

<p style="font-size:.75rem">Expected output: Value count of CurrentFlag column is returned with 1 = 299 and 1 = 5</p>

Why do you think `1` was shown twice?

```
Your answer here.
```

---
#### Task 6: Fix not matching data type
It looks like because the column `CurrentFlag` started out with strings inside of it, most of its values are still of data type `object`. Change the entire `CurrentFlag` column so that it has a data type of `int8`. Then, output the dataframe information again to show that the column has a data type of `int8`.

<p style="font-size:.75rem">Expected output: Informational dataframe</p>

---
#### Task 7: Find Extreme Values
You have already fixed some columns with null values and incorrect values. Now, check the data set for extreme/incorrect values.

Print out a description of the numerical fields in the dataframe.

<p style="font-size:.75rem">Expected output: Descriptive dataframe is displayed.</p>

Observe the information described above. What do you notice about the summary statistics that might indicate inaccurate data? You should see at least two columns with potential problems. Look closely at the min and maximum values for each column.

```
Your answer here.
```

---
#### Task 8: Fix Rate
Knowing that the `Rate` column has at least one negative number, create a filtered dataframe that contains rows where `Rate` is less than or equal to 0.

<p style="font-size:.75rem">Expected output: 1 row of index 286. It's Rate is -72.12.</p>

Looking at the row(s) returned, what do you think should be done? Can the true `Rate` of the rows with negative values be imputed, or should they be dropped?

```
Your answer here.
```

Filter the dataframe and change the negative rates to positive rates.

##### Question 41: Print out the dataframe description again
Use the `.describe()` method to print out the dataframe description again. Make sure that the minimum rate is greater than 1.

<p style="font-size:.75rem">Expected output: The min Rate is now 6.5.</p>

---
#### Task 9: Check VacationHours
The maximum value in the `VacationHours` is 290. That might be correct, but it also might be out of the ordinary. Let's do some work to determine whether it is an outlier or not, and fix it accordingly.

First, print out the rows of the dataframe with the highest number of `VacationHours`.

<p style="font-size:.75rem">Expected output: Some dataframe rows sorted by `VacationHours` descending</p>

Does 290 seem to be an appropriate value in the `VacationHours` column? Why or why not?

```
Your answer here.
```

---
#### Task 10: Check if outlier

Before anything else, let's test to see whether `290` is really an outlier for the `VacationHours` column. To begin, import the NumPy library. Then, apply the function `isOutlier()` to the `VacationHours` column to determine which rows are outliers. Save this Series as a new column in the dataframe called `isOutlierVacationHours`. Finally, print out the dataframe **sorted by `VacationHours` descending** to show that this column has been added correctly.

<p style="font-size:.75rem">Expected output: Dataframe rows that show the new column `isOutlierVacationHours`, which contains True/False values (mostly False)</p>

In [1]:
import numpy as np

def isOutlier(hours):
    # Get 25th percentile
    first_quartile = np.quantile(df['VacationHours'], 0.25)
    # Get 75th percentile
    third_quartile = np.quantile(df['VacationHours'], 0.75)
    # Get inner quartile range
    iqr = third_quartile - first_quartile
    # Calculate upper fence
    upper_fence = third_quartile + (1.5 * iqr)
    
    if hours > upper_fence:
        return True
    else:
        return False

According the the column `isOutlierVacationHours` that you just made, is `290` an outlier in the `VacationHours` column? If so, what do you think could be done to fix this value?

```
Your answer here.
```

---
#### Task 11: Fix `VacationHours` outlier
As seen above, the row with total `VacationHours` has a `JobTitle` of "Sales Representative". Use a filter to get back rows of employees whose `JobTitle` is also "Sales Representative". Then, grab the median `VacationHours` for employees with that `JobTitle`. Assign the median `VacationHours` for "Sales Representative" to the employee with `VacationHours` of `290`. Finally, print out a numerical description of the dataframe again to show that the maximum `VacationHours` is now 99.

---
#### Final Thoughts
In case you were curious, the value of the row with an outlier in `VacationHours` was originally `29` instead of `290`. That means that the error was most likely caused by a human who accidentally entered in an extra `0` on the end of the actual number. In any case, the median vacation hours among Sales Representatives turned out to be fairly close to this (`33`). Without looking more closely at the spread of the data, it's difficult to say whether or not this approximation is "good" or "bad". There are many ways to improve this data prediction and get a number even closer to `29`, but when in doubt, it might be best to avoid skewing your data by simply dropping the rows with outliers (if you have enough rows of data).