# Advanced Pandas Assignment 3

In this assignment, you will practice cleaning "dirty" data.

### Note about assignments
You can add lines of code according to your preferences. As long as the code required by the assignment is found in this notebook under the corresponding question header (ie. the answer to question 1 is underneath the title "Question 1"), you will receive credit for it.

## About the data
The data used in this assignment is a table built from the Human Resources schema of the Adventure Works 2019 database. This data contains information about each time that Employee Pay History was changed (each line is a pay rate change). It also contains information about the employee and the department they were working in when they received the pay rate listed.

This data is similar to data used in previous assignments but has been "dirtied" in order to make this assignment possible. 

The actual data is stored in a CSV file located inside the `data` folder. The file is called `pay_history.csv`.

## Instructions
### Set up
##### Import Pandas
Import the Pandas library into Jupyter Lab.

<p style="font-size:.75rem">Expected output: None</p>

##### Disable column display limit
Use the following code to disable the default limit for displaying columns. If you don't use this code, a data set with more than 20 columns will be truncated when displayed to take up less space.

```python
pd.options.display.max_columns = None
```

<p style="font-size:.75rem">Expected output: None</p>

##### Create the dataframe
Use the `read_csv()` function from Pandas to read the data from the `pay_history.csv` file into a dataframe called `df`.

<p style="font-size:.75rem">Expected output: None</p>

##### Preview dataframe
Use the `.head()` method to print out the first 5 rows of the dataframe.

<p style="font-size:.75rem">Expected output: 5 rows, indexes 0-4</p>

### Questions
The data you were given to analyze looks perfectly normal at first glance. However, knowing that humans are imperfect at inputting data, you decide to do a quick glance over the data to make sure that nothing strange is happening. If there are mistakes in the data, you are determined to fix them in such a way that they don't affect the analysis.


#### Find Null Values
##### Question 1: Describe the data 
Before anything else, use the `.info()` method to check the count of null values in the data set.

<p style="font-size:.75rem">Expected output: Informational dataframe with 304 entries and 22 columns, showing non-null count and dtype for each. Ex. SalariedFlag column has 38 non-null values</p>

##### Question 2: Count null values
Use the `.isna()` method in conjunction with the `.sum()` method to get a count of null values in each column.

<p style="font-size:.75rem">Expected output: Series of column names and counts of their null values Rate has 2 null values and SalariedFlag has 266 null values</p>

##### Question 3: Which columns have null values?
Which columns have null values? What do you think should be done to these columns to prepare the data set for analysis (imputation or drop the column)?

Remember that imputation is usually only done when there are few null values, and dropping the column is necessary when there are many null values.

<p style="font-size:.75rem">Expected output: None</p>

----> Your answer here.

#### Rate - Imputation (part 1)
Because there are only two missing values in the `Rate` column, let's try to impute them. We can start to do this by discovering the average rate and then assigning it to both rows where `Rate` is null.

##### Question 4: Select the `Rate` column
Get the `Rate` column out of the dataframe.

<p style="font-size:.75rem">Expected output: Series of 304 Rates, where the first value is 125.5000 and the last is 23.0769</p>

##### Question 5: Get the average rate
Use the `.mean()` method to get the average rate and save it to a variable `average_rate`. Print it out to show what the average rate is.

<p style="font-size:.75rem">Expected output: 17.665210596026487</p>

##### Question 6: Get the rows where `Rate` is null
Use the `.isna()` method with the `.loc` property to create a filter that returns rows where `Rate` is null. Show these rows of the dataframe.

<p style="font-size:.75rem">Expected output: 2 rows with indexes 86 and 301</p>

##### Question 7: Get the indexes of the null rows
Use the `.index` property to get the indexes from the filtered dataframe and save them to a variable `null_rates`. Print out this variable as well.

<p style="font-size:.75rem">Expected output: Int64Index containing 86 and 301</p>

##### Question 8: Get `Rate` column of filtered dataframe
Using the filtered dataframe from above (Question 6), get the `Rate` column by adding `Rate` to the `.loc` property.

You should get two rows back in a Series.

<p style="font-size:.75rem">Expected output: Series with two NaN values that have indexes 86 and 301</p>

##### Question 9: Set null rows equal to overall average rate
Using the code from the previous question, set the `Rate` of the two rows with null rates equal to the `average_rate`.

<p style="font-size:.75rem">Expected output: None</p>

##### Question 10: Print out the number of nulls
Using the code from Question 2, print out the number of null values in each column.

<p style="font-size:.75rem">Expected output: Count of null values. Rate should equal 0 now</p>

#### Rate - Imputation (part 2)
##### Question 11: Is the overall average `Rate` the best?
Now we can see that the overall average `Rate` was successfully imputed into the null values for each `Rate`. However, is this `Rate` an accurate reflection of what the `Rate` likely actually is for both employees? What could be done to estimate the rate more accurately?

<p style="font-size:.75rem">Expected output: None</p>

```
Your answer here.
```

##### Question 12: Get `JobTitle` of each (previously) null row
Using the `.loc` property and the `null_rates` list of indexes that you obtained previously, print out the `JobTitle` column (a Series) for each of the rows which previously had a null value for `Rate`.

<p style="font-size:.75rem">Expected output: Series of JobTitle for indexes 86 and 301</p>

##### Question 13: Turn the Series into a list
Add the `.tolist()` method onto the previous line of code to place the two job titles into a list. Save this list of job titles into a variable `job_titles` and then print it out.

<p style="font-size:.75rem">Expected output: List of job titles. Should be ['Production Technician - WC40', 'Sales Representative']</p>

#### Question 14: Filter the dataframe by `job_titles`
Using the `.isin()` method and the `.loc` property, get back all rows of the dataframe that have a value in the `JobTitle` column that exists in the `job_titles` list.

<p style="font-size:.75rem">Expected output: 40 rows starting with index 82-89 and ending with indexes 289-303</p>

##### Question 15: Observe the filtered dataframe above
Observe the filtered dataframe returned above. What do you notice about the average rate of pay for the different job titles? How are they different from the overall average rate?

<p style="font-size:.75rem">Expected output: None</p>

```
Your answer here.
```

##### Question 16: Group by `JobTitle` and find the mode
Using the `.groupby()` and `.agg()` methods, group the filtered dataframe from question 13 by `JobTitle` and aggregate to find the most common value for `Rate`.

Note: Pandas doesn't have a built in aggregate function to find the mode using `.agg()`. Thus, the code `.agg({'Rate': 'mode'})` won't work. Instead, you'll have to use the Pandas Series method, which will look like this: `.agg({'Rate': pd.Series.mode})`.

<p style="font-size:.75rem">Expected output: Aggregation table where the mode Rate for Production Technician - WC40 is 15.0000 and the mode Rate for Sales Representative is 23.0769</p>

##### Question 17: Replace overall rates with average rate by job title
Using any way you would like, replace both of the (previously) null rates that were assigned the overall average rate a more precise average rate based on their `JobTitle`. You can either do this (1) by using the `.loc` property and the indexes created before, or (2) by using the `.loc` property with a filter by average `Rate` and a specific `JobTitle`. You can also just type in the numbers, rather than try to extract them from the dataframe.

It will be easiest to do this on more than one line.

<p style="font-size:.75rem">Expected output: None</p>

##### Question 18: Print out number of nulls
Using the `.isna()` and the `.sum()` methods, show how many null values exist in the dataframe.

<p style="font-size:.75rem">Expected output: Count of null values. Rate should still equal 0.</p>

#### SalariedFlag - Drop Column
The columns `SalariedFlag` and `EndDate` seem to have many null values. In a way, it makes sense that there might be null values in the `EndDate` column, since the database designers may have designed the system so that it doesn't record any value until an employee actually stops working at Adventure Works. However, the `SalariedFlag` field is a little more suspicious.

Most likely, we will end up dropping the `SalariedFlag` column because there are so many null values. However, let's make sure to look at it first to make sure that we *can't* impute it (ie. if we can see that all of the recorded values are the same, we might be able to justify imputation in specific situations).

##### Question 19: Print out counts of each unique value in `SalariedFlag`
Before anything else, we should check to see the counts of distinct values in the `SalariedFlag` column. Use the `.value_counts()` method to print this out.

<p style="font-size:.75rem">Expected output: Value counts of distinct values in the SalariedFlag column are displayed. 1.0 = 20 and 0.0 = 18.</p>

##### Question 20: What do you see when counting the distinct values?
Upon observing the count of distinct values in the `SalariedFlag` column, what do you notice? Is there any way we could possibly justify imputation?

<p style="font-size:.75rem">Expected output: None</p>

```
Your answer here.
```

##### Question 21: Drop the `SalariedFlag` column
Use the `.drop()` method to drop the `SalariedFlag` column. 

<p style="font-size:.75rem">Expected output: None</p>

##### Question 22: Print out the number of nulls
Use the `.isna()` and `.sum()` methods to show how many null values still exist in the dataframe. Make sure that the column `SalariedFlag` has been dropped successfully.

<p style="font-size:.75rem">Expected output: Count of nulls is shown. The SalariedFlag column no longer appears.</p>

#### Check inconsistent format
To make sure that each of our columns has data in a consistent format, we should check to make sure that each column has the data type that we expect. If all of the quantitative fields are either `int64` or `float64`, we can move on to the categorical fields and get a count of each unique value to make sure each row has an expected value.

##### Question 23: Check for data types of quantitative fields
Use the `.info()` method to check the data types of each column in the dataframe.

<p style="font-size:.75rem">Expected output: Informational dataframe with info about 21 columns is shown. CurrentFlag column has a type of object.</p>

##### Question 24: Does any column surprise you?
Looking at the information printed above and knowing a little about the data, do any of the column data types surprise you? Which column(s) surprise you and why? (yes, there should be at least one)

```
Your answer here.
```

##### Question 25: Check count of values of `CurrentFlag`
To be sure that there are issues with the `CurrentFlag` column, print out the count of unique values in the column using the `.value_counts()` method.

<p style="font-size:.75rem">Expected output: Value count of CurrentFlag column is returned. 1 = 299 and yes = 5.</p>

##### Question 26: What do you observe?
After counting up each unique value in the `CurrentFlag` column, what do you see?

<p style="font-size:.75rem">Expected output: None</p>

```
Your answer here.
```

##### Question 27: Get rows where `CurrentFlag` is `yes`
Filter the dataframe using `.loc` to show only rows where `CurrentFlag` is equal to `"yes"`.

<p style="font-size:.75rem">Expected output: 5 rows whose indexes are 73, 107, and 123-125.</p>

##### Question 28: Change each `"yes"` to `1`
It's probably safe to assume that, since all of the values in the `CurrentFlag` column are either `1` or `"yes"`, all `"yes"` values can be changed to 1. Use the `.replace()` method to replace all occurences of "yes" with **the integer** `1`. Save it to the original dataframe.

<p style="font-size:.75rem">Expected output: None</p>

##### Question 29: Print out the value counts again
Use the `.value_counts()` method to print out the counts of unique values in the `CurrentFlag` again. Make sure that the only value that occurs now is `1`.

<p style="font-size:.75rem">Expected output: Value counts for the CurrentFlag column is displayed. The number 1 appears twice, once with a count of 299 and again with a count of 5.</p>

##### Question 30: What do you notice?
What do you notice about the results of the results of `.value_counts()`? Why do you think this occured?

<p style="font-size:.75rem">Expected output: None</p>

```
Your answer here.
```

#### Fix not matching data type
It looks like because the column `CurrentFlag` started out with strings inside of it, most of its values are still of data type `object`. Change it so that the `CurrentFlag` column has a data type of `int8`.

##### Question 31: Print out the data type of `CurrentFlag`
Using the `.dtype` property, print out the data type of the `CurrentFlag` column.

<p style="font-size:.75rem">Expected output: dtype('O')</p>

##### Question 32: Create `CurrentFlag` as data type `int8`
Using the `.astype()` method, print out a copy of the `CurrentFlag` column whose data type is `int8`.

<p style="font-size:.75rem">Expected output: Series of all 1s is printed with dtype = int8.</p>

##### Question 33: Override the `CurrentFlag` column
Using the code from above, save the new `CurrentFlag` column of data type `int8` to the `CurrentFlag` column of the dataframe.

<p style="font-size:.75rem">Expected output: None</p>

##### Question 34: Print out the dataframe info
Use the `.info()` method to print out information about the data types of each column. Make sure that the `CurrentFlag` column now has a data type of `int8`.

<p style="font-size:.75rem">Expected output: Informational dataframe is returned that shows that the CurrentFlag column has a dtype of int8.</p>

#### Fix Extreme Values
You have already fixed some columns with null values and incorrect values. Now, check the data set for extreme/incorrect values.

##### Question 35: Describe the dataframe
Use the `.describe()` method on the dataframe to get some summary statistics about each quantitative row.

<p style="font-size:.75rem">Expected output: Descriptive dataframe is displayed.</p>

##### Question 36: What do you notice?
Observe the information created by the `.describe()` method above. What do you notice about the summary statistics that might indicate inaccurate data? You should see at least two columns with potential problems. Look closely at the min and maximum values for each column.

<p style="font-size:.75rem">Expected output: None</p>

```
Your answer here.
```

#### Fix Rate
##### Question 37: Get a filtered `Rate` column
Knowing that the `Rate` column has at least one negative number, create a filtered dataframe that contains rows where `Rate` is less than or equal to 0.

<p style="font-size:.75rem">Expected output: 1 row of index 286. It's Rate is -72.12.</p>

##### Question 38: What should be done?
Looking at the row(s) returned, what do you think should be done? Can the true `Rate` of the rows with negative values be imputed, or should they be dropped?

<p style="font-size:.75rem">Expected output: None</p>

```
Your answer here.
```

##### Question 39: Make the negative `Rate` positive
Using the code from above, print out a the `Rate` column of rows whose `Rate` is less than or equal to 0. Multiply it by `-1` to make it positive.

<p style="font-size:.75rem">Expected output: Series with one value of index 286. The Rate for this row is now 72.12.</p>

##### Question 40: Set the negative rate to positive rate
Using the code from above, filter the dataframe and change the negative rates to positive rates.

<p style="font-size:.75rem">Expected output: None</p>

##### Question 41: Print out the dataframe description again
Use the `.describe()` method to print out the dataframe description again. Make sure that the minimum rate is greater than 1.

<p style="font-size:.75rem">Expected output: The min Rate is now 6.5.</p>

#### Fix VacationHours
The maximum value in the `VacationHours` is 290. That might be correct, but it also might be out of the ordinary. Let's do some work to try and fix it.

##### Question 42: Sort the dataframe by `VacationHours` descending
Use the `.sort_values()` method and the `by` and `ascending` arguments to print out the dataframe sorted by `VacationHours`, with the rows with the highest number of vacation hours on top. Add the `.head()` method after the `.sort_values()` method to get the top five rows.

<p style="font-size:.75rem">Expected output: 5 rows with indexes 292, 0, 91, 120, and 119.</p>

#### Check if outlier
##### Question 43: Import numpy
Before anything else, let's test to see whether `290` is really an outlier for the `VacationHours` column. To begin, import the NumPy library.

<p style="font-size:.75rem">Expected output: None</p>

##### Question 44: Create function `isOutlier()`
Now create a function called `isOutlier()` that accepts one parameter `hours`. The function will be applied individually to each row of the `VacationHours` column and should perform the following:

1. Uses the `np.quantile()` function on the `VacationHours` column to get the first quartile (`0.25`) and save it to a variable `first_quartile`.
2. Uses the `np.quantile()` function on the `VacationHours` column to get the third quartile (`0.75`) and save it to a variable `third_quantile`.
3. Subtracts the `first_quartile` from the `third_quartile` and saves the results to a variable `iqr` (inner-quartile range).
4. Calculates the upper fence by performing `third_quartile + (1.5 * iqr)`, saving it to a variable `upper_fence`.
    - The upper fence is the upper limit for identifying outliers. Anything greater than the upper fence is considered an outlier.
5. If `hours` is greater than the `upper_fence`, return `True` (the hours for this row *is* an outlier). Otherwise, return `False`.

<p style="font-size:.75rem">Expected output: None</p>

In [None]:
def isOutlier(hours):
    # Get 25th percentile
    first_quartile = np.quantile(df['VacationHours'], 0.25)
    # Get 75th percentile
    third_quartile = np.quantile(df['VacationHours'], 0.75)
    # Get inner quartile range
    iqr = third_quartile - first_quartile
    # Calculate upper fence
    upper_fence = third_quartile + (1.5 * iqr)
    
    if hours > upper_fence:
        return True
    else:
        return False

##### Question 45: Apply function to the `VacationHours` column
Using the `.apply()` method, pass in the `isOutlier` function without parentheses to return a Series of True/False values. True indicates a row where `VacationHours` is an outlier.

<p style="font-size:.75rem">Expected output: Series of length 304 and dtype bool. All values will appear to be False.</p>

##### Question 46: Save the Series above to a new column called `VacationHoursOutlier`
Using the code from above, create a new column in the dataframe called `VacationHoursOutlier` that has either a True or False if the `VacationHours` is an outlier.

<p style="font-size:.75rem">Expected output: None</p>

##### Question 47: Sort the dataframe by `VacationHours` descending again
In the same way that you did previously, use the `.sort_values()` method with the `by` and `ascending` arguments to view the rows with the highest number of `VacationHours`.

<p style="font-size:.75rem">Expected output: Dataframe rows where index 292 is at the top.</p>

##### Question 48: Is `290` an outlier for `VacationHours`?
Looking at the dataframe returned above and the `VacationHoursOutlier` column, is `290` an outlier in the `VacationHours` column? If so, what could be done to fix this value?

<p style="font-size:.75rem">Expected output: None</p>

```
Your answer here.
```

##### Question 49: Get rows of employees with `JobTitle` Sales Representative
As seen above, the row with total `VacationHours` has a `JobTitle` of "Sales Representative". Use a filter to get back rows of employees whose `JobTitle` is also "Sales Representative".

<p style="font-size:.75rem">Expected output: Dataframe starting with index 288 and ending with index 303, including index 292.</p>

##### Question 50: Remove the row with outliers from the filter
Add to the filter that you created above so that the row with an outlier in the `VacationHours` field is not included in the filtered dataframe.

<p style="font-size:.75rem">Expected output: Dataframe starting with index 288 and ending with index 303, excluding index 292.</p>

##### Question 51: Get the average vacation hours among Sales Representatives
Using the filtered dataframe above, get the average `VacationHours` for Sales Representatives. Store this number in a variable called `sales_rep_average_vac_hours` and then print it out.

<p style="font-size:.75rem">Expected output: 31.153846153846153</p>

##### Question 52: Assign the average vacation hours to the employee with an outlier `VacationHours`
Using the `.loc` property, change the `VacationHours` of the employee with an outlier in the `VacationHours` column to be `sales_rep_average_vac_hours` instead.

<p style="font-size:.75rem">Expected output: None</p>

##### Question 53: Print out the dataframe description
Finally, print out the dataframe description using the `.describe()` method. Make sure the maximum value for `VacationHours` is no longer 290.

<p style="font-size:.75rem">Expected output: Dataframe description that shows max VacationHours as 99.000000.</p>

#### Final Thoughts
In case you were curious, the value of the row with an outlier in `VacationHours` was originally `29` instead of `290`. That means that the error was most likely caused by a human who accidentally entered in an extra `0` on the end of the actual number. In any case, the average vacation hours among Sales Representatives turned out to be fairly close to this (`31.15`), meaning that imputation worked fairly well in this example.