# Advanced Python for Data Analytics - Final Project (part 1)

In part 1 of this final project, you will use the skills you have learned in this course to clean a data set.

Specifically, you will perform the following in this project:

1. Use Jupyter Lab to organize notebook text and run Python code.
2. Use Pandas to import, clean, and export a data set.

## About the data
The data comes from the `InvoiceLines` table of the `Sales` schema from the `WideWorldImporters` database. The data contains information about invoice lines for different invoices for the company *Wide World Importers*. To clarify, each row contains information about a single invoice line, and a single invoice may contain multiple invoice lines. In other words, each invoice line (each row) contains information about the sale of a single product, even though a customer may have purchased several different products in a single invoice. Thus, each invoice **line item** (each row) belongs to a single invoice.

The table contains information about the quantity, unit price, and the profit for each **line item**. You will not need any other tables or data to complete this project.

## Instructions
This final project will be very similar to the homework assignments you completed in modules 2 and 3. Specifically, you will import and clean a data set. However, the instructions given to you will be minimal, and the methods used to perform the necessary opertions for cleaning the data set will not be provided in the instructions. You will not need any methods or functions that have not been covered in the course readings to complete the assignment, although you are free to use them if you would like.

The data set provided to you contains dirty data. Below, some information about *why* the data set is dirty is listed. Use your best judgement to clean the data in the way that you think best. In this way, there are multiple solutions to this project and you will be required to explain your purpose and process for each data cleaning step.

You will be graded based on how well you defend and show your data cleaning solutions.

Address each part of the dirty data by 1) showing the code that you used to clean the data and 2) explaining your reasoning behind each step. Each of the cells should be ordered in such a way that the entire file could be run all at once from top to bottom and everything will run as expected.

### Import packages
Use the cell below to import all of the packages you will need to complete this assignment. You can come back later and add code to this cell as needed.

### Import data set
Import the file `invoice_lines_dirty.csv` as a Pandas dataframe into a variable called `df`.

#### First five rows
Use appropriate dataframe methods to show the first five rows of the dataframe.

#### Describe the data set
Use appropriate dataframe methods to describe the data in the dataframe.

#### Dataframe information
Use appropriate dataframe methods to get information about the data types and non-null counts for each column in the data set.

#### Null value counts
Use appropriate dataframe methods to get information about the number of null values in each column of the dataframe.

### Instructions
Using your knowledge of cleaning data using Pandas, clean up the data set imported from the file `invoice_lines_dirty.csv`. By examining the dataframe using skills learned previously, you should be able to see the following problems with the data set:

1. The `LastEditedBy` column has 177,849 null values (78% of the column is null)
2. There are 79 rows with null values only in the `UnitPrice` column
3. There are 140 null values in the `StockItemID` and `Description` columns
4. The `PackageTypeID` column has a datatype of `object` but should be only integers
5. The `StockItemID` column has a datatype of `object` but should only contain integers
6. The minimum `TaxRate` is `-15`, which is impossible
7. The minimum `UnitPrice`, `TaxAmount`, and `ExtendedPrice` is 0, which should be impossible
8. The maximum value in the `ExtendedPrice` column appears to be extremely high

Use strategies that you learned about in this course to fix the data set as best as you can. There is not a single right way to carry this process out, but do your best to clean up the data set however you feel is best.

#### Data Cleaning
To answer the following questions, write code to clean the data in the way that you see appropriate and explain why you wrote that code.

---

##### Question 1: The `LastEditedBy` column has 177,849 null values (78% of the column is null)
As you can see, the `LastEditedBy` column has 177,849 null values (78% of the column is null). Write code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned this column and why you did it that way. Show all of your work.

```
Your answer here.
```

---
##### Question 2: There are 79 rows with null values only in the `UnitPrice` column
There are 79 rows with null values only in the `UnitPrice` column. The `UnitPrice` column can either be dropped, the null rows can be dropped, the `UnitPrice` can be imputed by using the formula: 

$$ UnitPrice=\frac{ExtendedPrice - TaxAmount}{Quantity} $$

You can choose to use imputation or drop the rows/columns associated with null UnitPrices. Write code to perform the cleaning that you think best fits this data set. Then, write a sentence or two that explains why you used this code. Show all of your work.

```
Your answer here.
```

---
##### Question 3: There are 140 null values in the `StockItemID` and `Description` columns
There are 140 null values in the `StockItemID` and `Description` columns. Although the `UnitPrice` is known for these rows, different products can have the same `UnitPrice` so it will probably be impossible to accurately impute the `StockItemID` and `Description` for these rows.

Write the code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned these columns and why you did it that way. Show all of your work.

```
Your answer here.
```

---
##### Question 4: The `PackageTypeID` column has a datatype of `object` but should be only integers
The `PackageTypeID` column has a datatype of `object` but should be only integers. Before making a choice as to what to do, look at what the non-integer values are and determine your strategy from there.

Write the code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned this column and why you did it that way. Show all of your work.

```
Your answer here.
```

---
##### Question 5: The `StockItemID` column has a datatype of `object` but should only contain integers
The `StockItemID` column has a datatype of `object` but should only contain integers. Before making a choice of what to do, look at what the non-integer values are and determine your strategy from there. Hint: You can use `.value_counts()` to get a count of how many times each integer in the `StockItemID` occurs. If you see a value that is not an integer, look for it using a filter.

Write the code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned this column and why you did it that way. Show all of your work.

```
Your answer here.
```

---
##### Question 6: The minimum `TaxRate` is `-15`, which is impossible
The minimum `TaxRate` is `-15`, which is impossible. Use your best judgement to fix this.

Write the code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned this column and why you did it that way. Show all of your work.

```
Your answer here.
```

---
##### Question 7: The minimum `UnitPrice`, `TaxAmount`, and `ExtendedPrice` is 0, which should be impossible
The minimum `UnitPrice`, `TaxAmount`, and `ExtendedPrice` is 0, which should be impossible. Use your best judgement to clean this up. Note that imputation will likely be impossible if `UnitPrice`, `TaxAmount`, and `ExtendedPrice` are all 0.

Write the code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned this column and why you did it that way. Show all of your work.

```
Your answer here.
```

---
##### Question 8: The maximum value in the `ExtendedPrice` column appears to be extremely high
The maximum value in the `ExtendedPrice` column appears to be extremely high. You can use the function below `is_outlier()` if you want to check if the value is an outlier and clean the data set accordingly (it is optional). You can also look for other hints in the data that might explain this high value.

Write the code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned this column and why you did it that way. Show all of your work.

In [None]:
import numpy as np
q1 = np.quantile(df['ExtendedPrice'], .25)
q3 = np.quantile(df['ExtendedPrice'], .75)
iqr = q3 - q1
def is_outlier(row):
    return row > q3 + (1.5 * iqr)

```
Your answer here.
```

---
### Results
Use the code cells below to check your work. There should be 0 null values left in the data set, data types should be correct (ie. column names with `ID` should be integers, and so forth), and the most extreme/impossible values have been corrected or eliminated.

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.isna().sum()