# Advanced Python for Data Analytics - Final Project (part 1)

In part 1 of this final project, you will use the skills you have learned in this course to clean a data set.

Specifically, you will perform the following in this project:

1. Use Jupyter Lab to organize notebook text and run Python code.
2. Use Pandas to import, clean, and export a data set.

## About the data
The data comes from the `InvoiceLines` table of the `Sales` schema from the `WideWorldImporters` database. The data contains information about invoice lines for different invoices for the company *Wide World Importers*. To clarify, each row contains information about a single invoice line, and a single invoice may contain multiple invoice lines. In other words, each invoice line (each row) contains information about the sale of a single product, even though a customer may have purchased several different products in a single invoice. Thus, each invoice **line item** (each row) belongs to a single invoice.

The table contains information about the quantity, unit price, and the profit for each **line item**. You will not need any other tables or data to complete this project.

## Instructions
This final project will be very similar to the homework assignments you completed in modules 2 and 3. Specifically, you will import and clean a data set. However, the instructions given to you will be minimal, and the methods used to perform the necessary opertions for cleaning the data set will not be provided in the instructions. You will not need any methods or functions that have not been covered in the course readings to complete the assignment, although you are free to use them if you would like.

The data set provided to you contains dirty data. Below, some information about *why* the data set is dirty is listed. Use your best judgement to clean the data in the way that you think best. In this way, there are multiple solutions to this project and you will be required to explain your purpose and process for each data cleaning step.

You will be graded based on how well you defend and show your data cleaning solutions.

Address each part of the dirty data by 1) showing the code that you used to clean the data and 2) explaining your reasoning behind each step. Each of the cells should be ordered in such a way that the entire file could be run all at once from top to bottom and everything will run as expected.

### Import packages
Use the cell below to import all of the packages you will need to complete this assignment. You can come back later and add code to this cell as needed.

In [90]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Import data set
Import the file `invoice_lines_dirty.csv` as a Pandas dataframe into a variable called `df`. Print out the first five lines of the dataframe.

In [91]:
df = pd.read_csv('./data/invoice_lines_dirty.csv')

  df = pd.read_csv('./data/invoice_lines_dirty.csv')


#### First five rows
Use appropriate dataframe methods to show the first five rows of the dataframe.

In [92]:
df.head()

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,Description,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedBy,LastEditedWhen,r
0,19221,5922,33,Developer joke mug - that's a hardware problem...,7,9,13.0,15,17.55,76.5,134.55,,00:00.0,6e-06
1,36171,11162,99,"""The Gu"" red shirt XML tag t-shirt (Black) 5XL",7,24,18.0,15,64.8,240.0,496.8,,00:00.0,1.4e-05
2,157470,48512,32,Developer joke mug - that's a hardware problem...,7,7,13.0,15,13.65,59.5,104.65,,00:00.0,1.4e-05
3,130857,40302,211,Small 9mm replacement blades 9mm,7,40,4.1,15,24.6,60.0,188.6,,00:00.0,2.3e-05
4,114701,35324,200,Black and yellow heavy despatch tape 48mmx100m,7,48,4.1,15,29.52,81.6,226.32,,00:00.0,3.2e-05


#### Describe the data set
Use appropriate dataframe methods to describe the data in the dataframe.

In [93]:
df.describe()

Unnamed: 0,InvoiceLineID,InvoiceID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedBy,r
count,228265.0,228265.0,228265.0,228186.0,228265.0,228265.0,228265.0,228265.0,50416.0,228265.0
mean,114133.0,35179.209686,39.211566,45.594051,14.977176,112.941264,375.568663,868.41181,10.798239,0.500807
std,65894.573936,20337.208453,54.558829,139.882467,0.341887,217.497895,754.052045,1728.386504,5.506725,0.288629
min,1.0,1.0,1.0,0.0,-15.0,0.0,-645.0,0.0,2.0,6e-06
25%,57067.0,17572.0,5.0,13.0,15.0,14.4,51.0,110.4,6.0,0.251084
50%,114133.0,35152.0,10.0,18.0,15.0,34.5,120.0,264.5,11.0,0.500565
75%,171199.0,52765.0,60.0,32.0,15.0,129.6,390.0,993.6,16.0,0.751201
max,228265.0,70510.0,360.0,1899.0,15.0,2848.5,9200.0,218385.0,20.0,0.999987


#### Dataframe information
Use appropriate dataframe methods to get information about the data types and non-null counts for each column in the data set.

In [94]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228265 entries, 0 to 228264
Data columns (total 14 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   InvoiceLineID   228265 non-null  int64  
 1   InvoiceID       228265 non-null  int64  
 2   StockItemID     228125 non-null  object 
 3   Description     228125 non-null  object 
 4   PackageTypeID   228265 non-null  object 
 5   Quantity        228265 non-null  int64  
 6   UnitPrice       228186 non-null  float64
 7   TaxRate         228265 non-null  int64  
 8   TaxAmount       228265 non-null  float64
 9   LineProfit      228265 non-null  float64
 10  ExtendedPrice   228265 non-null  float64
 11  LastEditedBy    50416 non-null   float64
 12  LastEditedWhen  228265 non-null  object 
 13  r               228265 non-null  float64
dtypes: float64(6), int64(4), object(4)
memory usage: 24.4+ MB


#### Null value counts
Use appropriate dataframe methods to get information about the number of null values in each column of the dataframe.

In [95]:
df.isna().sum()

InvoiceLineID          0
InvoiceID              0
StockItemID          140
Description          140
PackageTypeID          0
Quantity               0
UnitPrice             79
TaxRate                0
TaxAmount              0
LineProfit             0
ExtendedPrice          0
LastEditedBy      177849
LastEditedWhen         0
r                      0
dtype: int64

### Instructions
Using your knowledge of cleaning data using Pandas, clean up the data set imported from the file `invoice_lines_dirty.csv`. By examining the dataframe using skills learned previously, you should be able to see the following problems with the data set:

1. The `LastEditedBy` column has 177,849 null values (78% of the column is null)
2. There are 79 rows with null values only in the `UnitPrice` column
3. There are 140 null values in the `StockItemID` and `Description` columns
4. The `PackageTypeID` column has a datatype of `object` but should be only integers
5. The `StockItemID` column has a datatype of `object` but should only contain integers
6. The minimum `TaxRate` is `-15`, which is impossible
7. The minimum `UnitPrice`, `TaxAmount`, and `ExtendedPrice` is 0, which should be impossible
8. The maximum value in the `ExtendedPrice` column appears to be extremely high

Use strategies that you learned about in this course to fix the data set as best as you can. There is not a single right way to carry this process out, but do your best to clean up the data set however you feel is best.

#### Data Cleaning
To answer the following questions, write code to clean the data in the way that you see appropriate and explain why you wrote that code.

---

##### Question 1: The `LastEditedBy` column has 177,849 null values (78% of the column is null)
As you can see, the `LastEditedBy` column has 177,849 null values (78% of the column is null). Write code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned this column and why you did it that way. Show all of your work.

In [96]:
df.drop(columns='LastEditedBy', inplace=True)

```
I dropped the `LastEditedBy` column because there are relatively few rows with values in that column but too many to drop the rows. I also couldn't have imputed the value for `LastedEditedBy`.
```

---
##### Question 2: There are 79 rows with null values only in the `UnitPrice` column
There are 79 rows with null values only in the `UnitPrice` column. The `UnitPrice` column can either be dropped, the null rows can be dropped, the `UnitPrice` can be imputed by using the formula: 

$$ UnitPrice=\frac{ExtendedPrice - TaxAmount}{Quantity} $$

You can choose to use imputation or drop the rows/columns associated with null UnitPrices. Write code to perform the cleaning that you think best fits this data set. Then, write a sentence or two that explains why you used this code. Show all of your work.

In [97]:
filt = df['UnitPrice'].isna()
null_unit_price_df = df.loc[filt]
df.loc[filt, 'UnitPrice'] = (null_unit_price_df.loc[:, 'ExtendedPrice'] - null_unit_price_df.loc[:, 'TaxAmount']) / null_unit_price_df.loc[:, 'Quantity']

```
I chose to impute the `UnitPrice` values using the given formula because I wanted to preserve as many rows as possible.
```

---
##### Question 3: There are 140 null values in the `StockItemID` and `Description` columns
There are 140 null values in the `StockItemID` and `Description` columns. Although the `UnitPrice` is known for these rows, different products can have the same `UnitPrice` so it will probably be impossible to accurately impute the `StockItemID` and `Description` for these rows.

Write the code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned these columns and why you did it that way. Show all of your work.

In [98]:
filt = df['StockItemID'].isna() & df['Description'].isna()
df.drop(index=df.loc[filt].index, inplace=True)

```
I chose to drop the rows with null `StockItemID` and null `Description`. There's no way to tell what they could have been and even though the rest of columns have data, I believe that the lack of product data may cloud the analysis.
```

---
##### Question 4: The `PackageTypeID` column has a datatype of `object` but should be only integers
The `PackageTypeID` column has a datatype of `object` but should be only integers. Before making a choice as to what to do, look at what the non-integer values are and determine your strategy from there.

Write the code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned this column and why you did it that way. Show all of your work.

In [99]:
df['PackageTypeID'].value_counts()

7        217651
9          5206
10         4209
1          1035
seven        24
Name: PackageTypeID, dtype: int64

In [100]:
df['PackageTypeID'].replace({'seven': 7}, inplace=True)

In [101]:
df['PackageTypeID'] = df['PackageTypeID'].astype('int32')

In [102]:
df['PackageTypeID'].value_counts()

7     217675
9       5206
10      4209
1       1035
Name: PackageTypeID, dtype: int64

```
This code checked the different values in the column `PackageTypeID` and found that both the number 7 and the word 'seven' were present. The `.replace()` method was used to change 'seven' to 7 and then the data type of the column was set to `int32`. This preserved more data without dropping rows that had 'seven'.
```

---
##### Question 5: The `StockItemID` column has a datatype of `object` but should only contain integers
The `StockItemID` column has a datatype of `object` but should only contain integers. Before making a choice of what to do, look at what the non-integer values are and determine your strategy from there. Hint: You can use `.value_counts()` to get a count of how many times each integer in the `StockItemID` occurs. If you see a value that is not an integer, look for it using a filter.

Write the code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned this column and why you did it that way. Show all of your work.

In [103]:
df['StockItemID'].value_counts()

65.0                            820
88.0                            819
13.0                            812
176.0                           802
120.0                           795
                               ... 
221                              36
220                              35
222                              34
224                              32
USB missile launcher (Green)      1
Name: StockItemID, Length: 455, dtype: int64

In [104]:
filt = df['StockItemID'] == 'USB missile launcher (Green)'
df.loc[filt]

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,Description,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedWhen,r
23797,60222,18545,USB missile launcher (Green),1,7,10,25.0,15,37.5,155.0,287.5,00:00.0,0.105244


In [105]:
df.loc[df['Description'] == 'USB missile launcher (Green)'].head()

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,Description,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedWhen,r
119,51327,15820,1,USB missile launcher (Green),7,5,25.0,15,18.75,77.5,143.75,00:00.0,0.000581
352,80464,24741,1,USB missile launcher (Green),7,8,25.0,15,30.0,124.0,230.0,00:00.0,0.001657
940,74185,22811,1,USB missile launcher (Green),7,4,25.0,15,15.0,62.0,115.0,00:00.0,0.004399
1206,8283,2549,1,USB missile launcher (Green),7,4,25.0,15,15.0,62.0,115.0,00:00.0,0.005571
1214,60208,18542,1,USB missile launcher (Green),7,2,25.0,15,7.5,31.0,57.5,00:00.0,0.00559


In [106]:
descriptions = df.loc[filt, 'StockItemID']
stock_items = df.loc[filt, 'Description']
df.loc[filt, 'StockItemID'] = stock_items
df.loc[filt, 'Description'] = descriptions

In [107]:
df['StockItemID'].value_counts()

65.0     820
88.0     819
13.0     812
176.0    802
120.0    795
        ... 
223       37
221       36
220       35
222       34
224       32
Name: StockItemID, Length: 454, dtype: int64

In [108]:
df['StockItemID'] = df['StockItemID'].astype('int16')

```
It looks like the `Description` and the `StockItemID` got switched around in one row. I used indexing to switch the values in the `StockItemID` and `Description` columns for that row.
```

---
##### Question 6: The minimum `TaxRate` is `-15`, which is impossible
The minimum `TaxRate` is `-15`, which is impossible. Use your best judgement to fix this.

Write the code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned this column and why you did it that way. Show all of your work.

In [109]:
df['TaxRate'].describe()

count    228125.000000
mean         14.977162
std           0.341992
min         -15.000000
25%          15.000000
50%          15.000000
75%          15.000000
max          15.000000
Name: TaxRate, dtype: float64

In [110]:
filt = df['TaxRate'] < 0
df.loc[filt]

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,Description,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedWhen,r
80052,166240,51231,133,Furry gorilla with big eyes slippers (Black) XL,7,10,32.0,-15,48.0,235.0,368.0,00:00.0,0.352486


In [111]:
df.loc[filt, 'TaxRate'] = df.loc[filt, 'TaxRate'] * -1 

In [112]:
df['TaxRate'].describe()

count    228125.000000
mean         14.977293
std           0.336183
min          10.000000
25%          15.000000
50%          15.000000
75%          15.000000
max          15.000000
Name: TaxRate, dtype: float64

```
I grabbed all of the rows where `TaxRate` was less than 0 and then multiplied each `TaxRate` by -1 and reassigned the values back to the column `TaxRate`.
```

---
##### Question 7: The minimum `UnitPrice`, `TaxAmount`, and `ExtendedPrice` is 0, which should be impossible
The minimum `UnitPrice`, `TaxAmount`, and `ExtendedPrice` is 0, which should be impossible. Use your best judgement to clean this up. Note that imputation will likely be impossible if `UnitPrice`, `TaxAmount`, and `ExtendedPrice` are all 0.

Write the code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned this column and why you did it that way. Show all of your work.

In [114]:
# where UnitPrice is 0
up_filt = df['UnitPrice'] <= 0
df.loc[up_filt]

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,Description,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedWhen,r
11024,195280,60221,183,Shipping carton (Brown) 480x270x320mm,7,50,0.0,15,0.0,77.0,0.0,00:00.0,0.04882
58747,22844,7034,157,10 mm Double sided bubble wrap 20m,7,30,0.0,15,0.0,480.0,0.0,00:00.0,0.258311
63002,62669,19296,104,Alien officer hoodie (Black) 3XL,7,1,0.0,15,0.0,18.0,0.0,00:00.0,0.277391
75456,36930,11396,110,Superhero action jacket (Blue) S,7,9,0.0,15,0.0,54.0,0.0,00:00.0,0.332412
107322,124012,38174,186,Shipping carton (Brown) 457x457x457mm,7,75,0.0,15,0.0,75.0,0.0,00:00.0,0.471036
216096,170768,52636,26,DBA joke mug - SELECT caffeine FROM mug (White),7,3,0.0,15,0.0,25.5,0.0,00:00.0,0.946992
216736,59555,18347,167,10 mm Anti static bubble wrap (Blue) 50m,7,90,0.0,15,0.0,4860.0,0.0,00:00.0,0.949621


In [115]:
# where TaxAmount is 0
tax_filt = df['TaxAmount'] <= 0
df.loc[tax_filt]

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,Description,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedWhen,r
11024,195280,60221,183,Shipping carton (Brown) 480x270x320mm,7,50,0.0,15,0.0,77.0,0.0,00:00.0,0.04882
58747,22844,7034,157,10 mm Double sided bubble wrap 20m,7,30,0.0,15,0.0,480.0,0.0,00:00.0,0.258311
63002,62669,19296,104,Alien officer hoodie (Black) 3XL,7,1,0.0,15,0.0,18.0,0.0,00:00.0,0.277391
75456,36930,11396,110,Superhero action jacket (Blue) S,7,9,0.0,15,0.0,54.0,0.0,00:00.0,0.332412
107322,124012,38174,186,Shipping carton (Brown) 457x457x457mm,7,75,0.0,15,0.0,75.0,0.0,00:00.0,0.471036
216096,170768,52636,26,DBA joke mug - SELECT caffeine FROM mug (White),7,3,0.0,15,0.0,25.5,0.0,00:00.0,0.946992
216736,59555,18347,167,10 mm Anti static bubble wrap (Blue) 50m,7,90,0.0,15,0.0,4860.0,0.0,00:00.0,0.949621


In [116]:
# where ExtendedPrice is 0
ep_filt = df['ExtendedPrice'] <= 0
df.loc[ep_filt]

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,Description,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedWhen,r
11024,195280,60221,183,Shipping carton (Brown) 480x270x320mm,7,50,0.0,15,0.0,77.0,0.0,00:00.0,0.04882
58747,22844,7034,157,10 mm Double sided bubble wrap 20m,7,30,0.0,15,0.0,480.0,0.0,00:00.0,0.258311
63002,62669,19296,104,Alien officer hoodie (Black) 3XL,7,1,0.0,15,0.0,18.0,0.0,00:00.0,0.277391
75456,36930,11396,110,Superhero action jacket (Blue) S,7,9,0.0,15,0.0,54.0,0.0,00:00.0,0.332412
107322,124012,38174,186,Shipping carton (Brown) 457x457x457mm,7,75,0.0,15,0.0,75.0,0.0,00:00.0,0.471036
216096,170768,52636,26,DBA joke mug - SELECT caffeine FROM mug (White),7,3,0.0,15,0.0,25.5,0.0,00:00.0,0.946992
216736,59555,18347,167,10 mm Anti static bubble wrap (Blue) 50m,7,90,0.0,15,0.0,4860.0,0.0,00:00.0,0.949621


In [118]:
df.loc[ up_filt & tax_filt & ep_filt]

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,Description,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedWhen,r
11024,195280,60221,183,Shipping carton (Brown) 480x270x320mm,7,50,0.0,15,0.0,77.0,0.0,00:00.0,0.04882
58747,22844,7034,157,10 mm Double sided bubble wrap 20m,7,30,0.0,15,0.0,480.0,0.0,00:00.0,0.258311
63002,62669,19296,104,Alien officer hoodie (Black) 3XL,7,1,0.0,15,0.0,18.0,0.0,00:00.0,0.277391
75456,36930,11396,110,Superhero action jacket (Blue) S,7,9,0.0,15,0.0,54.0,0.0,00:00.0,0.332412
107322,124012,38174,186,Shipping carton (Brown) 457x457x457mm,7,75,0.0,15,0.0,75.0,0.0,00:00.0,0.471036
216096,170768,52636,26,DBA joke mug - SELECT caffeine FROM mug (White),7,3,0.0,15,0.0,25.5,0.0,00:00.0,0.946992
216736,59555,18347,167,10 mm Anti static bubble wrap (Blue) 50m,7,90,0.0,15,0.0,4860.0,0.0,00:00.0,0.949621


In [120]:
df.drop(index=df.loc[ up_filt & tax_filt & ep_filt].index, inplace=True)

```
The same rows that had `UnitPrice` of 0 also had `TaxAmount` and `ExtendedPrice` of 0. There was no way to impute the values, and not enough incorrect values to drop the column, so I dropped the rows that had 0 in `UnitPrice`, `TaxAmount`, and `ExtendedPrice`.
```

---
##### Question 8: The maximum value in the `ExtendedPrice` column appears to be extremely high
The maximum value in the `ExtendedPrice` column appears to be extremely high. You can use the function below `is_outlier()` if you want to check if the value is an outlier and clean the data set accordingly (it is optional). You can also look for other hints in the data that might explain this high value.

Write the code that will clean this part of the data set. Then, write a sentence or two that explains how you cleaned this column and why you did it that way. Show all of your work.

In [134]:
import numpy as np
q1 = np.quantile(df['ExtendedPrice'], .25)
q3 = np.quantile(df['ExtendedPrice'], .75)
iqr = q3 - q1
def is_outlier(row):
    return row > q3 + (1.5 * iqr)

In [135]:
df['ExtendedPrice'].apply(is_outlier).sum()

19921

In [137]:
df.sort_values(by='ExtendedPrice', ascending=False).head()

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,Description,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedWhen,r
163373,194200,59885,215,Air cushion machine (Blue),7,10,1899.0,15,2848.5,7590.0,218385.0,00:00.0,0.717045
156577,208165,64262,215,Air cushion machine (Blue),7,10,1899.0,15,2848.5,7590.0,21838.5,00:00.0,0.686921
191950,153334,47239,215,Air cushion machine (Blue),7,10,1899.0,15,2848.5,7590.0,21838.5,00:00.0,0.841911
169917,166505,51315,215,Air cushion machine (Blue),7,10,1899.0,15,2848.5,7590.0,21838.5,00:00.0,0.745738
59553,149556,46068,215,Air cushion machine (Blue),7,10,1899.0,15,2848.5,7590.0,21838.5,00:00.0,0.261955


In [138]:
filt = df['ExtendedPrice'] > 21839
df.loc[filt]

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,Description,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedWhen,r
163373,194200,59885,215,Air cushion machine (Blue),7,10,1899.0,15,2848.5,7590.0,218385.0,00:00.0,0.717045


In [140]:
df.loc[filt, 'ExtendedPrice'] = df.loc[filt, 'Quantity'] * df.loc[filt, 'UnitPrice'] + df.loc[filt, 'TaxAmount']

```
There were almost 20,000 outliers, so I decided not to drop all of the outliers in the data set. I did remember that I could find the `ExtendedPrice` by multiplying together the `UnitPrice` and the `Quantity` and then adding the `TaxAmount`, so I did that to impute the strange value for `ExtendedPrice`.
```

---
### Results
Use the code cells below to check your work. There should be 0 null values left in the data set, data types should be correct (ie. column names with `ID` should be integers, and so forth), and the most extreme/impossible values have been corrected or eliminated.

In [141]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 228118 entries, 0 to 228264
Data columns (total 13 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   InvoiceLineID   228118 non-null  int64  
 1   InvoiceID       228118 non-null  int64  
 2   StockItemID     228118 non-null  int16  
 3   Description     228118 non-null  object 
 4   PackageTypeID   228118 non-null  int32  
 5   Quantity        228118 non-null  int64  
 6   UnitPrice       228118 non-null  float64
 7   TaxRate         228118 non-null  int64  
 8   TaxAmount       228118 non-null  float64
 9   LineProfit      228118 non-null  float64
 10  ExtendedPrice   228118 non-null  float64
 11  LastEditedWhen  228118 non-null  object 
 12  r               228118 non-null  float64
dtypes: float64(5), int16(1), int32(1), int64(4), object(2)
memory usage: 22.2+ MB


In [142]:
df.describe()

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,r
count,228118.0,228118.0,228118.0,228118.0,228118.0,228118.0,228118.0,228118.0,228118.0,228118.0,228118.0
mean,114134.528249,35179.677132,110.18682,7.073773,39.212381,45.576526,14.977292,112.932028,375.530338,867.481029,0.500807
std,65892.674374,20336.620727,63.727063,0.644392,54.555871,139.739843,0.336188,217.373146,753.796121,1666.968187,0.28863
min,1.0,1.0,1.0,1.0,1.0,0.66,10.0,0.38,-645.0,2.88,6e-06
25%,57070.25,17572.0,54.0,7.0,5.0,13.0,15.0,14.4,51.0,110.4,0.251071
50%,114133.5,35152.0,111.0,7.0,10.0,18.0,15.0,34.5,120.0,264.5,0.50054
75%,171196.75,52764.0,165.0,7.0,60.0,32.0,15.0,129.6,390.0,993.6,0.751215
max,228265.0,70510.0,227.0,10.0,360.0,1899.0,15.0,2848.5,9200.0,21838.5,0.999987


In [143]:
df.isna().sum()

InvoiceLineID     0
InvoiceID         0
StockItemID       0
Description       0
PackageTypeID     0
Quantity          0
UnitPrice         0
TaxRate           0
TaxAmount         0
LineProfit        0
ExtendedPrice     0
LastEditedWhen    0
r                 0
dtype: int64