# Advanced Python for Data Analytics - Final Project (part 2)

In part 2 of this final project, you will use the skills you have learned in this course to explore a data set.

Specifically, you will perform the following in this project:

1. Use Jupyter Lab to organize notebook text and run Python code.
2. Use Pandas to import data, create new columns, run aggregations, and export a CSV file.
3. Use regular expressions to enrichen the data set.
<!-- 4. Use matplotlib to create meaningful exploratory visualizations -->

## About the data
The data comes from the `InvoiceLines` table of the `Sales` schema from the `WideWorldImporters` database. The data contains information about invoice lines for different invoices for the company *Wide World Importers*. To clarify, each row contains information about a single invoice line, and a single invoice may contain multiple invoice lines. In other words, each invoice line (each row) contains information about the sale of a single product, even though a customer may have purchased several different products in a single invoice. Thus, each invoice **line item** (each row) belongs to a single invoice.

The table contains information about the quantity, unit price, and the profit for each **line item**. You will not need any other tables or data to complete this project.

## Instructions
This final project will be very similar to the homework assignments you completed in modules 2 and 3. Specifically, you will import and analyze a dataset by creating new columns and using aggregation to summarize data. However, the instructions given to you will be minimal, and the methods used to perform the necessary opertions for cleaning the data set will not be provided in the instructions. You will not need any methods or functions that have not been covered in the course readings to complete the assignment, although you are free to use them if you would like.

Answer the questions enumerated below by writing code to obtain the answers. You should show all of your work and then answer the question in a sentence below the question header.

You are free to add as many cells to this file as you would like. However, each of the cells should be ordered in such a way that the entire file could be run all at once from top to bottom and everything will run as expected.

### Import packages
Use the cell below to import all of the packages you will need to complete this assignment. You can come back later and add code to this cell as needed.

In [59]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#### Import data set
Import the file `invoice_lines.csv` from the `data` directory into a variable called `df`.

In [60]:
df = pd.read_csv('./data/invoice_lines.csv')

#### First five rows
Use appropriate dataframe methods to show the first five rows of the dataframe.

In [61]:
df.head()

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,Description,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedBy,LastEditedWhen
0,1,1,67,Ride on toy sedan car (Black) 1/12 scale,7,10,230.0,15.0,345.0,850.0,2645.0,7,2013-01-01 12:00:00.0000000
1,2,2,50,Developer joke mug - old C developers never di...,7,9,13.0,15.0,17.55,76.5,134.55,7,2013-01-01 12:00:00.0000000
2,3,2,10,USB food flash drive - chocolate bar,7,9,32.0,15.0,43.2,180.0,331.2,7,2013-01-01 12:00:00.0000000
3,4,3,114,Superhero action jacket (Blue) XXL,7,3,30.0,15.0,13.5,24.0,103.5,7,2013-01-01 12:00:00.0000000
4,5,4,206,Permanent marker black 5mm nib (Black) 5mm,7,96,2.7,15.0,38.88,96.0,298.08,7,2013-01-01 12:00:00.0000000


#### Describe the data set
Use appropriate dataframe methods to describe the data in the dataframe.

In [62]:
df.describe()

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedBy
count,228265.0,228265.0,228265.0,228265.0,228265.0,228265.0,228265.0,228265.0,228265.0,228265.0,228265.0
mean,114133.0,35179.209686,110.181285,7.073765,39.211566,45.591689,14.977307,112.948101,375.568663,867.603178,10.798795
std,65894.573936,20337.208453,63.729035,0.644437,54.558829,139.862055,0.336081,217.512539,754.052045,1668.036182,5.510103
min,1.0,1.0,1.0,1.0,1.0,0.66,10.0,0.38,-645.0,2.88,2.0
25%,57067.0,17572.0,54.0,7.0,5.0,13.0,15.0,14.4,51.0,110.4,6.0
50%,114133.0,35152.0,111.0,7.0,10.0,18.0,15.0,34.5,120.0,264.5,11.0
75%,171199.0,52765.0,165.0,7.0,60.0,32.0,15.0,129.6,390.0,993.6,16.0
max,228265.0,70510.0,227.0,10.0,360.0,1899.0,15.0,2848.5,9200.0,21838.5,20.0


#### Dataframe information
Use appropriate dataframe methods to get information about the data types and non-null counts for each column in the data set.

In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 228265 entries, 0 to 228264
Data columns (total 13 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   InvoiceLineID   228265 non-null  int64  
 1   InvoiceID       228265 non-null  int64  
 2   StockItemID     228265 non-null  int64  
 3   Description     228265 non-null  object 
 4   PackageTypeID   228265 non-null  int64  
 5   Quantity        228265 non-null  int64  
 6   UnitPrice       228265 non-null  float64
 7   TaxRate         228265 non-null  float64
 8   TaxAmount       228265 non-null  float64
 9   LineProfit      228265 non-null  float64
 10  ExtendedPrice   228265 non-null  float64
 11  LastEditedBy    228265 non-null  int64  
 12  LastEditedWhen  228265 non-null  object 
dtypes: float64(5), int64(6), object(2)
memory usage: 22.6+ MB


#### Null value counts
Use appropriate dataframe methods to get information about the number of null values in each column of the dataframe.

In [64]:
df.isna().sum()

InvoiceLineID     0
InvoiceID         0
StockItemID       0
Description       0
PackageTypeID     0
Quantity          0
UnitPrice         0
TaxRate           0
TaxAmount         0
LineProfit        0
ExtendedPrice     0
LastEditedBy      0
LastEditedWhen    0
dtype: int64

### Instructions
Using your knowledge of Pandas and regular expressions, answer the questions below. Again, you may add as many cells as you would like to this file, but make sure that you show your work in each question. Also make sure that the code cells in your file can run in order from top to bottom.

Add a comment to each question that gives an answer to the question.

---
##### Question 1: How many rows are there?
Answer the question below.

In [65]:
df.shape[0]

228265

```
There are 228,265 rows in the dataframe.
```

--- 
##### Question 2: How many unique InvoiceLines are there?
Answer the question below.

In [66]:
df['InvoiceLineID'].nunique()

228265

```
There are 228,265 unique `InvoiceLines`.
```

---
##### Question 3: How many unique Invoices are there?
Answer the question below.

In [67]:
df['InvoiceID'].nunique()

70510

```
There are 70,510 unique `InvoiceID`.
```

---
##### Question 4: What is the average `LineProfit`?
Answer the question below.

In [68]:
df['LineProfit'].mean()

375.5686631765712

```
The average `LineProfit` is $375.57.
```

---
##### Question 5: What is the median `LineProfit`?
Answer the question below.

In [69]:
df['LineProfit'].median()

120.0

```
The median `LineProfit` is $120.00.
```

---
##### Question 6: Which product (`Description`) seems to have the highest `LineProfit`?
Answer the question below.

In [70]:
df.sort_values(by='LineProfit', ascending=False).head()

Unnamed: 0,InvoiceLineID,InvoiceID,StockItemID,Description,PackageTypeID,Quantity,UnitPrice,TaxRate,TaxAmount,LineProfit,ExtendedPrice,LastEditedBy,LastEditedWhen
102159,102160,31440,161,20 mm Double sided bubble wrap 50m,7,100,108.0,15.0,1620.0,9200.0,12420.0,5,2014-08-14 12:00:00.0000000
71590,71591,22038,161,20 mm Double sided bubble wrap 50m,7,100,108.0,15.0,1620.0,9200.0,12420.0,12,2014-03-03 12:00:00.0000000
127463,127464,39249,161,20 mm Double sided bubble wrap 50m,7,100,108.0,15.0,1620.0,9200.0,12420.0,6,2015-01-05 12:00:00.0000000
14949,14950,4596,161,20 mm Double sided bubble wrap 50m,7,100,108.0,15.0,1620.0,9200.0,12420.0,20,2013-04-04 12:00:00.0000000
159121,159122,49015,161,20 mm Double sided bubble wrap 50m,7,100,108.0,15.0,1620.0,9200.0,12420.0,16,2015-06-12 12:00:00.0000000


```
20 mm Double sided bubble wrap 50m seems to have the highest `LineProfit`.
```

---
##### Question 7: Which product (`Description`) is the highest selling product, by total `Quantity` sold?
Answer the question below.

In [71]:
df.groupby(['StockItemID', 'Description']).agg({'Quantity': 'sum'}).sort_values(by='Quantity', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
StockItemID,Description,Unnamed: 2_level_1
191,Black and orange fragile despatch tape 48mmx75m,207324
192,Black and orange fragile despatch tape 48mmx100m,193680
189,Clear packaging tape 48mmx75m,158626
188,3 kg Courier post bag (White) 300x190x95mm,152375
185,Shipping carton (Brown) 356x356x279mm,152125
...,...,...
112,Superhero action jacket (Blue) L,5464
110,Superhero action jacket (Blue) S,5426
114,Superhero action jacket (Blue) XXL,5404
17,DBA joke mug - mind if I join you? (Black),5402


```
Product with `StockItemID` 191, or "Black and orange fragile despatch tape 48mmx75m", is the highest selling product by total quantity.
```

---
##### Question 8: Which product (`Description`) is the worst selling product by total `Quantity` sold?
Answer the question below.

In [72]:
df.groupby(['StockItemID', 'Description']).agg({'Quantity': 'sum'}).sort_values(by='Quantity', ascending=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
StockItemID,Description,Unnamed: 2_level_1
113,Superhero action jacket (Blue) XL,5373
17,DBA joke mug - mind if I join you? (Black),5402
114,Superhero action jacket (Blue) XXL,5404
110,Superhero action jacket (Blue) S,5426
112,Superhero action jacket (Blue) L,5464
...,...,...
185,Shipping carton (Brown) 356x356x279mm,152125
188,3 kg Courier post bag (White) 300x190x95mm,152375
189,Clear packaging tape 48mmx75m,158626
192,Black and orange fragile despatch tape 48mmx100m,193680


```
The worst selling product by total quantity is product with `StockItemID` of 113, called the "Superhero action jacket (Blue) XL".
```

---
##### Question 9: Which product (`Description`) is the most frequently ordered (count) product?
Answer the question below.

In [73]:
df.groupby(['StockItemID', 'Description']).agg({'InvoiceID': 'count'}).sort_values(by='InvoiceID', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,InvoiceID
StockItemID,Description,Unnamed: 2_level_1
120,Dinosaur battery-powered slippers (Green) L,1123
104,Alien officer hoodie (Black) 3XL,1123
167,10 mm Anti static bubble wrap (Blue) 50m,1119
13,USB food flash drive - shrimp cocktail,1117
88,"""The Gu"" red shirt XML tag t-shirt (White) 7XL",1110
...,...,...
221,Novelty chilli chocolates 500g,130
225,Chocolate sharks 250g,126
224,Chocolate frogs 250g,117
227,White chocolate moon rocks 250g,117


```
The most frequently ordered products are "Dinosaur battery-powered slippers (Green) L" and "Alien officer hoodie (Black) 3XL".
```

---
##### Question 10: Which product (`Description`) is the least frequently ordered product?
Answer the question below.

In [74]:
df.groupby(['StockItemID', 'Description']).agg({'InvoiceID': 'count'}).sort_values(by='InvoiceID', ascending=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,InvoiceID
StockItemID,Description,Unnamed: 2_level_1
223,Chocolate echidnas 250g,115
227,White chocolate moon rocks 250g,117
224,Chocolate frogs 250g,117
225,Chocolate sharks 250g,126
221,Novelty chilli chocolates 500g,130
...,...,...
88,"""The Gu"" red shirt XML tag t-shirt (White) 7XL",1110
13,USB food flash drive - shrimp cocktail,1117
167,10 mm Anti static bubble wrap (Blue) 50m,1119
120,Dinosaur battery-powered slippers (Green) L,1123


```
"Chocolate echidnas 250g" are the least frequently ordered product.
```

---
##### Question 11: Create a new column called `ProfitRatio`
You can calculate the `ProfitRatio` by using the following formula:
$$ ProfitRatio = \frac{LineProfit}{UnitPrice*Quantity} $$

In [75]:
df['ProfitRatio'] = df['LineProfit'] / (df['UnitPrice'] * df['Quantity'])

```
You don't need to write anything to answer this question.
```

---
##### Question 12: Which product (`Description`) has the highest average `ProfitRatio`?
Answer the question below.

In [76]:
df.groupby(['StockItemID', 'Description']).agg({'ProfitRatio': 'mean'}).sort_values(by='ProfitRatio', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,ProfitRatio
StockItemID,Description,Unnamed: 2_level_1
161,20 mm Double sided bubble wrap 50m,0.851852
118,Dinosaur battery-powered slippers (Green) S,0.750000
135,Animal with big feet slippers (Brown) M,0.750000
134,Animal with big feet slippers (Brown) S,0.750000
132,Furry gorilla with big eyes slippers (Black) L,0.750000
...,...,...
149,Halloween skull mask (Gray) XL,0.166667
142,Halloween zombie mask (Light Brown) S,-0.055556
143,Halloween zombie mask (Light Brown) M,-0.055556
144,Halloween zombie mask (Light Brown) L,-0.055556


```
The "20 mm Double sided bubble wrap 50m" has the highest `ProfitRatio`.
```

---
##### Question 13: Create a new column called `Year`
You can find the year in the `LastEditedWhen` column.

In [77]:
df['Year'] = df['LastEditedWhen'].str[:4].astype('int16')

```
You don't need to write anything here to answer this question.
```

---
##### Question 14: Which year had the highest total `ExtendedPrice`?
Answer the question below.

In [78]:
df.groupby('Year').agg({'ExtendedPrice':'sum'}).sort_values(by='ExtendedPrice', ascending=False)

Unnamed: 0_level_0,ExtendedPrice
Year,Unnamed: 1_level_1
2015,62090220.81
2014,57418916.89
2013,52563272.64
2016,25971029.11


```
The year with the highest total `ExtendedPrice` was 2015.
```

---
##### Question 15: Create a new column called `Color`.
You can use the regular expression `\(([A-Z]\w+)\)` to extract the color from the `Description` column.

In [79]:
df['Color'] = df['Description'].str.extract("\(([A-Z]\w+)\)")

```
You don't need to write anything here to answer this question.
```

---
##### Question 16: Which is the most highly sold (total `Quantity`) color among all products?
Answer the question below.

In [80]:
df.groupby('Color').agg({'Quantity':'sum'}).sort_values(by='Quantity', ascending=False)

Unnamed: 0_level_0,Quantity
Color,Unnamed: 1_level_1
Brown,1435887
White,1401803
Black,1224921
Blue,741435
Gray,309847
Pink,290436
Red,121807
Yellow,68942
Green,65023


```
The most highly sold color among all products is Brown.
```

---
##### Question 17: Create a filtered dataframe called `slippers_df` that includes rows with the word "slipper" in the `Description`
Create the variable below.

In [81]:
slippers_df = df.loc[df['Description'].str.contains('slipper')]

```
You don't need to write anything here to answer this question.
```

---
##### Question 18: Create a new column called `Size` for `slippers_df`
You can use the regular expression `\s([A-Z]+)$` to extract the size from the `Description` column.

If you get a `SettingWithCopyWarning`, you can ignore it.

In [82]:
slippers_df['Size'] = slippers_df['Description'].str.extract("\s([A-Z]+)$")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  slippers_df['Size'] = slippers_df['Description'].str.extract("\s([A-Z]+)$")


```
You don't need to write anything here to answer this question.
```

---
##### Question 19: What is the most popular size and color of slipper by total Quantity?
Answer the question below.

In [88]:
slippers_df.groupby(['Color', 'Size']).agg({'Quantity': 'sum'}).sort_values(by='Quantity', ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,Quantity
Color,Size,Unnamed: 2_level_1
Green,L,12060
Green,M,11973
Green,XL,11839
Green,S,11767
Brown,XL,6048
Black,L,5943
Black,S,5816
Gray,XL,5814
Black,M,5798
Brown,L,5785


```
The most popular slippers are green ones in size L.
```

---
##### Question 20: Which size of slippers has the lowest `ProfitRatio`?
Answer the question below.

In [91]:
slippers_df.groupby('Size').agg({'ProfitRatio':'mean'}).sort_values(by='ProfitRatio')

Unnamed: 0_level_0,ProfitRatio
Size,Unnamed: 1_level_1
XL,0.734375
L,0.75
M,0.75
S,0.75


```
Slippers in size XL have the lowest profit ratio.
```