# IS4487 Module 4 - Practice Code

This notebook is designed to help you follow along with the **Module 4 Lecture and Reading**

The practice code demos are intended to give you a chance to see working code and can be a source for your lab and assignment work.  Each section contains short explanations and annotated code that reflect the steps in the reading.

### Topics for this demo:
- Import libraries and data
- Profile the data
- Do basic data exploration
- Understand whether variables have *outliers* (points that are too far out from most others)

<a href="https://colab.research.google.com/github/vandanara/UofUtah_IS4487/blob/main/Demos/demo_04_data_understanding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


### Context: e-Commerce Retail Sales
This example uses a small set of customer orders available on an e-commerce website
- Order - OrderDate, OrderId, Quantity
- Customer - CustomerID
- Product - Product, Category

Your task is to do basic statistical analysis of the variables to understand the data quality and completeness.

### Import Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Create and use data

In this exercise, we will create a small artificial dataset in the form of a Python Dictionary or `dict`, that we will then convert to a Dataframe.

`dict` is a collection of paired data in the form of key:value pairs separated by commas, enclosed inside {}. The keys will act like the colnames, and within each col we specify several values (here 5) one per row that we want to have in our table.

`list` is used to group together a collection of values. It is defined using comma separated values inside []


In [None]:
# below is a Python dict or dictionary, which contains key:value pairs
# each key is a string using ""
# each value is a list given as []
data = {
    "OrderID": [1001, 1002, 1003, 1004, 1005],
    "CustomerID": [1, 2, 3, 1, 2],
    "Product": ["Laptop", "Mouse", "Keyboard", "Mouse", "Monitor"],
    "Category": ["Electronics", "Accessories", "Accessories", "Accessories", "Electronics"],
    "Quantity": [1, 2, 1, 1, 1],
    "Price": [1200.00, 25.00, 50.00, 25.00, 300.00],
    "OrderDate": ["2025-06-01", "2025-06-02", "2025-06-03", "2025-06-01", "2025-06-03"]
}

# Convert the dataset into a Pandas DataFrame
# this command makes each key a colname, and the list of values for the key become cell values
df = pd.DataFrame(data)

### Inspect the Data
- Is data missing from some variables?
- Do the datatypes look approriate?
- Do you see any outliers? A general rule of thumb: look for values that are very small or large compared to the mean (more than 3 std deviations away)


In [None]:
# preview the data
print(df)

   OrderID  CustomerID   Product     Category  Quantity   Price   OrderDate
0     1001           1    Laptop  Electronics         1  1200.0  2025-06-01
1     1002           2     Mouse  Accessories         2    25.0  2025-06-02
2     1003           3  Keyboard  Accessories         1    50.0  2025-06-03
3     1004           1     Mouse  Accessories         1    25.0  2025-06-01
4     1005           2   Monitor  Electronics         1   300.0  2025-06-03


In [None]:
# print information about what variables are in the Dataframe - their inferred data types, and whether there are missing values
# object refers to string or text data type
# note that date has been inferred as an object - which is not ideal
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   OrderID     5 non-null      int64  
 1   CustomerID  5 non-null      int64  
 2   Product     5 non-null      object 
 3   Category    5 non-null      object 
 4   Quantity    5 non-null      int64  
 5   Price       5 non-null      float64
 6   OrderDate   5 non-null      object 
dtypes: float64(1), int64(3), object(3)
memory usage: 412.0+ bytes


In [None]:
# look at the distribution of numeric variables, look at means and std dev
# when min and max values are more than 3 std dev from mean - there might be outliers in our data
df.describe()

Unnamed: 0,OrderID,CustomerID,Quantity,Price
count,5.0,5.0,5.0,5.0
mean,1003.0,1.8,1.2,320.0
std,1.581139,0.83666,0.447214,505.408251
min,1001.0,1.0,1.0,25.0
25%,1002.0,1.0,1.0,25.0
50%,1003.0,2.0,1.0,50.0
75%,1004.0,2.0,1.0,300.0
max,1005.0,3.0,2.0,1200.0


### Basic Data Exploration

In [None]:
#Total sales per product"
df.groupby("Product")["Price"].sum()

In [None]:
#Average quantity purchased by category:")
df.groupby("Category")["Quantity"].mean()