# Supermarket Ordering, Invoicing, and Sales

Joel Day, Nicholas Lee, and Christine Vu

Shiley-Marcos School of Engineering, University of San Diego

ADS 507: Practical Data Engineering

Professor Jonathan Sixt

February 8, 2023

***

## Data Description

### Invoices.csv

| Variable | Description  |
| --- | --- |
| Order Id | The order identification number |
| Date | The date the order was placed |
| Meal Id | The meal identification number |
| Company Id | The company identification number |
| Date of Meal | The date the meal was served |
| Participants | The number of people who participated in the meal |
| Meal Price | The cost of the meal |
| Type of Meal | The type of meal that was ordered |

### OrderLeads.csv

| Variable | Description  |
| --- | --- |
| Order Id | The order identification number |
| Company Id | The company identification number |
| Company Name | The name of the company associated with the order |
| Date | The date the order was placed |
| Order Value | The total value of the order |
| Converted | Whether or not the order was converted into a sale |

### SalesTeam.csv

| Variable | Description  |
| --- | --- |
| Sales Rep | The name of the sales representative |
| Sales Rep Id | The sales representative identification number |
| Company Name | The name of the company associated with the order |
| Company Id | The company identification number |

***

## Data Importing and Pre-processing

In [None]:
# Packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re

import pymysql
import requests
import io
import os

import warnings
warnings.filterwarnings("ignore")

Import in CSV files

In [None]:
# Function to Pull Raw CSV from GitHub and Convert to Pandas Dataframe Object

def github_to_pandas(raw_git_url):
    # Pull Raw CSV File from GitHub
    file_name = str(raw_git_url)
    pull_file = requests.get(file_name).content

    # Convert Raw CSV to Pandas Dataframe
    csv_df = pd.read_csv(io.StringIO(pull_file.decode('utf-8')))

    return csv_df

In [None]:
# Pull CSV files from GitHub and Convert to Pandas Dataframe
invoice_df = github_to_pandas(
    "https://raw.githubusercontent.com/nlee98/ADS-507-Data-Engineering/main/Invoices.csv")

orderleads_df = github_to_pandas(
    "https://raw.githubusercontent.com/nlee98/ADS-507-Data-Engineering/main/OrderLeads.csv")

salesteam_df = github_to_pandas(
    "https://raw.githubusercontent.com/nlee98/ADS-507-Data-Engineering/main/SalesTeam.csv")

### Data Pre-processing

In [None]:
# Find missing values
print("- Invoice Missing Values:\n", invoice_df.isnull().sum())
print("\n- Order Leads Missing Values:\n", orderleads_df.isnull().sum())
print("\n- Sales Team Missing Values:\n", salesteam_df.isnull().sum())

In [None]:
# Data types of all columns
print("- Invoice Data Types:\n", invoice_df.dtypes)
print("\n- Order Leads Data Types:\n", orderleads_df.dtypes)
print("\n- Sales Team Data Types:\n", salesteam_df.dtypes)

In [None]:
# Duplicated data
print("- Invoice Duplicated Values:", invoice_df.duplicated().sum())
print("- Order Leads Duplicated Values:", orderleads_df.duplicated().sum())
print("- Sales Team Duplicated Values:", salesteam_df.duplicated().sum())

***

## Explore CSV Files

### Invoice CSV

In [None]:
invoice_df.head(3)

#### Transformations
* Add Underscores to each column name
* Transform Date and Date of Meal to date/datetime data types
* Time of day column
* Number of participants column

In [None]:
# Replace spaces with underscores in all dataframe column names
invoice_df.columns = invoice_df.columns.str.replace(" ", "_")
orderleads_df.columns = orderleads_df.columns.str.replace(" ", "_")
salesteam_df.columns = salesteam_df.columns.str.replace(" ", "_")

In [None]:
# Date to Date ("d-m-Y")
invoice_df["Date"] = pd.to_datetime(
    invoice_df["Date"], format='%d-%m-%Y')

In [None]:
# Drop "+HH:MM:SS" to make all uniform to UTC timezone
invoice_df["Date_of_Meal"] = invoice_df["Date_of_Meal"].apply(
    lambda x: x.split("+")[0]
)

# Convert Date_of_Meal to Datetime format
invoice_df["Date_of_Meal"] = pd.to_datetime(
    invoice_df["Date_of_Meal"],
    format = "%Y-%m-%d %H:%M:%S"
)

In [None]:
# Convert Date_of_Meal to Datetime format
invoice_df["Date_of_Meal"] = pd.to_datetime(
    invoice_df["Date_of_Meal"],
    format = "%Y-%m-%d %H:%M:%S"
)

In [None]:
# Function defining hour of the day with the time of day
def time_of_day(x):
    day_hour = x.hour
    if (day_hour >= 5) and (day_hour <= 8): # 5am - 8am
        return "Early Morning"
    elif (day_hour > 8) and (day_hour <= 12): # 9am - 12pm
        return "Late Morning"
    elif (day_hour > 12) and (day_hour <= 15): # 1pm - 3pm
        return "Early Afternoon"
    elif (day_hour > 15) and (day_hour <= 19): # 4pm - 7pm
        return "Evening"
    elif (day_hour > 19) and (day_hour <= 23): # 8pm - 11pm
        return "Night"
    else: # 12am - 4am
        return "Late Night"

In [None]:
# Apply time_of_day function to Date_of_Meal column

invoice_df["Part_of_Day"] = invoice_df["Date_of_Meal"].apply(time_of_day)

In [None]:
# Add a field to count the number of participants
invoice_df['Number_of_Participants'] = invoice_df['Participants'].apply(lambda x: x.count("'")/2)
invoice_df['Number_of_Participants'] = invoice_df['Number_of_Participants'].astype(int)

invoice_df.head(5)

### Unique Customer Names and Table
Create a table with each unique customer and use the row index plus one as the customer id.

In [None]:
'''
# Function to convert string ['name' 'name2'] to list ['name', 'name2']
# Returns a list of participant names
def string_to_list(participant_string):
    return re.findall(r"'(.*?)'", participant_string)

invoice_df["Participants"] = invoice_df["Participants"].apply(string_to_list)
'''

In [None]:
'''
# Obtain an array of all unique customer names
customers = invoice_df["Participants"].explode().unique()

# Create new customer dataframe
customers_df = pd.DataFrame(
    customers,
    columns = ["CustomerName"]
)

# Add customer id
customers_df["customer_id"] = customers_df.index + 1

# Create a first_name and last_name column
customers_df["first_name"] = customers_df["CustomerName"].apply(lambda x: x.split(" ")[0])
## Splice the list 1: in the event the person has multiple last names
customers_df["last_name"] = customers_df["CustomerName"].apply(lambda x: x.split(" ")[1])
'''

### Customer-Order Table
Connect the customer id to each order id the customer placed. This table will link the customer information to the invoice information.

In [None]:
# Find all the occurrences of customer names then explode to convert values in lists to rows
cust = invoice_df['Participants'].str.findall(r"'(.*?)'").explode()

# Join with order id 
cust_order_df = invoice_df[['Order_Id']].join(cust)

# Factorize to encode the unique values in participants
cust_order_df['Customer_Id'] = cust_order_df['Participants'].factorize()[0] + 1

cust_order_df.head(9)

In [None]:
'''
cust_order_df = pd.DataFrame(columns = ["cust_id", "order_id"])


for i in range(0, 10):
    # Pulls in the row list of participant(s)
    customer_list = invoice_df["Participants"][i]
    # Corresponding order_id
    order_id = invoice_df["Order_Id"][i]
    for j in range(0, len(customer_list)):
        # Iterates over each name in the row list
        name = customer_list[j]
        # Get customer_id
        cust_id = customers_df.loc[customers_df["CustomerName"] == name]
        cust_order_df.loc[len(cust_order_df.index)] = [cust_id, order_id]
'''

In [None]:
'''
cust_order_df = pd.DataFrame(columns = ["cust_id", "order_id"])


for i in range(0, len(invoice_df["Participants"])):
    # Pulls in the row list of participant(s)
    customer_list = invoice_df["Participants"][i]
    # Corresponding order_id
    order_id = invoice_df["Order_Id"][i]
    for j in range(0, len(customer_list)):
        # Iterates over each name in the row list
        name = customer_list[j]
        # Get customer_id belonging to the name
        cust_id = customers_df.loc[
            customers_df["CustomerName"] == name, "customer_id"
            ].item()
        # Add customer_id and order_id to dataframe
        cust_order_df.loc[len(cust_order_df.index)] = [cust_id, order_id]
'''

### Order Leads CSV
* Converted Column - Whether or not a order was converted into a sale

In [None]:
orderleads_df.head(3)

In [None]:
orderleads_df.loc[orderleads_df["Order_Id"] == "839FKFW2LLX4LMBB"]

### Sales Team CSV

In [None]:
salesteam_df.head(3)

***

## Connection to MySQL Server

In [None]:
# Manually Login to MySQL
mysql_username = str(input("Enter MySQL Username: "))
mysql_password = str(input("Enter MySQL Password: "))

mysql_conn = pymysql.connect(
    host = "localhost",
    port = int(3306),
    user = mysql_username,
    passwd = mysql_password
)

### Create Supermarket Database - if it does not already exist

In [None]:
# Create ADS-507_Supermarket MySQL Database
mysql_conn.cursor().execute(
    """
    CREATE DATABASE IF NOT EXISTS ADS_507_Supermarket;
    """
)

# Navigate to Supermarket Database
mysql_conn.select_db("ADS_507_Supermarket")

## Upload dataframes as tables into MySQL
* Invoice
* Orders
* Sales Lead
* Customer
* Customer-order