Consider the below dataset,

In [None]:
import pandas as pd

df = pd.read_csv('/kaggle/input/stores-preprocessed-version/StoresPrep.csv')

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns

In [None]:
df.info()

In [None]:
df.head(5)

1. Make descriptive statistics (basic and advanced statistics or frequency table) for all categorical and continuous variables with one suitable graph for each.

In [None]:
# Separating categorical and continuous variables

categorical_var = ['Property', 'Type', 'Old/New']
continuous_var = ['AreaStore', 'Checkout Number', 'Revenue']

In [None]:
# Descriptive Statistics: Categorical Variables


# Frequency table for categorical variables

def frequency_table(variable):

    # Get unique elements and their counts
    unique_elements, counts = np.unique(variable.dropna(), return_counts=True)

    # Calculate percentages
    percentages = (counts / len(variable)) * 100

    # Create a dictionary to store the value counts and percentages
    value_counts_and_percentages = zip(unique_elements, counts, percentages)

    # Print the value counts and percentages
    for i, j, k in value_counts_and_percentages:
        print(f"{i}: Count: {j}, Percentage: {k:.2f}%")
    return


# Calculate frequency table and mode for each categorical variable
for var in categorical_var:
    print (f"frequency table for {var}")
    frequency_table(df[var])
    print("Mode =", df[var].mode()[0])
    print ("#"*50)

In [None]:
# Chart for categorical variables

def categorical_bar_charts(cat_var, df):
    n = len(cat_var)

    fig, axes = plt.subplots(1, n, figsize=(5 * n, 4))

    for i, var in enumerate(cat_var):
        counts = df[var].dropna().value_counts()
        axes[i].bar(counts.index, counts.values, color='green')
        axes[i].set_title(f"{var}")
        axes[i].tick_params(axis='x', rotation=45)

    plt.tight_layout()
    plt.show()


categorical_bar_charts(categorical_var, df)

In [None]:
# Descriptive Statistics: Continuous Variables


def describe_continuous(df, continuous_var, confidence_level=0.95):
    # Calculate basic statistics
    desc = df[continuous_var].describe().T

    # Skewness and elongation
    skewness = df[continuous_var].skew().rename('skewness')
    kurtosis = df[continuous_var].kurtosis().rename('kurtosis')

    # Combining all the statistics
    summary = pd.concat([desc, skewness, kurtosis], axis=1)

    # Confidence interval
    for var in continuous_var:
        values = df[var].dropna()
        mean = values.mean()
        std_error = stats.sem(values)

        if std_error != 0 and len(values) > 1:
            lower, upper = stats.t.interval(confidence_level, len(values) - 1, loc=mean, scale=std_error)
        else:
            lower, upper = mean, mean

        summary.loc[var, 'lower_ci'] = lower
        summary.loc[var, 'upper_ci'] = upper

    return summary




result = describe_continuous(df, continuous_var)
print(result)

In [None]:
# Chart for continuous variables

import matplotlib.pyplot as plt

def continuous_boxplots(cont_var, df):
    n = len(cont_var)

    fig, axes = plt.subplots(1, n, figsize=(5 * n, 4))

    for i, var in enumerate(cont_var):
        axes[i].boxplot(df[var].dropna(), vert=True, patch_artist=True, boxprops=dict(facecolor='skyblue'))
        axes[i].set_title(f"{var}")
        axes[i].set_ylabel("Values")

    plt.tight_layout()
    plt.show()




continuous_boxplots(continuous_var, df)

2. Make a result for relationship between 'Old/New' and 'Type' using Cross-Tabulation and a count plot.Describe the output and write your opinion about the relationship.

In [None]:
# Cross Tab between Old/New & Type

cross_tab_Old_New_Type = pd.crosstab(df['Old/New'], df['Type'], normalize='index')
print(cross_tab_Old_New_Type)

In [None]:
# Count Plot: Old/New & Type

fig, ax = plt.subplots(figsize=(5, 4))
sns.countplot(data=df, x='Type', hue='Old/New', ax=ax)
plt.title("Count of Type by Old/New")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

It appears that the store type (Old or New) influences, or at least is associated with, the type of product or service offered (Type). This insight can be valuable for making decisions related to marketing strategies or resource allocation.

3. Make a result for relationship between 'Old/New' and 'Revenue' using Pivot Table  and a Violin Plot.Describe the output and write your opinion about the relationship.

In [None]:
# Pivot Table between Old/New & Type

pivot_table_Old_New_Revenue = df.pivot_table(values='Revenue', index='Old/New', aggfunc=['min','max','mean','std'])
print(pivot_table_Old_New_Revenue)

In [None]:
# Violin Plot: Old/New & Type

plt.figure(figsize=(6, 4))
sns.violinplot(data=df, x='Old/New', y='Revenue', palette='Set2')

plt.title('Violin Plot of Revenue by Old/New')
plt.xlabel('Old/New')
plt.ylabel('Revenue')
plt.grid(True)
plt.tight_layout()
plt.show()

Based on the pivot table, we see that the average revenue (`mean`) differs between the 'Old' and 'New' store categories, with some variation in minimum, maximum, and standard deviation values as well.

The violin plot illustrates the distribution of revenue for both 'Old' and 'New' stores. It shows not only the central tendency but also the spread and density of the revenue data. For example, one category may have a wider spread or more pronounced peaks, indicating variability or concentration in certain revenue ranges.


There appears to be a noticeable difference in revenue between old and new stores. This suggests that store age/status may have an impact on revenue performance. Understanding this relationship can help in targeting business strategies such as improving revenue in underperforming stores or leveraging strengths in others.


4. Make an approperiate graph for showing the relationship between 'Property','Old/New' and 'Revenue'.Describe the output and write your opinion about the relationship.

In [None]:
# Box PLot: relationship between Property, Old/New and Revenue

plt.figure(figsize=(10, 6))
sns.boxplot(data=df, x='Property', y='Revenue', hue='Old/New', palette='Set2')

plt.title('Revenue Distribution by Property and Old/New')
plt.xlabel('Property Type')
plt.ylabel('Revenue')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()

In [None]:
# Violin Plot: relationship between Property, Old/New and Revenue

plt.figure(figsize=(10, 6))
sns.violinplot(data=df, x='Property', y='Revenue', hue='Old/New', split=True, palette='Set2')

plt.title('Violin Plot of Revenue by Property and Old/New')
plt.xlabel('Property Type')
plt.ylabel('Revenue')
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.show()

The box plot illustrates how the distribution of Revenue varies across different Property types and between Old and New stores.

The plot shows that for each Property category, Old and New stores may have different average revenues and variability. Generally, some Property types exhibit higher median revenues, and the revenue patterns differ between Old and New stores within these categories. Additionally, the range of revenue values in certain groups is wider, indicating diverse performance among stores of the same Property type.


Both the store status (Old vs. New) and the Property type jointly influence Revenue. This insight can help managers tailor their sales and marketing strategies according to Property types and store status, allowing for better resource allocation.


5. Create a Pair Plot for investigating pairwise relationship in 'AreaStore','Checkout Number' and 'Revenue'.Describe the output and write your opinion about the relationship.

In [None]:

selected_var = ['AreaStore', 'Checkout Number', 'Revenue']

sns.pairplot(df[selected_var], kind='scatter', diag_kind='kde', plot_kws={'alpha': 0.6})
plt.suptitle("Pairwise Relationships: AreaStore, Checkout Number, Revenue", y=1.02)
plt.show()

The pair plot shows the pairwise relationships among the variables 'AreaStore', 'Checkout Number', and 'Revenue'.

From the scatter plots, we can observe positive correlations between these variables: larger store areas tend to have more checkouts, and both larger areas and higher checkout numbers are associated with higher revenue. The diagonal KDE plots show the distribution of each variable, indicating some skewness, especially in 'Revenue'.


This suggests that store size and checkout activity are important factors related to revenue generation. Increasing either the store area or the number of checkouts could potentially boost revenue, but further analysis is needed to confirm causation.
