# Statistical Insights and Hypothesis Testing: A Study on Northwind Database
### Applying Hypothesis Tests to the Northwind Sample Database for Data-Driven Insights

### Introduction

Welcome to this analytical project where we delve into the Northwind database to extract actionable business insights. The Northwind database serves as a comprehensive dataset that encapsulates various facets of a business, from customer details to product sales.

### Project Goal

The overarching goal is to employ statistical hypothesis testing to validate or invalidate assumptions about business operations. The insights derived could be pivotal in shaping effective business strategies.

### What is Hypothesis Testing?

Hypothesis testing is a statistical method used to make inferences or educated guesses about a population based on a sample of data. In the context of business, it can be used to validate assumptions made about sales, customer behavior, and other operational metrics.

- **Null Hypothesis (H0):** This is the initial assumption that there is no effect or relationship between variables. It serves as the starting point that we aim to test against.

- **Alternative Hypothesis (Ha):** This is what you want to prove. It is the opposite of the null hypothesis and indicates the presence of an effect or relationship.

**Example:** If we assume that the average spending of customers from two different regions is the same, that's our null hypothesis (H0). The alternative hypothesis (Ha) would be that the average spending is different between these two regions.

### Types of Statistical Tests

We will employ a variety of statistical tests to answer different types of questions:

- **T-Tests:** Used for comparing the means between two groups.

- **Chi-Square Tests:** Ideal for examining the relationship between categorical variables.

- **ANOVA:** Useful for comparing means across three or more groups.

- **Z-Tests:** Employed when the sample size is large, to compare sample and population means.

- **F-Tests:** Used to compare the variances of two different samples.

- **Regression Analysis:** Applied for predicting outcomes based on relationships between variables.

### Importance of Statistical Testing

Statistical tests are crucial for several reasons:

- **Validation:** They help validate or invalidate business assumptions, lending credibility to strategies.

- **Insight Generation:** They can uncover hidden trends and relationships in the data.

- **Risk Mitigation:** They provide a data-backed approach to decision-making, reducing business risks.

- **Strategic Planning:** Empirical data supports the formulation of more effective and targeted business strategies.

Let's dive into the data and start our journey of discovery!


In [1]:
# import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from matplotlib.patches import Rectangle



#plt.style.use('ggplot')
plt.style.use('fivethirtyeight')

import warnings
warnings.filterwarnings('ignore')

In [2]:
os.listdir('D:\Statistical_analysis\Statistical-Analysis-on-Northwind-Sample-Database\Data')

['categories.csv',
 'customers.csv',
 'employees.csv',
 'employee_territories.csv',
 'orders.csv',
 'orders_details.csv',
 'products.csv',
 'regions.csv',
 'shippers.csv',
 'suppliers.csv',
 'territories.csv']

In [6]:
# load data
# Define the list of table names (CSV filenames)
table_names = [
    'categories',
    'customers',
    'employees',
    'employee_territories',
    'orders',
    'orders_details',
    'products',
    #'regions',
    'shippers',
    'suppliers',
    #'territories'
]

# Specify the base directory where your "Data" folder is located
base_directory = r'D:\Statistical_analysis\Statistical-Analysis-on-Northwind-Sample-Database\Data'

# Loop through the table names and read each CSV file into a DataFrame
for table in table_names:
    # Construct the full path to the CSV file using os.path.join
    csv_file_path = os.path.join(base_directory, f'{table}.csv')
    
    # Read the CSV file into a DataFrame and assign it to a variable with the same name as the table
    exec(f"{table} = pd.read_csv('{csv_file_path}')")

In [7]:
# load remaining data
regions = pd.read_csv(r'D:\Statistical_analysis\Statistical-Analysis-on-Northwind-Sample-Database\Data\regions.csv')
territories = pd.read_csv(r"D:\Statistical_analysis\Statistical-Analysis-on-Northwind-Sample-Database\Data\territories.csv")

In [12]:
# check dataframes loaded properly or not
tables= [categories,
 customers,
 employees,
 employee_territories,
 orders,
 orders_details,
 products,
 regions,
 shippers,
 suppliers,
 territories]


for table in tables:
        print(table.head(3))

   categoryid categoryname                                        description  \
0           1    Beverages        Soft drinks, coffees, teas, beers, and ales   
1           2   Condiments  Sweet and savory sauces, relishes, spreads, an...   
2           3  Confections                Desserts, candies, and sweet breads   

  picture  
0      \x  
1      \x  
2      \x  
  customerid                         companyname     contactname  \
0      ALFKI                 Alfreds Futterkiste    Maria Anders   
1      ANATR  Ana Trujillo Emparedados y helados    Ana Trujillo   
2      ANTON             Antonio Moreno Taquería  Antonio Moreno   

           contacttitle                        address         city region  \
0  Sales Representative                  Obere Str. 57       Berlin    NaN   
1                 Owner  Avda. de la Constitución 2222  México D.F.    NaN   
2                 Owner                Mataderos  2312  México D.F.    NaN   

  postalcode  country         phone      