# The Preamble to Descriptive Statistics

- Change Font Size

In [None]:
%%html
<style>
    /* Increase the font size for all text */
    body {
        font-size: 20px;
    }
    
    /* Optionally, you can also increase the font size for specific elements */
    h1, h2, h3, h4, h5, h6 {
        font-size: 18px; /* Adjust the size as needed */
    }
</style>

## Setting Up Python
1. Check if Python is istalled:
    <br>**Windows**: Open the Command Prompt:
    - Press Win + R to open the "Run" dialog.
    - Type cmd and press Enter. 
 
    In the Command Prompt, type:
    - python --version <br> or 
    
    - python3 --version
     
This command will display the installed Python version if Python is installed. If Python is not installed, you'll typically see an error message.

2. **MacOS**: Open the Terminal
    - Press Command + Space to open Spotlight Search
    - Type "Terminal" and press Enter.
    
    In the Terminal, type:
    
    - python --version <br> or
    
    - python3 --version
    
Similar to Windows, this command will display the installed Python version if Python is installed. <br> If Python is not installed, you'll typically see an error message.

3. **Linux (Ubuntu/Debian-based)**: Open the Terminal
    - You can usually find the Terminal application in the Applications menu or by searching for "Terminal".
    
    In the Terminal, type:
    
    - python --version <br> or
    
    - python3 --version <br>


    
    

<br>4. **Install Python**: First, ensure that Python is installed on your system. You can download the latest version of Python from the official website: [Python Downloads](https://www.python.org/downloads/). Follow the installation instructions for your operating system.

## Install Jupyter Notebook

Once Python is installed, you can install Jupyter Notebook using pip, which is Python's package installer. Open your command line or terminal and run the following command: 

   - pip install jupyter

After the installation is complete, start Jupyter Notebook by running on the command prompt:

   - jupyter notebook





You may also install Jupyter Notebook by using Anaconda through the following steps:

  - Download and install Anaconda from the [Anaconda Website](https://www.anaconda.com/download)
  
  - Add Anaconda to your system’s PATH
  
  - Launch the Anaconda Navigator.
  
  - Click on the “Install Jupyter Notebook” button
  
  - Once the installation is complete, click the “Launch” button to start Jupyter Notebook.
  

  

## How to start jupyter notebook using the Anaconda Navigator

- **Launch Anaconda Navigator**: Open Anaconda Navigator. You can usually find it in your applications or by searching for "Anaconda Navigator" in the Start menu (Windows) or Launchpad (macOS).

- **Launch Jupyter Notebook**: In the Anaconda Navigator window, you'll see a list of applications. Find and click on the "Jupyter Notebook" icon. This will open a new tab in your default web browser, and Jupyter Notebook will start running.

- **Create or Open a Notebook**: Once Jupyter Notebook is running, you can create a new notebook by clicking on the "New" button in the top-right corner and selecting "Python 3" (or any other available kernel). Alternatively, you can open an existing notebook by navigating to the directory where your notebook is located and clicking on it.


## Python Libraries
In Python, libraries (also known as modules or packages) are collections of functions, classes, and variables that extend the capabilities of the core Python language. These libraries are designed to address specific tasks or domains, such as data manipulation, scientific computing, web development, machine learning, and more. By importing and using libraries in your Python code, you can leverage pre-written code to perform various tasks efficiently without having to write everything from scratch. Some popular Python libraries include NumPy, Pandas, Matplotlib, TensorFlow, Flask, and Django.




## Python Libraries for Descriptive Statistics
There are several Python libraries commonly used for descriptive statistics, each offering various functionalities for analyzing and summarizing data. Here are some of the most popular ones:

1. **NumPy**:

- NumPy is a fundamental library for scientific computing in Python.
- It provides support for arrays, matrices, and mathematical functions to operate on these data structures efficiently.
- NumPy's `mean`, `median`, `std`, `var`, `min`, `max`, `sum`, and other functions can be used for descriptive statistics.

2. **Pandas**:

- Pandas is a powerful library for data manipulation and analysis.
- It offers data structures like Series and DataFrame, which are well-suited for handling tabular data.
- Pandas provides functions like `describe`, `mean`, `median`, `std`, `var`, `min`, `max`, `sum`, `count`, and more for descriptive statistics.

3. **SciPy**:

- SciPy is a library for scientific computing and technical computing.
- It builds on NumPy and provides additional functionality for optimization, integration, interpolation, and statistical functions.
- SciPy's stats module includes various statistical functions for descriptive statistics, such as `describe`, `mean`, `median`, `std`, `var`, `skew`, `kurtosis`, and many more.

4. **StatsModels**:

- StatsModels is a library for statistical modeling and hypothesis testing.
- It provides classes and functions for estimating statistical models and conducting statistical tests.
- StatsModels offers descriptive statistics functions as well as tools for `regression analysis`, `time series analysis`, and more.

5. **Seaborn**:

- Seaborn is a data visualization library based on Matplotlib.
- It provides a high-level interface for creating attractive statistical graphics.
- Seaborn includes functions for `visualizing distributions`, `categorical data`, and `relationships between variables`, which can be useful for exploring data during descriptive statistics analysis.


6. **Matplotlib**:

- Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
- While Matplotlib is primarily known for its `plotting capabilities`, it also offers functions for basic `statistical analysis` and `visualization`.

7. **Scikit-learn**:

- Scikit-learn is a `machine learning` library that includes modules for `data preprocessing`, `classification`, `regression`, `clustering`, and more.
- Although its primary focus is machine learning, Scikit-learn also provides tools for `feature selection`, `dimensionality reduction`, and `basic statistical analysis`.

8. **mimesis**: 
- Is a Python library, which helps generate mock data for various purposes. 

- The library was written with the use of tools from the standard Python library, and therefore, it doesn’t have any side dependencies. 

- Currently the library supports 32 languages and 21 class providers, supplying various data.

- It can be easily integrated into data processing pipelines or testing frameworks to generate realistic sample data for testing or analysis.

9. The Python package **tableone** is a tool for creating summary statistics tables for your dataset. It's particularly useful for generating descriptive statistics for categorical and continuous variables, stratified by groups.

- Customizing Output: The tableone package offers various customization options for the summary table, such as specifying which statistics to include, formatting the output, and adding test statistics for group comparisons. Experimenting with these options can help tailor the output to your specific needs.

- Advanced Features: Beyond basic summary statistics, tableone offers advanced features such as subgroup analysis, stratification by multiple variables, and customization of statistical tests. Exploring these features can enhance the depth and complexity of your analyses.

These libraries offer a wide range of functionalities for descriptive statistics, data manipulation, visualization, and more. Depending on your specific needs and preferences, you can choose one or more of these libraries to perform descriptive statistics analysis in Python.



## DataFrame

In Python, a **DataFrame** is a `two-dimensional` labeled data structure provided by the Pandas library. It can be thought of as a `table` or a `spreadsheet-like` data structure where data is organized into rows and columns. Each column can have a different data type (e.g., integer, float, string), and each row represents a single observation or record. DataFrames are widely used for data manipulation, analysis, and exploration in data science, machine learning, and other domains due to their flexibility, ease of use, and powerful functionality. <br> Key features of a DataFrame include:

- **Indexing**: Each row and column in a DataFrame has a label, called an index, which allows for easy and efficient access to individual elements.

- **Column Operations**: DataFrames support various operations on columns, such as selecting, adding, deleting, and renaming columns.

- **Data Alignment**: DataFrames automatically align data based on labels, making it easy to perform operations on data with different indexes.

- **Missing Data Handling**: DataFrames provide built-in support for handling missing data, allowing for flexible data manipulation and analysis.

- **Grouping and Aggregation**: DataFrames support grouping data based on one or more columns and performing aggregation functions (e.g., sum, mean, count) on grouped data.

- **Data I/O**: DataFrames can easily read data from and write data to various file formats, including CSV, Excel, SQL databases, and more.









In [None]:
## DataFrame Example

import pandas as pd

# Creating a DataFrame from a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emily'],
    'Age': [25, 30, 35, 40, 45],
    'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'San Francisco']
}

df = pd.DataFrame(data)

print("DataFrame:")
print(df)


In [None]:
## How to import excel data that has irrelevant rows.
import pandas as pd

# Specify the path to your Excel file
excel_file = 'C:\\Users\\yawus\\OneDrive\\Desktop\\DS Notes\\notes\\visualization\\Canada.xlsx'

# Specify the name of the sheet/tab you want to import
sheet_name = 'Regions by Citizenship'

# Read the specific sheet, skipping the first 20 rows and specifying the header names
# `header=0` explicitly tells Pandas to use the first row as the header (column names) of the DataFrame.
df = pd.read_excel(excel_file, sheet_name=sheet_name, skiprows=range(20), header=0)

# Display the top 10 rows of the DataFrame
print(df.head(10))



## Transfrom the DataFrame from Wide to Long:

You can see that the Dataframe has dates that are not under one column as the other observations.

To transform it to have the dates under a date column, we follow these steps using Python and Pandas:

1. Read the data from Excel into a Pandas DataFrame.

2. Reshape the DataFrame using the melt() function to unpivot the years into a single column.

3. Rename the columns for clarity.

4. Convert the "Date" column values to datetime format.

5. Optionally, you can sort the DataFrame by the "Date" column.


In [None]:
import pandas as pd

# Melt the DataFrame from wide to long format, excluding the 'Type', 'Coverage', 'AreaName', and 'RegName' columns
df_long = pd.melt(df, id_vars=['Type', 'Coverage', 'AreaName', 'RegName'], var_name='Year', value_name='Value')

# Remove rows with '..' values
df_long = df_long[df_long['Value'] != '..']

# Display the resulting DataFrame
print(df_long.head(10))

## Check if a Library is Installed.

To use a library, it must be installed. It is important to check if a libaray is already installed to avoid installing it again.

Installing a library that is already installed generally does not cause any harm. However, it can consume unnecessary time and bandwidth, especially if the library is large or if you are working with limited internet connectivity.

While it's generally safe to install a library that is already installed, it's a good practice to periodically check for updates to your installed libraries and only install new libraries when necessary. Additionally, consider using virtual environments or package management tools like pip to manage your dependencies efficiently and avoid unnecessary installations

Here are a few considerations:

1. **Resource Utilization**: Installing a library consumes system resources such as disk space and memory. If you repeatedly install the same library unnecessarily, it can gradually consume significant resources over time.

2. **Network Bandwidth**: Repeatedly downloading and installing the same library can consume network bandwidth, especially if you are working with limited or metered internet connections.

3. **Version Control**: Installing the same library multiple times might result in different versions being installed across different environments, which could lead to compatibility issues or inconsistencies in your codebase

In [None]:
## Check if a library is installed

# Check if NumPy is already installed
try:
    import numpy
    print("NumPy is already installed.")
except ImportError:
    print("NumPy is not installed.")


You can repeat this process for each library you want to check. If you have a long list of libraries to check, you can create a function to simplify the process:

In [None]:
def check_library(library_name):
    try: # Block that executes code that might raise an exception.
        __import__(library_name)
        print(f"Library '{library_name}' is already installed.")
    except ImportError: # This block executes If an `ImportError` occurs during the `try` block
        print(f"Library '{library_name}' is not installed.")

# Example usage
check_library("numpy")
check_library("pandas")
check_library("matplotlib")
check_library("SciPy")
check_library("StatsModels")
check_library("Seaborn")
check_library("Scikit-learn")

## Installing Libraries

If a library or some libraries are not installed, then they can't be imported. You can install one library or severall libraries
using the `pip install` command. Here is how to do it:

`!pip install library1 library2 library3 ... libraryN`

Additionally, you can specify versions for each library:

!pip install library1==version1 library2==version2 library3==version3

#### Example:

!pip install numpy==1.20.3 pandas==1.3.2 matplotlib==3.4.3





In [None]:
# Install The tableone package
!pip install tableone




## How to  use `tableone`

- Importing the necessary libraries:

import pandas as pd
from tableone import TableOne

- Load your dataset:

- Load your dataset into a pandas dataframe
df = pd.read_csv('your_dataset.csv')

- Create a TableOne object:


table = TableOne(data=df, 
                 columns=['age', 'sex', 'bmi', 'outcome'], 
                 categorical=['sex', 'outcome'], 
                 nonnormal=['age', 'bmi'], 
                 groupby='outcome')

- Generate the summary table:

print(table)

- Customizing the summary table:

table_customized = TableOne(data=df, 
                            columns=['age', 'sex', 'bmi', 'outcome'], 
                            categorical=['sex', 'outcome'], 
                            nonnormal=['age', 'bmi'], 
                            groupby='outcome',
                            labels={'age': 'Age (years)', 'bmi': 'BMI (kg/m^2)'},
                            pval=True,
                            formatStr='{:.2f}')

- print(table_customized)


## Nota Bene

In Jupyter Notebook, you typically do not need to import a library that has already been installed before you can use it. Once a library is installed in your Python environment, you can use it directly in any code cell without explicitly importing it again.

However, there are a few scenarios where you might need to import a library even if it's already installed:

1. **Namespace Collision**: If you have multiple versions of the same library installed, or if you have a variable with the same name as the library, you may need to import the library using a different name or alias to avoid namespace collisions.

- import pandas as pd  # Importing Pandas with alias 'pd'


2. **Restarting the Kernel**: If you restart the Jupyter Notebook kernel, you'll need to import the libraries again in the new kernel session before using them.


3. **Cell Execution Order**: If you're running code in multiple cells and the cell containing the import statement is executed after cells that use the library, you'll need to import the library before using it in those cells.


# Notes - Getting the Data Ready

### Import libraries



In [None]:
# xlrd is a package for reading excel files
#!pip install xlrd 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import mimesis
from scipy import stats as st
import xlrd
%matplotlib inline

In [None]:
# Specify the path to your Excel file
path = r'C:\Users\yawus\OneDrive\Desktop\DS Notes\notes\descriptive statistics\Sample - Superstore.xls'

# Specify the name of the sheet/tab you want to import
tab_name = 'Orders'

# Import the file
df = pd.read_excel(path, sheet_name = tab_name)


### Make sure the data has been imported

In [None]:
# Lets get the top 5 rows of the data using the head() function
print(df.head())

# Get the column names:
Sometimes you will have a long list of columns. <br> Listing the columns is important to help know which columns to modify

In [None]:
column_names = list(df.columns)
print(column_names)



In [None]:
# Alternative method for getting column names
column_names = df.keys()
print(column_names)

In [None]:
# Another method for column names
column_names = df.columns
print("Column names:")
for col in column_names:
    print(col)

### Get Data Types of Columns: <br > 
To retrieve the data types of each column, you can use the dtypes attribute of the DataFrame

In [None]:
data_types = df.dtypes
print(data_types)

### View Data Summary

In [None]:
df.info()

### Data Type Details:
- Sales : Numeric Continuous
- Discount: Numeric Continuous
- Profit: Numeric Continuous
- Quantity: Numeric Discreet
- The object is non-numeric - categorical

In [None]:
# Convert Numeric Columns to Strings

df['Row ID'] = df['Row ID'].astype(str)
df['Postal Code'] = df['Postal Code'].astype(str)

### Row Index:

- To get the row index (labels), you can use the `.index` attribute.
- It returns a pandas Index object containing the row labels.

In [None]:
row_index = df.index
print("Row index:")
for idx in row_index:
    print(idx)


### Get the number of rows and columns in the data


In [None]:
print(df.shape)

- This shows (# of rows, # of columns)
- There are 9994 rows and 21 columns

### Check for Missing Values in the Data

1. **Using** `.isnull()` or `.isna()`:
- Both .isnull() and .isna() return a DataFrame of the same shape as the original, <br> with True indicating missing values (NaN or None) and False indicating non-missing values.
- You can then use `.sum()` to count the missing values per column.

In [None]:
# .isnull()
# Check for missing values

missing_values = df.isnull()
print("Missing values (True indicates missing data):")
print(missing_values)

In [None]:
# Sum missing values: 0 means no missing values 
missing_counts = df.isnull().sum()
print("\nCount of missing values per column:")
print(missing_counts)

2. **Using** `.any()`:
- To check if any missing values exist in the entire DataFrame, <br> you can use .any() along the specified axis (default is axis=0 for columns).
- It returns True if any missing value is found, otherwise False.

In [None]:
has_missing_values = df.isnull().values.any()
print("\nDoes the DataFrame have any missing values?")
print(has_missing_values)

## Descriptive Statistics

**Descriptive statistics** refer to the analysis, summary, and presentation of findings related to a data set derived from a sample or the entire population. These statistics help us understand and describe the characteristics of the data.<br> The key components of descriptive statistics are: <br>
- Frequency Distribution
- Measures of Central Tendency
- Measures of Variability

The `statistics` module in Python provides functions for calculating mathematical statistics of numeric data. <br> Although it’s not meant to compete with third-party libraries like NumPy or SciPy, it serves well for basic statistical calculations. Here are some key features of the `statistics` module:<br> 

1. **Averages and Measures of Central Location**:
- `mean(data)`: Computes the arithmetic mean (average) of data.
- `fmean(data, weights=None)`: Calculates the floating-point arithmetic mean with optional weighting.
- `geometric_mean(data)`: Computes the geometric mean of data.
- `harmonic_mean(data)`: Calculates the harmonic mean of data.
- `median(data)`: Finds the median (middle value) of data.
- `median_low(data)` and median_high(data): Compute the low and high medians of data.
- `median_grouped(data, interval=1)`: Calculates the median (50th percentile) of grouped data.
- `mode(data)`: Determines the single mode (most common value) of discrete or nominal data.
- `multimode(data)`: Provides a list of modes (most common values) for discrete or nominal data.
- `quantiles(data, n=4, method='exclusive')`: Divides data into intervals with equal probability.

2. **Measures of Spread**:
- `pstdev(data)`: Computes the population standard deviation of data.
- `pvariance(data)`: Calculates the population variance of data.
- `stdev(data)`: Computes the sample standard deviation of data.
- `variance(data)`: Calculates the sample variance of data.

The `statistics` module is particularly useful for basic statistical tasks, but for more advanced analyses, consider using specialized libraries like `NumPy` or `SciPy`.


### Summary Statistics

The `describe()` funstion in pandas is used to generate descriptive statistics. <br> For numeric columns, the output includes:
- `count`: the number of non-na/null observations
- `mean`: the average
- `std`: the standard deviation
- `min`: the minimum
- `25%`, `50%`, `75%`:(lower quartile, median, and upper quartile)
- `max`: maximum value

For non-numeric columns ( strings or timestamps), the output includes:
- `count`: number of non-na/null observations
- `unique`: number of unique values
- `top`: most common value
- `freq` frequency of the most common value

In [None]:
# Summary Stats for Numeric Variables
df.describe()


In [None]:
# Get summary statistics for non-numeric columns: use include = 'object'
df.describe(include=['object'])

### CentralTendency - Mean, Median, Mode




#### Mean of one Variables:

In [None]:

sales_mean =df['Sales'].mean()
sales_mean_rounded = round(sales_mean, 3)
print(sales_mean_rounded)

#### Mean of two or more variables

In [None]:


column_names =['Sales', 'Profit']
mean_Sales_Profit = df[column_names].mean()
print("Means of columns {} are:\n{}".format(column_names, mean_Sales_Profit))


#### Mean of Each Numeric Column

In [None]:

mean = df.mean(numeric_only = True)

print("Means of numeric columns are:\n{}".format(mean))

We can calculate the mean for each row by supplying the axis argument:

In [None]:
# Select only the valid numeric columns
numeric_columns = df.select_dtypes(include='number')

# Calculate the mean across rows
row_means = numeric_columns.mean(axis=1)

print("Means across rows are:\n{}".format(row_means))


#### Median of Each Numeric Column

In [None]:
median = df.median(numeric_only = True)

print("Medians of numeric columns are:\n{}".format(median))

#### Mode of Each Numeric Data

In [None]:
mode = df.mode(numeric_only = True)
print(mode)


### The 5 Summary Statistics
The five summary statistics, also known as the five-number summary, is a set of descriptive statistics that provides information about a dataset. <br> It consists of the following five most important sample percentiles:

- **Minimum**: The smallest observation in the dataset1.
- **First Quartile (Q1)**: This is the value below which 25% of the data falls1.
- **Median**: This is the middle value that separates the higher half from the lower half of the data set1.
- **Third Quartile (Q3)**: This is the value below which 75% of the data falls1.
- **Maximum**: The largest observation in the dataset1.
These five statistics provide a concise representation of the distribution of a dataset, <br>offering insight into the central tendency, variability, and shape of the distribution

In [None]:
five_num = [df['Sales'].quantile(0).round(3),
            df['Sales'].quantile(0.25),
            df['Sales'].quantile(0.50).round(3),
            df['Sales'].quantile(0.75),
            df['Sales'].quantile(1)]
five_num

### Histogram for Discount

In [None]:
import matplotlib.pyplot as plt


# You can adjust the number of bins to change the granularity of the histogram
plt.hist(df['Discount'], bins= 15,color='green', edgecolor='black')

plt.xlabel('Discount')
plt.ylabel('Frequency')
plt.title('Histogram of Discount')
plt.grid(True)
plt.show()

#### Distribution Curves for Sales

The mean of 229.86 is far greater than the median of 54.49. The variable is therefore skewed right.<br> The median is a better representative of the center compared to the mean. There are outliers. We can confirm this with a distribution plot.<br> Also the maximum sales is 22638.48 and the minimum sales is 0.444. The range is 22638.48-0.444 = 22638.036

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'df' is your DataFrame containing the 'sales' column

# Create a figure and axis object
fig, ax = plt.subplots(figsize=(10, 6))

# Plot the distribution curve for the 'sales' variable
sns.kdeplot(data=df['Sales'], ax=ax, label='Sales', fill=True)

# Set labels and title
ax.set_xlabel('Sales')
ax.set_ylabel('Density')
ax.set_title('Distribution of Sales')

# Show the plot
plt.show()


The plot confirms that Sales is right skewed

### Categorical Variables

#### Print out Categorical Variables Only

In [None]:
# Filter DataFrame to include only categorical variables
categorical_df = df.select_dtypes(include=['object'])

print(categorical_df.columns)


#### Print Out the Unique Categories in the 'Product Name. Variable

In [None]:
# Get Unique Categories
unique_SubCategory = df['Sub-Category'].unique()

# Print Unique Categories

print("Unique Sub-Category :\n", unique_SubCategory)

#### Counts and Percentages
- Counts

In [None]:
# Counts of 'Sub-Category' Cases
SubCategory_counts = df['Sub-Category'].value_counts()
print('Sub-Category :\n' , SubCategory_counts)

- Percent

In [None]:
# Calculate the counts of each subcategory
SubCategory_Counts = df['Sub-Category'].value_counts()

# Calculate the total number of observations
n = len(df)

# Calculate the percentage of each subcategory observation with 2 decimal places
SubCategory_Percent = SubCategory_Counts / n * 100

# Round the percentages to 2 decimal places
SubCategory_Percent_rounded = SubCategory_Percent.round(2)

# Convert the percentages to strings with the desired format
SubCategory_Percent_str = SubCategory_Percent_rounded.astype(str) + '%'

# Print the percentage of each subcategory observation with 2 decimal places
print(f'Percent of Each Subcategory Observation:\n{SubCategory_Percent_str}')


- Percent for Category

In [None]:
# Count
CategoryCount = df['Category'].value_counts()
# Percent
CategoryPercent = (CategoryCount / n) * 100

# Convert Float to String

print(CategoryPercent.round(3).astype(str) + " %" )


### Mode for Categorical Variables

In [None]:
# Compute the mode for the specified columns
mode = df[['Ship Mode','City', 'State','Product ID','Category','Sub-Category' ,'Segment']].mode(axis='rows', numeric_only=False)
print("Mode for categorical variables:")
print(mode)

The mode is showing the observation with the highest frequency for each categorical variable.

### Frequency Table & Bar Graph for Ship Mode Category

In [None]:
# Generate a frequency table
ship_mode_freq =df['Ship Mode'].value_counts()

print('The Ship Mode Counts Are:')
print(ship_mode_freq)

# Plot a bar graph

plt.figure(figsize = (10, 6))
ship_mode_freq.plot(kind='bar')
plt.title('Frequency of Ship Mode')
plt.xlabel('Ship Mode')
plt.ylabel('Frequency')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

The bar graph by frequencies confirm the mode that we calculated.

#### Create a Frequency Table and  a Bar Graph for Category Variable

In [None]:
# Convert list to pandas Series
category_series = pd.Series(df.Category)

# Calculate value counts
value_counts = category_series.value_counts()

# Calculate proportions
proportions = value_counts / len(category_series)

# Calculate percentages
percentages = proportions * 100

# Create a DataFrame for the frequency table with proportion and percent
frequency_df = pd.DataFrame({'Frequency': value_counts, 
                             'Proportion': proportions, 
                             'Percent': percentages})

# Sort the DataFrame by frequency in descending order
frequency_df = frequency_df.sort_values(by='Frequency', ascending=False)

# Display the frequency table
print("Frequency Table with Proportion and Percent:")
print(frequency_df)

### Bar Grap

In [None]:
import matplotlib.pyplot as plt

# Plot bar graph
plt.figure(figsize=(10, 6))
frequency_df['Frequency'].plot(kind='bar', color='skyblue')
plt.title('Frequency of Categories')
plt.xlabel('Category')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


The graph show that Office Suplies has the highest frequency as shown by the mode

- `plt.figure(figsize=(10, 6))`: This line creates a new figure object with a specific size. <br>The figsize parameter specifies the width and height of the figure in inches. In this case, the figure size is set to 10 inches wide and 6 inches tall. 

<br>

- `frequency_df['Frequency'].plot(kind='bar', color='skyblue')`: This line plots a bar <br> graph of the 'Frequency' column from the frequency_df DataFrame. <br> The `kind='bar'` parameter specifies the type of plot to create, which is a bar plot in this case.<br> The `color='skyblue'` parameter sets the color of the bars to sky blue.

<br>


- `plt.title('Frequency of Categories')`: This line sets the title of the plot to 'Frequency of Categories'.

<br>

- `plt.xlabel('Category')`: This line sets the label for the x-axis to 'Category'.

<br>

- `plt.ylabel('Frequency')`: This line sets the label for the y-axis to 'Frequency'.

<br>

- `plt.xticks(rotation=45)`: This line rotates the x-axis tick labels by 45 degrees. <br> This is useful when you have long category names and want to avoid overlap between them.

<br>

- `plt.tight_layout()`: This line adjusts the layout of the plot to make sure all elements fit properly within the figure area, preventing overlap.

<br>


- `plt.show()`: This line displays the plot. It's necessary to include this line to actually see the plot.

### Measures of Variability

**Measures of variability** describe how data points spread out from each other and from the center of a distribution. <br> These statistics help us understand the dispersion or scatter in a dataset.

- Range
- Interquartile Range
- Standard Deviation
- Variance

- Variability affects our ability to generalize results from a sample to a population.
- Low variability allows better predictions based on sample data, while high variability makes predictions more challenging.
- Both central tendency and variability together give a complete picture of the data.


### Range

- The range is the simplest measure of variability.
- It tells us the **spread** of data from the **lowest** to the **highest** value.
- To calculate the range, subtract the lowest value from the highest value in the dataset.

In [None]:
# Calculate the Range for Sales

Sales_Range = max(df['Sales']) - min(df['Sales'])
print(Sales_Range)

### Percentile : using the `quantile` function from numpy library

- **Percentile**: Percentile is a specific value below which a given percentage of scores in a dataset fall.<br> For example, the 50th percentile (also known as the median) is the value below which 50% of the scores in the dataset fall. 

### Interquartile Range

In [None]:
# Calculate the Interquartile Range
Lower_Sales_Quartile = df['Sales'].quantile(0.25)
Middle_Sales_Quartile = df['Sales'].quantile(0.50)
Upper_Sales_Quartile = df['Sales'].quantile(0.75)
Sales_IQR = Upper_Sales_Quartile - Lower_Sales_Quartile
Ninety_Fith_Percentile = df['Sales'].quantile(0.95)

print('Sales Lower Quartile is:', Lower_Sales_Quartile)
print('Sales Median is:', Middle_Sales_Quartile.round(3))
print('Sales Upper Quartile is:', Upper_Sales_Quartile)
print('The Interquartile Range is:', Sales_IQR)
print('The 95th Percentile is:', Ninety_Fith_Percentile.round(0))

### Percentile - using the `scipy.stats.scoreatpercentile()`

- This function computes the score at the given percentile of the input data. 
- Percentile: Percentile is a specific value below which a given percentage of scores in a dataset fall. 
- For example, the 50th percentile (also known as the median) is the value below which 50% of the scores in the dataset fall. --- Similarly, the 25th percentile (Q1) is the value below which 25% of the scores fall, <br> and the 75th percentile (Q3) is the value below which 75% of the scores fall


In [None]:
from scipy import stats
# Compute the median (50th percentile)

median = stats.scoreatpercentile(df['Sales'], 50)

# Compute the 25th and 75th percentiles
q1 = stats.scoreatpercentile(df['Sales'], 25)
q3 = stats.scoreatpercentile(df['Sales'], 75)

print("Median:", median.round(3))
print("25th Percentile (Q1):", q1)
print("75th Percentile (Q3):", q3)



### Percentile Rank - using the `scipy.stats.percentileofscore()`: 
- This function computes the percentile rank of a score relative to a list of scores. Here's how to use it:
- Percentile Rank: The percentile rank of a score is the percentage of scores in a dataset that are equal to or less than the given score. For example, if a score has a percentile rank of 75, it means that 75% of the scores in the dataset are less than or equal to this score.

- Relative to a List of Scores: The function requires two main inputs: the list of scores (the dataset) and the specific score for which you want to calculate the percentile rank.

- How to Use It: To use the function, you provide the list of scores as the first argument and the specific score for which you want to calculate the percentile rank as the second argument. The function then returns the percentile rank of the given score relative to the entire list of scores.

In [None]:
from scipy import stats

# Calculate the percentile rank of the score 95
percentile_rank = stats.percentileofscore(df['Sales'], 95)
sixty_first_percentile = df['Sales'].quantile(0.61)
print("Percentile Rank of Score 95:", percentile_rank.round(0))
print('The 61st Percentile is:', sixty_first_percentile.round(0))

- **Percentile Rank of Score 95: 61.0**: This means that a score of 95 is higher than 61% of all the scores in the distribution.<br> In other words, 61% of the scores are less than or equal to 95..
- It does not provide information about the actual value at the 95th percentile.
- **The 61st Percentile is: 95.0**: This means that 61% of all the scores in the distribution are less than or equal to 95. <br> In other words, a score of 95 is at the 61st percentile.

### Standard Deviation
- The standard deviation measures the average distance of data points from the mean (average).
- It quantifies how much the data values deviate from the central tendency.
- A larger standard deviation indicates greater variability.
- It’s calculated using the square root of the variance.

In [None]:
# Standard Deviation of the Sales Variable
import numpy as np
import pandas as pd

numeric_columns = df.select_dtypes(include=np.number).columns

# Calculate standard deviation for numeric columns
std_devs = df[numeric_columns].std()

print("Standard Deviations for Numeric Columns:")
print(std_devs)

In [None]:
df['Sales'].std()

In [None]:
# Calculate Standard Deviation Using statistics module
#!pip install statistics
import statistics

std_dev = statistics.stdev(df['Sales'])
std_dev2 = statistics.stdev(df['Profit'])
print(std_dev)
print(std_dev2)

### Coefficient of Variation 
The Coefficient of Variation is defined as the ratio of the standard deviation to the mean of a dataset, expressed as a percentage. 

Here's how you can do it:

- Using mean and standard deviation
- Using variation function from scipy

#### Using mean and standard deviation

In [None]:
import numpy as np

sales_mean = np.mean(df['Sales'])
profit_mean = np.mean(df['Profit'])

sales_std_dev = np.std(df['Sales'])
profit_std_dev =np.std(df['Profit'])

sales_cv = (sales_std_dev/sales_mean)*100
profit_cv =(profit_std_dev/profit_mean)*100

print('The coefficient of variation for sales is :' , sales_cv)
print('The coefficient of variation for profit is :', profit_cv)

- Even though the standard deviation of Sales is greater than the standard deviation of Profit
- The coefficient of variation for Profit is higher than the coefficient of variation for Sales
- This means the Profit data points are more spread out compared to the Sales data points;

#### Using variation function from scipy

In [None]:
from scipy.stats import variation

sales_cv = variation(df['Sales'])*100

profit_cv = variation(df['Profit'])*100

print(sales_cv)
print(profit_cv)

### Variance
- Variance is the average of squared distances from the mean.
- It provides a measure of how much the data points vary from the mean.
- Variance = average of (data value - mean)^2.

#### Variance by numpy

In [None]:
import numpy as np

variance = np.var(df['Sales'])
print(variance)


#### Variance by statistics module

In [None]:
import statistics

variance = statistics.variance(df['Sales'])
print(variance)

### Measures of Shape

Measures of Shape describe how data points are distributed within a dataset. 

1.**Symmetric Distribution**:
- A distribution is considered symmetric if both sides of the distribution mirror each other.
- Examples include the Normal Distribution, Rectangular Distribution, and U-shaped distribution.
- In a symmetric distribution, the mean, median, and mode coincide.
- The Normal Distribution, in particular, exhibits symmetry and has its peak near the center1.

2.**Skewed Distribution**:
- Skewed distributions are asymmetrical.
- They can be positively skewed (tail extends to the right) or negatively skewed (tail extends to the left).
- Positively skewed distributions have a longer right tail, while negatively skewed distributions have a longer left tail.
- Examples of skewed distributions include income distributions (often positively skewed) and response times (often negatively skewed)1.

3.**Kurtosis**:
- Kurtosis measures the peakedness or flatness of a distribution.
- A distribution can be:
- Leptokurtic: Tall and narrow peak (high kurtosis).
- Platykurtic: Flat peak (low kurtosis).
- Mesokurtic: Similar to a normal distribution (moderate kurtosis)

#### Skewness & Kurtosis

In [None]:
from scipy.stats import skew, kurtosis

df['Sales'].skew()

The skew of 12.973 is greater than 1, thus the Sales variable is highly positicely skewed.

- A distribution with zero skew is perfectly symmetrical. Its left and right sides are mirror images.
- Skewness values between -0.5 and 0.5 indicate approximately symmetric data distributions.
- Skewness values between -1 and -0.5 (negative skewed) or between 0.5 and 1 (positive skewed) <br> suggest moderately skewed data distributions.
- Skewness values less than -1 (negative skewed) or greater than 1 (positive skewed) indicate highly skewed data distributions.
- Highly skewed data may require special treatment or transformation before certain statistical procedures.

In [None]:
df['Sales'].kurtosis()

The kurtosis of 305.3 is greater than 0, the the Sales variable is **leptokurtic**: <br> The distribution has heavier tails (more extreme values) than a normal distribution.

- A distribution with less tha 0 kurtosis is **platykurtic**: The distribution has lighter tails than a normal distribution (fewer extreme values).
- A distribution with kurtosis equal to 0 is **mesokurtic**. The distribution has tails similar to a normal distribution.
- A distribution with kurtosis less than 0 is **leptokurtic**. he distribution has heavier tails (more extreme values) than a normal distribution.<br> 

When assessing kurtosis, we often adjust the values by subtracting 3 to account for the normal distribution’s baseline. Therefore, a kurtosis value between **-3 and 3** (after this adjustment) is generally considered acceptable.

### Symmetric Distribution:
Symmetric data refers to a distribution where the values are evenly distributed around a central point.
1. **Characteristics of Symmetric Data**:
- The mean (average), median, and mode are approximately equal.
- The tails (extreme values) on both sides of the distribution have similar lengths.
- The shape of the distribution looks balanced.

2. **Importance of Symmetric Data**:
- Statistical Assumptions: Many statistical methods assume that the data is symmetric or approximately normal (bell-shaped). For example:
- Parametric Tests: Techniques like t-tests, ANOVA, and linear regression assume normality. Symmetric data aligns well with these assumptions.
- Confidence Intervals: Symmetric data allows for accurate confidence interval estimation.
- Ease of Interpretation: Symmetric distributions are easier to interpret. The mean, median, and mode provide consistent information about the central tendency.
- Robustness: Symmetric data is more robust against outliers compared to skewed data.

3. **Algorithm Assumptions**:
- Many machine learning algorithms assume that the input features follow a symmetric or approximately normal distribution.
- Symmetric data aligns with the assumptions of linear regression, logistic regression, and other parametric models.

4. **Feature Scaling**:
- Symmetric data simplifies feature scaling. Techniques like standardization (z-score normalization) work well when the data is symmetric.
- Scaling ensures that features contribute equally to the model, improving convergence and stability.

5. **Distance Metrics**:
- Symmetric data enhances the effectiveness of distance-based algorithms (e.g., k-means clustering, k-nearest neighbors).
- These algorithms rely on distance measures (such as Euclidean distance), which assume symmetric data.

6. **Principal Component Analysis (PCA)**:
- PCA, a dimensionality reduction technique, assumes symmetric data.
- It transforms features into uncorrelated components, making it easier to capture essential information.

7. **Decision Trees and Random Forests**:
- While decision trees are not sensitive to data distribution, symmetric data often leads to better splits.
- Random forests benefit from symmetric features during ensemble aggregation.

8. **Interpretability**:
- Symmetric data simplifies model interpretation.
- Coefficients in linear models represent meaningful relationships when features are symmetric.

## How to fix Skew Data

- Transformation
- Normalization
- Binning or Bucketing
- Remove Outliers


### Transforming skew data to symmetric
Data transformation involves altering the original data to improve its quality, distribution, or suitability for analysis. <br> It aims to make the data more amenable to modeling or statistical technique.

- **Common Techniques**:
    - **Log Transformation**: Applying the natural logarithm to data can help normalize skewed distributions.
    - **Box-Cox Transformation**: A family of power transformations that optimally normalize data.
    - **Square Root Transformation**: Useful for stabilizing variance and handling non-constant variance.
    - **Other Power Transformations**: Including cube root, reciprocal, and exponential transformations.
    
- **When to Use**:
    - When the data violates assumptions of normality or homoscedasticity.
    - To reduce the impact of outliers.
    - To linearize relationships between variables.

The Sales variable is skewed. Let us transform it into a symmetric data.
There are several methods:

- **Using a defined function**

In [None]:
# Transform using square root
import numpy as np
from scipy.stats import skew, kurtosis

def sqrt_transform(data):
    return np.sqrt(data)

# Apply to your skewed variable (e.g., Sale Price)
transformed_sales = sqrt_transform(df['Sales'])

print(transformed_sales.skew())
print(transformed_sales.kurtosis())

In [None]:
# Transform using log
import numpy as np
from scipy.stats import skew, kurtosis

def log_transform(data):
    return np.log(data)

# Apply to your skewed variable (e.g., Sale Price)
transformed_sales = log_transform(df['Sales'])

print(transformed_sales.skew())
print(transformed_sales.kurtosis())

The log transform is better since the skew and kurtosis values are very small.

In [None]:
# Transform using Box Cox
import pandas as pd
from scipy.stats import boxcox
from scipy.stats import skew, kurtosis

# Assuming you have a dataframe named df with a column named 'Sales'

# Transform using Box Cox
transformed_data, lambda_value = boxcox(df['Sales'])

# Convert transformed data to a Pandas Series
transformed_series = pd.Series(transformed_data)

# Calculate skewness and kurtosis of the transformed data
skewness = skew(transformed_series)
kurt = kurtosis(transformed_series)

print("Skewness:", skewness)
print("Kurtosis:", kurt)



The Box-Cox transformation is also doing a good job.

**Visualize the log transformed variable**

In [None]:

import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import skew, kurtosis

def log_transform(data):
    return np.log(data)

# Apply to your skewed variable (e.g., Sale Price)
df[('transformed_sales')] = log_transform(df['Sales'])



# Create a figure and axis object
fig, ax = plt.subplots(figsize=(10, 6))

# Plot the distribution curve for the 'sales' variable
sns.kdeplot(data=df['transformed_sales'], ax=ax, label='Sales', fill=True)

# Set labels and title
ax.set_xlabel('transformed_sales')
ax.set_ylabel('Density')
ax.set_title('Distribution of transformed_sales')

# Show the plot
plt.show()


**Visualize the Box-Cox transformed variable**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import boxcox
from scipy.stats import skew, kurtosis

# Assuming you have a dataframe named df with a column named 'Sales'

# Transform using Box Cox
transformed_data, lambda_value = boxcox(df['Sales'])

# Add the transformed data to the dataframe
df['Transformed_Sales'] = transformed_data

# Plot the resulting distribution
plt.figure(figsize=(10, 6))

# Plot original data
plt.subplot(1, 2, 1)
plt.hist(df['Sales'], bins=20, color='skyblue', edgecolor='black')
plt.title('Original Sales Distribution')
plt.xlabel('Sales')
plt.ylabel('Frequency')

# Plot transformed data
plt.subplot(1, 2, 2)
plt.hist(df['Transformed_Sales'], bins=20, color='lightgreen', edgecolor='black')
plt.title('Transformed Sales Distribution')
plt.xlabel('Transformed Sales')
plt.ylabel('Frequency')

plt.tight_layout()
plt.show()


In [None]:
# Plot a distribution Curve for the Box-Cox Transfoemation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import boxcox
from scipy.stats import skew, kurtosis

# Assuming you have a dataframe named df with a column named 'Sales'

# Transform using Box Cox
transformed_data, lambda_value = boxcox(df['Sales'])

# Add the transformed data to the dataframe
df['Transformed_Sales'] = transformed_data

# Create a figure and axis object
fig, ax = plt.subplots(figsize=(10, 6))

# Plot the distribution curve for the 'sales' variable
sns.kdeplot(data=df['Transformed_Sales'], ax=ax, label='Sales', fill=True)

# Set labels and title
ax.set_xlabel('Transformed_Sales')
ax.set_ylabel('Density')
ax.set_title('Distribution of Transformed_Sales')

# Show the plot
plt.show()


**Notes**

The code **`plt.subplot(1, 2, 1)`** is a Matplotlib function used to create subplots in a figure.

Here's what each argument represents:

- **`1`**: The first argument specifies the total number of rows in the subplot grid.
- **`2`**: The second argument specifies the total number of columns in the subplot grid.
- **`1`**: The third argument specifies the index of the subplot to create.

So, **`plt.subplot(1, 2, 1)`** is telling Matplotlib to create a subplot grid with 1 row and 2 columns,<br> and then to select the first subplot (from left to right) for plotting.

In the context of the provided code for plotting histograms, **`plt.subplot(1, 2, 1)`** is used to specify that the first histogram (for the original data) should be plotted in the first subplot position. Similarly, **`plt.subplot(1, 2, 2)`** is used to specify that the second histogram (for the transformed data) should be plotted in the second subplot position. This allows for side-by-side comparison of the two distributions.

### Normalizing Skew Data
Normalization ensures that features (variables) are on a similar scale, making them equally important <br> for machine learning algorithms. It doesn’t change the shape of the distribution but scales the data to a common range.

- **Common Techniques**:
    - **Standardization (Z-score normalization)**: Rescales features to have a mean of 0 and a standard deviation of 1. <br> Useful for optimization algorithms (e.g., gradient descent) and distance-based algorithms (e.g., K-nearest neighbors).
    - **Max/Min Normalization (Min-Max Scaling)**: Transforms features to a range between 0 and 1.<br>  Minimum value becomes 0, and maximum value becomes 1.

**Standardization(Z-Score Normalization) Using Calculations**

In [None]:
# Calculate the mean and standard deviation of the 'Sales' column

from scipy.stats import skew, kurtosis

sales_mean = df['Sales'].mean()
sales_std = df['Sales'].std()

In [None]:
#Apply z-score normalization
df['Sales_normalized'] = (df['Sales'] - sales_mean) / sales_std

In [None]:
# Display the first few rows of the DataFrame to verify the results
print(df.head())

**Standardization(Z-Score Normalization) Using sklearn.preprocessing**

In [None]:
from sklearn.preprocessing import StandardScaler
# Create a StandardScaler object
scaler = StandardScaler()

# Fit the scaler to the 'Sales' column and transform it
df['Sales_normalized'] = scaler.fit_transform(df[['Sales']])

# Display the first few rows of the DataFrame to verify the results
print(df.head())

**Notes on Noemalization**

Z-score normalization, also known as standardization, does not change the skewness of the data. Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. Z-score normalization shifts the mean of the data to 0 and scales it by the standard deviation, but it does not alter the shape of the distribution.

If the original data is positively or negatively skewed, the normalized data will retain the same skewness. However, standardization can help make the data more suitable for certain statistical analyses or machine learning algorithms that assume the data is centered around 0 with a standard deviation of 1.

If you want to reduce skewness in the data, you might consider applying transformations such as logarithmic transformation, square root transformation, or Box-Cox transformation before normalizing the data using z-score normalization. 

**Normalization Using Min/Max(Min-Max) Scaling**
- Using Calculation

In [None]:
#Calculate the minimum and maximum values of the 'Sales' column
min_sales = df['Sales'].min()
max_sales = df['Sales'].max()

# Perform Min-Max Scaling
df['Sales_Scaled'] = (df['Sales'] - min_sales) / (max_sales - min_sales)

# Display the first few rows of the DataFrame to verify the results
print(df.head())

- Using sklearn.preprocessing

In [None]:
from sklearn.preprocessing import MinMaxScaler
#Create a MinMaxScaler object
scaler = MinMaxScaler()
#Fit the scaler to the 'Sales' column and transform it
df['Sales_Scaled2'] = scaler.fit_transform(df[['Sales']])
# Display the first few rows of the DataFrame to verify the results
print(df.head())

### Z-Score Normalization vs Scaling Normalization

**Z-score normalization (Standardization)**:

Z-score normalization scales the data so that it has a mean of 0 and a standard deviation of 1.
It centers the data around the mean and scales it by the standard deviation.
The resulting distribution will have a mean of 0 and a standard deviation of 1.
Z-score normalization does not bound the data to a specific range.

**Scaling (Min-Max Scaling)**:
    
Min-Max Scaling scales the data to a fixed range, typically between 0 and 1.
It preserves the shape of the original distribution but scales the values to a specified range.
The resulting distribution will have values bounded within the specified range.
Min-Max Scaling does not center the data around a specific mean or standard deviation.

