<a href="https://colab.research.google.com/github/meetrafay/EDA-portfolio-project/blob/main/EDA_Portfolio_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

EDA Portfolio Project - Treadmill Buyer Profile
===========================================================
Project Details
---------------
The market research team at AeroFit wants to identify the characteristics of the target audience for each type of treadmill offered by the company, to provide a better recommendation of the treadmills to new customers. The team decides to investigate whether there are differences across the product with respect to customer characteristics.
Product Portfolio
-------------------
KP281: An entry-level treadmill that sells for $1,500
KP481: For mid-level runners and sells for $1,750
KP781: Treadmill with advanced features, and it sells for $2,500
Data Description
-----------------
The company collected data on individuals who purchased a treadmill from the AeroFit stores during the prior three months. The dataset in aerofit_treadmill_data.csv has the following features:
Product: Product purchased (KP281, KP481, or KP781)
Age: In years
Gender: Male/female
Education: In years
MaritalStatus: Single or partnered
Usage: The average number of times the customer plans to use the treadmill each week
Fitness: Self-rated fitness on a 1-5 scale, where 1 is poor shape and 5 is excellent shape
Income: Annual income in US dollars
Miles

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
import warnings
warnings.filterwarnings('ignore')

Data Exploration and Processing
------------------------------------
1. Importing Data
Import necessary libraries and load the dataset.
2. Reading DataFrame
Read the dataset into a pandas DataFrame.
3. Checking DataFrame Shape
Verify the number of rows and columns in the DataFrame.
4. Datatype of Each Column
Check the data type of each column in the DataFrame.
5. Missing Value Detection
Identify and handle missing values in the dataset.
6. Checking Duplicate Values
Detect and remove duplicate values in the dataset.



In [None]:
df = pd.read_csv('/content/aerofit_treadmill_data (2).csv')

In [None]:
display("Shape:", df.shape)

In [None]:
display(df.dtypes)

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df.describe(include='object')

In [None]:
# check null values
df.isnull().sum()

In [None]:
# check duplicates
df.duplicated().sum()

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
display(df.columns)

# Provide an analysis of the statistical summary in few lines for both categorical and numerical features.

## Statistical Summary Analysis

**Numerical Features:**

The statistical summary reveals the central tendency and dispersion of numerical features like Age, Education, Usage, Fitness, Income, and Miles.  The mean and median values offer insights into the typical customer profile.  Standard deviations indicate the spread of data around the mean, highlighting the variability within these features.  For example, a large standard deviation for Income might suggest a wide range of customer incomes.  Further analysis is needed to identify any potential outliers impacting the mean, as well as correlations between these features and product choice.  Minimum and maximum values might help define the range of these features for each product type.

**Categorical Features:**

The summary statistics for categorical features like Product, Gender, MaritalStatus provide frequency counts for each category.  This shows the proportion of customers in each product group, the gender distribution, and the marital status breakdown. This allows us to understand the customer demographics and preferences, revealing which products are more popular among specific customer groups. For example,  a higher frequency of a specific product category in a particular gender group could indicate that product's appeal towards that gender. More in depth analysis may reveal relationships between these categorical features.

Non-Graphical Analysis
--------------------------

\-------------------------

Categorical Feature Analysis
----------------------------

Value Counts
------------

Value counts for all categorical features, providing the frequency distribution of each category.

Unique Attributes
-----------------

Unique attributes for all categorical features, identifying distinct categories and their characteristics.

In [None]:
# Identify categorical and numerical columns
categorical_columns = df.select_dtypes(include='object').columns
numerical_columns = df.select_dtypes(exclude='object').columns
 # all columns in the dataframe
all_columns = df.columns

In [None]:
# List unique features
for col in categorical_columns:
    display(f"\nValue Counts for {col}:")
    display(df[col].value_counts())

In [None]:
# List distinct categories
for col in categorical_columns:
    print(f"\nUnique values for {col}:")
    print(df[col].unique())

Graphical Analysis
------------------

\---------------------

Univariate Analysis - Numerical Features
----------------------------------------

Visualization Tools
-------------------

*   Distribution Plot: Visualizing the distribution of numerical features to identify patterns and outliers.
    
*   Count Plot: Displaying the frequency of numerical values to understand the data density.
    
*   Box Plot: Illustrating the summary statistics (mean, median, quartiles) and identifying potential outliers.

In [None]:
plt.figure(figsize=(20, 10))

for index, col in enumerate(numerical_columns):
  plt.subplot(2, 3, index+1)
  sns.distplot(df[col])
  plt.title(f'{col} Distribution Plot (figure {index+1})')
  plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(20, 10))

for index, col in enumerate(numerical_columns):
  plt.subplot(2, 3, index+1)
  sns.countplot(df[col])
  plt.title(f'{col} Count Plot (figure {index+1})')
  plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(20, 10))

for index, col in enumerate(numerical_columns):
  plt.subplot(2, 3, index+1)
  sns.boxplot(df[col])
  plt.title(f'{col} Box Plot (figure {index+1})')
  plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

Univariate Analysis
-------------------

\---------------------

Categorical Features
--------------------

Visualization Tools
-------------------

*   Count Plot: Visualizing the frequency distribution of categorical features to identify the most common categories.

In [None]:
plt.figure(figsize=(20, 10))

for index, col in enumerate(categorical_columns):
  plt.subplot(2, 3, index+1)
  sns.countplot(x=df[col])
  plt.title(f'{col} Count Plot (figure {index+1})')
  plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

Bivariate Analysis
------------------

\---------------------

Analyzing Relationships
-----------------------

Feature Interactions
--------------------

*   Investigating how different features affect the product purchased:
    
*   Product vs Gender: Examining how gender influences product choice.
    
*   Product vs MaritalStatus: Analyzing how marital status impacts product selection.
    
*   Product vs Age: Understanding how age affects product purchasing decisions.

In [None]:
plt.figure(figsize=(20, 10))

for index, col in enumerate(all_columns):
  if col in ["Age", "MaritalStatus", "Gender"]:
    plt.subplot(2, 3, index+1)
    sns.countplot(x='Product', hue=col, data=df)
    plt.title(f'Product vs {col} Count Plot (figure {index+1})')
    plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

Multivariate Analysis
---------------------

Pairplots for Feature Relationships
-----------------------------------

*   Create pairplots to visualize the relationship between features.

In [None]:
sns.pairplot(df, hue="Product")
plt.show()

Correlation Analysis
--------------------

Heatmap Visualization and Observations
--------------------------------------

*   Display the correlation matrix on a heatmap.
    
*   Provide a brief summary of key findings and observations.

In [None]:
sns.heatmap(df.corr(numeric_only=True),annot=True)
plt.show()

'''
Age:
Age has a moderate positive correlation with Income (0.513414). This suggests that older individuals tend to have higher incomes.
Age has a weak positive correlation with Education (0.280496).
Age has very weak positive correlation with Usage(0.015064), Fitness(0.061105) and Miles(0.036618).
Education:
Education has a moderate positive correlation with Income (0.625827). Higher education levels tend to be associated with higher incomes.
Education has moderate positive correlation with Usage(0.395155), Fitness(0.410581) and Miles(0.307284).
Usage:
Usage has a strong positive correlation with Fitness (0.668606) and Miles (0.759130). People who use the product more tend to have higher fitness levels and travel more miles.
Usage has moderate positive correlation with Income(0.519537) and Education(0.395155).
Fitness:
Fitness has a strong positive correlation with Miles (0.785702). Higher fitness levels are associated with more miles traveled.
Fitness has moderate positive correlation with Income(0.535005) and Education(0.410581).
Income:
Income has a moderate positive correlation with Miles (0.543473). Higher incomes tend to be associated with more miles traveled.
Miles:
Miles has moderate positive correlation with Income(0.543473) and Education(0.307284).
'''

## Outlier Detection:

* Check for outliers using the IQR (Interquartile Range) method.

In [None]:
# Check data distribution using scipy
from scipy import stats

for col in numerical_columns:
  statistic, pvalue = stats.shapiro(df[col])
  if pvalue > 0.05:
    print(f"The '{col}' data is normally distributed")
    print("----------------------------------------")
  else:
    print(f"The '{col}' data is not normally distributed")
    print("----------------------------------------")

In [None]:
# Check data skewness using scipy
for col in numerical_columns:
  skewness = df[col].skew()
  if skewness > 0.5:
    print(f"The '{col}' data is positively skewed")
    print("----------------------------------------")
  elif skewness < -0.5:
    print(f"The '{col}' data is negatively skewed")
    print("----------------------------------------")
  else:
    print(f"'{col}' not strongly skewed")
    print("----------------------------------------")

In [None]:
#Check for outliers using the IQR
def get_bounds(col):
  Q1 = df[col].quantile(0.25)
  Q3 = df[col].quantile(0.75)

  IQR = Q3 - Q1

  lower_bound = Q1 - 1.5*IQR
  upper_bound = Q3 + 1.5*IQR

  return lower_bound, upper_bound


for col in numerical_columns:
  lower_bound, upper_bound = get_bounds(col)

  outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
  print(f"The {col} data has {len(outliers)} outliers")
  display(outliers)
  print("----------------------------------------")

## 7. Conditional Probabilities:

* What percent of customers have purchased KP281, KP481, or KP781?
* Create frequency tables and calculate the percentage as follows:

    * **Product – Gender**
        * Percentage of a Male customer purchasing a treadmill.
        * Percentage of a Female customer purchasing KP781 treadmill.
        * Probability of a customer being a Female given that Product is KP281.
    * **Product – Age**
        * Percentage of customers with Age between 20s and 30s among all customers.
    * **Product – Income**
        * Percentage of a low-income customer purchasing a treadmill.
        * Percentage of a high-income customer purchasing KP781 treadmill.
        * Percentage of customer with high-income salary buying treadmill given that Product is KP781.
    * **Product – Fitness**
        * Percentage of customers that have fitness level 5.
        * Percentage of a customer with Fitness Level 5 purchasing KP781 treadmill.
        * Percentage of customer with fitness level 5 buying KP781 treadmill
    * **Product - Marital Status**
        * Percentage of customers who are partnered using treadmills.

In [None]:
#percent of customers have purchased KP281, KP481, or KP781
customer_per_by_product = (df["Product"].value_counts() / len(df)) * 100
customer_per_by_product

In [None]:
# Frequency table of Product and Gender
product_gender_freq = pd.crosstab(df['Product'], df['Gender'])
product_gender_freq

In [None]:
#Percentage of a Male customer purchasing a treadmill.
totla_male = product_gender_freq['Male'].sum()
totla_male
male_purchase_percent = totla_male / len(df) * 100
round(male_purchase_percent, 2)

In [None]:
#Percentage of a Female customer purchasing KP781 treadmill.
total_female_buy_KP281 = product_gender_freq.loc['KP281', 'Female']
total_female = product_gender_freq['Female'].sum()

female_purchase_percent_KP281 = (total_female_buy_KP281 / total_female) * 100
round(female_purchase_percent_KP281, 2)

In [None]:
#Probability of a customer being a Female given that Product is KP281.
total_kp281 = product_gender_freq.loc['KP281'].sum()
probability_female_given_kp281 = (total_female_buy_KP281 / total_kp281) * 100

print(f"Probability of a customer being female given KP281: {probability_female_given_kp281:.2f}%")

In [None]:
#Percentage of customers with Age between 20s and 30s among all customers.
filtered_ages = df[(df['Age'] >= 20) & (df['Age'] <= 30)]

percent = len(filtered_ages) / len(df) * 100
round(percent, 2)

In [None]:
#Percentage of a low-income customer purchasing a treadmill.
income_mean = df['Income'].mean()
low_income_customers = df[df['Income'] < income_mean]
round(len(low_income_customers) / len(df) * 100, 2)

In [None]:
#Percentage of a high-income customer purchasing KP781 treadmill.
high_income_customers = df[df['Income'] > income_mean]
customer_KP281 = high_income_customers.loc[high_income_customers["Product"].isin(["KP281"])]
round(len(customer_KP281) / len(high_income_customers) * 100, 2)


In [None]:
#Percentage of customer with high-income salary buying treadmill given that Product is KP781
kp781_customers = df[df['Product'] == 'KP781']

# Filter for high-income customers within KP781 customers
high_income_given_kp781 = kp781_customers[kp781_customers['Income'] >= income_mean]

# Calculate the percentage
high_income_given_kp781_percent = (len(high_income_given_kp781) / len(kp781_customers)) * 100
display(high_income_given_kp781_percent)

In [None]:
#Percentage of customers that have fitness level 5.
fitness_5_customer = df[df["Fitness"] == 5]
round(len(fitness_5_customer) / len(df) *100, 2)

In [None]:
#Percentage of a customer with Fitness Level 5 purchasing KP781 treadmill.
KP781_customer = fitness_5_customer[df["Product"] == "KP781"]
round(len(KP781_customer) / len(df) *100, 2)

In [None]:
#Percentage of customers who are partnered using treadmills.
round(len(df[df["MaritalStatus"] == "Partnered"]) / len(df) * 100, 2)

## Data Profiling Report

In [None]:
!pip install ydata-profiling

In [None]:
from ydata_profiling import ProfileReport

In [None]:
Profile = ProfileReport(df,title="aerofit_treadmill dataset profile")

In [None]:
Profile.to_notebook_iframe()

# Actionable Insights and Recommendations based on the AeroFit Treadmill Data Analysis

## Executive Summary:

This report summarizes key findings from the analysis of AeroFit treadmill customer data, identifying customer segments and providing recommendations to refine marketing and product strategies.  The analysis reveals distinct customer profiles for each treadmill model (KP281, KP481, KP781), based on demographics, usage patterns, and fitness levels.

## Key Insights:

* **Product Segmentation:**  Clear distinctions exist between customer segments for each treadmill.  KP781 buyers tend to be higher-income, more educated, and have a higher fitness level and usage frequency.  KP281 buyers form a distinct group, with lower income, education, and fitness levels. KP481 attracts a mid-range profile.

* **Demographics:**  Income and education levels strongly correlate with product choice.  Higher-end models attract higher earners and those with more education.  Age shows a moderate correlation with income.

* **Usage and Fitness:**  As expected, Usage and Fitness levels positively correlate with product choice and income.  Customers who use the treadmills more frequently and rate their fitness higher tend to opt for advanced models and have higher income.

* **Marital Status:** Marital status exhibits a slight influence on product choice, although it's less significant than income and fitness.


## Recommendations:

1. **Targeted Marketing:** Develop targeted marketing campaigns for each product segment.  For instance, leverage online advertising platforms to reach high-income and educated individuals interested in advanced features (KP781).  Focus on value and affordability when promoting the entry-level model (KP281) to those with lower income and fitness goals.

2. **Product Development:** Use customer insights to develop new features and enhancements. For example, consider adding features desired by KP781 users (e.g., advanced training programs, connectivity options) to future models.  Likewise, consider simplifying and cost-reducing features for the entry-level models (KP281) without compromising basic functionality.  Continuously collect and analyze user feedback.

3. **Pricing Strategy:** Assess the price sensitivity of each segment. While higher priced models have shown to appeal to certain segments, analyze whether there is price optimization opportunities to appeal to a broader customer base.

4. **Customer Relationship Management:** Implement customer segmentation in CRM system to personalize communications and offers. Tailor marketing messages to the unique needs of each customer group and their product preferences.

5. **Further Investigation:** Conduct deeper analysis into the subtle influence of marital status, gender, and age to refine segmentation strategies and product offerings. Additional research might include gathering customer feedback through surveys and focus groups. Investigate correlation between Product and Miles to understand usage patterns with different products.

6. **Outlier Analysis:** Address outliers in income and usage. Identify and understand the reasons for extreme values.  These might represent opportunities or data errors.


## Conclusion:

By implementing these recommendations, AeroFit can improve its product offerings, refine its marketing strategies, and boost sales by targeting the right customers with the right message. Continuous monitoring and data analysis will ensure the effectiveness of these strategies.


In [None]:
print("====================== The END =======================")