#### Q1. Business Case: Aerofit - Descriptive Statistics & Probability

**Mindset:**

1. Evaluation will be kept lenient, so make sure you attempt this case study.
2. It is understandable that you might struggle with getting started on this. Just brainstorm, discuss with peers, or get help from TAs.
3. There is no right or wrong answer. We have to get used to dealing with uncertainty in business. This is exactly the skill we want to develop.

**About Aerofit**

Aerofit is a leading brand in the field of fitness equipment. Aerofit provides a product range including machines such as treadmills, exercise bikes, gym equipment, and fitness accessories to cater to the needs of all categories of people.


**Business Problem**

The market research team at AeroFit wants to identify the characteristics of the target audience for each type of treadmill offered by the company, to provide a better recommendation of the treadmills to the new customers. The team decides to investigate whether there are differences across the product with respect to customer characteristics.

1. Perform descriptive analytics to create a customer profile for each AeroFit treadmill product by developing appropriate tables and charts.
2. For each AeroFit treadmill product, construct two-way contingency tables and compute all conditional and marginal probabilities along with their insights/impact on the business.

**Dataset**

The company collected the data on individuals who purchased a treadmill from the AeroFit stores during the prior three months. The dataset has the following features:

- Product Purchased:	KP281, KP481, or KP781
- Age:	In years
- Gender:	Male/Female
- Education:	In years
- MaritalStatus:	Single or partnered
- Usage:	The average number of times the customer plans to use the treadmill each week.
- Income:	Annual income (in $)
- Fitness:	Self-rated fitness on a 1-to-5 scale, where 1 is the poor shape and 5 is the excellent shape.
- Miles:	The average number of miles the customer expects to walk/run each week

**Product Portfolio:**

- The KP281 is an entry-level treadmill that sells for $1,500.
- The KP481 is for mid-level runners that sell for $1,750.
- The KP781 treadmill is having advanced features that sell for $2,500.

**What good looks like?**

1. Import the dataset and do usual data analysis steps like checking the structure & characteristics of the dataset
2. Detect Outliers (using boxplot, “describe” method by checking the difference between mean and median)
3. Check if features like marital status, age have any effect on the product purchased (using countplot, histplots, boxplots etc)
4. Representing the marginal probability like - what percent of customers have purchased KP281, KP481, or KP781 in a table (can use pandas.crosstab here)
5. Check correlation among different factors using heat maps or pair plots.
6. With all the above steps you can answer questions like: What is the probability of a male customer buying a KP781 treadmill?
7. **Customer Profiling** - Categorization of users.
8. **Probability**- marginal, conditional probability.
9. Some recommendations and actionable insights, based on the inferences.

**Evaluation Criteria**

1. Defining Problem Statement and Analysing basic metrics (10 Points)
    - Observations on shape of data, data types of all the attributes, conversion of categorical attributes to 'category' (If required), statistical summary
2. Non-Graphical Analysis: Value counts and unique attributes ​​(10 Points)
3. Visual Analysis - Univariate & Bivariate (30 Points)
    - For continuous variable(s): Distplot, countplot, histogram for univariate analysis (10 Points)
    - For categorical variable(s): Boxplot (10 Points)
    - For correlation: Heatmaps, Pairplots(10 Points)
    4. Missing Value & Outlier Detection (10 Points)
5. Business Insights based on Non-Graphical and Visual Analysis (10 Points)
    - Comments on the range of attributes
    - Comments on the distribution of the variables and relationship between them
    - Comments for each univariate and bivariate plot
6. Recommendations (10 Points) - Actionable items for business. No technical jargon. No complications. Simple action items that everyone can understand


**Submission Process:**

- Type your insights and recommendations in the text editor.
- Convert your jupyter notebook into PDF (Save as PDF using Chrome browser’s Print command), upload it in your - Google Drive (set the permission to allow public access), and paste that link in the text editor.
- Optionally, you may add images/graphs in the text editor by taking screenshots or saving matplotlib graphs using plt.savefig(...).
- After submitting, you will not be allowed to edit your submission.

# **Project Solution**


#### Importing Libraries

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# import math

In [3]:
df = pd.read_csv("aerofit_treadmill.csv")

**Defining Problem Statement and Analysing basic metrics (10 Points)**

- Observations on shape of data, 
- data types of all the attributes
- conversion of categorical attributes to 'category'.

In [4]:
df.shape

(180, 9)

In [5]:
df.head()

Unnamed: 0,Product,Age,Gender,Education,MaritalStatus,Usage,Fitness,Income,Miles
0,KP281,18,Male,14,Single,3,4,29562,112
1,KP281,19,Male,15,Single,2,3,31836,75
2,KP281,19,Female,14,Partnered,4,3,30699,66
3,KP281,19,Male,12,Single,3,3,32973,85
4,KP281,20,Male,13,Partnered,4,2,35247,47


In [6]:
df.tail()

Unnamed: 0,Product,Age,Gender,Education,MaritalStatus,Usage,Fitness,Income,Miles
175,KP781,40,Male,21,Single,6,5,83416,200
176,KP781,42,Male,18,Single,5,4,89641,200
177,KP781,45,Male,16,Single,5,5,90886,160
178,KP781,47,Male,18,Partnered,4,5,104581,120
179,KP781,48,Male,18,Partnered,4,5,95508,180


In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age,180.0,28.788889,6.943498,18.0,24.0,26.0,33.0,50.0
Education,180.0,15.572222,1.617055,12.0,14.0,16.0,16.0,21.0
Usage,180.0,3.455556,1.084797,2.0,3.0,3.0,4.0,7.0
Fitness,180.0,3.311111,0.958869,1.0,3.0,3.0,4.0,5.0
Income,180.0,53719.577778,16506.684226,29562.0,44058.75,50596.5,58668.0,104581.0
Miles,180.0,103.194444,51.863605,21.0,66.0,94.0,114.75,360.0


**Missing Value & Outlier Detection (10 Points)**

In [8]:
df.isna().sum()

Product          0
Age              0
Gender           0
Education        0
MaritalStatus    0
Usage            0
Fitness          0
Income           0
Miles            0
dtype: int64

**There are no NA or Null values in the dataset.**

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 9 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Product        180 non-null    object
 1   Age            180 non-null    int64 
 2   Gender         180 non-null    object
 3   Education      180 non-null    int64 
 4   MaritalStatus  180 non-null    object
 5   Usage          180 non-null    int64 
 6   Fitness        180 non-null    int64 
 7   Income         180 non-null    int64 
 8   Miles          180 non-null    int64 
dtypes: int64(6), object(3)
memory usage: 12.8+ KB


### **Changing the it to "object" datatype to "category" to save memory**

In [10]:
# print('Old df memory usage:', df.memory_usage(deep=True).sum())

df.Product=df["Product"].astype("category")
df.Gender=df["Gender"].astype("category")
df.MaritalStatus=df["MaritalStatus"].astype("category")

print('New df memory usage:', df.memory_usage(deep=True).sum())

Old df memory usage: 42721
New df memory usage: 10071


**Non-Graphical Analysis: Value counts and unique attributes ​​(10 Points)**

In [11]:
df.columns

Index(['Product', 'Age', 'Gender', 'Education', 'MaritalStatus', 'Usage',
       'Fitness', 'Income', 'Miles'],
      dtype='object')

**Numeric Columns**

In [12]:
numeric_columns = df.select_dtypes(include=np.number).columns
numeric_columns

Index(['Age', 'Education', 'Usage', 'Fitness', 'Income', 'Miles'], dtype='object')

**Categorical Columns**

In [13]:
cat_columns = df.select_dtypes(include=['category']).columns
cat_columns

Index(['Product', 'Gender', 'MaritalStatus'], dtype='object')

### **Outlier Treatment**

In [14]:
# Age
_col = 'Age'

Q1, Q3 = np.percentile(df[_col], [25,75])
IQR = Q3 - Q1
 
print("Old Shape: ", df.shape)
 
# Upper bound
upper = np.where(df[_col] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df[_col] <= (Q1-1.5*IQR))
 
''' Removing the Outliers '''
df.drop(upper[0], inplace = True)
df.drop(lower[0], inplace = True)
print("New Shape: ", df.shape)

Old Shape:  (180, 9)
New Shape:  (175, 9)
