# 1. Variables

In analytics or statistics , variables are classified into 4 different types:

<img src="variables.png" alt="Nowmal Distribution" width="600"/>

## 1. a. Numerical Variable

A numerical variable has a notion of magnitude or quantity. A numerical variable represents a measure and is also known as **quantitative variable**.

- Numerical variables are divided into two types: **discrete** and **continuous**.

- **Continuous** variables are not countable and have an infinite number of possibilities.
    - Examples: Age, Salary, Sales Revenue
    - Between the minimum and maximum values. It can take any possible values even the fractions. So, it can have infinite possibilities
    
- **Discrete** variables are countable (Integer values). The number of possibilities is finite. 
    - Number of cars a person owns.
    - Number of children or dependents a person has.   
    - Discrete variables can not take fraction or float values.

## 1.b. Categorical Variable 

Categorical variables are variables that are not numerical or measureble and values fits into levels or categories. They are also known as **Qualitative variables**.

- Categorical variables are divided into two types: **nominal** and **ordinal**.

- **Nominal** variable is where no ordering is possible or has unordered levels. Nominal variables can have two or more levels or categories.

    - Gender
    - Color of a car
    - Name of week days
    
- **Ordinal** variable has an order implied in the levels or categories.

    - Food Taste: Poor, Average, Good, Excellent
    - Compesation Brackets: High, Medium or Low
    - There is an order *High > Medium > Low*. But we can not quantify it.
    
#### Note: 
- Categorical values which are encoded as numerical values like 0 or 1 are still categorical variables. For example: male is encoded as 0 and female encoded as 1.

# 2. Customer Data

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sn

In [None]:
DATA_PATH = 'https://raw.githubusercontent.com/manaranjanp/MLCourseV1/main/Session_2/'

customer_df = pd.read_csv(DATA_PATH+"Customers.csv")

In [None]:
customer_df.head(10)

## 2.a. Variable Types:

| Column| Variable|
|--------|---------|
|CustomerID | Not a variable|
|Gender| Nominal|
|Income| Continuous|
|Profession| Nominal|
|Family Size| Discrete|
|Agegroup| Ordinal|
    

In [None]:
customer_df.info()

# 3. Distribution of Continuous Variables

- Draw a histogram for continous values
- Countplot or barplot for discrete values

In [None]:
income_stats = customer_df.Income.describe()
income_stats

In [None]:
plt.figure(figsize=(12, 5))
plt.hist(customer_df.Income, bins = range(0, 200000, 20000));
plt.xticks(range(0, 200000, 20000));

In [None]:
plt.figure(figsize=(12, 5))
sn.countplot(data = customer_df,
             x = 'Family Size');

# 3. Distribution of Categorical Variables

- Finding unique values
- Create a barplot
- Order the barplot by count

### 3.a. Find unique values and their frequencies

In [None]:
customer_df.Profession.unique()

In [None]:
customer_df.Profession.value_counts()

In [None]:
customer_df.Profession.value_counts(normalize=True)

In [None]:
customer_df.Agegroup.unique()

In [None]:
customer_df.Agegroup.value_counts(normalize=True)

### 3.b. Create a bar plot

In [None]:
plt.figure(figsize=(12, 5))
sn.countplot(data = customer_df,
             x = 'Profession');

In [None]:
plt.figure(figsize=(12, 5))
ax = sn.countplot(data = customer_df,
                  x = 'Profession');

ax.bar_label(ax.containers[0], label_type='edge');

### 3.c. Order the barplot by counts

In [None]:
customer_df.Profession.value_counts()

In [None]:
customer_df.Profession.value_counts().index

In [None]:
plt.figure(figsize=(12, 5))
ax = sn.countplot(data = customer_df,
                  x = 'Profession',
                  order = customer_df.Profession.value_counts().index);

ax.bar_label(ax.containers[0], label_type='edge');