# Q1. What is data encoding? How is it useful in data science?

**Data encoding** is a fundamental technique in **data science** that involves converting data from one form to another. Let's delve into its significance and applications:

1. **Data Encoding and Decoding**:
   - **Data encoding** is the process of transforming data into a different format, typically for purposes like transmission, storage, or analysis.
   - Conversely, **data decoding** reverses this process, converting data back to its original form for interpretation or use.
   - These techniques act as a bridge between raw data and actionable insights.

2. **Why Data Encoding Matters**:
   - **Data is everywhere**, but it needs processing, transformation, and interpretation to extract meaning and value.
   - **Data encoding and decoding enable us to**:
     - **Prepare data for analysis**: By converting it into a suitable format for algorithms or models.
     - **Engineer features**: Extract relevant information and create new variables to enhance analysis accuracy.
     - **Compress data**: Reduce size or complexity without losing essential information.
     - **Protect data**: Encrypt or mask it to prevent unauthorized access.

3. **Encoding Techniques in Data Science**:
   - **One-hot Encoding**:
     - Used for handling **categorical variables** (e.g., gender, color, country).
     - Converts each category into a binary vector (0s and 1s).
     - Helps create **dummy variables** for machine learning models.
     - Avoids problems related to ordinality (implicit order or ranking).


# Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

**Nominal encoding** is a technique used in data science to handle **categorical features** where there is **no inherent order or ranking** among the categories. Let's explore it further:

1. **Nominal Encoding**:
   - In nominal encoding, we deal with features where variables are **just names**, and there is **no meaningful order** or rank.
   - Examples of nominal features include:
     - **City of residence**: Different cities have no inherent order or ranking.
     - **Gender**: Male, female, and other gender identities are equal and lack any specific order.
     - **Marital status**: Single, married, divorced—these categories don't follow a natural sequence.

2. **Real-World Scenario**:
   - Suppose we're analyzing a dataset related to **customer preferences** for a retail company. We have a feature called **"Preferred Payment Method"** with the following categories:
     - **Credit card**
     - **Cash**
     - **Mobile wallet**
     - **Bank transfer**

   - Since these payment methods are nominal (no inherent order), we can encode them as follows:
     - **Credit card**: 0
     - **Cash**: 1
     - **Mobile wallet**: 2
     - **Bank transfer**: 3

   - The encoded values allow machine learning models to process this categorical feature effectively. For instance, we can use these numerical representations to predict customer behavior or tailor marketing strategies.


# Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.


1. **Nominal Encoding vs. One-Hot Encoding**:
   - **Nominal Encoding**:
     - Assigns each categorical value an integer based on alphabetical order.
     - Suitable for **nominal features** where there's **no inherent order** among categories.
     - Examples: **City names**, **gender**, **marital status**.
   - **One-Hot Encoding**:
     - Creates new binary variables (0 or 1) for each category.
     - Ideal for **nominal categorical data** without any ordinal relationship.
     - Examples: **Colors**, **brands**, **payment methods**.

2. **When to Use Nominal Encoding**:
   - **Few unique categories**:
     - Nominal encoding is manageable when the number of unique categories is small.
   - **No natural order**:
     - When there's **no inherent ranking** among categories.
     - Example: **Gender** (male, female, other).

3. **Practical Example**:
   - Imagine a dataset for a **music streaming app**:
     - **Genre** (categorical feature):
       - Categories: **Pop**, **Rock**, **Jazz**, **Electronic**.
       - No inherent order among genres.
     - **Nominal Encoding**:
       - Pop: 0
       - Rock: 1
       - Jazz: 2
       - Electronic: 3
     - Using nominal encoding, we create a numeric representation for each genre without implying any ranking.


# Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

## **One-Hot Encoding:**
    -One-hot encoding maps each label to a binary vector (0s and 1s).
    -For our 5 categories, we would create 5 binary features—one for each unique value.
    -This technique is ideal when there’s no inherent order among the categories.
    -It avoids the assumption of ordinality and ensures that each category is treated independently.
    -it provides a clear representation for each category without assuming any order.
    -it ensures that the machine learning model treats all categories equally.

# Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

1. If I want to use nominal encoding to transform the categorical data, I would create a new column for each unique value in the categorical columns.
2. Let’s assume that the first categorical column has 12 unique values and the second categorical column has 5 unique values. Then, nominal encoding/one-hot encoding would create 12 + 5 = 17 new columns.
3. In general, if the first categorical column has n unique values and the second categorical column has m unique values, then nominal encoding would create n + m new columns.

# Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer

1. It is observed that variables species , habitat and diet are NOMINAL features with no natural order or ranking.
2. If the categorical variable has no natural order or ranking, then one-hot encoding can be used.
3. Hence One-Hot Encoding would be preffered in above case.

# Q7. You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

To transform the categorical data in the customer churn dataset into numerical data, you can use either ordinal encoding or one-hot encoding.
If the categorical variable has a natural order or ranking, then ordinal encoding can be used. For example, if the dataset contains information about the contract type of the customers, such as “month-to-month”, “one year”, and “two year”, then ordinal encoding can be used to encode this information.
If the categorical variable has no natural order or ranking, then one-hot encoding can be used. For example, if the dataset contains information about the gender of the customers, such as “male” and “female”, then one-hot encoding can be used to encode this information.
Here are the steps to implement this encoding:
Identify the categorical variables in the dataset. In this case, the categorical variables are the customer’s gender and contract type.

Seperate Nominal and Ordinal Variables. In this case Gender is an Nominal variable, while contract type is ordinal variable.

Apply One Hot Encoding to Nominal Variable in this case Gender Variable.

Apply Ordinal Encoding to Ordinal Variable in this case contract type variable.

Scale Numerical data using StandardScaler

Combine all 3 encoding into single dataframe

Data is now ready for machine learning model

In [1]:
import numpy as np
import pandas as pd

# Setting random seed 
np.random.seed(543)
n = 1000 # dataset size
gender = np.random.choice(['Male','Female'],size=n)
age = np.random.randint(low=25, high=65, size=n)
contract = np.random.choice(['monthly','quarterly','half yearly','yearly'], size=n)
monthly_charges = np.random.normal(loc=1000, scale=100,size=n)
tenure = np.random.randint(low=12, high=36, size=n)
churn = np.random.choice([1,0], p=[0.2,0.8], size=n)

# Creating dictionary
dct = {
    'gender':gender,
    'age':age,
    'contract':contract,
    'monthly_charges':monthly_charges,
    'tenure':tenure,
    'churn':churn
}

# Creating Dataframe
df = pd.DataFrame(dct)
df.head()

Unnamed: 0,gender,age,contract,monthly_charges,tenure,churn
0,Female,34,yearly,1042.202157,25,1
1,Female,36,half yearly,966.337735,12,1
2,Female,30,yearly,1165.040177,16,0
3,Female,44,quarterly,1002.266319,35,0
4,Male,53,half yearly,1043.851952,16,1


In [2]:
# Seperating X and Y
X = df.drop(labels=['churn'],axis=1)
Y = df[['churn']]

In [3]:
# Performing one hot encoding on gender column
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
gender_ohe = ohe.fit_transform(X[['gender']]).toarray()
# Get the DataFrame
X_gender = pd.DataFrame(gender_ohe,columns=ohe.get_feature_names_out())
X_gender.head()

Unnamed: 0,gender_Female,gender_Male
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,0.0,1.0


In [4]:
# Performing ordinal encoding on Contract type
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder(categories=[['monthly','quarterly','half yearly','yearly']])
# Getting ordinal dataframe
contract_encoded = ord_enc.fit_transform(X[['contract']]).flatten()
X_contract = pd.DataFrame(contract_encoded,columns=['contract'])
X_contract.head()

Unnamed: 0,contract
0,3.0
1,2.0
2,3.0
3,1.0
4,2.0


In [6]:
#Getting numeric variables
X_numeric = X.select_dtypes(exclude='object')
X_numeric.head()

Unnamed: 0,age,monthly_charges,tenure
0,34,1042.202157,25
1,36,966.337735,12
2,30,1165.040177,16
3,44,1002.266319,35
4,53,1043.851952,16


In [7]:
# Concatenating all 3 variables Nominal, Ordinal and Numerical
X_encoded = pd.concat([X_numeric,X_contract,X_gender],axis=1)
X_encoded.head()

Unnamed: 0,age,monthly_charges,tenure,contract,gender_Female,gender_Male
0,34,1042.202157,25,3.0,1.0,0.0
1,36,966.337735,12,2.0,1.0,0.0
2,30,1165.040177,16,3.0,1.0,0.0
3,44,1002.266319,35,1.0,1.0,0.0
4,53,1043.851952,16,2.0,0.0,1.0


In [8]:
# Applying StandardScaler to entire encoded dataset
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_final = pd.DataFrame(scaler.fit_transform(X_encoded),columns=X_encoded.columns)
X_final.head()

Unnamed: 0,age,monthly_charges,tenure,contract,gender_Female,gender_Male
0,-0.897852,0.446463,0.244225,1.354668,0.992032,-0.992032
1,-0.722935,-0.322028,-1.622281,0.462265,0.992032,-0.992032
2,-1.247688,1.690786,-1.047972,1.354668,0.992032,-0.992032
3,-0.023264,0.041921,1.679999,-0.430138,0.992032,-0.992032
4,0.763865,0.463175,-1.047972,0.462265,-1.008032,1.008032


In [10]:
# Train test split
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X_final,Y,test_size=0.2, random_state=42, stratify=Y)

In [11]:
xtrain.head()

Unnamed: 0,age,monthly_charges,tenure,contract,gender_Female,gender_Male
514,-0.635476,-0.898498,-1.478704,1.354668,-1.008032,1.008032
224,-1.07277,-0.721926,-1.191549,-1.322542,-1.008032,1.008032
845,1.463536,-0.933469,-0.617239,0.462265,0.992032,-0.992032
736,1.026242,-1.537408,0.962112,-1.322542,0.992032,-0.992032
792,-0.722935,-0.673444,0.818535,-0.430138,0.992032,-0.992032


In [12]:
ytrain.head()

Unnamed: 0,churn
514,0
224,0
845,0
736,0
792,0


In [13]:
xtest.head()

Unnamed: 0,age,monthly_charges,tenure,contract,gender_Female,gender_Male
855,0.676407,1.817057,0.244225,-0.430138,0.992032,-0.992032
942,0.064195,-0.282519,-1.335126,1.354668,0.992032,-0.992032
234,-0.897852,1.087411,-1.335126,0.462265,0.992032,-0.992032
108,-0.985311,1.01579,-1.622281,0.462265,0.992032,-0.992032
697,0.151654,-1.651405,-1.622281,1.354668,0.992032,-0.992032


In [14]:
ytest.head()

Unnamed: 0,churn
855,0
942,0
234,0
108,0
697,0
