# Week 13 Feature Engineering 3

### Q1. What is data encoding? How is it useful in data science?

Data encoding is the process of converting data from one form to another. In machine learning, data encoding is used to convert categorical data into numerical data so that it can be used in machine learning models. There are two popular techniques for encoding categorical data: Ordinal Encoding and One-Hot Encoding.
1. Categorical data is data that is divided into categories or groups, such as colors, shapes, or sizes. Machine learning models cannot work with categorical data directly, so it needs to be converted into numerical data.
2. Ordinal encoding is a process of converting categorical data into integer format so that the data with converted categorical values can be provided to the models to give and improve the predictions.
3. One-hot encoding is a process of converting categorical data into binary format so that the data with converted categorical values can be provided to the models to give and improve the predictions.
4. Data encoding is useful in data science because it helps to convert categorical data into numerical data, which can be used in machine learning models. This helps to improve the accuracy of the models and make them more efficient.

### Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.

1. Nominal encoding is a type of data encoding that is used to convert categorical data into numerical data. Nominal data is data that does not have an inherent order, such as colors, names, or categories
2. An example of how nominal encoding can be used in a real-world scenario is in the analysis of customer data. Suppose a company has a customer database that includes information such as the customer’s name, age, gender, and location. The location data is nominal data because it does not have an inherent order. To use this data in a machine learning model, the location data needs to be converted into numerical data using nominal encoding.
3. One way to do this is to use one-hot encoding, which creates a new binary feature for each category in the nominal data. For example, if the location data includes categories such as “New York,” “Los Angeles,” and “Chicago,” then one-hot encoding would create three new binary features, one for each category.
4. Nominal encoding is useful in data science because it helps to convert categorical data into numerical data, which can be used in machine learning models.

Example of One Hot Encoding below
![image.png](attachment:8e1e752c-f593-4848-83bd-1a193a275611.png)

### Q3. In what situations is nominal encoding preferred over one-hot encoding? Provide a practical example.

one-hot encoding is a process to encode nominal categorical variables which do not have ranks.

Correct Question is When is ORDINAL Encoding preffered over One-Hot-Encoding
1. Ordinal encoding is preferred over one-hot encoding when the categorical variable has a natural order or ranking. For example, in the context of machine learning, ordinal encoding can be used to encode the education level of a person, where the education level has a natural order such as “high school”, “bachelor’s degree”, “master’s degree”, and "doctorate degree".
2. One-hot encoding is preferred when the categorical variable has no natural order or ranking. For example, in the context of machine learning, one-hot encoding can be used to encode the color of a car, where the colors have no natural order or ranking
3. It is important to note that forcing an ordinal relationship via an ordinal encoding and allowing the model to assume a natural ordering between categories may result in poor performance or unexpected results (predictions halfway between categories).

### Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

1. If I have a dataset containing categorical data with 5 unique values, I can use either ordinal encoding or one-hot encoding to transform this data into a format suitable for machine learning algorithms.
2. If the categorical variable has a natural order or ranking, then ordinal encoding can be used. If the categorical variable has no natural order or ranking, then one-hot encoding can be used.
3. In general, one-hot encoding is preferred over ordinal encoding because it does not assume any ordinal relationship between the categories and can be used for categorical variables with any number of unique values. However, one-hot encoding can lead to the curse of dimensionality if the number of unique values is very large.
4. Ordinal encoding is preferred when the number of unique values is large and one-hot encoding would lead to the curse of dimensionality.

### Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

1. If I want to use nominal encoding to transform the categorical data, I would create a new column for each unique value in the categorical columns.
2. Let’s assume that the first categorical column has 10 unique values and the second categorical column has 5 unique values. Then, nominal encoding/one-hot encoding would create 10 + 5 = 15 new columns.
3. In general, if the first categorical column has n unique values and the second categorical column has m unique values, then nominal encoding would create n + m new columns.

### Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

1. It is observed that variables species , habitat and diet are NOMINAL features with no natural order or ranking.
2. If the categorical variable has no natural order or ranking, then one-hot encoding can be used.
3. Hence One-Hot Encoding would be preffered in above case.

### Question 7 : You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

To transform the categorical data in the customer churn dataset into numerical data, you can use either ordinal encoding or one-hot encoding.

If the categorical variable has a natural order or ranking, then ordinal encoding can be used. For example, if the dataset contains information about the contract type of the customers, such as “month-to-month”, “one year”, and “two year”, then ordinal encoding can be used to encode this information.

If the categorical variable has no natural order or ranking, then one-hot encoding can be used. For example, if the dataset contains information about the gender of the customers, such as “male” and “female”, then one-hot encoding can be used to encode this information.

#### Here are the steps to implement this encoding:
1. Identify the categorical variables in the dataset. In this case, the categorical variables are the customer’s gender and contract type.

2. Seperate Nominal and Ordinal Variables. In this case Gender is an Nominal variable, while contract type is ordinal variable.

3. Apply One Hot Encoding to Nominal Variable in this case Gender Variable.

4. Apply Ordinal Encoding to Ordinal Variable in this case contract type variable.

5. Scale Numerical data using StandardScaler

5. Combine all 3 encoding into single dataframe

6. Data is now ready for machine learning model

In [1]:
import numpy as np
import pandas as pd

# Setting random seed 
np.random.seed(543)
n = 1000 # dataset size
gender = np.random.choice(['Male','Female'],size=n)
age = np.random.randint(low=25, high=65, size=n)
contract = np.random.choice(['monthly','quarterly','half yearly','yearly'], size=n)
monthly_charges = np.random.normal(loc=1000, scale=100,size=n)
tenure = np.random.randint(low=12, high=36, size=n)
churn = np.random.choice([1,0], p=[0.2,0.8], size=n)

# Creating dictionary
dct = {
    'gender':gender,
    'age':age,
    'contract':contract,
    'monthly_charges':monthly_charges,
    'tenure':tenure,
    'churn':churn
}

# Creating Dataframe
df = pd.DataFrame(dct)
df.head()

Unnamed: 0,gender,age,contract,monthly_charges,tenure,churn
0,Female,34,yearly,1042.202157,25,1
1,Female,36,half yearly,966.337735,12,1
2,Female,30,yearly,1165.040177,16,0
3,Female,44,quarterly,1002.266319,35,0
4,Male,53,half yearly,1043.851952,16,1


In [3]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
gender_ohe = ohe.fit_transform(df[['gender']]).toarray()
# Get the DataFrame
gender_e = pd.DataFrame(gender_ohe,columns=ohe.get_feature_names_out())
gender_e.head()

Unnamed: 0,gender_Female,gender_Male
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,0.0,1.0


In [4]:
from sklearn.preprocessing import OrdinalEncoder
ord_enc = OrdinalEncoder(categories=[['monthly','quarterly','half yearly','yearly']])
# Getting ordinal dataframe
contract_encoded = ord_enc.fit_transform(df[['contract']]).flatten()
contract_e = pd.DataFrame(contract_encoded,columns=['contract'])
contract_e.head()

Unnamed: 0,contract
0,3.0
1,2.0
2,3.0
3,1.0
4,2.0


In [7]:
df_encoded = pd.concat([df,gender_e,contract_e],axis=1)

In [8]:
df_encoded.head()

Unnamed: 0,gender,age,contract,monthly_charges,tenure,churn,gender_Female,gender_Male,contract.1
0,Female,34,yearly,1042.202157,25,1,1.0,0.0,3.0
1,Female,36,half yearly,966.337735,12,1,1.0,0.0,2.0
2,Female,30,yearly,1165.040177,16,0,1.0,0.0,3.0
3,Female,44,quarterly,1002.266319,35,0,1.0,0.0,1.0
4,Male,53,half yearly,1043.851952,16,1,0.0,1.0,2.0
