## Q1. What is data encoding? How is it useful in data science?


- Data encoding is the process of converting data from one format or code to another. 

- Data encoding is an important step in machine learning. It converts categorical or textual data into numerical format so that it can be used as input for algorithms. Most machine learning models only accept numerical variables.
- Decoding is the reverse process of encoding. Decoding involves translating information from binary form into a readable version.

Here's how data encoding is useful in data science:

1. **Normalization**: Data encoding helps normalize data by converting it into a common format or scale. This is particularly important when dealing with features that have different units or ranges. Normalized data can lead to better model performance because it ensures that all features contribute equally to the analysis.

2. **Categorical Data**: In data science, many datasets contain categorical data (e.g., colors, categories, or labels) that are not directly usable by machine learning algorithms. Data encoding techniques such as one-hot encoding or label encoding can be applied to convert categorical data into numerical format, making it suitable for machine learning models.

3. **Text Data**: Text data often requires encoding to convert words or sentences into numerical representations (e.g., word embeddings or TF-IDF vectors) that machine learning models can work with effectively. Text encoding is crucial for tasks like natural language processing (NLP) and sentiment analysis.

4. **Image Data**: Images are represented as pixel values, but deep learning models require encoded image data in the form of tensors. Techniques like image encoding through convolutional neural networks (CNNs) transform raw image data into a format suitable for image classification or object detection tasks.

5. **Time Series Data**: Time series data encoding involves converting temporal data points into a structured format. This is important for analyzing trends, making forecasts, or training machine learning models on time series data.

6. **Feature Engineering**: Data encoding is a fundamental part of feature engineering, where you create new features or representations of data to improve model performance. Properly encoded features can capture complex relationships in the data and enhance the predictive power of models.

7. **Reducing Dimensionality**: Encoding techniques like Principal Component Analysis (PCA) or autoencoders can be used to reduce the dimensionality of data while preserving essential information. This is especially useful when dealing with high-dimensional data or for visualization purposes.

8. **Data Compression**: Data encoding can also be used for data compression, where redundant or unnecessary information is removed to reduce storage requirements and improve data transfer efficiency.

## Q2. What is nominal encoding? Provide an example of how you would use it in a real-world scenario.


It is a technique to encode nominal features in which each category of the feature is mapped to a vector 0 or 1, which represents the presence or absence of that category. In this, dummy variables are created equal to the number of categories present in the feature but to avoid the dummy variable trap; one column is removed, which basically is encoded.

**Scenario**: Imagine you have a dataset of fruits and you want to do nominal/ohe-hot encoding the "Fruit Type" feature, which contains categorical values like "Apple," "Banana," and "Orange."


In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

data = {
    'Fruit Type': ['Apple', 'Banana', 'Orange', 'Banana', 'Apple']
}

df = pd.DataFrame(data)

encoder = OneHotEncoder(sparse_output=False)

encoded_data = encoder.fit_transform(df[['Fruit Type']])

encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Fruit Type']))

encoded_df


Unnamed: 0,Fruit Type_Apple,Fruit Type_Banana,Fruit Type_Orange
0,1.0,0.0,0.0
1,0.0,1.0,0.0
2,0.0,0.0,1.0
3,0.0,1.0,0.0
4,1.0,0.0,0.0


## Q3. In what situations is ordinal encoding preferred over one-hot encoding? Provide a practical example.


Nominal encoding and one-hot encoding are both used to handle categorical data, but they serve different purposes and are preferred in different situations. Nominal encoding is typically preferred over one-hot encoding in the following situations:

1. **Limited Categories**: When you have a categorical feature with a limited number of unique categories (classes), nominal encoding is a more efficient choice. One-hot encoding can create a large number of binary columns, making the dataset more sparse and potentially increasing computational overhead.

**Example**: Consider a dataset of car colors with only a few unique categories (e.g., Red, Blue, Green). Using nominal encoding, you can represent these colors with numerical values (e.g., Red: 1, Blue: 2, Green: 3) rather than creating three separate binary columns as one-hot encoding would.

2. **Ordinal Data**: When the categorical feature has an inherent order or hierarchy among the categories, nominal encoding is suitable. In ordinal data, the values have a meaningful ranking, which one-hot encoding does not capture.

**Example**: Education levels (e.g., High School, Bachelor's, Master's, Ph.D.) represent ordinal data because they have a clear order. Nominal encoding can assign numerical values based on this order (e.g., High School: 1, Bachelor's: 2, Master's: 3, Ph.D.: 4).

3. **Simplifying Interpretability**: If you aim to maintain interpretability in your models, nominal encoding is often preferred. It preserves the direct relationship between the original categories and the encoded values, making it easier to understand the model's output.

**Example**: In customer surveys, you might have a feature like "Satisfaction Level" with categories like "Not Satisfied," "Satisfied," and "Very Satisfied." Using nominal encoding, you can assign numerical values (e.g., Not Satisfied: 1, Satisfied: 2, Very Satisfied: 3) that retain the intuitive interpretation of satisfaction levels.

4. **Regression Models**: For regression models, nominal encoding is more suitable when the categorical variable exhibits a natural ordinal relationship, as it can help the model capture this linear or non-linear relationship more effectively.

**Example**: In a real estate model, you have a "House Condition" feature with values like "Poor," "Fair," "Good," and "Excellent." Nominal encoding can map these categories to numerical values that represent the condition's relative quality.

#### In summary, nominal encoding is preferred over one-hot encoding when dealing with limited categories, ordinal data, interpretability requirements, and situations where there is a natural order or relationship among the categories. One-hot encoding is typically more suitable for handling categorical data with a large number of categories or when there is no inherent order or hierarchy among them. The choice between these encoding methods should be based on the specific characteristics of the data and the goals of the analysis or modeling task.

## Q4. Suppose you have a dataset containing categorical data with 5 unique values. Which encoding technique would you use to transform this data into a format suitable for machine learning algorithms? Explain why you made this choice.

#### ONE-HOT ENCODING
- We can use one-hot encoding because it's a straightforward and safe choice for a small number of unique values, ensuring that each category is represented as a separate binary column.

**Reasoning:**
- One-hot encoding preserves the distinctiveness of each category, preventing any implicit ordinal assumptions.
- It works well with most machine learning algorithms, including linear models, tree-based models, and neural networks.
- One-hot encoding is interpretable and easy to understand, as each binary column represents a specific category.

##  Q5. In a machine learning project, you have a dataset with 1000 rows and 5 columns. Two of the columns are categorical, and the remaining three columns are numerical. If you were to use nominal encoding to transform the categorical data, how many new columns would be created? Show your calculations.

  - Nominal encoding creates a new binary column for each unique category in each categorical column.

Let's assume the following:

- Categorical Column 1: It has 4 unique categories.
- Categorical Column 2: It has 3 unique categories.
- Numerical Columns: There are 3 numerical columns.

In [2]:
# Number of unique categories in each categorical column
unique_categories_cat1 = 4
unique_categories_cat2 = 3

# Number of numerical columns
numerical_columns = 3

# Calculate the total number of new columns created
total_new_columns = (unique_categories_cat1 + unique_categories_cat2) + numerical_columns

print("Total number of new columns created with nominal encoding:", total_new_columns)


Total number of new columns created with nominal encoding: 10


## Q6. You are working with a dataset containing information about different types of animals, including their species, habitat, and diet. Which encoding technique would you use to transform the categorical data into a format suitable for machine learning algorithms? Justify your answer.

 - If the categorical variables, such as "species," "habitat," and "diet," do not exhibit any inherent order or hierarchy among categories, nominal encoding (one-hot encoding) is a suitable choice. One-hot encoding ensures that each category is represented as a separate binary column, preserving the distinctiveness of each category.
 
**Advantages:**
- Preserves the categorical information without assuming any ordinal relationship.
- Works well with most machine learning algorithms, including linear models, tree-based models, and neural networks.
- Prevents the model from incorrectly assuming ordinality in the data.
- Example:
For the "species" variable, one-hot encoding would create separate binary columns for each species (e.g., "Lion," "Tiger," "Giraffe") without implying any inherent order among these species.

In [3]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Create a sample DataFrame with categorical data
data = {
    'Species': ['Lion', 'Tiger', 'Giraffe', 'Lion', 'Tiger'],
    'Habitat': ['Savannah', 'Jungle', 'Savannah', 'Forest', 'Savannah'],
    'Diet': ['Carnivore', 'Carnivore', 'Herbivore', 'Carnivore', 'Carnivore']
}

df = pd.DataFrame(data)

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop='first')

# Fit and transform the encoder on the categorical columns
encoded_data = encoder.fit_transform(df[['Species', 'Habitat', 'Diet']])

# Create a new DataFrame with the one-hot encoded data
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(['Species', 'Habitat', 'Diet']))

# Display the resulting DataFrame
encoded_df


Unnamed: 0,Species_Lion,Species_Tiger,Habitat_Jungle,Habitat_Savannah,Diet_Herbivore
0,1.0,0.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0
2,0.0,0.0,0.0,1.0,1.0
3,1.0,0.0,0.0,0.0,0.0
4,0.0,1.0,0.0,1.0,0.0


## Q7.You are working on a project that involves predicting customer churn for a telecommunications company. You have a dataset with 5 features, including the customer's gender, age, contract type, monthly charges, and tenure. Which encoding technique(s) would you use to transform the categorical data into numerical data? Provide a step-by-step explanation of how you would implement the encoding.

In [4]:
#Step1: Load the dataset
data = {
    'Gender': ['Male', 'Female', 'Male', 'Female', 'Male'],
    'Age': [35, 42, 28, 55, 48],
    'Contract Type': ['Month-to-Month', 'Two Year', 'One Year', 'Month-to-Month', 'Two Year'],
    'Monthly Charges': [65.5, 85.1, 50.2, 70.3, 95.6],
    'Tenure': [12, 24, 8, 60, 36]
}

df = pd.DataFrame(data)

# Check the data types of each feature
print(df.dtypes)


Gender              object
Age                  int64
Contract Type       object
Monthly Charges    float64
Tenure               int64
dtype: object


In [5]:
# Step2: Apply one-hot encoding to 'Gender' and 'Contract Type' columns

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Separate categorical and numerical features
categorical_cols = ['Gender', 'Contract Type']
numerical_cols = ['Age', 'Monthly Charges', 'Tenure']

# Initialize the OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop='first')

# Fit and transform the encoder on the categorical columns
encoded_data = encoder.fit_transform(df[categorical_cols])

# Create a DataFrame with the one-hot encoded data
encoded_df = pd.DataFrame(encoded_data, columns=encoder.get_feature_names_out(categorical_cols))

# Combine the one-hot encoded DataFrame with the numerical columns
final_df = pd.concat([encoded_df, df[numerical_cols]], axis=1)

# Display the resulting DataFrame
final_df

Unnamed: 0,Gender_Male,Contract Type_One Year,Contract Type_Two Year,Age,Monthly Charges,Tenure
0,1.0,0.0,0.0,35,65.5,12
1,0.0,0.0,1.0,42,85.1,24
2,1.0,1.0,0.0,28,50.2,8
3,0.0,0.0,0.0,55,70.3,60
4,1.0,0.0,1.0,48,95.6,36


## OR

In [6]:
# Apply one-hot encoding to 'Gender' and 'Contract Type' columns
df_encoded = pd.get_dummies(df, columns=['Gender', 'Contract Type'], prefix=['Gender', 'Contract'])

# Display the resulting DataFrame with one-hot encoding
df_encoded


Unnamed: 0,Age,Monthly Charges,Tenure,Gender_Female,Gender_Male,Contract_Month-to-Month,Contract_One Year,Contract_Two Year
0,35,65.5,12,0,1,1,0,0
1,42,85.1,24,1,0,0,0,1
2,28,50.2,8,0,1,0,1,0
3,55,70.3,60,1,0,1,0,0
4,48,95.6,36,0,1,0,0,1
