In [17]:
# Import necessary libraries
import pandas as pd
import numpy as np

# Load the data
df = pd.read_csv('netflix_titles.csv')

# Handle Missing Values
missing_columns = df.columns[df.isnull().any()]  # Identify columns with missing values
print("Columns with missing values:")
print(missing_columns)

# Option 1: Drop rows or columns with missing values
# df.dropna(inplace=True)  # Uncomment this line to drop rows with missing values
# df.dropna(axis=1, inplace=True)  # Uncomment this line to drop columns with missing values

# Option 2: Fill missing values with appropriate values
# Example: Fill missing values in numeric columns with the mean
numeric_columns = df.select_dtypes(include=[np.number]).columns
df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].mean())

# Example: Fill missing values in categorical columns with the mode
categorical_columns = df.select_dtypes(include=[np.object_]).columns
df[categorical_columns] = df[categorical_columns].fillna(df[categorical_columns].mode().iloc[0])

Columns with missing values:
Index(['director', 'cast', 'country', 'date_added', 'rating', 'duration'], dtype='object')


In [None]:
# Remove Duplicate Records
# Example: Identify and remove duplicate records based on all columns
duplicate_records = df.duplicated()
df_no_duplicates = df[~duplicate_records]

# Display cleaned data
print("Cleaned DataFrame:")
print(df_no_duplicates.head())

In [12]:
# Deal with Outliers
# Example: Identify and remove outliers using z-scores
z_scores = np.abs((df[numeric_columns] - df[numeric_columns].mean()) / df[numeric_columns].std())
outlier_threshold = 3  # Adjust the threshold as needed
df_no_outliers = df[(z_scores < outlier_threshold).all(axis=1)]

# Example: Transform outliers using winsorization
from scipy.stats import mstats
df_winsorized = pd.DataFrame(
    mstats.winsorize(df[numeric_columns].values, limits=[0.05, 0.05]),
    columns=numeric_columns
)


# Resolve Inconsistencies
# Example: Check and correct inconsistent formats
# df['date_column'] = pd.to_datetime(df['date_column'], format='%Y-%m-%d')

# Example: Standardize units
# df['weight_column'] = df['weight_column'] * 0.453592  # Convert pounds to kilograms

# Example: Correct inconsistent categorical values
# df['category_column'].replace({'A1': 'A', 'A2': 'A'}, inplace=True)

In this part, missing values in numeric columns are interpolated. Interpolation is a technique to estimate missing values based on the surrounding data points. It fills in the missing values with values that are calculated based on the neighboring values. The interpolate() function in pandas performs this interpolation for numeric columns.

In [13]:
# Option 3: Use advanced techniques like interpolation or imputation
# Example: Interpolate missing values in numeric columns
df[numeric_columns] = df[numeric_columns].interpolate()

Here, outliers in the numeric columns are identified and removed. Z-scores are calculated by subtracting the column mean from each data point and then dividing it by the column standard deviation. The z-scores indicate how many standard deviations away a data point is from the mean. By setting a threshold (e.g., 3), we consider values beyond that threshold as outliers. The DataFrame df_no_outliers contains the data without the identified outliers.

In [14]:
# Deal with Outliers
# Example: Identify and remove outliers using z-scores
z_scores = np.abs((df[numeric_columns] - df[numeric_columns].mean()) / df[numeric_columns].std())
outlier_threshold = 3  # Adjust the threshold as needed
df_no_outliers = df[(z_scores < outlier_threshold).all(axis=1)]

This example shows an alternative approach to dealing with outliers using winsorization. Winsorization transforms extreme values to a specified percentile of the data. The winsorize() function from the scipy.stats module is used to perform winsorization. The DataFrame df_winsorized contains the winsorized values for the numeric columns.

In [15]:
# Example: Transform outliers using winsorization
from scipy.stats import mstats
df_winsorized = pd.DataFrame(
    mstats.winsorize(df[numeric_columns].values, limits=[0.05, 0.05]),
    columns=numeric_columns
)
df_winsorized

Unnamed: 0,release_year
0,2020
1,2021
2,2021
3,2021
4,2021
...,...
8802,2007
8803,2018
8804,2009
8805,2006


These examples demonstrate how to handle inconsistencies in the data. The first example converts a column named 'date_column' to a consistent date format using the pd.to_datetime() function. The second example demonstrates standardizing units by multiplying a column named 'weight_column' by a conversion factor (e.g., from pounds to kilograms). The third example corrects inconsistent categorical values by using the replace() function to map specific values to desired values.

In [16]:
# Resolve Inconsistencies
# Example: Check and correct inconsistent formats
# df['date_column'] = pd.to_datetime(df['date_column'], format='%Y-%m-%d')

# Example: Standardize units
# df['weight_column'] = df['weight_column'] * 0.453592  # Convert pounds to kilograms

# Example: Correct inconsistent categorical values
# df['category_column'].replace({'A1': 'A', 'A2': 'A'}, inplace=True)


In [20]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Examine Data Structure
num_rows = df.shape[0]
num_cols = df.shape[1]
print("Number of rows:", num_rows)
print("Number of columns:", num_cols)
print("Data Types:")
print(df.dtypes)
print("Data Structure:")
print(df.head())

Number of rows: 8807
Number of columns: 12
Data Types:
show_id         object
type            object
title           object
director        object
cast            object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
description     object
dtype: object
Data Structure:
  show_id     type                  title         director  \
0      s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1      s2  TV Show          Blood & Water    Rajiv Chilaka   
2      s3  TV Show              Ganglands  Julien Leclercq   
3      s4  TV Show  Jailbirds New Orleans    Rajiv Chilaka   
4      s5  TV Show           Kota Factory    Rajiv Chilaka   

                                                cast        country  \
0                                 David Attenborough  United States   
1  Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa   
2  Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi... 

Count: It tells you how many values are available for each numerical column. If some columns have fewer values compared to others, it means there may be missing data in those columns.

Mean: It gives you an idea of the typical value for each numerical column. It's like finding the average of all the values in that column.

Standard Deviation: It shows you how spread out the values are in each numerical column. A higher standard deviation means the values are more spread out, while a lower standard deviation means the values are closer together.

Minimum and Maximum: They tell you the smallest and largest values in each numerical column, respectively. They give you an idea of the range of values covered by each variable.

Quartiles: They help you understand the distribution of the data. The first quartile represents the value below which 25% of the data falls, the median represents the middle value, and the third quartile represents the value below which 75% of the data falls.

By looking at these summary statistics, you can get an overall understanding of your numerical data. You can see how many values are available, the typical value, how spread out the values are, the range of values, and the distribution of the data. This information can help you identify any missing data, outliers, or unusual patterns in your dataset.


Missing Data: The count in the summary statistics provides the number of non-missing values for each numerical column. If a column has significantly fewer values compared to others, it indicates the presence of missing data. By observing the count values, you can identify which columns may have missing values and may require further investigation or handling.

Outliers: Outliers are extreme values that are significantly different from the majority of the data points. The minimum, maximum, and quartile values in the summary statistics give you an understanding of the range of values covered by each variable. By comparing these values, you can identify potential outliers. For example, if the maximum value is much larger or smaller than the majority of values, it suggests the presence of outliers. Outliers can be important to investigate further, as they may impact your analysis or indicate data issues.

Unusual Patterns: Summary statistics provide insights into the distribution of the data. By examining the mean, standard deviation, and quartile values, you can get a sense of the overall pattern and variability in the data. If you observe unusual patterns, such as a large standard deviation or significant differences between quartile values, it may indicate the presence of unusual data patterns or data quality issues. This can prompt you to investigate further and explore the underlying reasons behind these patterns

In [21]:
# Summarize Data
summary_stats = df.describe()
print("Summary Statistics:")
print(summary_stats)

Summary Statistics:
       release_year
count   8807.000000
mean    2014.180198
std        8.819312
min     1925.000000
25%     2013.000000
50%     2017.000000
75%     2019.000000
max     2021.000000


The histogram helps you visualize the distribution of the numerical variable, showing the frequency or count of values within each bin. It allows you to observe the shape of the distribution, identify peaks or clusters, and get a sense of the spread or concentration of values.

In [None]:
# Visualize Data
# Example: Histogram for a numerical variable
plt.hist(df['numeric_variable'], bins=10)
plt.xlabel('Numeric Variable')
plt.ylabel('Frequency')
plt.title('Histogram of Numeric Variable')
plt.show()

The bar plot helps you visualize the distribution of the categorical variable by showing the count or frequency of each category as individual bars. It allows you to compare the relative sizes of different categories and identify the dominant or minority categories. The rotation of the x-axis tick labels is often used to improve readability when there are many categories.

In [None]:
# Example: Bar plot for a categorical variable
plt.bar(df['categorical_variable'].value_counts().index, df['categorical_variable'].value_counts().values)
plt.xlabel('Categories')
plt.ylabel('Count')
plt.title('Bar Plot of Categorical Variable')
plt.xticks(rotation=90)
plt.show()

The correlation matrix and heatmap help you identify relationships between numerical variables in your dataset. By examining the color patterns in the heatmap, you can quickly spot variables that are strongly positively or negatively correlated. This information is valuable for understanding dependencies between variables and identifying potential predictors or factors that influence each other.

The heatmap visualization allows you to identify clusters or groups of variables with similar correlations, helping you uncover underlying patterns and relationships in your data. The annotation of correlation values on the heatmap provides a numerical representation of the strength of the relationships between variables.

Remember that correlation does not imply causation, and it's important to interpret the results in the context of your specific data and domain knowledge.

In [None]:
# Identify Relationships
# Example: Correlation matrix
corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

The box plot provides several key insights into the distribution of the numerical variable:

It shows the median, which represents the central tendency or the middle value of the data.
The box represents the interquartile range (IQR), which spans the middle 50% of the data. The lower boundary of the box is the 25th percentile, and the upper boundary is the 75th percentile.
The whiskers extend to the minimum and maximum non-outlier values within 1.5 times the IQR. Any data points beyond the whiskers are considered outliers and are plotted individually.
The individual points outside the whiskers represent potential outliers in the data.
By examining the box plot, you can quickly identify the median, the spread of the data (IQR), and any outliers or extreme values. This helps you understand the shape of the distribution, the presence of skewness or asymmetry, and the potential need for data transformations or outlier handling.

Remember to interpret the box plot in the context of your specific data and domain knowledge to gain meaningful insights about the distribution of the numerical variable.

In [None]:
# Analyze Distributions
# Example: Box plot for a numerical variable
sns.boxplot(x=df['numeric_variable'])
plt.xlabel('Numeric Variable')
plt.title('Box Plot of Numeric Variable')
plt.show()



The grouping and aggregation process allows you to analyze your data at a more granular level by creating subsets based on a categorical variable and calculating summary statistics or aggregating values within each group. In this example, you calculate the mean of the numerical variable for each unique value of the categorical variable.

By examining the grouped data, you can gain insights into how the numerical variable varies across different categories. It helps you understand the average or typical values of the numerical variable within each category and identify any patterns, trends, or differences between the groups.

You can apply various aggregation functions (e.g., mean, sum, count, min, max, etc.) to obtain different summary statistics based on your analysis goals. Grouping and aggregating data is particularly useful for conducting exploratory data analysis and understanding the relationships between categorical and numerical variables in your dataset.

In [None]:
# Grouping and Aggregation
# Example: Group by a categorical variable and calculate mean of a numerical variable
grouped_data = df.groupby('categorical_variable')['numeric_variable'].mean()
print("Grouped Data:")
print(grouped_data)

For Numerical Columns:

Line Plot: Suitable for visualizing the relationship between two numerical variables over a continuous interval.

Scatter Plot: Useful for exploring the relationship between two numerical variables as individual data points.

Histogram: Ideal for examining the distribution of a numerical variable.

Box Plot: Helpful for visualizing the summary statistics (median, quartiles, outliers) of a numerical variable.

Violin Plot: Similar to a box plot, but also shows the distribution of the numerical variable across different categories.

Area Plot: Useful for illustrating the cumulative contribution or stacked proportions of numerical variables over time or a continuous axis.

For Categorical Columns:

Bar Plot: Suitable for comparing and displaying the distribution of categorical variables.

Pie Chart: Helpful for representing the composition or proportion of different categories in a dataset.

Heatmap: Suitable for visualizing the relationship and patterns between two categorical variables.

Violin Plot: Can also be used to show the distribution of a numerical variable across different categories.

For both Numerical and Categorical Columns:

Pair Plot: Suitable for generating pairwise scatter plots for multiple numerical variables, but can also display the relationship between a numerical and a categorical variable.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Line Plot
plt.plot(df['x'], df['y'])
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Line Plot')
plt.show()

# Scatter Plot
plt.scatter(df['x'], df['y'])
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot')
plt.show()

# Histogram
plt.hist(df['numeric_variable'], bins=10)
plt.xlabel('Numeric Variable')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()

# Box Plot
plt.boxplot(df['numeric_variable'])
plt.xlabel('Numeric Variable')
plt.title('Box Plot')
plt.show()

# Violin Plot
sns.violinplot(x=df['category_variable'], y=df['numeric_variable'])
plt.xlabel('Category Variable')
plt.ylabel('Numeric Variable')
plt.title('Violin Plot')
plt.show()

# Area Plot
plt.stackplot(df['x'], df['y1'], df['y2'], labels=['Y1', 'Y2'])
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Area Plot')
plt.legend()
plt.show()

# Bar Plot
plt.bar(df['category_variable'], df['count_variable'])
plt.xlabel('Category Variable')
plt.ylabel('Count')
plt.title('Bar Plot')
plt.xticks(rotation=90)
plt.show()

# Pie Chart
plt.pie(df['count_variable'], labels=df['category_variable'], autopct='%1.1f%%')
plt.title('Pie Chart')
plt.axis('equal')
plt.show()

# Heatmap
heatmap_data = df.pivot(index='category_variable1', columns='category_variable2', values='count_variable')
sns.heatmap(heatmap_data, cmap='coolwarm')
plt.title('Heatmap')
plt.show()

# Violin Plot (for categorical variable)
sns.violinplot(x=df['category_variable'], y=df['numeric_variable'])
plt.xlabel('Category Variable')
plt.ylabel('Numeric Variable')
plt.title('Violin Plot')
plt.show()

# Pair Plot
sns.pairplot(df[['numeric_variable1', 'numeric_variable2', 'category_variable']])
plt.title('Pair Plot')
plt.show()


For Numerical Columns:

Kernel Density Plot: Estimates the probability density function of a continuous variable.

Scatter Matrix Plot: Displays pairwise scatter plots for multiple numerical variables.

Line Plot with Error Bars: Represents the trend of a numerical variable over time or other continuous interval, along with the uncertainty or variability using error bars.

Contour Plot: Visualizes three-dimensional data by displaying contours of equal values on a two-dimensional plane.

QQ Plot: Compares the quantiles of a dataset to the quantiles of a theoretical distribution, aiding in assessing the distributional fit.

Andrews Curves: Converts each data point into a curve based on the feature values and plots them, allowing for visual analysis of clusters or patterns.

For Categorical Columns:

Stacked Bar Plot: Shows the composition of categories within a variable and how it changes across groups.

Grouped Bar Plot: Compares the distribution of a categorical variable across different groups.

Word Cloud: Displays the frequency or importance of words in a text-based dataset using varying font sizes.

Sankey Diagram: Visualizes the flow or distribution of categorical data between different categories or stages.

Chord Diagram: Illustrates the relationships and flows between different categorical variables.

Determine the key questions:

Start by understanding the purpose of your data analysis and what insights you want to gain from it.

Identify the key questions or objectives you want to address with data visualization.

These questions should guide your analysis and visualization choices.
Select appropriate visualizations:

Based on the key questions and the nature of your data (numerical or categorical), choose the most suitable visualization techniques to explore and communicate the insights effectively.

Refer to the list of visualization techniques we discussed earlier and select the ones that best represent your data and answer your key questions.

Consider the type of data (numerical or categorical), the relationships between variables, the distribution characteristics, and any patterns or trends you want to highlight.

Create visualizations:

Use the chosen visualization techniques from step 2 to create the visualizations.

Use the code snippets provided earlier as a reference and customize them based on your dataset and visualization preferences.

Ensure that the visualizations are clear, well-labeled, and visually appealing.

Consider adding titles, axis labels, legends, and annotations to provide context and interpretation.

Summarize key findings:

Analyze the visualizations and extract insights relevant to your key questions.

Identify any patterns, trends, or relationships that emerge from the visualizations.

Highlight significant observations or noteworthy findings.

Use descriptive language to summarize the key findings concisely.

Support your findings with evidence from the visualizations, referring to specific charts or plots.

Consider providing recommendations or further actions based on the insights gained.

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Data Visualization

## Determine the key questions
# Define the questions or objectives you want to address with data visualization.

## Select appropriate visualizations

# Example: Line Plot
plt.plot(df['x'], df['y'])
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Line Plot')
plt.show()

# Example: Scatter Plot
plt.scatter(df['x'], df['y'])
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Scatter Plot')
plt.show()

# Example: Histogram
plt.hist(df['numeric_variable'], bins=10)
plt.xlabel('Numeric Variable')
plt.ylabel('Frequency')
plt.title('Histogram')
plt.show()

# Example: Box Plot
plt.boxplot(df['numeric_variable'])
plt.xlabel('Numeric Variable')
plt.title('Box Plot')
plt.show()

# Example: Violin Plot
sns.violinplot(x=df['categorical_variable'], y=df['numeric_variable'])
plt.xlabel('Categorical Variable')
plt.ylabel('Numeric Variable')
plt.title('Violin Plot')
plt.show()

# Example: Area Plot
plt.fill_between(df['x'], df['y'], alpha=0.5)
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Area Plot')
plt.show()

## Prepare data for visualization
# Perform any necessary data transformations, aggregations, or filtering specific to each visualization.

## Create visualizations

# Example: Bar Plot
plt.bar(df['categorical_variable'].value_counts().index, df['categorical_variable'].value_counts().values)
plt.xlabel('Categories')
plt.ylabel('Count')
plt.title('Bar Plot')
plt.xticks(rotation=90)
plt.show()

# Example: Pie Chart
plt.pie(df['categorical_variable'].value_counts(), labels=df['categorical_variable'].value_counts().index, autopct='%1.1f%%')
plt.title('Pie Chart')
plt.show()

# Example: Heatmap
cross_tab = pd.crosstab(df['categorical_variable1'], df['categorical_variable2'])
sns.heatmap(cross_tab, cmap='coolwarm', annot=True)
plt.xlabel('Categorical Variable 2')
plt.ylabel('Categorical Variable 1')
plt.title('Heatmap')
plt.show()

# Example: Pair Plot
sns.pairplot(df, hue='categorical_variable')
plt.title('Pair Plot')
plt.show()

## Customize and enhance visuals
# Adjust colors, labels, titles, axes, legends, annotations, etc., to improve clarity and interpretability.

## Arrange and organize visuals
# Arrange the visualizations in a logical order, considering subplots, grids, or sections for better organization.

## Provide clear explanations
# Accompany each visualization with clear explanations, highlighting key findings, observations, or trends.

## Iterate and refine
# Review and revise the visualizations and explanations to improve clarity, impact, and insights.

## Summarize key findings
# Conclude the data visualization section by summarizing the key findings or patterns observed.

# Save or export the final notebook as desired.


Categorical variables can be divided into several types:

Nominal: Nominal variables are categorical variables that have no inherent order or hierarchy. The categories are distinct and can't be ranked. Examples include colors (e.g., red, blue, green) or categories like "dog," "cat," and "bird."

Ordinal: Ordinal variables have categories that can be ranked or ordered in a meaningful way. The categories have a relative order but not necessarily a specific numerical difference between them. Examples include ratings (e.g., low, medium, high) or educational levels (e.g., elementary, middle, high school).

Binary: Binary variables have only two categories or levels. They can represent yes/no, true/false, or presence/absence situations. Examples include gender (male/female), whether a customer made a purchase (yes/no), or a binary outcome in a classification problem.

Interval: Interval variables represent categories that have a fixed and equal interval between them. However, they lack a meaningful zero point. Examples include temperature measured in Celsius or Fahrenheit.

Ratio: Ratio variables are similar to interval variables but have a meaningful zero point. They have a fixed and equal interval between categories, and the zero point indicates the absence of the variable. Examples include age, weight, height, or income.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA

# Handling Missing Values
# Identify columns or features with missing values
missing_columns = df.columns[df.isnull().any()]

# Impute missing values with mean
imputer = SimpleImputer(strategy='mean')
df[missing_columns] = imputer.fit_transform(df[missing_columns])

# Create new features to indicate missing values
df['feature_missing'] = df[missing_columns].isnull().sum(axis=1)

# Encoding Categorical Variables
# Perform one-hot encoding for categorical variables
categorical_columns = df.select_dtypes(include='object').columns
df_encoded = pd.get_dummies(df, columns=categorical_columns)

# Label encoding for ordinal categorical variables
ordinal_columns = ['ordinal_variable']
label_encoder = LabelEncoder()
df_encoded['ordinal_variable_encoded'] = label_encoder.fit_transform(df['ordinal_variable'])

# Feature Scaling and Normalization
# Standardize numerical features
numerical_columns = df.select_dtypes(include=['int', 'float']).columns
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(df[numerical_columns]), columns=numerical_columns)

# Min-max scaling for numerical features
minmax_scaler = MinMaxScaler()
df_minmax_scaled = pd.DataFrame(minmax_scaler.fit_transform(df[numerical_columns]), columns=numerical_columns)

# Creating Interaction or Derived Features
# Example: Create interaction feature between two numerical variables
df['interaction_feature'] = df['numeric_variable1'] * df['numeric_variable2']

# Handling Outliers
# Example: Remove outliers using z-scores
z_scores = np.abs((df['numeric_variable'] - df['numeric_variable'].mean()) / df['numeric_variable'].std())
df_no_outliers = df[z_scores < 3]

# Time-based Features
# Example: Extract year and month from a date variable
df['year'] = pd.to_datetime(df['date_column']).dt.year
df['month'] = pd.to_datetime(df['date_column']).dt.month

# Dimensionality Reduction
# Example: Perform PCA for dimensionality reduction
pca = PCA(n_components=2)
df_pca = pd.DataFrame(pca.fit_transform(df_scaled), columns=['PC1', 'PC2'])

# Summarize the transformed dataset
print("Transformed Dataset:")
print(df_pca.head())

# Save the transformed dataset
df_pca.to_csv('transformed_dataset.csv', index=False)


To analyze categorical variables based on their type, you can use the following approaches:

Nominal Variables:

Frequency Counts: Calculate the frequency or count of each category to understand the distribution of nominal variables.
Bar Plot: Create a bar plot to visualize the frequency counts of different categories.
Cross-Tabulation: Use cross-tabulation or contingency tables to explore the relationship between two or more nominal variables.
Ordinal Variables:

Frequency Counts: Calculate the frequency or count of each category to understand the distribution of ordinal variables.
Bar Plot: Create a bar plot to visualize the frequency counts of different categories, while respecting the ordinal order.
Cross-Tabulation: Use cross-tabulation or contingency tables to explore the relationship between an ordinal variable and other categorical or numerical variables.
Binary Variables:

Frequency Counts: Calculate the frequency or count of each category to understand the distribution of binary variables.
Bar Plot: Create a bar plot to visualize the frequency counts of different categories.
Cross-Tabulation: Use cross-tabulation or contingency tables to explore the relationship between a binary variable and other categorical or numerical variables.
Proportion Analysis: Calculate the proportion or percentage of each category to understand the relative distribution.
Interval and Ratio Variables:

Summary Statistics: Calculate summary statistics such as mean, median, standard deviation, and quartiles to understand the central tendency and spread of interval and ratio variables.
Histogram: Create a histogram to visualize the distribution of interval and ratio variables.
Box Plot: Generate a box plot to identify outliers, quartiles, and the overall distribution of interval and ratio variables.
Grouping and Aggregation: Group the data based on categorical variables and calculate aggregated statistics (e.g., mean, median) for interval and ratio variables within each group.

To provide a conclusion based on the analysis of your data, you can follow these steps:

Summarize Key Findings: Start by summarizing the key findings from your analysis. Highlight the most important insights and patterns that emerged from your data exploration and feature engineering. This could include significant relationships between variables, notable trends, or unexpected discoveries.

Answer the Research Questions: Refer back to the initial research questions or objectives that guided your analysis. Assess whether you were able to answer those questions based on the findings from your data analysis. Clearly state the conclusions you have drawn for each research question.

Discuss Implications and Insights: Discuss the implications of your findings and the insights they provide. Explain why the identified patterns or relationships are significant and how they can contribute to your overall understanding of the data or the problem you are trying to solve. Consider the potential impact of the insights on decision-making or future actions.

Address Limitations: Acknowledge any limitations or caveats in your analysis. Discuss factors that may have influenced the results or areas where further investigation may be needed. This demonstrates a critical and objective perspective and helps contextualize the conclusions.

Provide Recommendations: Based on your analysis, offer recommendations or suggestions for further actions. This could include areas for improvement, strategies for optimization, or potential areas of future research. Connect your recommendations directly to the insights and conclusions drawn from your analysis.

Summarize the Conclusion: Finally, provide a concise and clear summary of the overall conclusion based on your analysis. Restate the main findings, insights, and recommendations in a way that reinforces the main message you want to convey.