**1. What are the key tasks that machine learning entails? What does data pre-processing imply?**

**Ans:** Machine learning involves several key tasks, including:

1. **Data Collection:** Gathering relevant data that will be used to train the machine learning model.
2. **Data Preprocessing:** Cleaning, transforming, and organizing the data to make it suitable for training. This often involves tasks like handling missing values, normalizing or standardizing features, encoding categorical variables, and removing outliers.
3. **Feature Engineering:** Selecting, extracting, or creating features from the raw data that will be used as inputs to the model. Feature engineering aims to improve the performance of the model by providing it with more relevant information.
4. **Model Selection:** Choosing the appropriate machine learning algorithm or model architecture for the specific task at hand.
5. **Training:** Using the prepared data to train the chosen model. During training, the model learns patterns and relationships in the data.
6. **Evaluation:** Assessing the performance of the trained model using validation or test data to ensure it generalizes well to unseen examples.
Hyperparameter Tuning: Adjusting the settings or hyperparameters of the model to optimize its performance.
7. **Deployment:** Integrating the trained model into a production environment where it can be used to make predictions on new data.

Data preprocessing is a crucial step in machine learning that involves preparing the raw data to be fed into the model. This process typically includes:

1. **Data Cleaning:** Removing or imputing missing values, handling outliers, and dealing with any inconsistencies or errors in the data.
2. **Data Transformation:** Scaling or normalizing numerical features to ensure they have a similar scale, encoding categorical variables into a format suitable for machine learning algorithms, and transforming features to make them more suitable for modeling (e.g., logarithmic transformation).
3. **Feature Selection:** Identifying and selecting the most relevant features to include in the model to improve its performance and reduce overfitting.
4. **Dimensionality Reduction:** Techniques such as Principal Component Analysis (PCA) or feature selection methods can be used to reduce the number of features in the dataset while preserving its most important information, which can help improve model efficiency and reduce computational complexity.


Overall, data preprocessing is essential for ensuring that the data is of high quality and appropriate for training machine learning models, ultimately leading to more accurate and reliable predictions.







**2. Describe quantitative and qualitative data in depth. Make a distinction between the two.**

**Ans:** Quantitative and qualitative data are two fundamental types of data used in various fields, including statistics, social sciences, market research, and more. They differ in terms of the nature of the information they represent and the methods used to analyze them.

* **Quantitative Data:**
 Quantitative data are numerical and represent quantities or amounts. They are measured and expressed in terms of numbers. Quantitative data can be further classified into discrete and continuous data.

1. **Discrete Data:** Discrete data represent values that can be counted and are typically whole numbers. Examples include the number of students in a classroom, the number of cars in a parking lot, or the number of books on a shelf.

2. **Continuous Data:** Continuous data represent values that can be measured and can take any value within a range. They are often obtained through measurements. Examples include height, weight, temperature, and time.

Quantitative data are typically analyzed using statistical methods such as descriptive statistics (mean, median, mode), inferential statistics (hypothesis testing, regression analysis), and graphical representations (histograms, box plots, scatter plots).

* **Qualitative Data:**
Qualitative data describe qualities or characteristics and are non-numerical in nature. They provide insight into the underlying reasons, opinions, motivations, or behaviors of individuals or phenomena. Qualitative data are often obtained through observations, interviews, surveys, or open-ended questions.

**Qualitative data can take various forms:**

1. **Categorical Data:** Categorical data represent characteristics or qualities that can be grouped into categories or classes. Examples include gender, marital status, type of car, or favorite color.

2. **Ordinal Data:** Ordinal data represent categories with a natural order or ranking. While they have a defined order, the differences between the categories may not be uniform or measurable. Examples include ratings (e.g., Likert scales), educational levels (e.g., elementary, high school, college), or socioeconomic status (e.g., low, middle, high).

Qualitative data are often analyzed using qualitative research methods such as content analysis, thematic analysis, grounded theory, or narrative analysis. These methods involve identifying patterns, themes, or relationships within the data to gain deeper insights into the phenomenon under study.

**Distinction between Quantitative and Qualitative Data:**

1. **Nature of Information:** Quantitative data represent quantities or amounts and are numerical, while qualitative data describe qualities, characteristics, or attributes and are non-numerical.

2. **Measurement:** Quantitative data are measured using standardized units of measurement and can be counted or measured continuously, while qualitative data are descriptive and often obtained through observations or interviews.

3. **Analysis Methods:** Quantitative data are analyzed using statistical methods to derive numerical summaries and inferential conclusions, while qualitative data are analyzed using qualitative research methods to identify patterns, themes, or relationships within the data.

In summary, while quantitative data provide numerical information about quantities or amounts, qualitative data offer insights into the qualities, characteristics, or behaviors of individuals or phenomena, often through descriptive or categorical means. Both types of data play essential roles in research and decision-making processes, and choosing the appropriate type depends on the nature of the research question and the information needed.







**3. Create a basic data collection that includes some sample records. Have at least one attribute from each of the machine learning data types.**

**Ans:** Here's a basic data collection with sample records, including at least one attribute from each of the machine learning data types:

ID	Gender	Age	Education Level	Income	Product Category	Price	Rating
1	Male	35	Bachelor's	50000	Electronics	799	4.5
2	Female	28	Master's	75000	Clothing	49.99	4.2
3	Male	42	High School	35000	Home Appliances	1299	4.8
4	Female	45	PhD	100000	Electronics	1499	4.6
5	Male	30	Associate's	40000	Clothing	29.99	4.0
Explanation of attributes:

ID: Unique identifier for each record (Numeric, Discrete).
Gender: Categorical attribute representing the gender of the individual (Nominal).
Age: Numeric attribute representing the age of the individual (Continuous).
Education Level: Categorical attribute representing the highest level of education attained by the individual (Ordinal).
Income: Numeric attribute representing the annual income of the individual (Continuous).
Product Category: Categorical attribute representing the category of the purchased product (Nominal).
Price: Numeric attribute representing the price of the purchased product (Continuous).
Rating: Numeric attribute representing the rating given to the purchased product (Continuous).
Each record represents a hypothetical individual's demographic information (gender, age, education level, income) and a purchased product's details (category, price, rating). This dataset includes attributes from various machine learning data types, including categorical, numeric, and ordinal data.







**4. What are the various causes of machine learning data issues? What are the ramifications?**


**Ans:** Machine learning data issues can arise from various sources, and understanding these causes is crucial for ensuring the quality and reliability of machine learning models. Here are some common causes of machine learning data issues and their ramifications:

**Data Quality Issues:**

* Incomplete Data: Missing values, incomplete records, or data entry errors can lead to incomplete datasets, affecting the representativeness and accuracy of the data.
* Inaccurate Data: Incorrect or outdated information, measurement errors, or data collection biases can introduce inaccuracies into the dataset, leading to unreliable model predictions.
* Inconsistent Data: Inconsistencies in data formats, units of measurement, or data encoding can make it challenging to integrate data from different sources or conduct meaningful analyses.
* Ramifications: Poor data quality can lead to biased model predictions, decreased model performance, and unreliable insights, ultimately undermining the effectiveness and trustworthiness of machine learning applications.

**Data Imbalance:**

* Class Imbalance: In classification tasks, imbalanced class distributions occur when one class is significantly more prevalent than others. This can lead to biased model training and poor generalization performance, with models favoring the majority class and performing poorly on minority classes.
* Ramifications: Imbalanced data can result in inaccurate classification results, reduced sensitivity to minority classes, and inflated model performance metrics, making it difficult to detect and address real-world issues.

**Data Skewness and Distributional Issues:**

* Skewed Data: Skewed distributions, such as heavily right-skewed or left-skewed distributions, can result in non-normality and unequal representation of data points, affecting the assumptions of statistical models and algorithms.
* Outliers: Outliers, or extreme values, can distort statistical analyses, affect parameter estimates, and bias model predictions, especially in sensitive models like linear regression.
* Ramifications: Skewed data and outliers can lead to biased model estimates, decreased model accuracy, and increased vulnerability to overfitting or underfitting, hindering the robustness and generalizability of machine learning models.

**Feature Selection and Engineering Issues:**

* Irrelevant Features: Including irrelevant or redundant features in the model can introduce noise, increase model complexity, and degrade model performance.
* Feature Scaling: Features with different scales or units of measurement may require normalization or standardization to ensure fair comparison and effective model training.
* Ramifications: Poor feature selection and engineering decisions can lead to suboptimal model performance, longer training times, and decreased interpretability, hindering the model's ability to capture meaningful patterns and relationships in the data.

Addressing machine learning data issues requires careful data preprocessing, quality assurance, and feature engineering to ensure that the data used for model training and evaluation is clean, representative, and suitable for the intended analysis tasks. Failure to address these issues can lead to biased, inaccurate, or unreliable model predictions, undermining the utility and effectiveness of machine learning applications.







**5. Demonstrate various approaches to categorical data exploration with appropriate examples.**

**Ans:** Exploring categorical data involves understanding the distribution, frequency, and relationships between categories within the dataset. Here are several approaches to categorical data exploration with appropriate examples:

**Frequency Distribution:**

* Calculate the frequency of each category within a categorical variable.
Visualize the frequency distribution using bar charts or pie charts.
* Example: Suppose we have a dataset of customer reviews, and one of the categorical variables is "Sentiment" with categories "Positive," "Neutral," and "Negative." We can calculate the frequency of each sentiment category and visualize it using a bar chart to understand the distribution of sentiments among customers.

**Cross-Tabulation:**

* Create a cross-tabulation (contingency table) to analyze the relationships between two categorical variables.
* Calculate counts or percentages of observations for each combination of categories.
* Example: In the same customer reviews dataset, we can create a cross-tabulation between the "Sentiment" variable and the "Product Category" variable to understand how sentiments vary across different product categories. * This allows us to identify which product categories receive more positive or negative reviews.

**Stacked Bar Charts:**

* Visualize the relationship between two categorical variables using stacked bar charts.
* Each bar represents the frequency of one variable, divided into segments representing the different categories of the other variable.
* Example: Continuing with the customer reviews dataset, we can create a stacked bar chart where each bar represents a product category, and the segments within the bar represent the distribution of sentiments (positive, neutral, negative) for that category. This allows us to compare the sentiment distribution across different product categories visually.

**Histogram of Counts:**

* Create a histogram of counts to visualize the distribution of a categorical variable.
* Each bar represents the frequency or count of observations in each category.
* Example: Suppose we have a dataset of employee job titles, and one categorical variable is "Department." We can create a histogram of counts to visualize the distribution of employees across different departments, showing how many employees belong to each department.

**Heatmaps:**

* Visualize the frequency or proportion of observations for combinations of two categorical variables using a heatmap.
* Color-coding cells based on the frequency or proportion of observations.
* Example: In a dataset of customer demographics, we can create a heatmap to visualize the distribution of age groups across different income brackets. Each cell in the heatmap represents the proportion of customers belonging to a specific age group and income bracket combination, with colors indicating higher or lower proportions.

By employing these various approaches to categorical data exploration, analysts can gain insights into the distribution, relationships, and patterns within categorical variables, helping to inform further analysis and decision-making processes.


**6. How would the learning activity be affected if certain variables have missing v8alues? Having said that, what can be done about it?**


**Ans:** **If certain variables in the dataset have missing values, the learning activity, particularly in machine learning tasks, can be significantly affected in several ways:**

**1. Bias in Model Training:** Missing values can introduce bias into the training process, especially if the missingness is not random. Models trained on incomplete data may learn from patterns associated with missing values rather than the true underlying relationships in the data.

**2. Reduced Model Performance:** Missing values can lead to reduced model performance, as models may struggle to accurately represent the data and make predictions. This can result in lower accuracy, higher error rates, and decreased model reliability.

**3. Data Loss:** Traditional approaches like complete case analysis (removing observations with missing values) can lead to significant data loss, especially if missing values are prevalent in multiple variables. This loss of data can reduce the sample size and potentially overlook valuable information.

**4. Inaccurate Estimates:** Imputation methods that replace missing values with estimated values can introduce noise and inaccuracies into the dataset. If the imputation process is not carefully executed, it may distort the true distribution and relationships in the data.

**To address missing values and mitigate their impact on the learning activity, several strategies can be employed:**

**1. Data Imputation:** Missing values can be imputed using various techniques, such as mean imputation, median imputation, regression imputation, or machine learning-based imputation methods. These methods estimate missing values based on the observed data and can help preserve the integrity of the dataset.

**2. Multiple Imputation:** Instead of imputing a single value for each missing observation, multiple imputation generates multiple plausible values for missing values, accounting for uncertainty due to missingness. This approach provides more accurate estimates and uncertainty measures and is preferred when the assumption of missing completely at random (MCAR) or missing at random (MAR) holds.

**3. Model-Based Imputation:** Machine learning models can be trained to predict missing values based on the available data. Models like decision trees, random forests, or deep learning models can learn complex patterns and relationships in the data and impute missing values accordingly.

**4. Feature Engineering:** Instead of imputing missing values directly, feature engineering techniques can be used to create new features that capture information about missingness. For example, a binary indicator variable can be created to flag whether a value is missing in a particular variable.

**5. Sensitive Analysis:** Sensitivity analysis can be performed to assess the robustness of conclusions to different missing data handling methods. By comparing results obtained with different imputation techniques or handling strategies, analysts can evaluate the stability and reliability of findings.

Overall, addressing missing values in the dataset is essential for ensuring the quality and reliability of analyses and models. By employing appropriate imputation techniques and handling strategies, analysts can mitigate the impact of missing values on the learning activity and obtain more accurate and reliable results.






**7. Describe the various methods for dealing with missing data values in depth.**

**Ans:** Dealing with missing data values is a crucial step in data preprocessing to ensure the quality and reliability of analyses and models. Here are various methods for handling missing data values in depth:

**a. Deletion Methods:**

Listwise Deletion (Complete Case Analysis): In this method, entire observations with missing values in any variable are removed from the dataset. While simple, this approach can lead to loss of valuable information and reduced sample size.
Pairwise Deletion: In this method, analyses are conducted on all available pairs of variables, excluding observations with missing values only for the variables under analysis. While it maximizes the use of available data, it may lead to biased results due to differences in sample sizes across analyses.

**b. Imputation Methods:**

Mean/Median/Mode Imputation: Missing values are replaced with the mean, median, or mode of the observed values in the respective variable. While simple and quick, this method may distort the distribution and variability of the data.
Regression Imputation: Missing values are estimated based on the relationship between the variable with missing values and other variables in the dataset. A regression model is fitted using observed values as predictors to predict missing values. This method preserves relationships between variables but may lead to biased estimates if the relationship is not linear or if there are strong correlations between variables.

**c. Multiple Imputation:**

Multiple imputation involves creating multiple plausible values for each missing value based on the observed data distribution. Imputed datasets are then analyzed separately, and results are combined using specific rules. This method accounts for uncertainty due to missing data and provides more accurate estimates compared to single imputation methods.
K-Nearest Neighbors (KNN) Imputation: Missing values are replaced with the values of the nearest neighbors in the feature space. This method preserves relationships between variables and can handle both numerical and categorical data effectively.

**d. Advanced Techniques:**

Expectation-Maximization (EM) Algorithm: EM algorithm is an iterative method used to estimate parameters in statistical models with missing data. It estimates missing values by maximizing the likelihood function, incorporating available information to impute missing values.

**e. Deep Learning Models:**

Deep learning models, such as autoencoders, can be used to learn complex patterns and relationships in the data and impute missing values. These models can handle high-dimensional data and capture nonlinear relationships but may require large amounts of data and computational resources.

**f. Domain-Specific Knowledge:**

Incorporating domain knowledge can help inform the imputation process by guiding the selection of appropriate methods and variables for imputation. For example, imputing missing values in time series data may involve using interpolation methods based on temporal patterns or seasonal trends.

Choosing the appropriate method for handling missing data depends on factors such as the nature of the data, the extent of missingness, the presence of patterns or relationships in the data, and the objectives of the analysis. It's essential to carefully consider these factors and evaluate the impact of missing data handling methods on the validity and reliability of the results. Additionally, sensitivity analysis can be performed to assess the robustness of conclusions to different imputation methods.







**8. What are the various data pre-processing techniques? Explain dimensionality reduction and function selection in a few words.**


**Ans:** Data preprocessing techniques are used to prepare raw data for analysis and modeling. Some common data preprocessing techniques include:

**a. Data Cleaning:** This involves handling missing values, outliers, and errors in the dataset. Techniques include imputation (replacing missing values with estimated values), removing outliers, and correcting errors.

**b. Data Transformation:** Data transformation techniques are used to modify the distribution or scale of the data. Examples include normalization (scaling numerical features to a standard range), log transformation (reducing skewness in data), and binning (grouping continuous values into bins or categories).

**c. Feature Engineering:** Feature engineering involves creating new features or modifying existing ones to improve model performance. Techniques include creating interaction terms, combining features, and deriving new features from existing ones.

**d. Encoding Categorical Variables:** Categorical variables need to be encoded into numerical values before they can be used in machine learning models. Techniques include one-hot encoding, label encoding, and target encoding.

**e. Feature Selection:** Feature selection involves selecting a subset of relevant features from the original set of features. This helps reduce dimensionality and improve model efficiency. Techniques include filter methods, wrapper methods, and embedded methods.

**f. Dimensionality Reduction:** Dimensionality reduction techniques are used to reduce the number of features in the dataset while preserving the most important information. This helps reduce computational complexity and mitigate the curse of dimensionality. Techniques include Principal Component Analysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), and Linear Discriminant Analysis (LDA).

**g. Dimensionality Reduction:** Dimensionality reduction is the process of reducing the number of features in the dataset while retaining the most important information. This is achieved by transforming the original features into a lower-dimensional space. Dimensionality reduction techniques aim to reduce computational complexity, remove redundant information, and improve model performance by focusing on the most relevant features.

**h. Function Selection:** Function selection involves choosing the appropriate mathematical functions or algorithms to model the relationship between features and the target variable. This is crucial for building accurate and interpretable models. Function selection techniques include selecting appropriate regression functions (e.g., linear, polynomial, exponential) and choosing the right machine learning algorithms (e.g., decision trees, support vector machines, neural networks) based on the problem domain and dataset characteristics.

**9.**

**i. What is the IQR? What criteria are used to assess it?**

**ii. Describe the various components of a box plot in detail? When will the lower whisker surpass the upper whisker in length? How can box plots be used to identify outliers?**

**Ans:**

**i. What is the IQR? What criteria are used to assess it?**

The Interquartile Range (IQR) is a measure of statistical dispersion that represents the spread of the middle 50% of the data. It is calculated as the difference between the third quartile (Q3) and the first quartile (Q1):

IQR = Q3 - Q1.

**Criteria used to assess the IQR:**

* The IQR provides information about the variability within the central portion of the dataset, ignoring extreme values or outliers.
* A larger IQR indicates greater variability or spread within the middle 50% of the data.
* The IQR is resistant to outliers and is often used as a robust measure of spread in statistical analysis.
* The IQR is used to identify outliers using the rule: data points that fall below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are considered outliers.

**ii. Describe the various components of a box plot in detail? When will the lower whisker surpass the upper whisker in length? How can box plots be used to identify outliers?**

**Components of a box plot:**

* Median (Q2): The middle value of the dataset, dividing it into two halves.
* Box: Represents the interquartile range (IQR), indicating the spread of the middle 50% of the data. The lower boundary of the box is Q1, and the upper boundary is Q3.
* Whiskers: Lines extending from the box to the minimum and maximum values within a specified range. The length of the whiskers is determined by a scaling factor (often 1.5 times the IQR), and they represent the variability outside the box.
* Outliers: Data points that fall beyond the ends of the whiskers are considered outliers and are plotted individually as points.

**When will the lower whisker surpass the upper whisker in length?**

The lower whisker will surpass the upper whisker in length when the upper quartile (Q3) is closer to the median than the lower quartile (Q1). This occurs when the dataset is heavily left-skewed, with a larger concentration of data points towards the lower end of the distribution.

**How can box plots be used to identify outliers?**

* Box plots provide a visual summary of the distribution of data and facilitate the identification of outliers:
Outliers falling below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR are plotted as individual points outside the whiskers.
* Any data points falling outside this range are considered outliers and are highlighted in the box plot.
* Box plots make it easy to identify the presence, location, and extent of outliers in the dataset, helping to assess data quality and identify potential issues.

In summary, the Interquartile Range (IQR) measures the spread of the middle 50% of the data, while box plots visually represent the distribution of the data, including the median, quartiles, whiskers, and outliers. Box plots are effective tools for identifying outliers and assessing the variability and spread of data.








**10. Make brief notes on any two of the following:**

**a. Data collected at regular intervals**

**b. The gap between the quartiles**

**c. Use a cross-tab**

**Ans:**

**a. Data Collected at Regular Intervals:**

Data collected at regular intervals refers to observations or measurements that are taken consistently over time or at equal intervals.

**Examples of data collected at regular intervals include:**

* Time series data: Observations recorded at fixed intervals, such as daily, weekly, or monthly.
* Sensor data: Readings from sensors or instruments taken at regular time intervals.
* Stock market data: Daily or intraday price movements of stocks or financial instruments.

**Characteristics of data collected at regular intervals:**

* Regular intervals ensure uniform spacing between data points, facilitating analysis and comparison over time.
* Time series analysis techniques, such as trend analysis, seasonality detection, and forecasting, are commonly applied to data collected at regular intervals.
* Preprocessing steps may include handling missing values, smoothing noisy data, and resampling to align with desired time intervals.

**b. The Gap Between the Quartiles:**

The gap between the quartiles, also known as the interquartile range (IQR), is a measure of statistical dispersion that indicates the spread of the middle 50% of the data.

**Calculation of the interquartile range:**

* The quartiles divide a dataset into four equal parts. The first quartile (Q1) represents the 25th percentile, and the third quartile (Q3) represents the 75th percentile.
* The interquartile range (IQR) is calculated as the difference between the third quartile (Q3) and the first quartile (Q1): IQR = Q3 - Q1.

**Interpretation of the interquartile range:**

* The IQR provides information about the variability of the middle 50% of the data.
* A larger IQR indicates greater variability or spread within the central portion of the dataset, while a smaller IQR indicates less variability.
* The IQR is resistant to outliers and is often used to identify and assess the spread of data in robust statistical analysis.

**Applications of the interquartile range:**

* Box plots visually represent the interquartile range as the box between the first and third quartiles, with the median line inside the box.
* Outliers may be identified as data points that fall outside a specified range defined by the quartiles and the IQR (e.g., Q1 - 1.5 * IQR to Q3 + 1.5 * IQR).
* These notes provide a brief overview of data collected at regular intervals and the interquartile range, highlighting their characteristics, calculations, interpretations, and applications in statistical analysis.








**11. Make a comparison between:**

**a. Data with nominal and ordinal values**

**b. Histogram and box plot**

**c. The average and median**

**Ans:** Here's a comparison between data with nominal and ordinal values, histogram and box plot, and the average and median:

**a. Data with Nominal and Ordinal Values:**

**Nominal Data:**

* Nominal data consists of categories or labels with no inherent order or ranking.
* Examples include gender (male/female), eye color (blue/brown/green), and country of origin.
* Nominal data can be represented using one-hot encoding or label encoding, but the order of the categories is arbitrary.

**Ordinal Data:**

* Ordinal data consists of categories or labels with a natural order or ranking.
* Examples include education level (high school, bachelor's, master's), satisfaction ratings (poor, fair, good, excellent), and income levels (low, medium, high).
* Ordinal data can be represented using label encoding, where the categories are assigned numerical values according to their order or ranking.

**b. Histogram and Box Plot:**

**Histogram:**

* A histogram is a graphical representation of the distribution of numerical data.
* It consists of bars whose heights represent the frequencies or counts of data points falling within predefined intervals (bins).
* Histograms are useful for visualizing the shape, center, spread, and skewness of the data distribution.

**Box Plot (Box-and-Whisker Plot):**

* A box plot is a graphical summary of the distribution of numerical data through quartiles.
* It displays the median (middle line), quartiles (box), and range of the data (whiskers).
* Box plots are useful for identifying outliers, comparing the spread and central tendency of different datasets, and detecting skewness.

**c. The Average and Median:**

**Average (Mean):**

* The average, or mean, is a measure of central tendency calculated by summing all values in a dataset and dividing by the total number of values.
* It's sensitive to extreme values (outliers) and can be affected by skewed distributions.
* The average is commonly used to summarize numerical data with a symmetric distribution.

**Median:**

* The median is the middle value of a dataset when it's arranged in ascending or descending order.
* It's less affected by outliers and skewed distributions compared to the mean.
* The median is a robust measure of central tendency and is often used when the distribution of the data is skewed or contains outliers.

In summary, data with nominal values represent unordered categories, while data with ordinal values represent ordered categories. Histograms and box plots are both graphical representations of data distributions, with histograms showing frequency distributions and box plots summarizing key statistics. The average and median are measures of central tendency, with the average being sensitive to outliers and the median being more robust.





