## Experiment 1

### Handling Missing Values: Identify and fill missing values in a dataset using methods such as mean imputation or forward/backward filling.

### Description

### **1. Data Cleaning**

Data cleaning is the essential process of detecting and correcting errors, inconsistencies, or inaccuracies in a dataset. It ensures the data is reliable, useful, and suitable for analysis. This step involves identifying irrelevant data, correcting inaccuracies, and ensuring consistency in data formatting. Clean data improves the quality of insights and enhances the performance of predictive models. It can include:

- Identifying erroneous values or outliers.
- Correcting or removing faulty or inconsistent entries.
- Handling data quality issues like redundant or duplicated data.

### **2. Handling Missing Values**

Handling missing values is critical to maintaining the integrity of the dataset. This step involves identifying and addressing missing data points using various techniques. Depending on the situation and data type, missing values can be handled in two ways:

- **Removal**: If the missing values are sparse and spread out across the dataset, it might be feasible to remove the rows or columns with missing data, but only if their absence doesn't significantly reduce the quality or representativeness of the dataset.
- **Imputation**: For larger gaps in data, missing values are filled in using statistical methods:
    - **Mean/Median Imputation**: Replaces missing values with the mean or median of the existing data in that column.
    - **Mode Imputation**: Useful for categorical data, where the most frequent value is used to fill missing data.
    - **Forward/Backward Filling**: In time series data, missing values are filled using the preceding (forward fill) or subsequent (backward fill) data points.

### **3. Removing Duplicate Data**

Duplicate data can skew analysis and lead to inaccurate insights, so removing duplicates is a key data cleaning step. This involves identifying rows or entries that have been repeated unintentionally. Duplicate entries can arise from data collection errors or merging datasets. The process involves:

- Detecting duplicated rows or entries based on certain fields.
- Deciding which duplicates to keep and which to remove (e.g., keeping the first or last instance).

Eliminating duplicates helps streamline the dataset, reduce redundancy, and improve the performance of machine learning models.

### **4. Standardizing Data Formats**

Standardization ensures that the data in your dataset follows a consistent format. This includes uniformity in how dates, times, currencies, and categorical variables are represented. Common issues include inconsistent date formats (e.g., MM/DD/YYYY vs. DD/MM/YYYY), mixed use of text cases in strings, and varying units for measurements.

- **Date & Time Standardization**: All date fields should follow a uniform format (e.g., ISO 8601: YYYY-MM-DD).
- **Text Formatting**: Text data should be standardized in terms of case (upper/lower) and spelling (U.S. vs British English).
- **Numeric Data Standardization**: Numeric fields such as currency and measurements should be standardized in the same unit (e.g., meters, dollars).

Standardization ensures that the data is interpretable by both humans and machines and is crucial for further analysis.

### **5. Removing Outliers**

Outliers are data points that deviate significantly from the rest of the dataset. These points can be caused by data entry errors, measurement anomalies, or natural variations in the data. While outliers can sometimes hold valuable insights, they often distort the results of statistical analyses or machine learning models. The steps involved in handling outliers include:

- **Detection**: Identifying outliers using statistical techniques such as the interquartile range (IQR) method, Z-scores, or visual methods like box plots.
- **Removal or Transformation**: Depending on their impact, outliers can be removed, capped, or transformed to fit within an acceptable range.
  
By addressing outliers, you ensure that your model predictions are not unduly influenced by extreme values.

### **6. Data Transformation**

Data transformation involves modifying or converting data into a usable format that better suits analysis or modeling. This step is essential for scaling, normalizing, or encoding data. It can include:

- **Normalization/Standardization**: Adjusting the scale of numerical values so they have comparable ranges (e.g., scaling all features between 0 and 1).
- **Encoding Categorical Variables**: Converting categorical data into numerical form through one-hot encoding, label encoding, or other techniques.
- **Log Transformation**: Used to handle skewed data by compressing the range of variables and reducing the effect of outliers.

Data transformation improves the efficiency of algorithms, making patterns more discernible and reducing computational complexity.

### **7. Data Integration**

Data integration involves merging data from different sources to create a unified dataset. This is especially important when combining data from disparate systems like databases, spreadsheets, or APIs. Integration allows for a more comprehensive view of the data, enabling richer analysis. Key tasks in data integration include:

- **Matching Schema**: Ensuring that the structure (schema) of different datasets aligns so that they can be combined without errors.
- **Joining Data**: Using methods such as joins (inner, outer, left, or right joins) to combine datasets on key attributes.
- **Handling Redundancy**: Removing duplicate or redundant data points created during the merging process.

Data integration enables more holistic insights and supports better decision-making across multiple data sources.

### Importing Required Libraries

In [2]:
# For Data handling and other operations
import numpy as np
import pandas as pd 

# For Ignoring warnings
import warnings
warnings.filterwarnings("ignore")

# To see the version of pandas
print(pd.__version__)

1.5.3


### Mean Imputation

#### Mean Imputation is a method where missing values in a dataset are replaced with the mean (average) of the available values for that particular variable or column. This approach maintains the dataset's overall statistical properties but can reduce variability.

In [3]:
# Mean Imputation
# Imputing missing values with mean value

# Sample dataset with missing values
data = {
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, np.nan, 5],
    'C': [1, np.nan, 3, 4, np.nan]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Mean Imputation
df_mean_imputed = df.fillna(df.mean())

# Display the DataFrame after Mean Imputation
print("\nDataFrame after Mean Imputation:")
print(df_mean_imputed)

Original DataFrame:
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  NaN
2  NaN  3.0  3.0
3  4.0  NaN  4.0
4  5.0  5.0  NaN

DataFrame after Mean Imputation:
     A         B         C
0  1.0  3.333333  1.000000
1  2.0  2.000000  2.666667
2  3.0  3.000000  3.000000
3  4.0  3.333333  4.000000
4  5.0  5.000000  2.666667


### Forward Filling

### **Forward Filling** is a method used to handle missing data by filling the missing values with the last observed non-missing value in the column. This technique is commonly applied in time series data to propagate the most recent valid value forward until a new valid observation is encountered.

In [6]:
# Forward Filling
# This program fills missing values with the previous value in the column.

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Forward Filling
df_forward_filled = df.fillna(method='ffill')

# Display the DataFrame after Forward Filling
print("\nDataFrame after Forward Filling:")
print(df_forward_filled)

Original DataFrame:
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  NaN
2  NaN  3.0  3.0
3  4.0  NaN  4.0
4  5.0  5.0  NaN

DataFrame after Forward Filling:
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  1.0
2  2.0  3.0  3.0
3  4.0  3.0  4.0
4  5.0  5.0  4.0


### Backward Filling
#### Backward Filling is a technique for handling missing data by filling the missing values with the next valid observation in the column. In time-series data, this means that the missing value is replaced by the next available value as you move backward in time.

In [7]:
# Backward Filling
# This program fills missing values with the next value in the column.

# Display the original DataFrame
print("Original DataFrame:")
print(df)

# Backward Filling
df_backward_filled = df.fillna(method='bfill')

# Display the DataFrame after Backward Filling
print("\nDataFrame after Backward Filling:")
print(df_backward_filled)

Original DataFrame:
     A    B    C
0  1.0  NaN  1.0
1  2.0  2.0  NaN
2  NaN  3.0  3.0
3  4.0  NaN  4.0
4  5.0  5.0  NaN

DataFrame after Backward Filling:
     A    B    C
0  1.0  2.0  1.0
1  2.0  2.0  3.0
2  4.0  3.0  3.0
3  4.0  5.0  4.0
4  5.0  5.0  NaN
