#  Name -> Deven Chhajed
# Roll No-> 32
# Batch -> B1 (CSE)
# Prn -> 1032210789
# Missing Value And Noise Data

# Theory of Lab 2

# What is Data Preprocessing?
Data preprocessing is a crucial phase in data analysis and machine learning workflows. It involves the cleaning, reshaping, and organizing of raw data into a suitable format for subsequent analysis or for training machine learning models. The quality of data preprocessing significantly influences the accuracy and effectiveness of your analytical processes or models.
Here are various common procedures and strategies within data preprocessing:

**1.	Data Cleaning:**

•	Managing missing data: Decide whether to remove incomplete records, fill in missing values using methods like averages or advanced imputation techniques, or utilize domain-specific knowledge for replacement.

•	Detection and handling of outliers: Identify and address outliers, which are data points significantly different from the majority. You can choose to remove, transform, or treat them separately.

**2.	Data Transformation:**

•	Feature scaling: Standardize or normalize features so that they share a similar scale. This is typically done through techniques like Min-Max scaling or Z-score normalization.

•	Encoding categorical variables: Translate categorical variables, such as text or categories, into numerical representations using methods like one-hot encoding or label encoding.

•	Feature engineering: Create new features based on existing ones to capture more information or simplify the dataset.

•	Logarithmic transformations: Apply logarithmic adjustments to skewed data to make it follow a more normal distribution.


**3.	Data Reduction:**

•	Dimensionality reduction: Decrease the number of features while retaining essential information. Techniques like Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) are employed for this purpose.

•	Sampling: In cases of excessively large datasets, random sampling is often performed to create a smaller but representative subset for analysis or modelling.

**4.	Data Partitioning:**

•	Split the dataset into training, validation, and test sets to evaluate and fine-tune your machine learning models. Common splits include 70-30 or 80-20 ratios for training and testing, respectively

**5.	Addressing Imbalanced Data:**

•	If your dataset exhibits class imbalances (where one class has significantly fewer instances than others), you may need to apply techniques like oversampling, under sampling, or synthetic data generation to balance the class distribution.

**6.	Normalization:**

•	Normalize the data distribution, if necessary, to better suit the chosen analysis or machine learning algorithm.

**7.	Feature Scaling:**

•	Ensure that all features share a comparable scale to prevent any single feature from dominating the model. Common scaling methods include Min-Max scaling and Z-score normalization.

**8.	Data Validation:**

•	Verify data integrity, consistency, and correctness, including the validation of data values within expected ranges and the identification of anomalies.

**9.	Handling Time-Series Data:**

•	Special preprocessing steps may be needed for time-series data, including resampling, lag feature creation, and managing seasonality.

**10.	Text Data Preprocessing:**

•	When working with text data, essential tasks include tokenization, removal of stop words, stemming or lemmatization, and vectorization using techniques like TF-IDF or word embeddings.

**11.	Scaling and Normalization:**

•	Depending on your chosen algorithms, it may be necessary to scale or normalize features to ensure they have similar ranges.

**12.	Feature Selection:**

•	Identify and select the most relevant features for your analysis or modelling to reduce dimensionality and enhance model performance.

Data preprocessing is an iterative process, and the specific procedures and techniques applied will depend on the nature of your data and the objectives of your data analysis or machine learning project. Effective data preprocessing is critical for developing robust and accurate models and for extracting meaningful insights from your data.



# Need of the Data Preprocessing?
Data preprocessing is a vital stage in both data analysis and machine learning for various compelling reasons:

**1.	Enhancing Data Quality:** Initial data is often riddled with inaccuracies, gaps, or inconsistencies. Data preprocessing is indispensable for cleaning and rectifying these issues, resulting in higher-quality data, which, in turn, contributes to more accurate and dependable analyses and models.

**2.	Feature Extraction and Crafting:** Data preprocessing enables the creation of novel features or the modification of existing ones to better capture the inherent data patterns. This process boosts the performance of machine learning models by furnishing them with more pertinent and informative attributes.

**3.	Managing Missing Data:** Real-world datasets frequently exhibit missing values, which can lead to erroneous analysis or modelling outcomes. Data preprocessing techniques, such as imputation, address these gaps by filling them in using statistical methods, domain expertise, or machine learning algorithms.

**4.	Identification and Treatment of Outliers:** Outliers can significantly skew model performance. Data preprocessing plays a pivotal role in detecting and managing outliers through strategies like trimming, transformation, or segregated treatment.

**5.	Normalization and Scaling:** Numerous machine learning algorithms are sensitive to feature scaling. Data preprocessing guarantees that all features share similar scales, preventing certain features from disproportionately influencing the modelling process and promoting more balanced and consistent models.

**6.	Handling Categorical Data:** While machine learning models generally require numerical inputs, real-world data often encompasses categorical variables. Data preprocessing facilitates the transformation of categorical data into a numerical format suitable for modelling, employing methods such as one-hot encoding or label encoding.

**7.	Dimensionality Reduction:** High-dimensional datasets can pose challenges and increase the risk of overfitting. Techniques like dimensionality reduction (e.g., PCA) mitigate this by reducing the number of features while preserving vital information.

**8.	Addressing Data Imbalances:** In classification scenarios, imbalanced datasets where one class substantially outnumbers the others can result in biased models. Data preprocessing tactics address class imbalances through techniques like oversampling, under sampling, or generating synthetic data.

**9.	Handling Time-Series Data:** Time-series data often requires specialized preprocessing steps such as resampling, lag feature generation, and the adjustment for seasonality to reveal meaningful patterns.

**10.	Text Data Processing:** When working with text data, preprocessing steps encompass tokenization, exclusion of stop words, stemming or lemmatization, and vectorization. These actions are indispensable for converting text into a format amenable to modelling.

**11.	Improving Model Performance:** Sound data preprocessing directly contributes to enhanced model performance by reducing noise, improving feature quality, and ensuring data conforms better to the assumptions of the chosen machine learning algorithm.

**12.	Enhancing Interpretability and Understanding:** Data preprocessing elevates the interpretability of results by converting data into a more comprehensible format. It also aids analysts in gaining a deeper comprehension of the data and its intrinsic patterns.

In summary, data preprocessing constitutes an integral phase in data analysis and machine learning endeavours. It elevates data quality, prepares data for modelling, and amplifies model effectiveness, ultimately leading to more precise and actionable insights derived from data.





# Importing the Libraries

**import pandas as pd**: Pandas, a Python library for data manipulation and analysis, offers essential data structures such as DataFrames and Series to facilitate efficient data handling and analysis.

**import numpy as np:** By including 'import numpy as np' in Python, you gain access to NumPy, a vital numerical computing library. NumPy provides robust support for arrays, matrices, and mathematical functions, making it a cornerstone for scientific and mathematical computations.

**import seaborn as sns:** By importing 'seaborn as sns' in Python, you enable the use of Seaborn, a library known for enhancing the aesthetics of statistical visualizations and simplifying their creation.

In [2]:
import pandas as pd
import numpy as np
import seaborn as sns

# Uploading files from your local machine to a Google Colab and Importing CSV File

In [3]:
from google.colab import files
uploaded = files.upload()

Saving sample_data.csv to sample_data.csv


# Readig the CSV File

In [4]:
df= pd.read_csv("sample_data.csv")
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Nigeria,18.0,15000.0,No
5,Germany,40.0,,Yes
6,France,35.0,58000.0,Yes
7,Spain,,52000.0,No
8,France,48.0,79000.0,Yes
9,Germany,50.0,83000.0,No


# Understanding the Data Set

In [5]:
df.shape

(29, 4)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29 entries, 0 to 28
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Country    28 non-null     object 
 1   Age        27 non-null     float64
 2   Salary     28 non-null     float64
 3   Purchased  28 non-null     object 
dtypes: float64(2), object(2)
memory usage: 1.0+ KB


In [7]:
df.ndim

2

In [8]:
df.describe()

Unnamed: 0,Age,Salary
count,27.0,28.0
mean,36.925926,53642.857143
std,8.757089,19216.532785
min,18.0,15000.0
25%,30.0,44750.0
50%,37.0,53000.0
75%,44.0,67000.0
max,50.0,83000.0


In [9]:
df.dtypes

Country       object
Age          float64
Salary       float64
Purchased     object
dtype: object

In [10]:
df.tail()

Unnamed: 0,Country,Age,Salary,Purchased
24,France,37.0,23000.0,Yes
25,Germany,45.0,50000.0,No
26,France,37.0,67000.0,Yes
27,Nigeria,30.0,30000.0,Yes
28,Nigeria,29.0,15000.0,No


# Data Cleaning
Data cleaning, also known as data cleansing or data scrubbing, is the process of identifying and correcting errors, inconsistencies, and inaccuracies in datasets. It is a critical step in data preparation and data analysis, as the quality and reliability of data can significantly impact the results and insights derived from it.
Data cleaning involves various tasks and techniques, including:

**1. Handling Missing Values:** Identifying and dealing with missing data points, which can include filling in missing values, removing rows with missing values, or using statistical methods to estimate missing values.

**2. Removing Duplicates:**  Identifying and eliminating duplicate records or observations in a dataset, ensuring that each data point is unique.

**3. Standardizing Data:** Ensuring that data follows a consistent format, such as converting dates to a common format, normalizing text to a consistent case, or standardizing units of measurement.

**4. Dealing with Outliers:** Identifying and addressing outliers, which are data points that deviate significantly from the typical pattern and can distort analysis results.

**5. Correcting Inaccuracies:** Detecting and correcting data inaccuracies, which may include typographical errors, incorrect values, or inconsistent data entries.

**6. Handling Inconsistencies:** Resolving inconsistencies in categorical data, such as different spellings of the same category or variations in coding.

**7. Addressing Data Integrity Issues:**  Ensuring data integrity by verifying that the data accurately represents the real-world entities it is meant to describe.

**8. Data Transformation:** Converting data into a suitable format for analysis, which may involve encoding categorical variables, scaling numerical features, or creating new derived variables.

**9. Data Validation:** Checking data against predefined rules or constraints to ensure that it meets specific quality standards.

**10. Data Imputation:**  When dealing with missing data, imputing values using statistical techniques or domain knowledge to fill in gaps while maintaining data integrity.

Data cleaning is an iterative process, and it often requires collaboration between domain experts and data scientists to understand the context and meaning of the data. The goal is to prepare the data for analysis or machine learning modeling, ensuring that the results are reliable and meaningful. Clean data is essential for making informed decisions, drawing accurate conclusions, and building robust predictive models.


# How to  handle Data Cleaning
**1. Familiarize Yourself with the Data:**
* Start by gaining a deep understanding of your dataset, including its source, structure, and context.
* Collaborate with subject matter experts or stakeholders to grasp the data's significance.

**2. Data Profiling:**
* Begin with an initial data profiling step to spot common issues like missing values, duplicates, and outliers.
* Create summary statistics and visualizations to get a better grasp of the dataset.

**3. Manage Missing Data:**
* Identify instances of missing data and decide on an appropriate approach for handling them.
* Choices include eliminating rows with missing values, filling in missing data using statistical methods, or drawing on domain knowledge to complete gaps.

**4. Eliminate Duplicate Entries:**
* Detect and eliminate duplicate records to maintain data integrity.
* Exercise caution when eliminating duplicates, as in some cases, duplicates may represent valid data points.

**5. Standardize Data:**
* Ensure data formats are standardized for consistency. Examples include uniform date formats, consistent text casing, and conversions of units.

**6. Handle Outliers:**
* Identify outliers through statistical methods or visualizations.
Determine whether to address outliers, remove them, or retain them based on domain expertise and the objectives of your analysis.

**7. Rectify Inaccuracies:**
* Detect and rectify inaccuracies in the data, such as typos or erroneous values.
Utilize validation rules and data profiling techniques to identify discrepancies.

**8. Address Inconsistencies:**
* Resolve inconsistencies in categorical data, like variations in spelling or coding for the same category.

**9. Data Transformation:**
* Transform data as required for analysis, which may involve encoding categorical variables, scaling numeric attributes, or generating new derived variables.

**10. Data Validation:**
* Validate data against established rules or criteria to ensure it complies with specific quality standards and business needs.

**11. Document Changes:**
* Maintain comprehensive documentation of all cleaning procedures undertaken. Such documentation promotes transparency and allows for reproducibility.

**12. Iterate:**
* Understand that data cleaning often requires multiple iterations. After initial improvements, re-evaluate the data to detect any new issues.

**13. Evaluate Data Quality:**
* Perform assessments of data quality to gauge the effectiveness of your cleaning efforts and verify that the data is suitable for analysis.

**14. Utilize Data Cleaning Tools:**
* Make use of data cleaning tools and software, such as Python libraries (e.g., Pandas), Excel, or dedicated data cleaning software, to streamline the process.

**15. Quality Assurance:**
* Engage other team members or experts to conduct quality assurance checks on the cleaned data to guarantee its accuracy.

**16. Establish a Data Cleaning Workflow:**
* If dealing with ongoing data streams, create a data cleaning workflow to automate tasks to the greatest extent possible.

**17. Monitor Data Quality:**
* Continually oversee data quality to identify and address issues as they arise, particularly in long-term projects.

It's important to remember that data cleaning is an ongoing process, not a single event. This continual effort is essential to maintain the trustworthiness and accuracy of data for the purposes of analysis and informed decision-making.






# df.isnull()
The df.isnull() function is a method used in Python with data frames (commonly used in libraries like Pandas) to identify missing values in a dataset.
df.isnull() is applied to a DataFrame df to create a Boolean mask of the same shape, where each element is True if the corresponding element in df is missing (NaN or None), and False otherwise.

In [11]:
df.isnull()

Unnamed: 0,Country,Age,Salary,Purchased
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,False,False
5,False,False,True,False
6,False,False,False,False
7,False,True,False,False
8,False,False,False,False
9,False,False,False,False


In [12]:
df.isnull().sum()
# .isnull(): This method is used to create a Boolean mask, where each element in the DataFrame is checked for being null (True) or not null (False).

# .sum(): After applying .isnull(), this sums up the Boolean values (True is treated as 1, and False as 0) along each column. So, it counts the number of missing values in each column.

# The result is a Pandas Series where each column name is associated with the count of missing values in that column.

Country      1
Age          2
Salary       1
Purchased    1
dtype: int64

In [13]:
df['Country'].isnull().sum()
# .isnull(): This method is used to create a Boolean mask for the 'Country' column, where each element in that column is checked for being null (True) or not null (False).

# .sum(): After applying .isnull() to the 'Country' column, this sums up the Boolean values (True is treated as 1, and False as 0) in that column.

# The result is the count of missing values in the 'Country' column of the DataFrame 'df'.

1

In [14]:
df['Country'].isnull().any()
# .any(): After applying .isnull() to the 'Country' column, this function checks if there is at least one True value (i.e., at least one missing value) in the Boolean mask.

True

In [15]:
df['Country'].fillna("Space", inplace = False)
# .fillna("Space"): This method is used to fill missing (null) values in the 'Country' column. In this case, missing values are replaced with the string "Space".

# inplace=False: This argument specifies that the operation should not modify the original DataFrame 'df' in place. Instead, it returns a new Series with missing values filled, leaving 'df' unchanged.

# The result is a new Series with the missing values in the 'Country' column replaced by "Space," while the original DataFrame 'df' remains unaltered.

0      France
1       Spain
2     Germany
3       Spain
4     Nigeria
5     Germany
6      France
7       Spain
8      France
9     Germany
10     France
11    Nigeria
12     France
13      Space
14     France
15    Nigeria
16      Spain
17      Spain
18      Spain
19    Nigeria
20      Spain
21     France
22    Germany
23     France
24     France
25    Germany
26     France
27    Nigeria
28    Nigeria
Name: Country, dtype: object

In [16]:
df['Country'].isnull().sum()

1

In [17]:
df['Country'].fillna("Spain", inplace = True)

In [18]:
df['Country'].isnull().sum()

0