<a href="https://colab.research.google.com/github/manikanta741/Data-Science/blob/main/missingvalues.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
"""
1. Steps in Data Analysis



a) Data Collection
Gathering raw data from various sources like databases, APIs, web scraping, or surveys.

b) Data Cleaning
Handling missing values
Removing duplicates
Fixing inconsistent data
Detecting and correcting outliers

c) Exploratory Data Analysis (EDA)
Understanding data distribution using summary statistics
Visualizing data through histograms, scatter plots, box plots, etc.
Identifying patterns, correlations, and anomalies

d) Data Transformation & Feature Engineering
Normalization and standardization
Encoding categorical variables
Creating new meaningful features
Dimensionality reduction (e.g., PCA)

e) Data Modeling
Applying statistical models or machine learning algorithms
Training predictive models
Evaluating model performance using metrics like RMSE, R², accuracy, etc."""

In [1]:
#missing values

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Charlie', None],
        'Age': [25, None, 30, 22],
        'Salary': [50000, 60000, None, 45000]}

df = pd.DataFrame(data)
print(df.isnull())
print(df.isnull().sum())


    Name    Age  Salary
0  False  False   False
1  False   True   False
2  False  False    True
3   True  False   False
Name      1
Age       1
Salary    1
dtype: int64


In [None]:
# Techniques to Handle Missing Values
#Removing Missing Values
#If missing values are very few, dropping them might be an option.
df.dropna(inplace=True)


#To remove only columns with missing values:
df.dropna(axis=1,inplace=True)

In [None]:
#Filling Missing Values (Imputation)
#Instead of dropping missing values, we can fill them with meaningful replacements.

#1. Fill with a Fixed Value
df.fillna(0, inplace=True)  # Replaces all NaN with 0


#2. Fill with Mean, Median, or Mode
df['Age'].fillna(df['Age'].mean(), inplace=True)  # Replaces with mean
df['Age'].fillna(df['Age'].median(), inplace=True)  # Replaces with median
df['Age'].fillna(df['Age'].mode()[0], inplace=True)  # Replaces with mode

#3. Forward Fill (Propagating Last Value)
df.fillna(method='ffill', inplace=True)  # Forward fill


#4. Backward Fill
df.fillna(method='bfill', inplace=True)  # Backward fill




In [None]:
"""Using Predictive Models for Imputation
Machine learning models can predict missing values based on existing data."""

#1. Using K-Nearest Neighbors (KNN) Imputation

from sklearn.impute import KNNImputer

imputer = KNNImputer(n_neighbors=3)
df[['Age', 'Salary']] = imputer.fit_transform(df[['Age', 'Salary']])


#2. Using Regression Imputation
from sklearn.linear_model import LinearRegression

# Selecting non-null values for training
train_data = df.dropna(subset=['Salary'])
X_train = train_data[['Age']]
y_train = train_data['Salary']

# Training a simple linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predict missing salaries
missing_data = df[df['Salary'].isnull()]
df.loc[df['Salary'].isnull(), 'Salary'] = model.predict(missing_data[['Age']])
👉 Use Case: When there’s a strong relationship between features.



In [None]:
"""3. Choosing the Right Method
Scenario	Recommended Approach
Few missing values:	Drop rows or columns
Missing at random:	Mean/Median/Mode imputation
Time-series data	:Forward/Backward fill
Data has a pattern:	Predictive imputation (KNN, regression)"""