<a href="https://colab.research.google.com/github/hariomvyas/AIhub/blob/main/AIHub_Data_Cleaning_Template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# AIHub Data Cleaning Template by Hariom Vyas

Data cleaning is an essential step in preparing data for machine learning. Here are the steps you can follow in Python for data cleaning:

## 1. Import libraries: 
First, you need to import the necessary libraries such as pandas, numpy, and sklearn.

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler

## 2. Load the data 
Load the dataset you want to clean using pandas.

In [4]:
data = pd.read_csv('filename.csv')
data

## 3. Handle missing values: 
Check for missing values in your dataset and handle them appropriately. You can either remove rows or columns with missing values or impute the missing values using techniques such as mean, median, or mode.

In [None]:
# Drop missing values
data.dropna(inplace=True)

# Impute missing values
data.fillna(data.mean(), inplace=True)

## 4. Handle categorical data:
Convert categorical data to numerical data using techniques such as one-hot encoding or label encoding.

In [None]:
# One-hot encoding
data = pd.get_dummies(data, columns=['category'])

# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
data['category'] = le.fit_transform(data['category'])

## 5. Remove duplicate values:
Remove duplicate rows from the dataset.

In [None]:
data.drop_duplicates(inplace=True)

## 6. Standardize the data:
Standardize the data to ensure that all features are on the same scale.

In [None]:
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

## 7. Normalize the data:
Normalize the data to ensure that all features have the same importance.

In [None]:
from sklearn.preprocessing import normalize
data_normalized = normalize(data, norm='l2')

## 8. Feature selection: 
Select the relevant features for your model.

In [6]:
# Select features using feature importance
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(data, target)
feat_importances = pd.Series(rf.feature_importances_, index=data.columns)
feat_importances.nlargest(10).plot(kind='barh')

## 9. Split the data:
Finally, split the dataset into training and testing sets.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.3, random_state=42)

These are the basic steps you can follow for data cleaning in Python before training a machine learning model. However, the actual steps you take may depend on the specific dataset and the machine learning problem you are trying to solve. More Steps might get added in future.