# Understanding Data in Notebooks
### Assignment 2: Data Cleaning, Visualization, and Analysis
**Authors:** Chaitanya Ravindra Inamdar, Shreya Deshpande, Ajin Abraham

## 1. Abstract
In this notebook, we explore the process of data cleaning, transformation, and visualization using the Iris dataset. We demonstrate how to handle missing data, normalize features, and use various Python libraries to visualize the dataset. The aim is to understand how these processes improve the accuracy of data analysis and decision-making.
We conclude by showing how visual insights derived from cleaned and prepared data can provide actionable outcomes for further machine learning modeling.


## 2. Theory and Background
Data cleaning and visualization play a vital role in ensuring accurate data analysis. According to research, up to 80% of the time in a data science project is spent cleaning and transforming data. This notebook demonstrates key principles such as handling missing data, normalizing values, and visualizing relationships between variables.
We use the Iris dataset, a well-known dataset in the field of machine learning, to showcase how interactive environments like Jupyter Notebooks are excellent for real-time data exploration and visualization.


## 3. Problem Statement
The Iris dataset contains measurements of iris flowers' sepals and petals. The challenge is to clean the dataset by handling missing data and transforming the values to prepare it for further analysis. This requires:
- Handling missing values (imputation or removal)
- Normalizing numerical features for model compatibility
- Visualizing relationships between variables


In [1]:
# 4. Data Preprocessing
# Import necessary libraries
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# Load the Iris dataset
iris = sns.load_dataset('iris')

# Check for missing values
print("Missing values per column:")
print(iris.isnull().sum())

# Fill missing values (if any) with the mean
iris.fillna(iris.mean(), inplace=True)

# Normalize the features (excluding the 'species' column)
scaler = StandardScaler()
iris_scaled = pd.DataFrame(scaler.fit_transform(iris.drop('species', axis=1)), columns=iris.columns[:-1])

# Display the normalized data
iris_scaled.head()

### Explanation:
- We first check for missing values and handle them using the `fillna` function.
- We then normalize the numerical features using `StandardScaler`, which transforms them to have a mean of 0 and a standard deviation of 1.
- This step is crucial for machine learning algorithms that are sensitive to feature scaling.


In [2]:
# 5. Data Analysis
# Pair plot to visualize feature relationships
sns.pairplot(iris, hue='species')
plt.show()

### Results and Data Analysis:
- The pair plot visually demonstrates the relationships between the features in the dataset.
- It helps us understand which features are most important for distinguishing between the species of iris.


## 6. Conclusion
In this notebook, we walked through the process of cleaning, normalizing, and visualizing the Iris dataset. We explored how important it is to handle missing data and ensure that all features are scaled correctly. Visualization plays a key role in understanding the relationships between different variables, as demonstrated by the pair plot. This process ensures the data is ready for machine learning models, where the relationships between variables can be leveraged for predictive modeling.
### Future Work:
Further steps could include building machine learning models on the cleaned data, such as classification models to predict the species of an iris flower based on its features.

## 7. References
- Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. *Annals of Eugenics*, 7(2), 179-188.
- Python Pandas Documentation: https://pandas.pydata.org/
- Seaborn Documentation: https://seaborn.pydata.org/
- Scikit-learn Documentation: https://scikit-learn.org/
- Matplotlib Documentation: https://matplotlib.org/