# Activity - Download pen source dataset
Please use the attached dataset to do the initial data cleaning then calculate the correlation between features. See link: https://archive.ics.uci.edu/dataset/53/iris

In [1]:
# Install seaborn if not already installed
!pip install seaborn



In [2]:
# Import necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
# Load the dataset
column_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris_data = pd.read_csv('/Users/jc/projects/coding_exercise_python/MSE803/week6/bezdekIris.data', header=None, names=column_names)

# Display the first few rows of the dataset
print(iris_data.head())

   sepal_length  sepal_width  petal_length  petal_width      species
0           5.1          3.5           1.4          0.2  Iris-setosa
1           4.9          3.0           1.4          0.2  Iris-setosa
2           4.7          3.2           1.3          0.2  Iris-setosa
3           4.6          3.1           1.5          0.2  Iris-setosa
4           5.0          3.6           1.4          0.2  Iris-setosa


In [13]:
# Task 1 : Find the number and mean of missing data
iris_X = iris_data.dropna(how="all", inplace=True)
print('Number of missing data :')
print(iris_data.isnull().sum()) # There is no missing data.

Number of missing data :
sepal_length    0
sepal_width     0
petal_length    0
petal_width     0
species         0
dtype: int64


```markdown
# Iris Dataset Analysis

This Jupyter Notebook provides an analysis of the Iris dataset. The dataset contains measurements of sepal length, sepal width, petal length, and petal width for three species of iris flowers: Iris-setosa, Iris-versicolor, and Iris-virginica.

## Dataset

The dataset can be downloaded from the UCI Machine Learning Repository: [Iris Dataset](https://archive.ics.uci.edu/dataset/53/iris).

## Workflow

1. **Download and Load the Dataset**
    - The dataset is downloaded and loaded into a pandas DataFrame.
    - The dataset contains the following columns: `sepal_length`, `sepal_width`, `petal_length`, `petal_width`, and `species`.

2. **Data Cleaning**
    - Missing data is identified and handled. In this case, there is no missing data in the dataset.

3. **Data Analysis**
    - The correlation between the features is calculated and stored in a correlation matrix.

## Variables

- `cleaned_data`: NoneType, Value: None
- `column_names`: list, Value: `['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']`
- `correlation_matrix`: pandas DataFrame, Value:
  ```
                     sepal_length  sepal_width  petal_length  petal_width
  sepal_length      1.000000    -0.117570      0.871754     0.817941
  sepal_width      -0.117570     1.000000     -0.428440    -0.366126
  petal_length      0.871754    -0.428440      1.000000     0.962865
  petal_width       0.817941    -0.366126      0.962865     1.000000
  ```
- `iris_X`: NoneType, Value: None
- `iris_data`: pandas DataFrame, Value: DataFrame containing the Iris dataset
- `iris_df`: pandas DataFrame, Value: DataFrame containing the Iris dataset

## Installation

To run this notebook, you need to have the following libraries installed:

- pandas
- seaborn
- matplotlib

You can install the required libraries using the following command:

```python
!pip install pandas seaborn matplotlib
```

## Usage

To use this notebook, follow these steps:

1. Download the Iris dataset from the UCI Machine Learning Repository.
2. Load the dataset into a pandas DataFrame.
3. Perform data cleaning and analysis as described in the workflow.

## Conclusion

This notebook provides a comprehensive analysis of the Iris dataset, including data cleaning and correlation analysis. The results can be used to understand the relationships between different features of the iris flowers.
```