# Iris Dataset Classification with Logistic Regression



This notebook performs data analysis and logistic regression modeling on the Iris dataset. It starts with data loading, exploration, and visualization, followed by model training and evaluation.



## Table of Contents

1. [Import Libraries](#import-libraries)
2. [Load Data](#load-data)
3. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-(eda))
4. [Data Visualization](#data-visualization)
5. [Data Preprocessing](#data-preprocessing)
6. [Model Training](#model-training)
7. [Model Evaluation](#model-evaluation)


## Importing Necessary Tools

This section sets up the environment by importing the required libraries for our analysis.  We'll be using:

*   **pandas:** For data manipulation and analysis.  It provides data structures like DataFrames that make working with tabular data easier.
*   **seaborn and matplotlib.pyplot:**  These libraries are used for data visualization. Seaborn builds on top of matplotlib to create statistically informative and visually appealing plots.
*   **scikit-learn:** This powerful library provides tools for machine learning tasks. We import specific modules for splitting data (`train_test_split`), building a logistic regression model (`LogisticRegression`), and evaluating model performance (`accuracy_score`).  These tools will be essential in building and evaluating our predictive model.

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score


---


```markdown
## Loading the Iris Data

This section sets the stage for our analysis by loading the Iris dataset.  The data, stored in a CSV file named `iris.csv`, is read into a pandas DataFrame, a powerful structure for organizing and manipulating data.  This DataFrame, named `df`, becomes the central object holding our Iris data for subsequent exploration and model building.  This efficient loading process allows us to quickly access and utilize the information contained within the dataset.
```

In [None]:
df = pd.read_csv('iris.csv')


---


## Exploratory Data Analysis

This section kicks off the analysis by examining the Iris dataset to understand its structure and characteristics.  We use several methods to get a quick overview of the data.  First, `df.head()` shows the first few rows of the dataset, allowing a peek at the actual data values.  Then, `df.info()` provides a summary of the dataset, including the data types of each column (e.g., numerical, categorical) and whether there are any missing values. Finally, `df.describe()` calculates descriptive statistics like mean, standard deviation, and quartiles for the numerical columns. These methods together provide a foundational understanding of the dataset before we move on to visualization and modeling.

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()


---


## Visualizing Iris Features

This section explores the relationships between different Iris flower features and how these relate to the species.  We use visual tools to gain insights into these relationships.

*   **Pairplots:**  A pairplot helps visualize the relationships between all possible pairs of numerical features.  By coloring the points according to the Iris species, we can visually identify patterns and correlations specific to each species. This allows us to see how features might cluster or separate different species.
*   **Boxplots:** We use a boxplot to compare the distribution of sepal length for each Iris species. This visualization helps us understand the typical range of sepal lengths and identify any significant differences between the species based on this feature.  Boxplots can reveal variations and potential outliers within each species group.

In [None]:
sns.pairplot(df, hue='species')

In [None]:
sns.boxplot(x='species', y='sepal_length', data=df)


---


## Data Preparation

This section prepares the data for the machine learning model.  We separate the dataset into features (X) and the target variable (y), which is the species of Iris flower we aim to predict. Then, we split the data into training and testing sets to evaluate the model's performance on unseen data.  This ensures that the model learns general patterns from the training data and doesn't simply memorize it.  The `random_state` ensures consistent splitting for reproducibility.

In [None]:
X = df.drop('species', axis=1)
y = df['species']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=45)


---


## Model Training

This section trains a logistic regression model to predict the iris species.  A logistic regression model is chosen because the target variable, 'species', is categorical. This model learns the relationships between the features (sepal length, sepal width, petal length, and petal width) and the target variable. The `fit` function trains the model by adjusting its internal parameters to minimize prediction errors on the training data.  This trained model is then used in the next section to make predictions on unseen test data.

In [None]:
model = LogisticRegression()

In [None]:
model.fit(X_train, y_train)


---


### Model Evaluation

Having trained our logistic regression model, we now assess its predictive capabilities on the unseen test data.  This evaluation helps us understand how well the model generalizes to new data and provides a measure of its real-world performance. We use the `accuracy_score` metric, which calculates the percentage of correctly classified instances in the test set.  This provides a straightforward measure of the model's overall effectiveness in predicting iris species.  The accuracy score is then printed, giving us a quantifiable measure of the model's performance.

In [None]:
predictions = model.predict(X_test)

In [None]:
accuracy = accuracy_score(y_test, predictions)
print(accuracy)