<a href="https://colab.research.google.com/github/hariprasad2422/31-ML-labs/blob/main/ML_Lab12.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Exploratory Data Analysis(EDA):**

Exploratory Data Analysis (EDA) is a crucial initial step in data science projects. It involves analyzing and visualizing data to understand its key characteristics, uncover patterns, and identify relationships between variables refers to the method of studying and exploring record sets to apprehend their predominant traits, discover patterns, locate outliers, and identify relationships between variables. EDA is normally carried out as a preliminary step before undertaking extra formal statistical analyses or modeling.

**Key aspects of EDA include:**

**Distribution of Data:** Examining the distribution of data points to understand their range, central tendencies (mean, median), and dispersion (variance, standard deviation).

**Graphical Representations:** Utilizing charts such as histograms, box plots, scatter plots, and bar charts to visualize relationships within the data and distributions of variables.

**Outlier Detection:** Identifying unusual values that deviate from other data points. Outliers can influence statistical analyses and might indicate data entry errors or unique cases.

**Correlation Analysis:** Checking the relationships between variables to understand how they might affect each other. This includes computing correlation coefficients and creating correlation matrices.

**Handling Missing Values:** Detecting and deciding how to address missing data points, whether by imputation or removal, depending on their impact and the amount of missing data.

**Summary Statistics:** Calculating key statistics that provide insight into data trends and nuances.

**Testing Assumptions:** Many statistical tests and models assume the data meet certain conditions (like normality or homoscedasticity). EDA helps verify these assumptions

### **12. Exploratory Data Analysis for Classification using Pandas or Matplotlib.**

In [1]:
import pandas as pd

# Load the dataset
data = pd.read_csv('/content/Lab12.csv')

In [2]:
# Get summary statistics for numerical columns
print(data.describe())


                  ID  CREDIT_SCORE  VEHICLE_OWNERSHIP       MARRIED  \
count   69635.000000  69635.000000       69635.000000  69635.000000   
mean   395171.733711      0.601651           0.827759      0.583816   
std    279843.644596      0.137901           0.377592      0.492928   
min       101.000000      0.074401           0.000000      0.000000   
25%    156522.500000      0.514441           1.000000      0.000000   
50%    354344.000000      0.600581           1.000000      1.000000   
75%    599642.500000      0.702655           1.000000      1.000000   
max    999976.000000      0.954075           1.000000      1.000000   

           CHILDREN   POSTAL_CODE  ANNUAL_MILEAGE  SPEEDING_VIOLATIONS  \
count  69635.000000  69634.000000    69634.000000         69634.000000   
mean       0.520256  18060.209553    11052.617974             0.674814   
std        0.499593  16770.268108     2974.245981             1.383240   
min        0.000000  10238.000000     2000.000000             0.

In [3]:
# Display all column names in the dataset
print(data.columns)


Index(['ID', 'AGE', 'GENDER', 'DRIVING_EXPERIENCE', 'EDUCATION', 'INCOME',
       'CREDIT_SCORE', 'VEHICLE_OWNERSHIP', 'VEHICLE_YEAR', 'MARRIED',
       'CHILDREN', 'POSTAL_CODE', 'ANNUAL_MILEAGE', 'SPEEDING_VIOLATIONS',
       'DUIS', 'PAST_ACCIDENTS', 'OUTCOME', 'TYPE_OF_VEHICLE'],
      dtype='object')


In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
data = pd.read_csv('/content/Lab12.csv')

# Set a visual style for the plots
sns.set(style="whitegrid")

# 1. Basic Information and Summary Statistics
print(data.info())  # Overview of dataset structure
print(data.describe())  # Summary statistics for numerical features
print(data.head())  # Peek at the first few rows of the dataset


In [None]:
# Check for missing values
print(data.isnull().sum())

# If there are missing values, you can visualize them (optional)
sns.heatmap(data.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Values in the Dataset')
plt.show()


In [None]:
# Summary of numerical features
print(data.describe())

# Plot histograms of numerical features
data.hist(figsize=(15, 10), bins=30)
plt.tight_layout()
plt.show()


In [None]:
# 2. Plot the distribution of the target variable (OUTCOME)
plt.figure(figsize=(8,6))
sns.countplot(data=data, x='OUTCOME')
plt.title('Distribution of Outcome (Target Variable)')
plt.xlabel('Outcome (0 = No Claim, 1 = Claim)')
plt.ylabel('Count')
plt.show()

In [None]:
# Plot the distribution of age groups and outcome
plt.figure(figsize=(10,6))
sns.countplot(data=data, x='AGE', hue='OUTCOME')
plt.title('Age Distribution by Outcome')
plt.xlabel('Age Group')
plt.ylabel('Count')
plt.show()

# Plot the distribution of vehicle types and outcome
plt.figure(figsize=(10,6))
sns.countplot(data=data, x='TYPE_OF_VEHICLE', hue='OUTCOME')
plt.title('Vehicle Type by Outcome')
plt.xlabel('Vehicle Type')
plt.ylabel('Count')
plt.show()

# Plot distribution of gender and outcome
plt.figure(figsize=(8,6))
sns.countplot(data=data, x='GENDER', hue='OUTCOME')
plt.title('Gender Distribution by Outcome')
plt.xlabel('Gender')
plt.ylabel('Count')
plt.show()


In [None]:
# Exclude non-numeric columns before calculating the correlation matrix
numeric_data = data.select_dtypes(include=['float64', 'int64'])

# 4. Analyze the correlation between numerical features
plt.figure(figsize=(12,8))
corr = numeric_data.corr()  # Compute correlation matrix only on numeric columns
sns.heatmap(corr, annot=True, cmap="coolwarm", fmt=".2f")
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

In [None]:
# 5. Relationship between Credit Score and Outcome
plt.figure(figsize=(8,6))
sns.histplot(data=data, x='CREDIT_SCORE', hue='OUTCOME', multiple="stack", kde=True)
plt.title('Credit Score Distribution by Outcome')
plt.xlabel('Credit Score')
plt.ylabel('Density')
plt.show()

In [None]:
# 6. Explore the distribution of vehicle types in relation to the outcome
plt.figure(figsize=(10,6))
sns.countplot(data=data, x='TYPE_OF_VEHICLE', hue='OUTCOME')
plt.title('Vehicle Type vs Outcome')
plt.xlabel('Type of Vehicle')
plt.ylabel('Count')
plt.show()


In [None]:
# Driving Experience vs Outcome
plt.figure(figsize=(10,6))
sns.countplot(data=data, x='DRIVING_EXPERIENCE', hue='OUTCOME')
plt.title('Driving Experience vs Outcome')
plt.xlabel('Driving Experience')
plt.ylabel('Count')
plt.show()

In [None]:
# 7. Analyze the relationship between Annual Mileage and Outcome
plt.figure(figsize=(8,6))
sns.boxplot(data=data, x='OUTCOME', y='ANNUAL_MILEAGE')
plt.title('Annual Mileage vs Outcome')
plt.xlabel('Outcome')
plt.ylabel('Annual Mileage')
plt.show()
