## Task 1: Problem Statement

**Problem Statement:**

The PIMA Indians Diabetes Dataset contains medical diagnostic data from female patients of Pima Indian heritage, aged 21 years or older. The goal is to predict whether a patient has diabetes based on various medical attributes such as glucose levels, blood pressure, BMI, age, and other health indicators.

**Objective:** Build a classification model to predict the onset of diabetes (binary classification: 0 = No Diabetes, 1 = Diabetes) using medical diagnostic measurements.

**Business Impact:** Early prediction of diabetes can help healthcare providers implement preventive measures and treatment plans, improving patient outcomes and reducing healthcare costs.

Task 1: Problem Statement

In [None]:
# Predict diabetes using medical data
# Features: Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age
# Target: Outcome (0 or 1)

Task 2: Import Libraries

In [None]:
import pandas as pd

In [None]:
import numpy as np

In [None]:
import matplotlib.pyplot as plt

In [None]:
import seaborn as sns

Task 3: Load Dataset

In [None]:
df=pd.read_csv("diabetes.csv")

Task 4: Inspect Dataset

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.dtypes

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
df['Glucose']

In [None]:
df[['Age','BMI']]

In [None]:
df.iloc[0:5]

In [None]:
df.isnull().sum()

In [None]:
df.duplicated().sum()

Task 5: Features and Target

In [None]:
X=df.drop('Outcome',axis=1)

In [None]:
X

In [None]:
y=df['Outcome']

In [None]:
y

In [None]:
y.value_counts()

Task 1: Problem Statement

In [22]:
# Problem: Predict diabetes in PIMA Indian women using medical data
# Dataset has 8 features (Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age)
# Target: Outcome (0=No Diabetes, 1=Diabetes)

Task 2: Import Libraries

In [23]:
import pandas as pd

In [24]:
import numpy as np

In [25]:
import matplotlib.pyplot as plt

In [26]:
import seaborn as sns

Task 3: Load Dataset

In [27]:
df=pd.read_csv("diabetes.csv")

Task 4: Inspect Dataset

In [28]:
# Shape of dataset
df.shape

(768, 9)

In [29]:
# Column names
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'], dtype='object')

In [30]:
# Data types
df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [31]:
# First 5 rows
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [32]:
# Last 5 rows
df.tail()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.34,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
767,1,93,70,31,0,30.4,0.315,23,0


In [33]:
# Dataset info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [34]:
# Statistical summary
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [35]:
# Extract specific column
df['Glucose']

0      148
1       85
2      183
3       89
4      137
      ... 
763    101
764    122
765    121
766    126
767     93
Name: Glucose, Length: 768, dtype: int64

In [36]:
# Extract multiple columns
df[['Age','BMI']]

Unnamed: 0,Age,BMI
0,50,33.6
1,31,26.6
2,32,23.3
3,21,28.1
4,33,43.1
...,...,...
763,63,32.9
764,27,36.8
765,30,26.2
766,47,30.1


In [37]:
# Extract rows using iloc
df.iloc[0:5]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [38]:
# Check missing values
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [39]:
# Check duplicates
df.duplicated().sum()

np.int64(0)

Task 5: Features and Target

In [None]:
X=df.drop('Outcome',axis=1)

In [41]:
X

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age
0,6,148,72,35,0,33.6,0.627,50
1,1,85,66,29,0,26.6,0.351,31
2,8,183,64,0,0,23.3,0.672,32
3,1,89,66,23,94,28.1,0.167,21
4,0,137,40,35,168,43.1,2.288,33
...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63
764,2,122,70,27,0,36.8,0.340,27
765,5,121,72,23,112,26.2,0.245,30
766,1,126,60,0,0,30.1,0.349,47


In [42]:
# Target (y)
y=df['Outcome']

In [43]:
y

0      1
1      0
2      1
3      0
4      1
      ..
763    0
764    0
765    0
766    1
767    0
Name: Outcome, Length: 768, dtype: int64

In [44]:
# Check target distribution
y.value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

## Task 2: Import Required Libraries

In [45]:
# Import required libraries for data analysis and manipulation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better visualization
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# For warnings
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

Libraries imported successfully!
Pandas version: 2.2.3
NumPy version: 2.1.3


## Task 3: Load the PIMA Indians Diabetes Dataset

The dataset is typically available from the UCI Machine Learning Repository or Kaggle.  
We'll load it from a CSV file or use a standard source.

In [46]:
# Load the PIMA Indians Diabetes dataset
# Note: Update the file path based on where your dataset is stored
# Common sources:
# - Local CSV file: pd.read_csv('diabetes.csv')
# - Online: pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv')

# For this example, we'll use the standard CSV format
# You may need to adjust the path based on your dataset location

try:
    # Try loading from local file first
    df = pd.read_csv('diabetes.csv')
    print("Dataset loaded from local file successfully!")
except FileNotFoundError:
    # If not found locally, load from online source
    url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
    column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 
                    'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
    df = pd.read_csv(url, names=column_names)
    print("Dataset loaded from online source successfully!")

print(f"\nDataset loaded with {len(df)} records")

Dataset loaded from local file successfully!

Dataset loaded with 768 records


## Task 4: Inspect the Dataset

### 4.1 Shape of Dataset

In [47]:
# Display the shape of the dataset
print("=" * 60)
print("DATASET SHAPE")
print("=" * 60)
print(f"Number of rows (samples): {df.shape[0]}")
print(f"Number of columns (features): {df.shape[1]}")
print(f"Total data points: {df.shape[0] * df.shape[1]}")

DATASET SHAPE
Number of rows (samples): 768
Number of columns (features): 9
Total data points: 6912


### 4.2 Column Names and Description

In [48]:
# Display column names
print("=" * 60)
print("COLUMN NAMES")
print("=" * 60)
print(df.columns.tolist())
print("\n")

# Column descriptions
print("=" * 60)
print("COLUMN DESCRIPTIONS")
print("=" * 60)
column_descriptions = {
    'Pregnancies': 'Number of times pregnant',
    'Glucose': 'Plasma glucose concentration (2 hours in an oral glucose tolerance test)',
    'BloodPressure': 'Diastolic blood pressure (mm Hg)',
    'SkinThickness': 'Triceps skin fold thickness (mm)',
    'Insulin': '2-Hour serum insulin (mu U/ml)',
    'BMI': 'Body mass index (weight in kg/(height in m)^2)',
    'DiabetesPedigreeFunction': 'Diabetes pedigree function (genetic influence)',
    'Age': 'Age in years',
    'Outcome': 'Target variable (0 = No diabetes, 1 = Diabetes)'
}

for col, desc in column_descriptions.items():
    print(f"{col:25s}: {desc}")

COLUMN NAMES
['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']


COLUMN DESCRIPTIONS
Pregnancies              : Number of times pregnant
Glucose                  : Plasma glucose concentration (2 hours in an oral glucose tolerance test)
BloodPressure            : Diastolic blood pressure (mm Hg)
SkinThickness            : Triceps skin fold thickness (mm)
Insulin                  : 2-Hour serum insulin (mu U/ml)
BMI                      : Body mass index (weight in kg/(height in m)^2)
DiabetesPedigreeFunction : Diabetes pedigree function (genetic influence)
Age                      : Age in years
Outcome                  : Target variable (0 = No diabetes, 1 = Diabetes)


### 4.3 Data Types

In [49]:
# Display data types of each column
print("=" * 60)
print("DATA TYPES")
print("=" * 60)
print(df.dtypes)
print("\n")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

DATA TYPES
Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object


Memory usage: 54.13 KB


### 4.4 Basic DataFrame Functions

In [50]:
# Display first few rows
print("=" * 60)
print("FIRST 5 ROWS (head)")
print("=" * 60)
print(df.head())

FIRST 5 ROWS (head)
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
0            6      148             72             35        0  33.6                     0.627   50        1
1            1       85             66             29        0  26.6                     0.351   31        0
2            8      183             64              0        0  23.3                     0.672   32        1
3            1       89             66             23       94  28.1                     0.167   21        0
4            0      137             40             35      168  43.1                     2.288   33        1


In [51]:
# Display last few rows
print("=" * 60)
print("LAST 5 ROWS (tail)")
print("=" * 60)
print(df.tail())

LAST 5 ROWS (tail)
     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
763           10      101             76             48      180  32.9                     0.171   63        0
764            2      122             70             27        0  36.8                     0.340   27        0
765            5      121             72             23      112  26.2                     0.245   30        0
766            1      126             60              0        0  30.1                     0.349   47        1
767            1       93             70             31        0  30.4                     0.315   23        0


In [52]:
# Display random sample of rows
print("=" * 60)
print("RANDOM SAMPLE (5 rows)")
print("=" * 60)
print(df.sample(5))

RANDOM SAMPLE (5 rows)
     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
758            1      106             76              0        0  37.5                     0.197   26        0
560            6      125             76              0        0  33.8                     0.121   54        1
474            4      114             64              0        0  28.9                     0.126   24        0
0              6      148             72             35        0  33.6                     0.627   50        1
327           10      179             70              0        0  35.1                     0.200   37        0


In [53]:
# Display detailed information about the dataset
print("=" * 60)
print("DATASET INFO")
print("=" * 60)
df.info()

DATASET INFO
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [54]:
# Display statistical summary
print("=" * 60)
print("STATISTICAL SUMMARY")
print("=" * 60)
print(df.describe())

STATISTICAL SUMMARY
       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin         BMI  DiabetesPedigreeFunction         Age     Outcome
count   768.000000  768.000000     768.000000     768.000000  768.000000  768.000000                768.000000  768.000000  768.000000
mean      3.845052  120.894531      69.105469      20.536458   79.799479   31.992578                  0.471876   33.240885    0.348958
std       3.369578   31.972618      19.355807      15.952218  115.244002    7.884160                  0.331329   11.760232    0.476951
min       0.000000    0.000000       0.000000       0.000000    0.000000    0.000000                  0.078000   21.000000    0.000000
25%       1.000000   99.000000      62.000000       0.000000    0.000000   27.300000                  0.243750   24.000000    0.000000
50%       3.000000  117.000000      72.000000      23.000000   30.500000   32.000000                  0.372500   29.000000    0.000000
75%       6.000000  140.250000     

In [55]:
# Extract specific columns
print("=" * 60)
print("EXTRACTING SPECIFIC COLUMNS")
print("=" * 60)
print("\nGlucose column (first 10 values):")
print(df['Glucose'].head(10))

print("\nMultiple columns - Age and BMI (first 10 rows):")
print(df[['Age', 'BMI']].head(10))

EXTRACTING SPECIFIC COLUMNS

Glucose column (first 10 values):
0    148
1     85
2    183
3     89
4    137
5    116
6     78
7    115
8    197
9    125
Name: Glucose, dtype: int64

Multiple columns - Age and BMI (first 10 rows):
   Age   BMI
0   50  33.6
1   31  26.6
2   32  23.3
3   21  28.1
4   33  43.1
5   30  25.6
6   26  31.0
7   29  35.3
8   53  30.5
9   54   0.0


In [56]:
# Extract specific rows
print("=" * 60)
print("EXTRACTING SPECIFIC ROWS")
print("=" * 60)
print("\nRow at index 0:")
print(df.iloc[0])

print("\nRows 10 to 14:")
print(df.iloc[10:15])

print("\nRows where Age > 50 (first 10):")
print(df[df['Age'] > 50].head(10))

EXTRACTING SPECIFIC ROWS

Row at index 0:
Pregnancies                   6.000
Glucose                     148.000
BloodPressure                72.000
SkinThickness                35.000
Insulin                       0.000
BMI                          33.600
DiabetesPedigreeFunction      0.627
Age                          50.000
Outcome                       1.000
Name: 0, dtype: float64

Rows 10 to 14:
    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age  Outcome
10            4      110             92              0        0  37.6                     0.191   30        0
11           10      168             74              0        0  38.0                     0.537   34        1
12           10      139             80              0        0  27.1                     1.441   57        0
13            1      189             60             23      846  30.1                     0.398   59        1
14            5      166             72     

In [57]:
# Check for missing values
print("=" * 60)
print("MISSING VALUES CHECK")
print("=" * 60)
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

MISSING VALUES CHECK
Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

Total missing values: 0


In [58]:
# Check for duplicate rows
print("=" * 60)
print("DUPLICATE ROWS CHECK")
print("=" * 60)
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

DUPLICATE ROWS CHECK
Number of duplicate rows: 0


## Task 5: Identify Features (X) and Target (y)

The dataset contains:
- **Features (X):** All columns except 'Outcome' - these are the independent variables used for prediction
- **Target (y):** 'Outcome' column - this is the dependent variable we want to predict (0 or 1)

In [59]:
# Identify Features (X) and Target (y)
print("=" * 60)
print("FEATURES AND TARGET IDENTIFICATION")
print("=" * 60)

# Features (X) - all columns except the target
X = df.drop('Outcome', axis=1)
print(f"Features (X):")
print(f"  Columns: {X.columns.tolist()}")
print(f"  Shape: {X.shape}")
print(f"  Number of features: {X.shape[1]}")

print("\n" + "-" * 60 + "\n")

# Target (y) - the outcome column
y = df['Outcome']
print(f"Target (y):")
print(f"  Column: 'Outcome'")
print(f"  Shape: {y.shape}")
print(f"  Unique values: {y.unique()}")
print(f"  Value counts:")
print(y.value_counts())
print(f"\n  Class distribution:")
print(f"    No Diabetes (0): {(y == 0).sum()} samples ({(y == 0).sum() / len(y) * 100:.2f}%)")
print(f"    Diabetes (1): {(y == 1).sum()} samples ({(y == 1).sum() / len(y) * 100:.2f}%)")

FEATURES AND TARGET IDENTIFICATION
Features (X):
  Columns: ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age']
  Shape: (768, 8)
  Number of features: 8

------------------------------------------------------------

Target (y):
  Column: 'Outcome'
  Shape: (768,)
  Unique values: [1 0]
  Value counts:
Outcome
0    500
1    268
Name: count, dtype: int64

  Class distribution:
    No Diabetes (0): 500 samples (65.10%)
    Diabetes (1): 268 samples (34.90%)


In [60]:
# Display the features (first 5 rows)
print("=" * 60)
print("FEATURES (X) - First 5 rows")
print("=" * 60)
print(X.head())

FEATURES (X) - First 5 rows
   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  DiabetesPedigreeFunction  Age
0            6      148             72             35        0  33.6                     0.627   50
1            1       85             66             29        0  26.6                     0.351   31
2            8      183             64              0        0  23.3                     0.672   32
3            1       89             66             23       94  28.1                     0.167   21
4            0      137             40             35      168  43.1                     2.288   33


In [61]:
# Display the target (first 20 values)
print("=" * 60)
print("TARGET (y) - First 20 values")
print("=" * 60)
print(y.head(20))

TARGET (y) - First 20 values
0     1
1     0
2     1
3     0
4     1
5     0
6     1
7     0
8     1
9     1
10    0
11    1
12    0
13    1
14    1
15    1
16    1
17    1
18    0
19    1
Name: Outcome, dtype: int64
