## Task 1: Problem Statement

**Problem Statement:**

The PIMA Indians Diabetes Dataset contains medical diagnostic data from female patients of Pima Indian heritage, aged 21 years or older. The goal is to predict whether a patient has diabetes based on various medical attributes such as glucose levels, blood pressure, BMI, age, and other health indicators.

**Objective:** Build a classification model to predict the onset of diabetes (binary classification: 0 = No Diabetes, 1 = Diabetes) using medical diagnostic measurements.

**Business Impact:** Early prediction of diabetes can help healthcare providers implement preventive measures and treatment plans, improving patient outcomes and reducing healthcare costs.

## Task 2: Import Required Libraries

In [None]:
# Import required libraries for data analysis and manipulation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set display options for better visualization
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# For warnings
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Task 3: Load the PIMA Indians Diabetes Dataset

The dataset is typically available from the UCI Machine Learning Repository or Kaggle.  
We'll load it from a CSV file or use a standard source.

In [None]:
# Load the PIMA Indians Diabetes dataset
# Note: Update the file path based on where your dataset is stored
# Common sources:
# - Local CSV file: pd.read_csv('diabetes.csv')
# - Online: pd.read_csv('https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv')

# For this example, we'll use the standard CSV format
# You may need to adjust the path based on your dataset location

try:
    # Try loading from local file first
    df = pd.read_csv('diabetes.csv')
    print("Dataset loaded from local file successfully!")
except FileNotFoundError:
    # If not found locally, load from online source
    url = 'https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv'
    column_names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 
                    'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
    df = pd.read_csv(url, names=column_names)
    print("Dataset loaded from online source successfully!")

print(f"\nDataset loaded with {len(df)} records")

## Task 4: Inspect the Dataset

### 4.1 Shape of Dataset

In [None]:
# Display the shape of the dataset
print("=" * 60)
print("DATASET SHAPE")
print("=" * 60)
print(f"Number of rows (samples): {df.shape[0]}")
print(f"Number of columns (features): {df.shape[1]}")
print(f"Total data points: {df.shape[0] * df.shape[1]}")

### 4.2 Column Names and Description

In [None]:
# Display column names
print("=" * 60)
print("COLUMN NAMES")
print("=" * 60)
print(df.columns.tolist())
print("\n")

# Column descriptions
print("=" * 60)
print("COLUMN DESCRIPTIONS")
print("=" * 60)
column_descriptions = {
    'Pregnancies': 'Number of times pregnant',
    'Glucose': 'Plasma glucose concentration (2 hours in an oral glucose tolerance test)',
    'BloodPressure': 'Diastolic blood pressure (mm Hg)',
    'SkinThickness': 'Triceps skin fold thickness (mm)',
    'Insulin': '2-Hour serum insulin (mu U/ml)',
    'BMI': 'Body mass index (weight in kg/(height in m)^2)',
    'DiabetesPedigreeFunction': 'Diabetes pedigree function (genetic influence)',
    'Age': 'Age in years',
    'Outcome': 'Target variable (0 = No diabetes, 1 = Diabetes)'
}

for col, desc in column_descriptions.items():
    print(f"{col:25s}: {desc}")

### 4.3 Data Types

In [None]:
# Display data types of each column
print("=" * 60)
print("DATA TYPES")
print("=" * 60)
print(df.dtypes)
print("\n")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024:.2f} KB")

### 4.4 Basic DataFrame Functions

In [None]:
# Display first few rows
print("=" * 60)
print("FIRST 5 ROWS (head)")
print("=" * 60)
print(df.head())

In [None]:
# Display last few rows
print("=" * 60)
print("LAST 5 ROWS (tail)")
print("=" * 60)
print(df.tail())

In [None]:
# Display random sample of rows
print("=" * 60)
print("RANDOM SAMPLE (5 rows)")
print("=" * 60)
print(df.sample(5))

In [None]:
# Display detailed information about the dataset
print("=" * 60)
print("DATASET INFO")
print("=" * 60)
df.info()

In [None]:
# Display statistical summary
print("=" * 60)
print("STATISTICAL SUMMARY")
print("=" * 60)
print(df.describe())

In [None]:
# Extract specific columns
print("=" * 60)
print("EXTRACTING SPECIFIC COLUMNS")
print("=" * 60)
print("\nGlucose column (first 10 values):")
print(df['Glucose'].head(10))

print("\nMultiple columns - Age and BMI (first 10 rows):")
print(df[['Age', 'BMI']].head(10))

In [None]:
# Extract specific rows
print("=" * 60)
print("EXTRACTING SPECIFIC ROWS")
print("=" * 60)
print("\nRow at index 0:")
print(df.iloc[0])

print("\nRows 10 to 14:")
print(df.iloc[10:15])

print("\nRows where Age > 50 (first 10):")
print(df[df['Age'] > 50].head(10))

In [None]:
# Check for missing values
print("=" * 60)
print("MISSING VALUES CHECK")
print("=" * 60)
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

In [None]:
# Check for duplicate rows
print("=" * 60)
print("DUPLICATE ROWS CHECK")
print("=" * 60)
duplicates = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicates}")

## Task 5: Identify Features (X) and Target (y)

The dataset contains:
- **Features (X):** All columns except 'Outcome' - these are the independent variables used for prediction
- **Target (y):** 'Outcome' column - this is the dependent variable we want to predict (0 or 1)

In [None]:
# Identify Features (X) and Target (y)
print("=" * 60)
print("FEATURES AND TARGET IDENTIFICATION")
print("=" * 60)

# Features (X) - all columns except the target
X = df.drop('Outcome', axis=1)
print(f"Features (X):")
print(f"  Columns: {X.columns.tolist()}")
print(f"  Shape: {X.shape}")
print(f"  Number of features: {X.shape[1]}")

print("\n" + "-" * 60 + "\n")

# Target (y) - the outcome column
y = df['Outcome']
print(f"Target (y):")
print(f"  Column: 'Outcome'")
print(f"  Shape: {y.shape}")
print(f"  Unique values: {y.unique()}")
print(f"  Value counts:")
print(y.value_counts())
print(f"\n  Class distribution:")
print(f"    No Diabetes (0): {(y == 0).sum()} samples ({(y == 0).sum() / len(y) * 100:.2f}%)")
print(f"    Diabetes (1): {(y == 1).sum()} samples ({(y == 1).sum() / len(y) * 100:.2f}%)")

In [None]:
# Display the features (first 5 rows)
print("=" * 60)
print("FEATURES (X) - First 5 rows")
print("=" * 60)
print(X.head())

In [None]:
# Display the target (first 20 values)
print("=" * 60)
print("TARGET (y) - First 20 values")
print("=" * 60)
print(y.head(20))

---

## Day 1 Summary

### Work Completed:
✅ **Problem Statement Defined:** Clear understanding of diabetes prediction task  
✅ **Libraries Imported:** pandas, numpy, matplotlib, seaborn  
✅ **Dataset Loaded:** Successfully loaded PIMA Indians Diabetes dataset  
✅ **Data Inspection Completed:**
   - Dataset has 768 rows and 9 columns
   - 8 feature columns + 1 target column
   - All columns are numeric (float64/int64)
   - No missing values detected
   - Checked for duplicates

✅ **Basic DataFrame Operations Applied:**
   - Viewed first/last rows with head() and tail()
   - Extracted specific columns and rows
   - Generated statistical summaries
   - Explored data structure with info()

✅ **Features and Target Identified:**
   - **Features (X):** Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age
   - **Target (y):** Outcome (Binary: 0 = No Diabetes, 1 = Diabetes)
   - Class distribution analyzed

### Next Steps (Day 2):
- Exploratory Data Analysis (EDA)
- Data visualization
- Feature correlation analysis
- Outlier detection