<a href="https://colab.research.google.com/github/hyacinth-lab/KamXproject/blob/main/Copy_of_Data_Science_on_Autopilot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CODING TASK #1: IMPORT LIBRARIES AND DATASETS

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# For Autogluon to work in Google Colab, you need to install ipykernel and restart the notebook
# The IPython kernel is the Python execution backend for Jupyter
!pip install -U ipykernel

In [None]:
!pip install -U pip
!pip install -U setuptools wheel
!pip install -U "mxnet<2.0.0"
!pip install autogluon --no-cache-dir

In [None]:
# pip install autogluon autogluon.tabular "mxnet<2.0.0"

In [None]:
# AutoGluon is modularized into sub-modules for: 1. Tabular, 2. text, 3.Images
from autogluon.tabular import TabularDataset, TabularPredictor

In [None]:
insurance_df = pd.read_csv('insurance.csv')

In [None]:
insurance_df

**PRACTICE OPPORTUNITY #1 [OPTIONAL]:**
- **How many unique regions do we have in the insurance_df DataFrame?**

# CODING TASK #2: PERFORM EXPLORATORY DATA ANALYSIS

In [None]:
# Explore the first five rows in the DataFrame
insurance_df.head(3)

In [None]:
# Explore the last five rows in the DataFrame
insurance_df.tail()

In [None]:
# Generate statistical summary
insurance_df.describe()

In [None]:
# Obtain dataset information
insurance_df.info()

In [None]:
# Grouping by region to see any relationship between region and charges
# Seems like south east region has the highest charges and body mass index
df_region = insurance_df.groupby(by='region').mean()
df_region

**PRACTICE OPPORTUNITY #2 [OPTIONAL]:**
- **Group data by 'age' and examine the relationship between 'age' and 'charges'**

# CODING TASK #3: PERFORM DATA VISUALIZATION

In [None]:
# check if there are any Null values
sns.heatmap(insurance_df.isnull(), yticklabels = False, cbar = False, cmap="Blues")

In [None]:
# check if there are any Null values
insurance_df.isnull().sum()

In [None]:
insurance_df[['age', 'sex', 'bmi', 'children', 'smoker', 'charges']].hist(bins = 30, figsize = (12, 12), color = 'r');


In [None]:
# plot pairplot
sns.pairplot(insurance_df)

In [None]:
plt.figure(figsize = (15, 6))
sns.regplot(x = 'age', y = 'charges', data = insurance_df)
plt.show()


In [None]:
plt.figure(figsize = (15, 6))
sns.regplot(x = 'bmi', y = 'charges', data = insurance_df)
plt.show()


**PRACTICE OPPORTUNITY #3 [OPTIONAL]:**
 - **Calculate and plot the correlation matrix**
 - **Which feature has the most positive correlation with charges?**

# CODING TASK #4: TRAIN MULTIPLE MODELS USING AUTOGLUON

In [None]:
# Split the data into 80% for training and 20% for testing using train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(insurance_df, test_size=0.2, random_state=0)

In [None]:
X_train

In [None]:
X_test

In [None]:
# Train multiple ML regression models using AutoGluon
# You need to specify the target column, train_data, limit_time, and presets
# Note that AutoGluon automatically detects if the problem is classification or regression type problems from the 'label' column
# For regression type problems, 'label' values are generally floating point non-integers with large number of unique values

predictor = TabularPredictor(label="charges", problem_type = 'regression', eval_metric = 'r2').fit(train_data = X_train, time_limit = 200, presets = "best_quality")

In [None]:
predictor.fit_summary()

# CODING TASK #5: EVALUATE TRAINED MODELS PERFORMANCE

In [None]:
predictor.leaderboard()

In [None]:
# Initialize the matplotlib figure
f, ax = plt.subplots(figsize = (15, 6))
sns.barplot(x = "model", y = "score_val", data = predictor.leaderboard(), color = "b")
ax.set(ylabel = "Performance Metric (R2)", xlabel = "Regression Models")
plt.xticks(rotation = 45);

In [None]:
predictor.evaluate(X_test)

In [None]:
# assess model performance
# Pick 5 test datasets and generate predictions
y_pred = predictor.predict(X_test)
print("Predictions:  ", list(y_pred)[:5])

In [None]:
X_test

In [None]:
y_test = X_test['charges']
y_test #groundtruth

In [None]:
y_predict = predictor.predict(X_test)
plt.figure(figsize = (15, 10))
plt.plot(y_test, y_predict, "^", color = 'r')
plt.ylabel('Model Predictions')
plt.xlabel('True Values')

In [None]:
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from math import sqrt

RMSE = float(format(np.sqrt(mean_squared_error(y_test, y_predict)),'.3f'))
MSE = mean_squared_error(y_test, y_predict)
MAE = mean_absolute_error(y_test, y_predict)
r2 = r2_score(y_test, y_predict)

print('RMSE =',RMSE, '\nMSE =',MSE, '\nMAE =',MAE, '\nR2 =', r2)

**PRACTICE OPPORTUNITY #4 [OPTIONAL]:**
- **Retrain a regressor model using AutoGluon with different preset value**
- **Set the limit_time to 300 secs**
- **Use RMSE as the key metric and plot the barchart**
- **Which model provides the best performance?**
- **Assess trained model performance by comparing various metrics**

# PRACTICE OPPORTUNITY SOLUTIONS

**PRACTICE OPPORTUNITY #1 SOLUTION:**
- **How many unique regions do we have in the insurance_df?**

In [None]:
insurance_df['region'].unique()

**PRACTICE OPPORTUNITY #2 SOLUTION:**
- **Group data by 'age' and examine the relationship between 'age' and 'charges'**

In [None]:
df_age = insurance_df.groupby(by = 'age').mean()
df_age

**PRACTICE OPPORTUNITY #3 SOLUTION:**
 - **Calculate and plot the correlation matrix**
 - **Which feature has the most positive correlation with charges?**

In [None]:
plt.figure(figsize = (15, 10))
sns.heatmap(insurance_df.corr(), annot = True);
# smoker and age have positive correlations with charges

**PRACTICE OPPORTUNITY #4 SOLUTION:**
- **Retrain a regressor model using AutoGluon with different preset value**
- **Set the limit_time to 300 secs**
- **Use RMSE as the key metric and plot the barchart**
- **Which model provides the best performance?**
- **Assess trained model performance by comparing various metrics**

In [None]:
predictor = TabularPredictor(label="charges", problem_type = 'regression', eval_metric = 'rmse').fit(train_data = X_train, time_limit = 300, presets = "optimize_for_deployment")
predictor.fit_summary()
predictor.leaderboard()

# Initialize the matplotlib figure
f, ax = plt.subplots(figsize = (15, 6))
sns.barplot(x = "model", y = "score_val", data = predictor.leaderboard(), color = "b")
ax.set(ylabel = "Performance Metric (RMSE)", xlabel = "Regression Models")
plt.xticks(rotation = 45);

predictor.evaluate(X_test)

# FINAL CAPSTONE PROJECT

- The objective of this project is to build, train, and test a classifier model to predict diabetes in patients using AutoGluon. This project can be effectively used by healthcare professionals to detect diabetes and understand key factors that contribute to the disease.
- Please complete the following tasks:
  - Load the “diabetes.csv” dataset
  - Perform basic Exploratory Data Analysis (EDA)
  - Using ‘best_quality’ preset and ‘accuracy’ metric, train classification models using AutoGluon to predict the “Outcome” column
  - Evaluate trained models' performance by plotting the leaderboard and indicating the best model. Plot the confusion matrix.


# FINAL CAPSTONE PROJECT SOLUTION

# PROJECT TASK #1: IMPORT LIBRARIES AND DATASETS

In [None]:
# Import Key Libaries and datsets
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
!pip install -U pip
!pip install -U setuptools wheel
!pip install -U "mxnet<2.0.0"
!pip install autogluon --no-cache-dir

In [None]:
# pip install autogluon autogluon.tabular "mxnet<2.0.0"

In [None]:
# AutoGluon is modularized into sub-modules for: 1. Tabular, 2. text, 3.Images
from autogluon.tabular import TabularDataset, TabularPredictor

In [None]:
# Read Diabetes datsets
diabetes_df = pd.read_csv('diabetes.csv')

In [None]:
diabetes_df

# PROJECT TASK #2: PERFORM EXPLORATORY DATA ANALYSIS AND VISUALIZATION

In [None]:
# Explore the first five rows in the DataFrame
diabetes_df.head()

In [None]:
# Explore the last five rows in the DataFrame
diabetes_df.tail()

In [None]:
# Generate statistical summary
diabetes_df.describe()

In [None]:
# Obtain dataset information
diabetes_df.info()

In [None]:
# check if there are any Null values
sns.heatmap(diabetes_df.isnull(), yticklabels = False, cbar = False, cmap="Blues")

In [None]:
# check if there are any Null values
diabetes_df.isnull().sum()

In [None]:
diabetes_df.hist(bins = 30, figsize = (12, 12), color = 'r');

In [None]:
# plot pairplot
sns.pairplot(diabetes_df, hue = 'Outcome', vars = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age'])

In [None]:
plt.figure(figsize = (15, 10))
sns.heatmap(diabetes_df.corr(), annot = True);
# smoker and age have positive correlations with charges

In [None]:
plt.figure(figsize = (12, 7))
sns.countplot(x = 'Outcome', data = diabetes_df)

# PROJECT TASK #3: TRAIN MULTIPLE MODELS USING AUTOGLUON

In [None]:
# Split the data into 80% for training and 20% for testing using train_test_split
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(diabetes_df, test_size=0.2, random_state=0)

In [None]:
X_train

In [None]:
X_test

In [None]:
# Train multiple ML classifier models using AutoGluon
# You need to specify the target column, train_data, limit_time, and presets
# Note that AutoGluon automatically detects if the problem is classification or regression type problems from the 'label' column
predictor = TabularPredictor(label="Outcome", problem_type = 'binary', eval_metric = 'accuracy').fit(train_data = X_train, time_limit = 200, presets = "best_quality")

In [None]:
predictor.fit_summary()

# PROJECT TASK #4: EVALUATE TRAINED MODEL PERFORMANCE

In [None]:
predictor.leaderboard()

In [None]:
# Initialize the matplotlib figure
f, ax = plt.subplots(figsize = (15, 6))
sns.barplot(x = "model", y = "score_val", data = predictor.leaderboard(), color = "b")
ax.set(ylabel = "Performance Metric (Accuracy)", xlabel = "Classification Models")
plt.xticks(rotation = 45);


In [None]:
predictor.evaluate(X_test)

In [None]:
# assess model performance
# Pick 5 test datasets and generate predictions
y_pred = predictor.predict(X_test)
print("Predictions:  ", list(y_pred)[:5])

In [None]:
y_test = X_test['Outcome']
y_test

In [None]:
# Training set Performance
from sklearn.metrics import confusion_matrix
# Testing Set Performance
plt.figure(figsize = (12, 8))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True)


In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

# GREAT JOB!