<a href="https://colab.research.google.com/github/jonnunez92/Datasets-for-Modeling/blob/main/Datasets_for_Modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Load Libraries/Data


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_selector, make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import make_pipeline

## Models
from sklearn.dummy import DummyRegressor
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression

## Metrics
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.metrics import classification_report, ConfusionMatrixDisplay, roc_auc_score, RocCurveDisplay

## Set global scikit-learn configuration 
from sklearn import set_config
## Display estimators as a diagram
set_config(display='diagram') # 'text' or 'diagram'}

# Dataset 1 - Healthcare Stroke Prediction

1. **Source of data**
  - https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset
2. **Brief description of data**
  - Dataset contains various features of people including 'bmi', 'heart_disease', 'hypertension', etc
  - Using these characterics, I'll create a model to predict if any person would get a stroke
3. **What is the target?**
  - Column 'stroke', yes (1) and no (0)
4. **What does one row represent? (A person?  A business?  An event? A product?)**
  - A row represents a person
5. **Is this a classification or regression problem?**
  - Classification, predicting stroke: yes or no
6. **How many features does the data have?**
  - 11 features, 12 columns including the target column
7. **How many rows are in the dataset?**
  - 5110 rows
8. **What, if any, challenges do you foresee in cleaning, exploring, or modeling this dataset?**
  - Missing data, irrelevant columns, mix of categorical and numerical data
  - Seems like the metrics I'll be looking at are accuracy and, especially, recall. Both need to be high.

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [27]:
stroke_df = pd.read_csv('/content/drive/MyDrive/Data Sets for Coding Dojo/healthcare-dataset-stroke-data.csv')
stroke_df.head()

Unnamed: 0,id,gender,age,hypertension,heart_disease,ever_married,work_type,Residence_type,avg_glucose_level,bmi,smoking_status,stroke
0,9046,Male,67.0,0,1,Yes,Private,Urban,228.69,36.6,formerly smoked,1
1,51676,Female,61.0,0,0,Yes,Self-employed,Rural,202.21,,never smoked,1
2,31112,Male,80.0,0,1,Yes,Private,Rural,105.92,32.5,never smoked,1
3,60182,Female,49.0,0,0,Yes,Private,Urban,171.23,34.4,smokes,1
4,1665,Female,79.0,1,0,Yes,Self-employed,Rural,174.12,24.0,never smoked,1


In [28]:
stroke_df.shape

(5110, 12)

In [29]:
stroke_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5110 entries, 0 to 5109
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5110 non-null   int64  
 1   gender             5110 non-null   object 
 2   age                5110 non-null   float64
 3   hypertension       5110 non-null   int64  
 4   heart_disease      5110 non-null   int64  
 5   ever_married       5110 non-null   object 
 6   work_type          5110 non-null   object 
 7   Residence_type     5110 non-null   object 
 8   avg_glucose_level  5110 non-null   float64
 9   bmi                4909 non-null   float64
 10  smoking_status     5110 non-null   object 
 11  stroke             5110 non-null   int64  
dtypes: float64(3), int64(4), object(5)
memory usage: 479.2+ KB


# Dataset 2 - Spaceship Titanic
1. **Source of data**
  - https://www.kaggle.com/competitions/spaceship-titanic
2. **Brief description of data**
  - Dataset contains various features of people on a spaceship
  - Using these characterics, I'll create a model to predict if which person was transported by an anomaly
3. **What is the target?**
  - Column 'Transported', True or False
4. **What does one row represent? (A person?  A business?  An event? A product?)**
  - A row represents a person
5. **Is this a classification or regression problem?**
  - Classification, predicting transported: True or False
6. **How many features does the data have?**
  - 13 features, 14 columns including the target column
7. **How many rows are in the dataset?**
  - 8693 rows
8. **What, if any, challenges do you foresee in cleaning, exploring, or modeling this dataset?**
  - Missing data, irrelevant columns, mix of categorical and numerical data
  - Seems like the metrics I'll be looking at are accuracy and F1 score

In [3]:
space_df = pd.read_csv('/content/drive/MyDrive/Data Sets for Coding Dojo/space_train.csv')
space_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [4]:
space_df.shape

(8693, 14)

In [5]:
space_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB
