<a href="https://colab.research.google.com/github/ipark3/Hank-Ian/blob/main/GitHub%20Project/dataset/code/Learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [63]:
# Google Drive Mount
from google.colab import drive
drive.mount('/content/drive')

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

# Load the dataset
file_path = '/content/drive/MyDrive/Personal/NCHS_-_Leading_Causes_of_Death__United_States.csv'
df = pd.read_csv(file_path)

# Fill or drop missing values if necessary
df = df.dropna()  # Dropping rows with missing values (can use df.fillna() if needed)

# Define respiratory illnesses
respiratory_illnesses = ['Pneumonia', 'Asthma', 'Chronic Lower Respiratory Diseases', 'Influenza']  # Add relevant causes

# Create a new binary column: Is_Respiratory
df['Is_Respiratory'] = df['Cause Name'].apply(lambda x: 1 if x in respiratory_illnesses else 0)

# Section: Supervised Learning - Predicting Respiratory Illness (Is_Respiratory)
# Define features and target
X = df.drop(columns=['Cause Name', 'Is_Respiratory', '113 Cause Name'])  # Drop irrelevant or target columns
y = df['Is_Respiratory']

# Encode categorical features
le = LabelEncoder()
for col in X.select_dtypes(include=['object']).columns:
    X[col] = le.fit_transform(X[col])

# Scale numeric features
scaler = StandardScaler()
X = scaler.fit_transform(X)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train a Random Forest Classifier
clf = RandomForestClassifier(random_state=42)
clf.fit(X_train, y_train)

# Evaluate the model
y_pred = clf.predict(X_test)
print("\nClassification Report (Supervised Learning):")
print(classification_report(y_test, y_pred))
print("\nConfusion Matrix:")
print(confusion_matrix(y_test, y_pred))

# Section: Unsupervised Learning - Clustering States Based on Death Rates
# Aggregate data for clustering
cluster_features = df[['State', 'Age-adjusted Death Rate', 'Deaths']].groupby('State').mean()

# Scale the data
cluster_features_scaled = scaler.fit_transform(cluster_features)

# Apply K-Means clustering
kmeans = KMeans(n_clusters=3, random_state=42)
clusters = kmeans.fit_predict(cluster_features_scaled)

# Add cluster labels back to DataFrame
cluster_features['Cluster'] = clusters
print("\nClustered States:")
print(cluster_features)



Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Classification Report (Supervised Learning):
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3261

    accuracy                           1.00      3261
   macro avg       1.00      1.00      1.00      3261
weighted avg       1.00      1.00      1.00      3261


Confusion Matrix:
[[3261]]

Clustered States:
                      Age-adjusted Death Rate         Deaths  Cluster
State                                                                
Alabama                            153.289474    7672.416268        2
Alaska                             122.824402     564.066986        0
Arizona                            117.020096    7523.650718        0
Arkansas                           148.490909    4724.665072        2
California                         112.487081   39088.578947        0
Colorado       



The supervised learning shows us taht the perdictive data is consistant across the board and no false positives exsist.

The unsupervised learning shows us the states that have higher known number of deaths related to respiratory illnesses. States with a tag of 2 indicate that the states have a higher number of respiratory illnesses.

In [64]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, make_scorer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the dataset
file_path = '/content/drive/MyDrive/Personal/P_Data_Extract_From_World_Development_Indicators.xlsx'
df = pd.read_excel(file_path)

# Display basic information
print("Dataset Overview")
print(df.head())
print(df.info())
print(df.describe())

# Data preprocessing
print("\nPreprocessing Data")
# Handling missing values (example strategy: drop or fill with median)
df = df.dropna()
print(f"Dataset shape after removing missing values: {df.shape}")

# Feature selection
# Assuming columns 'X1', 'X2', 'X3', etc. are features and 'Y' is the target
features = ['1990 [YR1990]', '2000 [YR2000]', '2016 [YR2016]']  # Replace with actual feature column names
target = '2020 [YR2020]'  # Replace with actual target column name

X = df[features]
y = df[target]

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Selecting features (pollution levels from earlier years) and target (2020)
features = ['1990 [YR1990]', '2000 [YR2000]', '2014 [YR2014]', '2015 [YR2015]',
            '2016 [YR2016]', '2017 [YR2017]', '2018 [YR2018]', '2019 [YR2019]']
target = '2020 [YR2020]'


 #Ensure numeric conversion for features and target
for col in features + [target]:
    df[col] = pd.to_numeric(df[col], errors='coerce')

# Handle missing values again if introduced during conversion
df.dropna(subset=features + [target], inplace=True)

# Supervised Learning
X = df[features]
y = df[target]

# Splitting data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Training a Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions and evaluation
y_pred = model.predict(X_test)
print(f"R2 Score: {r2_score(y_test, y_pred):.2f}")
print(f"Mean Squared Error: {mean_squared_error(y_test, y_pred):.2f}")

# Unsupervised Learning: Example using K-Means Clustering
print("\nUnsupervised Learning: K-Means Clustering")
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_filtered[features].astype(float))

kmeans = KMeans(n_clusters=3, random_state=42)  # Set n_clusters based on exploration
kmeans.fit(X_scaled)

# Adding results interpretation
# Adding cluster labels to the dataset
df_filtered['Cluster'] = kmeans.labels_
print(df_filtered[['Country Name', 'Cluster']])


Mounted at /content/drive
Dataset Overview
                                         Series Name        Series Code  \
0  PM2.5 air pollution, mean annual exposure (mic...  EN.ATM.PM25.MC.M3   
1  PM2.5 air pollution, mean annual exposure (mic...  EN.ATM.PM25.MC.M3   
2  PM2.5 air pollution, mean annual exposure (mic...  EN.ATM.PM25.MC.M3   
3  PM2.5 air pollution, mean annual exposure (mic...  EN.ATM.PM25.MC.M3   
4  PM2.5 air pollution, mean annual exposure (mic...  EN.ATM.PM25.MC.M3   

     Country Name Country Code 1990 [YR1990] 2000 [YR2000] 2014 [YR2014]  \
0     Afghanistan          AFG     64.174097      64.76728     77.143728   
1         Albania          ALB     22.961579     22.265189     19.367944   
2         Algeria          DZA      22.35937     28.233108     24.111123   
3  American Samoa          ASM      6.433882       6.78903      6.381482   
4         Andorra          AND     16.827185     16.221599     10.435999   

  2015 [YR2015] 2016 [YR2016] 2017 [YR2017] 2018 

The supervised Learning shows us that with the higher R2 levels that pollution data is a strong redicator or 2020 levels. It helps to show how the upcoming trends could be for the future.

The Unsupervised Learning shows clustering based on the pollution trend. It shows that countries such as Afghanistan has had more pollution over time.

Using the clustering we could help those countries to make intervention plans for the upcoming future.