<a href="https://colab.research.google.com/github/pratyush-3000/me/blob/master/Pratyush_Lahane_ML__Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The objective of this assignment is to acquaint oneself with the Random Forest concept and gain practical experience in training both regression and classification RF models.

# Problem 01

In this problem you are going to work with "Life Expectancy Data.csv" dataset. <br>
Import the dataset and
1. Drop all the missing values from the dataset at the beginning of the project.
2. Drop the "Country" column from the dataset.
3. Split the dataset into train and test (consider 10% of data as the test).
4. Prepare the datasets using pipeline.

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor

df = pd.read_csv('Life Expectancy Data.csv')
df = df.dropna()

life_expectancy_data_cleaned = df.dropna()

life_expectancy_data_cleaned = life_expectancy_data_cleaned.drop(columns=["Country"])

train_data, test_data = train_test_split(life_expectancy_data_cleaned, test_size=0.1, random_state=42)

X_train = train_data.drop(columns=['Life expectancy'])
y_train = train_data['Life expectancy']
X_test = test_data.drop(columns=['Life expectancy'])
y_test = test_data['Life expectancy']

numeric_features = X_train.select_dtypes(include=['float64', 'int64']).columns
categorical_features = X_train.select_dtypes(include=['object']).columns

preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='mean')),
            ('scaler', StandardScaler())
        ]), numeric_features),
        ('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
    ]
)

pipeline = Pipeline([
    ('preprocessor', preprocessor)
])

X_train_prepared = pipeline.fit_transform(X_train)
X_test_prepared = pipeline.transform(X_test)

print("Shape of X_train_prepared:", X_train_prepared.shape)
print("Shape of X_test_prepared:", X_test_prepared.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

Shape of X_train_prepared: (1884, 19)
Shape of X_test_prepared: (210, 19)
Shape of y_train: (1884,)
Shape of y_test: (210,)


**Model training and testing**

1. Train a Random Forest model with 500 trees on the prepared train dataset.
2. Test your model on both train and test dataset and calculate RMSEs.
3. Extract the feature importance of all the features in the prepared dataset.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rf_model = RandomForestRegressor(n_estimators=500, random_state=42)
rf_model.fit(X_train_prepared, y_train)

x_train_pred = rf_model.predict(X_train_prepared)
y_test_pred = rf_model.predict(X_test_prepared)

train_rmse = np.sqrt(mean_squared_error(y_train, x_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

feature_importances  = rf_model.feature_importances_

feature_names = list(numeric_features) + list(pipeline.named_steps['preprocessor'].transformers_[1][1].get_feature_names_out())
feature_importances_df = pd.DataFrame({
    'Feature': feature_names,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

{
    "Train RMSE": train_rmse,
    "Test RMSE": test_rmse,
    "Feature Importances": feature_importances_df
}

{'Train RMSE': 0.6698125339503145,
 'Test RMSE': 1.8536320250742764,
 'Feature Importances':                             Feature  Importance
 15  Income composition of resources    0.536204
 10                         HIV/AIDS    0.299262
 1                   Adult Mortality    0.087609
 5                               BMI    0.013928
 16                        Schooling    0.008019
 8                 Total expenditure    0.007832
 14               thinness 5-9 years    0.007299
 3                           Alcohol    0.006335
 0                              Year    0.005649
 6                 under-five deaths    0.004248
 7                             Polio    0.003876
 13             thinness  1-19 years    0.003717
 11                              GDP    0.003474
 9                        Diphtheria    0.003453
 12                       Population    0.003072
 4            percentage expenditure    0.002999
 2                     infant deaths    0.002819
 17                 Status

# Problem 02

In this problem you are going to work with "water_potability.csv" dataset. <br>
Import the dataset and

1. Split the data into train and test datasets; consider 10% of data as the test set.
2. Prepare the datasets using pipeline.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import precision_score, recall_score, confusion_matrix

df = pd.read_csv('water_potability.csv')

x = df.drop(columns=['Potability'])
y = df['Potability']

x__train, x_test, y_test , y_train = train_test_split(x, y , test_size=0.1, random_state=42)

pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

x_train_prepared = pipeline.fit_transform(x__train)
x_test_prepared = pipeline.transform(x_test)

x_train_prepared.shape , x_test_prepared.shape , y_train.shape , y_test.shape


((2948, 9), (328, 9), (328,), (2948,))

**Model training and testing**

1. Train a Random Forest model with 500 trees on the prepared train dataset.
2. Test your model on both train and test dataset and calculate precision and recall score and extract confusion matrix.
3. Extract the feature importance of all the features in the prepared dataset.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import precision_score, recall_score, confusion_matrix

# Correct the variable assignment in train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)

# The rest of the code remains the same
from sklearn.ensemble import RandomForestClassifier # Changed to Classifier
from sklearn.metrics import precision_score, recall_score, confusion_matrix

# Correct the variable assignment in train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1, random_state=42)

# The rest of the code remains the same
model = RandomForestClassifier(n_estimators=500, random_state=42) # Changed to Classifier
model.fit(x_train_prepared, y_train)

y_train_pred = model.predict(x_train_prepared)
y_test_pred = model.predict(x_test_prepared)

train_precision = precision_score(y_train, y_train_pred)
test_precision = precision_score(y_test, y_test_pred)
train_recall = recall_score(y_train, y_train_pred)
test_recall = recall_score(y_test, y_test_pred)

train_conf_matrix = confusion_matrix(y_train, y_train_pred)
test_conf_matrix = confusion_matrix(y_test, y_test_pred)

feature_importances = model.feature_importances_
features_names = x.columns

importances_df = pd.DataFrame({
    'Feature': features_names,
    'Importance': feature_importances
}).sort_values(by='Importance', ascending=False)

print(importances_df)

           Feature  Importance
0               ph    0.128606
4          Sulfate    0.124403
1         Hardness    0.120700
3      Chloramines    0.115204
2           Solids    0.113521
6   Organic_carbon    0.101080
5     Conductivity    0.100899
7  Trihalomethanes    0.097986
8        Turbidity    0.097601
