# Portfolio Project

### For this project I will follow the ML (Machine Learning) workflow taught in Codecademy's: 
#### 'Machine Learning/AI Engineer' career path
1. ETL (Extract, Transform and Load) data
2. Data Cleaning
3. Train-Test-Validation Split
4. EDA (Exploratory Data Analysis)
5. Feature Engineering (normalization, removing autocorrelations, discretization, etc.)
6. Model Selection and Implementation
7. Model Evaluation
8. Hyperparameter Tuning
9. Model Validation
10. Build ML pipeline!

### Project Scoping

#### Goals  
- improve my understanding of Machine Learning concepts I've learned through the Codecademy
  "Machine Learning-Engineering" track by applying them to unfamiliar datasets.
- Choose a dataset from kaggle
- Import the dataset
- Based on dataset choice, decide what to glean (prediction, classification, etc.)
- Perform EDA & gain solid understanding of the data
- Decide on Machine Learning technique/s
- Build the model
- Test model performance/score
- Publish code to GitHub and Kaggle

#### I decided to use Water Quality (potability) datasets  
- I will pursue a 'classification' approach.
- Based on the features, is the water safe (potable) of unsafe (not potable)

I will first import the dataset for EDA

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as plt

#### I found two datasets in Kaggle and need to see if they are the exact same

In [None]:
df = pd.read_csv('water_potability_AdityaKadiwal.csv')

In [None]:
df2 = pd.read_csv('water_potability_LaksikaTharmalingam.csv')

In [None]:
compare_df = pd.merge(df, df2, on=list(df.columns), how='outer', indicator=True)

In [None]:
differences = compare_df[compare_df['_merge'] != 'both']

In [None]:
if len(differences) == 0:
    print("Datasets are identical")
else:
    print("Data are NOT identical")
#print("Rows unique to either df or df2:")
#differences

#### They are indeed the same. So, I will use the better documented one.  
Aditya Kadiwal

In [None]:
print(
    f"Head: \n{df.head()}\n",
    f'Number of Unique Values: \n{df.nunique()}\n',
    f'Information: \n{df.info()}\n',
    f'Describe the data: \n{df.describe()}'
)

### First Dilemma: What to do with rows that have NULL values in columns?  
1. Remove any rows that have null values - ```df.dropna()```
2. Fill null values with mean of that column - ```df['col'].fillna(df['col'].mean(), inplace=True)```
3. Fill null values with the mode if non-numeric - ```df['col'].fillna(df['col'].mode()[0], inplace=True)```
4. Fill null values with the mediam of that column - ```df['col'].fillna(df['col'].median(), inplace=True)```

#### Let's create a dataframe for 1, 2, and 4  
I can then run each through the model to see the different results

In [None]:
remove_null = df.dropna()
fillWithMean = df.apply(lambda col: col.fillna(col.mean()), axis=0)
fillWithMedian = df.apply(lambda col: col.fillna(col.median()), axis=0)

In [None]:
myDFs = [remove_null,fillWithMean,fillWithMedian]
for x in myDFs:
    print(x.shape)

In [None]:
for x in myDFs:
    print(x.nunique())

#### Conclusion of creating the 3 DataFrames:  
- ```remove_null``` is only 2011 rows. The smallest dataset.
- ```fillWithMean``` is the same size as the original (3276) but I gave more strength to the average.
- ```fillWithMedian``` is the same size as the original (3276) but gave more strength the median value.

#### For the remainder of this code I will leverage the 'fillWithMean' Dataset as I anticipate better feature selection and prediction opportunities with more rows of data.

Notes:   
Aside from the Null (NaN) values this is a clean dataset.  
With that in mind I will not conduct EDA (Explorotory Data Analysis) beyond the basic methods I've ran in the introduction:  
- Unique values
- describe()
- info()