Problem Framing:
* Definition: breaking up the problem in hand to smaller tasks to be addressed indivisually, which in return helps to define the goals of the project and its feasibility.

In order to understand the problem:
* Stating the goal  of the product you are developing.
> State the goal in non-ML terms


* Determine if the goal is best solved with ML or traditional programming skills.
>  ML is a specialized tool suitable only for particular problems. You don't want to implement a complex ML solution when a simpler non-ML solution will work.
>To confirm that ML is the right approach, first verify that your current non-ML solution is optimized. If you don't have a non-ML solution implemented, otherwise try solving the problem heurstically ( through trial and error).
>>Other things to consider when comparing ML and a non-ML solution would be: Quality in terms of how much better using ML would be, and Cost and Maintenence of implementing said ML.

* Verfiy the data aquired to train the model.
>In order to use ML; the data to be used in the ML algorithm needs to meet a certain criteria:
>>Abundance in relevency as it would affect the model's quality.
>>Consistency and reliability as data from a reliable osurce is confrimed to produce a better model.
>>Trusted sourced data where you know where it came from and comes from a source you have insight about.
>>Availability of all the inputs at prediction time in all the correct format.
>>Correctness of the lables in order to produce good predictions.
>>Representative of the real world and accurate reflecting the events or beahvours it studies.

Data Pipelines:
* A Machine Learning pipeline is a process of automating the
workflow of a complete machine learning task.
* A typical pipeline includes raw data input, features, outputs,
model parameters, ML models, and Predictions.
* ML Pipeline contains multiple sequential steps that perform
everything ranging from data extraction and pre-processing to
model training and deployment in Machine learning in a
modular approach.

In [3]:
!pip install sklearn


Collecting sklearn
  Downloading sklearn-0.0.post7.tar.gz (3.6 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25l[?25hdone
  Created wheel for sklearn: filename=sklearn-0.0.post7-py3-none-any.whl size=2952 sha256=aecab97f3c2c9740990681fc7120b8301059385783ba526d0bc8df92b2b4bdd9
  Stored in directory: /root/.cache/pip/wheels/c8/9c/85/72901eb50bc4bc6e3b2629378d172384ea3dfd19759c77fd2c
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0.post7


In [13]:
import numpy as np
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [36]:
#Simple dataset
data = {
    'age': [25, 35, 65, 28, 42, 50, 31, 39],
    'income': [50000, 60000, 43000, 89999, 80000, 75000, 62000, 69000],
    'education': ['Bachelor', 'Master', 'High School', 'Master', 'PhD', 'Bachelor', 'Master', 'PhD'],
    'target': [0, 1, 0, 1, 1, 0, 1, 0]
}

df = pd.DataFrame(data)

In [37]:
# Perform one-hot encoding on 'education' column
df = pd.get_dummies(df, columns=['education'])

In [38]:
#Separate features and target
X = df.drop('target', axis=1)
y = df['target']

#Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [39]:
#Define preprocessing steps for numerical and categorical features
numeric_features = ['age', 'income']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_features = ['education']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder())
])

In [40]:
#Create the preprocessor using ColumnTransformer and FeatureUnion
preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features)
])

#Add the feature selection step
feature_union = FeatureUnion(transformer_list=[
    ('preprocessor', preprocessor),
    ('feature_selector', SelectKBest(k='all'))  # k value can be adjusted based on the feature selection needs
])

In [41]:
#Create the data pipeline by combining the preprocessor, feature selection, and the model
pipeline = Pipeline(steps=[
    ('feature_union', feature_union),
    ('classifier', LogisticRegression())
])

In [42]:
# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Make predictions on the test data
y_pred = pipeline.predict(X_test)

In [43]:
#Make predictions on the test data
y_pred = pipeline.predict(X_test)

#Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy}")

Accuracy: 0.5
