> ## Overall Machine Learning Operations Process
> # MLOps

1. Problem Scoping - define the problem well and find a potential ML solution
2. Data Acquisition - gather the data from sources
3. Data Exploration (Preprocessing) - ready the data for ML process
4. Modeling (Selection) - select the appropriate model
5. Training - data split into train/test, and fit model with data
6. Evaluation - score the model, test the performance of the model - Then reiterate
7. Deployment - take the ML app to production 
8. Monitor and maintainance - 

### Problem Scoping:
> - Predict the **species** of flower, based on physical parameters like **pl, pw, sl and sw** - $Classification$

> - Predict the **sepal length**, based on physical parameters like **petal length** - $Regression$

In [9]:
# pip install scitkit-learn

In [14]:
# data acquisition

from sklearn.datasets import load_iris # brings us data from sklearn

data = load_iris()
print(data)

ModuleNotFoundError: No module named 'sklearn'

In [None]:
# print(data.feature_names)
# observe the data

data.data # array
data.feature_names # List feature names

data.target # list target feature 0,1,2,0,1,2
data.target_names # list - names of each target
# give me a dataframe of iris from these variables???
import pandas as pd
df = pd.DataFrame(data.data, columns=data.feature_names)
# how we add columns to dataframes?
df['species'] = data.target
# instead of 0 1 2s we should get specie names
df.head(2)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0


> ### Preprocessing of data
- Machine Learning models understand data that is numeric, not catagorical or string or alphanumeric - undeerstand numbers, and 0s and 1s
- Encoding - 
- next time

In [None]:
# observation of data
# ML Model - only numerical - 
# check data health
# class balance - 
# 50 rows for each, 100 rows while others only have 20

df.info()
df.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   sepal length (cm)  150 non-null    float64
 1   sepal width (cm)   150 non-null    float64
 2   petal length (cm)  150 non-null    float64
 3   petal width (cm)   150 non-null    float64
 4   species            150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
species              0
dtype: int64

- data is clean, usable, preprocessed
- Let's use the data we have here. 

> ## Modeling
- selecting the perfect model for a problem
- will have different type of models working best on them

Based on type of problems: 
- $Regression$ - predicting values, continous values - eg. price of a house, temperature, stock price, 
- $Classification$ - predicting classes, discrete values like 0,1,2,3 - a couple of classes

$Models$ $used$ $for$ $Regression$:
- Linear Regression
- Decision Trees
- Polynomial Regression
- Ridge Regression
- Lasso Regression

In [15]:
# Linear Regression
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model
# not fitted means, this model is not trained yet

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


## Training
- Train the model, by fitting it with data
- Fit the training data into the model, but do not show the model some part of the labelled data that you have

- hidden part, we will use for testing the model

In [None]:
# Data -> 4 parts 
# Training data -> X_train (input features) - y_train (target features)
# model will look at sepal_length, and it will observe petal_length

# we give our model, X_test part -> our model gives us a prediction, y_pred
# we compare with y_test

X = df[['sepal length (cm)']] # x can have multiple features, input features can be multiple
y = df['petal length (cm)']

# about 150 rows, 20% of data will be kept for testing - hidden from model


# train test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# sepalL  sepalL  petalL   petalL 

In [28]:
y_test.shape

# training part has - 120 rows    = 80 % of 150
# testing part has - 30           = 20% of 150

(30,)

In [None]:
# train the model
model.fit(X_train,y_train) # our model is now trained

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


## Evaluation
- to make predcitions
- check performance


In [36]:
X

Unnamed: 0,sepal length (cm)
0,5.1
1,4.9
2,4.7
3,4.6
4,5.0
...,...
145,6.7
146,6.3
147,6.5
148,6.2


In [37]:
y_predict = model.predict(X_test)
y_predict


petalLfor3 = model.predict([[5.2]])
petalLfor3



array([2.50120442])

# Please train a Linear Regression Model to do this:

# **petal width** ------> predict -----> **sepal width**

In [67]:
X = df[['sepal length (cm)','sepal width (cm)', 'petal width (cm)']]
y = df['petal length (cm)']
# change the code above

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # setting the seed in the random function


model = LinearRegression()
model.fit(X_train,y_train)


y_predict = model.predict(X_test)
y_predict

# if petal width is 0.2, what is your model saying about sepal width?
petalWfor = model.predict([[4.7, 3.2,0.2]])
petalWfor





array([1.39392486])