<a href="https://colab.research.google.com/github/karim-mttk/machine-learning-basics/blob/main/ML_Project1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 1
## Objectives:
* Differentiate between classification and regression problems and will use DummyClassifier and DummyRegressor as baseline models.
* The fit and predict paradigm will be explained, and learners will learn to evaluate model performance using the score method of ML models.
* Decision trees, widely used in classification and regression problems, will be explored. 
* How decision tree prediction works, and how to build decision trees using scikit-learn's DecisionTreeClassifier and DecisionTreeRegressor models.
* The difference between parameters and hyperparameters, and how they affect model performance.
* The concept of decision boundaries and how they relate to model complexity will also be explained.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, export_graphviz

## Acquire & prepare data: 

In [2]:
url = "https://raw.githubusercontent.com/karim-mttk/machine-learning-basics/main/data/grade-toy-classification.csv"
classi_df = pd.read_csv(url)

In [3]:
classi_df.head()

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1,quiz2
0,1,1,92,93,84,91,92,A+
1,1,0,94,90,80,83,91,not A+
2,0,0,78,85,83,80,80,not A+
3,0,1,91,94,92,91,89,A+
4,0,1,77,83,90,92,85,A+


* **Features**: x
* **Target**: y 
* **Training**: The learned function of mapping x to y


In [4]:
url2 = "https://raw.githubusercontent.com/karim-mttk/machine-learning-basics/main/data/kc_house_data.csv"
housingPrice_df = pd.read_csv(url2)
housingPrice_df.drop(["id", "date"], axis = 1, inplace=True)
housingPrice_df.head()

Unnamed: 0,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,221900.0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,538000.0,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,180000.0,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,604000.0,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,510000.0,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


Setting features and target of the housing price table

In [5]:
x = housingPrice_df.drop(columns=["price"])
y = housingPrice_df["price"]
y.head()

0    221900.0
1    538000.0
2    180000.0
3    604000.0
4    510000.0
Name: price, dtype: float64

In [6]:
x.head()

Unnamed: 0,bedrooms,bathrooms,sqft_living,sqft_lot,floors,waterfront,view,condition,grade,sqft_above,sqft_basement,yr_built,yr_renovated,zipcode,lat,long,sqft_living15,sqft_lot15
0,3,1.0,1180,5650,1.0,0,0,3,7,1180,0,1955,0,98178,47.5112,-122.257,1340,5650
1,3,2.25,2570,7242,2.0,0,0,3,7,2170,400,1951,1991,98125,47.721,-122.319,1690,7639
2,2,1.0,770,10000,1.0,0,0,3,6,770,0,1933,0,98028,47.7379,-122.233,2720,8062
3,4,3.0,1960,5000,1.0,0,0,5,7,1050,910,1965,0,98136,47.5208,-122.393,1360,5000
4,3,2.0,1680,8080,1.0,0,0,3,8,1680,0,1987,0,98074,47.6168,-122.045,1800,7503


## Analyse data:

### Supervised & Unsupervised learning:

Supervised learning is a machine learning approach in which the model is trained on labeled data, where each observation in the training dataset is associated with a known output value. The objective of supervised learning is to learn a function that maps the input to the output based on the labeled data, and to generalize this function to make accurate predictions on new, unseen data. Common examples of supervised learning problems include regression, where the output is a continuous value, and classification, where the output is a discrete value.

Unsupervised learning, on the other hand, is a machine learning approach in which the model is trained on unlabeled data, where there is no known output value associated with each observation. The objective of unsupervised learning is to find patterns or structure in the data that can help us understand it better. Common examples of unsupervised learning problems include clustering, where the goal is to group similar observations together, and dimensionality reduction, where the goal is to reduce the number of features in the dataset while preserving as much of the original information as possible.

**Classification:**
*   Multiclass classification: is a type of classification problem where the goal is to predict a categorical output with more than two possible values. For example, predicting the type of flower based on its features (e.g., iris setosa, iris versicolor, or iris virginica) is a multiclass classification problem, as the output can take one of three possible values.

*   Binary classification: is a type of classification problem where the goal is to predict a binary output, which can take one of two possible values. For example, predicting whether an email is spam or not is a binary classification problem, as the output can only be either "spam" or "not spam".



**Regression**:
used for predicting a continuous output value. In regression, the goal is to learn a function that can map the input data to a continuous output value based on a set of labeled training data.

In [7]:
classi_df.shape

(21, 8)

In [8]:
#one example of regression data
url3 = "https://raw.githubusercontent.com/karim-mttk/machine-learning-basics/main/data/grade-toy-regression.csv"
regres_df = pd.read_csv(url3)
regres_df.head()


Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1,quiz2
0,1,1,92,93,84,91,92,90
1,1,0,94,90,80,83,91,84
2,0,0,78,85,83,80,80,82
3,0,1,91,94,92,91,89,92
4,0,1,77,83,90,92,85,90


In [9]:
housingPrice_df.shape

(21613, 19)

### Creating a very basic supervised machine learning model:

In [10]:
classi_df['quiz2'].value_counts()

not A+    11
A+        10
Name: quiz2, dtype: int64

The occurances of **not A+** is higher than **A+**

**Baseline:** A baseline machine learning algorithm is a simple model used to set a benchmark for the performance of more complex models. It provides a way to check the effectiveness of a given model and identify potential issues.

Using `sklearn`'s baseline model, `DummyClassifier`, to train a classifier:

1. Read the data
2. Separate the features (x) and target (y) variables
3. Create a classification model object
4. Train (`fit`) the model using the training data
5. `Predict` the target variable for new examples using the trained model
6. Evaluate the model's performance using a `score` metric such as accuracy, precision, recall, or F1 score.  
  



1. Read the data:

In [11]:
classi_df.head()

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1,quiz2
0,1,1,92,93,84,91,92,A+
1,1,0,94,90,80,83,91,not A+
2,0,0,78,85,83,80,80,not A+
3,0,1,91,94,92,91,89,A+
4,0,1,77,83,90,92,85,A+


2. Create features(x) and target(y): 

In [12]:
x = classi_df.drop(columns=["quiz2"])
y = classi_df["quiz2"]
y.head()

0        A+
1    not A+
2    not A+
3        A+
4        A+
Name: quiz2, dtype: object

In [13]:
x.head()

Unnamed: 0,ml_experience,class_attendance,lab1,lab2,lab3,lab4,quiz1
0,1,1,92,93,84,91,92
1,1,0,94,90,80,83,91
2,0,0,78,85,83,80,80
3,0,1,91,94,92,91,89
4,0,1,77,83,90,92,85


3. Create an object of the classifier:

In [14]:
#import the classifier
from sklearn.dummy import DummyClassifier

#create the classifier object
dummy_classi = DummyClassifier(strategy="most_frequent")

4. `fit` the model:

In [15]:
#fit the classifier object:
dummy_classi.fit(x, y);

5. `predict` the target:

In [16]:
#predict using the trained classifier object:
dummy_classi.predict(x)

array(['not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+',
       'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+',
       'not A+', 'not A+', 'not A+', 'not A+', 'not A+', 'not A+',
       'not A+', 'not A+', 'not A+'], dtype='<U6')

6. `score` the model using accuracy. **accuracy = correct predictions/total examples**

In [18]:
print("The accuracy of the model based on training data: %0.3f" % (dummy_classi.score(x,y)))


The accuracy of the model based on training data: 0.524


Finding the error of the model:

In [19]:
print("Error of the model based on training data: %0.3f" % (1 - dummy_classi.score(x,y)))

Error of the model based on training data: 0.476


Prediction based on new data by using the trained model:

In [24]:
new_ex = [[0, 1, 92, 90, 95, 93, 92], [1, 1, 92, 93, 94, 92]]
dummy_classi.predict(new_ex)

array(['not A+', 'not A+'], dtype='<U6')

* Build machine learning models using `DummyRegressor`:





In [25]:
from sklearn.dummy import DummyRegressor
#1.Read data
reg_df = pd.read_csv(url3)
#2.Create features(x) and target(y):
x = reg_df.drop(columns=["quiz2"])
y = reg_df["quiz2"]
#3.Create an object of regression
reg = DummyRegressor()
#4.Fit the model
reg.fit(x,y)
#5.Score the model
reg.score(x,y)
#6.Predict the model
new_ex = [[0, 1, 92, 90, 95, 93, 92], [1, 1, 92, 93, 94, 92]]
reg.predict(new_ex) 


array([86.28571429, 86.28571429])

In [26]:
reg.score(x,y)

0.0

`DummyRegressor` returns the mean of **y** values
