# What we're covering in the Scikit-Learn Introduction

This notebook outlines the content convered in the Scikit-Learn Introduction.

It's a quick stop to see all the Scikit-Learn functions and modules for each section outlined.

What we're covering follows the following diagram detailing a Scikit-Learn workflow.

<img src="../images/sklearn-workflow-title.png"/>

## 0. Standard library imports

For all machine learning projects, you'll often see these libraries (Matplotlib, NumPy and pandas) imported at the top.

In [15]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

We'll use 2 datasets for demonstration purposes.
* `heart_disease` - a classification dataset (predicting whether someone has heart disease or not)
* `boston_df` - a regression dataset (predicting the median house prices of cities in Boston)

In [16]:
# Classification data
heart_disease = pd.read_csv("../data/heart-disease.csv")

# Regression data
from sklearn.datasets import fetch_california_housing
boston = fetch_california_housing() # loads as dictionary
# Convert dictionary to dataframe
boston_df = pd.DataFrame(boston["data"], columns=boston["feature_names"])
boston_df["target"] = pd.Series(boston["target"])


## 1. Get the data ready

In [17]:
# Split data into X & y
X = heart_disease.drop("target", axis=1) # use all columns except target
y = heart_disease["target"] # we want to predict y using X

In [18]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split
# Example use case (requires X & y)
X_train, X_test, y_train, y_test = train_test_split(X, y)

## 2. Pick a model/estimator (to suit your problem)
To pick a model we use the [Scikit-Learn machine learning map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html).

<img src="../images/sklearn-ml-map.png" width=400/>

**Note:** Scikit-Learn refers to machine learning models and algorithms as estimators.

In [19]:
# Random Forest Classifier (for classification problems)
from sklearn.ensemble import RandomForestClassifier
# Instantiating a Random Forest Classifier (clf short for classifier)
clf = RandomForestClassifier()

In [20]:
# Random Forest Regressor (for regression problems)
from sklearn.ensemble import RandomForestRegressor
# Instantiating a Random Forest Regressor
model = RandomForestRegressor()

## 3. Fit the model to the data and make a prediction


In [21]:
# All models/estimators have the fit() function built-in
clf.fit(X_train, y_train)

# Once fit is called, you can make predictions using predict()
y_preds = clf.predict(X_test)

# You can also predict with probabilities (on classification models)
y_probs = clf.predict_proba(X_test)

# View preds/probabilities
y_preds, y_probs

(array([0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0,
        0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0,
        1, 0, 0, 1, 0, 0, 1, 0, 0, 0], dtype=int64),
 array([[0.74, 0.26],
        [0.21, 0.79],
        [0.87, 0.13],
        [0.49, 0.51],
        [0.91, 0.09],
        [0.4 , 0.6 ],
        [0.94, 0.06],
        [0.95, 0.05],
        [0.03, 0.97],
        [0.52, 0.48],
        [0.34, 0.66],
        [0.89, 0.11],
        [0.3 , 0.7 ],
        [0.64, 0.36],
        [0.68, 0.32],
        [0.03, 0.97],
        [0.84, 0.16],
        [0.32, 0.68],
        [0.23, 0.77],
        [0.12, 0.88],
        [0.01, 0.99],
        [0.93, 0.07],
        [0.7 , 0.3 ],
        [0.88, 0.12],
        [0.56, 0.44],
        [0.31, 0.69],
        [0.9 , 0.1 ],
        [0.92, 0.08],
        [0.56, 0.44],
        [0.09, 0.91],
        [0.69, 0.31],
        [0.54, 0.46],
        [0.65, 0.35],

## 4. Evaluate the model

Every Scikit-Learn model has a default metric which is accessible through the `score()` function.

However there are a range of different evaluation metrics you can use depending on the model you're using.

A full list of evaluation metrics can be [found in the documentation](https://scikit-learn.org/stable/modules/model_evaluation.html).

In [22]:
# All models/estimators have a score() function
clf.score(X_test, y_test)

0.8552631578947368

In [23]:
# Evaluting a model using cross-validation is possible with cross_val_score
from sklearn.model_selection import cross_val_score

# scoring=None means default score() metric is used
print(cross_val_score(estimator=clf, 
                      X=X, 
                      y=y, 
                      cv=5, # use 5-fold cross-validation
                      scoring=None)) 

# Evaluate a model with a different scoring method
print(cross_val_score(estimator=clf, 
                      X=X, 
                      y=y,
                      cv=5, # use 5-fold cross-validation
                      scoring="precision"))

[0.80327869 0.91803279 0.81967213 0.8        0.76666667]
[0.83333333 0.90322581 0.84375    0.78787879 0.76315789]


In [24]:
# Different classification metrics

# Accuracy
from sklearn.metrics import accuracy_score
print(accuracy_score(y_test, y_preds))

# Reciver Operating Characteristic (ROC curve)/Area under curve (AUC)
from sklearn.metrics import roc_curve, roc_auc_score
false_positive_rate, true_positive_rate, thresholds = roc_curve(y_test, y_probs[:, 1])
print(roc_auc_score(y_test, y_preds))

# Confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, y_preds))

# Classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, y_preds))

0.8552631578947368
0.8548163548163548
[[31  6]
 [ 5 34]]
              precision    recall  f1-score   support

           0       0.86      0.84      0.85        37
           1       0.85      0.87      0.86        39

    accuracy                           0.86        76
   macro avg       0.86      0.85      0.86        76
weighted avg       0.86      0.86      0.86        76



In [25]:
# # Different regression metrics

# # Make predictions first
# X = boston_df.drop("target", axis=1)
# y = boston_df["target"]
# X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# model = RandomForestRegressor()
# model.fit(X_train, y_train)
# y_preds = model.predict(X_test)

# # R^2 (pronounced r-squared) or coefficient of determination
# from sklearn.metrics import r2_score
# print(r2_score(y_test, y_preds))

# # Mean absolute error (MAE)
# from sklearn.metrics import mean_absolute_error
# print(mean_absolute_error(y_test, y_preds))

# # Mean square error (MSE)
# from sklearn.metrics import mean_squared_error
# print(mean_squared_error(y_test, y_preds))

## 5. Improve through experimentation

Two of the main methods to improve a models baseline metrics (the first evaluation metrics you get).

From a data perspective asks:
* Could we collect more data? In machine learning, more data is generally better, as it gives a model more opportunities to learn patterns.
* Could we improve our data? This could mean filling in misisng values or finding a better encoding (turning things into numbers) strategy.

From a model perspective asks:
* Is there a better model we could use? If you've started out with a simple model, could you use a more complex one? (we saw an example of this when looking at the [Scikit-Learn machine learning map](https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html), ensemble methods are generally considered more complex models)
* Could we improve the current model? If the model you're using performs well straight out of the box, can the **hyperparameters** be tuned to make it even better?

**Hyperparameters** are like settings on a model you can adjust so some of the ways it uses to find patterns are altered and potentially improved. Adjusting hyperparameters is referred to as hyperparameter tuning.

In [26]:
# How to find a model's hyperparameters
clf = RandomForestClassifier()
clf.get_params() # returns a list of adjustable hyperparameters

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [27]:
# Example of adjusting hyperparameters by hand

# Split data into X & y
X = heart_disease.drop("target", axis=1) # use all columns except target
y = heart_disease["target"] # we want to predict y using X

# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Instantiate two models with different settings
clf_1 = RandomForestClassifier(n_estimators=100)
clf_2 = RandomForestClassifier(n_estimators=200)

# Fit both models on training data
clf_1.fit(X_train, y_train)
clf_2.fit(X_train, y_train)

# Evaluate both models on test data and see which is best
print(clf_1.score(X_test, y_test))
print(clf_2.score(X_test, y_test))

0.8026315789473685
0.8157894736842105


In [28]:
# Example of adjusting hyperparameters computationally (recommended)

from sklearn.model_selection import RandomizedSearchCV

# Define a grid of hyperparameters
grid = {"n_estimators": [10, 100, 200, 500, 1000, 1200],
        "max_depth": [None, 5, 10, 20, 30],
        "max_features": ["auto", "sqrt"],
        "min_samples_split": [2, 4, 6],
        "min_samples_leaf": [1, 2, 4]}

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Set n_jobs to -1 to use all cores (NOTE: n_jobs=-1 is broken as of 8 Dec 2019, using n_jobs=1 works)
clf = RandomForestClassifier(n_jobs=1)

# Setup RandomizedSearchCV
rs_clf = RandomizedSearchCV(estimator=clf,
                            param_distributions=grid,
                            n_iter=10, # try 10 models total
                            cv=5, # 5-fold cross-validation
                            verbose=2) # print out results

# Fit the RandomizedSearchCV version of clf
rs_clf.fit(X_train, y_train);

# Find the best hyperparameters
print(rs_clf.best_params)

# Scoring automatically uses the best hyperparameters
rs_clf.score(X_test, y_test)

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   0.0s
[CV] END max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   0.0s
[CV] END max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   0.0s
[CV] END max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   0.0s
[CV] END max_depth=5, max_features=auto, min_samples_leaf=4, min_samples_split=2, n_estimators=100; total time=   0.0s
[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=1000; total time=   0.0s
[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=1000; total time=   0.0s
[CV] END max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=6, n_estimators=1000; 

40 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
40 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\mnour\Desktop\machine learing\projekts\env\lib\site-packages\sklearn\model_selection\_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\mnour\Desktop\machine learing\projekts\env\lib\site-packages\sklearn\base.py", line 1144, in wrapper
    estimator._validate_params()
  File "C:\Users\mnour\Desktop\machine learing\projekts\env\lib\site-packages\sklearn\base.py", line 637, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\mnour\Desktop\machine learing\projekts\env\lib\site-package

AttributeError: 'RandomizedSearchCV' object has no attribute 'best_params'

## 6. Save and reload your trained model
You can save and load a model with `pickle`.

In [29]:
# Saving a model with pickle
import pickle

# Save an existing model to file
pickle.dump(rs_clf, open("rs_random_forest_model_1.pkl", "wb"))

In [30]:
# Load a saved pickle model
loaded_pickle_model = pickle.load(open("rs_random_forest_model_1.pkl", "rb"))

# Evaluate loaded model
loaded_pickle_model.score(X_test, y_test)

0.8524590163934426

You can do the same with `joblib`. `joblib` is usually more efficient with numerical data (what our models are).

In [31]:
# Saving a model with joblib
from joblib import dump, load

# Save a model to file
dump(rs_clf, filename="gs_random_forest_model_1.joblib") 

['gs_random_forest_model_1.joblib']

In [32]:
# Import a saved joblib model
loaded_joblib_model = load(filename="gs_random_forest_model_1.joblib")

In [33]:
# Evaluate joblib predictions 
loaded_joblib_model.score(X_test, y_test)

0.8524590163934426

## 7. Putting it all together (not pictured)

We can put a number of different Scikit-Learn functions together using `Pipeline`.

As an example, we'll use `car-sales-extended-missing-data.csv`. Which has missing data as well as non-numeric data. For a machine learning model to work, there can be no missing data or non-numeric values.

The problem we're solving here is predicting a cars sales price given a number of parameters about the car (a regression problem).

In [34]:
# Getting data ready
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Modelling
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split, GridSearchCV

# Setup random seed
import numpy as np
np.random.seed(42)

# Import data and drop the rows with missing labels
data = pd.read_csv("../data/car-sales-extended-missing-data.csv")
data.dropna(subset=["Price"], inplace=True)

# Define different features and transformer pipelines
categorical_features = ["Make", "Colour"]
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))])

door_feature = ["Doors"]
door_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="constant", fill_value=4))])

numeric_features = ["Odometer (KM)"]
numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="mean"))
])

# Setup preprocessing steps (fill missing values, then convert to numbers)
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", categorical_transformer, categorical_features),
        ("door", door_transformer, door_feature),
        ("num", numeric_transformer, numeric_features)])

# Create a preprocessing and modelling pipeline
model = Pipeline(steps=[("preprocessor", preprocessor),
                        ("model", RandomForestRegressor())])

# Split data
X = data.drop("Price", axis=1)
y = data["Price"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit and score the model
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.22188417408787875

In [35]:
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [36]:
X= heart_disease.drop("target",axis=1)
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [37]:
y= heart_disease["target"]
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [38]:
from sklearn.model_selection import train_test_split

x_train,x_test ,y_train,y_text = train_test_split(X,y,test_size=0.2)

In [39]:
x_train.shape,x_test.shape ,y_train.shape,y_text.shape

((242, 13), (61, 13), (242,), (61,))

In [40]:
X.shape

(303, 13)

In [41]:
len(heart_disease)

303

# 1.1 Make sure its all numerical 

In [42]:
car_sales = pd.read_csv("../data/car-sales-extended.csv")
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [43]:
len(car_sales)

1000

In [44]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object

In [45]:
# Spli into x/y
x= car_sales.drop("Price",axis=1)
y= car_sales["Price"]
# split into training and test 
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)

In [46]:
#Build machine learing model 
from sklearn.ensemble import RandomForestRegressor # der kann zahl ausgeben
model = RandomForestRegressor()
model.fit(x_train,y_train)
# model.score(x_test,y_train)

ValueError: could not convert string to float: 'Toyota'

## convert data to number 
### Trun the categories into numbers 

In [47]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# car_sales.columns
categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
tansformer  = ColumnTransformer([("one_hot",one_hot,categorical_features)]
                               ,remainder="passthrough")
tansformed_x = tansformer.fit_transform(x)
pd.DataFrame(tansformed_x)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,35431.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,192714.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,84714.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,154365.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,181577.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,35820.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,155144.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,66604.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,215883.0


In [48]:
dummies = pd.get_dummies(car_sales)
dummies

Unnamed: 0,Odometer (KM),Doors,Price,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,35431,4,15323,False,True,False,False,False,False,False,False,True
1,192714,5,19943,True,False,False,False,False,True,False,False,False
2,84714,4,28343,False,True,False,False,False,False,False,False,True
3,154365,4,13434,False,False,False,True,False,False,False,False,True
4,181577,3,14043,False,False,True,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
995,35820,4,32042,False,False,False,True,True,False,False,False,False
996,155144,3,5716,False,False,True,False,False,False,False,False,True
997,66604,4,31570,False,False,True,False,False,True,False,False,False
998,215883,4,4001,False,True,False,False,False,False,False,False,True


In [49]:
# # my exsampel 
# x= dummies.drop("Price",axis=1)
# y= dummies["Price"]
# # split into training and test 
# x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)
# model = RandomForestRegressor(n_estimators=100)
# model.fit(x_train,y_train)
# model.score(x_test,y_test)

In [50]:
# Lets refit the model
np.random.seed(42)
x_train,x_test,y_train,y_test = train_test_split(tansformed_x,y,test_size=0.2)
model = RandomForestRegressor(n_estimators=100)
model.fit(x_train,y_train)
model.score(x_test,y_test)

0.3235867221569877

# 1.2 What if there were missing values ? 
 ### 1. Fillthem with some value (also kmowm as imputation).
 ### 2. Remove the samples with missing data altogetether 

In [51]:
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum(),len(car_sales_missing)

(Make             49
 Colour           50
 Odometer (KM)    50
 Doors            50
 Price            50
 dtype: int64,
 1000)

In [52]:
xm = car_sales_missing.drop("Price",axis=1)
ym = car_sales_missing["Price"]

### option 1: Fill missing data with Pandas 

In [53]:
car_sales_missing.columns

Index(['Make', 'Colour', 'Odometer (KM)', 'Doors', 'Price'], dtype='object')

In [54]:
car_sales_missing["Doors"].value_counts()

Doors
4.0    811
5.0     75
3.0     64
Name: count, dtype: int64

In [55]:
car_sales_missing["Make"].fillna("missing",inplace=True)
car_sales_missing["Colour"].fillna("missing",inplace=True)
car_sales_missing["Odometer (KM)"].fillna(pd.to_numeric(car_sales_missing['Odometer (KM)'], errors='coerce').mean(),inplace=True)
car_sales_missing["Doors"].fillna(4,inplace=True)


In [56]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64

In [57]:
#Remove rows with missing Price value
car_sales_missing.dropna(inplace=True)
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

In [58]:
len(car_sales_missing)

950

In [59]:
xm = car_sales_missing.drop("Price",axis=1)
ym = car_sales_missing["Price"]

In [60]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# car_sales.columns
categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
tansformer  = ColumnTransformer([("one_hot",one_hot,categorical_features)]
                               ,remainder="passthrough")
tansformed_x = tansformer.fit_transform(car_sales_missing)

tansformed_x

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        3.54310e+04, 1.53230e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        1.92714e+05, 1.99430e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        8.47140e+04, 2.83430e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        6.66040e+04, 3.15700e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.15883e+05, 4.00100e+03],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        2.48360e+05, 1.27320e+04]])

In [61]:
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0


In [62]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

In [63]:
car_sales_missing.dropna(subset=["Price"],inplace=True)
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64

In [64]:
#split into y and y 
x= car_sales_missing.drop("Price",axis=1)
y= car_sales_missing["Price"]

In [65]:
# Fehlende kategoriale Werte mit 'missing' und numerische Werte mit dem Mittelwert ersetzen
cat_imputer = SimpleImputer(strategy="constant", fill_value="missing")
door_imputer = SimpleImputer(strategy="constant", fill_value=4)
num_imputer = SimpleImputer(strategy="mean")

# Definition der Spalten
cat_features = ["Make", "Colour"]
door_feature = ["Odometer (KM)"]
num_feature = ["Doors"]

# Erstellen eines Imputers (etwas, das fehlende Daten füllt)
imputer = ColumnTransformer([
    ("cat_imputer", cat_imputer, cat_features),
    ("num_imputer", num_imputer, num_feature),
    ("door_imputer", door_imputer, door_feature)
])

# Transformiere die Daten
filled_x = imputer.fit_transform(x)

# Überprüfe den Datentyp der gefüllten Daten (kann je nach Bedarf auskommentiert werden)
# type(filled_x.isna())

# Erstelle ein neues DataFrame aus den gefüllten Daten
colums_name = ['Make', 'Colour', 'Doors', 'Odometer (KM)']
new_Data = pd.DataFrame(filled_x,columns=colums_name)

# Überprüfe die Summe der fehlenden Werte im neuen DataFrame
new_Data.iloc[0],car_sales_missing.iloc[0]


(Make               Honda
 Colour             White
 Doors                4.0
 Odometer (KM)    35431.0
 Name: 0, dtype: object,
 Make               Honda
 Colour             White
 Odometer (KM)    35431.0
 Doors                4.0
 Price            15323.0
 Name: 0, dtype: object)

In [66]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
# car_sales.columns
categorical_features = ['Make', 'Colour', 'Doors']
one_hot = OneHotEncoder()
tansformer  = ColumnTransformer([("one_hot",one_hot,categorical_features)]
                               ,remainder="passthrough")
tansformed_x = tansformer.fit_transform(new_Data)

tansformed_x


<950x16 sparse matrix of type '<class 'numpy.float64'>'
	with 3800 stored elements in Compressed Sparse Row format>

In [67]:
# Now we got our data as numbers and filled (no missing value)
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_x, y)

clf = RandomForestRegressor(n_estimators=100)
clf.fit(X_train,y_train)
clf.score(X_test,y_test)

NameError: name 'transformed_x' is not defined

In [68]:
len(car_sales),len(filled_x)

(1000, 950)

## 2. Choosing the right estimator for your Problem 
some thinge note: 
* Sklearn refers to machine learning models,  algorithms as estimater.
* Classfication problem - predicting a category (heart disease or not)
  * Sometime you see clf (short for classifier) used as a classification estimator 
* Regrssion problem - predicting a number (selling price of a car)

### 2.1 Picking a machine learning model for a regrssion problem 
* using a Callfornie Housing dataset from sklearn

In [69]:
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
housing

{'data': array([[   8.3252    ,   41.        ,    6.98412698, ...,    2.55555556,
           37.88      , -122.23      ],
        [   8.3014    ,   21.        ,    6.23813708, ...,    2.10984183,
           37.86      , -122.22      ],
        [   7.2574    ,   52.        ,    8.28813559, ...,    2.80225989,
           37.85      , -122.24      ],
        ...,
        [   1.7       ,   17.        ,    5.20554273, ...,    2.3256351 ,
           39.43      , -121.22      ],
        [   1.8672    ,   18.        ,    5.32951289, ...,    2.12320917,
           39.43      , -121.32      ],
        [   2.3886    ,   16.        ,    5.25471698, ...,    2.61698113,
           39.37      , -121.24      ]]),
 'target': array([4.526, 3.585, 3.521, ..., 0.923, 0.847, 0.894]),
 'frame': None,
 'target_names': ['MedHouseVal'],
 'feature_names': ['MedInc',
  'HouseAge',
  'AveRooms',
  'AveBedrms',
  'Population',
  'AveOccup',
  'Latitude',
  'Longitude'],
 'DESCR': '.. _california_housing_dataset:\n

In [70]:
housing_df = pd.DataFrame(housing["data"],columns=housing["feature_names"])
housing_df["target"]= housing["target"]
housing_df

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,target
0,8.3252,41.0,6.984127,1.023810,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.971880,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.802260,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422
...,...,...,...,...,...,...,...,...,...
20635,1.5603,25.0,5.045455,1.133333,845.0,2.560606,39.48,-121.09,0.781
20636,2.5568,18.0,6.114035,1.315789,356.0,3.122807,39.49,-121.21,0.771
20637,1.7000,17.0,5.205543,1.120092,1007.0,2.325635,39.43,-121.22,0.923
20638,1.8672,18.0,5.329513,1.171920,741.0,2.123209,39.43,-121.32,0.847


In [76]:
# import algorithm / estimator
from sklearn.linear_model import Ridge

#setup random seed 
np.random.seed(42)

#create the  data 
x = housing_df.drop("target",axis=1)
y = housing_df["target"]
# split into train and test sets 
X_train, X_test, y_train, y_test = train_test_split(x, y,test_size=0.2)

#instantiate an fit the model (on the training set)

model = Ridge()
model.fit(X_train,y_train)
#Check the score of the model (on the test set)
model.score(X_test,y_test)

0.5758549611440131

In [77]:
from sklearn.ensemble import RandomForestRegressor
np.random.seed(42)
model = RandomForestRegressor()
model.fit(X_train,y_train)
#Check the score of the model (on the test set)
model.score(X_test,y_test)

0.8051230593157366

In [73]:
# ist die daten große wie 50 line/colum ? Ja
len(heart_disease)
heart_disease

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3,0
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3,0
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3,0
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3,0


### ist Vorhersage einer Kategorie ? ja
### Haben Sie beschriftete Daten? ja
### ist kleine wie 100K? ja 
### Consultinng the map and it says to try LinearSVC

In [74]:
heart_disease = pd.read_csv("../data/heart-disease.csv")

In [78]:
X = heart_disease.drop("target",axis=1)
y= heart_disease["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.2)
from sklearn.svm import LinearSVC
np.random.seed(42)
model = LinearSVC()
model.fit(X_train,y_train)
#Check the score of the model (on the test set)
model.score(X_test,y_test)



0.8032786885245902

In [79]:
heart_disease["target"].value_counts()

target
1    165
0    138
Name: count, dtype: int64

In [80]:
from sklearn.neighbors import KNeighborsClassifier
np.random.seed(42)
model = KNeighborsClassifier()
model.fit(X_train,y_train)
#Check the score of the model (on the test set)
model.score(X_test,y_test)

0.7213114754098361

In [90]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
np.random.seed(42)
model = make_pipeline(StandardScaler(), SVC(gamma='auto'))
model.fit(X_train,y_train)
#Check the score of the model (on the test set)
model.score(X_test,y_test)

0.8524590163934426

In [119]:
from sklearn.linear_model import LogisticRegression

np.random.seed(42)
model = LogisticRegression()
model.fit(X_train,y_train)
#Check the score of the model (on the test set)
model.score(X_test,y_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8032786885245902

In [120]:
heart_disease[heart_disease["target"]==1].head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [121]:
# Angenommen, Sie haben eine Liste mit neuen Datenpunkten
new_data = [
    [63,	1	,3,	140,	233,	1,	0,	150	,0	,2.3,	0,	0	,1	]  # Beispiel für einen Datenpunkt
    # Weitere neue Datenpunkte...
]

# Erstellen eines DataFrames für die neuen Daten
columns = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', 'exang', 'oldpeak', 'slope', 'ca', 'thal']
new_df = pd.DataFrame(new_data, columns=columns)

# Durchführen von Vorhersagen mit dem trainierten Modell
predictions = model.predict(new_df)
predictions[0]

1

In [122]:
y_preds = model.predict(X_test)
np.mean(y_preds==y_test)*100

80.32786885245902

In [123]:
model.score(X_test,y_test)

0.8032786885245902

In [127]:
model.predict_proba(new_df)

array([[0.20828504, 0.79171496]])

In [128]:
model.predict(X_test[:5])

array([0, 0, 1, 1, 1], dtype=int64)

In [2]:
from sklearn.ensemble  import RandomForestRegressor
# RandomForestRegressor?  die ? ist für doc an zuschauen