<a href="https://colab.research.google.com/github/mankind/jupyter-notebooks/blob/master/fun_mooc_Python_with_scikit_learn_fit_model_on_numerical_data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Fitting a scikit-learn model on numerical data
# In this notebook, we present how to build predictive models on tabular datasets, with only numerical features.
#  First model with scikit-learn

# In particular we will highlight:

# * the scikit-learn API: `.fit(X, y)`/`.predict(X)`/`.score(X, y)`;
# * how to evaluate the generalization performance of a model with a train-test split.

## Loading the dataset with Pandas

import pandas as pd

url = "https://raw.githubusercontent.com/mankind/scikit-learn-mooc/main/datasets/adult-census-numeric.csv"
adult_census = pd.read_csv(url)

print(" checks first records" )
print(" ")
print(adult_census.head() )

# Separate the data and the target

target_name = "class"
target = adult_census[target_name]
print(" ")
print("target is")
print(target)

print(" ")
print("### data ###")
data = adult_census.drop(columns=[target_name, ])
print(data.head())

print(" ")
print("""
   We can now linger on the variables, also denominated features, that we will use to build our predictive model. \n
   In addition, we can also check how many samples are available in our dataset.
      """)

print("columns")
print(data.columns)

print(" ")
print(f"The dataset contains has {data.shape[0]} samples and {data.shape[1]} features")

 checks first records
 
   age  capital-gain  capital-loss  hours-per-week   class
0   41             0             0              92   <=50K
1   48             0             0              40   <=50K
2   60             0             0              25   <=50K
3   37             0             0              45   <=50K
4   73          3273             0              40   <=50K
 
target is
0         <=50K
1         <=50K
2         <=50K
3         <=50K
4         <=50K
          ...  
39068     <=50K
39069     <=50K
39070      >50K
39071     <=50K
39072      >50K
Name: class, Length: 39073, dtype: object
 
### data ###
   age  capital-gain  capital-loss  hours-per-week
0   41             0             0              92
1   48             0             0              40
2   60             0             0              25
3   37             0             0              45
4   73          3273             0              40
 

   We can now linger on the variables, also denominated features, th

In [None]:
## Fit a model and make predictions

print("""
  
We will build a classification model using the "K-nearest neighbors"
strategy. To predict the target of a new sample, a k-nearest neighbors takes
into account its `k` closest samples in the training set and predicts the
majority target of these samples.   \n

We use a K-nearest neighbors here. However, be aware that it is seldom useful
in practice. We use it because it is an intuitive \n

The `fit` method is called to train the model from the input (features) and
target data
""")

print("  ")
print("""
   The method fit is composed of two elements: (i) a learning algorithm and (ii) some model states. \n 
   The learning algorithm takes the training data and training target as input and sets the model states. These model states will be used later to \n
   either predict (for classifiers and regressors) or transform data (for transformers).
""")
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()

print("In scikit-learn documentation, data is commonly named X and target is commonly called y.")
_ = model.fit(data, target)


print(" make predictions based on trained model")
print(" ")
print("""
To predict, a model uses a prediction function that will use the input data together with the model states. \n
As for the learning algorithm and the model states, the prediction function is specific for each type of model.
""")
target_predicted = model.predict(data)
print(target_predicted)

print(" ")
print("Let's now have a look at the computed predictions. For the sake of simplicity, we will look at the five first predicted targets.")

print(" ")
print("### pring target_predicted[:5]")
print(target_predicted[:5])
print("### compare predictions with the actual data")
print(target[:5])

print(" ")
print("### ...and we could even check if the predictions agree with the real targets: ")
target[:5] == target_predicted[:5]

print(" ")
print("""Here, we see that our model makes a mistake when predicting for the first sample""")
print(f"Number of correct prediction: "
      f"{(target[:5] == target_predicted[:5]).sum()} / 5")

print(" ")
print("### To get a better assessment, we can compute the average success rate.")
print( (target == target_predicted).mean() )
print(" ")
print("""
This result means that the model makes a correct prediction for approximately 82 samples out of 100. \n
 Note that we used the same data to train and evaluate our model. Can this evaluation be trusted or is it too good to be true?
""")


  
We will build a classification model using the "K-nearest neighbors"
strategy. To predict the target of a new sample, a k-nearest neighbors takes
into account its `k` closest samples in the training set and predicts the
majority target of these samples.   


We use a K-nearest neighbors here. However, be aware that it is seldom useful
in practice. We use it because it is an intuitive 


The `fit` method is called to train the model from the input (features) and
target data

  

   The method fit is composed of two elements: (i) a learning algorithm and (ii) some model states. 
 
   The learning algorithm takes the training data and training target as input and sets the model states. These model states will be used later to 

   either predict (for classifiers and regressors) or transform data (for transformers).

In scikit-learn documentation, data is commonly named X and target is commonly called y.
 make predictions based on trained model
 

To predict, a model uses a prediction 

In [None]:
## Train-test data split

print("""
When building a machine learning model, it is important to evaluate the
trained model on data that was not used to fit it, as **generalization** is
more than memorization (meaning we want a rule that generalizes to new data,
without comparing to data we memorized).
It is harder to conclude on never-seen instances than on already seen ones. \n

Correct evaluation is easily done by leaving out a subset of the data when
training the model and using it afterwards for model evaluation.
The data used to fit a model is called training data while the data used to
assess a model is called testing data. \n

We can load more data, which was actually left-out from the original data
set.
""")

import pandas as pd

url = "https://raw.githubusercontent.com/mankind/scikit-learn-mooc/main/datasets/adult-census-numeric-test.csv"
adult_census_test = pd.read_csv(url)

print(" ")
print("### , we separate our input features and the target to predict")
target_name = "class"
target_test = adult_census_test[target_name]
data_test = adult_census_test.drop(columns=[target_name, ])

print(" ")
print("### We can check the number of features and samples available in this new set.")
print(f"The testing data contains {data_test.shape[0]} samples and there are {data_test.shape[1]} features")

print(" ")
print("""
 Instead of computing the prediction and manually computing the average success rate, we can use the method score. \n 
 When dealing with classifiers this method returns their performance metric.
""")

accuracy = model.score(data_test, target_test)
model_name = model.__class__.__name__
print(" ")
print(f"The test accuracy using a {model_name} is "
      f"{accuracy:.3f}")

print(" ")
print("""
To compute the score, the predictor first computes the predictions (using the predict method) and \n 
then uses a scoring function to compare the true target y and the predictions. Finally, the score is returned.
""")


When building a machine learning model, it is important to evaluate the
trained model on data that was not used to fit it, as **generalization** is
more than memorization (meaning we want a rule that generalizes to new data,
without comparing to data we memorized).
It is harder to conclude on never-seen instances than on already seen ones. 


Correct evaluation is easily done by leaving out a subset of the data when
training the model and using it afterwards for model evaluation.
The data used to fit a model is called training data while the data used to
assess a model is called testing data. 


We can load more data, which was actually left-out from the original data
set.

 
### , we separate our input features and the target to predict
 
### We can check the number of features and samples available in this new set.
The testing data contains 9769 samples and there are 4 features
 

 Instead of computing the prediction and manually computing the average success rate, we can use the me