# Learning Curves

👇 Run the code below.

In [3]:
import pandas as pd
data = pd.read_csv('data/insurance_charges.csv', nrows=134)
data.head()

Unnamed: 0,age,children,bmi,sex_female,sex_male,smoker,region_northeast,region_northwest,region_southeast,region_southwest,charges
0,19,0,27.9,1,0,1,0,0,0,1,16884.924
1,18,1,33.77,0,1,0,0,0,1,0,1725.5523
2,28,3,33.0,0,1,0,0,0,1,0,4449.462
3,33,0,22.705,0,1,0,0,1,0,0,21984.47061
4,32,0,28.88,0,1,0,0,1,0,0,3866.8552


Each observation is the profile of a health insurance client. `charges` is the ammount the client pays for the insurance. Some of the features have been preprocessed.

The task is to estimate the price a client will be charged for his insurance based on his profile.

## Data Exploration

👇 How many observations make up the dataset?

👇 What is the average price a client pays for insurance?

👇 What is the average price a smoker pays for insurance?

👇 What is the average price a non-smoker pays for insurance?

## Initial Learning Curves

👇 Plot the `learning_curve`s of a `LinearRegression` model trained on all the features except `age` and `bmi`. Specify the following parameters:
- Train sizes : [25, 50, 75, 100]
- 10-fold Cross validation
- R2 scoring

[`learning_curve` documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.learning_curve.html)

👇 What is the best cross-validated R2 score reached by the model?

❓ What effects would adding more data have on this model?

<details>
<summary>Answer</summary>
    
The curves have not yet converged, which signals that adding more data should improve its performance.
</details>

## More data!

👇 Run the code below to load the full version of the dataset

In [2]:
more_data = pd.read_csv('data/insurance_charges.csv')

more_data.head()

Unnamed: 0,age,children,bmi,sex_female,sex_male,smoker,region_northeast,region_northwest,region_southeast,region_southwest,charges
0,19,0,27.9,1,0,1,0,0,0,1,16884.924
1,18,1,33.77,0,1,0,0,0,1,0,1725.5523
2,28,3,33.0,0,1,0,0,0,1,0,4449.462
3,33,0,22.705,0,1,0,0,1,0,0,21984.47061
4,32,0,28.88,0,1,0,0,1,0,0,3866.8552


👇 How many observations are there?

👇 Plot the `learning_curve`s of a `LinearRegression` model trained on all the features except `age` and `bmi`. Specify the following parameters:

- Train sizes : [50, 100, 200, 300, 500]
- 10-fold Cross validation
- R2 scoring

👇 What is the best cross-validated R2 score reached by the model?

❓ What effects would adding more data have on this model?

<details>
<summary>Answer</summary>
    
None! The curves have converged, which signals that adding more data will not improve performance.
</details>

❓ What could you do to improve the performance of the model?

<details>
<summary>Answer</summary>
    
The converged curves indicate that the model has reach its full potential with the current features. However, adding new features might improve its performance.
</details>

## More features!

👇 Plot the Learning curves of a Linear Regression model trained on **all features** (don't forget to scale!). Specify the following parameters:
- Train sizes : [100, 200, 300, 400, 500]
- 10-fold Cross validation
- R2 scoring

👇 What is the best cross-validated R2 score reached by the model?

Adding features should have greatly improved the performance of the model!

❓ Is the model at risk of overfitting?

<details>
<summary>Answer</summary>
    
No! Overfitting is indicated when the training score is suspiciously high, and the gap with the test score significant.
    
</details>

## Prediction

A potential client wants to get a quote on how much she would get charged.

In [4]:
potential_client = pd.read_csv('data/new_data.csv')

potential_client

Unnamed: 0,age,children,bmi,sex_female,sex_male,smoker,region_northeast,region_northwest,region_southeast,region_southwest,charges
0,40,0,36.19,1,0,0,0,0,1,0,


👇 Use your model to give her a price estimate. Make sure you scale the new data with the original scaler.

⚠️ Don't forget to push your exercice once you have completed it 🙃

# 🏁