# Linear Regression Using Scikit Learn

We will be learning the basics of Machine Learning and Linear Regression Algorithms.

What is **linear regression**?

Linear regression analysis is used to predict the value of a variable based on the value of another variable. The variable you want to predict is called the dependent variable. The variable you are using to predict the other variable's value is called the independent variable.

In this notebook we will cover:
* A typical problem statement for machine learning
* Downloading and exploring a dataset for machine learning
* Linear regression with one variable using Scikit-learn
* Linear regression with multiple variables
* Using categorical features for machine learning
* Regression coefficients and feature importance
* Other models and techniques for regression using Scikit-learn

**QUESTION**: 

ACME Insurance Inc. offers affordable health insurance to thousands of customer all over the United States. As the lead data scientist at ACME, you're tasked with creating an automated system to estimate the annual medical expenditure for new customers, using information such as their age, sex, BMI, children, smoking habits and region of residence.

## Importing the Data

In [2]:
dataset_url = 'https://raw.githubusercontent.com/JovianML/opendatasets/master/data/medical-charges.csv'

In [3]:
from urllib.request import urlretrieve

urlretrieve(dataset_url, 'medical.csv')

('medical.csv', <http.client.HTTPMessage at 0x7fe3a9374940>)

In [4]:
import pandas as pd

medical_df = pd.read_csv('medical.csv')

In [5]:
medical_df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [6]:
medical_df.shape

(1338, 7)

The dataframe has 1338 rows and 7 columns. The one we need to predict is the 'charges', given the other column values for a new patient.

In [8]:
#Checking the data types of each features
medical_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


There are no null values in the dataframe. Therefore no need for imputation of values. 

Here age, bmi, children, charges are integers (numerical values), whereas sex, smoker, region are string (categorical values)

In [9]:
#check statistics of numerical column
medical_df.describe()

Unnamed: 0,age,bmi,children,charges
count,1338.0,1338.0,1338.0,1338.0
mean,39.207025,30.663397,1.094918,13270.422265
std,14.04996,6.098187,1.205493,12110.011237
min,18.0,15.96,0.0,1121.8739
25%,27.0,26.29625,0.0,4740.28715
50%,39.0,30.4,1.0,9382.033
75%,51.0,34.69375,2.0,16639.912515
max,64.0,53.13,5.0,63770.42801


Seems like the charges are skewed to the right skewed as the maximum of the charges is far above the median. Almost, 75% of the population has 2 children.

**If your BMI is less than 18.5, it falls within the underweight range. If your BMI is 18.5 to 24.9, it falls within the Healthy Weight range. If your BMI is 25.0 to 29.9, it falls within the overweight range. If your BMI is 30.0 or higher, it falls within the obese range.**

Average of the population seems to be either Overweight or Obese, and only very few are underweighted. Only >25% of the population falls under healthy weight range.
