In [3]:
#import important libraries
import numpy as np
import pandas as pd
import sklearn

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

## Linear Regression Example



In [21]:
#import data
insurance = pd.read_csv('./Data/insurance.csv')
insurance

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


In this dataset, we will be trying to predict the amount of medical charges based on a patient's data.

So which column in this table will be our <b>target variable</b> (the variable we are trying to predict)?

Which columns are our <b> feature variables </b> (the variables we are using to predict our target)?



In [3]:
insurance.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

First, let's do some simple EDA to see what our dataset looks like:

In [4]:
categorical = ['sex', 'smoker', 'region']

for category in categorical:
    fig = px.histogram(insurance, x = category)
    fig.show()

In [0]:
numerical = ['age', 'bmi', 'children']

for column in numerical:
    fig = px.histogram(insurance, x = column)
    fig.show()

In [0]:
fig = px.scatter(insurance, x = 'children', y = 'charges')
fig.show()

In [0]:
fig = px.scatter(insurance, x = 'age', y = 'charges')
fig.show()

But first! We need to perform a train test split on our data! What is a train-test split?

Our training data allows us to build our model and accurately train it so that it can produce the right output.
Our validation/test set acts as a set of <b> NEW and UNSEEN </b> data to ensure that our model can work with any kind of new data and still be accurate.

sklearn library for train test split: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [22]:
#import sklearn library for train test split
from sklearn.model_selection import train_test_split

#first let's split up our feature variables from our target variables
X = insurance[['age', 'bmi', 'children']] # X represents our features
y = insurance[['charges']] # y represents what we are trying to predict

#now, we must split this into 4 different lists:

# X_train -- represents our features in the train set
# X_test -- represents our features in the test set
# y_train -- represents our target in the train set
# y_test -- represents our target in the test set

#we can actually assign all of these lists at once using commas :) !!
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state=42)

Looks like this data is pretty clean! So let's get into linear regression! Remember that in linear regression, we want to predict a numerical value -- which in this case is the amount of charges per patient. We can do this using SKlearn!

sklearn library for linear regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html



In [4]:
#import sklearn library for linear regression
from sklearn.linear_model import LinearRegression

#first we set a variable equal to our regression model
reg = LinearRegression()

#next, we need to fit our model! This is the first step in training a ML model
'''which set do we use to fit our model?'''

reg.fit(X_train, y_train)

#now we can predict our model! 
train_pred = reg.predict(X_train)
y_pred = reg.predict(X_test)
y_pred

array([[13305.28945949],
       [11801.95170145],
       [16941.71437111],
       [14278.42206855],
       [ 8680.25439362],
       [16202.22349193],
       [ 5555.8901083 ],
       [20602.58565492],
       [ 5806.95206068],
       [15919.03165614],
       [10299.48549201],
       [14221.13480456],
       [10676.8197114 ],
       [19794.64417995],
       [20721.14505796],
       [18319.57207031],
       [20003.52486079],
       [16989.16491869],
       [13918.10991054],
       [12305.80129754],
       [10155.63429188],
       [15330.37731182],
       [ 8175.30663033],
       [11961.94421188],
       [15950.43401031],
       [16659.25006455],
       [18787.72664854],
       [11630.42705085],
       [14220.19351839],
       [ 7868.1762212 ],
       [14697.87471015],
       [17037.49024392],
       [10101.76722818],
       [ 8821.1844745 ],
       [10150.74943027],
       [17039.37281627],
       [ 7851.03169019],
       [13870.41119463],
       [13940.42279767],
       [14085.77995875],


Now how well did our model perform? Let's check using r2_score!

sklearn library for r2_score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html

If we were using logistic regression/classification, we would use accuracy!

sklearn library for accuracy score: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html



In [5]:
from sklearn.metrics import r2_score

r2 = r2_score(y_train, train_pred)
print('R-squared for test set: ', r2)

R-squared for test set:  0.10987471044767094


In [6]:
#import sklearn library for r2_score
from sklearn.metrics import r2_score

r2 = r2_score(y_test, y_pred)
print('R-squared for test set: ', r2)

R-squared for test set:  0.15489592484270776


Dang that is a really low R-squared! This means that the features we chose don't have the highest correlation. From here, what we can do is go back to add or change columns in order to improve our model as accurately as possible!

## Logistic Regression Example

In [4]:
drug = pd.read_csv('./Data/drug200.csv')
drug

Unnamed: 0,Age,Sex,BP,Cholesterol,Na_to_K,Drug
0,23,F,HIGH,HIGH,25.355,DrugY
1,47,M,LOW,HIGH,13.093,drugC
2,47,M,LOW,HIGH,10.114,drugC
3,28,F,NORMAL,HIGH,7.798,drugX
4,61,F,LOW,HIGH,18.043,DrugY
...,...,...,...,...,...,...
195,56,F,LOW,HIGH,11.567,drugC
196,16,M,LOW,HIGH,12.006,drugC
197,52,M,NORMAL,HIGH,9.894,drugX
198,23,M,NORMAL,NORMAL,14.020,drugX


In [5]:
#making all the column names lowercase
drug.columns = ['age', 'sex', 'bp', 'cholesterol', 'na-k', 'drug-type']
drug.columns

Index(['age', 'sex', 'bp', 'cholesterol', 'na-k', 'drug-type'], dtype='object')

In this example, we will be building a model to predict the type of drug that a patient should take.

What are our <b> feature variables </b>?

What is our <b> target variable </b>?

Now, let's perform our train test split!

In [7]:
from sklearn.model_selection import train_test_split
X = drug[['age', 'na-k']] #feature variables
y = drug[['drug-type']] #target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8, random_state = 42)

Next, let's import our <b> Logistic Regression </b> model!

sklearn library for logistic regression: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


In [8]:
#import library for logisitic regression
from sklearn.linear_model import LogisticRegression

#first, set a variable equal to our model
lr = LogisticRegression()

#next, let's fit the model to our training set
lr.fit(X_train, y_train)

#predict using test set
y_pred = lr.predict(X_test)

  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [9]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy of test set: ', accuracy)

Accuracy of test set:  0.675


In [13]:
from sklearn.neighbors import KNeighborsClassifier

kn = KNeighborsClassifier(n_neighbors=5)

kn.fit(X_train, y_train)

y_pred = kn.predict(X_test)

  return self._fit(X, y)


In [14]:
from sklearn.metrics import accuracy_score

accuracy = accuracy_score(y_test, y_pred)
print('Accuracy of test set: ', accuracy)

Accuracy of test set:  0.65


In [19]:
from sklearn.metrics import recall_score
recall = recall_score(y_test, y_pred, average = 'micro')
recall

0.65

### sklearn library for k-nearest neighbors: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html


### sklearn library for decision trees: https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html