### Logistic Regression

Logistic regression aims to solve classification problems. It does this by predicting categorical outcomes, unlike linear regression that predicts a continuous outcome.

In the simplest case there are two outcomes, which is called binomial, an example of which is predicting if a tumor is malignant or benign. Other cases have more than two outcomes to classify, in this case it is called multinomial. A common example for multinomial logistic regression would be predicting the class of an iris flower between 3 different species.

Here we will be using basic logistic regression to predict a binomial variable. This means it has only two possible outcomes.

In Python we have modules that will do the work for us. Start by importing the NumPy module.

In [1]:
import numpy

Store the independent variables in X.

Store the dependent variable in y.

Below is a sample dataset:

In [2]:
#X represents the size of a tumor in centimeters.
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69, 5.88]).reshape(-1,1)

#Note: X has to be reshaped into a column from a row for the LogisticRegression() function to work.
# For more information about reshape, see https://stackoverflow.com/questions/18691084/what-does-1-mean-in-numpy-reshape
print ("array X after being reshaped into 1 column and 12 rows (12, 1) is:\n", X)

#y represents whether or not the tumor is cancerous (0 for "No", 1 for "Yes").
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
print ("\n y is: \n", y)

array X after being reshaped into 1 column and 12 rows (12, 1) is:
 [[3.78]
 [2.44]
 [2.09]
 [0.14]
 [1.72]
 [1.65]
 [4.92]
 [4.37]
 [4.96]
 [4.52]
 [3.69]
 [5.88]]

 y is: 
 [0 0 0 0 0 0 1 1 1 1 1 1]


We will use a method from the sklearn module, so we will have to import that module as well:

In [3]:
from sklearn import linear_model

From the sklearn module we will use the LogisticRegression() method to create a logistic regression object.

This object has a method called fit() that takes the independent and dependent values as parameters and fills the regression object with data that describes the relationship:

In [4]:
logr = linear_model.LogisticRegression()
logr.fit(X,y)

LogisticRegression()

Now we have a logistic regression object that is ready to whether a tumor is cancerous based on the tumor size:

In [5]:
#predict if tumor is cancerous where the size is 3.46mm:
predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))# An array of one column and one row that has the value of 3.46
# will be fed into the predict function of logistic regression object logr to find the corresponding result value.

print("The result is:", predicted)
print ("[0] means tumor is not cancerous")
print ("[1] means tumor is cancerous")

The result is: [0]
[0] means tumor is not cancerous
[1] means tumor is cancerous


We have predicted that a tumor with a size of 3.46mm will not be cancerous.

#### Coefficient

In logistic regression the coefficient is the expected change in log-odds of having the outcome per unit change in X.

For more information about odds, log-odds and the Logit Function, see https://towardsdatascience.com/https-towardsdatascience-com-what-and-why-of-log-odds-64ba988bf704

Odds are the ratio of something happening to something not happening. Whereas, Probability is the ratio of something happening to everything that could happen. Log Odds is nothing but log of odds, i.e., log(odds). 

the Logit Function, which is the basis for one of the most commonly used machine learning algorithms, Logistic Regression is Logit Function = log (p / (1-p))

#### Probability
The coefficient and intercept values can be used to find the probability that each tumor is cancerous.

Create a function that uses the model's coefficient and intercept values to return a new value. This new value represents probability that the given observation is a tumor.

Let's see an example.

In [6]:
def logit2prob(logr,x):
  log_odds = logr.coef_ * x + logr.intercept_ # To find the log-odds for each observation, we must first create a formula 
  #that looks similar to the one from linear regression, extracting the coefficient and the intercept.
  
  #For more information about coef_ and intercept_
  # see https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

  odds = numpy.exp(log_odds) # To then convert the log-odds to odds we must exponentiate the log-odds.
  probability = odds / (1 + odds) #Now that we have the odds, we can convert it to probability by dividing it by 1 plus the odds.
  return(probability)

Let us now use the function with what we have learned to find out the probability that each tumor is cancerous.

In [8]:
print(logit2prob(logr, X))

[[0.60749955]
 [0.19268876]
 [0.12775886]
 [0.00955221]
 [0.08038616]
 [0.07345637]
 [0.88362743]
 [0.77901378]
 [0.88924409]
 [0.81293497]
 [0.57719129]
 [0.96664243]]


#### Results Explained:

3.78 -- 0.61 -> The probability that a tumor with the size 3.78cm is cancerous is 61%.

2.44 -- 0.19 -> The probability that a tumor with the size 2.44cm is cancerous is 19%.

2.09 -- 0.13 -> The probability that a tumor with the size 2.09cm is cancerous is 13%.