# Logistic Regression

The dataset that was used for this lab is [here](https://archive.ics.uci.edu/ml/datasets/Blood+Transfusion+Service+Center).

---

## 1. Dataset Description

#### a. Describe the Dataset
This data set contains a list of blood doners that gave blood to a specific blood transfusion center.     
#### b. Feature Representation
The features are: "Recency" which represent how recent the donor gave blood, "Frequency" represents the total amount of times given, "Monetary" which is the total amount of blood given in c.c.s, "Time" represents the total amount of months since the first donation, and "Donated" is a binary value that represents whether or not a specific blood donor gave blood in March 2007.     
#### c. Target Variable
The target variable in this data set is the "Donated" value in each of the data points. We use the other features in order to predict this value.    

## 2. Splitting the Data

In [37]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error, r2_score, accuracy_score as accuracy

#### Importing the Data

In [38]:
dataset = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data")

#dataset["Bias"] = 1
dataset.columns = ["Recency", "Frequency", "Monetary", "Time", "Donated"]
dataset.head()

Unnamed: 0,Recency,Frequency,Monetary,Time,Donated
0,2,50,12500,98,1
1,0,13,3250,28,1
2,1,16,4000,35,1
3,2,20,5000,45,1
4,1,24,6000,77,0


#### Splitting the Data

In [39]:
X = dataset.drop(["Donated"], axis = 1)
Y = dataset.Donated

In [40]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3)

X_train.shape, X_test.shape, Y_train.shape, Y_test.shape

((523, 4), (225, 4), (523,), (225,))

## 3. Logistic Regression with Scikit-Learn

In [41]:
logReg = LogisticRegression(solver='lbfgs', multi_class='multinomial')

logReg.fit(X_train, Y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [42]:
logReg.predict(X_test)

acc = logReg.score(X_train, Y_train)

print('Train accuracy: {:.2f}'.format(acc))

print('Test accuracy: {:.2f}'.format(logReg.score(X_test, Y_test)))

Train accuracy: 0.78
Test accuracy: 0.76


In [43]:
print(classification_report(Y_test, Y_pred))

              precision    recall  f1-score   support

           0       0.72      0.93      0.81       162
           1       0.21      0.05      0.08        63

   micro avg       0.68      0.68      0.68       225
   macro avg       0.46      0.49      0.44       225
weighted avg       0.58      0.68      0.60       225



In [44]:
logReg.predict_proba(X_train)

array([[0.86722379, 0.13277621],
       [0.8161895 , 0.1838105 ],
       [0.96150993, 0.03849007],
       ...,
       [0.96650458, 0.03349542],
       [0.87182875, 0.12817125],
       [0.64575612, 0.35424388]])

## 4. Parameters

In [45]:
print(logReg.intercept_) 
print(logReg.coef_)

[-0.15486996]
[[-6.28934336e-02  9.11092719e-07  2.27773180e-04 -1.07525510e-02]]


$\theta_0$ = -0.26037524      
$\theta_1$ = -0.06095942117   
$\theta_2$ = -0.000000980667403   
$\theta_3$ = -0.000245166851    
$\theta_4$ = -0.0073296504

The different thetas represent the different weights that each of the variables, or features, carries with the logistic regresssion calculation.     

$\theta_0$ represents the weight of the bias of the logistic function. It also shows as the y-intercept when graphing the logistic function. The $\theta_1$ represents the weight that the "Recency" feature carries when calculating the regression. Since $\theta_1$ is the highest number, the "Recency" feature carries the most amount of weight out of the other features in the calculation.      
      
$\theta_2$ represents the weight that the "Frequency" feature of each data point carries within the logistic calculation. When comparing to the other $\theta$s,  $\theta_2$ has the lowest amount of weight; therefore, when used in the calculations, it still matters, but not nearly as much as the other features in the data point. The $\theta_3$ weight represents the amount of weight that the "Monetary" tab carries within the calculation of the logistic regression of the data set. The $\theta_4$ weight represents the weight that the "Time" feature within each data point carries when that data point is being used in the calculation of logistic regression of the data set.

## 5. Statement of Collaboration

#### a. Whom you worked with
I mainly did this by myself, but I did get help from Kolby, Matt and Tucker.    
      
#### b. Resources Used
The only resources I used were trying to figure out Logistic Regressison using SkLearn and other websites similar to that.    

## 6. Extra Credit

#### a. Logistic Regression by Hand

#### d. Article
     
EssentiLly the article begins with the idea of the author wanting to get more in depth with machine learning, and as he got more in depth with it, he learned he had to become more familiar with statistics and statistical terms. He said that he wanted to write this blog post to explain to the average person what some of the specific terms mean. 

The author then defines a few terms with emphasis on Entropy. He goes on to describe entropy as being something used for probability distributions and that it "measures the uncertainty inherent in their probability distribution." From there he begins to go into the idea of response variables. At the start of this section, he describes the concept of a model taking in a specific input and returning a desired output. The author describes the idea of a maximum entropy distrubtions basically being a distrbution that obeys specific constraints. After that, the author gives a brief description/equation of other distributions such as Gaussian Distribution, Binomial Distirbution, and Multinomial Distributions. 

From the response variables, he begins to go into the idea of functional form, starting with exponential family distribtuions. A large portion of this section outlines the math behind the distributions above and how they work. Generalized linear models is up next, and here is goes into the ideas of regression, specifically Linear Regression, Logistic Regression and Softmax Regression with some emphasis on why the functions associated with them are associated with them. He then goes on to the loss function and he describes it as being a way to compute how good specific parameters are. The author then uses the generalized linear models to show this using maximum likelihood estimation. The author then uses the same models for maximum a posteriori estimation. 