## Part 2: Classification: Logistic regression

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model

### Task 1 -  Import CSV into pandas

1. Create a function to read the CSV file provided into a DataFrame. 
2. The file imported has NA (not available) values in some columns. These rows need to be dropped as machine learning algorithms cannot process data with missing values. Remember when rows are dropped some (row) indexes will be missing. 
3. The first step in processing data is to review the data types of the features (columns). 
4. Use **pandas** features *columns* and *dtypes* to create a dictionary with column names as keys and the datatype as values.
5. This function then returns the new dataframe (df) and the df_types dictionary (df_types), where a key-value pair represents column name-column's dtype. 

In [None]:
def process_data(fl):
    
    # Import the CSV file (fl)
    df = pd.read_csv(fl)
    
    # Drop all rows with NA values
    df2 = df.dropna()
    df2 = df2.reset_index(drop=True)
    
    # Create a dictionary with keys the column names and values the type of data
    df2_types = {}
    
    for col in df2.columns:
        df2_types[col] = df2[col].dtype
    
    return df2, df2_types

### Task 2  - Convert categorical (non-numeric) variables 

Many machine learning algorithms are designed to process numeric data and cannot natively handle categorical data. Therefore as part of the model building process, we must apply pre-processing steps to convert the data into an encoded format which the algorithms can handle.

1. In the following function you will identify and convert categorical variables to numeric data type. 
2. You will need the python *dictionary* "df2_types" of the function "process_data" we created in task 1. We can use this to identify data in a categorical (non-numeric) data format.
3. Create a list "cat_ls" of column names which are non-numeric. 
4. Process each column named in "cat_ls" separately. 
5. For a column name, say "col_name", find the *distinct* categories. For example, in column "gender" there are 2 categories "Male" and "Female". 
6. For a (categorical) column 'C' with *k* categories *k-1* new columns are created and 'C' is replaced by these new columns. For example, the "*gender*" column will be replaced by one numerical column. The column "*smoking_status*" is to be replaced with 2 numerical columns. This process is referred to as *one-hot encoding*.
7. The encoding is done as follows. Suppose there are 3 categories "cat1", "cat2", "cat3" in column 'C'. Create 2 columns with distinct names, say "cat_level1", "cat_level2. If an observation corresponding to a row is 'cat1' then put a 1 in 'cat_level1' and 0 in 'cat_level2' in the same row. If it is 'cat2' put 0 in 'cat_level1' and 1 in 'cat_level2' and put 0 in both if the observation is 'cat3'. 
8. It is simpler if the column has only 2 categories (like "gender"). It will be replaced by 1 column of 1's and 0's. 
9. The number of columns in the new DataFrame will be generally more than the original. For the *stroke-dataset* this number is 11. Remember to **drop** the old non-numeric columns.  
10. Depending on how you do it the column orderings may change. This is important for identifying the output column "stroke". 
11. You may reorder the columns. Suggestion:move "stroke" to the last column in the new dataframe. 
13. You should NOT use any feature-processing modules from **sklearn** or pandas.get_dummies()for this part. If used the maximum mark for this task will not exceed 60%. 


In [None]:
def convert_to_numeric():
    
    # Read the appropriate file, should be in the same directory as the notebook
    df, dict_types = process_data("Stroke_data_for_part_TWO.csv")
    
    # Apply the one hot encoding process outlined to the new dataframe df2
    num_ls = [i for i in dict_types if dict_types[i] != "O"]
    df2 = df[num_ls]
    output = df2.iloc[:,-1] 
    df2 = df2.drop(df.columns[-1], axis=1) 
    cat_ls = [i for i in dict_types if dict_types[i] == "O"]
    
    for col in cat_ls:
        n_values = len(df[col].unique())
        n_cols = n_values-1
        df_aux = pd.DataFrame(np.zeros(df.shape[0])) 
        for i in range(n_values-1):
            arr = np.zeros(df.shape[0])
            for j in range(df.shape[0]):
                if df[col][j] == df[col].unique()[i]:
                    arr[j] = 1
                else:
                    pass
                
            df_aux[i] = arr
        df_aux.columns = [col + str(col_name) for col_name in df_aux.columns]
        df2 = df2.join(df_aux)
    
    df2 = df2.join(output)
    
    return df2        

### Task 3 - Generate ndarrays for train and test data

1. Convert all columns except "id" and "stroke" into a numerical feature matrix **X**. The size of the matrix will be *no_of_rows* $\times$  *(no_of_columns-2)*. The number of columns should be 9. 
2. Put the values in the "stroke" column in the array **y**. 
3. Use the sklearn [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method to generate *X_train, X_test, y_train. y_test*. 
5. In the *train_test_split()* method the fraction of data to be split for testing has to be specified. Vary this fraction between .2 to .33. Run your program  a few times to choose  an optimim value. The optimum will correspond to the fraction giving the best accuracy/precision (see Task 5). 
6. Return the 4 arrays. 

In [4]:
# Import train_test_split from sklearn.model_selection
from sklearn.model_selection import train_test_split

def create_arrays(t_size=0.33):
    
    # Call the function created in Task 2 to source the encoded data frame
    df = convert_to_numeric()
    
    # Create the X and y objects
    # Your code goes here
    X = df.iloc[:,1:10]
    y = df.iloc[:,-1]
    
    # Create test/train splits for X and y
    # Your code goes here
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=t_size, random_state=50)
    
    # Function returns the four newly created objects
    
    return X_train, X_test, y_train, y_test


### Task 4 - Create the logistic regression model 
1. In the following function we will use the [liner_model.LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) from sklearn to create and train a logistic regression model. 
2. The model should be trained on the train set created in task 3. **Do not use the full dataset or test set for training**.
2. As this is a binary classification problem (2 classes: "stroke", "no-stroke") the default model does not need significant adjustment


In [5]:
def fit_logitmodel(X, y):
    
    # Create the logitmodel_stroke model
    # Your code goes here
    logitmodel_stroke = linear_model.LogisticRegression(max_iter=200)
    
    # Train the logitmodel_stroke model
    # Your code goes here
    logitmodel_stroke.fit(X, y)
    
    return logitmodel_stroke

### Task 5 - Model evaluation 
The process for evaluating a classification model is different from a regression model. In regression we have a wide range of values so we measure variance, however classification has a much smaller problem space so we measure how often the correct prediction is made. There are multiple metrics for measuring this, [this article](https://www.mage.ai/blog/definitive-guide-to-accuracy-precision-recall-for-product-developers) and the [Wikipedia page](https://en.wikipedia.org/wiki/Precision_and_recall) provide additional context.

1. As this is binary classification there are 2 classes. Class 1 indicates positive stroke risk and class 0 indicates negative stroke risk. 
2. When testing we use a separate dataset which the model was not trained on. This is essential to observe how the model performs on data it has not seen before.
3. In the function below *X_ts* represents the data used to generate test predictions and *y_obs* represents the actual values we are trying to predict. 
4. We can evaluate a classification model by having it make a set of predictions for a test set (X_ts) and comparing these with the actual values (y_obs).
5. Suppose *y_pred* is a predicted value when run on a sample from *X_ts*. We compare it to the corresponding observed value in *y_obs*. There are four potential outcomes from this comparison:

    1. *y_pred* = 1 (positive) and *y_obs* = 1 (positive): counted as *true positive*.
    2. *y_pred* = 1 (positive) and *y_obs* = 0 (negative): counted as *false positive*. 
    3. *y_pred* = 0 (negative) and *y_obs* = 0 (negative): counted as *true negative*. 
    4. *y_pred* = 0 (positive) and *y_obs* = 1 (negative): counted as *false negative*. 
    
5. Count all the 4 cases for the entire sample input to the function *evaluate_logitmodel* and store them in 4 variables: *tp*, *fp*, *tn* and *fn*. For example, *tp* will give total number of true positives and *tn* the total of true negatives. 
6. The two metrics we will be using for evaluation are *accuracy* and *precision*. The formula for these is below. 
$$acc = \frac{tp+tn}{tp+tn+fp+fn} \quad\text{(accuracy)}, \quad prec = \frac{tp}{tp + fp} \quad\text{(precision)}$$

7. Run the model training/evaluation process for 5 different test/train split ratios

In [6]:
#the model object is the output of the function fit_logitmodelto obtain y_pred
def evaluate_logitmodel(model, X_ts,  y_obs):
    
    # Use the .predict() method of the model to generate a set of predictions for X_ts
    # Your code goes here
    y_preds = model.predict(X_ts)
    
    # Determine the tp, fp, tn and fn values for the prediction set
    # Your code goes here
    tp = 0
    fp = 0
    tn = 0
    fn = 0
    
    for i in range(y_preds.shape[0]):
        if y_preds[i] == 1 and y_obs.iloc[i,] == 1:
            tp += 1
        elif  y_preds[i] == 1 and y_obs.iloc[i,] == 0:   
            fp += 1
        elif  y_preds[i] == 0 and y_obs.iloc[i,] == 0:
            tn += 1
        else:
            fn += 1
    
    # Calculate the accuracy and precision values
    # Your code goes here
    acc = (tp+tn)/(tp+tn+fp+fn)
    prec = tp/(tp+fp)
    
    return acc, prec

In [7]:
t_sizes = [0.2, 0.24, 0.27, 0.30, 0.33]
for size in t_sizes:
    X_train, X_test, y_train, y_test = create_arrays(size)
    logitmodel_stroke = fit_logitmodel(X_train, y_train)
    acc, prec = evaluate_logitmodel(logitmodel_stroke, X_test,  y_test)
    print("Test size %.2f -> Accuracy = %.7f, Precision = %.2f, Accuracy/Precision = %.7f" % (size, acc, prec,(acc/prec)))

Test size 0.20 -> Accuracy = 0.9503650, Precision = 1.00, Accuracy/Precision = 0.9503650
Test size 0.24 -> Accuracy = 0.9476886, Precision = 1.00, Accuracy/Precision = 0.9476886
Test size 0.27 -> Accuracy = 0.9481081, Precision = 1.00, Accuracy/Precision = 0.9481081
Test size 0.30 -> Accuracy = 0.9503891, Precision = 1.00, Accuracy/Precision = 0.9503891
Test size 0.33 -> Accuracy = 0.9513705, Precision = 1.00, Accuracy/Precision = 0.9513705


After evaluating the model with different sizes for the train and test data sets, the following results were obtained. The precision of the model was iqual to 1 for all the sizes of the train and test data sets. However, the accuracy of the model changed as the size of the test set increased. At first, with a test set size of 20% of the data, the acurracy was 95.04%. When the test set size increased to 24% and 27% of the data, the accuracy decreassed to 94.77% and 94.81%. When the test set size increased to 30% of the data, the accuracy increaded again to 95.04% and when the test set size increased to %33 of the data, the highest accuracy of 95.14% was obtained. 

There was a difference in accuracy and precision with every size for the test set. The accuracy meassures the amount of true predictions over all the predictions. The precition meassure the amount of true possitive over all the possitive predictions. This means that the model only missed by classifing possitive observations as negative, but all the possitive predictions were correct. 

- For a fixed train/test data evaluate the metrics on the train data (*X_train*) and test (*X_test*) seprately and record the valuse of the metrics.  

In [8]:
# Train size is equal to 35% of all the data
X_train, X_test, y_train, y_test = create_arrays(0.35)
logitmodel_stroke = fit_logitmodel(X_train, y_train)
acc, prec = evaluate_logitmodel(logitmodel_stroke, X_test,  y_test)
print("Accuracy = ", acc)
print("Precision = ", prec)

Accuracy =  0.9516263552960801
Precision =  1.0
