# Data Description

### Here you could find the data and it's description that is used for our task of Colonoscopy relevantness detection

### Codebook
* quality: a measure of the quality of the recorded video.
* bits: number of bits used to encode that block in the video stream.
* intra_parts: number sub-blocks inside this block that are not encoded by making use of
information in other frames.
* skip_parts: number sub-blocks inside this block that are straight-forward copied from another
frame.
* inter_16x16_parts: number of sub-blocks inside this block making use of information in other
frames and whose size is 16x16 pixels.
* inter_4x4_parts: number of sub-blocks inside this block making use of information in other
frames and whose size is 4x4 pixels.
* inter_other_parts: number of sub-blocks inside this block making use of information in other
frames and whose size is different from 16x16 and 4x4 pixels.
* non_zero_pixels: number of pixels different from 0 after encoding the block.
* frame_width: the width of the video frame in pixels.
* frame_height: the height of the video frame in pixels.
* movement_level: a measure of the level of movement of this frame with respect the previous
one.
* mean: mean of the pixels of the encoded block.
* sub_mean_1: mean of the pixels contained in the first 32x32 sub-bock of the current block.
* sub_mean_2: mean of the pixels contained in the second 32x32 sub-bock of the current block.
* sub_mean_3: mean of the pixels contained in the third 32x32 sub-bock of the current block.
* sub_mean_4: mean of the pixels contained in the fourth 32x32 sub-bock of the current block.
* var_sub_blocks: variance of the four previous values.
* sobel_h: mean of the pixels of the encoded block after applying the Sobel operator in
horizontal direction.
* sobel_v: mean of the pixels of the encoded block after applying the Sobel operator in vertical
direction.
* variance: variance of the pixels of the encoded block.
* block_movement_h: a measure of the movement of the current block in the horizontal
direction.
* block_movement_v: a measure of the movement of the current block in the vertical direction.
* var_movement_h: a measure of the variance of the movements inside the current block in the
horizontal direction.
* var_movement_v: a measure of the variance of the movements inside the current block in the
vertical direction.
* cost_1: a measure of the cost of encoding this block without partitioning it.
* cost_2: a measure of the cost of encoding this block without partitioning it and without
considering any movement in it.
* relevant: the target variable that indicates whether the current block is relevant (1) or not (0).

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [2]:
#There are several variables with missing data
df = pd.read_csv('data.csv', sep=';')
print(df.isna().sum())

quality,bits,intra_parts,skip_parts,inter_16x16_parts,inter_4x4_parts,inter_other_parts,non_zero_pixels,frame_width,frame_height,movement_level,mean,sub_mean_1,sub_mean_2,sub_mean_3,sub_mean_4,var_sub_blocks,sobel_h,sobel_v,variance,block_movement_h,block_movement_v,var_movement_h,var_movement_v,cost_1,cost_2,relevant    0
dtype: int64


In [3]:
# The number of missing data is not significant, so we drop rows with misssing data
df  = df.dropna()
df.shape

(16000, 1)

In [4]:
# Here we explored the number of unique values for each variable and decide which variable should be considered numberical
df.nunique()

quality,bits,intra_parts,skip_parts,inter_16x16_parts,inter_4x4_parts,inter_other_parts,non_zero_pixels,frame_width,frame_height,movement_level,mean,sub_mean_1,sub_mean_2,sub_mean_3,sub_mean_4,var_sub_blocks,sobel_h,sobel_v,variance,block_movement_h,block_movement_v,var_movement_h,var_movement_v,cost_1,cost_2,relevant    16000
dtype: int64

In [5]:
#deviding the columns in the appropriate type
cat = df.loc[:, df.nunique() < 30]
cont = df.loc[:, df.nunique() >= 30]

lst = cont.columns.tolist()
lst.append('relevant')
cont_rel = df[lst]

KeyError: "['relevant'] not in index"

In [None]:
# We explore how the occurence of a certain categorical value increases the chanse of "relevant" to be true
fig, ((ax1, ax2), (ax3, ax4), (ax5, ax6), (ax7, ax8)) = plt.subplots(ncols=2, nrows=4,figsize=(17, 20))
pd.crosstab(df.quality, df.relevant).plot(kind='bar', ax=ax1)
pd.crosstab(df.intra_parts, df.relevant).plot(kind='bar', ax=ax2)
pd.crosstab(df.skip_parts, df.relevant).plot(kind='bar', ax=ax3)
pd.crosstab(df.inter_16x16_parts, df.relevant).plot(kind='bar', ax=ax4)
pd.crosstab(df.inter_4x4_parts, df.relevant).plot(kind='bar', ax=ax5)
pd.crosstab(df.inter_other_parts, df.relevant).plot(kind='bar', ax=ax6)
pd.crosstab(df.frame_width, df.relevant).plot(kind='bar', ax=ax7)
pd.crosstab(df.frame_height, df.relevant).plot(kind='bar', ax=ax8)

In [None]:
# Here we check hoe the categorical variable correlate with the image relevantness to Colonoscopy
plt.figure(figsize=(13, 9))
corrMatrix = cat.corr()
viz = sns.heatmap(corrMatrix, annot=True)

viz.set_xticklabels(viz.get_xticklabels(), rotation=45)
viz.set_yticklabels(viz.get_yticklabels(), rotation=45)

#plt.savefig('CatCorr.png')

In [None]:
# Here we explore whether the means of variables subgroups being relevant of unrelevant to Colonoscope have a significantly different means
cont_rel.groupby('relevant').mean()

In [None]:
# We visualize the mean difference of variables and cannot claim that any of variable subgroups are equal

fig, ((ax1, ax2, ax3), (ax4, ax5, ax6), (ax7, ax8, ax9), (ax10, ax11, ax12), (ax13, ax14, ax15), (ax16, ax17, ax18)) = plt.subplots(ncols=3, nrows=6,figsize=(17, 20))
sns.boxplot(y=cont_rel.columns[0], x="relevant", data=df, ax=ax1)
sns.boxplot(y=cont_rel.columns[1],x="relevant", data=df, ax=ax2)
sns.boxplot(y=cont_rel.columns[2], x="relevant", data=df, ax=ax3)
sns.boxplot(y=cont_rel.columns[3], x="relevant", data=df, ax=ax4)
sns.boxplot(y=cont_rel.columns[4], x="relevant", data=df, ax=ax5)
sns.boxplot(y=cont_rel.columns[5], x="relevant", data=df, ax=ax6)
sns.boxplot(y=cont_rel.columns[6], x="relevant", data=df, ax=ax7)
sns.boxplot(y=cont_rel.columns[7], x="relevant", data=df, ax=ax8)
sns.boxplot(y=cont_rel.columns[8], x="relevant", data=df, ax=ax9)
sns.boxplot(y=cont_rel.columns[9], x="relevant", data=df, ax=ax10)
sns.boxplot(y=cont_rel.columns[10], x="relevant", data=df, ax=ax11)
sns.boxplot(y=cont_rel.columns[11], x="relevant", data=df, ax=ax12)
sns.boxplot(y=cont_rel.columns[12], x="relevant", data=df, ax=ax13)
sns.boxplot(y=cont_rel.columns[13], x="relevant", data=df, ax=ax14)
sns.boxplot(y=cont_rel.columns[14], x="relevant", data=df, ax=ax15)
sns.boxplot(y=cont_rel.columns[15], x="relevant", data=df, ax=ax16)
sns.boxplot(y=cont_rel.columns[16], x="relevant", data=df, ax=ax17)
sns.boxplot(y=cont_rel.columns[17], x="relevant", data=df, ax=ax18)

In [None]:
from scipy.stats import ttest_rel
from statsmodels.stats.stattools import jarque_bera
import random

def equality_testing(df, variables, y):
    
    def check_normality(var):
            normality = jarque_bera(var)
            if float(normality[1]) < 0.5:
                print(" violates the normality!")
    

    for el in variables:
        zero_y = df.loc[df[y] == 0][el].tolist()
        one_y = df.loc[df[y] == 1][el].tolist()
        print(el.upper())
        sample_size = max(len(zero_y), len(one_y))
        zero_y = random.choices(zero_y, k = sample_size)
        one_y = random.choices(one_y, k = sample_size)
        check_normality(zero_y)
        check_normality(one_y)
        
        if ttest_rel(zero_y, one_y).pvalue >=0.5:
            print("!!!The groups related to 1 or 0 have the same mean :" + el) 
        else:
            print("!!!The groups related to 1 or 0 are different :" + el)

In [None]:
# Here with paired t-test (even the normality assumptions are not met) we claim that each of the subgroups are 
# statistically different and we cannot suspect any variable to be inneficient in identifying relevantnes to Colonoscopy 
equality_testing(cont_rel, cont_rel.columns[:-1].tolist(), cont_rel.columns[-1])

In [None]:
# Here we explore the correlation where we see the correlation rising from 0.02 to 0.25
plt.figure(figsize=(18, 15))
corrMatrix = cont_rel.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

### Variable selection and Construction

In [None]:
#Here we generate new variables from old ones to reduce the cross-correlation in between the predictor variables and to decrese  needles the model complexity

In [None]:
# The proportion of useful infomation per square
df["pixels_height_width"] = (df['frame_height']*df['frame_width'])/df['non_zero_pixels']
df = df.drop(['frame_height', 'frame_width', 'non_zero_pixels'], axis = 1)

In [None]:
# weighted variance of sub-blocks
df['sub_mean'] = (df['sub_mean_1']+ df['sub_mean_2'] + df['sub_mean_3'] +df['sub_mean_4'])/4
df = df.drop(['sub_mean_1', 'sub_mean_2', 'sub_mean_3', 'sub_mean_4'], axis=1)

In [None]:
# variability per block movement
#df['movement'] = (df['var_movement_h'] + df['var_movement_v'])/(df['block_movement_h'] + df['block_movement_v'])
df['movement_var'] = ((df['block_movement_h']/df['var_movement_h'])+(df['block_movement_v']/df['var_movement_v']))/2
df = df.drop(['block_movement_h', 'block_movement_v', 'var_movement_h', 'var_movement_v'], axis=1)

In [None]:
# the average cost of the block encoding
df['cost'] = (df['cost_1']+df['cost_2'])/2
df = df.drop(['cost_1', 'cost_2'], axis=1)

In [None]:
# the mean of pixels encoded after Sobel operation
df['sobel'] = (df['sobel_h']+df['sobel_v'])/2
df = df.drop(['sobel_h', 'sobel_v'], axis=1)

In [None]:
# Here we explore how the transformed variables correlate with the relevan and check if there is not much variability is shared among predictor variables
plt.figure(figsize=(18, 15))
corrMatrix = df.corr()
sns.heatmap(corrMatrix, annot=True)
plt.show()

In [None]:
# deleting based on correlation pattern with other variables
df = df.drop(["var_sub_blocks", "sub_mean", "movement_var", "inter_16x16_parts", "mean"], axis=1)

In [None]:
# Generation of the final correlation matrix of variables we use for predicting "relevant"
%matplotlib inline
plt.rcParams['figure.dpi'] = 300
plt.rcParams['savefig.dpi'] = 300

plt.figure(figsize=(15, 10))
corrMatrix = df.corr()
viz = sns.heatmap(corrMatrix, annot=True)
viz.set_xticklabels(viz.get_xticklabels(), rotation=45)
viz.set_yticklabels(viz.get_yticklabels(), rotation=45)
plt.savefig('FinalCorr.png')
plt.show()

In [None]:
df.describe().T

In [None]:
# Here we see that number of observations to be releavant is 6 times higher than number or irrelevant observations
%matplotlib inline
sns.countplot(x='relevant', data=df)
plt.show()

In [None]:
# Cheking the generated continuous data 
cont = df.loc[:, df.nunique() >= 30]
lst = cont.columns.tolist()
lst.append('relevant')
cont_rel = df[lst]

In [None]:
# Here we check how the generated and selected variables correlate with the relevance to colonoscopy
fig, (ax1, ax2, ax3) = plt.subplots(ncols=3, nrows=1,figsize=(17, 5))
sns.boxplot(y="pixels_height_width", x="relevant", data=df, ax=ax1)
sns.boxplot(y="cost", x="relevant", data=df, ax=ax2)
sns.boxplot(y="sobel", x="relevant", data=df, ax=ax3)

In [None]:
# Checking if all the sub-groups (relevant=0 and relevant=1) of variables are statistically different from each other

In [None]:
# All sub-groups of variables are statistically different from mean
equality_testing(cont_rel, ["pixels_height_width", "cost", "sobel"], cont_rel.columns[-1])

### Data Trasformation

In [None]:
# The column that has less than 30 different values is considered to have categorical data
cat = df.loc[:, df.nunique() < 30]
cont = df.loc[:, df.nunique() >= 30]

In [None]:
# Normalization of continuous data

In [None]:
print(cont.head())
for var in cont.columns:
    cont[var] = cont[var].apply(lambda x:  x / df[var].max())
cont.head()

In [None]:
# Categorical data transofmations:

In [None]:
# We belive that the order for the selected categorical variables matters so we use Label encoder. 
# We found no need in One-Hot encoding as NONE of the variables had only categorycal difference (difference by class) and not by order.

print(cat.head())
from sklearn.preprocessing import LabelEncoder
# This technique gives the highest priority due to its label and lowest priority for its label being 0.
encode = LabelEncoder()
for el in cat.columns:
    encode.fit(cat[el])
    cat[el] = encode.transform(cat[el])
cat.head()

In [None]:
# First version

In [None]:
from sklearn.utils import shuffle
from sklearn import metrics

In [None]:
df = pd.concat([cont, cat], axis=1)

In [None]:
# As there are 6 times more relevant data then irrelevant from relevant we randomly select a sample of the size equal to the irrelevant data
#(This will help to avoid the low recall)

sample_size = df['relevant'].value_counts()[0]
relevant = df[df.relevant == 0]
not_relevant = df[df.relevant == 1].sample(sample_size , replace = False)
relevant = relevant.reset_index(drop = True)
not_relevant = not_relevant.reset_index(drop = True)
fin_df = not_relevant.append(relevant)
fin_df = shuffle(fin_df).reset_index(drop = True)
fin_df = fin_df.dropna(axis=1)
fin_df.head()

In [None]:
# Saving to the file
fin_df.to_csv("tryMe.csv", index=False)

In [None]:

#Importing all the libraries for modelling 
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import precision_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import recall_score
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import precision_recall_curve
from sklearn import tree


In [None]:
#Deviding in explanatory variables and our target variable

X = df[df.columns[:-1]]
y = df[df.columns[-1]]

In [None]:
#Logistic Regression model


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression(max_iter=1000)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
#print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))


confusion = metrics.confusion_matrix(y_test,y_pred)

TP = confusion[1,1]
TN = confusion[0,0]
FP = confusion[0,1]
FN = confusion[1,0]

accuracy = (TP + TN) / float(TP+TN+FP+FN) # metrics.accuracy_score(y_test, y_pred)
sensitiviy = TP / float(TP+FN)  #recall metrics.recall_score(y_test, y_pred)
specificity = TN / float(TN+FP) #when the actual value is negative, how often is the predicion correct?
precision = TP / float(TP+FP)   #metrics.precision_score(y_test, y_pred)


print("METRICS FOR LOGISTIC REGRESSION")
print("accuracy", accuracy.round(4))  
print("recall", sensitiviy.round(4))
print("specificity", specificity.round(4))
print("precision",precision.round(4))

In [None]:
#Decision Tree model

X_train2, X_test2, y_train2, y_test2 = train_test_split(X, y, test_size=0.3, random_state=0)

clf = tree.DecisionTreeClassifier()
clf = clf.fit(X_train2, y_train2)

y_pred2 = clf.predict(X_test2)


confusion = metrics.confusion_matrix(y_test2,y_pred2)

TP = confusion[1,1]
TN = confusion[0,0]
FP = confusion[0,1]
FN = confusion[1,0]

accuracy = (TP + TN) / float(TP+TN+FP+FN) # metrics.accuracy_score(y_test, y_pred)
sensitiviy = TP / float(TP+TN)  #recall metrics.recall_score(y_test, y_pred)
specificity = TN / float(TN+FP) #when the actual value is negative, how often is the predicion correct?
precision = TP / float(TP+FP)   #metrics.precision_score(y_test, y_pred)

print("METRICS FOR DECISION TREE")
print("accuracy", accuracy.round(4))  
print("recall", sensitiviy.round(4))
print("specificity", specificity.round(4))
print("precision",precision.round(4))


#tree.plot_tree(clf) 

In [None]:
#Random Forest model

X_train3, X_test3, y_train3, y_test3 = train_test_split(X, y, test_size=0.3, random_state=0)

#Create a Gaussian Classifier
clf=RandomForestClassifier()

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train3,y_train3)

y_pred3=clf.predict(X_test3)

confusion = metrics.confusion_matrix(y_test3,y_pred3)

TP = confusion[1,1]
TN = confusion[0,0]
FP = confusion[0,1]
FN = confusion[1,0]

accuracy = (TP + TN) / float(TP+TN+FP+FN) # metrics.accuracy_score(y_test, y_pred)
sensitiviy = TP / float(TP+TN)  #recall metrics.recall_score(y_test, y_pred)
specificity = TN / float(TN+FP) #when the actual value is negative, how often is the predicion correct?
precision = TP / float(TP+FP)   #metrics.precision_score(y_test, y_pred)

print("METRICS FOR RANDOM FOREST")
print("accuracy", accuracy.round(4))  
print("recall", sensitiviy.round(4))
print("specificity", specificity.round(4))
print("precision",precision.round(4))