In [2]:
import pandas as pd
import numpy as np
df = pd.read_csv('HTRU_2.csv')
df.head()

Unnamed: 0,140.5625,55.68378214,-0.234571412,-0.699648398,3.199832776,19.11042633,7.975531794,74.24222492,0
0,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,0
1,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
2,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,0
3,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,0
4,93.570312,46.698114,0.531905,0.416721,1.636288,14.545074,10.621748,131.394004,0


oops, that first row is supposed to be data, not headers, and the last column is 0s so that's likely the target column.

In [3]:
df = pd.read_csv('HTRU_2.csv', header=None)
df.columns = [['Mean of integrated profile', 
               'Standard deviation of integrated profile',
               'Excess kurtosis of integrated profile',
               'Skewness of integrated profile',
               'Mean of DM-SNR curve',
               'Standard deviation of DM-SNR curve', 
               'Excess kurtosis of DM-SNR curve',
               'Skewness of DM-SNR curve',
               'Class']]
df.head()

Unnamed: 0,Mean of integrated profile,Standard deviation of integrated profile,Excess kurtosis of integrated profile,Skewness of integrated profile,Mean of DM-SNR curve,Standard deviation of DM-SNR curve,Excess kurtosis of DM-SNR curve,Skewness of DM-SNR curve,Class
0,140.5625,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.88243,0.465318,-0.515088,1.677258,14.860146,10.576487,127.39358,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.75,57.178449,-0.068415,-0.636238,3.642977,20.95928,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.17893,11.46872,14.269573,252.567306,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17898 entries, 0 to 17897
Data columns (total 9 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   (Mean of integrated profile,)                17898 non-null  float64
 1   (Standard deviation of integrated profile,)  17898 non-null  float64
 2   (Excess kurtosis of integrated profile,)     17898 non-null  float64
 3   (Skewness of integrated profile,)            17898 non-null  float64
 4   (Mean of DM-SNR curve,)                      17898 non-null  float64
 5   (Standard deviation of DM-SNR curve,)        17898 non-null  float64
 6   (Excess kurtosis of DM-SNR curve,)           17898 non-null  float64
 7   (Skewness of DM-SNR curve,)                  17898 non-null  float64
 8   (Class,)                                     17898 non-null  int64  
dtypes: float64(8), int64(1)
memory usage: 1.2 MB


In [5]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression

In [6]:
X = df.iloc[:,:-1]
y = df.iloc[:, -1]

In [7]:
def clf_model(model):
    clf = model
    scores = cross_val_score(clf, X, y)
    print(f'Scores: {scores}')
    print(f'Mean score: {scores.mean()}')

In [8]:
clf_model(LogisticRegression(max_iter=1000))

Scores: [0.97486034 0.97988827 0.98184358 0.97736798 0.9782062 ]
Mean score: 0.9784332723007113


Trying Naive Bayes. This is a probablility type of algorithm, and there is a family of such 
algorithms, such as GaussianNB, MultinomialNB, and ComplementNB.  We are using Gaussian here.

In [9]:
from sklearn.naive_bayes import GaussianNB
clf_model(GaussianNB())

Scores: [0.96061453 0.92374302 0.94273743 0.92847164 0.96451523]
Mean score: 0.9440163679814436


Not as good but let's try it again using the K Nearest Neighbors classifier

In [10]:
from sklearn.neighbors import KNeighborsClassifier
clf_model(KNeighborsClassifier())

Scores: [0.96955307 0.96927374 0.97318436 0.9706622  0.97289746]
Mean score: 0.9711141653437728


Still not as good as the original.  Let's try the decision tree classifier

In [11]:
from sklearn.tree import DecisionTreeClassifier
clf_model(DecisionTreeClassifier(random_state=0))

Scores: [0.96843575 0.96424581 0.96871508 0.96227997 0.96954457]
Mean score: 0.9666442360073738


Not as good here either.  Lets try it with the random forest ensemble classifier


In [12]:
from sklearn.ensemble import RandomForestClassifier
clf_model(RandomForestClassifier(random_state=0))

Scores: [0.97709497 0.98324022 0.98072626 0.97485331 0.97848561]
Mean score: 0.978880074800083


Slightly better than the original.  All of them are doing pretty well.  Let's look into it.  The dataset may be imbalanced.

In [13]:
df.Class.count()

Class    17898
dtype: int64

In [14]:
df[df.Class == 1].Class.count()

Class    1639
dtype: int64

In [15]:
df[df.Class==1].Class.count()/df.Class.count()

Class    0.091574
dtype: float64

so, this is an unbalanced dataset because only ~9% of the data shows pulsars, so just by always saying "nope, not a pulsar" you'd still get 91.57% accuracy.

need to create a confusion matrix to classify everything as a true positive, a true negative, a type 1 error of a false positive, or a type 2 error as a false negative.  This should shed light on the true accuracy.

Precision: true positives / num of positive predictions

Recall: true positives / positive labels so...
    for 0 (not pulsar) = all not pulsars / (all not pulsars + false "pulsar" predictions)
    for 1 (pulsar) = all pulsars / (all pulsars + false "not a pulsar" preditions)

f1-score: harmonic mean of precision and recall scores for the 0s and for the 1s

need to take into account what you're trying to predict.  Trying to identify all possible pulsars. lower precision is ok.   But for medical purposes you'd proabably want a much higher precision.

In [16]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=0)

In [19]:
def confusion(model):
    clf=model
    clf.fit(X_train,y_train)
    y_pred = clf.predict(X_test)
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    return clf
    

In [20]:
confusion(LogisticRegression(max_iter=1000))

[[4095   20]
 [  63  297]]
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      4115
           1       0.94      0.82      0.88       360

    accuracy                           0.98      4475
   macro avg       0.96      0.91      0.93      4475
weighted avg       0.98      0.98      0.98      4475



In [21]:
confusion(KNeighborsClassifier())

[[4077   38]
 [  69  291]]
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      4115
           1       0.88      0.81      0.84       360

    accuracy                           0.98      4475
   macro avg       0.93      0.90      0.92      4475
weighted avg       0.98      0.98      0.98      4475



In [22]:
confusion(GaussianNB())

[[3946  169]
 [  52  308]]
              precision    recall  f1-score   support

           0       0.99      0.96      0.97      4115
           1       0.65      0.86      0.74       360

    accuracy                           0.95      4475
   macro avg       0.82      0.91      0.85      4475
weighted avg       0.96      0.95      0.95      4475



In [23]:
confusion(RandomForestClassifier())

[[4094   21]
 [  58  302]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      4115
           1       0.93      0.84      0.88       360

    accuracy                           0.98      4475
   macro avg       0.96      0.92      0.94      4475
weighted avg       0.98      0.98      0.98      4475



88% is the best we've seen so far, so now we will use boosting algorithms to "boost" the ml algorithm by retraining it on the incorrect data to increase it's correct rate.

In [24]:
#first AdaBoost

In [25]:
from sklearn.ensemble import AdaBoostClassifier
clf_model(AdaBoostClassifier())



Scores: [0.97430168 0.97988827 0.98128492 0.97597094 0.97708857]
Mean score: 0.977706874833175


ugh deprecation warnings

In [26]:
confusion(AdaBoostClassifier())



[[4094   21]
 [  63  297]]
              precision    recall  f1-score   support

           0       0.98      0.99      0.99      4115
           1       0.93      0.82      0.88       360

    accuracy                           0.98      4475
   macro avg       0.96      0.91      0.93      4475
weighted avg       0.98      0.98      0.98      4475



XGBoost must be installed so lets do that.

In [27]:
import sys
!{sys.executable} -m pip install xgboost

Collecting xgboost
  Downloading xgboost-3.0.2-py3-none-win_amd64.whl.metadata (2.1 kB)
Downloading xgboost-3.0.2-py3-none-win_amd64.whl (150.0 MB)
   ---------------------------------------- 0.0/150.0 MB ? eta -:--:--
   - -------------------------------------- 6.8/150.0 MB 46.5 MB/s eta 0:00:04
   ---- ----------------------------------- 16.5/150.0 MB 45.1 MB/s eta 0:00:03
   ------- -------------------------------- 29.4/150.0 MB 50.3 MB/s eta 0:00:03
   ---------- ----------------------------- 41.2/150.0 MB 52.3 MB/s eta 0:00:03
   ------------- -------------------------- 50.1/150.0 MB 50.6 MB/s eta 0:00:02
   ---------------- ----------------------- 62.1/150.0 MB 51.4 MB/s eta 0:00:02
   ------------------- -------------------- 73.1/150.0 MB 51.8 MB/s eta 0:00:02
   ---------------------- ----------------- 85.7/150.0 MB 53.1 MB/s eta 0:00:02
   -------------------------- ------------- 98.6/150.0 MB 53.7 MB/s eta 0:00:01
   --------------------------- ----------- 107.5/150.0 MB 52.4

In [28]:
from xgboost import XGBClassifier
clf_model(XGBClassifier())

Scores: [0.97653631 0.98128492 0.9801676  0.97513272 0.97680916]
Mean score: 0.9779861420046485


In [29]:
confusion(XGBClassifier())

[[4084   31]
 [  53  307]]
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      4115
           1       0.91      0.85      0.88       360

    accuracy                           0.98      4475
   macro avg       0.95      0.92      0.93      4475
weighted avg       0.98      0.98      0.98      4475

