# Machine Learning model using K-nearest Neighbors algorithm

## Created a Machine Learning model using KNeighborsClassifier for stock price direction prediction, optimisted the model and prepared analysis report. 

NOTE: Required data is stored in IPython's database

1. Prepare training and test data
2. Find the optimial value of n_neighbors for the dataset, using cross validation technique.
3. Train model using the optimal number of neighbors
4. Make predictions using trained model.
5. Print classification report.

Repeated above steps to optimise model, by varing parameters.

**Scenario 1** Vary the maximum number of Neighbors e.g (21, 51) & performance score metric used for cross validation(e.g. "accuracy", "recall").   
**Scenario 2** Keeping in mind time series nature of data by default training and testing data is prepared, also check the impact on performance of the model when train & test data is prepared using train_test_split.    
**Scenario 3** Compare results by increasing/decreasing test data size.  
**Scenario 4** Add sentiment analysis based on tweets as one of X feature to determine it's impact on performance of the model.

In [1]:
import pandas as pd
from datetime import datetime

from sklearn.model_selection import cross_val_score,train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report,balanced_accuracy_score
%store -r

In [2]:
# %run PrepareData.ipynb
# ta_df.plot(y="close_value")

In [22]:
# The purpose of this function is to prepare the training and test datasets
def get_training_testing_data(dataframe,random=False,test_size=.25):
    # we want to predict the direction of stock, so our target is price_direction
    y = dataframe["price_direction"]
    print(y.value_counts())
    
    # dropping target from the features dataframe  
    X = dataframe.drop(["price_direction"],axis=1)
    
    #  use test_train_split in case datasets are to be prepared randomly    
    if(random):
        # Split the dataset using train_test_split
        X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=test_size,random_state=1)
    else:
        # Determine training and test data size on the basis on input value of test_size,
        train_data_size = int (len(dataframe) * (1-test_size))
        test_data_size = len(dataframe) - train_data_size
        
        # prepare training dataset, keeping in mind time series nature of the data.
        train_start = X.index.min()
        train_end = X.index.min() + pd.DateOffset(days=train_data_size)      
        X_train = X.loc[train_start:train_end]
        y_train = y.loc[train_start:train_end]
        
        # prepare test dataset, keeping in mind time series nature of the data.
        test_start = train_end + pd.DateOffset(hours=1)         
        X_test = X.loc[test_start:]
        y_test = y.loc[test_start:]
    
    return X_train, X_test, y_train, y_test       
            

In [4]:
# The purpose of this function is to get the optimal number of neighbors using cross validation technique
def get_optimal_k_neighbors(maxnum,X_scaled_data,y_scaled_data,scoring="accuracy"):
    # checking for odd number of neighbors    
    k_neighbors = [num for num in range(1,maxnum,2)]

    k_acc_scores = []

    for k in k_neighbors:
        
        knn = KNeighborsClassifier(n_neighbors=k)
        
        # Compute cross validation scores, using accuracy/recall
        cv_scores = cross_val_score(knn, X_scaled_data,y_scaled_data,cv=5, scoring=scoring)
        
        # Take the mean of scores and append to the list of scores        
        k_acc_scores.append(cv_scores.mean())
        
        
    # find optimal number of neighbors by finding the index of max score    
    return k_neighbors[k_acc_scores.index(max(k_acc_scores))]
        

In [5]:
# This function executes all steps required to develop, train and test KNN machine learning model depending 
# on the dataset and variable inputs, as an output it prints the classification report to analyse the performance of the model.
def exe_knn_model(dataset,test_size=.25,max_neighbors=100,random_data=False,scoring="accuracy"):  

    X_train, X_test, y_train, y_test = get_training_testing_data(dataset,test_size=test_size,random=random_data)

    scaler = StandardScaler()

    # Fitting Standard Scaler
    X_scaler = scaler.fit(X_train)

    # Scaling data
    X_train_scaled = X_scaler.transform(X_train)
    X_test_scaled = X_scaler.transform(X_test)
    
    # Get Optimal value of n_neighbors    
    optimal_k_neighbors = get_optimal_k_neighbors(maxnum=max_neighbors,
                                                  X_scaled_data=X_train_scaled,
                                                  y_scaled_data=y_train,
                                                  scoring=scoring)
    
    print(f"Optimal value of n_neighbors is {optimal_k_neighbors}")
    knn_model = KNeighborsClassifier(n_neighbors=optimal_k_neighbors)
    
    # Train model using training data
    knn_model.fit(X_train_scaled, y_train)
    
    # Create predictions using the testing data
    y_pred = knn_model.predict(X_test_scaled)
    
    # Print the balanced_accuracy score of the model
    print(f"Balanced accuracy score for the model is {balanced_accuracy_score(y_test,y_pred)}")
    
    # Print the classification report comparing the testing data to the model predictions
    print(classification_report(y_test, y_pred))

In [6]:
# For correct mapping with price_direction, get technical indicators of previous day
ta_df_temp = ta_df.loc[:, ta_df.columns != 'price_direction'].shift(1)
ta_df_temp['price_direction'] = ta_df['price_direction']
ta_df_temp.dropna(inplace= True)

**Scenario 1** Vary the maximum number of Neighbors e.g (21, 51) & performance score metric used for cross validation(e.g. "accuracy", "recall").<br>  Based on these parameters evaluate the performance of the model. Find the optimal scoring technique to get optimal n_neighbors for improving performance of the model

In [23]:
# Execute model where optimal value of n_neighbor cannot be more than 21, with 80% training data and 20% test data 
exe_knn_model(dataset=ta_df_temp,test_size=.2,max_neighbors=21,scoring="accuracy")

1    676
0    651
Name: price_direction, dtype: int64
Optimal value of n_neighbors is 5
Balanced accuracy score for the model is 0.4778513412053784
              precision    recall  f1-score   support

           0       0.48      0.55      0.51       294
           1       0.48      0.40      0.44       299

    accuracy                           0.48       593
   macro avg       0.48      0.48      0.47       593
weighted avg       0.48      0.48      0.47       593



In [8]:
exe_knn_model(dataset=ta_df_temp,test_size=.2,max_neighbors=21,scoring="recall")

Optimal value of n_neighbors is 1
Balanced accuracy score for the model is 0.5133665506336313
              precision    recall  f1-score   support

           0       0.51      0.60      0.55       294
           1       0.52      0.43      0.47       299

    accuracy                           0.51       593
   macro avg       0.51      0.51      0.51       593
weighted avg       0.51      0.51      0.51       593



In [9]:
exe_knn_model(dataset=ta_df_temp,test_size=.2,max_neighbors=51,scoring="accuracy")

Optimal value of n_neighbors is 23
Balanced accuracy score for the model is 0.4865765704275021
              precision    recall  f1-score   support

           0       0.48      0.39      0.43       294
           1       0.49      0.58      0.53       299

    accuracy                           0.49       593
   macro avg       0.49      0.49      0.48       593
weighted avg       0.49      0.49      0.48       593



In [10]:
exe_knn_model(dataset=ta_df_temp,test_size=.2,max_neighbors=201,scoring="recall")

Optimal value of n_neighbors is 23
Balanced accuracy score for the model is 0.4865765704275021
              precision    recall  f1-score   support

           0       0.48      0.39      0.43       294
           1       0.49      0.58      0.53       299

    accuracy                           0.49       593
   macro avg       0.49      0.49      0.48       593
weighted avg       0.49      0.49      0.48       593



**Scenario 2** Keeping in mind time series nature of data by default training and testing data is prepared, also check the impact if train & test data is prepared using train_test_split.  

In [11]:
# as a result of our analysis in Scenario 1, the increase in number of neighbors is not improving the performance of the model, scoring technique recall gives better results.

exe_knn_model(dataset=ta_df_temp,test_size=.2,random_data=True,max_neighbors=21,scoring="recall")


Optimal value of n_neighbors is 13
Balanced accuracy score for the model is 0.46711607092278934
              precision    recall  f1-score   support

           0       0.49      0.45      0.47       139
           1       0.45      0.49      0.47       127

    accuracy                           0.47       266
   macro avg       0.47      0.47      0.47       266
weighted avg       0.47      0.47      0.47       266



In [12]:
# Run test to see if Optimal value of n_neighbors changes when max_neighbors=101, 

exe_knn_model(dataset=ta_df_temp,test_size=.2,random_data=True,max_neighbors=101,scoring="accuracy")

Optimal value of n_neighbors is 69
Balanced accuracy score for the model is 0.4946751260408996
              precision    recall  f1-score   support

           0       0.52      0.45      0.48       139
           1       0.47      0.54      0.51       127

    accuracy                           0.49       266
   macro avg       0.49      0.49      0.49       266
weighted avg       0.50      0.49      0.49       266



**Scenario 3** Compare results by increasing/decreasing test data size.  

In [13]:
#increase test data size
exe_knn_model(dataset=ta_df_temp,test_size=.3,max_neighbors=21)

Optimal value of n_neighbors is 5
Balanced accuracy score for the model is 0.4916762857509298
              precision    recall  f1-score   support

           0       0.49      0.54      0.51       339
           1       0.50      0.44      0.47       345

    accuracy                           0.49       684
   macro avg       0.49      0.49      0.49       684
weighted avg       0.49      0.49      0.49       684



In [14]:
#decrease test data size
exe_knn_model(dataset=ta_df_temp,test_size=.1,max_neighbors=51)

Optimal value of n_neighbors is 13
Balanced accuracy score for the model is 0.5130605345410474
              precision    recall  f1-score   support

           0       0.50      0.52      0.51       244
           1       0.53      0.51      0.52       257

    accuracy                           0.51       501
   macro avg       0.51      0.51      0.51       501
weighted avg       0.51      0.51      0.51       501



In [15]:
#decrease test data size
exe_knn_model(dataset=ta_df_temp,test_size=.1,max_neighbors=51,scoring="recall")

Optimal value of n_neighbors is 13
Balanced accuracy score for the model is 0.5130605345410474
              precision    recall  f1-score   support

           0       0.50      0.52      0.51       244
           1       0.53      0.51      0.52       257

    accuracy                           0.51       501
   macro avg       0.51      0.51      0.51       501
weighted avg       0.51      0.51      0.51       501



**Scenario 4** Add sentiment analysis based on tweets as one of X feature to determine if it helps improve stock price direction prediction.  

**Test including Vader sentiment analysis along with technical indicator**

In [16]:
# Added VaderSentiments to the features along with technical indicators
ta_df_vader_temp = pd.concat([ta_df_temp,tsla_sentiments_df],axis=1, join="inner")

exe_knn_model(dataset=ta_df_vader_temp,test_size=.1,max_neighbors=21,scoring="recall")

Optimal value of n_neighbors is 15
Balanced accuracy score for the model is 0.5037793952967525
              precision    recall  f1-score   support

           0       0.51      0.47      0.49       235
           1       0.50      0.54      0.52       228

    accuracy                           0.50       463
   macro avg       0.50      0.50      0.50       463
weighted avg       0.50      0.50      0.50       463



In [17]:
# checking to see if increase in n_neighbors improves the results
exe_knn_model(dataset=ta_df_vader_temp,test_size=.1,max_neighbors=201,scoring="accuracy")

Optimal value of n_neighbors is 21
Balanced accuracy score for the model is 0.4995893990294886
              precision    recall  f1-score   support

           0       0.51      0.46      0.48       235
           1       0.49      0.54      0.52       228

    accuracy                           0.50       463
   macro avg       0.50      0.50      0.50       463
weighted avg       0.50      0.50      0.50       463



In [18]:
# As there is a slight improvement in when n_neighbors increases, checking if further increase in number of neighbors imporves the result
exe_knn_model(dataset=ta_df_vader_temp,test_size=.3,max_neighbors=201)

Optimal value of n_neighbors is 63
Balanced accuracy score for the model is 0.5057417565856893
              precision    recall  f1-score   support

           0       0.51      0.48      0.50       318
           1       0.50      0.53      0.52       313

    accuracy                           0.51       631
   macro avg       0.51      0.51      0.51       631
weighted avg       0.51      0.51      0.51       631



**Test including TextBolb sentiment analysis along with technical indicator**

In [19]:
# Added Textblob subjectivity and polarity to the features along with technical indicators
ta_df_text_blob_temp = pd.concat([ta_df_temp,tsla_sentiments_df_textblob],axis=1, join="inner")

exe_knn_model(dataset=ta_df_text_blob_temp,test_size=.2,max_neighbors=21,scoring="recall")

Optimal value of n_neighbors is 1
Balanced accuracy score for the model is 0.5039049919484702
              precision    recall  f1-score   support

           0       0.51      0.48      0.50       276
           1       0.50      0.53      0.51       270

    accuracy                           0.50       546
   macro avg       0.50      0.50      0.50       546
weighted avg       0.50      0.50      0.50       546



In [20]:
exe_knn_model(dataset=ta_df_text_blob_temp,test_size=.2,max_neighbors=51)

Optimal value of n_neighbors is 41
Balanced accuracy score for the model is 0.49991948470209335
              precision    recall  f1-score   support

           0       0.51      0.34      0.41       276
           1       0.49      0.66      0.57       270

    accuracy                           0.50       546
   macro avg       0.50      0.50      0.49       546
weighted avg       0.50      0.50      0.49       546



In [21]:
exe_knn_model(dataset=ta_df_text_blob_temp,test_size=.3,max_neighbors=101,scoring="recall")

Optimal value of n_neighbors is 97
Balanced accuracy score for the model is 0.4674633793477606
              precision    recall  f1-score   support

           0       0.45      0.27      0.34       318
           1       0.47      0.66      0.55       313

    accuracy                           0.47       631
   macro avg       0.46      0.47      0.45       631
weighted avg       0.46      0.47      0.45       631



# KNN Analysis
## Summary
We tired different premutations and combinations of parameters to optimise our KNN model,by finding the optimal value of n_neighbors for parameters used to train and test the model.
Also we evaluated the model by including sentiment analysis based on tweets data along with other features.

If we take the best case scenario from among all tests we did, model predicted with overall accuracy of 51.3%. The model predicted 52% of true negative and 51% of true positive for stock price direction. Below is the classification report for best performing model from among all knn models we created.</br>

>Optimal value of n_neighbors is 13<br>
>Balanced accuracy score for the model is 0.5130605345410474<br>
>
>                   precision    recall  f1-score   support
>
>                0       0.50      0.52      0.51       244
>                1       0.53      0.51      0.52       257
>
>         accuracy                           0.51       501
>        macro avg       0.51      0.51      0.51       501
>     weighted avg       0.51      0.51      0.51       501


## Details
### Scenario 1
Impact of scoring technique used to find n_neighbors on performance of the KNN Model

**Given the data it is observed that performace of KNN model improves, when n_neighbors is set to 1 and test size is 20%. The optimal value of n_neighbors is obtained by using cross validation scoring technique using "recall" score.</br>**

***Below are the results when optimal value of n_neighbors is obtained by using cross_val_score, with scoring set to "recall"***  
>Optimal value of n_neighbors is 1<br>
>Balanced accuracy score for the model is 0.5133665506336313<br>
>
>                   precision    recall  f1-score   support 
>
>                0       0.51      0.60      0.55       294  
>                1       0.52      0.43      0.47       299   
>
>         accuracy                           0.51       593 
>        macro avg       0.51      0.51      0.51       593  
>     weighted avg       0.51      0.51      0.51       593  

***Below are the results when optimal value of n_neighbors is obtained by using cross_val_score, with scoring set to "accuracy"***  
>Optimal value of n_neighbors is 5<br>                                 
>Balanced accuracy score for the model is 0.4778513412053784<br>
>
>                   precision    recall  f1-score   support             
>
>                0       0.48      0.55      0.51       294  
>                1       0.48      0.40      0.44       299  
>
>         accuracy                           0.48       593  
>        macro avg       0.48      0.48      0.47       593  
>     weighted avg       0.48      0.48      0.47       593  

### Scenario 2
Impact of the way training and testing dataset are generated on performance of the KNN Model.

**Given the data it is observed that KNN Model perform better when training & testing datasets are constructed keeping time series nature of the data in mind, which is also the default behaviour of the KNN Model we created. It is observed that when training & testing datasets randomly using train_test_split the performance of our model falls.<br>**

***Referring to the classification report below,these are the best results that we could achieve when training and testing datasets are generated randomly.***  
>Optimal value of n_neighbors is 69<br>
>Balanced accuracy score for the model is 0.4946751260408996<br>
>
>                   precision    recall  f1-score   support
>    
>                0       0.52      0.45      0.48       139
>                1       0.47      0.54      0.51       127
>    
>         accuracy                           0.49       266
>        macro avg       0.49      0.49      0.49       266
>     weighted avg       0.50      0.49      0.49       266

### Scenario 3
Impact of increasing/decreasing test data size on performance of the model.

**As there is a major peak in data on later dates, increasing the training dataset & reducing test data size to 10% from 20% improved the performance of the model***

***Referring to the classification report below,these are the best results that we could achieve by increasing the training dataset to 90%.***  
>Optimal value of n_neighbors is 13<br>
>Balanced accuracy score for the model is 0.5130605345410474<br>
>
>                   precision    recall  f1-score   support
>
>                0       0.50      0.52      0.51       244
>                1       0.53      0.51      0.52       257
>
>         accuracy                           0.51       501
>        macro avg       0.51      0.51      0.51       501
>     weighted avg       0.51      0.51      0.51       501


### Scenario 4

Impact of adding results of Sentiments analysis based on the tweets along with technical indicators, on the performance of the model.

**Our model performed slightly better with the output received from VaderSentiment analysis, also we observed that VaderSentiment analysis is much faster than Textblob**

#### Using VaderSentiment Analysis

***When we added polarity of sentiments as one of the feature along with technical indicators, below is the best performance we could achieve. In this scenario increasing the test data size to 30% helped improve the performance of the model***


>Optimal value of n_neighbors is 63<br>
>Balanced accuracy score for the model is 0.5057417565856893<br>
>
>                   precision    recall  f1-score   support
>
>                0       0.51      0.48      0.50       318
>                1       0.50      0.53      0.52       313
>
>         accuracy                           0.51       631
>        macro avg       0.51      0.51      0.51       631
>     weighted avg       0.51      0.51      0.51       631

#### Using TextBlob Sentiment Analysis

***When we added output received from TextBlob sentiment analysis as one of the feature along with technical indicators, below is the best performance we could achieve. In this scenario the test data size of 20% helped improve the performance of the model***

>Optimal value of n_neighbors is 1<br>
>Balanced accuracy score for the model is 0.5039049919484702<br>
>
>                   precision    recall  f1-score   support
>     
>                0       0.51      0.48      0.50       276
>                1       0.50      0.53      0.51       270
>     
>         accuracy                           0.50       546
>        macro avg       0.50      0.50      0.50       546
>     weighted avg       0.50      0.50      0.50       546

