#                             INVESCO TRADING - RISK FOR REDEEMING



<b> OVERVIEW: </b>

Invesco is an independent investment management firm which sells retail mutual funds across The United States of America via financial intermediaries. They would like to understand who is at risk for selling(Redeeming) their mutual fund in the next month and which mutual funds are at a risk so that they can take a proactive approach targeted specifically at the risk areas.

They have a wealth of internal data regarding the purchase and sale of their mutual funds , asset holding balances , product investment experience information and financial intermediary activity information. The same had been provided to us for analysis.

<b> Team: </b> 

Vallabh Remani - Intern,
Koti - Software Engineer,
Seshu - Director of Engineering, 
Sri Harsha - Mentor

<b>WORKFLOW STRUCTURE</b>

To solve the given problem, we have taken a 7 step approach, each step dealing with one particular section of the solution development. The steps are as follows:

    
    1. Question or problem definition.
    2. Acquire training and testing data.
    3. Wrangle, prepare, cleanse the data.
    4. Analyze, identify patterns, and explore the data.
    5. Model, predict and solve the problem.
    6. Visualize, report, and present the problem solving steps and final solution.
    7. Supply or submit the results.




<b>1. QUESTION OR PROBLEM DEFINITION</b>

Having a training set of samples listing whether the investment has been redeemed by the advisor or not, can our model predict whether a particular investment_id will be redeemed or not on the test dataset whose redeemption status is unknown. If yes, What will be the accuracy of the system.

To solve the above problem, we had to develop some early understanding of the problem domain by consulting experts in that particular field as this would give us a clear understanding of the factors that would influence the outcome.

The highlights of the understanding were as follows:
<ul>
<li> Advisors are more likely to buy funds that have been around for atleast 3 years as this would ensure security</li>
<li> There types of advisors are varied. Some look at short term ratings and short term gains and others look at long term ratings and long term gains </li>
<li> The more times the advisor got involved in any activity the more are his chances of purchasing the fund in the following month.</li>
<li> Advisors generally care more about buying funds rather than selling them off meaning, the advisor would sell off a particular fund when they have a better fund to invest the same money in.</li>
</ul>

<b>2. ACQUIRING TRAINING AND TESTING DATASET</b>

The dataset was made available as 4 csv files.
<ul>
<li>InvestmentExperience.csv</li>
<li>Transaction.csv</li>
<li>AUM.csv</li>
<li>Activity.csv</li>
</ul>



In [None]:
# Importing all the necessary packages and Libraries which will help us in solving the above mentioned problem.

# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

# visualization
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier

<b>2.1 ACQUIRING DATA</b>

The python pandas and numpy packages help us work with our dataset. We load the csv files into the pandas and numpy dataframes for further data processing and data analysis.

Let us load the datasets into the pandas dataframes and have an initial understanding of the structure of the data provided to us.

In [None]:
# Loading the Investment Experience Dataset into the pandas dataframe.
invexp_df = pd.read_csv('resources\dataset\InvestmentExperience.csv')

# Loading the Transaction Dataset into the pandas dataframe.
transaction_df = pd.read_csv('resources\dataset\Transaction.csv')

# Loading the AUM Dataset into the pandas dataframe.
aum_df = pd.read_csv('resources\dataset\AUM.csv')

# Loading the Activity Dataset into the pandas dataframe.
activity_df = pd.read_csv('resources\dataset\Activity.csv')

<b>2.2 ANALYZE BY DESCRIBING THE DATA</b>

Let us try to understand the features that are available in the data provided to us.



In [None]:
# Knowing the features in the Investment Experience Table
print(invexp_df.columns.values)



<b>Which features are categorical?</b>

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.

   <ul><li>Categorical: Morningstar Category , Investment. </li>
   <li>Ordinal: Rating.</li></ul>

<b>Which features are numerical?</b>

Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.
 
   <ul><li>Discrete: Unique_Investment_Id</li>
   <li>Timeseries: Month</li>
   <li>Continous: All others</li>
   </ul>



In [None]:
# Preview the Data

invexp_df.head()

In [None]:
# Knowing the features in the Transaction Table
print(transaction_df.columns.values)

<b>Which features are categorical?</b>

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.

   <ul><li>Categorical: Transaction_Type </li>
   </ul>

<b>Which features are numerical?</b>

Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.
 
   <ul><li>Discrete: Unique_Advisor_Id , Unique_Investment_Id , Code_1 , Code_2 , Code_3 , Code_4 , Code_5  </li>
   <li>Timeseries: Month</li>
   <li>Continous: Amount</li>
   </ul>

In [None]:
# Preview the Data

transaction_df.head()

In [None]:
# Knowing the features in the Transaction Table
print(aum_df.columns.values)

<b>Which features are categorical?</b>

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.

   <ul><li>Categorical: None </li>
   </ul>

<b>Which features are numerical?</b>

Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.
 
   <ul><li>Discrete: Unique_Advisor_Id , Unique_Investment_Id   </li>
   <li>Timeseries: Month</li>
   <li>Continous: Shares , AUM</li>
   </ul>

In [None]:
# Preview the Data

aum_df.head()

In [None]:
# Knowing the features in the Transaction Table
print(activity_df.columns.values)

<b>Which features are categorical?</b>

These values classify the samples into sets of similar samples. Within categorical features are the values nominal, ordinal, ratio, or interval based? Among other things this helps us select the appropriate plots for visualization.

   <ul><li>Categorical: Activity_Type </li>
   </ul>

<b>Which features are numerical?</b>

Which features are numerical? These values change from sample to sample. Within numerical features are the values discrete, continuous, or timeseries based? Among other things this helps us select the appropriate plots for visualization.
 
   <ul><li>Discrete: Unique_Advisor_Id , Activity_Count   </li>
   <li>Timeseries: Month</li>
   </ul>

In [None]:
# Preview the Data

activity_df.head()

In [None]:
# Knowing the number of Unique_Investment_Id's that come under each Morningstar Category

invexp_df[['Morningstar Category', 'Unique_Investment_Id']].groupby(['Morningstar Category'], as_index=False).count().sort_values(by='Morningstar Category', ascending=True)

In [None]:
# Knowing the number of Unique_Investment_Id's that come under each Morningstar Category

invexp_df[['Month', 'Unique_Investment_Id']].groupby(['Month'], as_index=False).count().sort_values(by='Month', ascending=True)

This means that every month, Investments have been made on 592 Investments

In [None]:
# Knowing the number of Transactions of Type 'Redeem' in each month
transaction_df = transaction_df[transaction_df.Transaction_Type == 'R']
transaction_df[['Month', 'Transaction_Type']].groupby(['Month'], as_index=False).count().sort_values(by='Month', ascending=True)

In [None]:
# Knowing the number of Transactions of Type 'Purchase' in each month
transaction_df = pd.read_csv('resources\dataset\Transaction.csv')
transaction_df = transaction_df[transaction_df.Transaction_Type == 'P']
transaction_df[['Month', 'Transaction_Type']].groupby(['Month'], as_index=False).count().sort_values(by='Month', ascending=True)

In [None]:
# Number of Investments done by an advisor in a given month 

aum_df[['Unique_Advisor_Id', 'Unique_Investment_Id' , 'Month']].groupby(['Unique_Advisor_Id' , 'Month'], as_index=False).count().sort_values(by='Month', ascending=True)

In [None]:
# Number of Activities done by Each Advisor in each month

activity_df[['Unique_Advisor_Id', 'Month', 'Activity_Count']].groupby(['Unique_Advisor_Id' , 'Month'], as_index=False).sum().sort_values(by='Month', ascending=True)

<b>2.3 VISUALIZING THE GIVEN DATA FOR ANALYTICS</b>

In [None]:
# Transaction amount in various months for a particular Advisor = 1000103

transaction_df = pd.read_csv('resources\dataset\Transaction.csv')
transaction_df = transaction_df[transaction_df.Unique_Advisor_Id == 1000103]
transaction_df.plot('Month', 'Amount', kind='bar', color='r')

In [None]:
# This is the AUM for the advisor = 12243 and Investment = 11681.

aum_df = aum_df[aum_df.Unique_Advisor_Id == 12243]
aum_df = aum_df[aum_df.Unique_Investment_Id == 11681 ]
aum_df.plot('Month' , 'AUM' , kind='bar' , color = 'b')


In [None]:
# For a given Advisor = 1000103 , Find the type of activities that he does.

activity_df = pd.read_csv('resources\dataset\Activity.csv')
activity_df = activity_df[activity_df.Unique_Advisor_Id == 1000103]
activity_df.plot('Activity_Type' , 'Activity_Count' , kind='bar' , color = 'g')


<b>3. WRANGLE , PREPARE AND CLEAN THE DATA</b>

We have collected several assumptions and decisions regarding our datasets and solution requirements. So far we did not have to change a single feature or value to arrive at these. Let us now execute our decisions and assumptions for correcting, creating, and completing goals.

<b>Correcting by dropping features</b>

This is a good starting goal to execute. By dropping features we are dealing with fewer data points. Speeds up our processing and eases the analysis.



In [None]:
# We will try to see which columns are unnecessary for our analysis and drop them.

invexp_df.columns.values

We believe that the Morningstar Category and Investment Columns are not necessary for our computation, so we will delete those columns from our table.



In [None]:
invexp_df = pd.read_csv('resources\dataset\InvestmentExperience.csv')
invexp_df = invexp_df.drop(['Morningstar Category', 'Investment'], axis=1)
invexp_df.columns.values


In [None]:
# We will try to see which columns are unnecessary for our analysis and drop them.

transaction_df.columns.values

We believe that Code_1 , Code_2 .. Code_5 are not required for the computation, so we are dropping them.

In [None]:
transaction_df = transaction_df.drop(['Code_1', 'Code_2', 'Code_3', 'Code_4', 'Code_5',], axis=1)
transaction_df.columns.values

In [None]:
# We will try to see which columns are unnecessary for our analysis and drop them.

aum_df.columns.values

We assume that all the features are important in this table, so not dropping any of these.

In [None]:
# We will try to see which columns are unnecessary for our analysis and drop them.

activity_df.columns.values

We assume all the attributes are important for the final computation, so leaving everything as it is.

Now, we need to join all the dataframes so that we can create one dataframe with all the necessary details. We need to use the join statements in order to join them.

In [None]:
# merging the tables according to common attributes
data = pd.merge(aum_df, activity_df, how='inner', on=['Unique_Advisor_Id', 'Month']).fillna(0.0)
data = pd.merge(data, invexp_df, how='inner', on=['Unique_Investment_Id', 'Month']).fillna(0.0)

# Writing this data to a csv file called feature_vector.csv for future use.
data.to_csv(myconfig.PROCESSED_DATASET_FOLDER + myconfig.SEP + "feature_vector.csv")

In [None]:
# Looking at all the values in the table . Check if they have been combined or not.
data.columns.values

In [None]:
# Initializing the class_label to X and changing it later on to the desired value and attaching the class label to each row 
# Here, The class label of each class depends on the feature values of the previous months.
# Using the values of months and using a for loop, we are joining the value of 'P' or 'R' to each combination of Unique_Investment_Id and Unique_Advisor_Id
# Writing the final table into a csv file called dataset.csv for future use.

data['class_label'] = 'X'

print("Length of dataset: " + str(len(data)))
for rowidx in range(len(data)):
    class_label = None
    this_month = data.get_value(rowidx, 'Month')
    next_month = invesco.get_next_month(this_month)

    this_aid = data.get_value(rowidx, 'Unique_Advisor_Id')
    this_iid = data.get_value(rowidx, 'Unique_Investment_Id')
    t1 = tnx[tnx.Unique_Advisor_Id == this_aid]
    t2 = t1[t1.Unique_Investment_Id == this_iid]
    t3 = t2[t2.Month == next_month]

    if len(t3) == 0:
        class_label = 'H'
    else:
        p_amount = 0.0
        r_amount = 0.0
        for rowidx in range(len(t3)):
            txn_type = t3.iloc[rowidx, t3.columns.get_loc('Transaction_Type')]
            amt_str = t3.iloc[rowidx, t3.columns.get_loc('Amount')]
            print("AID : %s / IID: %s / Row Idx: %s  / Amount: %s" % (this_aid, this_iid, rowidx, amt_str))
            amount = abs(float(amt_str))
            if txn_type == 'P':
                p_amount = amount
            else:
                r_amount = amount
        if r_amount > p_amount:
            class_label = 'R'
        else:
            class_label = 'P'
    data.iloc[rowidx, data.columns.get_loc('class_label')] = class_label

data.to_csv(myconfig.PROCESSED_DATASET_FOLDER + myconfig.SEP + "dataset.csv")

<b>4. MODEL, PREDICT AND SOLVE</b>

Now we are ready to train a model and predict the required solution. There are 60+ predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). We are also perfoming a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:

   <ul><li> Logistic Regression</li>
    <li>KNN or k-Nearest Neighbors</li>
    <li>Support Vector Machines</li>
    <li>Naive Bayes classifier</li>
    <li>Decision Tree</li>
    <li>Random Forrest</li>
    <li>Perceptron</li>
    <li>Artificial neural network</li>
    <li>RVM or Relevance Vector Machine</li></ul>


With the available dataset, we will divide the available dataset into training dataset and test dataset in the ratio 80:20 as is the standard convention.

We will then run the available machine learning algorithms on the dataset to see which one produces the highest output.

<b>6. VISUALIZE, REPORT AND SOLVE THE PROBLEM</b>

<b>6.1 VERSION 1 SOLUTION:</b>

Looking at the data provided to us, we might have to predict the status for investments and advisors for whom complete information has not been provided.

The method to solve this problem would involve creating a similarity matrix using the Collaborative Filtering Algorithm. This similarity matrix can be constructed both based on users(advisors) or items(investments). The basic idea is that we will try to find the users who are similar to the given user and we will try to predict that the given user will take decisions like the user he is most similar to.

The same logic applies when we try to build the item item similarity matrix. We try to find items that are similar to each other and for predicting which item will be redeemed, we will try to see the status of the item whose information is known to us.

This solution is inspired by the BellKor Solution to the netflix problem which had to find similarity between users who watch movies.


In [None]:
"""
Apply Collaborative filtering 
    - Try both item - item based and user - user based approach.
"""
from src.main.core import invesco
from src.main.core import myconfig
import pandas as pd

__EPSILON = 1e-9


def compute_cfmat(month=None):
    df = invesco.get_txn_df()
    if month:
        df = df[df.Month <= invesco.get_datetime(month)]

    print("#Rows: " + str(len(df)))
    x = pd.DataFrame(df, columns=[df.Unique_Advisor_Id.name, df.Unique_Investment_Id.name, df.Transaction_Type.name])
    x['count'] = 0

    y = x.groupby([x.Unique_Advisor_Id.name, x.Unique_Investment_Id.name, x.Transaction_Type]).count().reset_index()

    pcounts = y[y.Transaction_Type == 'P']
    rcounts = y[y.Transaction_Type == 'R']

    pmat = pd.DataFrame(pcounts, columns=[pcounts.Unique_Advisor_Id.name, pcounts.Unique_Investment_Id.name, 'count'])
    rmat = pd.DataFrame(rcounts, columns=[rcounts.Unique_Advisor_Id.name, rcounts.Unique_Investment_Id.name, 'count'])

    users = set()
    items = set()

    users = users.union(set(pmat[pmat.Unique_Advisor_Id.name].unique()))
    users = users.union(set(rmat[rmat.Unique_Advisor_Id.name].unique()))

    items = items.union(set(pmat[pmat.Unique_Investment_Id.name].unique()))
    items = items.union(set(rmat[rmat.Unique_Investment_Id.name].unique()))

    print("USERS: \n")
    print(users)

    print("\nITEMS")
    print(items)

    columns = list(sorted(list(items)))
    rows = list(sorted(list(users)))

    cfmat = pd.DataFrame(index=rows, columns=columns).fillna(0)

    for user in users:
        for item in items:
            pvalc = 0.0
            rvalc = 0.0

            rval = rmat[rmat.Unique_Advisor_Id == user]
            rval = rval[rval.Unique_Investment_Id == item]
            if (len(rval) > 0):
                rvalc = float(rval['count'])

            pval = pmat[pmat.Unique_Advisor_Id == user]
            pval = pval[pval.Unique_Investment_Id == item]
            if (len(pval) > 0):
                pvalc = float(pval['count'])

            prob = float(rvalc / (__EPSILON + rvalc + pvalc))
            # print("user : " + user + " / item: " + item + " / pvalc : " + str(pvalc) + " / rvalc : " + str(rvalc) + " / prob: " + str(prob))
            cfmat.loc[user, item] = prob
    return cfmat


if __name__ == '__main__':
    cfmat = compute_cfmat(month='2016 / 10')
    # pmat.to_csv(myconfig.PROCESSED_DATASET_FOLDER + myconfig.SEP + "purchase_mat.csv")
    # rmat.to_csv(myconfig.PROCESSED_DATASET_FOLDER + myconfig.SEP + "redeem_mat.csv")
    cfmat.to_csv(myconfig.PROCESSED_DATASET_FOLDER + myconfig.SEP + "cfmat.csv")
    print("Done")


In the above piece of code, we have tried to calculate the number of times the transaction of 'P' and the number of times of the transaction 'R' have happend. This detail is stored for every user and every item whose data is available.

This is then used to predict values for unknown data using the similarity approach already talked about.

In [None]:
from sklearn.metrics import mean_squared_error
import numpy as np

from src.main.core.algorithms import txnmat


class CF1(object):
    def __init__(self, month=None):
        self.iid2idx = dict()
        self.iid2idx = dict()
        self.idx2iid = dict()
        self.aid2idx = dict()
        self.idx2aid = dict()
        self.n_users = -1
        self.n_items = -1
        self.month = month
        self.__load()

    def __get_mse(self, pred, actual):
        # Ignore nonzero terms.
        pred = pred[actual.nonzero()].flatten()
        actual = actual[actual.nonzero()].flatten()
        return mean_squared_error(pred, actual)

    def __measure_sparsity(self, mat):
        sparsity = float(len(mat.nonzero()[0]))
        sparsity /= (mat.shape[0] * mat.shape[1])
        sparsity *= 100
        return 'Sparsity: {:4.2f}%'.format(sparsity)

    def __train_test_split(self, datamat, size=2):
        test = np.zeros(datamat.shape)
        train = datamat.copy()
        for user in range(datamat.shape[0]):
            non_zeros = datamat[user, :].nonzero()[0]
            if (len(non_zeros) > size):
                test_ratings = np.random.choice(datamat[user, :].nonzero()[0], size=size, replace=False)
                train[user, test_ratings] = 0.
                test[user, test_ratings] = datamat[user, test_ratings]

        # Test and training are truly disjoint
        assert (np.all((train * test) == 0))
        return train, test

    def __fast_similarity(self, datamat, kind='user', epsilon=1e-9):
        # epsilon -> small number for handling dived-by-zero errors
        if kind == 'user':
            sim = datamat.dot(datamat.T) + epsilon
        elif kind == 'item':
            sim = datamat.T.dot(datamat) + epsilon
        norms = np.array([np.sqrt(np.diagonal(sim))])
        return (sim / norms / norms.T)

    def __predict_slow_simple(self, datamat, similarity, kind='user'):
        pred = np.zeros(datamat.shape)
        if kind == 'user':
            for i in range(datamat.shape[0]):
                for j in range(datamat.shape[1]):
                    pred[i, j] = similarity[i, :].dot(datamat[:, j]) \
                                 / np.sum(np.abs(similarity[i, :]))
            return pred
        elif kind == 'item':
            for i in range(datamat.shape[0]):
                for j in range(datamat.shape[1]):
                    pred[i, j] = similarity[j, :].dot(datamat[i, :].T) \
                                 / np.sum(np.abs(similarity[j, :]))

            return pred

    def __predict_fast_simple(self, ratings, similarity, kind='user'):
        if kind == 'user':
            return similarity.dot(ratings) / np.array([np.abs(similarity).sum(axis=1)]).T
        elif kind == 'item':
            return ratings.dot(similarity) / np.array([np.abs(similarity).sum(axis=1)])

    def __print_data(self, matrix, message):
        print("\n")
        print("Dataset: " + message)
        print("Shape: " + str(matrix.shape))
        print("Sparsity: " + str(self.__measure_sparsity(matrix)))
        print("First 5 rows: ")
        print(matrix[:5, ])
        print("\n")

    def __load_index_map(self, df):
        self.iids = list(df.columns[1:])
        for idx, iid in enumerate(self.iids):
            self.iid2idx[iid] = idx
            self.idx2iid[idx] = iid

        self.aids = list(df.index)
        for idx, aid in enumerate(self.aids):
            self.aid2idx[aid] = idx
            self.idx2aid[idx] = aid

        self.n_users = len(self.aids)
        self.n_items = len(self.iids)

    def __load(self):
        df = txnmat.compute_cfmat(month=self.month)
        self.__load_index_map(df)
        ratings = df.as_matrix(df.columns[1:])

        # print_data(ratings, "Ratings matrix")
        train, test = self.__train_test_split(ratings)
        # print_data(train, "Training Matrix")
        # print_data(test, "Testing Matrix")

        user_similarity = self.__fast_similarity(ratings)
        item_similarity = self.__fast_similarity(ratings, kind='item')

        # print_data(user_similarity, "User Similarity Matrix")
        # print_data(item_similarity, "Item Similarity Matrix")

        self.item_prediction = self.__predict_fast_simple(train, item_similarity, kind='item')
        self.user_prediction = self.__predict_fast_simple(train, user_similarity, kind='user')

        print('User-based CF MSE: ' + str(self.__get_mse(self.user_prediction, test)))
        print('Item-based CF MSE: ' + str(self.__get_mse(self.item_prediction, test)))
        # print_data(item_prediction, "Item Pred")
        # print_data(user_prediction, "User Pred")

    def get_value(self, aid, iid, algorithm='user'):
        if aid in self.aid2idx:
            aid_idx = self.aid2idx[aid]
        else:
            return -1

        if iid in self.iid2idx:
            iid_idx = self.iid2idx[iid]
        else:
            return -1

        value = -1
        if algorithm == 'user':
            value = self.user_prediction[aid_idx][iid_idx]
        elif algorithm == 'item':
            value = self.item_prediction[aid_idx][iid_idx]
        else:
            print("Invalid algorithm input provided.")
        return value


if __name__ == '__main__':
    pass


This method was then used to predict the future action based on the history of the item or user. This method gave us an accuracy of 28.15% on the testing dataset. This has been established as the ground truth. 

We will now build solutions that will be able to develop on the solution already provided.

<b>6.2 BETTERING THE VERSION 1 SOLUTION</b>

Let us now use the Machine Learning Algorithms Like Logistic Regression to find the solution to the above problem.

In [None]:
# Importing all the necessary libraries and packages for applying machine learning algorithms.
# Logistic Regression

import numpy as np
import pandas as pd

from sklearn import model_selection
from sklearn.linear_model import LogisticRegression

df = pd.read_csv('resources\processed_dataset\dataset.csv')
df.replace('NaN' , -9999 , inplace=True)
df.drop(['Unique_Advisor_Id' , 'Unique_Investment_Id' , 'Month' , 'Morningstar Category' , 'Investment'] , 1 , inplace = True)

x = np.array(df.drop(['class_label'] , 1))
y = np.array(df['class_label'])

x_train , x_test , y_train , y_test = model_selection.train_test_split(x , y , test_size = 0.2)

logreg = LogisticRegression()
logreg.fit(x_train , y_train)

def predict(self,df):
    self.df = df
    output = logreg.predict(self.df)
    print(output)

print(logreg.predict_proba(x_test))

accuracy_of_logistic = logreg.score(x_train, y_train)
print(accuracy_of_logistic)


<b>We got an accuracy of 96.0006437596% using the Logistic Regression Algorithm on the dataset that we have.</b>

<b> We have used the Logistic Regression model because we wanted to know both the class to which a particular input belongs and also the probability that a particular input belongs to a class. </b>

Had we no restriction on finding out the probabilty of the input belonging to any class, we could have used the other Machine Learning Algorithms whose codes are available below.

In this particular problem, we have applied logistic regression and stopped but in other problems, it may be the case that we have to run all the possible algorithms on the data and see the algorithm that gives the highest accuracy.

Then double down on the algorithm that give you the highest accuracy and use that algorithm to make furthur predictions.

In [None]:
#Using SVM

svc = SVC()
svc.fit(x_train, y_train)
Y_pred = svc.predict(x_test)
acc_svc = round(svc.score(x_train, y_train) * 100, 2)
acc_svc

In [None]:
#Using kNN 

knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(x_train, y_train)
y_pred = knn.predict(x_test)
acc_knn = round(knn.score(x_train, y_train) * 100, 2)
acc_knn

In [None]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian

In [None]:
# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron

In [None]:
# Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc

In [None]:
# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

<b>7. SUPPLY OR SUBMIT THE RESULTS</b>

Now that we have the code ready and working well on the test dataset with an accuracy of 96.0006437596% , Given any dataset on which we have to predict the output, We can take the dataset and run the above mentioned algorithm to get the predictions.