# Skillenza - Devengers Hackthon Prediction

### Objective of the problem
The objective of the problem is to predict values “treatment” attribute from the given features of the test data. The predictions are to be written to a CSV file along with ID which is the unique identifier for each tuple. Please view the sample submission file to understand how the submission file is to be written. Please upload the submission file to get a score. 

### Description of files
**Training File** : All features including the target would be present in this file. Machine learning model would trained using this file. This file is to be used for training and validation.  
**Test File** : This file contains all features, but the target variable. Prediction is to be made for all tuples in the test file. The predicted values are to be written to a CSV file along with ID and uploaded.  
**Sample Submission** : Sample submission is an example of how the actual submission file should be like

### Features 
 - Timestamp  
 - Age  
 - Gender 
 - Country 
 - state: If you live in the United States, which state or territory do you live in? 
 - self_employed: Are you self-employed? 
 - family_history: Do you have a family history of mental illness? 
 - treatment: Does he or she really needs treatment. 
 - work_interfere: If you have a mental health condition, do you feel that it interferes with your work? 
 - no_employees: How many employees does your company or organization have? 
 - remote_work: Do you work remotely (outside of an office) at least 50% of the time? 
 - tech_company: Is your employer primarily a tech company/organization? 
 - benefits: Does your employer provide mental health benefits? 
 - care_options: Do you know the options for mental health care your employer provides? 
 - wellness_program: Has your employer ever discussed mental health as part of an employee wellness program? 
 - seek_help: Does your employer provide resources to learn more about mental health issues and how to seek help? 
 - anonymity: Is your anonymity protected if you choose to take advantage of mental health or substance abuse treatment resources? 
 - leave: How easy is it for you to take medical leave for a mental health condition? 
 - mental_health_consequence: Do you think that discussing a mental health issue with your employer would have negative consequences? 
 - phys_health_consequence: Do you think that discussing a physical health issue with your employer would have negative consequences? 
 - coworkers: Would you be willing to discuss a mental health issue with your coworkers? 
 - supervisor: Would you be willing to discuss a mental health issue with your direct supervisor(s)? 
 - mental_health_interview: Would you bring up a mental health issue with a potential employer in an interview? 
 - phys_health_interview: Would you bring up a physical health issue with a potential employer in an interview? 
 - mental_vs_physical: Do you feel that your employer takes mental health as seriously as physical health? 
 - obs_consequence: Have you heard of or observed negative consequences for coworkers with mental health conditions in your workplace? 
 - comments: Any additional notes or comments.

In [1]:
import datetime
import glob
import ipywidgets
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import pandas as pd
from scipy import stats
import seaborn as sns
from sklearn import metrics
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV, RandomizedSearchCV


from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.ensemble import GradientBoostingClassifier

import time
import warnings
import xgboost as xgb

%matplotlib inline
%run ./plugins/widgets.py

Widget Loaded


### Global parameters and variables

In [2]:
plt.rcParams['figure.figsize'] = [16, 9]
plt.rcParams['font.size'] = 14
plt.rcParams['axes.grid'] = True
plt.rcParams['figure.facecolor'] = 'white'
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)

## Load data

In [3]:
full_train_df = pd.read_csv('./devengers_train.csv')
print("Shape : ", full_train_df.shape)
full_train_df.sample(2)

Shape :  (1000, 28)


Unnamed: 0,s.no,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
410,411,2014-08-27 15:29:03,33,male,United States,OH,No,Yes,Yes,Often,More than 1000,Yes,Yes,Yes,Yes,No,No,Don't know,Somewhat easy,Yes,No,No,No,No,No,No,No,
947,948,2014-08-29 00:11:17,43,Male,United States,WA,No,No,Yes,Rarely,More than 1000,No,Yes,Yes,Yes,Yes,Yes,Don't know,Somewhat easy,No,No,Yes,Yes,Maybe,Yes,Yes,No,I have an exceptional employer. I haven't run ...


In [4]:
full_test_df = pd.read_csv('./devengers_test.csv')
print("Shape : ", full_test_df.shape)
full_test_df.sample(2)

Shape :  (259, 27)


Unnamed: 0,s.no,Timestamp,Age,Gender,Country,state,self_employed,family_history,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments
74,75,2014-08-29 14:04:59,31,m,United States,CA,No,No,Sometimes,100-500,No,Yes,Yes,Yes,No,No,Don't know,Somewhat easy,Maybe,No,Some of them,No,No,Maybe,No,No,
141,142,2014-09-01 09:15:34,29,Male,Sweden,,No,No,Rarely,More than 1000,No,Yes,Don't know,Not sure,Don't know,Don't know,Don't know,Don't know,Maybe,Maybe,No,No,No,Yes,Don't know,Yes,


## Data Pre-processing and Cleanup

In [5]:
def preprocess_and_cleanup(df, test_data=False):
    
    # Age Column Cleanup
    mean_age = np.mean(df['Age'][(df['Age']>=10) & (df['Age'] <= 100)].values)
    df['Age'] = np.where((df['Age'] >= 18) & (df['Age'] <= 100), df['Age'], mean_age)
    
    # Cleanup Gender Column
    df['Gender'] = df['Gender'].str.strip()
    male_strings = ['Male', 'male', 'M', 'm', 'Man', "man", "male-ish",
                    "mal", 'Make', "maile", "msle", "mail", "malr", "make",
                    "male (cis)",  "cis man",  "Cis Male", "cis male"]
    df['gender_male'] = df['Gender'].isin(male_strings).astype(int)
    female_strings = ['Female', 'female', 'F', 'f', 'Woman', "woman",
                      "femake", "femail",
                      "cis female", "cis-female/femme", "female (cis)"
                     ]
    df['gender_female'] = df['Gender'].isin(female_strings).astype(int)
    df['gender_other'] = np.where((df['gender_male']==0) & (df['gender_female']==0), 1, 0)
    
    # Replace all no/dont know/yes with numeric values
    mcq_columns = []
    mcq_columns.extend(['self_employed', 'family_history', 'remote_work', 'tech_company', 'benefits', 
                       'care_options', 'wellness_program', 'seek_help', 'anonymity', 'mental_health_consequence', 
                       'phys_health_consequence', 'coworkers', 'supervisor', 'mental_health_interview', 
                       'phys_health_interview', 'mental_vs_physical', 'obs_consequence'])
    if test_data==False:
        mcq_columns.append('treatment')
    for col in mcq_columns:
        df[col] = df[col].map({'Yes': 1, 
                               'Maybe': 0.5,  'Some of them' : 0.5, 'Not sure': 0.5, 'Don\'t know': 0.5,
                               'No': 0})

    # Convert Other Columns to Numeric
    # df['work_interfere'] = df['work_interfere'].map({'Often': 1, 'Sometimes': 2/3, 'Rarely': 1/3, 'Never': 0})

    # df['leave'] = df['leave'].map({'Very easy': 1, 'Somewhat easy': 3/4, 'Don\'t know ': 0.5, 
    #                               'Somewhat difficult': 1/4, 'Very difficult': 0})

    # Drop Rubish Columns
    df = df.drop(['s.no', 'Timestamp', 'Gender', 'comments'], axis=1)
    
    # Final Step
    df = df.fillna(df.median())
    df = pd.get_dummies(df)
    df = df.reindex(columns=(sorted(df.columns)))
    
    return df

In [6]:
train_df = preprocess_and_cleanup(full_train_df, False)
train_df.sample(5)

Unnamed: 0,Age,Country_Australia,Country_Austria,"Country_Bahamas, The",Country_Belgium,Country_Bosnia and Herzegovina,Country_Brazil,Country_Bulgaria,Country_Canada,Country_Colombia,Country_Costa Rica,Country_Croatia,Country_Denmark,Country_Finland,Country_France,Country_Germany,Country_Hungary,Country_India,Country_Ireland,Country_Israel,Country_Italy,Country_Japan,Country_Latvia,Country_Mexico,Country_Netherlands,Country_New Zealand,Country_Nigeria,Country_Norway,Country_Poland,Country_Portugal,Country_Romania,Country_Russia,Country_Singapore,Country_Slovenia,Country_South Africa,Country_Spain,Country_Sweden,Country_Switzerland,Country_Thailand,Country_United Kingdom,Country_United States,Country_Uruguay,Country_Zimbabwe,anonymity,benefits,care_options,coworkers,family_history,gender_female,gender_male,...,state_CO,state_CT,state_DC,state_FL,state_GA,state_IA,state_ID,state_IL,state_IN,state_KS,state_KY,state_LA,state_MA,state_MD,state_ME,state_MI,state_MN,state_MO,state_MS,state_NC,state_NE,state_NH,state_NJ,state_NM,state_NV,state_NY,state_OH,state_OK,state_OR,state_PA,state_RI,state_SC,state_SD,state_TN,state_TX,state_UT,state_VA,state_VT,state_WA,state_WI,state_WV,state_WY,supervisor,tech_company,treatment,wellness_program,work_interfere_Never,work_interfere_Often,work_interfere_Rarely,work_interfere_Sometimes
736,43.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1.0,1.0,1.0,0.5,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,0.0,0.0,0.5,0,0,0,0
41,35.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0.5,1.0,0.5,0.5,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,1.0,0.0,0.5,0,0,0,0
919,22.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0.5,0.5,0.5,0.0,1.0,1,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.0,1.0,1.0,0.0,0,1,0,0
230,36.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0.5,1.0,0.0,0.5,0.0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.5,1.0,1.0,0.0,0,0,0,1
782,34.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1.0,0.5,0.5,0.5,1.0,0,1,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,1.0,1.0,0.5,0,0,0,1


In [7]:
test_df = preprocess_and_cleanup(full_test_df, True)
test_df.sample(5)

Unnamed: 0,Age,Country_Australia,Country_Belgium,Country_Brazil,Country_Bulgaria,Country_Canada,Country_China,Country_Colombia,Country_Croatia,Country_Czech Republic,Country_Finland,Country_France,Country_Georgia,Country_Germany,Country_Greece,Country_India,Country_Ireland,Country_Israel,Country_Italy,Country_Moldova,Country_Netherlands,Country_New Zealand,Country_Philippines,Country_Poland,Country_Singapore,Country_South Africa,Country_Sweden,Country_Switzerland,Country_United Kingdom,Country_United States,anonymity,benefits,care_options,coworkers,family_history,gender_female,gender_male,gender_other,leave_Don't know,leave_Somewhat difficult,leave_Somewhat easy,leave_Very difficult,leave_Very easy,mental_health_consequence,mental_health_interview,mental_vs_physical,no_employees_1-5,no_employees_100-500,no_employees_26-100,no_employees_500-1000,no_employees_6-25,no_employees_More than 1000,obs_consequence,phys_health_consequence,phys_health_interview,remote_work,seek_help,self_employed,state_AL,state_CA,state_CT,state_FL,state_GA,state_IL,state_IN,state_MD,state_MI,state_MN,state_MO,state_NC,state_NJ,state_NY,state_OH,state_OK,state_OR,state_PA,state_SC,state_SD,state_TN,state_TX,state_UT,state_VA,state_VT,state_WA,state_WI,supervisor,tech_company,wellness_program,work_interfere_Never,work_interfere_Often,work_interfere_Rarely,work_interfere_Sometimes
231,24.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0.5,0.0,0.0,0.5,0.0,0,0,1,1,0,0,0,0,0.5,0.0,0.5,0,0,0,0,1,0,1.0,0.5,0.5,1.0,0.0,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.5,1.0,0.0,0,0,0,1
113,25.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0.5,0.5,0.0,1.0,1.0,0,1,0,0,0,0,0,1,0.0,0.5,0.5,1,0,0,0,0,0,0.0,0.0,1.0,1.0,0.5,0.0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,1.0,0.5,0,0,0,1
194,27.0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0.5,0.5,0.5,0.5,0.0,0,1,0,1,0,0,0,0,0.0,1.0,0.5,0,0,1,0,0,0,0.0,0.0,1.0,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,1.0,0.0,0,0,0,0
80,25.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.0,0.0,1.0,1.0,1.0,0,1,0,0,0,0,0,1,0.5,0.0,0.0,1,0,0,0,0,0,0.0,0.0,0.5,0.0,0.0,0.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1.0,1.0,0.0,0,1,0,0
255,32.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1.0,1.0,1.0,0.5,1.0,0,1,0,0,1,0,0,0,0.0,0.0,1.0,0,0,1,0,0,0,0.0,0.0,0.0,1.0,0.0,0.0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1.0,1.0,0.0,0,1,0,0


In [8]:
s1 = set(train_df.columns)
s2 = set(test_df.columns)
print("Missing from Test : ", s1 - s2)
print("\nMissing from Train : ", s2 - s1)

Missing from Test :  {'Country_Japan', 'state_MA', 'state_NV', 'Country_Costa Rica', 'state_NM', 'Country_Denmark', 'Country_Bosnia and Herzegovina', 'state_NE', 'Country_Norway', 'state_DC', 'Country_Bahamas, The', 'Country_Romania', 'Country_Austria', 'state_RI', 'state_ID', 'Country_Portugal', 'state_LA', 'state_WY', 'treatment', 'Country_Russia', 'state_ME', 'Country_Mexico', 'state_CO', 'Country_Slovenia', 'state_WV', 'Country_Latvia', 'state_IA', 'state_KY', 'state_AZ', 'Country_Hungary', 'Country_Zimbabwe', 'state_MS', 'state_NH', 'Country_Uruguay', 'Country_Thailand', 'Country_Spain', 'Country_Nigeria', 'state_KS'}

Missing from Train :  {'Country_Moldova', 'Country_Greece', 'Country_Czech Republic', 'Country_China', 'Country_Georgia', 'Country_Philippines'}


In [9]:
train_df, test_df = train_df.align(test_df, join='outer', axis=1, fill_value=0)

train_df = train_df.reindex(columns=(sorted(list([a for a in train_df.columns if a != 'treatment'])) + ['treatment']))
test_df = test_df.drop(['treatment'], axis=1)

## Scikit-Learn Calssifiers Evaluation

In [10]:
data_x, data_y = train_df.iloc[:,:-1], train_df.iloc[:,-1]

In [11]:
classifier_list = [
    KNeighborsClassifier(3),
    SVC(kernel="linear", probability=True),
    SVC(kernel="poly", probability=True),
    SVC(kernel='rbf', probability=True),
    # GaussianProcessClassifier(1.0 * RBF(1.0)),
    DecisionTreeClassifier(max_depth=5),
    RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
    MLPClassifier(alpha=1, max_iter=1000),
    AdaBoostClassifier(),
    GaussianNB(),
    QuadraticDiscriminantAnalysis(),
    GradientBoostingClassifier()
]

In [12]:
warnings.filterwarnings('ignore')

## Cross Validation

## Test Data Predictions

In [13]:
# best_classifier = clf.best_estimator_
best_classifier = AdaBoostClassifier(learning_rate=0.01333521432163324, n_estimators=500)

In [14]:
%%time
best_classifier.fit(data_x, data_y)
y_pred = best_classifier.predict_proba(data_x)
y_pred = y_pred[:, 1]
y_pred = np.where(y_pred <= 0.5, 0, 1)
correct = np.where(y_pred==data_y, 1, 0)
np.mean(correct)

CPU times: user 1.47 s, sys: 2.37 ms, total: 1.47 s
Wall time: 1.49 s


0.842

In [15]:
full_train_df['treat_pred'] = y_pred

In [16]:
full_train_df['accurate_pred'] = full_train_df.apply(lambda row: 1 if row['treatment'] == row['treat_pred'] else 0, 
                                                     axis=1)

In [17]:
np.mean(full_train_df['accurate_pred'])

0.842

In [18]:
full_train_df

Unnamed: 0,s.no,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments,gender_male,gender_female,gender_other,treat_pred,accurate_pred
0,1,2014-08-27 11:29:31,37.000000,Female,United States,IL,,0.0,1.0,Often,6-25,0.0,1.0,1.0,0.5,0.0,1.0,1.0,Somewhat easy,0.0,0.0,0.5,1.0,0.0,0.5,1.0,0.0,,0,1,0,1,1
1,2,2014-08-27 11:29:37,44.000000,M,United States,IN,,0.0,0.0,Rarely,More than 1000,0.0,0.0,0.5,0.0,0.5,0.5,0.5,Don't know,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.0,,1,0,0,0,1
2,3,2014-08-27 11:29:44,32.000000,Male,Canada,,,0.0,0.0,Rarely,6-25,0.0,1.0,0.0,0.0,0.0,0.0,0.5,Somewhat difficult,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,,1,0,0,0,1
3,4,2014-08-27 11:29:46,31.000000,Male,United Kingdom,,,1.0,1.0,Often,26-100,0.0,1.0,0.0,1.0,0.0,0.0,0.0,Somewhat difficult,1.0,1.0,0.5,0.0,0.5,0.5,0.0,1.0,,1,0,0,1,1
4,5,2014-08-27 11:30:22,31.000000,Male,United States,TX,,0.0,0.0,Never,100-500,1.0,1.0,1.0,0.0,0.5,0.5,0.5,Don't know,0.0,0.0,0.5,1.0,1.0,1.0,0.5,0.0,,1,0,0,0,1
5,6,2014-08-27 11:31:22,33.000000,Male,United States,TN,,1.0,0.0,Sometimes,6-25,0.0,1.0,1.0,0.5,0.0,0.5,0.5,Don't know,0.0,0.0,1.0,1.0,0.0,0.5,0.5,0.0,,1,0,0,1,0
6,7,2014-08-27 11:31:50,35.000000,Female,United States,MI,,1.0,1.0,Sometimes,1-5,1.0,1.0,0.0,0.0,0.0,0.0,0.0,Somewhat difficult,0.5,0.5,0.5,0.0,0.0,0.0,0.5,0.0,,0,1,0,1,1
7,8,2014-08-27 11:32:05,39.000000,M,Canada,,,0.0,0.0,Never,1-5,1.0,1.0,0.0,1.0,0.0,0.0,1.0,Don't know,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,1,0,0,0,1
8,9,2014-08-27 11:32:39,42.000000,Female,United States,IL,,1.0,1.0,Sometimes,100-500,0.0,1.0,1.0,1.0,0.0,0.0,0.0,Very difficult,0.5,0.0,1.0,1.0,0.0,0.5,0.0,0.0,,0,1,0,1,1
9,10,2014-08-27 11:32:43,23.000000,Male,Canada,,,0.0,0.0,Never,26-100,0.0,1.0,0.5,0.0,0.5,0.5,0.5,Don't know,0.0,0.0,1.0,1.0,0.5,0.5,1.0,0.0,,1,0,0,0,1


In [33]:
api_df = pd.read_csv('./devengers_api_predict.csv')
api_df.head(5)

Unnamed: 0,s.no,Timestamp,Age,Gender,Country,state,self_employed,family_history,treatment,work_interfere,no_employees,remote_work,tech_company,benefits,care_options,wellness_program,seek_help,anonymity,leave,mental_health_consequence,phys_health_consequence,coworkers,supervisor,mental_health_interview,phys_health_interview,mental_vs_physical,obs_consequence,comments,treat_pred,accurate_pred
0,1,2014-08-27 11:29:31,37,Female,United States,IL,No,No,Yes,Often,6-25,No,Yes,Yes,Not sure,No,Yes,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,,Yes,1
1,2,2014-08-27 11:29:37,44,M,United States,IN,No,No,No,Rarely,More than 1000,No,No,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No,,No,1
2,3,2014-08-27 11:29:44,32,Male,Canada,,No,No,No,Rarely,6-25,No,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,,No,1
3,4,2014-08-27 11:29:46,31,Male,United Kingdom,,No,Yes,Yes,Often,26-100,No,Yes,No,Yes,No,No,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,,Yes,1
4,5,2014-08-27 11:30:22,31,Male,United States,TX,No,No,No,Never,100-500,Yes,Yes,Yes,No,Don't know,Don't know,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,,No,1


In [35]:
merge_df = full_train_df.join(api_df, lsuffix='_l', rsuffix='_r', how='outer')
merge_df.head(5)

Unnamed: 0,s.no_l,Timestamp_l,Age_l,Gender_l,Country_l,state_l,self_employed_l,family_history_l,treatment_l,work_interfere_l,no_employees_l,remote_work_l,tech_company_l,benefits_l,care_options_l,wellness_program_l,seek_help_l,anonymity_l,leave_l,mental_health_consequence_l,phys_health_consequence_l,coworkers_l,supervisor_l,mental_health_interview_l,phys_health_interview_l,mental_vs_physical_l,obs_consequence_l,comments_l,gender_male,gender_female,gender_other,treat_pred_l,accurate_pred_l,s.no_r,Timestamp_r,Age_r,Gender_r,Country_r,state_r,self_employed_r,family_history_r,treatment_r,work_interfere_r,no_employees_r,remote_work_r,tech_company_r,benefits_r,care_options_r,wellness_program_r,seek_help_r,anonymity_r,leave_r,mental_health_consequence_r,phys_health_consequence_r,coworkers_r,supervisor_r,mental_health_interview_r,phys_health_interview_r,mental_vs_physical_r,obs_consequence_r,comments_r,treat_pred_r,accurate_pred_r
0,1,2014-08-27 11:29:31,37.0,Female,United States,IL,,0.0,1.0,Often,6-25,0.0,1.0,1.0,0.5,0.0,1.0,1.0,Somewhat easy,0.0,0.0,0.5,1.0,0.0,0.5,1.0,0.0,,0,1,0,1,1,1,2014-08-27 11:29:31,37,Female,United States,IL,No,No,Yes,Often,6-25,No,Yes,Yes,Not sure,No,Yes,Yes,Somewhat easy,No,No,Some of them,Yes,No,Maybe,Yes,No,,Yes,1
1,2,2014-08-27 11:29:37,44.0,M,United States,IN,,0.0,0.0,Rarely,More than 1000,0.0,0.0,0.5,0.0,0.5,0.5,0.5,Don't know,0.5,0.0,0.0,0.0,0.0,0.0,0.5,0.0,,1,0,0,0,1,2,2014-08-27 11:29:37,44,M,United States,IN,No,No,No,Rarely,More than 1000,No,No,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,No,No,No,No,No,Don't know,No,,No,1
2,3,2014-08-27 11:29:44,32.0,Male,Canada,,,0.0,0.0,Rarely,6-25,0.0,1.0,0.0,0.0,0.0,0.0,0.5,Somewhat difficult,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,,1,0,0,0,1,3,2014-08-27 11:29:44,32,Male,Canada,,No,No,No,Rarely,6-25,No,Yes,No,No,No,No,Don't know,Somewhat difficult,No,No,Yes,Yes,Yes,Yes,No,No,,No,1
3,4,2014-08-27 11:29:46,31.0,Male,United Kingdom,,,1.0,1.0,Often,26-100,0.0,1.0,0.0,1.0,0.0,0.0,0.0,Somewhat difficult,1.0,1.0,0.5,0.0,0.5,0.5,0.0,1.0,,1,0,0,1,1,4,2014-08-27 11:29:46,31,Male,United Kingdom,,No,Yes,Yes,Often,26-100,No,Yes,No,Yes,No,No,No,Somewhat difficult,Yes,Yes,Some of them,No,Maybe,Maybe,No,Yes,,Yes,1
4,5,2014-08-27 11:30:22,31.0,Male,United States,TX,,0.0,0.0,Never,100-500,1.0,1.0,1.0,0.0,0.5,0.5,0.5,Don't know,0.0,0.0,0.5,1.0,1.0,1.0,0.5,0.0,,1,0,0,0,1,5,2014-08-27 11:30:22,31,Male,United States,TX,No,No,No,Never,100-500,Yes,Yes,Yes,No,Don't know,Don't know,Don't know,Don't know,No,No,Some of them,Yes,Yes,Yes,Don't know,No,,No,1


In [36]:
merge_df[merge_df['accurate_pred_l'] != merge_df['accurate_pred_r']]

Unnamed: 0,s.no_l,Timestamp_l,Age_l,Gender_l,Country_l,state_l,self_employed_l,family_history_l,treatment_l,work_interfere_l,no_employees_l,remote_work_l,tech_company_l,benefits_l,care_options_l,wellness_program_l,seek_help_l,anonymity_l,leave_l,mental_health_consequence_l,phys_health_consequence_l,coworkers_l,supervisor_l,mental_health_interview_l,phys_health_interview_l,mental_vs_physical_l,obs_consequence_l,comments_l,gender_male,gender_female,gender_other,treat_pred_l,accurate_pred_l,s.no_r,Timestamp_r,Age_r,Gender_r,Country_r,state_r,self_employed_r,family_history_r,treatment_r,work_interfere_r,no_employees_r,remote_work_r,tech_company_r,benefits_r,care_options_r,wellness_program_r,seek_help_r,anonymity_r,leave_r,mental_health_consequence_r,phys_health_consequence_r,coworkers_r,supervisor_r,mental_health_interview_r,phys_health_interview_r,mental_vs_physical_r,obs_consequence_r,comments_r,treat_pred_r,accurate_pred_r
19,20,2014-08-27 11:35:08,36.000000,Male,France,,1.0,1.0,0.0,,6-25,1.0,1.0,0.0,0.0,1.0,0.0,1.0,Somewhat easy,0.0,0.0,0.5,0.5,0.5,0.5,0.5,0.0,,1,0,0,0,1,20,2014-08-27 11:35:08,36,Male,France,,Yes,Yes,No,Sometimes,6-25,Yes,Yes,No,No,Yes,No,Yes,Somewhat easy,No,No,Some of them,Some of them,Maybe,Maybe,Don't know,No,,Yes,0
26,27,2014-08-27 11:37:23,33.000000,male,United States,TN,0.0,0.0,0.0,,1-5,0.0,1.0,0.5,0.5,0.0,0.5,0.5,Don't know,0.5,0.5,0.5,0.0,0.0,0.0,0.5,0.0,,1,0,0,0,1,27,2014-08-27 11:37:23,33,male,United States,TN,No,No,No,Sometimes,1-5,No,Yes,Don't know,Not sure,No,Don't know,Don't know,Don't know,Maybe,Maybe,Some of them,No,No,No,Don't know,No,,Yes,0
37,38,2014-08-27 11:41:50,38.000000,Male,Portugal,,0.0,0.0,0.0,,100-500,0.0,1.0,0.0,1.0,0.0,0.0,0.5,Somewhat easy,0.5,0.0,0.5,0.5,0.0,0.5,0.0,0.0,,1,0,0,0,1,38,2014-08-27 11:41:50,38,Male,Portugal,,No,No,No,Sometimes,100-500,No,Yes,No,Yes,No,No,Don't know,Somewhat easy,Maybe,No,Some of them,Some of them,No,Maybe,No,No,,Yes,0
38,39,2014-08-27 11:42:08,50.000000,M,United States,IN,0.0,0.0,0.0,,100-500,0.0,1.0,1.0,1.0,0.0,0.5,0.5,Don't know,0.0,0.0,0.5,1.0,0.0,0.5,0.5,0.0,,1,0,0,0,1,39,2014-08-27 11:42:08,50,M,United States,IN,No,No,No,Sometimes,100-500,No,Yes,Yes,Yes,No,Don't know,Don't know,Don't know,No,No,Some of them,Yes,No,Maybe,Don't know,No,,Yes,0
41,42,2014-08-27 11:42:31,35.000000,Male,United States,MI,0.0,0.0,0.0,,More than 1000,1.0,1.0,1.0,0.5,0.5,1.0,0.5,Somewhat difficult,1.0,1.0,0.5,0.0,0.0,0.5,0.5,0.0,,1,0,0,0,1,42,2014-08-27 11:42:31,35,Male,United States,MI,No,No,No,Sometimes,More than 1000,Yes,Yes,Yes,Not sure,Don't know,Yes,Don't know,Somewhat difficult,Yes,Yes,Some of them,No,No,Maybe,Don't know,No,,Yes,0
46,47,2014-08-27 11:43:45,28.000000,Male,United Kingdom,,0.0,0.0,0.0,,26-100,0.0,1.0,0.5,0.5,0.0,0.0,0.5,Don't know,0.0,0.5,0.5,1.0,0.5,1.0,0.5,0.0,,1,0,0,0,1,47,2014-08-27 11:43:45,28,Male,United Kingdom,,No,No,No,Sometimes,26-100,No,Yes,Don't know,Not sure,No,No,Don't know,Don't know,No,Maybe,Some of them,Yes,Maybe,Yes,Don't know,No,,Yes,0
47,48,2014-08-27 11:43:48,34.000000,Male,United States,TN,0.0,0.0,0.0,,6-25,0.0,1.0,0.0,0.0,0.0,0.0,0.5,Don't know,0.0,0.0,1.0,1.0,0.5,1.0,0.5,0.0,,1,0,0,0,1,48,2014-08-27 11:43:48,34,Male,United States,TN,No,No,No,Sometimes,6-25,No,Yes,No,No,No,No,Don't know,Don't know,No,No,Yes,Yes,Maybe,Yes,Don't know,No,,Yes,0
51,52,2014-08-27 11:45:32,33.000000,Male,United States,UT,0.0,0.0,0.0,,100-500,1.0,1.0,0.5,0.5,0.5,0.5,0.5,Don't know,0.0,0.0,0.5,1.0,0.5,0.5,0.5,0.0,,1,0,0,0,1,52,2014-08-27 11:45:32,33,Male,United States,UT,No,No,No,Sometimes,100-500,Yes,Yes,Don't know,Not sure,Don't know,Don't know,Don't know,Don't know,No,No,Some of them,Yes,Maybe,Maybe,Don't know,No,,Yes,0
52,53,2014-08-27 11:45:33,31.000000,M,United States,,0.0,0.0,0.0,,100-500,1.0,1.0,0.5,0.0,0.5,0.5,0.5,Don't know,0.5,0.5,0.5,0.5,0.5,0.5,0.5,0.0,,1,0,0,0,1,53,2014-08-27 11:45:33,31,M,United States,,No,No,No,Sometimes,100-500,Yes,Yes,Don't know,No,Don't know,Don't know,Don't know,Don't know,Maybe,Maybe,Some of them,Some of them,Maybe,Maybe,Don't know,No,,Yes,0
54,55,2014-08-27 11:46:36,28.000000,M,Switzerland,,0.0,0.0,0.0,,100-500,0.0,1.0,0.0,0.0,0.0,0.0,0.5,Don't know,0.0,0.0,0.0,0.0,0.0,0.5,0.5,0.0,,1,0,0,0,1,55,2014-08-27 11:46:36,28,M,Switzerland,,No,No,No,Sometimes,100-500,No,Yes,No,No,No,No,Don't know,Don't know,No,No,No,No,No,Maybe,Don't know,No,,Yes,0
