# Multiple Classification models

The balanced and preprocessed dataset is taken for supervised classification. The preprocessed text is used as the features by converting it into vectors using TFIDF scores. The polarityNum is used as the lables. The dataset is split into train, validate and test set. Different classification models used are in this notebook .

In [36]:
#Import dataset into pandas
import pandas as pd
tweets = pd.read_csv('iphone_final_14may.csv')
tweets.head()

Unnamed: 0,id,date,text_original,permalink,text_preprocessed,polarity,polarity_confidence,subjectivity,subjectivity_confidence,polarityNum,brand
0,948570515085713408,2018-01-03 16:03,It's 10 AM and my iPhone 7 battery is already ...,https://twitter.com/TheScottBeach/status/94857...,s iphone battery already minute phone call tol...,negative,0.894631,subjective,1.0,-1.0,iphone
1,945970348436094977,2017-12-27 11:51,Apple launched three phones this year: the bez...,https://twitter.com/Today__Tech/status/9459703...,apple launched three phone year bezelbusting i...,positive,0.492551,subjective,1.0,1.0,iphone
2,951907048035270656,2018-01-12 21:01,Cut one phone completely OFF and my iPhone on ...,https://twitter.com/gods1blessings/status/9519...,cut one phone completely iphone dnd yeah ready...,positive,0.355765,subjective,1.0,1.0,iphone
3,972090952251670528,2018-03-09 13:45,"Do you think Apple iphone x ""notch"" will be th...",https://twitter.com/puspakpatnaik/status/97209...,think apple iphone x notch trend flagship phon...,neutral,0.577812,subjective,1.0,0.0,iphone
4,958395849737990144,2018-01-30 18:45,Fluffy Bling iPhone Case All Colors in Stock h...,https://twitter.com/IDMD_MIAMI/status/95839584...,fluffy bling iphone case color stock http shop...,positive,0.501405,subjective,1.0,1.0,iphone


In [24]:
tweets.polarity.value_counts()

negative    7209
positive    7209
neutral     7209
Name: polarity, dtype: int64

# Filter the neutral tweets 

For binary classification only tweets with positive and negative polarity are considered. But for multi-class classification this is not used.

In [26]:
tweets_PosNeg = tweets[(tweets.polarityNum != 0.0)] 

In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier , VotingClassifier
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.metrics import accuracy_score, f1_score
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Split the data into train,validate,test set

In [28]:
#80% train data, 10% validation data, 10% test data
train, validate_test = train_test_split(tweets_PosNeg, test_size=0.2, random_state=1) #use tweets dataframe for multi class
validate, test = train_test_split(validate_test, test_size=0.5, random_state=1)

X_train = train['text_preprocessed'].values
X_validate = validate['text_preprocessed'].values
X_test = test['text_preprocessed'].values
y_train = train['polarityNum']
y_validate = validate['polarityNum']
y_test = test['polarityNum']

# Convert text into TFIDF vectors

 Build a TFIDF model for the training data and transform the test and validation data into it to get feature vectors.

In [29]:
v = TfidfVectorizer(norm='l2',min_df=0, use_idf= True, smooth_idf=True, sublinear_tf=True,
                    analyzer='word',  ngram_range=(1, 2))

train_features_model = v.fit(X_train)
test_features = v.transform(X_test)
validate_features =  v.transform(X_validate)
train_features = v.fit_transform(X_train)

# Parameter tuning using GridSearchCV

 Parameter tuning is used to get the optimized hyper parameter for models to perform the best.

In [None]:
import numpy as np

clf = LogisticRegression()
tuned_parameters = [{'C': [0.01,0.1,1,10,13,30], 'max_iter' : [200,500,1000] , 'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}] #clf = KNeighborsClassifier()
#tuned_parameters = [{'n_neighbors': [3,5,7],'weights': ['distance','uniform']}] 
#clf = MultinomialNB()
#tuned_parameters = [{'alpha': [1.0], 'binarize' : [None,0]}]
#clf = SVC()
#tuned_parameters=[{'C': np.logspace(-5, 0, 10), 'class_weight':[None, 'balanced']}]
#clf = DecisionTreeClassifier()
#tuned_parameters=[{'max_depth' :[3,5,7,8], 'random_state' : [7, 50,100], 'min_samples_leaf' : [5,7,9]}]
#clf= RandomForestClassifier()
#tuned_parameters=[{'n_estimators': [200,100,500], #'max_features': [3,7,10], #'min_samples_leaf': [ 1, 4,6], 'min_samples_split' : [2,10,20]
                  #}]
# clf= AdaBoostClassifier()
# tuned_parameters=[{'n_estimators':[100,200,500]} ]
# clf=BernoulliNB()
# tuned_parameters=[{'alpha' : [1.0,0.0]}] 

In [None]:
inner_cv = KFold(n_splits=5)

classifier = GridSearchCV(clf, tuned_parameters, cv=inner_cv, scoring='f1_macro')
model = classifier.fit(train_features, y_train)
scores = classifier.best_score_
scores_std = classifier.cv_results_['std_test_score']
print('F1-macro of '+classifier.__class__.__name__+' is '+str(scores))

# Different classifiers used for modelling

 Run the classification models one by one to see the performance on validation set and also use the model to predict the performance on test set.

In [30]:
Classifiers = [ 
    LogisticRegression(C=30, solver='saga',max_iter=200),
    #KNeighborsClassifier(7, weights= 'distance'), #set params
    #SVC(kernel="linear", C=1, probability=True),
    #MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None),
    #BernoulliNB(alpha = 1.0),
    #DecisionTreeClassifier()#max_depth = 7, min_samples_leaf = 5,
        #random_state = 50),
    #XGBClassifier(n_estimators = 500),
    #RandomForestClassifier(n_estimators=500,max_features= 3),
    #GradientBoostingClassifier(learning_rate = 0.5, n_estimators= 500),
    AdaBoostClassifier(learning_rate = 0.5,n_estimators=500, random_state= 50),
]

In [32]:
import warnings
warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)

dense_features=train_features.toarray() #when text is used
dense_test= validate_features.toarray()
Model=[]

for classifier in Classifiers:
    try:
        model = classifier.fit(train_features,y_train)
        pred = model.predict(validate_features)
    except Exception:
        model = classifier.fit(dense_features,y_train)
        pred = model.predict(dense_test)
        
    accuracy = accuracy_score(pred,y_validate)
    f1 = f1_score(pred,y_validate, average = 'macro')
    scores = cross_val_score(classifier, train_features, y_train, cv=5)
    Model.append(classifier.__class__.__name__)
    print('Accuracy of '+classifier.__class__.__name__+' is '+str(accuracy))    
    print('F1 of '+classifier.__class__.__name__+' is '+str(f1)) 
    print('Cross_validation_score of '+classifier.__class__.__name__+' is '+str(scores))

Accuracy of LogisticRegression is 0.8723994452149791
F1 of LogisticRegression is 0.8723748944185662
Cross_validation_score of LogisticRegression is [0.84532062 0.84482011 0.83701777 0.84605377 0.83651344]


# Test the model on test data

In [33]:
#F1 score on test data for each model
pred_test = model.predict(test_features)
f1_score(pred_test,y_test, average = 'macro')

0.8668390433096316

# Ensemble by voting

The top three best performing individual models logistic regression, support vector machine and Naive bayes are used to build the ensemble by voting.

In [39]:
import warnings
warnings.filterwarnings(module='sklearn*', action='ignore', category=DeprecationWarning)
seed = 1
kfold = model_selection.KFold(n_splits=10, random_state=seed)
estimators = []
model1 = LogisticRegression(C=10,max_iter=200)
estimators.append(('logistic', model1))
model2 = SVC(kernel="linear", C=1, probability=True)
estimators.append(('SVM', model1))
model3 = MultinomialNB ()
estimators.append(('NB', model3))
# create the ensemble model
ensemble = VotingClassifier(estimators)
results = model_selection.cross_val_score(ensemble, train_features, y_train, cv=kfold)
print(results.mean())

0.8483603920749276


In [40]:
ensemble_model = ensemble.fit(train_features,y_train )
pred = ensemble_model.predict(test_features)
f1_score(pred,y_test, average = 'macro')

0.8661528998217174