# Assignment 5. Machine Learning and Natural Language Processing

OPIM 5894 Data Science with Python

Name: Pawan Shivhare

NetID: PRS16109

Discussed with: if any

## Instructions
In this assignment, you are asked to predict genders of users using their public information on websites. In question 1, you are asked to predict gender using only usename. In question 2, you are asked to predict gender using the profile description of a user instead. Finally, you may combine all available information of users to make predictions. You may explore different models and different combination of features, as well as different ways to transform features, to achieve best performance. 
<br> <br>
- It is recommended to use NLTK for this classification task, as the features stored in dictionary style can be easily extended. While scikit-learn is easier for Q2, it might not be that straightforward to combine different features in Q3. In addition, dealing with categorical variables can be a pain in scikit-learn. If you plan to use scikit-learn anyway, please read the following post: http://pbpython.com/categorical-encoding.html
- While protyping, it is easier to stick to the Naive Bayes Classifier. Adding other classifiers once your code is bug-free.
- Use cross validation on the training set to avoid over-fitting, though it is not guaranteed achieve that purpose.


<br>
This assignment involves the following challenges:
- Construct features from strings (i.e., usernames)
- Frequent use of zip() and zip(*) (see doc https://docs.python.org/3/library/functions.html)
- Parsing a json style column into multiple columns
- Merging different features into one feature set
- Find appropriate models and features to improve prediction accuracy
- Writing and debugging a lot of code
<br><br>

What to submit?
- The predictions of 5 models on the test set (see a sample submission sample_submission.csv). Diverify your portfolio, as similar models may suffer from similar problems.
- The notebook file (** please make sure that your code are sufficiently commented**)
- In the end of the notebook file, briefly describe what you have done, which models work the best, and what findings you have.
<br><br>

The top 50% submissions will get 0-3 extra points. Try at least 3 models for each question. Try as many as you want for extra credit.
<br><br>
** Please do NOT distribute the dataset used in this assignment!**


In [1]:
import pandas as pd
import os
os.chdir('/Users/pawanshivhare/Documents/Uconn/data science in python/Assignment5')

In [2]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
train.head()

Unnamed: 0,username,gender,status,description
0,Vimal20011,M,"{u'payment_verified': False, u'identity_verifi...",A team of 5 working on various projects relate...
1,sheom,M,"{u'payment_verified': True, u'identity_verifie...",We are an IT solution and service provider com...
2,ezbik,M,"{u'payment_verified': False, u'identity_verifi...",System administration is my work & hobby.
3,angelme,F,"{u'payment_verified': False, u'identity_verifi...",Good day! Thank you for taking some time to ch...
4,snitch1,M,"{u'payment_verified': False, u'identity_verifi...",I build good relation with clients and deliver...


## 1. Predicting Gender with Username
Some potential features of usernames: whether it has capital letters, whether it has digits, number of characters, number of vowels, first and last letters, etc. See http://www.nltk.org/book/ch06.html for some related code.

In [3]:
# conda install -c anaconda nltk
# download english.pickle
# nltk.download('punkt')
# nltk.download('stopwords')
import nltk
import string
from sklearn.model_selection import KFold
import numpy as np
from sklearn import metrics
import warnings; warnings.simplefilter('ignore')

In [4]:
# Define a function to preprocess the username field and create features. 
def fcreate(df,istrain):    
    
    gal_words=['LADY','WOMAN','MISS','GAL','CAT','ANGEL','GIRL','BLONDE','QUEEN']
    boy_words=['BOY','GUY','DUDE','KING','BOND']
    
    uids=df['username']
    
    # The length of id
    len_id=[len(id) for id in uids]
    
    # Number of digits, has digits and ends with digits in username
    has_digits=np.where([any(c.isdigit() for c in id) for id in uids ],1,0)
    cnt_digits=[sum(c.isdigit() for c in id) for id in uids ]
    end_digits=np.where([(id[len(id)-1:] in string.digits) for id in uids],1,0)
    
    # Number of Capital, has capital and ends with capital
    has_capital=np.where([any(c.isupper() for c in id) for id in uids ],1,0)
    cnt_capital=[sum(c.isupper() for c in id) for id in uids ]
    end_capital=np.where([id[len(id)-1:].isupper() for id in uids ],1,0)
    
    # Number of Vowels, has Vowels and ends with Vowels
    has_vowels=np.where([any(c in ('a','e','i','o','u') for c in id) for id in uids],1,0)
    cnt_vowels=[sum(c in ('a','e','i','o','u') for c in id) for id in uids]
    end_vowels=np.where([(id[len(id)-1:] in ('a','e','i','o','u')) for id in uids],1,0)
    start_vowels=np.where([(id[0:1] in ('a','e','i','o','u')) for id in uids],1,0)
    
    # First and Last letter
    
    #first_char=[id[0:1] for id in uids]
    #last_char=[id[len(id)-1:] for id in uids]
    
    # Has Girl/Boy Words
    
    girl_flag=np.where((uids.str.upper().str.contains('|'.join(gal_words))),1,0)
    boy_flag=np.where((uids.str.upper().str.contains('|'.join(boy_words))),1,0)
    
    
    if istrain:
        gender=np.where([(cl=='M') for cl in df['gender']],1,0)
        zz = pd.DataFrame({'gender':gender,
                      'len_id':len_id,'has_digits':has_digits,
                     'cnt_digits':cnt_digits,'end_digits':end_digits,
                       
                     'has_capital':has_capital,
                     'cnt_capital':cnt_capital,'end_capital':end_capital,
                      
                     'has_vowels':has_vowels,
                     'cnt_vowels':cnt_vowels,'end_vowels':end_vowels, 
                     'start_vowels':start_vowels,
                       
                       'girl_flag':girl_flag,'boy_flag':boy_flag})
    else:
        zz = pd.DataFrame({'len_id':len_id,'has_digits':has_digits,
                     'cnt_digits':cnt_digits,'end_digits':end_digits,
                       
                     'has_capital':has_capital,
                     'cnt_capital':cnt_capital,'end_capital':end_capital,
                      
                     'has_vowels':has_vowels,
                     'cnt_vowels':cnt_vowels,'end_vowels':end_vowels, 
                     'start_vowels':start_vowels,
                       
                       'girl_flag':girl_flag,'boy_flag':boy_flag})
        
    return(zz)

In [5]:
# Create features in the train and test dataset

train_new=fcreate(train,istrain=True)
test_new=fcreate(test,istrain=False)
train_new.head()

Unnamed: 0,boy_flag,cnt_capital,cnt_digits,cnt_vowels,end_capital,end_digits,end_vowels,gender,girl_flag,has_capital,has_digits,has_vowels,len_id,start_vowels
0,0,1,5,2,0,1,0,1,0,1,1,1,10,0
1,0,0,0,2,0,0,0,1,0,0,0,1,5,0
2,0,0,0,2,0,0,0,1,0,0,0,1,5,1
3,0,0,0,3,0,0,1,0,1,0,0,1,7,1
4,0,0,1,1,0,1,0,1,0,0,1,1,7,0


In [6]:
train_features=['len_id','has_digits','cnt_digits','end_digits','has_vowels','cnt_vowels',
                'end_vowels','start_vowels','girl_flag','boy_flag','cnt_capital','end_capital','has_capital']
label=['gender']

#***********************************K fold cross validation with different values of K as 5 *******#

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score 

kfold = KFold(n_splits= 5, random_state= 123)
result = cross_val_score(LogisticRegression(), train_new[train_features], train_new[label], cv=kfold, scoring='accuracy')

print(result)
print(result.mean())
print(np.median(result))

# The mean and median of accuracy on test is almost same indicating the model is not overfitting #

#*********************************** Fit a logistic model ****************#

logreg = LogisticRegression()
model = logreg.fit(train_new[train_features],train_new[label])

from sklearn.metrics import accuracy_score
train_pred=model.predict(train_new[train_features])
lm_accu=accuracy_score(train_new[label],train_pred)
print(lm_accu)

lm_test_pred=model.predict(test_new)

[ 0.8192      0.808       0.8192      0.82        0.81905524]
0.817091048839
0.8192
0.817090734518


In [7]:
train_features=['len_id','has_digits','cnt_digits','end_digits','has_vowels','cnt_vowels',
                'end_vowels','start_vowels','girl_flag','boy_flag','cnt_capital','end_capital','has_capital']
label=['gender']

#***********************************K fold cross validation with different values of K as 5 *******#

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score 

kfold = KFold(n_splits= 5, random_state= 123)
result = cross_val_score(RandomForestClassifier(), train_new[train_features], train_new[label], cv=kfold, scoring='accuracy')

print(result)
print(result.mean())
print(np.median(result))

# The mean and median of accuracy on test is almost same indicating the model is not overfitting #

#*********************************** Fit a Random Forest Model ****************#

rfmodel = RandomForestClassifier()
model = rfmodel.fit(train_new[train_features],train_new[label])

from sklearn.metrics import accuracy_score
train_pred=model.predict(train_new[train_features])
rf_accu=accuracy_score(train_new[label],train_pred)
print(rf_accu)

rf_test_pred=model.predict(test_new)

[ 0.7968      0.78        0.7976      0.7936      0.78622898]
0.790845796637
0.7936
0.841894703153


In [8]:
train_features=['len_id','has_digits','cnt_digits','end_digits','has_vowels','cnt_vowels',
                'end_vowels','start_vowels','girl_flag','boy_flag','cnt_capital','end_capital','has_capital']
label=['gender']

#***********************************K fold cross validation with different values of K as 5 *******#

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score 

kfold = KFold(n_splits= 5, random_state= 123)
svm = SVC(kernel='linear', C=10)
result = cross_val_score(svm, train_new[train_features], train_new[label], cv=kfold, scoring='accuracy')

print(result)
print(result.mean())
print(np.median(result))

# The mean and median of accuracy on test is almost same indicating the model is not overfitting #

#*********************************** Fit a SVM Model ****************#

model = svm.fit(train_new[train_features],train_new[label])

from sklearn.metrics import accuracy_score
train_pred=model.predict(train_new[train_features])
svm_accu=accuracy_score(train_new[label],train_pred)
print(svm_accu)

svm_test_pred=model.predict(test_new)

[ 0.8192      0.808       0.8192      0.82        0.81905524]
0.817091048839
0.8192
0.817090734518


In [9]:
# Printing the predictions on test

zz = pd.DataFrame({'username':test['username'], 'prediction_lm':lm_test_pred,'prediction_rf':rf_test_pred
                  ,'prediction_svm':svm_test_pred})
zz.to_csv('pred_uname.csv', index=False)
zz.head()

Unnamed: 0,prediction_lm,prediction_rf,prediction_svm,username
0,0,1,1,nazrulmadina
1,0,0,0,SehrishWarraich
2,0,1,1,samadhinie
3,0,1,1,ebottabi
4,0,1,1,mrjimoy


## 2. Predicting Gender with Description
The updated notebook for lecture 11 might be of some help, which now includes demo code for making predictions with NLTK classifier.

In [10]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
ps = PorterStemmer()
from nltk.tokenize import word_tokenize
def preprocess(text):
    return [ps.stem(w) for w in word_tokenize(text.lower()) 
             if w not in string.punctuation and w not in stopwords.words('english')] 

In [11]:
def extract_features(words, selected_words):
    ''' simply using words counts'''
    return nltk.FreqDist([w for w in words if w in selected_words])

In [12]:
# Prepping up Train Data from description

m_text=train[train['gender']=='M'][['description']]
f_text=train[train['gender']!='M'][['description']]

train_list=[m_text,f_text]

categories = ['Male','Female']
train_words = [(preprocess(t), category) 
               for desc, category in zip(train_list, categories) 
               for t in desc.description]

all_words = [w for words, c in train_words for w in words]
words_freq = nltk.FreqDist(all_words)

selected_words = [word for word, freq in words_freq.items() if freq>1]
print('Before:',len(words_freq), ', after:', len(selected_words))

train_words = [([w for w in words if w in selected_words], c) for words, c in train_words]

# Prepping up Test Data from description

categories = ['Male']
test_list=[test]

test_words = [(preprocess(t), category) 
               for desc, category in zip(test_list, categories) 
               for t in desc.description]

all_words = [w for words, c in test_words for w in words]
words_freq = nltk.FreqDist(all_words)

selected_words = [word for word, freq in words_freq.items() if freq>1]
print('Before:',len(words_freq), ', after:', len(selected_words))

test_words = [([w for w in words if w in selected_words], c) for words, c in test_words]

Before: 36128 , after: 9953
Before: 20142 , after: 5820


In [13]:
def extract_features(words):
    ''' simply using words counts'''
    return nltk.FreqDist(words)

features = [(extract_features(words), c) for words, c in train_words]

In [14]:
# Training a NaiveBayes Classifier

from sklearn.model_selection import KFold
import numpy as np
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(features):
    train1 = [features[i] for i in train_idx]
    test1 = [features[i] for i in test_idx]
    classifier = nltk.NaiveBayesClassifier.train(train1)   
    accu.append( nltk.classify.util.accuracy(classifier, test1) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

accuracy: 0.564
accuracy: 0.5576
accuracy: 0.5448
accuracy: 0.5528
accuracy: 0.5940752602081665
CV mean accuracy: 0.562655052042


In [15]:
# Training Classifier on full train data 

classifier = nltk.NaiveBayesClassifier.train(features)

train_feat = [extract_features(words) for words, c in train_words]
train_pred = [classifier.classify(row) for row in train_feat]

# Predicting gender on Test Data using NaiveBayes classifier

test_feat = [extract_features(words) for words, c in test_words]
nb_test_pred = [classifier.classify(row) for row in test_feat]

In [16]:
# Training a Maximum Entropy Classifier

k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(features):
    train1 = [features[i] for i in train_idx]
    test1 = [features[i] for i in test_idx]
    classifier = nltk.classify.MaxentClassifier.train(train1, trace=3, max_iter=1)       
    accu.append( nltk.classify.util.accuracy(classifier, test1) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.810
         Final          -0.40096        0.795
accuracy: 0.8064
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.815
         Final          -0.34748        0.809
accuracy: 0.796
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.814
         Final          -0.31994        0.810
accuracy: 0.8056
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.816
         Final          -0.38275        0.803
accuracy: 0.7872
  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accur

In [17]:
# Training Classifier on full train data 

classifier = nltk.classify.MaxentClassifier.train(features, trace=3, max_iter=1) 

train_feat = [extract_features(words) for words, c in train_words]
train_pred = [classifier.classify(row) for row in train_feat]

# Predicting gender on Test Data using NaiveBayes classifier

test_feat = [extract_features(words) for words, c in test_words]
me_test_pred = [classifier.classify(row) for row in test_feat]

  ==> Training (1 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.813
         Final          -0.36920        0.804


In [18]:
# Training a SVC Classifier

from nltk.classify import SklearnClassifier
from sklearn.svm import SVC
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
for train_idx, test_idx in k_fold.split(features):
    train1 = [features[i] for i in train_idx]
    test1 = [features[i] for i in test_idx]
    classifier = SklearnClassifier(SVC(kernel='linear', C=10, random_state=1), sparse=True).train(train1)       
    accu.append( nltk.classify.util.accuracy(classifier, test1) )
    print('accuracy:', accu[len(accu)-1])    
print('CV mean accuracy:', np.mean(accu)) 

accuracy: 0.7456
accuracy: 0.7304
accuracy: 0.74
accuracy: 0.7408
accuracy: 0.7373899119295436
CV mean accuracy: 0.738837982386


In [19]:
# Training Classifier on full train data 

classifier = SklearnClassifier(SVC(kernel='linear', C=10, random_state=1), sparse=True).train(features) 

train_feat = [extract_features(words) for words, c in train_words]
train_pred = [classifier.classify(row) for row in train_feat]

# Predicting gender on Test Data using NaiveBayes classifier

test_feat = [extract_features(words) for words, c in test_words]
SVC_test_pred = [classifier.classify(row) for row in test_feat]

In [20]:
# Printing the predictions on test

zz = pd.DataFrame({'username':test['username'], 'prediction_nb':nb_test_pred,'prediction_me':me_test_pred
                  ,'prediction_svc':SVC_test_pred})
zz.to_csv('pred_description.csv', index=False)
zz.head()

Unnamed: 0,prediction_me,prediction_nb,prediction_svc,username
0,Male,Female,Male,nazrulmadina
1,Male,Female,Male,SehrishWarraich
2,Male,Male,Male,samadhinie
3,Female,Male,Male,ebottabi
4,Male,Male,Female,mrjimoy


## 3. Predicting Gender with Username, Description, and Status
If you need to merge multiple dict-format features into one, check the following question: https://stackoverflow.com/questions/38987/how-to-merge-two-dictionaries-in-a-single-expression

In [44]:
# Parse Json format status as dictionary
from ast import literal_eval
import gensim
from gensim import corpora
train_status = train['status'].apply(literal_eval)
test_status = test['status'].apply(literal_eval)

In [29]:
def status_cleanse(status):
    deposit=[]
    email=[]
    facebook=[]
    identity=[]
    payment=[]
    phone=[]
    profile=[]
    for d1 in status:
        for key,value in d1.items():
            if key=='deposit_made': deposit.append(d1[key])
            if key=='email_verified': email.append(d1[key])
            if key=='facebook_connected': facebook.append(d1[key])
            if key=='identity_verified': identity.append(d1[key])
            if key=='payment_verified': payment.append(d1[key])
            if key=='phone_verified': phone.append(d1[key])
            if key=='profile_complete': profile.append(d1[key])
    deposit_made=np.where(deposit,1,0)
    email_verified=np.where(email,1,0) 
    facebook_connected=np.where(facebook,1,0) 
    identity_verified=np.where(identity,1,0) 
    payment_verified=np.where(payment,1,0) 
    phone_verified=np.where(phone,1,0) 
    profile_complete=np.where(profile,1,0)
    st=pd.DataFrame({'deposit_made':deposit_made,
                    'email_verified':email_verified,
                    'facebook_connected':facebook_connected,
                    'identity_verified':identity_verified,
                    'payment_verified':payment_verified,
                    'phone_verified':phone_verified,
                    'profile_complete':profile_complete})
    return(st)

In [30]:
train_status1=status_cleanse(train_status)
test_status1=status_cleanse(test_status)
train_status1.head()

Unnamed: 0,deposit_made,email_verified,facebook_connected,identity_verified,payment_verified,phone_verified,profile_complete
0,1,1,0,0,0,0,1
1,1,1,1,0,1,0,1
2,0,1,0,0,0,0,1
3,1,1,0,0,0,1,1
4,0,1,0,0,0,0,1


In [45]:
# Topic Modeling for the description term in test and train
features = [(extract_features(words), c) for words, c in train_words]

In [67]:
X, y = zip(*features)
dictionary = corpora.Dictionary(X)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in X]
ldamodel = gensim.models.ldamodel.LdaModel(doc_term_matrix, num_topics=3, id2word = dictionary, passes=20)
ldamodel.print_topics(num_topics=5, num_words=6)

[(0,
  '0.009*"design" + 0.008*"year" + 0.007*"graphic" + 0.006*"experi" + 0.006*"work" + 0.006*"busi"'),
 (1,
  '0.018*"work" + 0.009*"year" + 0.008*"experi" + 0.008*"write" + 0.008*"time" + 0.007*"project"'),
 (2,
  '0.024*"develop" + 0.018*"web" + 0.017*"experi" + 0.017*"year" + 0.013*"php" + 0.012*"design"')]

In [87]:
# Calculating Topic weights on Train

feat_train = [(extract_features(words)) for words, c in train_words]
dict_new = [dictionary.doc2bow(doc) for doc in feat_train]
lda_train=[ldamodel[d1] for d1 in dict_new]
# Calculating Topic Weights on Test

feat_test = [(extract_features(words)) for words, c in test_words]
dict_new = [dictionary.doc2bow(doc) for doc in feat_test]
lda_test=[ldamodel[d1] for d1 in dict_new]

6249
6249


In [97]:
# Extract Topic weights to a data frame
def twts(liist):
    topic1=[]
    topic2=[]
    topic3=[]
    for l1 in liist:
        topic1.append(0)
        topic2.append(0)
        topic3.append(0)
        for tu in l1:
            if tu[0]==0: topic1[len(topic1)-1]=tu[1]
            if tu[0]==1: topic2[len(topic2)-1]=tu[1]
            if tu[0]==2: topic3[len(topic3)-1]=tu[1]
    wl=pd.DataFrame({'topic1':topic1,
                    'topic2':topic2,
                    'topic3':topic3})
    return(wl)

In [99]:
wts_train=twts(lda_train)
wts_test=twts(lda_test)

In [104]:
# Combining Username, Description Topic and Status data for train and test
final_train = pd.concat([train_new, wts_train, train_status1], axis=1)
final_test = pd.concat([test_new, wts_test, test_status1], axis=1)

In [105]:
train_features=['len_id','has_digits','cnt_digits','end_digits','has_vowels','cnt_vowels',
                'end_vowels','start_vowels','girl_flag','boy_flag','cnt_capital','end_capital','has_capital',
               'topic1','topic2','topic3','deposit_made','email_verified','facebook_connected','identity_verified',
                'payment_verified','phone_verified','profile_complete']
label=['gender']

#***********************************K fold cross validation with different values of K as 5 *******#

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score 

kfold = KFold(n_splits= 5, random_state= 123)
result = cross_val_score(LogisticRegression(), final_train[train_features], final_train[label], cv=kfold, scoring='accuracy')

print(result)
print(result.mean())
print(np.median(result))

# The mean and median of accuracy on test is almost same indicating the model is not overfitting #

#*********************************** Fit a logistic model ****************#

logreg = LogisticRegression()
model = logreg.fit(final_train[train_features],final_train[label])

from sklearn.metrics import accuracy_score
train_pred=model.predict(final_train[train_features])
lm_accu=accuracy_score(final_train[label],train_pred)
print(lm_accu)

lm_test_pred=model.predict(final_test)

[ 0.82        0.808       0.8176      0.8184      0.81905524]
0.816611048839
0.8184
0.817090734518


In [106]:
train_features=['len_id','has_digits','cnt_digits','end_digits','has_vowels','cnt_vowels',
                'end_vowels','start_vowels','girl_flag','boy_flag','cnt_capital','end_capital','has_capital',
               'topic1','topic2','topic3','deposit_made','email_verified','facebook_connected','identity_verified',
                'payment_verified','phone_verified','profile_complete']
label=['gender']
#***********************************K fold cross validation with different values of K as 5 *******#

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score 

kfold = KFold(n_splits= 5, random_state= 123)
result = cross_val_score(RandomForestClassifier(), final_train[train_features], final_train[label], cv=kfold, scoring='accuracy')

print(result)
print(result.mean())
print(np.median(result))

# The mean and median of accuracy on test is almost same indicating the model is not overfitting #

#*********************************** Fit a Random Forest Model ****************#

rfmodel = RandomForestClassifier()
model = rfmodel.fit(final_train[train_features],final_train[label])

from sklearn.metrics import accuracy_score
train_pred=model.predict(final_train[train_features])
rf_accu=accuracy_score(final_train[label],train_pred)
print(rf_accu)

rf_test_pred=model.predict(final_test)

[ 0.7712      0.7648      0.7768      0.7912      0.78142514]
0.777085028022
0.7768
0.987838054089


In [107]:
train_features=['len_id','has_digits','cnt_digits','end_digits','has_vowels','cnt_vowels',
                'end_vowels','start_vowels','girl_flag','boy_flag','cnt_capital','end_capital','has_capital',
               'topic1','topic2','topic3','deposit_made','email_verified','facebook_connected','identity_verified',
                'payment_verified','phone_verified','profile_complete']
label=['gender']

#***********************************K fold cross validation with different values of K as 5 *******#

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score 

kfold = KFold(n_splits= 5, random_state= 123)
svm = SVC(kernel='linear', C=10)
result = cross_val_score(svm, final_train[train_features], final_train[label], cv=kfold, scoring='accuracy')

print(result)
print(result.mean())
print(np.median(result))

# The mean and median of accuracy on test is almost same indicating the model is not overfitting #

#*********************************** Fit a SVM Model ****************#

model = svm.fit(final_train[train_features],final_train[label])

from sklearn.metrics import accuracy_score
train_pred=model.predict(final_train[train_features])
svm_accu=accuracy_score(final_train[label],train_pred)
print(svm_accu)

svm_test_pred=model.predict(final_test)

[ 0.8192      0.808       0.8192      0.82        0.81905524]
0.817091048839
0.8192
0.817090734518


In [108]:
# Printing the predictions on test

zz = pd.DataFrame({'username':test['username'], 'prediction_lm':lm_test_pred,'prediction_rf':rf_test_pred
                  ,'prediction_svm':svm_test_pred})
zz.to_csv('pred_all_info.csv', index=False)
zz.head()

Unnamed: 0,prediction_lm,prediction_rf,prediction_svm,username
0,0,1,1,nazrulmadina
1,0,1,0,SehrishWarraich
2,0,1,1,samadhinie
3,0,0,1,ebottabi
4,0,1,1,mrjimoy


## Method:

The approach followed was as follows:

1) Predicting using Username: 
    - Created features such as length of username, has digits, has vowels, has capital on both train and test data
    - Using the features implemented logistic regression, RandomForest and SVM models with 5 Kfold cross validation
    - Compared the accuracy from the models and derived predictions from the three models on test

2) Predicting using Description: 
    - Created word count dicitionary after removing stopwaords and steming
    - Using the word freq implemented Naive Bayes, Max Entropy and SVM models with 5 Kfold cross validation
    - Compared the accuracy from the models and derived predictions from the three models on test

3) Predicting using Username, Description and Status: 
    - Created 3 Topics from description and calculated topic weights for train and test
    - Extracted binary status data from Json column
    - Combined Username features, Topic weights and Status features for train and test
    - Using all features implemented logistic regression, RandomForest and SVM models with 5 Kfold cross validation
    - Compared the accuracy from the models and derived predictions from the three models on test

### Extra Credit: Try Different Features and Models for Best Performance
Save your predictions as netid_1.csv, ..., netid_5.csv