# Assignment 5. Machine Learning and Natural Language Processing

OPIM 5894 Data Science with Python

Name:   NetID:

Discussed with: if any

## Instructions
In this assignment, you are asked to predict genders of users using their public information on websites. In question 1, you are asked to predict gender using only usename. In question 2, you are asked to predict gender using the profile description of a user instead. Finally, you may combine all available information of users to make predictions. You may explore different models and different combination of features, as well as different ways to transform features, to achieve best performance. 
<br> <br>
- It is recommended to use NLTK for this classification task, as the features stored in dictionary style can be easily extended. While scikit-learn is easier for Q2, it might not be that straightforward to combine different features in Q3. In addition, dealing with categorical variables can be a pain in scikit-learn. If you plan to use scikit-learn anyway, please read the following post: http://pbpython.com/categorical-encoding.html
- While protyping, it is easier to stick to the Naive Bayes Classifier. Adding other classifiers once your code is bug-free.
- Use cross validation on the training set to avoid over-fitting, though it is not guaranteed achieve that purpose.


<br>
This assignment involves the following challenges:
- Construct features from strings (i.e., usernames)
- Frequent use of zip() and zip(*) (see doc https://docs.python.org/3/library/functions.html)
- Parsing a json style column into multiple columns
- Merging different features into one feature set
- Find appropriate models and features to improve prediction accuracy
- Writing and debugging a lot of code
<br><br>

What to submit?
- The predictions of 5 models on the test set (see a sample submission sample_submission.csv). Diverify your portfolio, as similar models may suffer from similar problems.
- The notebook file (** please make sure that your code are sufficiently commented**)
- In the end of the notebook file, briefly describe what you have done, which models work the best, and what findings you have.
<br><br>

The top 50% submissions will get 0-3 extra points. Try at least 3 models for each question. Try as many as you want for extra credit.
<br><br>
** Please do NOT distribute the dataset used in this assignment!**


In [3]:
import pandas as pd
import os
os.chdir('D:/Dropbox/Teaching/Data Science using Python/Notebooks/Assignment5')

In [4]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

## 1. Predicting Gender with Username
Some potential features of usernames: whether it has capital letters, whether it has digits, number of characters, number of vowels, first and last letters, etc. See http://www.nltk.org/book/ch06.html for some related code.

In [5]:
import nltk
import numpy as np
from nltk.classify import SklearnClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.model_selection import KFold

In [6]:
def extract_features(name):
    name = name.lower()
    features = {
        'last': name[-1],
        'last_two': name[-2:],
        'last_three': name[-3:],
        'first': name[0],
        'first2': name[:1],
        'first3': name[:2],
        'nchar': len(name),
        'vowels.pct': sum(c in 'aoeiu' for c in name)/len(name),
        'digits.pct': sum(c.isdigit() for c in name)/len(name),
        'endwd': name[-1].isdigit(),
    }
    # the features below do not seem to be useful
    #for letter in 'abcdefghijklmnopqrstuvwxyz':
        #features["count({})".format(letter)] = name.lower().count(letter)
    return features

In [7]:
feat_uname = [extract_features(row['username']) for idx, row in train.iterrows()]
feat_uname_test = [extract_features(row['username']) for idx, row in test.iterrows()]

In [8]:
method = ['SVM','NB','ME'][1]
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
Xy = list(zip(feat_uname, train['gender']))
for train_idx, test_idx in k_fold.split(Xy):
    train_obs = [Xy[i] for i in train_idx]
    test_obs = [Xy[i] for i in test_idx]
    if method == 'SVM':
        classifier = SklearnClassifier(SVC(kernel='linear', C=1, random_state=1), sparse=True).train(train_obs)
    if method == 'NB':
        classifier = nltk.NaiveBayesClassifier.train(train_obs)
    if method == 'ME':
        classifier = nltk.classify.MaxentClassifier.train(train_obs, trace=3, max_iter=30)    
    accu.append( nltk.classify.util.accuracy(classifier, test_obs) )
    print('accuracy:', accu[len(accu)-1])    
# select the best model based on CV performance shown below
print('Final accuracy:', np.mean(accu))    

accuracy: 0.7456
accuracy: 0.7416
accuracy: 0.7488
accuracy: 0.7344
accuracy: 0.7365892714171337
Final accuracy: 0.741397854283


In [9]:
clf_uname = nltk.classify.MaxentClassifier.train(Xy, trace=3, max_iter=30)
pred_uname = [clf_uname.classify(row) for row in feat_uname_test]

  ==> Training (30 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.813
             2          -0.34948        0.813
             3          -0.33268        0.813
             4          -0.31775        0.815
             5          -0.30471        0.819
             6          -0.29333        0.828
             7          -0.28333        0.837
             8          -0.27446        0.850
             9          -0.26654        0.856
            10          -0.25940        0.863
            11          -0.25292        0.870
            12          -0.24702        0.877
            13          -0.24160        0.884
            14          -0.23661        0.888
            15          -0.23199        0.893
            16          -0.22771        0.897
            17          -0.22371        0.902
            18          -0.21998        0.905
            19          -0.21649        0.909
  

In [10]:
# support your predictions are stored in a list named pred_uname
zz = pd.DataFrame({'username':test['username'], 'prediction':pred_uname})
zz.to_csv('pred_uname.csv', index=False)

## 2. Predicting Gender with Description
The updated notebook for lecture 11 might be of some help, which now includes demo code for making predictions with NLTK classifier.

In [11]:
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import string
ps = PorterStemmer()
from nltk.tokenize import word_tokenize
def preprocess(text):
    return [ps.stem(w) for w in word_tokenize(text.lower()) 
             if w not in string.punctuation and w not in stopwords.words('english')] 

In [12]:
desc_words = [preprocess(desc) for desc in train['description']]
desc_words_test = [preprocess(desc) for desc in test['description']]

In [13]:
def filter_words(words_2d_list, thd=1):
    words = [word for desc in desc_words for word in desc]
    words_freq = nltk.FreqDist(words)
    selected_words = {word for word, freq in words_freq.items() if freq>1}
    print('Before:',len(words_freq), ', after:', len(selected_words))
    return selected_words

In [14]:
selected_words = filter_words(desc_words)

Before: 36128 , after: 9953


In [15]:
def extract_features(words, selected_words):
    ''' simply using words counts'''
    return nltk.FreqDist([w for w in words if w in selected_words])

In [16]:
feat_desc = [extract_features(desc, selected_words) for desc in desc_words]
feat_desc_test = [extract_features(desc, selected_words) for desc in desc_words_test]

In [17]:
method = ['SVM','NB','ME'][1]
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
Xy = list(zip(feat_desc, train['gender']))
for train_idx, test_idx in k_fold.split(Xy):
    train_obs = [Xy[i] for i in train_idx]
    test_obs = [Xy[i] for i in test_idx]
    if method == 'SVM':
        classifier = SklearnClassifier(SVC(kernel='linear', C=1, random_state=1), sparse=True).train(train_obs)
    if method == 'NB':
        classifier = nltk.NaiveBayesClassifier.train(train_obs)
    if method == 'ME':
        classifier = nltk.classify.MaxentClassifier.train(train_obs, trace=3, max_iter=30)    
    accu.append( nltk.classify.util.accuracy(classifier, test_obs) )
    print('accuracy:', accu[len(accu)-1])    
# select the best model based on CV performance shown below
print('Final accuracy:', np.mean(accu))    

accuracy: 0.564
accuracy: 0.5544
accuracy: 0.5696
accuracy: 0.5896
accuracy: 0.5468374699759808
Final accuracy: 0.564887493995


## 3. Predicting Gender with Username, Description, and Status

In [18]:
# Parse Json format status as dictionary
from ast import literal_eval
status = train['status'].apply(literal_eval)

In [19]:
# Find a way to expand/split the status column as multiple columns
status_ext = status.apply(pd.Series)

In [20]:
feat_status = status_ext.to_dict('records')

In [21]:
status_test_ext = test['status'].apply(literal_eval).apply(pd.Series)
feat_status_test = status_test_ext.to_dict('records')

In [22]:
feat_all = [{**a, **b, **c} for a,b,c in zip(feat_uname, feat_desc, feat_status)]
feat_all_test = [{**a, **b, **c} for a,b,c in zip(feat_uname_test, feat_desc_test, feat_status_test)]

In [23]:
method = ['SVM','NB','ME'][1]
k_fold = KFold(n_splits=5, shuffle=True)
accu = []
Xy = list(zip(feat_all, train['gender']))
for train_idx, test_idx in k_fold.split(Xy):
    train_obs = [Xy[i] for i in train_idx]
    test_obs = [Xy[i] for i in test_idx]
    if method == 'SVM':
        classifier = SklearnClassifier(SVC(kernel='linear', C=1, random_state=1), sparse=True).train(train_obs)
    if method == 'NB':
        classifier = nltk.NaiveBayesClassifier.train(train_obs)
    if method == 'ME':
        classifier = nltk.classify.MaxentClassifier.train(train_obs, trace=3, max_iter=30)    
    accu.append( nltk.classify.util.accuracy(classifier, test_obs) )
    print('accuracy:', accu[len(accu)-1])    
# select the best model based on CV performance shown below
print('Final accuracy:', np.mean(accu))   

accuracy: 0.5984
accuracy: 0.5304
accuracy: 0.6128
accuracy: 0.5704
accuracy: 0.6020816653322658
Final accuracy: 0.582816333066


## 4. Try Different Features and Models for Best Performance

In [24]:
clf = nltk.classify.NaiveBayesClassifier.train(Xy)
pred = [clf.classify(row) for row in feat_all_test]

In [26]:
zz = pd.DataFrame({'username':test['username'], 'prediction':pred_uname})
zz.to_csv('pred_all.csv', index=False)