#### Data 620 - Project 3 <br>July 10, 2019<br>Team 2: <ul><li>Anthony Munoz</li> <li>Katie Evers</li> <li>Juliann McEachern</li> <li>Mia Siracusa</li></ul>

<h1 align="center">Network Analysis: Text Mining </h1>

## Getting Started

Prompt: Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 

#### Python Dependencies

In [1]:
# basic requirements 
import pandas as pd, numpy as np, random, nltk
from nltk.corpus import names # data source 

# sklearn packages
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

#### Upload and Label Data

In [2]:
# retrive names from the nltk corpus
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])

# randomly shuffle the names
np.random.seed(1)
random.shuffle(labeled_names)

In [3]:
labeled_names[:10]

[('Gustav', 'male'),
 ('Quent', 'male'),
 ('Lou', 'female'),
 ('Klaus', 'male'),
 ('Gardener', 'male'),
 ('Chloette', 'female'),
 ('Jade', 'female'),
 ('Miran', 'female'),
 ('Trace', 'female'),
 ('Kenyon', 'male')]

#### Subset Corpus

We split the names corpus into three subsets:
1.  Test Set (500 words)
2.  Dev-test (500 words)
3.  Training Set (Remaining words)

In [4]:
train_names = labeled_names[1000:]
devtest_names = labeled_names[500:1000]
test_names = labeled_names[:500]

print('Testing names count:',len(test_names),'\nDevelopment names count:',len(devtest_names),'\nTraining names count:', len(train_names))

Testing names count: 500 
Development names count: 500 
Training names count: 6944


## Name-Gender Classifier

Task: Start with the example name gender classifier & make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. 

#### Example Name-Gender Classifier

We used the book example for our first attempt. The function shows us how to create basic features to classify our train and test sets.

In [5]:
# Create function
def gender_features(word):
    return {'last_letter': word[-1]}

# Apply function to train and test data
feature_test = [(gender_features(n), gender) for (n, gender) in test_names]
feature_devtest = [(gender_features(n), gender) for (n, gender) in devtest_names]
feature_train = [(gender_features(n), gender) for (n, gender) in train_names]

# Apply naive Bayes algorithm classifier
classifier = nltk.NaiveBayesClassifier.train(feature_train)
print('Train Accuracy', round(nltk.classify.accuracy(classifier, feature_train),3)) 
print('Example DevTest Accuracy:', round(nltk.classify.accuracy(classifier, feature_devtest),3)) 
print('Example Test Set Accuracy:', round(nltk.classify.accuracy(classifier, feature_test),3)) 

Train Accuracy 0.764
Example DevTest Accuracy: 0.752
Example Test Set Accuracy: 0.752


#### Incremental Improvements

We found our most informative features, shown below, to create an informative analysis of observed patterns.

In [6]:
print(classifier.show_most_informative_features())

Most Informative Features
             last_letter = 'a'            female : male   =     38.7 : 1.0
             last_letter = 'k'              male : female =     30.7 : 1.0
             last_letter = 'f'              male : female =     14.5 : 1.0
             last_letter = 'm'              male : female =     11.7 : 1.0
             last_letter = 'd'              male : female =     10.5 : 1.0
             last_letter = 'p'              male : female =     10.5 : 1.0
             last_letter = 'v'              male : female =      9.1 : 1.0
             last_letter = 'o'              male : female =      8.5 : 1.0
             last_letter = 'z'              male : female =      7.1 : 1.0
             last_letter = 'r'              male : female =      6.4 : 1.0
None


We took these informative features to improve our second gender functions in an attempt to improve the preditor accurracy of our model.

We used the following pattern combinations in order to improve the gender classification:

1. Second letter in the name.
2. First 3 letters in the name.
3. Middle letter of the name.
4. Last letter of the name.
5. Last 2 letters of the name.
6. First 2 letters with the last letter of the name.

In [7]:
# Improvements function

def gender_features_new(word):
    word = word.lower()
    mid = int(len(word)/2)
    return {'comb1': word[1],
            'comb2': word[:3],
            'comb3': word[mid:mid+5],
            'comb4': word[-1],
            'comb5': word[-2],
            'comb6': word[:2]+word[-1]}

We first worked with the training and the development datasets until we felt confident with our model. Then, we applied function on the test dataset.

In [8]:
# Apply function to train, devtest, and test datasets
train_set = [(gender_features_new(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features_new(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features_new(n), gender) for (n, gender) in test_names]

To improve our model, we used the following loop to iterate through the devtest names and call the gender functions to identify which of our gender prediction were off.

In [9]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features_new(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

Here, we observe if they are some pattern on the names and that why we start creating combination of letters and selecting those that improve our prediction algorithm.

In [10]:
for (tag_gender_name, guess_name, name) in sorted(errors):
        print({'correct': tag_gender_name,
               'guessing':guess_name,
               'name':name})
        
print('\n\nTotal Errors:', len(errors))

{'correct': 'male', 'guessing': 'female', 'name': 'Aamir'}
{'correct': 'male', 'guessing': 'female', 'name': 'Abbie'}
{'correct': 'male', 'guessing': 'female', 'name': 'Abby'}
{'correct': 'male', 'guessing': 'female', 'name': 'Aditya'}
{'correct': 'male', 'guessing': 'female', 'name': 'Ahmet'}
{'correct': 'male', 'guessing': 'female', 'name': 'Alfie'}
{'correct': 'male', 'guessing': 'female', 'name': 'Alfred'}
{'correct': 'male', 'guessing': 'female', 'name': 'Ambrosi'}
{'correct': 'male', 'guessing': 'female', 'name': 'Andy'}
{'correct': 'male', 'guessing': 'female', 'name': 'Aristotle'}
{'correct': 'male', 'guessing': 'female', 'name': 'Armando'}
{'correct': 'male', 'guessing': 'female', 'name': 'Armond'}
{'correct': 'male', 'guessing': 'female', 'name': 'Artur'}
{'correct': 'male', 'guessing': 'female', 'name': 'Aubrey'}
{'correct': 'male', 'guessing': 'female', 'name': 'Austen'}
{'correct': 'male', 'guessing': 'female', 'name': 'Baron'}
{'correct': 'male', 'guessing': 'female', 'na

#### Final Performance

We used the Naives Bayes classifier on our train dataset and measured the accuracy of the devtest and test datasets again. We were satisfied with our accuracy improvements upon this final attempt.

Our new classification patterns helped improve our prediction accuracy. In both cases our results were above 80%.

In [11]:
classifier = nltk.NaiveBayesClassifier.train(train_set) 

# Store results for analysis
NLTK_score_X = round(nltk.classify.accuracy(classifier, train_set),3)
NLTK_score_y_dev = round(nltk.classify.accuracy(classifier, devtest_set),3)
NLTK_score_y = round(nltk.classify.accuracy(classifier, test_set),3)

NLTK_Results = pd.DataFrame([[NLTK_score_X,NLTK_score_y]]).rename(index={0:'NLTK_NBC'})

print('Train Accuracy', NLTK_score_X) 
print('DevTest Accuracy', NLTK_score_y_dev) 
print('TestSet Accuracy:', NLTK_score_y) 

Train Accuracy 0.899
DevTest Accuracy 0.816
TestSet Accuracy: 0.79


## Sklearn Approach

After we tried with `nltk` package, we also tried a few approaches using the `sklearn` package to see how other types of machine learning and modeling could improve our acurracy. 

#### Prepare Dataset

To prepare our model, we created a new gender feature from the previews one that we use for the nltk predictor model and added some new changes to the features. We created a new dataset for easy access to our data and called the features function to obtain organized data within a classified array.

We called the function our dataset splits to retrieve the names and gender and set these variables as x and y for training and testing purposes.

In [12]:
# turn test/train into dataframe to ease accessing
test_df = pd.DataFrame(test_names)
train_df = pd.DataFrame(train_names)

# Define feature function
def gender_features_new_2(word):
    word = word.lower()
    mid = int(len(word)/2)
    return {'comb1': word[1],
            'comb2': word[:3],
            'comb3': word[mid:mid+3],
            'comb4': word[-1],
            'comb5': word[-2],
            'comb6': word[:1]+word[-1]}

# Vectorize function 
func_gender = np.vectorize(gender_features_new_2)

# Apply function to dataframes
X_train, y_train = func_gender(train_df[0]), train_df[1]
X_test, y_test = func_gender(test_df[0]), test_df[1]

vectorizer = DictVectorizer()

#we fit the train data onto vectorize dictionary 
vect = vectorizer.fit(X_train)

This is how the name are after we call the function and return the vectorized array.

In [13]:
print(func_gender(['Cathy',"Mark"]))

[{'comb1': 'a', 'comb2': 'cat', 'comb3': 'thy', 'comb4': 'y', 'comb5': 'h', 'comb6': 'cy'}
 {'comb1': 'a', 'comb2': 'mar', 'comb3': 'rk', 'comb4': 'k', 'comb5': 'r', 'comb6': 'mk'}]


#### Multinomial Model

We first tried a Naive Bayes classifier again, but this time using functions from the sklearn package. Our result were comparable with our final nltk attempt.

In [14]:
# Fit Naive Bayes classifier 
clf = MultinomialNB()
clf.fit(vect.transform(X_train),y_train)

# Store results for analysis
NBC_test = round(clf.score(vect.transform(X_test), y_test),4)
NBC_train = round(clf.score(vect.transform(X_train), y_train),4)

NBC_Results = pd.DataFrame([[NBC_test,NBC_train]]).rename(index={0:'SK_NBC'})

# View results
print("Train Accuracy: " + str(NBC_train))
print("Test Accuracy: " + str(NBC_test))

Train Accuracy: 0.8659
Test Accuracy: 0.8


#### Linear Model

We next fitted our data using a Stochastic Gradient Descent (SGD) approach on a linear model. This improved the accuracy of our training data, but the accuracy of our test sets slightly decreased. 

In [15]:
# Fit SGD classifier 
SGD = SGDClassifier(max_iter=1000, tol=0.001,random_state=1)
SGD.fit(vect.transform(X_train),y_train)

# Store results for analysis
SGD_test= round(SGD.score(vect.transform(X_test), y_test),4)
SGD_train = round(SGD.score(vect.transform(X_train), y_train),4)

SGD_Results = pd.DataFrame([[SGD_train,SGD_test]]).rename(index={0:'SGD'})

# View results
print("Train Accuracy: " + str(SGD_train))
print("Test Accuracy: " + str(SGD_test))

Train Accuracy: 0.9291
Test Accuracy: 0.804


#### Decision Tree Model

In our final attempt, we used the `RandomForestClassifier` to improve our accuracy and avoid the overfitting we observed with the linear approach. However, we found that our train accuracy increased again while our testing accuracies decreased. 

In [16]:
# Fit RFC classifier 
RFC = RandomForestClassifier(n_estimators=10,random_state=1)
RFC.fit(vect.transform(X_train),y_train)

# Store results for analysis
RFC_test = round(RFC.score(vect.transform(X_test), y_test),4)
RFC_train = round(RFC.score(vect.transform(X_train), y_train),4)

RFC_Results = pd.DataFrame([[RFC_train,RFC_test]]).rename(index={0:'RFC'})

# View results
print("Train Accuracy: " + str(RFC_train))
print("Test Accuracy: " + str(RFC_test))

Train Accuracy: 0.9512
Test Accuracy: 0.782


## Analysis

Text mining can be challenging especially at the beginning of the process because you don't know what you find in the data and how easy or challenging it will be to work with. One of our initial challenges was identifying the right features to build a good model for predicting the gender of our name corpus. While we had a clear objective, finding these features can be nuanced and tricky. Thankfully, the text provided us with a good examples of what features to look for and how we can identify them on our own. In our first model, we used the book steps to measure our progress while attempting to improve our prediction accuracy for our training, development, and testing datasets.

Using the NLTk package, we tried many times different features against dev_test to improve our model. We combined these features in the example function and applied it to obtain the best prediction matches based off these features.

While we were satisfied with our results, we were interested to see if other machine-learning packages could improve upon our NLTk models. We used the Scikit-learn package in order to implement different algorithms and compared the accuracy results of the training and test data. Our methods included multinomial Naive Bayes, Stochastic Gradient Descent (SGD) and RandomForest classifiers.

After concluding with all the different model predictions, we found that one of the issues is that working with incremental improvements is overfitting our model. This became most apparent with our SKlearn approaches as we saw our training accuracies improve while our test accuracies declined. Most of the model predictions were between a range of 0.78 - 0.84. Interestingly, RandomForest decision trees methods should help alieviate overfitting, but we found this method overfitted our data the most. Our Naive Bayes methods provided our most consistant results. The NLTk method performed slightly better on the train data while the sklearn method improved a little more on our test data.

In [17]:
NLTK_Results.append(NBC_Results).append(SGD_Results).append(RFC_Results).rename(columns={0:"Train", 1:"Test"})

Unnamed: 0,Train,Test
NLTK_NBC,0.899,0.79
SK_NBC,0.8,0.8659
SGD,0.9291,0.804
RFC,0.9512,0.782


In [1]:
%%html

<div style="position: relative; padding-bottom: 56.25%; height: 0;"><iframe src="https://www.loom.com/embed/0ddad4e67dae4dd9b5534d8ed7f77899" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>

### References 

1. https://www.nltk.org/book/ch06.html#fig-supervised-classification
2. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
3. https://nlpforhackers.io/introduction-machine-learning/