#### Data 620 - Project 3 <br>July 10, 2019<br>Team 2: <ul><li>Anthony Munoz</li> <li>Katie Evers</li> <li>Juliann McEachern</li> <li>Mia Siracusa</li></ul>

<h1 align="center">Network Analysis: Text Mining </h1>

## Getting Started

Prompt: Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. 

#### Python Dependencies

In [1]:
# basic requirements 
import pandas as pd, numpy as np, random, nltk
from nltk.corpus import names # data source 

# sklearn packages
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import SGDClassifier
from sklearn.ensemble import RandomForestClassifier

#### Upload and Label Data

In [2]:
# retrive names from the nltk corpus
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])

# randomly shuffle the names
np.random.seed(1)
random.shuffle(labeled_names)

In [3]:
labeled_names[:10]

[('Jewelle', 'female'),
 ('Cletus', 'male'),
 ('Orsola', 'female'),
 ('Harrison', 'male'),
 ('Hersch', 'male'),
 ('Marin', 'female'),
 ('Hewe', 'male'),
 ('Prince', 'male'),
 ('Melba', 'female'),
 ('Elliott', 'male')]

#### Subset Corpus

We split the names corpus into three subsets:
1.  Test Set (500 words)
2.  Dev-test (500 words)
3.  Training Set (Remaining words)

In [4]:
train_names = labeled_names[1000:]
devtest_names = labeled_names[500:1000]
test_names = labeled_names[:500]

print('Testing names count:',len(test_names),'\nDevelopment names count:',len(devtest_names),'\nTraining names count:', len(train_names))

Testing names count: 500 
Development names count: 500 
Training names count: 6944


## Name-Gender Classifier

Task: Start with the example name gender classifier & make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. 

#### Example Name-Gender Classifier

We used the book example for our first attempt. The function shows us how to create basic features to classify our train and test sets.

In [5]:
# Create function
def gender_features(word):
    return {'last_letter': word[-1]}

# Apply function to train and test data
feature_test = [(gender_features(n), gender) for (n, gender) in test_names]
feature_devtest = [(gender_features(n), gender) for (n, gender) in devtest_names]
feature_train = [(gender_features(n), gender) for (n, gender) in train_names]

# Apply naive Bayes algorithm classifier
classifier = nltk.NaiveBayesClassifier.train(feature_train)
print('Train Accuracy', round(nltk.classify.accuracy(classifier, feature_train),3)) 
print('Example DevTest Accuracy:', round(nltk.classify.accuracy(classifier, feature_devtest),3)) 
print('Example Test Set Accuracy:', round(nltk.classify.accuracy(classifier, feature_test),3)) 

Train Accuracy 0.766
Example DevTest Accuracy: 0.73
Example Test Set Accuracy: 0.758


#### Incremental Improvements

We found our most informative features, shown below, to create an informative analysis of observed patterns.

In [6]:
print(classifier.show_most_informative_features())

Most Informative Features
             last_letter = 'a'            female : male   =     37.7 : 1.0
             last_letter = 'k'              male : female =     28.3 : 1.0
             last_letter = 'f'              male : female =     22.0 : 1.0
             last_letter = 'p'              male : female =     11.9 : 1.0
             last_letter = 'd'              male : female =     10.9 : 1.0
             last_letter = 'v'              male : female =      9.8 : 1.0
             last_letter = 'o'              male : female =      8.8 : 1.0
             last_letter = 'm'              male : female =      8.1 : 1.0
             last_letter = 'w'              male : female =      8.0 : 1.0
             last_letter = 'r'              male : female =      6.6 : 1.0
None


We took these informative features to improve our second gender functions in an attempt to improve the preditor accurracy of our model.

We used the following pattern combinations in order to improve the gender classification:

1. Second letter in the name.
2. First 3 letters in the name.
3. Middle letter of the name.
4. Last letter of the name.
5. Last 2 letters of the name.
6. First 2 letters with the last letter of the name.

In [7]:
# Improvements function

def gender_features_new(word):
    word = word.lower()
    mid = int(len(word)/2)
    return {'comb1': word[1],
            'comb2': word[:3],
            'comb3': word[mid:mid+5],
            'comb4': word[-1],
            'comb5': word[-2],
            'comb6': word[:2]+word[-1]}

We first worked with the training and the development datasets until we felt confident with our model. Then, we applied function on the test dataset.

In [8]:
# Apply function to train, devtest, and test datasets
train_set = [(gender_features_new(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_features_new(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_features_new(n), gender) for (n, gender) in test_names]

To improve our model, we used the following loop to iterate through the devtest names and call the gender functions to identify which of our gender prediction were off.

In [9]:
errors = []
for (name, tag) in devtest_names:
    guess = classifier.classify(gender_features_new(name))
    if guess != tag:
        errors.append( (tag, guess, name) )

Here, we observe if they are some pattern on the names and that why we start creating combination of letters and selecting those that improve our prediction algorithm.

In [10]:
for (tag_gender_name, guess_name, name) in sorted(errors):
        print({'correct': tag_gender_name,
               'guessing':guess_name,
               'name':name})
        
print('\n\nTotal Errors:', len(errors))

{'correct': 'male', 'guessing': 'female', 'name': 'Abdul'}
{'correct': 'male', 'guessing': 'female', 'name': 'Ace'}
{'correct': 'male', 'guessing': 'female', 'name': 'Adger'}
{'correct': 'male', 'guessing': 'female', 'name': 'Adolf'}
{'correct': 'male', 'guessing': 'female', 'name': 'Aguste'}
{'correct': 'male', 'guessing': 'female', 'name': 'Aldis'}
{'correct': 'male', 'guessing': 'female', 'name': 'Aldus'}
{'correct': 'male', 'guessing': 'female', 'name': 'Aldwin'}
{'correct': 'male', 'guessing': 'female', 'name': 'Andreas'}
{'correct': 'male', 'guessing': 'female', 'name': 'Antoine'}
{'correct': 'male', 'guessing': 'female', 'name': 'Antone'}
{'correct': 'male', 'guessing': 'female', 'name': 'Armond'}
{'correct': 'male', 'guessing': 'female', 'name': 'Ashton'}
{'correct': 'male', 'guessing': 'female', 'name': 'Aub'}
{'correct': 'male', 'guessing': 'female', 'name': 'Ave'}
{'correct': 'male', 'guessing': 'female', 'name': 'Bartlet'}
{'correct': 'male', 'guessing': 'female', 'name': '

#### Final Performance

We used the Naives Bayes classifier on our train dataset and measured the accuracy of the devtest and test datasets again. We were satisfied with our accuracy improvements upon this final attempt.

Our new classification patterns helped improve our prediction accuracy. In both cases our results were above 80%.

In [11]:
classifier = nltk.NaiveBayesClassifier.train(train_set) 

# Store results for analysis
NLTK_score_X = round(nltk.classify.accuracy(classifier, train_set),3)
NLTK_score_y_dev = round(nltk.classify.accuracy(classifier, devtest_set),3)
NLTK_score_y = round(nltk.classify.accuracy(classifier, test_set),3)

NLTK_Results = pd.DataFrame([[NLTK_score_X,NLTK_score_y]]).rename(index={0:'NLTK_NBC'})

print('Train Accuracy', NLTK_score_X) 
print('DevTest Accuracy', NLTK_score_y_dev) 
print('TestSet Accuracy:', NLTK_score_y) 

Train Accuracy 0.896
DevTest Accuracy 0.804
TestSet Accuracy: 0.84


## Sklearn Approach

After we tried with `nltk` package, we also tried a few approaches using the `sklearn` package to see how other types of machine learning and modeling could improve our acurracy. 

#### Prepare Dataset

To prepare our model, we created a new gender feature from the previews one that we use for the nltk predictor model and added some new changes to the features. We created a new dataset for easy access to our data and called the features function to obtain organized data within a classified array.

We called the function our dataset splits to retrieve the names and gender and set these variables as x and y for training and testing purposes.

In [12]:
# turn test/train into dataframe to ease accessing
test_df = pd.DataFrame(test_names)
train_df = pd.DataFrame(train_names)

# Define feature function
def gender_features_new_2(word):
    word = word.lower()
    mid = int(len(word)/2)
    return {'comb1': word[1],
            'comb2': word[:3],
            'comb3': word[mid:mid+3],
            'comb4': word[-1],
            'comb5': word[-2],
            'comb6': word[:1]+word[-1]}

# Vectorize function 
func_gender = np.vectorize(gender_features_new_2)

# Apply function to dataframes
X_train, y_train = func_gender(train_df[0]), train_df[1]
X_test, y_test = func_gender(test_df[0]), test_df[1]

vectorizer = DictVectorizer()

#we fit the train data onto vectorize dictionary 
vect = vectorizer.fit(X_train)

This is how the name are after we call the function and return the vectorized array.

In [13]:
print(func_gender(['Cathy',"Mark"]))

[{'comb1': 'a', 'comb2': 'cat', 'comb3': 'thy', 'comb4': 'y', 'comb5': 'h', 'comb6': 'cy'}
 {'comb1': 'a', 'comb2': 'mar', 'comb3': 'rk', 'comb4': 'k', 'comb5': 'r', 'comb6': 'mk'}]


#### Multinomial Model

We first tried a Naive Bayes classifier again, but this time using functions from the sklearn package. Our result were comparable with our final nltk attempt.

In [14]:
# Fit Naive Bayes classifier 
clf = MultinomialNB()
clf.fit(vect.transform(X_train),y_train)

# Store results for analysis
NBC_test = round(clf.score(vect.transform(X_test), y_test),4)
NBC_train = round(clf.score(vect.transform(X_train), y_train),4)

NBC_Results = pd.DataFrame([[NBC_test,NBC_train]]).rename(index={0:'SK_NBC'})

# View results
print("Train Accuracy: " + str(NBC_train))
print("Test Accuracy: " + str(NBC_test))

Train Accuracy: 0.8661
Test Accuracy: 0.84


#### Linear Model

We next fitted our data using a Stochastic Gradient Descent (SGD) approach on a linear model. This improved the accuracy of our training data, but the accuracy of our test sets slightly decreased. 

In [15]:
# Fit SGD classifier 
SGD = SGDClassifier(max_iter=1000, tol=0.001,random_state=1)
SGD.fit(vect.transform(X_train),y_train)

# Store results for analysis
SGD_test= round(SGD.score(vect.transform(X_test), y_test),4)
SGD_train = round(SGD.score(vect.transform(X_train), y_train),4)

SGD_Results = pd.DataFrame([[SGD_train,SGD_test]]).rename(index={0:'SGD'})

# View results
print("Train Accuracy: " + str(SGD_train))
print("Test Accuracy: " + str(SGD_test))

Train Accuracy: 0.9291
Test Accuracy: 0.828


#### Decision Tree Model

In our final attempt, we used the `RandomForestClassifier` to improve our accuracy and avoid the overfitting we observed with the linear approach. However, we found that our train accuracy increased again while our testing accuracies decreased. 

In [16]:
# Fit RFC classifier 
RFC = RandomForestClassifier(n_estimators=10,random_state=1)
RFC.fit(vect.transform(X_train),y_train)

# Store results for analysis
RFC_test = round(RFC.score(vect.transform(X_test), y_test),4)
RFC_train = round(RFC.score(vect.transform(X_train), y_train),4)

RFC_Results = pd.DataFrame([[RFC_train,RFC_test]]).rename(index={0:'RFC'})

# View results
print("Train Accuracy: " + str(RFC_train))
print("Test Accuracy: " + str(RFC_test))

Train Accuracy: 0.9512
Test Accuracy: 0.788


## Analysis

Text mining can be challenging especially at the beginning of the process because you don't know what you find in the data and how easy or challenging can be to work with. one of the challenges that we face its that we know the goal of the project which was to identify if the name was female or male but to get the right features to build a good model for the prediction can be tricky. In order to measure our progress while we trying to improve our first example model from the book, we were asked to create 3 groups of data. The trainig, develoment data, and the testing data.

Working with NLTk package we try many times different features against dev_test to improve our model and used the same example functions to obtain the prediction match that was off in order to make changes in our model features.

We also use the Scikit-learn package in order to implement more algorithm analysis with different library and compare the results. from this package, we use the following algorithm to work with Naive Bayes, Stochastic Gradient Descent (SGD) and RandomForestClassifier.

After concluding with all the different model predictions we can see that one of the issues is that working with the dev_test (development) NLTK and Training data Scikit-learn is that we tend to overfit our model and later when we test against the Test set we get a little smaller Accuracy rate. But most of the model predictions were between a range of 0.78 - 0.84.




In [17]:
NLTK_Results.append(NBC_Results).append(SGD_Results).append(RFC_Results).rename(columns={0:"Train", 1:"Test"})

Unnamed: 0,Train,Test
NLTK_NBC,0.896,0.84
SK_NBC,0.84,0.8661
SGD,0.9291,0.828
RFC,0.9512,0.788


### References 

1. https://www.nltk.org/book/ch06.html#fig-supervised-classification
2. https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
3. https://nlpforhackers.io/introduction-machine-learning/