# Machine Learning Project 
## Using Sentiment Analysis to Classify Gender of Book Authors

## By Malori Hales

In this machine learning project I have created a simple sentiment analysis model to satisfy one of my own curiosities: Is there a detectable difference in writing styles between modern male and female authors?

First, I will import the necessary libraries for the project.

In [1]:
import numpy as np    
import pandas as pd   
import re             
import nltk           
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split as tts

In [2]:
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/malorieve/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/malorieve/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
stop_words = set(stopwords.words('english'))

Now I will read in the file containing the training data as a pandas dataframe.

In [4]:
df = pd.read_csv("authorData2.csv")    

In [5]:
df

Unnamed: 0,Book,Sentiment,Word Count,Text Sample
0,The Book Their - Markus Zusak,0,1143,DEATH AND CHOCOLATE\n\n\nFirst the colors.\nTh...
1,The Vanished Birds by Simon Jimenez,0,1694,He was born with an eleventh finger. A small b...
2,The Small Crimes of Tiffany Templeton by Richa...,0,766,The court-appointed advocate dropped me off at...
3,Starsight - Brandon Sanderson,0,1885,I slammed on my overburn and boosted my starsh...
4,Looking for Alaska - John Green,0,1048,“So do you really memorize last words?”\nShe r...
5,The Kingdom of Back - Marie Lu,1,972,"SALZBURG, AUSTRIA\n1759\n\nMOZART BY THE OCEAN..."
6,Ember Queen by Laura Sebastian,1,1479,Reckoning\n\n\n\nThe sun is blinding when I st...
7,Field Notes on Love by Jennifer E Smith,1,974,Hugo\n\nThe shock of it takes a few minutes to...
8,The Twin by Natasha Preston,1,1419,I dig the tips of my yellow-­painted fingernai...
9,All the Pretty Things by Emily Arsenault,1,1782,“Hon. Morgan is missing.” \n\n\nThat’s how Dad...


In [6]:
men = df.loc[df['Sentiment'] == 0]
men

Unnamed: 0,Book,Sentiment,Word Count,Text Sample
0,The Book Their - Markus Zusak,0,1143,DEATH AND CHOCOLATE\n\n\nFirst the colors.\nTh...
1,The Vanished Birds by Simon Jimenez,0,1694,He was born with an eleventh finger. A small b...
2,The Small Crimes of Tiffany Templeton by Richa...,0,766,The court-appointed advocate dropped me off at...
3,Starsight - Brandon Sanderson,0,1885,I slammed on my overburn and boosted my starsh...
4,Looking for Alaska - John Green,0,1048,“So do you really memorize last words?”\nShe r...
10,An Abundance of Katherines - John Green,0,1211,The morning after noted child prodigy Colin Si...
11,Chrysalis - Brendan Reichs,0,1733,Back in the village we grabbed several lengths...
12,Rayne and Delilah’s Midnite Matinee - Jeff Zen...,0,499,Here’s the thing with dreams--and I’m talking ...
13,The Between - David Hofmeyr,0,1712,Between school and home lies a path through th...
14,Frankly in Love - David Yoon,0,1438,"Mom-n-Dad work at The Store every day, from mo..."


In [7]:
women = df.loc[df['Sentiment'] == 1]
women

Unnamed: 0,Book,Sentiment,Word Count,Text Sample
5,The Kingdom of Back - Marie Lu,1,972,"SALZBURG, AUSTRIA\n1759\n\nMOZART BY THE OCEAN..."
6,Ember Queen by Laura Sebastian,1,1479,Reckoning\n\n\n\nThe sun is blinding when I st...
7,Field Notes on Love by Jennifer E Smith,1,974,Hugo\n\nThe shock of it takes a few minutes to...
8,The Twin by Natasha Preston,1,1419,I dig the tips of my yellow-­painted fingernai...
9,All the Pretty Things by Emily Arsenault,1,1782,“Hon. Morgan is missing.” \n\n\nThat’s how Dad...
15,The Queen’s Assassin - Melissa De La Cruz,1,1730,Something or someone is following me. I’ve bee...
16,A Good Girl’s Guide to Murder - Holly Jackson,1,1782,Pip knew where they lived. \n\n\n\nEveryone in...
17,One of Us is Next - Karen M. McManus,1,1340,“I have to go to the football field real quick...
18,The Fountains of Silence - Ruta Sepetys,1,485,They stand in line for blood.\n\n\nJune’s earl...
19,American Royals - Katherine McGee,1,986,Present Day\n\nBeatrice could trace her ancest...


I checked to make sure there were an equal number of male and female authors included in my training data.

In [8]:
print(len(men))
print(len(women))

20
20


## Defining Functions

In [9]:
def clean_my_text(text):
    text = re.sub(r"<.*?>", "", text)      
    text = re.sub("[^a-zA-Z]", " ", text)
    text = re.sub(r'[^\x00-\x7F]+',' ', text)
    text = text.strip().lower()            
    text = re.sub(" s ", " ", text)     
    
    tokenizer = nltk.tokenize.TreebankWordTokenizer()
    tokens = tokenizer.tokenize(text)   
    
    unstopped = []                        
    for word in tokens:
        if word not in stop_words:         
            unstopped.append(word)         
    stemmer = nltk.stem.WordNetLemmatizer()   
    cleanText = " ".join(stemmer.lemmatize(token) for token in unstopped)
    return cleanText

In [10]:
X_train, X_test, y_train, y_test = tts(df["Text Sample"], df["Sentiment"], test_size=0.2)

In [11]:
def clean_my_data(dataList):
    print("Cleaning all of the data")
    i = 0
    for textEntry in dataList:              
                                                  
        cleanElement = clean_my_text(textEntry)     
        dataList[i] = cleanElement  
        i = i + 1
        print("Cleaning entry number", i, "out of", len(dataList))
    print("Finished cleaning all of the data\n")
    return dataList


print("Operating on training data...\n")
reviews = X_train.tolist()
cleanReviewData = clean_my_data(reviews)   

Operating on training data...

Cleaning all of the data
Cleaning entry number 1 out of 32
Cleaning entry number 2 out of 32
Cleaning entry number 3 out of 32
Cleaning entry number 4 out of 32
Cleaning entry number 5 out of 32
Cleaning entry number 6 out of 32
Cleaning entry number 7 out of 32
Cleaning entry number 8 out of 32
Cleaning entry number 9 out of 32
Cleaning entry number 10 out of 32
Cleaning entry number 11 out of 32
Cleaning entry number 12 out of 32
Cleaning entry number 13 out of 32
Cleaning entry number 14 out of 32
Cleaning entry number 15 out of 32
Cleaning entry number 16 out of 32
Cleaning entry number 17 out of 32
Cleaning entry number 18 out of 32
Cleaning entry number 19 out of 32
Cleaning entry number 20 out of 32
Cleaning entry number 21 out of 32
Cleaning entry number 22 out of 32
Cleaning entry number 23 out of 32
Cleaning entry number 24 out of 32
Cleaning entry number 25 out of 32
Cleaning entry number 26 out of 32
Cleaning entry number 27 out of 32
Cleaning

In [12]:
def create_bag_of_words(X):
    from sklearn.feature_extraction.text import CountVectorizer
    
    print ('Generating bag of words...')
    
    vectorizer = CountVectorizer(analyzer = "word",   \
                                 tokenizer = None,    \
                                 preprocessor = None, \
                                 stop_words = None,   \
                                 ngram_range = (1,2), \
                                 max_features = 10000)
                                                         
    train_data_features = vectorizer.fit_transform(X)
    train_data_features = train_data_features.toarray()
    
    from sklearn.feature_extraction.text import TfidfTransformer
    tfidf = TfidfTransformer()
    tfidf_features = tfidf.fit_transform(train_data_features)
    return vectorizer, tfidf_features, tfidf  

In [13]:
vectorizer, tfidf_features, tfidf  = (create_bag_of_words(cleanReviewData))

Generating bag of words...


In [14]:
def train_logistic_regression(features, label):
    print ("Training the logistic regression model...")
    from sklearn.linear_model import LogisticRegression
    ml_model = LogisticRegression(C = 100, random_state = 0, solver = 'liblinear')
    ml_model.fit(features, label)
    print ('Finished training the model\n')
    return ml_model

## Train and Test Model

In [15]:
ml_model = train_logistic_regression(tfidf_features, y_train)

Training the logistic regression model...
Finished training the model



In [16]:
print("Operating on test data...\n")
sentiments = X_test.tolist()
cleanTestData = clean_my_data(sentiments)

Operating on test data...

Cleaning all of the data
Cleaning entry number 1 out of 8
Cleaning entry number 2 out of 8
Cleaning entry number 3 out of 8
Cleaning entry number 4 out of 8
Cleaning entry number 5 out of 8
Cleaning entry number 6 out of 8
Cleaning entry number 7 out of 8
Cleaning entry number 8 out of 8
Finished cleaning all of the data



In [17]:
test_data_features = vectorizer.transform(cleanTestData)
test_data_features = test_data_features.toarray()

In [18]:
test_data_tfidf_features = tfidf.fit_transform(test_data_features)
test_data_tfidf_features = test_data_tfidf_features.toarray()

In [19]:
predicted_y = ml_model.predict(test_data_tfidf_features)

In [20]:
correctly_identified_y = predicted_y == y_test
accuracy = np.mean(correctly_identified_y) * 100
print ('The accuracy of the model in predicting book author sentiment is %.0f%%' %accuracy)

The accuracy of the model in predicting book author sentiment is 50%
