# Krystian Gronek & Katarzyna Piotrowska
# Text Mining and Social Media Mining, final project - Analyzing men and women comments using NLP methods

# Loading packages and data

In [1]:
%matplotlib inline 

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import nltk

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
#from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
from sklearn.pipeline import Pipeline

# Naive Bayes Classifier
from sklearn.naive_bayes import MultinomialNB

men = pd.read_csv('data/final_askmen.csv', sep = ';')
women = pd.read_csv('data/final_askwomen.csv', sep = ';')

# Describing the categorization problem approach

In this notebook we will try to predict from which subreddit a comment and post came. For that we will sue Multinomial Naive Bayes classifier and the input data will be the cleaned text of comments and posts submissions from /r/AskMen and /r/AskWomen subreddits. 

For categorization for both comments and posts we need convert each comment to its vectorized form representing the words from posts and comments and their numeric representation. We use 2 different representations of words:
- as word counts, representing how many times a word appears in comments or posts in particular subreddit
- as TF-IDF (Term Frequency - Inverse Document Frequency) score acting as weights corresponding to each word

After that we divide the dataset into training and test parts and fit the model. Finally we make predictions for text categorization and assess the model accuracy.

In [2]:
# add categorical variable that distincs whether a observation comes from /r/AskMen subreddit or /r/AskWomen subreddit 
men['subreddit'] = np.repeat("askmen", len(men))
women['subreddit'] = np.repeat("askwomen", len(women))

# merge two datasets into one
df = pd.concat([men, women], axis = 0)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30809 entries, 0 to 14637
Data columns (total 23 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   username                          30809 non-null  object 
 1   com_original                      30809 non-null  object 
 2   cleaned                           30809 non-null  object 
 3   cleaned_wo_sw                     30809 non-null  object 
 4   tokenized                         30809 non-null  object 
 5   stemmed                           30809 non-null  object 
 6   tokenized_wo_sw                   30809 non-null  object 
 7   submission_title                  30809 non-null  object 
 8   submission_title_cleaned          30809 non-null  object 
 9   submission_title_cleaned_wo_sw    30809 non-null  object 
 10  submission_title_tokenized        30809 non-null  object 
 11  submission_title_stemmed          30809 non-null  object 
 12  subm

# Categorization of comments according to subreddit 

In [3]:
comments_cleaned = df['cleaned']
subreddits = df['subreddit']

# Vectorization
# count how many times does a word occur in each message, term frequency
# weigh the counts, so that frequent tokens get lower weight, inverse document frequency
# normalize the vectors to unit length, to abstract from the original text length, L2 norm 
cv = CountVectorizer().fit(comments_cleaned);
X = cv.transform(comments_cleaned);

# the bag-of-words counts for the entire SMS corpus
print("Summary of Document-Term Matrix for comments text:")
print('Shape of Sparse Matrix: ',X.shape)
print('Amount of non-zero occurences:',X.nnz)
# Sparsity
sparsity =(100.0 * X.nnz/(X.shape[0]*X.shape[1]))
print('Sparsity: {}'.format(sparsity),"%")
print('\n')

# Term weighting and normalization with TF-IDF
tfidf_transformer=TfidfTransformer().fit(X)
X_tfidf = tfidf_transformer.transform(X)

# split the data into train and test parts
comments_train, comments_test, subreddit_train, subreddit_test = train_test_split(X, subreddits, test_size=0.2, random_state = 9);

# Naive Bayes Classifier
model = MultinomialNB().fit(comments_train,subreddit_train);

# what is the quality of our model?
all_predictions_comments = model.predict(X)

print("Summary of predictions for vectorized text:")
# Accuracy of our Model - train data
print("Accuracy of Model - train data", model.score(comments_train,subreddit_train)*100,"%")

# Accuracy of our Model - test data
print("Accuracy of Model - test data", model.score(comments_test,subreddit_test)*100,"%")
print("\n")
print("Classification report for vectorized variable:")
print(classification_report(df['subreddit'],all_predictions_comments))
print("Confusion matrix for vectorized variable:")
print(confusion_matrix(df['subreddit'],all_predictions_comments))
print('\n')

# Accuracy after TF-IDF
# split the data into train and test parts
comments_train_TFIDF, comments_test_TFIDF, subreddit_train_TFIDF, subreddit_test_TFIDF = train_test_split(X_tfidf, subreddits, test_size=0.2, random_state = 9);

# Naive Bayes Classifier + fit the model to train data
model = MultinomialNB().fit(comments_train_TFIDF,subreddit_train_TFIDF);

# what is the quality of our model?
all_predictions_comments_TFIDF = model.predict(X_tfidf)

print("Summary of predictions for TF-IDF transformed text:")
# Accuracy of our Model - train data
print("Accuracy of Model - train data", model.score(comments_train_TFIDF,subreddit_train_TFIDF)*100,"%")

# Accuracy of our Model - test data
print("Accuracy of Model - test data", model.score(comments_test_TFIDF,subreddit_test_TFIDF)*100,"%")
print('\n')

print("Classification report for TF-IDF transformed variable:")
print(classification_report(df['subreddit'],all_predictions_comments_TFIDF))
print("Confusion matrix for TF-IDF transformed variable:")
print(confusion_matrix(df['subreddit'],all_predictions_comments_TFIDF))
print('\n')

Summary of Document-Term Matrix for comments text:
Shape of Sparse Matrix:  (30809, 30197)
Amount of non-zero occurences: 781735
Sparsity: 0.08402686403341117 %


Summary of predictions for vectorized text:
Accuracy of Model - train data 84.41595326003166 %
Accuracy of Model - test data 76.87439143135346 %


Classification report for vectorized variable:
              precision    recall  f1-score   support

      askmen       0.84      0.83      0.84     16171
    askwomen       0.81      0.83      0.82     14638

    accuracy                           0.83     30809
   macro avg       0.83      0.83      0.83     30809
weighted avg       0.83      0.83      0.83     30809

Confusion matrix for vectorized variable:
[[13358  2813]
 [ 2453 12185]]


Summary of predictions for TF-IDF transformed text:
Accuracy of Model - train data 86.61906114334403 %
Accuracy of Model - test data 77.6209023044466 %


Classification report for TF-IDF transformed variable:
              precision    recal

### Results - categorization of comments according to subreddit

Looking at the sparsity of matrix we can wee that it is equal to 0.08%. 

Accuracy of predictions for vectorized, word counts transformed text were accurate in 84.4% of the comments in train data and 76.9% in test data. For TF-IDF transformed text data we can see that the accuracy of the model increased slightly. Now the predictions were accurate in 86.6% in train data and 77.6% in test data. We can also see the classification reports and confusion matrices for vectorized and TF-IDF transformed text.

Overall we can say the categorization predictions were relatively high, suggesting that there is a distinction in how men and women write replies on Reddit and the selection of the words that they are using.

# Predicting from which subreddit does a post come from

In [4]:
posts_cleaned = df['submission_title_cleaned']
subreddits = df['subreddit']

# Vectorization
# count how many times does a word occur in each message, term frequency
# weigh the counts, so that frequent tokens get lower weight, inverse document frequency
# normalize the vectors to unit length, to abstract from the original text length, L2 norm 
cv = CountVectorizer().fit(posts_cleaned);
X = cv.transform(posts_cleaned);

# the bag-of-words counts for the entire SMS corpus
print("Summary of Document-Term Matrix for posts text:")
print('Shape of Sparse Matrix: ',X.shape)
print('Amount of non-zero occurences:',X.nnz)
# Sparsity
sparsity =(100.0 * X.nnz/(X.shape[0]*X.shape[1]))
print('Sparsity: {}'.format(sparsity),"%")
print('\n')

# Term weighting and normalization with TF-IDF
tfidf_transformer=TfidfTransformer().fit(X)
X_tfidf = tfidf_transformer.transform(X)

# split the data into train and test parts
posts_train, posts_test, subreddit_train, subreddit_test = train_test_split(X, subreddits, test_size=0.2, random_state = 9);

# Naive Bayes Classifier + fit the model to train data
model = MultinomialNB().fit(posts_train,subreddit_train);

# what is the quality of our model?
all_predictions_posts = model.predict(X)

print("Summary of predictions for vectorized posts text:")
# Accuracy of our Model - train data
print("Accuracy of Model - train data", model.score(posts_train,subreddit_train)*100,"%")

# Accuracy of our Model - test data
print("Accuracy of Model - test data", model.score(posts_test,subreddit_test)*100,"%")
print('\n')
print("Classification report for vectorized variable:")
print(classification_report(df['subreddit'],all_predictions_posts))
print("Confusion matrix for vectorized variable:")
print(confusion_matrix(df['subreddit'],all_predictions_posts))
print('\n')

# Accuracy after TF-IDF
# split the data into train and test parts
posts_train_TFIDF, posts_test_TFIDF, subreddit_train_TFIDF, subreddit_test_TFIDF = train_test_split(X_tfidf, subreddits, test_size=0.2, random_state = 9);

# Naive Bayes Classifier + fit the model to train data
model = MultinomialNB().fit(posts_train_TFIDF,subreddit_train_TFIDF);

# what is the quality of our model?
all_predictions_posts_TFIDF = model.predict(X_tfidf)

print("Summary of predictions for TF-IDF transformed posts text:")
# Accuracy of our Model - train data
print("Accuracy of Model - train data", model.score(posts_train_TFIDF,subreddit_train_TFIDF)*100,"%")

# Accuracy of our Model - test data
print("Accuracy of Model - test data", model.score(posts_test_TFIDF,subreddit_test_TFIDF)*100,"%")
print('\n')
print("Classification report for TF-IDF transformed variable:")
print(classification_report(df['subreddit'],all_predictions_posts_TFIDF))
print("Confusion matrix for TF-IDF transformed variable:")
print(confusion_matrix(df['subreddit'],all_predictions_posts_TFIDF))
print('\n')

Summary of Document-Term Matrix for posts text:
Shape of Sparse Matrix:  (30809, 1448)
Amount of non-zero occurences: 369515
Sparsity: 0.8282966572335091 %


Summary of predictions for vectorized posts text:
Accuracy of Model - train data 98.23913660891792 %
Accuracy of Model - test data 98.10126582278481 %


Classification report for vectorized variable:
              precision    recall  f1-score   support

      askmen       0.99      0.98      0.98     16171
    askwomen       0.97      0.99      0.98     14638

    accuracy                           0.98     30809
   macro avg       0.98      0.98      0.98     30809
weighted avg       0.98      0.98      0.98     30809

Confusion matrix for vectorized variable:
[[15799   372]
 [  179 14459]]


Summary of predictions for TF-IDF transformed posts text:
Accuracy of Model - train data 98.93293301415994 %
Accuracy of Model - test data 98.83154819863681 %


Classification report for TF-IDF transformed variable:
              precision 

### Results - categorization of posts titles according to subreddit

Looking at the sparsity of matrix for posts titles we can wee that it is equal to 0.8%. 

For posts titles accuracy of the model predicting posts titles subreddit origin were very accurate. For vectorized text it was equal to 98.2% and 98.1% for train and test data respectively. The predictions were again increased a bit with TF-IDF transformation. Now the accuracty inceased to 98.9% for train data and 98.8% for test data. 

The accuracy of prediction for posts titles origin subreddit seem pretty impressive. It is important to note that as for comments we can safely assume that reply comments were written by their respective sexes because we can assume that asking a question on selected sex subreddit should result from their sex user. This is not the case for posts titles as they are usually questions asked by any sex (both men and women). Nonetheless we can see that there is a very big distinct way or word composition that are used in these different subreddits which results in such big prediction accuracy.

# Conclusions

Overall we can say that there is a significant difference in how write both their comments and posts on examined subreddits. The categorization predictions were very high and based on the scores alone we could conclude that there are men and women do use different language when online.