# Data Mining Challange: *Reddit Gender Text-Classification*

#### Successful Training strategy

The train set has been grouped by author and the resulting texts, aggregated with `" ".join`, have been turned into a BOW (see this [brief Kaggle tutorial](https://www.kaggle.com/matleonard/text-classification#Bag-of-Words). 80% of the resulting data has been used to train an [XGBoost](https://www.kaggle.com/alexisbcook/xgboost), which was later used to predict the remeining 20%.  
Then, a [Document Embedding model](https://medium.com/wisio/a-gentle-introduction-to-doc2vec-db3e8c0cce5e) has been fitted on test and train texts. 80% of train vectors were later used to train a [Multi Layer Perceptron](https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html), which then predicted the remaining 20% and the test set. Third, an MLP on the Counterized subredidts has been trained, just like th models above. 
The predictions on the 20% of the XGBoost and of the two MLPs were used to train and validate a final logistic regression.  
Finally, a new XGBoost and and two new MLPs were trained on all train texts, and the predictions of the two used by the logistic regression to output the final submission.  

#### Unsuccessful Training Strategy

An exploration of [SpaCy](https://github.com/explosion/spaCy) was performed. One may find the relevant notebooks [here](https://github.com/pitmonticone/data-mining-challange/tree/master/spaCy). The model works and has a similar strategy to the one presented above, though its performance is lower (roc = 0.894). The exploration has been concluded with this [Stack Overflow Question](https://stackoverflow.com/questions/60821793/text-classification-with-spacy-going-beyond-the-basics-to-improve-performance), this [GitHub Issue](https://github.com/explosion/spaCy/issues/5224) and a comment to a [Feature Request](https://github.com/explosion/spaCy/issues/2253#issuecomment-605502320). 

### Modules

In [None]:
# Numpy & matplotlib for notebooks 
%pylab inline

# Pandas for data analysis and manipulation 
import pandas as pd 

# Sparse matrix package for numeric data.
from scipy import sparse

# Module for word embedding (word2vector)
import gensim  

# Module for progress monitoring
import tqdm   

# Sklearn 
from sklearn.preprocessing import StandardScaler # to standardize features by removing the mean and scaling to unit variance (z=(x-u)/s)
from sklearn.neural_network import MLPClassifier # Multi-layer Perceptron classifier which optimizes the log-loss function using LBFGS or sdg.
from sklearn.model_selection import train_test_split # to split arrays or matrices into random train and test subsets
from sklearn.model_selection import KFold # K-Folds cross-validator providing train/test indices to split data in train/test sets.
from sklearn.decomposition import PCA, TruncatedSVD # Principal component analysis (PCA); dimensionality reduction using truncated SVD.
from sklearn.linear_model import LogisticRegression 
from sklearn.naive_bayes import MultinomialNB # Naive Bayes classifier for multinomial models

# Matplotlib
import matplotlib # Data visualization
import matplotlib.pyplot as plt 
import matplotlib.patches as mpatches  

# Seaborn
import seaborn as sns # Statistical data visualization (based on matplotlib)

### Data Collection 

In [None]:
# Import the test dataset
test_data = pd.read_csv("data/test_data.csv", encoding="utf8")

In [None]:
# Create a list of authors
a_test = []
for author, group in test_data.groupby("author"):
    a_test.append(author)

In [None]:
# Load predictions of all models
y = np.load("../y_valid.csv.npy") # common validation y of previous steps

In [None]:
# MLP on doc2vec
x1 = np.load("../y_D2V-mlpClf.npy")
# XGB on countvectorized texts
x2 = np.load("../y_predict_XGB.csv.npy")
# MLP on binary countvectorized subreddits
x3 = np.load("../y_score_MLPs.npy")

In [None]:
t1 = np.load("y_testD2V.npy")
t2 = np.load("y_testXGBnS.csv.npy")
t3 = np.load("y_testMLPs.npy")

In [None]:
a = np.vstack((x3,x2,x1))
t = np.vstack((t3,t2,t1))

In [None]:
X = a.T # transpose
T = t.T # transpose

In [None]:
# Plot the test data along the 2 dimensions of largest variance
def plot_LSA(test_data, test_labels, savepath="PCA_demo.csv", plot=True):
        lsa = TruncatedSVD(n_components=2)
        lsa.fit(test_data)
        lsa_scores = lsa.transform(test_data)
        color_mapper = {label:idx for idx,label in enumerate(set(test_labels))}
        color_column = [color_mapper[label] for label in test_labels]
        colors = ['orange','blue']
        if plot:
            plt.scatter(lsa_scores[:,0], lsa_scores[:,1], s=8, alpha=.8, c=test_labels, cmap=matplotlib.colors.ListedColormap(colors))
            orange_patch = mpatches.Patch(color='orange', label='M')
            blue_patch = mpatches.Patch(color='blue', label='F')
            plt.legend(handles=[orange_patch, blue_patch], prop={'size': 20})

fig = plt.figure(figsize=(8, 8))          
plot_LSA(X, y)
plt.show()

In [None]:
# Logistic regression 
lrClf = LogisticRegression(class_weight = "balanced",solver = "saga",C = 0.00005)  #modello

# Kfold percross-validation
kf = KFold(n_splits = 10)

for train_indices, test_indices in kf.split(X):
    lrClf.fit(X[train_indices], y[train_indices])
    print(lrClf.score(X[test_indices], y[test_indices]))

In [None]:
y_scorel = lrClf.predict_proba(T)[:,1]

In [None]:
y_scorel

In [None]:
test = {'author': a_test,
        'gender': y_scorel
        }

df = pd.DataFrame(test, columns = ['author', 'gender'])

print (df)

In [None]:
df.to_csv(r'Submission.csv', index = False)