# Text classification 


## Load libraries 

I adapted some Python code that my RAs developed for another project: https://github.com/jaeyk/ITS-Text-Classification/blob/master/code/05_classification.ipynb

Import only relevant libraries.

In [1]:
# Numpy and Pandas 

import numpy as np
import warnings
import pandas as pd
from pandas.api.types import CategoricalDtype
from sklearn.preprocessing import StandardScaler

# Data visualization 

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# REGEX and NLP

import re
import string
!pip install nltk
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

# ML

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB # Naive-Bayes
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression # Linear models
from xgboost import XGBClassifier # XG Boost
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score # Accuracy score
from sklearn.metrics import balanced_accuracy_score # Balanced accuracy score
from sklearn.metrics import cohen_kappa_score # Cohen's Kappa score
from sklearn.utils import resample # for resampling

# Interface
import tkinter as tk
from tkinter import filedialog

warnings.filterwarnings('ignore')



[nltk_data] Downloading package stopwords to /home/jae/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load data



In [2]:
# Open file path

# root = tk.Tk()
# root.withdraw()

# file_path = filedialog.askopenfilename()

In [3]:

# The labeled data 

asian_sample = pd.read_csv("/home/jae/content-analysis-for-evaluating-ML-performances/processed_data/sample_asian.csv")
black_sample = pd.read_csv("/home/jae/content-analysis-for-evaluating-ML-performances/processed_data/sample_black.csv")

# The unlabeled data

asian_unlabeled = pd.read_csv("/home/jae/content-analysis-for-evaluating-ML-performances/processed_data/unlabeled_asian.csv")
black_unlabeled = pd.read_csv("/home/jae/content-analysis-for-evaluating-ML-performances/processed_data/unlabeled_black.csv")

Examine files.

In [4]:
# First five rows

asian_sample.head(5)

Unnamed: 0.1,Unnamed: 0,author,date,source,text,year,linked_progress,linked_hurt
0,1,"Lopez, Flora",1976-04-30,International Examiner,\n\n\n\n\n\n\n\nS.P.I.C.E. is a nutritional pr...,1976,1,0
1,2,,1976-09-30,International Examiner,\n\n\n\n\n\n\n\nCommunity control rather than ...,1976,0,0
2,3,,1976-06-30,International Examiner,"\n\n\n\n\n\n\n\n""Peasants of the Second Fortre...",1976,0,1
3,4,"Chin, Doug",1976-10-31,International Examiner,\n\n\n\n\n\n\n\nMuch of what is now the Intern...,1976,0,0
4,5,"Chow, Ron",1976-02-29,International Examiner,\n\n\n\n\n\n\n\nInternational District Housing...,1976,1,0


Select only relevant columns.

In [5]:
# Drop the first column

## Seen 

asian_sample = asian_sample.drop(['Unnamed: 0'], axis = 1)
black_sample = black_sample.drop(['Unnamed: 0'], axis = 1)

# An alternative way of doing this is asian_sample = asian_sample[['col1', 'col2']] 

## Unseen 

asian_unlabeled = asian_unlabeled.drop(['Unnamed: 0'], axis = 1)
black_unlabeled = black_unlabeled.drop(['Unnamed: 0'], axis = 1)

Convert date column into datetime. This new data type allows us to extract some info from the column. For instance, `asian_samplep['date'].year` returns years. 

In [6]:

# Seen data 
asian_sample["date"] = pd.to_datetime(asian_sample["date"])
black_sample["date"] = pd.to_datetime(black_sample["date"])

# Unseen data 
asian_unlabeled["date"] = pd.to_datetime(asian_unlabeled["date"])
black_unlabeled["date"] = pd.to_datetime(black_unlabeled["date"])


Check the balance of target values: **imbalanced**. I used a resampling method (upsampling/oversampling) to address this problem.

In [7]:
# Check the balance of target values 

asian_sample['linked_progress'].value_counts()

0    736
1    254
Name: linked_progress, dtype: int64

In [8]:
asian_sample['linked_hurt'].value_counts()

0    843
1    147
Name: linked_hurt, dtype: int64

In [9]:
black_sample['linked_progress'].value_counts()

0    751
1    257
Name: linked_progress, dtype: int64

In [10]:
black_sample['linked_hurt'].value_counts()

0    812
1    196
Name: linked_hurt, dtype: int64

Note that the number of labeled Asian American articles was reduced as I remove 18 duplicates from the original sample.

## Preprocessing

### Remove special characters, punctuations, whitespace, and stopwords

- I created a function for cleaning texts.
- Removing stop words did not increase performance in this case. (So, I commented it out.)

In [11]:

# stop_words = stopwords.words('english')

def clean_text(document):
    document = document.str.lower() # lower case
    document = document.str.replace('[\r?\n]','', regex = True)
    document = document.str.replace('[^\\w\\s]','', regex = True)
    document = document.str.replace('\\d+', '', regex = True)   
    document = document.str.strip() # remove whitespace
  #  document = document.apply(lambda x: " ".join([y for y in x.split() if y not in stop_words]))
    return(document)

Let's see how it works using one sample.

In [12]:
clean_text(black_sample['text']).head() # first 5 rows 

0    friday nov  at  pm rev l s rubin pastor at oli...
1    we have a large building an ante bellum buildi...
2    ktvus televoters were back to being pretty upt...
3    washington dc  washingtons appointed mayor wal...
4    spokesmen for the congress of racial equality ...
Name: text, dtype: object

Apply the function to each corpus.

In [13]:
# Seen

asian_sample['text'] = clean_text(asian_sample['text'])
black_sample['text'] = clean_text(black_sample['text'])

# Unseen

asian_unlabeled['text'] = clean_text(asian_unlabeled['text'])
black_unlabeled['text'] = clean_text(black_unlabeled['text'])

## Feature engineering

Here, we turn texts into a document-term matrix. These terms represent features in the model and we aim to find a combination of features that are most effective in predicting target values.

### Vectorizer 

In [14]:

# Bag of Words (BOW)

vectorizer = CountVectorizer(
    max_features = 5000, # 5,000 is large enough
    min_df = 1, # minimum frequency 1 
    ngram_range = (1,2), # ngram 
    binary = True,
)


Lots of things happened here. 

- Resampling to correct the imbalanced classes: `upsampled` the minority class 
- Converting text into a `document-term matrix` 
- Splitting the matrix into the training and testing set using `stratified random sampling`

### Resampling, Creating DTM, and Splitting data


In [15]:


def dtm_train_resample(data, text, column, year):
    
    ############################### RESAMPLING ################################
    
    # Split into majority and minority classes: # I adapted some code from here: https://elitedatascience.com/imbalanced-classes 
       
    df_majority = data[data[column] == 0]
    df_minority = data[data[column] == 1]
    
    # Upsample (oversample) minority class 
    
    df_minority_upsampled = resample(df_minority, 
                                 replace = True,     # sample with replacement
                                 n_samples = 750,    # to match majority class
                                 random_state = 1234) # reproducible results
    
    # Combine majority class with upsampled minority class
    data = pd.concat([df_majority, df_minority_upsampled])
    
    ############################### DOCUMENT-TERM MATRIX ################################
    
    # BOW model 
    
    features = vectorizer.fit_transform(data[text]).todense() # Turn into a sparse matrix    

    # Response variable
    
    response = data[column].values # values 

    ############################### STRATIFIED RANDOM SAMPLING ################################
    
    # Split into training and testing sets 

    X_train, X_test, y_train, y_test = train_test_split(features, response, 
                                                        test_size = 0.2, # training = 80%, test = 20%
                                                        random_state = 1234, # for reproducibility
                                                        stratify = data[year]) # stratifying by year
    
    # Label encode (normalize) response variable
    
    encoder = preprocessing.LabelEncoder()
    
    y_train = encoder.fit_transform(y_train)
    y_test = encoder.fit_transform(y_test)

    return(X_train, y_train, X_test, y_test)



### Training and testing data and response variables

I created training and testing data (text features) and their response variables using the custom function shown above.


In [16]:
# Asian American newspapers 

## Linked progress
asian_lp_dtm = dtm_train_resample(asian_sample, 'text', 'linked_progress', 'year')
asian_X_train_lp = asian_lp_dtm[0]
asian_y_train_lp = asian_lp_dtm[1]
asian_X_test_lp = asian_lp_dtm[2]
asian_y_test_lp = asian_lp_dtm[3]

## Linked hurt 
asian_lh_dtm = dtm_train_resample(asian_sample, 'text', 'linked_hurt', 'year')
asian_X_train_lh = asian_lh_dtm[0]
asian_y_train_lh = asian_lh_dtm[1]
asian_X_test_lh = asian_lh_dtm[2]
asian_y_test_lh = asian_lh_dtm[3]

# African American newspapers

## Linked progress
black_lp_dtm = dtm_train_resample(black_sample, 'text', 'linked_progress', 'year')
black_X_train_lp = black_lp_dtm[0]
black_y_train_lp = black_lp_dtm[1]
black_X_test_lp = black_lp_dtm[2]
black_y_test_lp = black_lp_dtm[3]

## Linked hurt 
black_lh_dtm = dtm_train_resample(black_sample, 'text', 'linked_hurt', 'year')
black_X_train_lh = black_lh_dtm[0]
black_y_train_lh = black_lh_dtm[1]
black_X_test_lh = black_lh_dtm[2]
black_y_test_lh = black_lh_dtm[3]


## Fit and evaluate a ML model

### Functions for various ML models

In [17]:
# Lasso

def fit_logistic_regression(X_train, y_train):
    model = LogisticRegression(fit_intercept = True, penalty = 'l1', solver = 'saga') # Lasso
    model.fit(X_train, y_train)
    return model

# Naive-Bayes 

def fit_bayes(X_train, y_train):
    model = GaussianNB()
    model.fit(X_train, y_train)
    return model

# XG Boost 

def fit_xgboost(X_train, y_train):
    model = XGBClassifier(random_state = 42,
                         seed = 2, 
                         colsample_bytree = 0.6,
                         subsample = 0.7)
    model.fit(X_train, y_train)
    return model


### Function for evaluating ML models (accuracy, balanced accuracy, and Cohen's kappa)

In [None]:

def test_model(model, X_train, y_train, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
    kappa = cohen_kappa_score(y_test, y_pred)
    print("Accuracy:", accuracy, "\n"
          "Balanced accuracy:", balanced_accuracy, "\n"
          "Cohen's Kappa:", kappa)


### Model fitting 

In [18]:
# Asian American newspapers

## Linked progress
asian_lp = fit_logistic_regression(asian_X_train_lp, asian_y_train_lp)
asian_lp_bayes = fit_bayes(asian_X_train_lp, asian_y_train_lp)
asian_lp_xgboost = fit_xgboost(asian_X_train_lp, asian_y_train_lp)

## Linked hurt 
asian_lh = fit_logistic_regression(asian_X_train_lh, asian_y_train_lh)
asian_lh_bayes = fit_bayes(asian_X_train_lh, asian_y_train_lh)
asian_lh_xgboost = fit_xgboost(asian_X_train_lh, asian_y_train_lh)

# African American newspapers

## Linked progress
black_lp = fit_logistic_regression(black_X_train_lp, black_y_train_lp)
black_lp_bayes = fit_bayes(black_X_train_lp, black_y_train_lp)
black_lp_xgboost = fit_xgboost(black_X_train_lp, black_y_train_lp)

## Linked hurt 
black_lh = fit_logistic_regression(black_X_train_lh, black_y_train_lh)
black_lh_bayes = fit_bayes(black_X_train_lh, black_y_train_lh)
black_lh_xgboost = fit_xgboost(black_X_train_lh, black_y_train_lh)

### Model evaluations 

In [19]:
# Asian American newspapers

print("Asian linked progress: Logistic regression")
test_model(asian_lp, asian_X_train_lp, asian_y_train_lp, asian_X_test_lp, asian_y_test_lp)
print("Asian linked progress: Naive Bayes")
test_model(asian_lp_bayes, asian_X_train_lp, asian_y_train_lp, asian_X_test_lp, asian_y_test_lp)
print("Asian linked progress: XGBoost")
test_model(asian_lp_xgboost, asian_X_train_lp, asian_y_train_lp, asian_X_test_lp, asian_y_test_lp)

print("Asian linked hurt: Logistic regression")
test_model(asian_lh, asian_X_train_lh, asian_y_train_lh, asian_X_test_lh, asian_y_test_lh)
print("Asian linked hurt: Naive Bayes")
test_model(asian_lh_bayes, asian_X_train_lh, asian_y_train_lh, asian_X_test_lh, asian_y_test_lh)
print("Asian linked hurt: XGBoost")
test_model(asian_lh_xgboost, asian_X_train_lh, asian_y_train_lh, asian_X_test_lh, asian_y_test_lh)

# African American newspapers

print("Black linked progress: Logistic regression")
test_model(black_lp, black_X_train_lp, black_y_train_lp, black_X_test_lp, black_y_test_lp)
print("Black linked progress: Naive Bayes")
test_model(black_lp_bayes, black_X_train_lp, black_y_train_lp, black_X_test_lp, black_y_test_lp)
print("Black linked progress: XGBoost")
test_model(black_lp_xgboost, black_X_train_lp, black_y_train_lp, black_X_test_lp, black_y_test_lp)

print("Black linked hurt: Logistic regression")
test_model(black_lh, black_X_train_lh, black_y_train_lh, black_X_test_lh, black_y_test_lh)
print("Black linked hurt: Naive Bayes")
test_model(black_lh_bayes, black_X_train_lh, black_y_train_lh, black_X_test_lh, black_y_test_lh)
print("Black linked hurt: XGBoost")
test_model(black_lh_xgboost, black_X_train_lh, black_y_train_lh, black_X_test_lh, black_y_test_lh)

Asian linked progress: Logistic regression
Accuracy: 0.9194630872483222 
Balanced accuracy: 0.9189980628012795 
Cohen's Kappa: 0.8387518600351715
Asian linked progress: Naive Bayes
Accuracy: 0.8221476510067114 
Balanced accuracy: 0.8223408568725503 
Cohen's Kappa: 0.6443914081145584
Asian linked progress: XGBoost
Accuracy: 0.87248322147651 
Balanced accuracy: 0.8720998333108079 
Cohen's Kappa: 0.7447364861818674
Asian linked hurt: Logistic regression
Accuracy: 0.9561128526645768 
Balanced accuracy: 0.9595375722543353 
Cohen's Kappa: 0.912248988092899
Asian linked hurt: Naive Bayes
Accuracy: 0.9059561128526645 
Balanced accuracy: 0.9052775358302321 
Cohen's Kappa: 0.810555071660464
Asian linked hurt: XGBoost
Accuracy: 0.9592476489028213 
Balanced accuracy: 0.961358777417056 
Cohen's Kappa: 0.9183002029196793
Black linked progress: Logistic regression
Accuracy: 0.8637873754152824 
Balanced accuracy: 0.8637086092715232 
Cohen's Kappa: 0.7275296403417747
Black linked progress: Naive Bayes


## Prediction

### Function for predicting the unlabeled data

In [20]:

def test_text(text, model):   
      
    # BOW model 
    
    features = vectorizer.fit_transform(text).todense()
    
    # Prediction
    
    preds = model.predict(features)
    
    return preds

### Label the unlabeled data

In [21]:
# Asian Americans 

asian_lp_full = test_text(asian_unlabeled['text'], asian_lp)
asian_lh_full = test_text(asian_unlabeled['text'], asian_lh)

# African Americans 

black_lp_full = test_text(black_unlabeled['text'], black_lp)
black_lh_full = test_text(black_unlabeled['text'], black_lh)

## Export classification results as CSV files 

I saved the classification results as CSV files to plot them in R. 

In [25]:

# Rename new columns  

asian_lp_data = pd.DataFrame(asian_lp_full).rename(columns = {0:'labeled_linked_progress'})
asian_lh_data = pd.DataFrame(asian_lh_full).rename(columns = {0:'labeled_linked_hurt'})
black_lp_data = pd.DataFrame(black_lp_full).rename(columns = {0:'labeled_linked_progress'})
black_lh_data = pd.DataFrame(black_lh_full).rename(columns = {0:'labeled_linked_hurt'})

# Save data 

asian_lp_data.to_csv("/home/jae/content-analysis-for-evaluating-ML-performances/processed_data/asian_lp_data.csv")
asian_lh_data.to_csv("/home/jae/content-analysis-for-evaluating-ML-performances/processed_data/asian_lh_data.csv")
black_lp_data.to_csv("/home/jae/content-analysis-for-evaluating-ML-performances/processed_data/black_lp_data.csv")
black_lh_data.to_csv("/home/jae/content-analysis-for-evaluating-ML-performances/processed_data/black_lh_data.csv")

This is what the final data looks like.

In [26]:
asian_lp_data.head()

Unnamed: 0,labeled_linked_progress
0,1
1,1
2,0
3,0
4,1
