# Text classification 


## Loading libraries 

I adapted some Python code that my RAs developed for another project: https://github.com/jaeyk/ITS-Text-Classification/blob/master/code/05_classification.ipynb

Import only relevant libraries.

In [1]:
# Numpy and Pandas 

import numpy as np
import warnings
import pandas as pd
from pandas.api.types import CategoricalDtype
from sklearn.preprocessing import StandardScaler

# Data visualization 

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# REGEX and NLP

import re
import string
!pip install nltk
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

# ML

from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB # Naive-Bayes
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression # Linear models
from xgboost import XGBClassifier # XG Boost
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction import DictVectorizer
from sklearn.decomposition import PCA
from sklearn.metrics import accuracy_score # Accuracy score
from sklearn.metrics import balanced_accuracy_score # Balanced accuracy score

# Interface
import tkinter as tk
from tkinter import filedialog

warnings.filterwarnings('ignore')



[nltk_data] Downloading package stopwords to /home/jae/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## Load data



In [2]:
# Open file path

# root = tk.Tk()
# root.withdraw()

# file_path = filedialog.askopenfilename()

In [3]:

# The labeled data 

asian_sample = pd.read_csv("/home/jae/content-analysis-for-evaluating-ML-performances/processed_data/sample_asian.csv")
black_sample = pd.read_csv("/home/jae/content-analysis-for-evaluating-ML-performances/processed_data/sample_black.csv")

# The unlabeled data

asian_unlabeled = pd.read_csv("/home/jae/content-analysis-for-evaluating-ML-performances/processed_data/unlabeled_asian.csv")
black_unlabeled = pd.read_csv("/home/jae/content-analysis-for-evaluating-ML-performances/processed_data/unlabeled_black.csv")

Examine files.

In [4]:
# First five rows

asian_sample.head(5)

Unnamed: 0.1,Unnamed: 0,author,date,source,text,year,linked_progress,linked_hurt
0,1,"Lopez, Flora",1976-04-30,International Examiner,\n\n\n\n\n\n\n\nS.P.I.C.E. is a nutritional pr...,1976,1,0
1,2,,1976-09-30,International Examiner,\n\n\n\n\n\n\n\nCommunity control rather than ...,1976,0,0
2,3,,1976-06-30,International Examiner,"\n\n\n\n\n\n\n\n""Peasants of the Second Fortre...",1976,0,1
3,4,"Chin, Doug",1976-10-31,International Examiner,\n\n\n\n\n\n\n\nMuch of what is now the Intern...,1976,0,0
4,5,"Chow, Ron",1976-02-29,International Examiner,\n\n\n\n\n\n\n\nInternational District Housing...,1976,1,0


Select only relevant columns.

In [5]:
# Drop the first column

## Seen 

asian_sample = asian_sample.drop(['Unnamed: 0'], axis = 1)
black_sample = black_sample.drop(['Unnamed: 0'], axis = 1)

# An alternative way of doing this is asian_sample = asian_sample[['col1', 'col2']] 

## Unseen 

asian_unlabeled = asian_unlabeled.drop(['Unnamed: 0'], axis = 1)
black_unlabeled = black_unlabeled.drop(['Unnamed: 0'], axis = 1)

Convert date column into datetime. This new data type allows us to extract some info from the column. For instance, `asian_samplep['date'].year` returns years. 

In [6]:

# Seen data 
asian_sample["date"] = pd.to_datetime(asian_sample["date"])
black_sample["date"] = pd.to_datetime(black_sample["date"])

# Unseen data 
asian_unlabeled["date"] = pd.to_datetime(asian_unlabeled["date"])
black_unlabeled["date"] = pd.to_datetime(black_unlabeled["date"])


In [7]:
# Check the size of each file

print("The N of labeled Asian American articles:", len(asian_sample['text']), "\n"
      "The N of labeled African American articles:", len(black_sample['text']), "\n"
      "The N of unlabeled Asian American articles:", len(asian_unlabeled['text']), "\n"
      "The N of unlabeled Asian American articles:", len(black_unlabeled['text']))

The N of labeled Asian American articles: 990 
The N of labeled African American articles: 1008 
The N of unlabeled Asian American articles: 7739 
The N of unlabeled Asian American articles: 37743


Note that the number of labeled Asian American articles was reduced as I remove 18 duplicates from the original sample.

## Preprocessing

### Remove special characters, punctuations, whitespace, and stopwords

- I created a function for cleaning texts.
- Removing stop words did not increase performance in this case. (So, I commented it out.)

In [8]:

# stop_words = stopwords.words('english')

def clean_text(document):
    document = document.str.lower() # lower case
    document = document.str.replace('[\r?\n]','', regex = True)
    document = document.str.replace('[^\\w\\s]','', regex = True)
    document = document.str.replace('\\d+', '', regex = True)   
    document = document.str.strip() # remove whitespace
  #  document = document.apply(lambda x: " ".join([y for y in x.split() if y not in stop_words]))
    return(document)

Let's see how it works using one sample.

In [9]:
clean_text(black_sample['text']).head() # first 5 rows 

0    friday nov  at  pm rev l s rubin pastor at oli...
1    we have a large building an ante bellum buildi...
2    ktvus televoters were back to being pretty upt...
3    washington dc  washingtons appointed mayor wal...
4    spokesmen for the congress of racial equality ...
Name: text, dtype: object

Apply the function to each corpus.

In [10]:
# Seen

asian_sample['text'] = clean_text(asian_sample['text'])
black_sample['text'] = clean_text(black_sample['text'])

# Unseen

asian_unlabeled['text'] = clean_text(asian_unlabeled['text'])
black_unlabeled['text'] = clean_text(black_unlabeled['text'])

## Feature engineering

Here, we turn texts into a document-term matrix. These terms represent features in the model and we aim to find a combination of features that are most effective in predicting target values.

In [11]:

# Bag of Words (BOW)

vectorizer = CountVectorizer(
    max_features = 5000, # 5,000 is large enough
    min_df = 1, # minimum frequency 1 
    ngram_range = (1,2), # ngram 
    binary = True,
)


In [12]:

# Document-term Matrix (DTM)

def dtm_train(data, text, column, year):
    
    # BOW model 
    
    features = vectorizer.fit_transform(data[text]).todense()   

    # Response variable
    
    response = data[column].values # values 

    # Split into training and testing sets 

    X_train, X_test, y_train, y_test = train_test_split(features, response, 
                                                        test_size = 0.2, # training = 80%, test = 20%
                                                        random_state = 1234, # for reproducibility
                                                        stratify = data[year]) # stratifying by year
    
    # Label encode (normalize) response variable
    
    encoder = preprocessing.LabelEncoder()
    
    y_train = encoder.fit_transform(y_train)
    y_test = encoder.fit_transform(y_test)

    return(X_train, y_train, X_test, y_test)


In [13]:
# Asian 

## Linked progress
asian_lp_dtm = dtm_train(asian_sample, 'text', 'linked_progress', 'year')
asian_X_train_lp = asian_lp_dtm[0]
asian_y_train_lp = asian_lp_dtm[1]
asian_X_test_lp = asian_lp_dtm[2]
asian_y_test_lp = asian_lp_dtm[3]

## Linked hurt 
asian_lh_dtm = dtm_train(asian_sample, 'text', 'linked_hurt', 'year')
asian_X_train_lh = asian_lh_dtm[0]
asian_y_train_lh = asian_lh_dtm[1]
asian_X_test_lh = asian_lh_dtm[2]
asian_y_test_lh = asian_lh_dtm[3]

# Black

## Linked progress
black_lp_dtm = dtm_train(black_sample, 'text', 'linked_progress', 'year')
black_X_train_lp = black_lp_dtm[0]
black_y_train_lp = black_lp_dtm[1]
black_X_test_lp = black_lp_dtm[2]
black_y_test_lp = black_lp_dtm[3]

## Linked hurt 
black_lh_dtm = dtm_train(black_sample, 'text', 'linked_hurt', 'year')
black_X_train_lh = black_lh_dtm[0]
black_y_train_lh = black_lh_dtm[1]
black_X_test_lh = black_lh_dtm[2]
black_y_test_lh = black_lh_dtm[3]


## Fit and evaluate a model

In [14]:

## fitting 

def fit_logistic_regression(X_train, y_train):
    model = LogisticRegression(fit_intercept = True, penalty = 'l1', solver = 'saga') # Lasso
    model.fit(X_train, y_train)
    return model

def fit_bayes(X_train, y_train):
    model = GaussianNB()
    model.fit(X_train, y_train)
    return model

def fit_xgboost(X_train, y_train):
    model = XGBClassifier(random_state = 42,
                         seed = 2, 
                         colsample_bytree = 0.6,
                         subsample = 0.7)
    model.fit(X_train, y_train)
    return model

## evaluating 

def test_model(model, X_train, y_train, X_test, y_test):
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    balanced_accuracy = balanced_accuracy_score(y_test, y_pred)
    print("Accuracy:", accuracy, "\n"
          "Balanced accuracy:", balanced_accuracy)
    

In [15]:
# asian_sample 

## linked progress
asian_lp = fit_logistic_regression(asian_X_train_lp, asian_y_train_lp)
asian_lp_bayes = fit_bayes(asian_X_train_lp, asian_y_train_lp)
asian_lp_xgboost = fit_xgboost(asian_X_train_lp, asian_y_train_lp)

## linked hurt 
asian_lh = fit_logistic_regression(asian_X_train_lh, asian_y_train_lh)
asian_lh_bayes = fit_bayes(asian_X_train_lh, asian_y_train_lh)
asian_lh_xgboost = fit_xgboost(asian_X_train_lh, asian_y_train_lh)

# black_sample 

## linked progress
black_lp = fit_logistic_regression(black_X_train_lp, black_y_train_lp)
black_lp_bayes = fit_bayes(black_X_train_lp, black_y_train_lp)
black_lp_xgboost = fit_xgboost(black_X_train_lp, black_y_train_lp)

## linked hurt 
black_lh = fit_logistic_regression(black_X_train_lh, black_y_train_lh)
black_lh_bayes = fit_bayes(black_X_train_lh, black_y_train_lh)
black_lh_xgboost = fit_xgboost(black_X_train_lh, black_y_train_lh)

### Cross-validation check 

In [16]:
# asian sample test 

print("Asian linked progress: Logistic regression")
test_model(asian_lp, asian_X_train_lp, asian_y_train_lp, asian_X_test_lp, asian_y_test_lp)
print("Asian linked progress: Naive Bayes")
test_model(asian_lp_bayes, asian_X_train_lp, asian_y_train_lp, asian_X_test_lp, asian_y_test_lp)
print("Asian linked progress: XGBoost")
test_model(asian_lp_xgboost, asian_X_train_lp, asian_y_train_lp, asian_X_test_lp, asian_y_test_lp)

print("Asian linked hurt: Logistic regression")
test_model(asian_lh, asian_X_train_lh, asian_y_train_lh, asian_X_test_lh, asian_y_test_lh)
print("Asian linked hurt: Naive Bayes")
test_model(asian_lh_bayes, asian_X_train_lh, asian_y_train_lh, asian_X_test_lh, asian_y_test_lh)
print("Asian linked hurt: XGBoost")
test_model(asian_lh_xgboost, asian_X_train_lh, asian_y_train_lh, asian_X_test_lh, asian_y_test_lh)

# black sample test

print("Black linked progress: Logistic regression")
test_model(black_lp, black_X_train_lp, black_y_train_lp, black_X_test_lp, black_y_test_lp)
print("Black linked progress: Naive Bayes")
test_model(black_lp_bayes, black_X_train_lp, black_y_train_lp, black_X_test_lp, black_y_test_lp)
print("Black linked progress: XGBoost")
test_model(black_lp_xgboost, black_X_train_lp, black_y_train_lp, black_X_test_lp, black_y_test_lp)

print("Black linked hurt: Logistic regression")
test_model(black_lh, black_X_train_lh, black_y_train_lh, black_X_test_lh, black_y_test_lh)
print("Black linked hurt: Naive Bayes")
test_model(black_lh_bayes, black_X_train_lh, black_y_train_lh, black_X_test_lh, black_y_test_lh)
print("Black linked hurt: XGBoost")
test_model(black_lh_xgboost, black_X_train_lh, black_y_train_lh, black_X_test_lh, black_y_test_lh)

Asian linked progress: Logistic regression
Accuracy: 0.7626262626262627 
Balanced accuracy: 0.6352216748768473
Asian linked progress: Naive Bayes
Accuracy: 0.6363636363636364 
Balanced accuracy: 0.5459359605911329
Asian linked progress: XGBoost
Accuracy: 0.7626262626262627 
Balanced accuracy: 0.615024630541872
Asian linked hurt: Logistic regression
Accuracy: 0.8636363636363636 
Balanced accuracy: 0.6732142857142858
Asian linked hurt: Naive Bayes
Accuracy: 0.8282828282828283 
Balanced accuracy: 0.5291666666666667
Asian linked hurt: XGBoost
Accuracy: 0.8636363636363636 
Balanced accuracy: 0.6321428571428571
Black linked progress: Logistic regression
Accuracy: 0.7772277227722773 
Balanced accuracy: 0.6223214285714286
Black linked progress: Naive Bayes
Accuracy: 0.7326732673267327 
Balanced accuracy: 0.6205357142857143
Black linked progress: XGBoost
Accuracy: 0.7970297029702971 
Balanced accuracy: 0.5997023809523809
Black linked hurt: Logistic regression
Accuracy: 0.8465346534653465 
Balan

## Prediction

In [17]:

def test_text(text, model):   
      
     # BOW model 
    
    features = vectorizer.fit_transform(text).todense()
    
    preds = model.predict(features)
    
    return preds

## Label the entire data

In [18]:
# asian
## unlabeled
asian_lp_full = test_text(asian_unlabeled['text'], asian_lp)
asian_lh_full = test_text(asian_unlabeled['text'], asian_lh)

# black
## unlabeled 
black_lp_full = test_text(black_unlabeled['text'], black_lp)
black_lh_full = test_text(black_unlabeled['text'], black_lh)


In [19]:
# The original 

print("asian linked progress:", sum(asian_sample['linked_progress']),
      "asian linked hurt:", sum(asian_sample['linked_hurt']),
      "black linked progress:", sum(black_sample['linked_progress']),
      "blakc linked hurt:", sum(black_sample['linked_hurt']))

asian linked progress: 254 asian linked hurt: 147 black linked progress: 257 blakc linked hurt: 196


In [20]:
# The machine coded

print("asian linked progress:", sum(test_text(asian_sample['text'], asian_lp)),
      "asian linked hurt:", sum(test_text(asian_sample['text'], asian_lh)),
      "black linked progress:", sum(test_text(black_sample['text'], black_lp)),
      "blakc linked hurt:", sum(test_text(black_sample['text'], black_lh)))

asian linked progress: 220 asian linked hurt: 138 black linked progress: 246 blakc linked hurt: 174


In [21]:

print("asian linked progress:", sum(asian_lp_full),
      "asian linked hurt:", sum(asian_lh_full),
      "black linked progress:", sum(black_lp_full),
      "black linked hurt:", sum(black_lh_full))

print("asian progress / hurt", round(sum(asian_lp_full)/sum(asian_lh_full),2),
      "black progress /un hurt", round(sum(black_lp_full)/sum(black_lh_full),2))


asian linked progress: 2369 asian linked hurt: 2976 black linked progress: 15087 black linked hurt: 24212
asian progress / hurt 0.8 black progress /un hurt 0.62


## Export results as csv files 

I saved the resutls as csv files to plot them in R. 

In [22]:

# Asian
#asian_lp_data = pd.DataFrame(asian_lp_full).rename(columns = {0:'labeled_linked_progress'})
#asian_lp_data.to_csv("/home/jae/linked_fate_evolution/Output/asian_lp_data.csv")

#asian_lh_data = pd.DataFrame(asian_lh_full).rename(columns = {0:'labeled_linked_hurt'})
#asian_lh_data.to_csv("/home/jae/linked_fate_evolution/Output/asian_lh_data.csv")

# Black
#black_lp_data = pd.DataFrame(black_lp_full).rename(columns = {0:'labeled_linked_progress'})
#black_lp_data.to_csv("/home/jae/linked_fate_evolution/Output/black_lp_data.csv")

#black_lh_data = pd.DataFrame(black_lh_full).rename(columns = {0:'labeled_linked_hurt'})
#black_lh_data.to_csv("/home/jae/linked_fate_evolution/Output/black_lh_data.csv")