<a href="https://colab.research.google.com/github/mkane968/Text-Mining-with-Student-Papers/blob/main/Student_Essay_Text_Mining_Master_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Master Pipeline: Text Mining Student Essays

This notebook lets users clean, explore and analyze student academic essays.

First, any given corpus of student essays can be merged into single dataframe and  associated with their metadata in a master dataframe.

Then, basic cleaning and enrichment can be performed on the corpus. 

Using a combination of natural language processing techniques, each essay can then be searched for **key words and phrases** associated with students' **uptake of certain first-year writing program outcomes.** Examples include: 
*   `Rhetorical analysis terms like "pathos" , "ethos" , "logos"` &rarr; *To learn to employ rhetorical terms and strategies and strengthen your ability to analyze rhetorical techniques in published essays and visual texts.*
*   `Regular expressions associated with in-text citations` &rarr; *To learn to employ academic evidence*
*   `Terms from stance word dictionary` &rarr; *To develop competent academic arguments over multiple drafts*

*Outcomes from ENG 701 Syllabus, Temple University First-Year Writing Program, Revised June 2022*

Once texts have been searched for these outcomes, their surrounding context can be retrieved and stored for further analysis. Since each text is associated with its grade metadata, researchers can investigate trends in language patterns and their differences across high (A), mid-range (B-C) and low-scoring (D-F) essays. This can be done using natural-language processing tools in this notebook as well as [the DocuScope tool.](https://www.cmu.edu/dietrich/english/research-and-publications/docuscope.html)


This code can support program instructors who seek to gauge patterns of student performance on valued writing program outcomes, articulate those outcomes to instructors, and identify areas where outcomes can be refined, reshaped, or expanded to better capture what is valued about students’ languaging practices. 


## 1. Install Packages

In [3]:
#Mount Google Drive
from google.colab import drive
from google.colab import files

#Install glob
import glob 

#Install pandas
import pandas as pd

#Install numpy
import numpy as np

#Imports the Natural Language Toolkit, which is necessary to install NLTK packages and libraries
#!pip install nltk
import nltk

#Installs libraries and packages to tokenize text
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize

#Installs libraries and packages to clean text
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

#Imports spaCy itself, necessary to use features 
#!pip install spaCy
import spacy
#Load the natural language processing pipeline
nlp = spacy.load("en_core_web_sm")
#Load spaCy visualizer
from spacy import displacy

#Import matplotlib for visualizations
import matplotlib.pyplot as plt


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## 2. Add Student Essays and Metadata to DataFrame

### Upload all student essays from local folder and add to dataframe

In [4]:
#Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Add files to upload from local machine
uploaded = files.upload()

In [None]:
essays = pd.DataFrame.from_dict(uploaded, orient='index')
essays.head()

In [None]:
#Reset index and add column names to make wrangling easier
essays = essays.reset_index()
essays.columns = ["ID", "Text"]
essays


In [None]:
#Remove encoding characters from Text column (b'\xef\xbb\xbf)
essays['Text'] = essays['Text'].apply(lambda x: x.decode('utf-8'))
essays.head()

In [None]:
#Remove newline characters
essays['Text'] = essays['Text'].str.replace(r'\s+|\\r', ' ', regex=True) 
essays['Text'] = essays['Text'].str.replace(r'\s+|\\n', ' ', regex=True) 
essays

### Remove identifying information (student/instructor names) from each paper
This isn't perfect -- headers are not standardized across all papers, sometimes students end with prof name or other info, some student names still visible if referenced in papers themselves.



In [None]:
#Remove identifying information from ID
#Remove any occurences of "LATE_" from dataset (otherwise will skew ID cleaning)
essays['ID'] = essays['ID'].str.replace(r'LATE_', '', regex=True) 

#Split book on first underscore (_) in ID, keep only text in between first and second underscore (ID number)
start = essays["ID"].str.split("_", expand = True)
essays['ID'] = start[1]
essays['ID'] = essays['ID'].astype(int)
essays

In [None]:
#Remove headers containing student name, instructor name, course name and date
#Split text on 2022 (will likely be last value in headers) and add all contents before to new column
headers = essays["Text"].str.split("22", 1, expand = True)
essays['Header'] = headers[0]
print(essays['Header'])

#Add 2022 back to header column
essays['Header'] = essays['Header'] + '22'
essays['Header'][0]

In [None]:
#Remove any occurences of the header from the rest of the text in each cell (should be at top of each essay in portfolio)
essays['Text_NoHeaders'] = essays.apply(lambda row : row['Text'].replace(str(row['Header']), ''), axis=1)
essays['Text_NoHeaders']

In [None]:
#Remove old text and header columns from dataframe 
essays = essays.drop(columns=['Text', 'Header'])
essays


### Upload essay grades and additional metadata to second dataframe

In [None]:
#Upload csvs with essay metadata
uploaded_grades = files.upload()

In [None]:
#Link to path where csv files are stored in drive
local_path = r'/content'

#Create variable to store all csvs in path
filenames = glob.glob(local_path + "/*.csv")

#Create df list for all csvs
dfs = [pd.read_csv(filename) for filename in filenames]

# Concatenate all data into one DataFrame
metadata = pd.concat(dfs, ignore_index=True)

#Change data to string (for further cleaning)
metadata.astype(str)

metadata

In [None]:
#Drop header rows(Points Possible) and test student rows (Student, Test)
metadata = metadata[metadata['Student'].str.contains('Points Possible|Student, Test')==False]
metadata


In [None]:
#Get all column names
for col in metadata.columns:
    print(col)

#Choose which rows to keep (ID, section and final portfolios with #s after chosen here)
metadata = metadata[['ID', 'Section', "Final Portfolio (1689777)", "Final Portfolio (1676963)"]]
metadata

In [None]:
#Replace all NaN values with 0 
metadata = metadata.replace(np.nan, 0)

In [None]:
#Create new final portfolio column with all values
#Not sure how this will work with more than two dataframes
metadata['Portfolio Score'] = metadata['Final Portfolio (1689777)'] + metadata['Final Portfolio (1676963)']
metadata


In [None]:
#Drop grade columns for individual classes
clean_metadata = metadata[['ID', 'Section', "Portfolio Score"]]
clean_metadata

In [None]:
#Drop decimal from ID (inconsistent with ID in essay dataframe)
clean_metadata['ID'] = clean_metadata['ID'].astype(int)

#Check cleaned DF one more time
clean_metadata


### Merge essays and grade metadata into one dataframe

In [None]:
#Merge metadata and cleaned essays into new dataframe
#Will only keep rows where both essay and metadata are present
essays_grades_df = clean_metadata.merge(essays,on='ID')
essays_grades_df

In [None]:
#Save new df to csv and download
essays_grades_df.to_csv('essays_grades.csv') 
files.download('essays_grades.csv')

## 3. Clean and Enrich Student Essays 

### Cleaning: Lowercasing, Punctuation Removal, and Stopword Removal

In [None]:
#Lowercase all words
essays_grades_df['Text'] = essays_grades_df['LC_Text'].str.lower()

#Remove punctuation and replace with no space (except periods and hyphens)
essays_grades_df['NoPunct_Text'] = essays_grades_df['Text'].str.replace(r'[^\w\-\.\'\s]+', '', regex = True)

#Remove periods and replace with space (to prevent incorrect compounds)
essays_grades_df['NoPunct_Text'] = essays_grades_df['NoPunct_Text'].str.replace(r'[^\w\-\'\s]+', ' ', regex = True)

#Remove stopwords
stop_words = set(stopwords.words("english"))
essays_grades_df['no_stops'] = essays_grades_df['Text'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

#Check output
essays_grades_df.head()

### Enrichment: Lemmatization, Part-of-Speech Tagging, and Named Entity Recognition

In [None]:
#Make dataframe for enriched data
enriched_essays_grades_df = essays_grades_df

#Get lemmas
lemma_list = []

# Disable Dependency Parser, and NER since all we want is POS 
with nlp.disable_pipes('parser', 'ner'):
  #Iterate through each doc object and tag POS, append POS to list
  for doc in nlp.pipe(enriched_essays_grades_df.Text.astype('unicode').values, batch_size=100):
    word_list = []
    for token in doc:
        word_list.append(token.lemma_)
        
    lemma_list.append(word_list)

#Make pos list a new column in dataframe
enriched_essays_grades_df['lemma_list'] = lemma_list
enriched_essays_grades_df.head()


In [None]:
#Get part of speech tags
pos_list = []

# Disable Dependency Parser, and NER since all we want is POS 
with nlp.disable_pipes('parser', 'ner'):
  #Iterate through each doc object and tag POS, append POS to list
  for doc in nlp.pipe(enriched_essays_grades_df.Text.astype('unicode').values, batch_size=100):
    word_list = []
    for token in doc:
        word_list.append(token.pos_)
        
    pos_list.append(word_list)

#Make pos list a new column in dataframe
enriched_essays_grades_df['pos_list'] = pos_list

#Check pos tags
enriched_essays_grades_df.head()


In [None]:
#Get dependency parsing for single doc and visualize
doc = nlp(enriched_essays_grades_df.Text_NoHeaders[0]) 
print(doc)

displacy.render(doc, style="dep", jupyter=True)

In [None]:
#Get named entities
ent_list = []

with nlp.disable_pipes('tagger', 'parser'):
    for doc in nlp.pipe(enriched_essays_grades_df.Text.astype('unicode').values, batch_size=100):
        ent_list.append(doc.ents)

enriched_essays_grades_df['ent_list'] = ent_list

#Check named entities
enriched_essays_grades_df.head()


In [None]:
#Get named entities in a single document and visualize
doc = nlp(enriched_essays_grades_df.Text[0]) 

displacy.render(doc, style="ent", jupyter=True)

In [None]:
#Download enriched df
enriched_essays_grades_df.to_csv('essays_grades_enriched.csv') 
files.download('essays_grades_enriched.csv')

### Interlude: Some preliminary analyses of essay length, part-of-speech counts, and named entities

(Work in progress)

In [None]:
#Get word count of each text
essays_grades_df['Length'] = essays_grades_df['Text'].apply(lambda x: len(x))
essays_grades_df.head()

In [None]:
#Graph portfolio grade by length
import matplotlib.pyplot as plt

essays_grades_df = essays_grades_df.sort_values(by=['Portfolio Score'], ascending=True)

essays_grades_df.plot(kind='bar',x='Portfolio Score',y='Length')

## 4. Section Texts Based on Outcomes-Related Keywords

### Get Sections of Each Essay Containing Rhetorical Analysis Terms
**Related Outcome:** To learn to employ rhetorical terms and strategies and strengthen your ability to analyze rhetorical techniques in published essays and visual texts.



In [None]:
#We only need one version of the cleaned text for this essay; will choose stopwords one here
df_rhetorical = essays_grades_df.drop(["LC_Text", "Text", "NoPunct_Text"], axis=1)
df_rhetorical.head()

In [None]:
#Set up column for score plus ID
df_rhetorical['ID + Score'] = df_rhetorical['ID'].astype(str) + '_' + df_rhetorical['Final Portfolio'].astype(str)

#Count number of occurences of rhetorical terms in each paper
pathos_counts = df_rhetorical['no_stops'].str.count('pathos')
ethos_counts = df_rhetorical['no_stops'].str.count('ethos')
logos_counts = df_rhetorical['no_stops'].str.count('logos')

df_rhetorical.head()

In [None]:
#Graph number of pathos, ethos and logos mentions across essays
#https://plotly.com/python/bar-charts/
import plotly.graph_objects as go

fig = go.Figure(data=[
    go.Bar(name='Pathos Counts', x=df_rhetorical["Final Portfolio"], y=df_rhetorical["Pathos_Counts"]),
    go.Bar(name='Ethos Counts', x=df_rhetorical["Final Portfolio"], y=df_rhetorical["Ethos_Counts"]),
    go.Bar(name='Logos Counts', x=df_rhetorical["Final Portfolio"], y=df_rhetorical["Logos_Counts"])
])
# Change the bar mode
fig.update_layout(barmode='stack')
fig.show()

In [None]:
#Use concordancing to get context around each rhetorical term

def concordance(ci, word, width=400, lines=25):
    """
    Rewrite of nltk.text.ConcordanceIndex.print_concordance that returns results
    instead of printing them. 

    See:
    http://www.nltk.org/api/nltk.html#nltk.text.ConcordanceIndex.print_concordance
    """
    half_width = (width - len(word) - 2) // 2
    context = width // 2 # approx number of words of context

    results = []
    offsets = ci.offsets(word)
    if offsets:
        lines = min(lines, len(offsets))
        for i in offsets:
            if lines <= 0:
                break
            left = (' ' * half_width +
                    ' '.join(ci._tokens[i-context:i]))
            right = ' '.join(ci._tokens[i+1:i+context])
            left = left[-half_width:]
            right = right[:half_width]
            results.append('%s %s %s' % (left, ci._tokens[i], right))
            lines -= 1

    return results

In [None]:
#Get context around each instance of pathos in each essay and append to dataframe
pathos_results = []
for text in df_rhetorical['no_stops']:
  ci = ConcordanceIndex((word_tokenize(text)))
  results = concordance(ci, 'pathos')
  pathos_results.append(results)

pathos_df = pd.DataFrame(pathos_results)


pathos_df.insert(loc = 0,
          column = 'ID_Score',
          value = df_rhetorical['ID + Score'])

pathos_df.head()


In [None]:
#Associate each use of pathos with score and reset index, tidy column names
pathos_df = pathos_df.set_index('ID_Score')
pathos_clean = pathos_df.stack().reset_index()

pathos_clean.columns = ["ID_Score","Pathos_Count","Pathos_Context"]

pathos_clean.head()


In [None]:
#Repeat the above steps for ethos concordancing

#Get context around each instance of ethos in each essay and append to dataframe
ethos_results = []
for text in df_rhetorical['no_stops']:
  ci = ConcordanceIndex((word_tokenize(text)))
  results = concordance(ci, 'ethos')
  ethos_results.append(results)

ethos_df = pd.DataFrame(ethos_results)


ethos_df.insert(loc = 0,
          column = 'ID_Score',
          value = df_rhetorical['ID + Score'])

#Associate each instance with score and reset index, tidy column names
ethos_df = ethos_df.set_index('ID_Score')
ethos_clean = ethos_df.stack().reset_index()

ethos_clean.columns = ["ID_Score","Ethos_Count","Ethos_Context"]

ethos_clean.head()


In [None]:
#Repeat the above steps for logos concordancing

#Get context around each instance of ethos in each essay and append to dataframe
logos_results = []
for text in df_rhetorical['no_stops']:
  ci = ConcordanceIndex((word_tokenize(text)))
  results = concordance(ci, 'logos')
  logos_results.append(results)

logos_df = pd.DataFrame(logos_results)


logos_df.insert(loc = 0,
          column = 'ID_Score',
          value = df_rhetorical['ID + Score'])

#Associate each instance with score and reset index, tidy column names
logos_df = logos_df.set_index('ID_Score')
logos_clean = logos_df.stack().reset_index()

logos_clean.columns = ["ID_Score","Logos_Count","Logos_Context"]

logos_clean.head()

In [None]:
#Combine into single dataframe
import functools as ft
rhetorical_dfs = [logos_clean, pathos_clean, ethos_clean]
rhetorical_df_final = ft.reduce(lambda left, right: pd.merge(left, right, on='ID_Score'), rhetorical_dfs)

#Clean by removing duplicate values of each (replicated during merge)
rhetorical_df_final.loc[rhetorical_df_final['Pathos_Context'].duplicated(), 'Pathos_Context'] = 'None'
rhetorical_df_final.loc[rhetorical_df_final['Logos_Context'].duplicated(), 'Logos_Context'] = 'None'
rhetorical_df_final.loc[rhetorical_df_final['Ethos_Context'].duplicated(), 'Ethos_Context'] = 'None'

rhetorical_df_final


In [None]:
#Download file to csv
rhetorical_df_final.to_csv('rhetorical_context_df.csv', encoding = 'utf-8-sig') 
files.download('rhetorical_context_df.csv')

### Get Sections of Each Essay Containing Citations
**Related Outcome:** To learn to employ academic evidence

This section DOES NOT work yet; going to adapt this code: https://levelup.gitconnected.com/count-citations-in-a-word-document-with-python-and-regular-expressions-d068218c50b9

In [None]:
#https://ideone.com/IqZvxm

import re
pattern = r'\(([^"\)]*|\bAnonymous\b|"[^"\)]*")(, )([\d]+|n\.d\.|[\d]+[\w])\)'
num_replaces = 100000000


# Try to find citation matches (returned as an iterator of matches)
citation_results = []
for text in essays_grades_df['Text_NoHeaders']:
  results = re.finditer(pattern, text)
  citation_results.append(results)

citation_results

### Get Sections of Each Essay Containing Stance Words
** Related Outcome:** To develop competent academic arguments

This section DOES NOT work yet; going to adapt methods from:

*   https://www.sciencedirect.com/science/article/abs/pii/S147515851730005X
*   https://journals-sagepub-com.libproxy.temple.edu/doi/full/10.1177/0741088314527055
