<a href="https://colab.research.google.com/github/mkane968/Text-Mining-with-Student-Papers/blob/main/Text_Mining_Student_Essays_A_Computational_Exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Mining Student Essays: A Computational Exploration

This pipeline will ingest, clean and analyze meaningful language patterns in a corpora of student papers. The following input is required: 

*   Corpus of student papers (.txt files)
*   Grades and other relevant metadata associated with the papers (.csv files)


## 1. Install Packages

In [2]:
#Mount Google Drive
from google.colab import drive
from google.colab import files

#Install glob
import glob 

#Install pandas
import pandas as pd

#Install numpy
import numpy as np

#Imports the Natural Language Toolkit, which is necessary to install NLTK packages and libraries
#!pip install nltk
import nltk

#Installs libraries and packages to tokenize text
nltk.download('punkt')
from nltk.tokenize import sent_tokenize, word_tokenize
from  nltk.text import ConcordanceIndex

#Installs libraries and packages to clean text
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer

#Import matplotlib for visualizations
import matplotlib.pyplot as plt


import re  # For preprocessing
from time import time  # To time our operations
from collections import defaultdict  # For word frequency
import logging  # Setting up the loggings to monitor gensim

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Import Student Essays and Metadata

###Import Student Essays and Add to DataFrame

In [3]:
#Mount Google Drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
#Add files to upload from local machine
uploaded = files.upload()

Saving achesonalessandro_193606_13858036_Final Portfolio Eng 802 - Acheson .txt to achesonalessandro_193606_13858036_Final Portfolio Eng 802 - Acheson .txt
Saving ahenkoraravenmanu_LATE_232002_18873129_English 0802 Portfolio - Raven Ahenkora.txt to ahenkoraravenmanu_LATE_232002_18873129_English 0802 Portfolio - Raven Ahenkora.txt
Saving bedellolivia_195145_16640649_Analytical Reading and Writing Final Portfolio.txt to bedellolivia_195145_16640649_Analytical Reading and Writing Final Portfolio.txt
Saving benjamincamillia_193400_11506010_Camillia Benjamin Final Portfolio.txt to benjamincamillia_193400_11506010_Camillia Benjamin Final Portfolio.txt
Saving bernsteingage_LATE_227638_18903293_Final Portfolio- English 0802.txt to bernsteingage_LATE_227638_18903293_Final Portfolio- English 0802.txt
Saving bortolottiryan_LATE_232933_18872422_Portfolio.txt to bortolottiryan_LATE_232933_18872422_Portfolio.txt
Saving braunsteinaydan_232993_19273434_Final Portfolio.txt to braunsteinaydan_232993_192

In [5]:
#Put essays into dataframe
essays = pd.DataFrame.from_dict(uploaded, orient='index')

#Reset index and add column names to make wrangling easier
essays = essays.reset_index()
essays.columns = ["ID", "Text"]

#Remove encoding characters from Text column (b'\xef\xbb\xbf)
essays['Text'] = essays['Text'].apply(lambda x: x.decode('utf-8'))

#Remove newline characters and put in new column 
essays['Text_Newlines'] = essays['Text']
essays['Text'] = essays['Text'].str.replace(r'\s+|\\r', ' ', regex=True) 
essays['Text'] = essays['Text'].str.replace(r'\s+|\\n', ' ', regex=True) 
essays.head()

Unnamed: 0,ID,Text,Text_Newlines
0,achesonalessandro_193606_13858036_Final Portfo...,"I have learned a lot through English 802, mor...","\tI have learned a lot through English 802, mo..."
1,ahenkoraravenmanu_LATE_232002_18873129_English...,Raven Ahenkora Professor Megan Kane English 08...,Raven Ahenkora\nProfessor Megan Kane\nEnglish ...
2,bedellolivia_195145_16640649_Analytical Readin...,Olivia Bedell Professor Megan Kane ENG 802 14 ...,Olivia Bedell\nProfessor Megan Kane\nENG 802\n...
3,benjamincamillia_193400_11506010_Camillia Benj...,Camillia Benjamin Prof. Kane English 802 06 De...,Camillia Benjamin\nProf. Kane\nEnglish 802\n06...
4,bernsteingage_LATE_227638_18903293_Final Portf...,Bernstein 1 Gage Bernstein English 0802 Profes...,Bernstein 1\nGage Bernstein\nEnglish 0802\nPro...


### Remove identifying information from each paper ID (instructor/student names) 

In [6]:
#Remove identifying information from ID
#Remove any occurences of "LATE_" from dataset (otherwise will skew ID cleaning)
essays['ID'] = essays['ID'].str.replace(r'LATE_', '', regex=True) 

#Split book on first underscore (_) in ID, keep only text in between first and second underscore (ID number)
start = essays["ID"].str.split("_", expand = True)
essays['ID'] = start[1]
essays['ID'] = essays['ID'].astype(int)
essays

Unnamed: 0,ID,Text,Text_Newlines
0,193606,"I have learned a lot through English 802, mor...","\tI have learned a lot through English 802, mo..."
1,232002,Raven Ahenkora Professor Megan Kane English 08...,Raven Ahenkora\nProfessor Megan Kane\nEnglish ...
2,195145,Olivia Bedell Professor Megan Kane ENG 802 14 ...,Olivia Bedell\nProfessor Megan Kane\nENG 802\n...
3,193400,Camillia Benjamin Prof. Kane English 802 06 De...,Camillia Benjamin\nProf. Kane\nEnglish 802\n06...
4,227638,Bernstein 1 Gage Bernstein English 0802 Profes...,Bernstein 1\nGage Bernstein\nEnglish 0802\nPro...
...,...,...,...
99,189740,Isabella Volpe Professor Kane ENG 802 21 April...,Isabella Volpe\nProfessor Kane\nENG 802\n21 Ap...
100,186629,Amaya Whipple Professor Megan Kane ENG 802 9 F...,Amaya Whipple\nProfessor Megan Kane\nENG 802\n...
101,185528,Ashrita Yellani Professor Kane English 0802 De...,Ashrita Yellani\nProfessor Kane\nEnglish 0802\...
102,189403,Yuknek 1 Kathryn Yuknek Professor Kane ENG 802...,Yuknek 1\n\n\nKathryn Yuknek\n\n\nProfessor Ka...


### Import grades and additional metadata to second dataframe


In [7]:
#Upload csvs with essay metadata
uploaded_grades = files.upload()

Saving 2022-09-13T0943_Grades-LA-ENG-0802-711-10742-202220.csv to 2022-09-13T0943_Grades-LA-ENG-0802-711-10742-202220.csv
Saving 2022-09-13T0945_Grades-LA-ENG-0802-062-37264-202203.csv to 2022-09-13T0945_Grades-LA-ENG-0802-062-37264-202203.csv
Saving 2022-11-28T1326_Grades-LA-ENG-0802-011-4684-202103.csv to 2022-11-28T1326_Grades-LA-ENG-0802-011-4684-202103.csv
Saving 2022-11-28T1331_Grades-LA-ENG-0802-012-3352-202136.csv to 2022-11-28T1331_Grades-LA-ENG-0802-012-3352-202136.csv
Saving 2022-11-28T1332_Grades-LA-ENG-0802-010-3350-202036.csv to 2022-11-28T1332_Grades-LA-ENG-0802-010-3350-202036.csv


In [28]:
#Link to path where csv files are stored in drive
local_path = r'/content'

#Create variable to store all csvs in path
filenames = glob.glob(local_path + "/*.csv")

#Create df list for all csvs
dfs = [pd.read_csv(filename) for filename in filenames]

# Concatenate all data into one DataFrame
metadata = pd.concat(dfs, ignore_index=True)

#Change data to string (for further cleaning)
metadata.astype(str)

metadata.head()

Unnamed: 0,Student,ID,SIS User ID,SIS Login ID,Integration ID,Section,Final Portfolio (1689777),Weekly Assignments Current Score,Weekly Assignments Unposted Current Score,Weekly Assignments Final Score,...,Assignments Unposted Final Score,Quizzes Current Score,Quizzes Unposted Current Score,Quizzes Final Score,Quizzes Unposted Final Score,Discussions Current Score,Discussions Unposted Current Score,Discussions Final Score,Discussions Unposted Final Score,Final Portfolio (1059452)
0,Points Possible,,,,,,100.0,(read only),(read only),(read only),...,,,,,,,,,,
1,"Braunstein, Aydan",232993.0,tun93646,tun93646,915967676.0,Section: 711,94.0,100.09,100.09,100.09,...,,,,,,,,,,
2,"Clancy, Hannah",232214.0,tuo91570,tuo91570,916062331.0,Section: 711,92.0,100.26,100.26,100.26,...,,,,,,,,,,
3,"Cuascut-Palmer, Corey",237430.0,tuo77740,tuo77740,916050168.0,Section: 711,85.0,97.44,97.44,97.44,...,,,,,,,,,,
4,"Duckworth, Emily",227040.0,tuo35762,tuo35762,916008595.0,Section: 711,95.0,100.26,100.26,100.26,...,,,,,,,,,,


In [64]:
#Drop header rows(Points Possible) and test student rows (Student, Test)
metadata = metadata[metadata['Student'].str.contains('Points Possible|Student, Test')==False]
metadata.head()

Unnamed: 0,Student,ID,SIS User ID,SIS Login ID,Integration ID,Section,Final Portfolio (1689777),Weekly Assignments Current Score,Weekly Assignments Unposted Current Score,Weekly Assignments Final Score,...,Assignments Unposted Final Score,Quizzes Current Score,Quizzes Unposted Current Score,Quizzes Final Score,Quizzes Unposted Final Score,Discussions Current Score,Discussions Unposted Current Score,Discussions Final Score,Discussions Unposted Final Score,Final Portfolio (1059452)
1,"Braunstein, Aydan",232993.0,tun93646,tun93646,915967676.0,Section: 711,94.0,100.09,100.09,100.09,...,,,,,,,,,,
2,"Clancy, Hannah",232214.0,tuo91570,tuo91570,916062331.0,Section: 711,92.0,100.26,100.26,100.26,...,,,,,,,,,,
3,"Cuascut-Palmer, Corey",237430.0,tuo77740,tuo77740,916050168.0,Section: 711,85.0,97.44,97.44,97.44,...,,,,,,,,,,
4,"Duckworth, Emily",227040.0,tuo35762,tuo35762,916008595.0,Section: 711,95.0,100.26,100.26,100.26,...,,,,,,,,,,
5,"Gallagher, Chris",184517.0,tul44633,tul44633,915837216.0,Section: 711,90.0,100.26,100.26,100.26,...,,,,,,,,,,


In [91]:
#Keep only relevant metadata (ID, Section, Final Portfolio Scores)
clean_metadata = metadata[['ID'] + ['Section'] + list(metadata.loc[:, metadata.columns.str.startswith('Final Portfolio (')])]
clean_metadata
#Want other metadata? Check the columns
#Get all column names 
#for col in metadata.columns:
   # print(col)

Unnamed: 0,ID,Section,Final Portfolio (1689777),Final Portfolio (878160),Final Portfolio (1676963),Final Portfolio (1313717),Final Portfolio (1059452)
1,232993.0,Section: 711,94.0,,,,
2,232214.0,Section: 711,92.0,,,,
3,237430.0,Section: 711,85.0,,,,
4,227040.0,Section: 711,95.0,,,,
5,184517.0,Section: 711,90.0,,,,
...,...,...,...,...,...,...,...
120,177300.0,Section: 011,,,,,92.0
121,193777.0,Section: 011,,,,,89.0
122,189740.0,Section: 011,,,,,89.0
123,186629.0,Section: 011,,,,,86.0


In [94]:
#Replace all NaN values with 0 
clean_metadata = clean_metadata.replace(np.nan, 0)
clean_metadata

Unnamed: 0,ID,Section,Final Portfolio (1689777),Final Portfolio (878160),Final Portfolio (1676963),Final Portfolio (1313717),Final Portfolio (1059452)
1,232993.0,Section: 711,94.0,0.0,0.0,0,0.0
2,232214.0,Section: 711,92.0,0.0,0.0,0,0.0
3,237430.0,Section: 711,85.0,0.0,0.0,0,0.0
4,227040.0,Section: 711,95.0,0.0,0.0,0,0.0
5,184517.0,Section: 711,90.0,0.0,0.0,0,0.0
...,...,...,...,...,...,...,...
120,177300.0,Section: 011,0.0,0.0,0.0,0,92.0
121,193777.0,Section: 011,0.0,0.0,0.0,0,89.0
122,189740.0,Section: 011,0.0,0.0,0.0,0,89.0
123,186629.0,Section: 011,0.0,0.0,0.0,0,86.0


In [95]:
#Create new final portfolio column with all values
#Add values of each column together; values except correct grade will be zero
score_counts = clean_metadata.columns[2:]
clean_metadata['Portfolio_Score'] = clean_metadata[score_counts].sum(axis=1)

  after removing the cwd from sys.path.


In [96]:
clean_metadata['Portfolio_Score']

1      94.0
2      92.0
3      85.0
4      95.0
5      90.0
       ... 
120    92.0
121    89.0
122    89.0
123    86.0
124    84.0
Name: Portfolio_Score, Length: 115, dtype: float64

In [98]:
#Drop grade columns for individual classes
clean_metadata = clean_metadata[['ID', 'Section', "Portfolio_Score"]]
clean_metadata

Unnamed: 0,ID,Section,Portfolio_Score
1,232993.0,Section: 711,94.0
2,232214.0,Section: 711,92.0
3,237430.0,Section: 711,85.0
4,227040.0,Section: 711,95.0
5,184517.0,Section: 711,90.0
...,...,...,...
120,177300.0,Section: 011,92.0
121,193777.0,Section: 011,89.0
122,189740.0,Section: 011,89.0
123,186629.0,Section: 011,86.0


In [None]:
#Drop decimal from ID (inconsistent with ID in essay dataframe)
clean_metadata['ID'] = clean_metadata['ID'].astype(int)

#Check cleaned DF one more time
clean_metadata

### Merge essays and grade metadata into one dataframe

In [None]:
#Merge metadata and cleaned essays into new dataframe
#Will only keep rows where both essay and metadata are present
essays_grades_master = clean_metadata.merge(essays,on='ID')

#Print dataframe
essays_grades_master

In [None]:
#Sort dataframe by grades
essays_grades_master.sort_values(by=['Portfolio Score'], inplace = True)
essays_grades_master.head()

In [None]:
#Save new df to csv and download
essays_grades_master.to_csv('essays_grades_master.csv') 
files.download('essays_grades_master.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>