# Paper Grading Assistant

## Data Wrangling and Pre-processing

Data comes from these links:
- https://components.one/datasets/all-the-news-2-news-articles-dataset/
- https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus
- https://www.kaggle.com/krsoninikhil/pual-graham-essays
- https://www.kaggle.com/c/asap-sas/data
- https://www.kaggle.com/c/asap-aes/data
- https://www.kaggle.com/thevirusx3/automated-essay-scoring-dataset

In [1]:
# !pip install gensim
import sys
from gensim import corpora, models
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import re
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
# from sklearn.feature_extraction.text import CountVectorizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\maxw2\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\maxw2\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [2]:
train_docs = {
    'doc1' : "D:\\Kaggle\\asap-sas\\train.tsv",
    'doc2' : "D:\\Kaggle\\asap-aes\\training_set_rel3.tsv",
    'doc3' : "D:\\Kaggle\\paul-graham-essays\\paul_graham_essay.txt",
    'doc4' : "gibberish"
} 

In [3]:
# Cleaning the text

def get_data(path):
    dataset = []
    if path.endswith('.tsv'):
        dataset = pd.read_table(path)
    elif path.endswith('.csv'):
        dataset = pd.read_csv(path)
    elif path.endswith('.txt'):
        with open(path) as file:
            for line in file:
                dataset.append(line.rstrip())
    else:
        dataset = ''
    return (dataset)

def strip_html(raw_html):
    clean_re = re.compile('<.*?>|&([a-z0-9]+|#[0-9]{1,6}|#x[0-9a-f]{1,6});')
    text = re.sub(clean_re, '', raw_html)
    return text

def process_text(text):
    # remove handles and urls specifically
    text = re.sub(r"(@\[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", text)
    # remove anything not a letter
    text = re.sub('[^a-zA-Z]', ' ', text)

    text = text.lower()
    text = text.split()
    
#     ps = PorterStemmer()
    wnl = WordNetLemmatizer()
    all_stopwords = stopwords.words('english')
    all_stopwords.remove('not')
    
#     text = [ps.stem(word) for word in text if not word in set(all_stopwords)]
    text = [wnl.lemmatize(word) for word in text if not word in set(all_stopwords)]
    text = ' '.join(text)
    return text
    

def clean_df(df):
    cols = df.columns
    print(cols)
    target = input("Which column has the text? Copy and paste here: ")
    print(target)
    clean_text = []
    for i in range(len(df)):
        try:
            text = strip_html(df[target][i])
            text = process_text(text)
            clean_text.append(text)
        except:
            pass
#     print(clean_text)
    return clean_text

def clean_list(lst):
    print('processing text data...')
    clean_text = []
    for i in range(len(lst)):
        try:
            lst[i] = lst[i].strip()
            if len(lst[i]) < 2: # removes random empty lines
                lst.pop(i)
                continue
            text = strip_html(lst[i])
            text = process_text(text)
            clean_text.append(text)
        except IndexError:
            break
        except:
            print("Unexpected error:", sys.exc_info())
            pass
#     print(clean_text)
    return clean_text

def process_data(data):
    if isinstance(data, pd.DataFrame):
        return clean_df(data)
    elif isinstance(data, list):
        return clean_list(data)
    else:
        print('data type not recognized')
        return ''
    

In [4]:
all_data = []
for key in train_docs.keys():
    data = get_data(train_docs[key])
    big_data = process_data(data)
    all_data.append(big_data)

Index(['Id', 'EssaySet', 'Score1', 'Score2', 'EssayText'], dtype='object')
Which column has the text? Copy and paste here: EssayText
EssayText
Index(['essay_id', 'essay_set', 'essay', 'rater1_domain1', 'rater2_domain1',
       'rater3_domain1', 'domain1_score', 'rater1_domain2', 'rater2_domain2',
       'domain2_score', 'rater1_trait1', 'rater1_trait2', 'rater1_trait3',
       'rater1_trait4', 'rater1_trait5', 'rater1_trait6', 'rater2_trait1',
       'rater2_trait2', 'rater2_trait3', 'rater2_trait4', 'rater2_trait5',
       'rater2_trait6', 'rater3_trait1', 'rater3_trait2', 'rater3_trait3',
       'rater3_trait4', 'rater3_trait5', 'rater3_trait6'],
      dtype='object')
Which column has the text? Copy and paste here: essay
essay
processing text data...
data type not recognized


In [5]:
all_data[0]

['additional information would need replicate experiment much vinegar placed identical container tool use measure mass four different sample much distilled water use rinse four sample taking vinegar',
 'reading expirement realized additional information need replicate expireiment one amant vinegar poured container two label container start yar expirement three write conclusion make sure yar result accurate',
 'need trial control set exact amount vinegar pour cupbeaker could also take check mass every min hour',
 'student list rock better rock worse procedure',
 'student able make replicate would need tell use much vinegar used tipe material needed expirement',
 'would need information would let different sample dry container drying',
 'information would need order sucessfully replicate experiment correct measurement used experiment also material used creat experiament hour removing sample container rinsen sample distilled water making dry mintued might not long enough really determine 

In [6]:
all_data[1]

['dear local newspaper think effect computer people great learning skillsaffects give u time chat friendsnew people help u learn globeastronomy keep u troble thing dont think would feel teenager always phone friend ever time chat friend buisness partner thing well there new way chat computer plenty site internet organization organization cap facebook myspace ect think setting meeting bos computer teenager fun phone not rushing get cause want use learn countrysstates outside well computerinternet new way learn going time might think child spends lot time computer ask question economy sea floor spreading even date youll surprise much heshe know believe not computer much interesting class day reading book child home computer local library better friend fresh perpressured something know isnt right might not know child cap forbidde hospital bed driveby rather child computer learning chatting playing game safe sound home community place hope reached point understand agree computer great effe

In [7]:
all_data[2]

['september',
 'fma example general surprising hard',
 'combination achieve territory tends picked',
 'clean precisely insight valuable',
 'either surprising without general eg',
 'gossip general without surprising eg',
 'platitude',
 'insight get small addition whichever',
 'quality missing common case small',
 'addition generality piece gossip thats',
 'gossip teach something interesting',
 'world another le common approach focus',
 'general idea see find something new',
 'say start general',
 'need small delta novelty produce useful',
 'insight',
 'time mean take route idea',
 'seem lot like one already exist sometimes',
 'youll find youve merely rediscovered idea',
 'already exist dont discouraged remember huge',
 'multiplier kick manage think',
 'something even little new',
 'le worry repeating',
 'write enough inevitable brain much',
 'year year stimulus hit',
 'feel slightly bad find ive said something',
 'close ive said plagiarizing',
 'rationally one shouldnt wont say',
 'some