# Data 
The dataset (the ELLIPSE corpus) comprises argumentative essays written by 8th-12th grade English Language Learners (ELLs). The essays have been scored according to six analytic measures: cohesion, syntax, vocabulary, phraseology, grammar, and conventions. Each measure represents a component of proficiency in essay writing, with greater scores corresponding to greater proficiency in that measure. The scores range from 1.0 to 5.0 in increments of 0.5. The task is to predict the score of each of the six measures for the essays given in the test set.

**File and Field Information**
train.csv - The training set, comprising the full_text of each essay, identified by a unique text_id. The essays are also given a score for each of the seven analytic measures above: cohesion, etc. These analytic measures comprise the target for the competition.    
test.csv - For the test data we give only the full_text of an essay together with its text_id.




In [1]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import logging
import logging.config
from IPython.display import display, HTML

from atelier.workflow.pipeline import DataPipeBuilder, DataPipe
from atelier.data.io import YamlIO

In [2]:
# Seaborn
sns.set_palette("Blues_r")
sns.set_style("whitegrid")
# Pandas
pd.options.display.float_format = '{:,.2f}'.format
pd.set_option("display.max_rows", 100)
pd.set_option("display.max_columns", 20)
pd.set_option("display.max_colwidth", 100)
# Configurations
ETL_CONFIG_FILE = "config/etl.yml"
LOGGING_CONFIG_FILE = "config/logging.yml"
DATA_CONFIG_FILE = "config/data.yml"
FEATURE_CONFIG_FILE = "config/features.yml"
# Logging
io =  YamlIO()
LOGGING_CONFIG = io.read(LOGGING_CONFIG_FILE)
logging.config.dictConfig(LOGGING_CONFIG)
logger = logging.getLogger(__name__)
# Data Directories
DATA_DIRECTORIES = io.read(DATA_CONFIG_FILE)['directories']
FEATURE_STORE = io.read(FEATURE_CONFIG_FILE)

## Extract Transform and Load
The following data pipeline obtains and prepares the data for profiling, analysis, and downstream feature engineering, and modeling. In short the data pipeline,    

1. Extracts the data from the Kaggle website, unpacks, and stores it locally in the raw data directory. 
2. Transforms the raw data into tokens, tags the associated parts-of-speech, and parses the syntactic dependencies for syntactic analysis.  
3. Loads the raw and transformed data into the local environment for analysis.

In [3]:
builder = DataPipeBuilder()                     
builder.build(ETL_CONFIG_FILE)           
datapipe = builder.pipeline                     
datapipe.run()                                   


## Data Profile
This data profile step examines the structure, types, nullability, cardinality, statistics, and distributions of the dataset to illuminate data quality issues, so that corrective action can be undertaken before downstream investments in analysis, feature engineering, and modeling are committed.  Let's start with the training set. 
### Training Set

In [4]:
train_filepath = os.path.join(DATA_DIRECTORIES['raw'], 'train.csv')
train = pd.read_csv(train_filepath)
train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3911 entries, 0 to 3910
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   text_id      3911 non-null   object 
 1   full_text    3911 non-null   object 
 2   cohesion     3911 non-null   float64
 3   syntax       3911 non-null   float64
 4   vocabulary   3911 non-null   float64
 5   phraseology  3911 non-null   float64
 6   grammar      3911 non-null   float64
 7   conventions  3911 non-null   float64
dtypes: float64(6), object(2)
memory usage: 244.6+ KB


We have a shape [3911,8] and no null values. Our identity variable, text_id, and the full_text are string objects and our target variables are float values.  Let's examine a few random samples.
### Random Training Samples

In [5]:
idx = np.random.randint(train.shape[0], size=5)
train.loc[idx]

Unnamed: 0,text_id,full_text,cohesion,syntax,vocabulary,phraseology,grammar,conventions
743,394F5867B7EC,Working alone is great but have you ever thought about the benefits of working in groups? for ex...,3.5,3.0,3.5,3.5,2.5,3.0
1158,598C1E9B27C8,"Your character will be what you yourself choose to make. Do we choose our own character traits, ...",3.0,3.5,3.5,3.0,4.0,4.0
3653,F584D9BB5F5C,""" Has the limitation of human contact due to the use of technology had a positive or negative af...",4.0,3.5,3.5,4.0,3.0,3.5
2212,A4F11F4A76AE,"Students should enjoy summer break, cause they would not have to worry about school and they can...",2.5,2.5,3.0,2.5,3.0,3.5
1077,529A633179FF,Have you ever wondered about the important's and the difference between imagination and knowledg...,3.0,3.5,4.0,3.5,3.0,3.5


Let's take a closer look at some text data.

In [6]:
pd.set_option("display.max_colwidth", 5000)
idx = np.random.randint(train.shape[0], size=2)
train['full_text'].loc[idx]

1499                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    

It would appear that the data are organized by paragraphs denoted by double linebreak characters. This may be useful for establishing sentence boundaries during preprocessing; however, they should be removed prior to the modeling stage. 

### Score Descriptive Statistics
Let's get a sense of the target variable distributions.

In [7]:
scores = train[['cohesion', 'syntax', 'vocabulary', 'phraseology',	'grammar',	'conventions']]
scores.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
cohesion,3911.0,3.13,0.66,1.0,2.5,3.0,3.5,5.0
syntax,3911.0,3.03,0.64,1.0,2.5,3.0,3.5,5.0
vocabulary,3911.0,3.24,0.58,1.0,3.0,3.0,3.5,5.0
phraseology,3911.0,3.12,0.66,1.0,2.5,3.0,3.5,5.0
grammar,3911.0,3.03,0.7,1.0,2.5,3.0,3.5,5.0
conventions,3911.0,3.08,0.67,1.0,2.5,3.0,3.5,5.0


### Score Distribution Plot

All scores are in the range [1,5] and are centered on the mean of 3.0. Next, we examine surface features such as word length, sentence, length, frequency measures within the texts.

### Surface Features



In [8]:
analytics = pd.read_csv(FEATURE_STORE['sinlp'], index_col=None)[[ ' number words', ' number types', ' TTR', ' Letters per word',
       ' number paragraphs', ' number of sentences',
       ' number of words per sentence', 'determiners', 'demonstratives',
       'number of pronouns', 'first person pronouns', 'second person pronouns',
       'third person pronouns', 'conjuncts', 'connectives', 'negations',
       'future']]
analytics.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
number words,3911.0,430.49,191.87,14.0,294.0,402.0,526.5,1260.0
number types,3911.0,165.45,56.72,13.0,124.0,158.0,198.0,439.0
TTR,3911.0,0.41,0.08,0.15,0.35,0.4,0.46,0.93
Letters per word,3911.0,4.26,0.28,3.3,4.07,4.26,4.44,5.6
number paragraphs,3911.0,10.08,6.23,1.0,7.0,9.0,11.0,103.0
number of sentences,3911.0,18.8,10.49,1.0,11.0,17.0,25.0,100.0
number of words per sentence,3911.0,28.88,25.41,6.34,17.89,22.5,31.43,565.5
determiners,3911.0,0.08,0.03,0.0,0.06,0.08,0.1,0.23
demonstratives,3911.0,0.02,0.01,0.0,0.01,0.02,0.03,0.1
number of pronouns,3911.0,0.1,0.04,0.0,0.07,0.1,0.12,0.23
