# Statistical Analysis of Writing 
## Introduction

For my project, I decided to create a tool that analyzes writing level and style using the Flesch-Kincaid index, average sentence length and the standard deviation within that, as well as repetitiveness of word use. When I was in high school, I had an English teacher tell my class that a good way to keep a reader engaged was to make sure the sentences in your writing varied in length and that you did not start too many of your sentences with the same word. In order to do this in his class, we would manually count the words of our sentences when editing as well as go through our work and circle the first word to make sure we were not being too repetitive. To this day I still use this technique, as I find it is an excellent and easy way to imrove my writing. So, I thought it would be a great project idea to create a program that would speed up the process!

The Flesch-Kincaid readability tests are two different tests that indicate the reading ease and suggested grade level of writing. The reading ease test will return a number between _zero_ and _one hundred_, the higher the number corresponds inversley to the texts reading ease. It uses this formula:


\begin{equation}
\ 206.835 - 1.015( Total Words / Total Sentences ) - 84.6( Total Syllables / Total Words )
\end{equation}

The chart below illustrates the correlation between the reading ease score and the level of difficulty of the writing.

![Reading Ease](https://lh3.googleusercontent.com/broaSYuKWSgDp6GEcuVKd76Jt_AeHFi26vqRMyfs0uZV1FmTkvehJZ1oerWAb4uIEnj_bRXWw0dYOBlhkDpfKHFjt4gZD1_u5b0nGK8yZWAvSIzt-Cd_TFUf4IZLG08Y37IO5jce)

Similarly, the grade level text will return a number that corresponds with the American grade level the text is estimated to be able to be read at. This test uses this formula:


\begin{equation}
\ 0.39( Total Words / Total Sentences ) + 11.8( Total Syllables / Total Words ) - 15.59
\end{equation}

My search to find data on the ideal sentence length and variation came up pretty slim. However, I did find a tool online for purchase which measured the sentence length of your writing and its standard deviation called ProWritingAid. I did not choose to purchase this program, but I did find that they provided a brief explanation in one of their demonstrations that gave some decent information on the topic. According to them, most published writer vary their sentences on average between eleven and eighteen words. Any more would be too complicated and any less would be too choppy. These seemed like decent enough numbers for me to go off of, and so I began...

## Data

In order to test my design. I had friends and family send me essays they had written for various classes. The writers aged from high school to college seniors. It was important to me to find writers at different levels who studied different subject so I could see the variance across the outcomes. Here is my data set:

In [38]:
import pandas
Data = pandas.read_csv("PData1.csv")
Data

Unnamed: 0,Title,Text
0,The Balfour Declaration,"The Balfour Declaration, published November 2n..."
1,"Tolerance, Feminism, and the Downside of Globa...","During the Enlightenment, works such as Candid..."
2,Common App Essay,"Following many hours of travel—flying, then dr..."
3,The Role of Setting in “Disgrace”,"Published in 1999, J.M. Coetzee’s novel revolv..."
4,Designer Babies,"If given the chance to choose the sex, physiq..."
5,Freedom and Remorse,Many authors use the same literary tools to w...
6,Mansaf and the Sustainability of Traditional J...,Living in the desert has proven time and time ...
7,The Final Unveiling,Huda Shaarawi is regarded as one of the founde...
8,America’s Lunch Line,"Imagine a group of kindergarteners, all of the..."
9,More Than Surface Value: A Dialogue,This scene takes place at a prestigious art mu...


## Method
First, I wrote a function to get the average word count for the sentences of a text, as well as one to get the total wordcount.

In [12]:
def average_wordcount(text):
    wordcounts = []
    sentences = text.split('.')
    for sentence in sentences:
        words = sentence.split(' ')
        wordcounts.append(len(words))
        avg_wordcount = sum(wordcounts)/len(wordcounts)
    return avg_wordcount

In [13]:
def wordcount(text):
    simplewords = text.split(' ')
    num_words = len(simplewords)
    return num_words

My next step was to get the standard deviation of the sentence lengths.

In [14]:
import statistics

def stnd_dev(text):
    wordcounts = []
    sentences = text.split('.')
    for sentence in sentences:
        words = sentence.split(' ')
        wordcounts.append(len(words))
    return (statistics.stdev(wordcounts))   

Finally, I created a function to find all of beginning sentence words and display them.

In [15]:
def startwords(text):
    mylist = text.split()  
    bigrams = zip(mylist, mylist[1:])
    return [b[1] for b in bigrams if b[0].endswith('.')]

Now, I applied this to the data I collected. I had some friends send me old essays they had written, and I compiled those along with some of my own. Let's see how our writing styles vary in terms of sentence length.

In [39]:
Data['Average Sentence Length'] = Data.Text.apply(average_wordcount)
Data['Standard Deviation'] = Data.Text.apply(stnd_dev)
Data['Start Words'] = Data.Text.apply(startwords)
Data

Unnamed: 0,Title,Text,Average Sentence Length,Standard Deviation,Start Words
0,The Balfour Declaration,"The Balfour Declaration, published November 2n...",19.5,12.525155,"[These, The, The, It, It, His, I, A, This, The..."
1,"Tolerance, Feminism, and the Downside of Globa...","During the Enlightenment, works such as Candid...",22.06383,10.174614,"[Candide, The, Candide, All, Pangloss, Despite..."
2,Common App Essay,"Following many hours of travel—flying, then dr...",23.466667,10.874625,"[Eager, My, He, When, We, Each, My, Faces, Pho..."
3,The Role of Setting in “Disgrace”,"Published in 1999, J.M. Coetzee’s novel revolv...",21.226415,14.070085,"[Coetzee’s, Disgrace, Coetzee, Throughout, Coe..."
4,Designer Babies,"If given the chance to choose the sex, physiq...",23.509804,12.207565,"[Couples, People, As, TThe, The, However,, The..."
5,Freedom and Remorse,Many authors use the same literary tools to w...,22.95,11.041409,"[This, In, “The, Montresor, However,, All, His..."
6,Mansaf and the Sustainability of Traditional J...,Living in the desert has proven time and time ...,28.527778,13.739383,"[A, One, The, Some, Stemming, Mansaf,, Lamb, H..."
7,The Final Unveiling,Huda Shaarawi is regarded as one of the founde...,20.553191,12.719052,"[She, In, Perhaps, When, The, According, Even,..."
8,America’s Lunch Line,"Imagine a group of kindergarteners, all of the...",26.362637,13.850404,"[Some, Others, The, Each, This, The, Today,, T..."
9,More Than Surface Value: A Dialogue,This scene takes place at a prestigious art mu...,19.784615,11.557061,"[A, Ingres’, GOYA, GOYA:, BUYER:, You’re, GOYA..."


Now that I had recovered the average sentence length, standard deviation, and first words of the sentences in the data, I went into calculating the Flesch-Kincaid reading ease and grade level scores. The first set of new data I needed was the syllable count for each text.

In [27]:
def syllable_count(word):
    word = word.lower()
    count = 0
    vowels = "aeiouy"
    if word[0] in vowels:
        count += 1
    for index in range(1, len(word)):
        if word[index] in vowels and word[index - 1] not in vowels:
            count += 1
    if word.endswith("e"):
        count -= 1
    if count == 0:
        count += 1
    return count

Next, I found the number of sentences in the text.

In [28]:
def sentence_count(text):
    simplesents = text.split('.')
    num_sents = len(simplesents)
    return num_sents

And finally, the word count.

In [29]:
def wordcount(text):
    simplewords = text.split(' ')
    num_words = len(simplewords)
    return num_words

I added this new data to my data table to ensure my code was working correctly.

In [30]:
Data['Syllable Count'] = Data.Text.apply(syllable_count)
Data['Sentence Count'] = Data.Text.apply(sentence_count)
Data['Word Count'] = Data.Text.apply(wordcount)
Data.head()

Unnamed: 0,Title,Text,Average Sentence Length,Standard Deviation,Start Words,Syllable Count,Sentence Count,Word Count
0,The Balfour Declaration,"The Balfour Declaration, published November 2n...",19.5,12.525155,"[These, The, The, It, It, His, I, A, This, The...",2698,84,1555
1,"Tolerance, Feminism, and the Downside of Globa...","During the Enlightenment, works such as Candid...",22.06383,10.174614,"[Candide, The, Candide, All, Pangloss, Despite...",3566,94,1981
2,Common App Essay,"Following many hours of travel—flying, then dr...",23.466667,10.874625,"[Eager, My, He, When, We, Each, My, Faces, Pho...",956,30,675
3,The Role of Setting in “Disgrace”,"Published in 1999, J.M. Coetzee’s novel revolv...",21.226415,14.070085,"[Coetzee’s, Disgrace, Coetzee, Throughout, Coe...",3646,106,2145
4,Designer Babies,"If given the chance to choose the sex, physiq...",23.509804,12.207565,"[Couples, People, As, TThe, The, However,, The...",4002,102,2297


Now that I had all the data I needed, I calculated the readability score for each text.

In [31]:
def FRE(text):
    score = 206.835 - 1.015 * (wordcount(text) / sentence_count(text)) - 84.6 * (syllable_count(text) / wordcount(text))
    return score

Then, the estimated grade level.

In [32]:
def grade_level(text):
    new_score = 0.39 * (wordcount(text) / sentence_count(text)) + 11.8 * (syllable_count(text) / wordcount(text)) - 15.59
    return new_score

In [33]:
Data['Flesh-Kincaid Reading Ease'] = Data.Text.apply(FRE)
Data["Flesh-Kincaid Grade Level"] = Data.Text.apply(grade_level)
Data.head()

Unnamed: 0,Title,Text,Average Sentence Length,Standard Deviation,Start Words,Syllable Count,Sentence Count,Word Count,Flesh-Kincaid Reading Ease,Flesh-Kincaid Grade Level
0,The Balfour Declaration,"The Balfour Declaration, published November 2n...",19.5,12.525155,"[These, The, The, It, It, His, I, A, This, The...",2698,84,1555,41.260336,12.103212
1,"Tolerance, Feminism, and the Downside of Globa...","During the Enlightenment, works such as Candid...",22.06383,10.174614,"[Candide, The, Candide, All, Pangloss, Despite...",3566,94,1981,33.155874,13.870234
2,Common App Essay,"Following many hours of travel—flying, then dr...",23.466667,10.874625,"[Eager, My, He, When, We, Each, My, Faces, Pho...",956,30,675,64.178833,9.897296
3,The Role of Setting in “Disgrace”,"Published in 1999, J.M. Coetzee’s novel revolv...",21.226415,14.070085,"[Coetzee’s, Disgrace, Coetzee, Throughout, Coe...",3646,106,2145,42.495333,12.359231
4,Designer Babies,"If given the chance to choose the sex, physiq...",23.509804,12.207565,"[Couples, People, As, TThe, The, However,, The...",4002,102,2297,36.581342,13.751463


Here I cleaned up my data to display the important information to the user. And Ta-Da! My program is complete

In [36]:
Data[["Title", "Average Sentence Length", "Standard Deviation", "Start Words", "Flesh-Kincaid Reading Ease", "Flesh-Kincaid Grade Level"]]

Unnamed: 0,Title,Average Sentence Length,Standard Deviation,Start Words,Flesh-Kincaid Reading Ease,Flesh-Kincaid Grade Level
0,The Balfour Declaration,19.5,12.525155,"[These, The, The, It, It, His, I, A, This, The...",41.260336,12.103212
1,"Tolerance, Feminism, and the Downside of Globa...",22.06383,10.174614,"[Candide, The, Candide, All, Pangloss, Despite...",33.155874,13.870234
2,Common App Essay,23.466667,10.874625,"[Eager, My, He, When, We, Each, My, Faces, Pho...",64.178833,9.897296
3,The Role of Setting in “Disgrace”,21.226415,14.070085,"[Coetzee’s, Disgrace, Coetzee, Throughout, Coe...",42.495333,12.359231
4,Designer Babies,23.509804,12.207565,"[Couples, People, As, TThe, The, However,, The...",36.581342,13.751463
5,Freedom and Remorse,22.95,11.041409,"[This, In, “The, Montresor, However,, All, His...",47.882232,12.03785
6,Mansaf and the Sustainability of Traditional J...,28.527778,13.739383,"[A, One, The, Some, Stemming, Mansaf,, Lamb, H...",31.583651,15.69961
7,The Final Unveiling,20.553191,12.719052,"[She, In, Perhaps, When, The, According, Even,...",45.997567,11.706434
8,America’s Lunch Line,26.362637,13.850404,"[Some, Others, The, Each, This, The, Today,, T...",28.405065,15.600907
9,More Than Surface Value: A Dialogue,19.784615,11.557061,"[A, Ingres’, GOYA, GOYA:, BUYER:, You’re, GOYA...",55.025728,10.252876


## Conclusion

Overall, I am pretty happy with how this program turned out. However, one limitation of my design is that although it does come up with a calculated grade level score, it does not actually check the correctness of the writing or give feedback on specific style or syntax. Continuing,  I would like to investigate further if there were ways return the text marked up with any sentences that were outliers in terms of length or structure. This would provide the user with much more specific areas that need correction. 

One tool I would like to look into more in the future for this is spacy. I am really intrigued by spacy and I think if I had more time with it I would be able to do a lot to further my project and look deeper into analyzing and editing sentence structure and syntax.

Additionally, I believe I could make this program more useful by collecting works of recognized authors and scholars and seeing if I could find a correlation between their sentence length and structure, or perhaps similarities across disciplines. 

But despite its limitations, I believe my project was successful. I gained more confidence with Python and created something that will actually be useful to myself and others. Already by looking at the sentence length for all of the pieces that I contributed I can see that I tend to write sentences that are a bit too lengthy, at least according to ProWritingAid... Seeing the data lined up together was very interesting to me, and my friends and I liked being able to see what level we are supposedly writing at. I am excited to continue with Python and see what else I can create!