# Loading Data

In [110]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import nltk
from collections import Counter
from nltk.corpus import stopwords
from nltk.corpus import wordnet
nltk.download('stopwords')
nltk.download('omw-1.4')

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, f1_score

from tqdm import tqdm
tqdm.pandas(desc="progress-bar")
from gensim.models import doc2vec, Doc2Vec
from sklearn import utils
import gensim
from gensim.models.doc2vec import TaggedDocument
import re

import os

import warnings
warnings.filterwarnings('ignore')
"""
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
"""

[nltk_data] Downloading package stopwords to /usr/share/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /usr/share/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


"\nfor dirname, _, filenames in os.walk('/kaggle/input'):\n    for filename in filenames:\n        print(os.path.join(dirname, filename))\n"

# Finding and Creating a Dataset (10 hours)

### To undergo analysis of whether a line indicates the end of a paragraph, we need a lot of data. Thus, there is a readily available dataset on Kaggle with thousands of Wikipedia articles generated via web scrapping that we will use.


### First, we will create an empty dataset with two columns that we for sure know we will need - the line of text we are given, and a boolean indicator for whether that line of text is at the end of a paragraph.

In [111]:
wikipedia_data = pd.DataFrame(columns = ['Text', 'End of Paragraph'])

### Currently, the Wikipedia data is not in the form that we need. Looking through the data, we can see that each article has many text files associated with them, indicating the link of the article, external links, etc. 
### We only need to extract one text file from each article - that which is giving us the body of the Wikipedia article. Thus, we will loop through all the articles given and only save the text files with the name 'bodyText.txt' into a list.

In [112]:
base_dir = '../input/wikipedia-articles-extracted/Article'

target_sub_dir_name = 'bodyText.txt'

bt_paths = []

# walk subdirs under base_dir
for root, dirs, files in os.walk(base_dir):
    for dirs in files:
        if dirs and target_sub_dir_name in dirs:
            bt_paths.append(os.path.join(root, target_sub_dir_name))


### After we have the list of text files that we need, we will add all the lines of those files into one large list of text.

In [113]:
# Adds all files that have the name "bodyText.txt" so we can analyze the main body of the article
topology_list = []
for path in bt_paths:
    file = open(path,'r')
    topology_list.append(file.readlines())

### We haven't fully filtered through the data we need yet. So far, the data is in the form of blobs of text, where each element of topology_list is a paragraph of text. We want to filter through the paragraphs and manually add new lines after 80 characters, a standard line length by a variety of platforms. We will create a function to perform this task.


### (Aside - this may not be the standard line length of a Wikipedia article on many laptops and computers, yet 80 characters is a somewhat universal standard for which many writers base their line length upon. Furthermore, this increases accessibility for a variety of devices in which many people may be reading Wikipedia on, including tablets and smart phones. We may not be following the exact line lengths of Wikipedia, but this will allow for more general analysis that can be applied to a variety of other sources of data).

In [114]:
# Inserts a new line about every 80 characters
def insert_newlines(string, every=80):
    lines = []
    for i in range(0, len(string), every):
        lines.append(string[i:i+every])
    return '\n'.join(lines)

#### We will then create a new topology list that takes in lines based on the standard line length we have arbitrarily imposed. Before extracting each individual line, we will indicate at the end of each paragraph in topology_list whether it is a paragraph by a key marker we will remove later on. 

#### This step is adding an indication of a new line after 80 characters so those lines may be separated later.

In [115]:
# New list for each line based on given line length
new_topology_list = []
for file in topology_list:
    for paragraph in file:
        # Adds indication of end of paragraph to end of paragraph
        paragraph = paragraph + "[[END]]"
        # Adds automatic new line label after every 80 characters
        new_topology_list.append(insert_newlines(paragraph))

### To now get all the lines without a new line marker, we will split all lines based on the "\n" marker we imposed.

In [116]:
all_lines = []
for paragraph in new_topology_list:
    lines = paragraph.split("\n")
    for line in lines:
        all_lines.append(line)

### Thus, we now add all the individual lines to the dataframe's "Text" column and initialize all "End of Paragraph" boolean values as false before we can edit this later.

In [117]:
wikipedia_data['Text'] = [line for line in all_lines]
wikipedia_data['End of Paragraph'] = False

### We can now analyze the data. In each original paragraph, there was a "\n" marker at the end of the paragraph that showed that it was the end. We then added "[[END]]" at the end of each paragraph so that we would not confuse the end of paragraph marking with our arbitrarily-added new lines. As a result, we now have many rows that only have the value of "[[END]]". This will prove useful as we develop the 'End of Paragraph' column.

In [118]:
display(wikipedia_data)

Unnamed: 0,Text,End of Paragraph
0,John Prophet (1356–1416) was an English mediev...,False
1,"per of the Privy Seal and, Dean of Hereford an...",False
2,le administrator he remained loyal to all king...,False
3,cunning. Although guilty of simony and plura...,False
4,successfully made the transition from Richard ...,False
...,...,...
627160,(in French) Béroul's Le Roman de Tristan,False
627161,[[END]],False
627162,(in French) Thomas d'Angleterre's Tristan,False
627163,[[END]],False


### If we see a row that has the value "[[END]]", we know that the previous row must have been the last line in the paragraph. Thus, we can loop through the 'Text' column and assign the previous row as being the end of the paragraph if the current row has the value of '[[END]]'

In [119]:
for ind, text in enumerate(wikipedia_data['Text']):
    if "[[END]]" in text:
        # assign previous row's value for end of paragraph as true
        wikipedia_data.loc[ind - 1, "End of Paragraph"] = True

### And then we can remove the row with "[[END]]" so that it will not interfere with our analysis (as well as remove any general empty rows

In [120]:
# remove rows that have the value "[[END]]" in them
wikipedia_data = wikipedia_data[wikipedia_data['Text'].str.contains("END") == False]
# remove rows with empty strings
wikipedia_data = wikipedia_data[wikipedia_data['Text'] != ""]
# resetting indices after removing elements
wikipedia_data = wikipedia_data.reset_index()

### We have now created the dataset we need and can perform feature engineering. 

In [121]:
display(wikipedia_data)

Unnamed: 0,index,Text,End of Paragraph
0,0,John Prophet (1356–1416) was an English mediev...,False
1,1,"per of the Privy Seal and, Dean of Hereford an...",False
2,2,le administrator he remained loyal to all king...,False
3,3,cunning. Although guilty of simony and plura...,False
4,4,successfully made the transition from Richard ...,False
...,...,...,...
416593,627154,Tristan page from the Camelot Project,True
416594,627156,Bibliography of Modern Tristania in English,True
416595,627158,Tristan and Iseult public domain audiobook at...,True
416596,627160,(in French) Béroul's Le Roman de Tristan,True


# Feature Engineering

### Let's brainstorm some features that may be an indication that the line we read is the end of a paragraph. Some immediate features that come to mind are the length of the line (we can create two separate variables for the length and whether the length is short) and whether the line ends with punctuation such as '.' or '!'. Let's create functions to add these features.

In [122]:
# Creating a feature that marks the length of the line
wikipedia_data['Line Length'] = wikipedia_data['Text'].apply(lambda x: len(x))

# Creating a feature that determines whether the length of the line is significantly smaller
wikipedia_data['Short Length Line'] = wikipedia_data['Text'].apply(lambda x: len(x) < 70)

# Creating a feature that determines whether the line ends with punctuation
wikipedia_data['Ends with Punctuation'] = wikipedia_data['Text'].apply(lambda x: x[-1] is '.' or x[-1] is '!')

# Tones with ending sentences ?
display(wikipedia_data)

Unnamed: 0,index,Text,End of Paragraph,Line Length,Short Length Line,Ends with Punctuation
0,0,John Prophet (1356–1416) was an English mediev...,False,80,False,False
1,1,"per of the Privy Seal and, Dean of Hereford an...",False,80,False,False
2,2,le administrator he remained loyal to all king...,False,80,False,False
3,3,cunning. Although guilty of simony and plura...,False,80,False,False
4,4,successfully made the transition from Richard ...,False,80,False,False
...,...,...,...,...,...,...
416593,627154,Tristan page from the Camelot Project,True,37,True,False
416594,627156,Bibliography of Modern Tristania in English,True,43,True,False
416595,627158,Tristan and Iseult public domain audiobook at...,True,55,True,False
416596,627160,(in French) Béroul's Le Roman de Tristan,True,40,True,False


### We can also try to find the most common words used in lines that indicate the end of a paragraph and compare them to other lines. Let's see if there are any significant differences.

In [123]:
# List of all words in text where the lines are not the end of the paragraph
not_paragraph = []
# List of all words in text where the lines are the end of the paragraph
paragraph = []

# Find most common words in both lines that are at the end of a paragraph and those that aren't
for ind, text in enumerate(wikipedia_data['Text']):
    # Gets unique list of words to analyze
    for txt in list(set(text.split())):
        txt = txt.lower()
        if txt not in stopwords.words('english') and len(txt) > 1 and wordnet.synsets(txt):
            if wikipedia_data.loc[ind, 'End of Paragraph'] and txt not in paragraph:
                paragraph.append(txt)
            elif not wikipedia_data.loc[ind, 'End of Paragraph'] and txt not in not_paragraph:
                not_paragraph.append(txt)
                
# Pass the lists to instance of Counter class
not_paragraph_counter = Counter(not_paragraph)
paragraph_counter = Counter(paragraph)

# Get 500 most common words for each list 
most_occur_not_paragraph = not_paragraph_counter.most_common(500)
most_occur_paragraph = paragraph_counter.most_common(500)

#maybe contains numbers ? more numbers at end?
# create column that counts 500 top words and returns whether that text contains a word in either set

# isolate most common words that are only at the end of paragraph list 
#only_end_of_paragraph = [word for word in list(zip(*most_occur_paragraph)) if word not in list(zip(*most_occur_not_paragraph))] 
only_end_of_paragraph = list(set(zip(*most_occur_not_paragraph)).difference(zip(*most_occur_paragraph)))                         
print("Most common words only at lines that are at the end of a paragraph: %s" %(only_end_of_paragraph))

print("\n\n\n")

# isolate most common words that are only at the not end of paragraph list
#only_not_end_of_paragraph = [word for word in list(zip(*most_occur_not_paragraph)) if word not in list(zip(*most_occur_paragraph))]
only_not_end_of_paragraph = list(set(zip(*most_occur_paragraph)).difference(zip(*most_occur_not_paragraph)))   
print("Most common words only at lines that are not at the end of a paragraph: %s" %(only_not_end_of_paragraph))

Most common words only at lines that are at the end of a paragraph: [('henry', 'prophet', 'john', 'medieval', 'english', 'secretary', 'king', 'hereford', 'distinguished', 'dean', 'privy', 'seal', 'le', 'administrator', 'kings', 'remained', 'loyal', 'mix', 'guilty', 'simony', 'made', 'transition', 'successfully', 'court', 'extravagant', 'educated', 'university', 'entered', 'ordained', 'holy', 'priest', 'th', 'clerk', 'served', 'office', 'onwards', 'appointed', 'chaplain', 'archbishop', 'council', 'time', 'prebendary', 'converted', 'diocese', 'habit', 'cathedral', 'continued', 'gifts', 'curios', 'anomalous', 'another', 'nonetheless', 'servants', 'clever', 'crown', 'hard', 'commonly', 'promoted', 'rewarded', 'energetic', 'royal', 'pursuit', 'agenda', 'first', 'system', 'introducing', 'register', 'collation', 'november', 'ran', 'kind', 'thence', 'portion', 'appertaining', 'upper', 'saints', 'church', 'parish', 'hall', 'orpington', 'kent', 'rector', 'also', 'preferred', 'deanery', 'year', '

### We can take in this input and create separate boolean columns for whether words in the given text are found in the list of most common words occurring at the end of a paragraph and the list of most common words occurring in the middle or beginning of the paragraph (exclusive).

In [124]:
# Turns tuple input into list
only_end_of_paragraph_list = list(only_end_of_paragraph[0])
only_not_end_of_paragraph_list = list(only_not_end_of_paragraph[0])

In [125]:
# Creates column based on whether the text contains words common at lines that signal the end of paragraph
wikipedia_data['Contains Common Word at End'] = wikipedia_data['Text'].apply(lambda x: any(word in str(x) for word in only_end_of_paragraph_list))
# Creates column based on whether the text contains words common at lines not at the end of the paragraph
wikipedia_data['Contains Common Word NOT at End'] = wikipedia_data['Text'].apply(lambda x: any(word in str(x) for word in only_not_end_of_paragraph_list))

### Now we have many key features we can use to build a machine learning model.

In [126]:
display(wikipedia_data)

Unnamed: 0,index,Text,End of Paragraph,Line Length,Short Length Line,Ends with Punctuation,Contains Common Word at End,Contains Common Word NOT at End
0,0,John Prophet (1356–1416) was an English mediev...,False,80,False,False,True,True
1,1,"per of the Privy Seal and, Dean of Hereford an...",False,80,False,False,True,True
2,2,le administrator he remained loyal to all king...,False,80,False,False,True,True
3,3,cunning. Although guilty of simony and plura...,False,80,False,False,True,True
4,4,successfully made the transition from Richard ...,False,80,False,False,True,True
...,...,...,...,...,...,...,...,...
416593,627154,Tristan page from the Camelot Project,True,37,True,False,True,True
416594,627156,Bibliography of Modern Tristania in English,True,43,True,False,True,True
416595,627158,Tristan and Iseult public domain audiobook at...,True,55,True,False,True,True
416596,627160,(in French) Béroul's Le Roman de Tristan,True,40,True,False,True,False


# Building and Evaluating a Model (2 hours)

## Now that we have developed many features, we can create a model based around our dataset. Because we know that our model is going to be dealing with text classification, we must find a model ideal for text classification. After conducting some research, I found a model called 'doc2vec' that can take the mathematical average of the word vector representations of all sentences in the document. I will be borrowing code from the blog post at https://towardsdatascience.com/multi-class-text-classification-model-comparison-and-selection-5eb066197568.

### We will first begin by labeling each sentence in the document.

In [127]:
def label_sentences(corpus, label_type):
    """
    Gensim's Doc2Vec implementation requires each document/paragraph to have a label associated with it.
    We do this by using the TaggedDocument method. The format will be "TRAIN_i" or "TEST_i" where "i" is
    a dummy index of the post.
    """
    labeled = []
    for i, v in enumerate(corpus):
        label = label_type + '_' + str(i)
        labeled.append(doc2vec.TaggedDocument(v.split(), [label]))
    return labeled
X_train, X_test, y_train, y_test = train_test_split(wikipedia_data_copy['Text'], wikipedia_data_copy['End of Paragraph'], random_state=0, test_size=0.3)
X_train = label_sentences(X_train, 'Train')
X_test = label_sentences(X_test, 'Test')
all_data = X_train + X_test

### Then we will initialize the model.

In [128]:
model_dbow = Doc2Vec(dm=0, vector_size=300, negative=5, min_count=1, alpha=0.065, min_alpha=0.065)
model_dbow.build_vocab([x for x in tqdm(all_data)])

for epoch in range(30):
    model_dbow.train(utils.shuffle([x for x in tqdm(all_data)]), total_examples=len(all_data), epochs=1)
    model_dbow.alpha -= 0.002
    model_dbow.min_alpha = model_dbow.alpha

100%|██████████| 416598/416598 [00:00<00:00, 3586992.15it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3589313.15it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3577525.65it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3462119.54it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3525241.16it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3410000.19it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3512697.45it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3570143.00it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3537402.06it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3480983.13it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3538197.14it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3554253.72it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3530726.03it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3529763.16it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3524906.92it/s]
100%|██████████| 416598/416598 [00:00<00:00, 3461694.29it/s]
100%|██████████| 416598/

### Next, we will obtain vectors from the model...

In [129]:
def get_vectors(model, corpus_size, vectors_size, vectors_type):
    """
    Get vectors from trained doc2vec model
    :param doc2vec_model: Trained Doc2Vec model
    :param corpus_size: Size of the data
    :param vectors_size: Size of the embedding vectors
    :param vectors_type: Training or Testing vectors
    :return: list of vectors
    """
    vectors = np.zeros((corpus_size, vectors_size))
    for i in range(0, corpus_size):
        prefix = vectors_type + '_' + str(i)
        vectors[i] = model.docvecs[prefix]
    return vectors
    
train_vectors_dbow = get_vectors(model_dbow, len(X_train), 300, 'Train')
test_vectors_dbow = get_vectors(model_dbow, len(X_test), 300, 'Test')

### ... and get a logistic regression model from our trained features.

In [130]:
logreg = LogisticRegression(n_jobs=1, C=1e5)
logreg.fit(train_vectors_dbow, y_train)
logreg = logreg.fit(train_vectors_dbow, y_train)

## Now that we have our model, we can evaluate how the model performs on our test set through a variety of metrics. Because our dataset is imbalanced (logically, there are many more lines that are not at the end of the paragraph than are at the end of the paragraph), we will prioritize understanding our f1-score; we can still examine our accuracy to make sure that our accuracy is also high since it is another important metric.

In [131]:
y_pred = logreg.predict(test_vectors_dbow)
print('accuracy: %s' % accuracy_score(y_pred, y_test))
print('f1: %s' % f1_score(y_pred, y_test))

print("\n")

print(classification_report(y_test, y_pred))

accuracy: 0.8672747639622339
f1: 0.7241493996740613


              precision    recall  f1-score   support

       False       0.88      0.95      0.91     91283
        True       0.82      0.65      0.72     33697

    accuracy                           0.87    124980
   macro avg       0.85      0.80      0.82    124980
weighted avg       0.86      0.87      0.86    124980



## After all our work, we can see that we have achieved a total accuracy of 0.868 and an f1 score of 0.727, ensuring that the model we created is reliable for predicting what lines are at the end of a paragraph.

# Reflection / Next Steps

I had a lot of fun working on this project and really expanded my machine learning skills. I familiaried myself with many principles that I previously learned and engaged in learning new ones, such as how to connect many files based on the same name and how to analyze a plethora of data. This was my first time starting completely from scratch with a dataset and manually finding ways to add features, which proved to be very rewarding; I also learned how to optimize a model for a text classification task specifically by undergoing research to find an ideal solution. I learned how to experiment with a variety of new models and techniques for machine learning (such as doc2vec) that I know I will use throughout my future ML practices. Most importantly, I learned how to learn new techniques for analyzing machine learning fast and incorporate those techniques throughout my work.

Though I am satisfied with the current work I have done, I can recognize future areas of improvement that I would love to tackle. First, I would like to incorporate a much greater variety of data throughout this program. While I used Wikipedia data because it was plentiful and readily available, Wikipedia articles tend to have similar structures to each other, making it easier for the model to predict when the end of a paragraph is likely to occur. I want to feed a greater variety of data into the model, such as random blog posts, poems, and stories, to diversify the data and allow for there to be many more ways to predict when the end of a paragraph occurs. This would also allow for a greater diversity of voices to be heard so that we may hear a multitude of perspectives and understand how a wide set of people may end paragraphs differently (i.e. within different settings - colloquial blog posts versus formal contracts). Another way to increase the diversity of voices heard would be to analyze more text in more languages; I may take this project further by inputting Ukrainian and Russian language data within the model to then improve linguistic diversity of identifying the end of paragraphs while ensuring I can understand the data I analyze (and maintain an ethical standard). To make this model more applicable to contracts and acceptance letters, I want to work towards creating a database of a variety of these to analyze explicitly, customizing the model for the specific task at hand.

There are also more features that I would like to experiment with. For example, I noticed that lines at the end of a paragraph tend to have a different "conclusive" tone than those that are at the beginning or middle of a paragraph. I want to find a way to extract this tone, such as by using IBM Watson (albeit likely having to pay a fee), to add as a column to the dataset. I also want to experiment with finding ways to algorithmically identify more features that may lead to an end of paragraph indication and experiment with dropping and combining features (for example, having just n columns may lead to greater accuracy than considering all columns when building a model). I want to build a pipeline where I can test many models ideal for text classification and finetune their hyperparameters to truly maximize accuracy.

As a simple task, I want to grow my machine learning skills so that I can learn to optimize the code within this program as a whole and increase efficiency and runtime wherever possible.

Another area of possible work is to find a way to create my own machine learning models to optimize the task at hand. While researching how to complete this task, I found an interesting paper from the 2017 International Conference on Information and Communication Technology Convergence titled "Line-break prediction of hanmun text using recurrent neural networks". This paper used recurrent neural networks to create NLP models that predicted when Hanmun poems had line breaks. This paper was a fascinating read and inspired me to use its applications and formulas to find a way to create an original model optimized for specifically finding line breaks.

I can foresee some possible limitations with this model, such as that a short line length is not necessarily indicative of being the last line of a paragraph (such as the given acceptance letter that had a paragraph listing the employer, company name, etc on different lines); this is another area to explore with further work.

Overall, this project was thrilling and confirmed my passion for machine learning. I am inspired to continue iterating on this program and I hope to do similar projects to these in the future, analyzing a wide variety of texts and to help serve others.