# Sentence Extraction
- We have got paragraphs in html format, with each para b/w para tags(p tags) seperated by newline. 
- We need to extract proper sentences from these. 
- **Using bs4**, we successfully get proper text out of para tags (p tags).
- Then, we use **spaCy** to extract sentences from these extracted paras.

### Encoding sidenotes
- Files have to written in unicoded strings, so we have always encoded **unicode objects** to utf-8 before writing them to file.
- spaCy's  **Doc.sents** returns sentences in ***span objects***, while ***span_object.text*** returns **unicode objects**(just like normal Python), so we needed to encode them first before adding \n to them. 
- Also, this implies when we read the unicode strings from files, we need to decode them using decode() function to get back the **unicode objects**. 

In [1]:
from bs4 import BeautifulSoup
import os

In [2]:
import spacy
nlp = spacy.load('en')

In [3]:
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

In [9]:
def get_sentences(PATH):
    '''For each paras.txt extract sentences for each paragraph using bs4.'''
    
    sent_file = open(os.path.join(PATH,'sentences.txt'),'w')
    
    with open(os.path.join(PATH,'paras.txt')) as fobj:
        #read the file line-by-line
        for line in fobj:
            if line != '\n' and line.strip().startswith('<p>'):
                try:
                    #extract tagless paragraph-text from <p> tags.
                    soup = BeautifulSoup(line.strip().decode('utf-8'),"lxml")
                    
                    #divide this tagless paragraphs into proper sentences using NLP via spacy.
                    doc = nlp(soup.p.text)
                except:
                    #If the line can't be parsed then log the line and continue to next line.
                    logging.warning(PATH,":",line,"can't be parsed.")
                    continue
                    
                #If parsed then write each sentence in the file 'sentences.txt'.
                for each in doc.sents:
                    text = each.text+'\n'.encode()
                    sent_file.write(each.text.encode('utf-8')+'\n')
                    
    sent_file.close()

In [10]:
dirs = os.listdir('./stackexchange')
#print len(dirs)
for each in dirs:
    PATH = os.path.join(os.getcwd(),'stackexchange',each)
    logging.info('Generating sentences for ' + each)
    get_sentences(PATH)
    logging.info('SUCCESS')

2017-06-27 10:15:36,110 : INFO : Generating sentences for earthscience.stackexchange.com
2017-06-27 10:16:40,870 : INFO : SUCCESS
2017-06-27 10:16:40,870 : INFO : Generating sentences for meta.webapps.stackexchange.com
2017-06-27 10:16:51,761 : INFO : SUCCESS
2017-06-27 10:16:51,762 : INFO : Generating sentences for economics.stackexchange.com
2017-06-27 10:18:32,489 : INFO : SUCCESS
2017-06-27 10:18:32,490 : INFO : Generating sentences for meta.hinduism.stackexchange.com
2017-06-27 10:18:39,745 : INFO : SUCCESS
2017-06-27 10:18:39,746 : INFO : Generating sentences for meta.webmasters.stackexchange.com
2017-06-27 10:18:49,431 : INFO : SUCCESS
2017-06-27 10:18:49,432 : INFO : Generating sentences for meta.mechanics.stackexchange.com
2017-06-27 10:18:56,352 : INFO : SUCCESS
2017-06-27 10:18:56,353 : INFO : Generating sentences for meta.arduino.stackexchange.com
2017-06-27 10:19:00,001 : INFO : SUCCESS
2017-06-27 10:19:00,002 : INFO : Generating sentences for meta.ell.stackexchange.com
20

2017-06-27 12:53:02,784 : INFO : SUCCESS
2017-06-27 12:53:02,785 : INFO : Generating sentences for meta.money.stackexchange.com
2017-06-27 12:53:14,178 : INFO : SUCCESS
2017-06-27 12:53:14,178 : INFO : Generating sentences for mechanics.stackexchange.com
2017-06-27 12:57:49,894 : INFO : SUCCESS
2017-06-27 12:57:49,895 : INFO : Generating sentences for meta.boardgames.stackexchange.com
2017-06-27 12:58:03,891 : INFO : SUCCESS
2017-06-27 12:58:03,892 : INFO : Generating sentences for meta.stats.stackexchange.com
2017-06-27 12:58:33,993 : INFO : SUCCESS
2017-06-27 12:58:33,994 : INFO : Generating sentences for meta.sitecore.stackexchange.com
2017-06-27 12:58:36,039 : INFO : SUCCESS
2017-06-27 12:58:36,040 : INFO : Generating sentences for meta.iot.stackexchange.com
2017-06-27 12:58:38,433 : INFO : SUCCESS
2017-06-27 12:58:38,434 : INFO : Generating sentences for meta.startups.stackexchange.com
2017-06-27 12:58:41,749 : INFO : SUCCESS
2017-06-27 12:58:41,750 : INFO : Generating sentences f

2017-06-27 16:31:14,729 : INFO : SUCCESS
2017-06-27 16:31:14,730 : INFO : Generating sentences for meta.3dprinting.stackexchange.com
2017-06-27 16:31:16,673 : INFO : SUCCESS
2017-06-27 16:31:16,674 : INFO : Generating sentences for meta.literature.stackexchange.com
2017-06-27 16:31:23,489 : INFO : SUCCESS
2017-06-27 16:31:23,490 : INFO : Generating sentences for meta.engineering.stackexchange.com
2017-06-27 16:31:29,603 : INFO : SUCCESS
2017-06-27 16:31:29,604 : INFO : Generating sentences for meta.woodworking.stackexchange.com
2017-06-27 16:31:32,616 : INFO : SUCCESS
2017-06-27 16:31:32,618 : INFO : Generating sentences for meta.history.stackexchange.com
2017-06-27 16:31:45,658 : INFO : SUCCESS
2017-06-27 16:31:45,659 : INFO : Generating sentences for meta.sqa.stackexchange.com
2017-06-27 16:31:48,701 : INFO : SUCCESS
2017-06-27 16:31:48,701 : INFO : Generating sentences for meta.computergraphics.stackexchange.com (3)
2017-06-27 16:31:50,977 : INFO : SUCCESS
2017-06-27 16:31:50,978 : 

2017-06-27 19:40:09,622 : INFO : SUCCESS
2017-06-27 19:40:09,623 : INFO : Generating sentences for meta.pets.stackexchange.com
2017-06-27 19:40:17,837 : INFO : SUCCESS
2017-06-27 19:40:17,838 : INFO : Generating sentences for meta.biology.stackexchange.com
2017-06-27 19:40:31,193 : INFO : SUCCESS
2017-06-27 19:40:31,194 : INFO : Generating sentences for english.stackexchange.com
2017-06-27 20:10:32,780 : INFO : SUCCESS
2017-06-27 20:10:32,781 : INFO : Generating sentences for writers.stackexchange.com
2017-06-27 20:13:57,416 : INFO : SUCCESS
2017-06-27 20:13:57,417 : INFO : Generating sentences for meta.reverseengineering.stackexchange.com
2017-06-27 20:13:59,848 : INFO : SUCCESS
2017-06-27 20:13:59,849 : INFO : Generating sentences for meta.serverfault.com
2017-06-27 20:14:47,315 : INFO : SUCCESS
2017-06-27 20:14:47,316 : INFO : Generating sentences for meta.electronics.stackexchange.com
2017-06-27 20:15:16,202 : INFO : SUCCESS
2017-06-27 20:15:16,203 : INFO : Generating sentences for