# Q&A Data for Pretraining

Load various NLP datasets and combine into a single source. This notebook outputs a 850mb text data file with 84M words/tokens with the following distribution:
- 65% from wiki103
- 18% from Tensorflow 2.0 Q&A
- 17% from the StackSample dataset

## Text datasets used

- **[StackSample data](https://www.kaggle.com/stackoverflow/stacksample):** This is a dump of 10% of StackOverflow questions and answers, only answers with score > 7 were used to try and enfore some quality control on the relationship between question and answer

- **[Kaggle Tensorflow 2.0 Q&A data](https://www.kaggle.com/c/tensorflow2-question-answering/data):** Q&A competition data from [Google's Natural Questions dataset](https://github.com/google-research-datasets/natural-questions/blob/master/README.md)

- **[Wikitext-103](https://www.salesforce.com/products/einstein/ai-research/the-wikitext-dependency-language-modeling-dataset/):** A scrape of wikipedia articles, I used the "raw" version here

## Data Processing
I did my best to align the datasets as best I could. 

- Html was stripped
- Wikipedia articles were combined based on the header sub-structure ("=" equals H1, "==" equals H2 etc)
- Beginning and end of wikipedia article: `[BOS = 'xxars', EOS = 'xxare']`
- Headers and sub-headers got a special token for start and end: `[H1S='xxh1s', H1e='xxh1e', H2S='xxh2s',H2e='xxh2e' etc..]`
- Question Title, Body and Answers all got start and end tokens: `xxqts, xxqte, xxqbs, xxqbe, xxans, xxane`
- Newline's were replaced with `xxnpg`


## Thanks
This [notebook from @xhlulu on Kaggle](https://www.kaggle.com/xhlulu/tf-qa-jsonl-to-dataframe) was super helpful in processing the TensorFlow 2.0 data


In [1]:
%reload_ext autoreload
%autoreload 2

from fastai2.basics import *
import gc

from tqdm import tqdm

#### HTML Stripper

In [2]:
from html.parser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.strict = False
        self.convert_charrefs= True
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

#### Load Wiki data func

In [3]:
def get_wiki(wiki_num: int=2, raw: bool=False):
    if raw==True:
        raw_p='-raw'
        suff='raw'
    else: 
        raw_p=''
        suff='tokens'
        
    wiki_path = Path(f'data/wikitext-{wiki_num}{raw_p}')
    
    df_train = pd.read_csv(wiki_path/f'wiki.train.{suff}', sep='\n', header=None)   
    df_valid = pd.read_csv(wiki_path/f'wiki.valid.{suff}', sep='\n', header=None)
    df_test = pd.read_csv(wiki_path/f'wiki.test.{suff}', sep='\n', header=None)

    df_train.columns=['doc']
    df_valid.columns=['doc']
    df_test.columns=['doc']
    
    df_train['source'] = f'wiki{wiki_num}'
    df_valid['source'] = f'wiki{wiki_num}'
    df_test['source'] = f'wiki{wiki_num}'
    
    return df_train, df_valid, df_test 

#### Clean Wiki Data

In [4]:
def add_wiki_header_toks(df_all):
    """Replace default wikitext header tokens ("=") with special tokens"""
    df_all_cp = df_all.copy()

    headers={i:'= '*i for i in range(7,0,-1)}

    # Find the indexes that are likely to be headers
    header_idxs=df_all.loc[df_all.doc.str.contains('\=.*?\=', regex=True)].index
    
    # Deal with sub-headers first
    for h_i in range(7,1,-1):
        mod_head_ls = []
        mod_head_idx = []
        df_all_cp.doc = df_all_cp.doc.str.replace(pat=headers[h_i], repl=f'xxh{h_i}s ')

        # Replace all the end header tokens with a different one
        for j, s in enumerate(df_all_cp.doc.values[header_idxs]):
            if f'xxh{h_i}s' in s:
                head, _sep, tail = s.rpartition(f'xxh{h_i}s ')    
                mod_head_ls.append(f'{head} xxh{h_i}e {tail}')
                mod_head_idx.append(header_idxs[j])

        df_all_cp.doc.values[mod_head_idx] = mod_head_ls

    # Deal with top level headers which are more tricky as there is only 1 "=""
    mod_tail_ls = []
    mod_tail_idx = []
    for j, s in enumerate(df_all_cp.doc.values[header_idxs]):   
        if s.endswith('= '):
            # replace both = with H1s
            uu = s.replace('=', 'xxh1s')

            # Split by H1s and replace the last one with H1e
            head, _sep, tail = uu.rpartition(f'xxh1s')   

            mod_tail_ls.append(f'{head} xxh1e ')
            mod_tail_idx.append(header_idxs[j])

    df_all_cp.doc.values[mod_tail_idx] = mod_tail_ls

    return df_all_cp, header_idxs


def add_wiki_start_end(comp_ls, wiki_num):
    ## Add start and end
    BOS = 'xxars'
    EOS = 'xxare'
    H1S = 'xxh1s'
    fin_dfs = []
    
    for df in comp_ls:

        # Find the starting index for each article
        art_starts = list(df.loc[df.doc.str.contains(pat=H1S, case=True, regex=False)].index)

        new_arts = []
        for i, idx in enumerate(art_starts[:-1]):
            # Add Article Start and End Tokens
            new_arts.append(BOS + df.loc[art_starts[i]:art_starts[i+1]-1].doc.str.cat(sep='xxnpg ') + EOS)

        new_df = pd.DataFrame(new_arts, columns=['doc'])
        new_df['source'] = f'wiki{wiki_num}'
        new_df['char_count'] = new_df['doc'].str.len().values
        
        fin_dfs.append(new_df)
    return fin_dfs

#### Load StackSample data func

- Only Answers with scores > 7 are loaded

In [5]:
def load_ss_data(num_samples: int=5000):
    ss_q = pd.read_csv('data/stacksample/Questions.csv', nrows=num_samples,
                       usecols =['Id','Title', 'Body'],encoding='latin1')
    ss_q = ss_q.dropna()

    ss_a = pd.read_csv('data/stacksample/Answers.csv', nrows=num_samples,
                       usecols =['ParentId','Body','Score'], encoding='latin1')
    ss_a.columns = ['ParentId','Score','Answer_Body']
    ss_a = ss_a.dropna()

    # Merge Questions and Answers
    ss_fin = pd.merge(left=ss_q, right=ss_a, left_on='Id', right_on='ParentId')

    ss_fin = ss_fin.query('Score > 7').copy()
    ss_fin.drop_duplicates(inplace=True)
    
    return ss_fin

#### Clean StackSample data func
- Add question title, question body and answer body start and end tags
- Merge question title, question body and answer to a single document
- replace '\ n' with xxnpg
- Strip HTML as is

In [6]:
def clean_ss_data(ss_fin):
    # Add new paragraph tag (xxnpg)
    ss_fin.Title = ss_fin.Title.str.replace(pat='\n', repl=f' xxnpg ')
    ss_fin.Body = ss_fin.Body.str.replace(pat='\n', repl=f' xxnpg ')
    ss_fin.Answer_Body = ss_fin.Answer_Body.str.replace(pat='\n', repl=f' xxnpg ')

    # Add start and end tags
    ss_fin.Title = 'xxqts ' + ss_fin.Title.astype(str) + ' xxqte'
    ss_fin.Body = 'xxqbs ' + ss_fin.Body.astype(str) + ' xxqbe'
    ss_fin.Answer_Body = 'xxans ' + ss_fin.Answer_Body.astype(str) + ' xxane'

    # Combine Question Title and Body and Answer Body to a single document
    ss_fin['doc'] = ss_fin.Title.values + ' ' + ss_fin.Body.values + ' ' + ss_fin.Answer_Body.values
    ss_fin = pd.DataFrame(ss_fin['doc'])

    # Strip HTML tags
    stripped = [strip_tags(t) for t in ss_fin['doc'].values]
    
    # Add some info
    ss_fin['doc'] = stripped
    ss_fin['source'] = 'stacksample'
    #ss_fin['doc_type'] = 'qt_qb_a'
    ss_fin['char_count'] = ss_fin['doc'].str.len().values
    
    return ss_fin

#### Load TensorFlow 2.0 Q&A data (Google Natural Questions)

In [7]:
# Taken from https://www.kaggle.com/xhlulu/tf-qa-jsonl-to-dataframe
def tf2qa_jsonl_to_df(file_path, n_rows=-1, load_annotations=True, truncate=True, offset=0):
    """
    Simple utility function to load the .jsonl files for the 
    TF2.0 QA competition. It creates a dataframe of the dataset.
    
    To use, click "File" > "Add utility script", search the name of this 
    notebook, then run:
    
    >>> from tf_qa_jsonl_to_dataframe import jsonl_to_df
    >>> train = jsonl_to_df("/kaggle/...train.jsonl")
    >>> test = jsonl_to_df("/kaggle/...test.jsonl", load_annotations=False)
    
    Parameters:
        * file_path (str): The path to your json_file.
        * n_rows (int): The number of rows you are importing. Set value to -1 if you want to import everything. [Default=-1]
        * load_annotations (bool): Whether to load annotations (for training data) or not (test set does not have
          annotations). [Default=True]
        * truncate (bool): Whether to cut the text before the first answer (long or short) [Default=True]
          and after the last answer (long or short), leaving a space for the offset
        * offset (int): If offset = k, then keep only keep the interval (answer_start - k, answer_end + k) [Default=True]
        
    Returns:
        A Dataframe containing the following columns:
            * document_text (str): The document split by whitespace, possibly truncated
            * question_text (str): the question posed
            * yes_no_answer (str): Could be "YES", "NO", or "NONE"
            * short_answer_start (int): Start index of token, -1 if does not exist
            * short_answer_end (int): End index of token, -1 if does not exist
            * long_answer_start (int): Start index of token, -1 if does not exist
            * long_answer_end (int): End index of token, -1 if does not exist
            * example_id (str): ID representing the string.
    
    Author: @xhlulu
    Source: https://www.kaggle.com/xhlulu/tf-qa-jsonl-to-dataframe
    """
    json_lines = []
    
    with open(file_path) as f:
        for i, line in tqdm(enumerate(f)):
            if i == n_rows:
                break
            
            line = json.loads(line)
            last_token = line['long_answer_candidates'][-1]['end_token']

            out_di = {
                'document_text': line['document_text'],
                'question_text': line['question_text']
            }
            
            if 'example_id' in line:
                out_di['example_id'] = line['example_id']
            
            if load_annotations:
                annot = line['annotations'][0]
                
                out_di['yes_no_answer'] = annot['yes_no_answer']
                out_di['long_answer_start'] = annot['long_answer']['start_token']
                out_di['long_answer_end'] = annot['long_answer']['end_token']

                if len(annot['short_answers']) > 0:
                    out_di['short_answer_start'] = annot['short_answers'][0]['start_token']
                    out_di['short_answer_end'] = annot['short_answers'][0]['end_token']
                else:
                    out_di['short_answer_start'] = -1
                    out_di['short_answer_end'] = -1

                if truncate:
                    if out_di['long_answer_start'] == -1:
                        start_threshold = out_di['short_answer_start'] - offset
                    elif out_di['short_answer_start'] == -1:
                        start_threshold = out_di['long_answer_start'] - offset
                    else:
                        start_threshold = min(out_di['long_answer_start'], out_di['short_answer_start']) - offset
                        
                    start_threshold = max(0, start_threshold)
                    end_threshold = max(out_di['long_answer_end'], out_di['short_answer_end']) + offset + 1
                    
                    out_di['document_text'] = " ".join(
                        out_di['document_text'].split(' ')[start_threshold:end_threshold]
                    )

            json_lines.append(out_di)

    df = pd.DataFrame(json_lines).fillna(-1)
    
    return df

#### Clean TensorFlow 2.0 Q&A data

In [8]:
def clean_tf2qa_data(tf2qa_data):
    # Remove articles without long answer
    tf2qa_data = tf2qa_data.loc[tf2qa_data['long_answer_start'] != -1, ['question_text', 'document_text']]

    # Add question mark and Question start and end tags to the end of each question
    tf2qa_data['question_text'] = 'xxqbs ' + tf2qa_data['question_text'].values + '?' + ' xxqbe'

    # Add Answer start and end tags
    tf2qa_data['document_text'] = 'xxans ' + tf2qa_data['document_text'] + ' xxane'

    # Merge to a single doc
    tf2qa_data['doc'] = tf2qa_data['question_text'] + ' ' + tf2qa_data['document_text']

    # Strip HTML
    tf2qa_data['doc'] = [strip_tags(t) for t in tf2qa_data['doc'].values]
    
    tf2qa_data_fin= pd.DataFrame(tf2qa_data['doc'].values, columns=['doc'])
    
    # Add some source info
    tf2qa_data_fin['source'] = 'tf2qa'

    # Add character count
    tf2qa_data_fin['char_count'] = tf2qa_data_fin['doc'].str.len().values
    return tf2qa_data_fin

# Load Data
#### Load Wiki Data
- Option to either load wiki2 or wiki103, with the raw or tokenized versions

In [9]:
# Load data
wiki_num = 103
wiki_train, wiki_valid, wiki_test = get_wiki(wiki_num=wiki_num, raw=True)
print('Data loaded')

# Add header toks
wiki_dfs = [wiki_train, wiki_valid, wiki_test]
comb_ls = []
for df in wiki_dfs:
    new_df, header_idxs = add_wiki_header_toks(df)
    comb_ls.append(new_df)
print('Header toks added')

# Add start and end articles
wiki_tmp_dfs = add_wiki_start_end(comb_ls, wiki_num)
print('Start, end toks added')
# Remove short articles, less than 2000 characters
# The below commented out code shows the distribution of character lengths
# import matplotlib.pyplot as plt
# bins=[]
# for h in range(0,10000, 200):
#     bins.append(h)
# plt.hist(wiki_fin_df['doc'].str.len(),bins=bins)

def cut_short_wikis(df):
    df = df[df['doc'].str.len() > 2000]
    return df
    
# Split df ls and tidy up
wiki_train_fin_df=cut_short_wikis(wiki_tmp_dfs[0])
wiki_valid_fin_df=cut_short_wikis(wiki_tmp_dfs[1])
wiki_test_fin_df=cut_short_wikis(wiki_tmp_dfs[2])
print('Short articles discarded')

wiki_fin_df = pd.concat([wiki_train_fin_df, wiki_valid_fin_df])
wiki_fin_df.reset_index(drop=True, inplace=True)
#wiki_fin_df['char_count'] = wiki_fin_df['doc'].str.len().values

wiki_test_fin_df.reset_index(drop=True, inplace=True)
#wiki_test_fin_df['char_count'] = wiki_test_fin_df['doc'].str.len().values

print(len(wiki_fin_df), wiki_fin_df.char_count.sum())
wiki_fin_df.head(3)

Data loaded
Header toks added
Start, end toks added
Short articles discarded
28784 546080745


Unnamed: 0,doc,source,char_count
0,"xxars xxh1s Valkyria Chronicles III xxh1e xxnpg Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the "" Nameless "" , a penal military unit...",wiki103,21104
1,"xxars xxh1s Tower Building of the Little Rock Arsenal xxh1e xxnpg The Tower Building of the Little Rock Arsenal , also known as U.S. Arsenal Building , is a building located in MacArthur Park in downtown Little Rock , Arkansas . Built in 1840 , it was part of Little Rock 's first military installation . Since its decommissioning , The Tower Building has housed two museums . It was home to the Arkansas Museum of Natural History and Antiquities from 1942 to 1997 and the MacArthur Museum of Arkansas Military History since 2001 . It has also been the headquarters of the Little Rock Æsthetic ...",wiki103,21870
2,"xxars xxh1s Cicely Mary Barker xxh1e xxnpg Cicely Mary Barker ( 28 June 1895 – 16 February 1973 ) was an English illustrator best known for a series of fantasy illustrations depicting fairies and flowers . Barker 's art education began in girlhood with correspondence courses and instruction at the Croydon School of Art . Her earliest professional work included greeting cards and juvenile magazine illustrations , and her first book , Flower Fairies of the Spring , was published in 1923 . Similar books were published in the following decades . xxnpg Barker was a devout Anglican , and dona...",wiki103,16713


#### Load StackSample Q&A data (from Stackoverflow)

- Source https://www.kaggle.com/stackoverflow/stacksample/tasks
- 2M Q&A Pairs

In [10]:
num_samples = 1000000
ss_fin = load_ss_data(num_samples)
ss_fin = clean_ss_data(ss_fin)

print(len(ss_fin), ss_fin.char_count.sum())
ss_fin.head()

79704 147225545


Unnamed: 0,doc,source,char_count
0,"xxqts SQLStatement.execute() - multiple queries in one statement xxqte xxqbs I've written a database generation script in SQL and want to execute it in my Adobe AIR application: xxnpg xxnpg Create Table tRole ( xxnpg roleID integer Primary Key xxnpg ,roleName varchar(40) xxnpg ); xxnpg Create Table tFile ( xxnpg fileID integer Primary Key xxnpg ,fileName varchar(50) xxnpg ,fileDescription varchar(500) xxnpg ,thumbnailID integer xxnpg ,fileFormatID integer xxnpg ,categoryID integer xxnpg ,isFavorite boolean xxnpg ,dateAdded date xxnpg ,global...",stacksample,2620
3,"xxqts Good branching and merging tutorials for TortoiseSVN? xxqte xxqbs Are there any really good tutorials explaining branching and merging with Apache Subversion? xxnpg xxnpg All the better if it's specific to TortoiseSVN client. xxnpg xxqbe xxans Version Control with Subversion\r xxnpg \r xxnpg A very good resource for source control in general. Not really TortoiseSVN specific, though. xxane",stacksample,398
5,xxqts Good branching and merging tutorials for TortoiseSVN? xxqte xxqbs Are there any really good tutorials explaining branching and merging with Apache Subversion? xxnpg xxnpg All the better if it's specific to TortoiseSVN client. xxnpg xxqbe xxans My easy click-by-click instructions (specific to TortoiseSVN) are in Stack Overflow question What is the simplest way to do branching and merging using TortoiseSVN?. xxnpg xxane,stacksample,431
6,"xxqts ASP.NET Site Maps xxqte xxqbs Has anyone got experience creating SQL-based ASP.NET site-map providers? xxnpg xxnpg I've got the default XML file web.sitemap working properly with my Menu and SiteMapPath controls, but I'll need a way for the users of my site to create and modify pages dynamically. xxnpg xxnpg I need to tie page viewing permissions into the standard ASP.NET membership system as well. xxnpg xxqbe xxans The Jeff Prosise version from MSDN magazine works pretty well, but it has a few flaws: xxnpg xxnpg AddNode freaks out with links to external sites on your menu (www.g...",stacksample,1772
9,"xxqts Function for creating color wheels xxqte xxqbs This is something I've pseudo-solved many times and never quite found a solution. That's stuck with me. The problem is to come up with a way to generate N colors, that are as distinguishable as possible where N is a parameter. xxnpg xxqbe xxans My first thought on this is ""how generate N vectors in a space that maximize distance from each other."" You can see that the RGB (or any other scale you use that forms a basis in color space) are just vectors. Take a look at Random Point Picking. Hope this is a good start for you! Once you have a...",stacksample,1353


#### Load TensorFlow 2.0 Q&A 2.0 data 

- From Google Natural Questions dataset

- https://www.kaggle.com/c/tensorflow2-question-answering/data

- https://github.com/google-research-datasets/natural-questions

- https://storage.googleapis.com/pub-tools-public-publication-data/pdf/1f7b46b5378d757553d3e92ead36bda2e4254244.pdf

The NQ training data contains 307,373 examples. 152,148 have a long answer and 110,724 have a short answer. Short answers can be sets of spans in the document (106,926), or yes or no (3,798). Long answers are HTML bounding boxes.

- Filtered for Long Answers only

In [11]:
tf2qa_data = tf2qa_jsonl_to_df('data/tensorflow2-question-answering/simplified-nq-train.jsonl', n_rows=200000, offset=0)
tf2qa_fin = clean_tf2qa_data(tf2qa_data)

print(len(tf2qa_fin), tf2qa_fin.char_count.sum())
tf2qa_fin.head()

199824it [02:10, 1697.09it/s]

99093 153611999


Unnamed: 0,doc,source,char_count
0,"xxqbs which is the most common use of opt-in e-mail marketing? xxqbe xxans A common example of permission marketing is a newsletter sent to an advertising firm 's customers . Such newsletters inform customers of upcoming events or promotions , or new products . In this type of advertising , a company that wants to send a newsletter to their customers may ask them at the point of purchase if they would like to receive the newsletter . xxane",tf2qa,446
1,"xxqbs how i.met your mother who is the mother? xxqbe xxans Tracy McConnell , better known as `` The Mother '' , is the title character from the CBS television sitcom How I Met Your Mother . The show , narrated by Future Ted , tells the story of how Ted Mosby met The Mother . Tracy McConnell appears in 8 episodes from `` Lucky Penny '' to `` The Time Travelers '' as an unseen character ; she was first seen fully in `` Something New '' and was promoted to a main character in season 9 . The Mother is played by Cristin Milioti . xxane",tf2qa,539
2,"xxqbs what type of fertilisation takes place in humans? xxqbe xxans The process of fertilization involves a sperm fusing with an ovum . The most common sequence begins with ejaculation during copulation , follows with ovulation , and finishes with fertilization . Various exceptions to this sequence are possible , including artificial insemination , in vitro fertilization , external ejaculation without copulation , or copulation shortly after ovulation . Upon encountering the secondary oocyte , the acrosome of the sperm produces enzymes which allow it to burrow through the outer jelly coat...",tf2qa,791
3,"xxqbs who had the most wins in the nfl? xxqbe xxans Active quarterback Tom Brady holds the records for most wins with 220 , most regular season wins with 195 , and most postseason wins with 25 , as of Week 16 of the 2017 NFL season . Having played the entirety of his career with the New England Patriots , each of Brady 's win records also apply to wins with a single team . xxane",tf2qa,384
4,"xxqbs who played mantis guardians of the galaxy 2? xxqbe xxans Pom Klementieff ( born 3 May 1986 ) is a French actress . She was trained at the Cours Florent drama school in Paris and has appeared in such films as Loup ( 2009 ) , Sleepless Night ( 2011 ) and Hacker 's Game ( 2015 ) . She plays the role of Mantis in the film Guardians of the Galaxy Vol. 2 ( 2017 ) and will appear in the same role in the film Avengers : Infinity War ( 2018 ) . xxane",tf2qa,454


# Save Data

In [32]:
from datetime import date
import os
fn = f'data/lm_data_{date.today()}.csv'

dfs = [wiki_fin_df, ss_fin, tf2qa_fin]
final_data = pd.concat(dfs)
final_data.reset_index(inplace=True, drop=True)
final_data.to_csv(fn, sep="\t", encoding='utf-8')

final_data.to_feather(f'data/lm_data_{date.today()}.ftr')

print(f'Total rows : {len(final_data)}')
print(f'Total character count : {final_data.char_count.sum()}')
print(f'NaNs in doc column: {final_data.isna().doc.sum() / len(final_data)}')
statinfo = os.stat(fn)
print(f'File size is {statinfo.st_size/1000000000}GB')
print()
print(final_data.groupby('source').sum() / final_data.char_count.sum())
print()
print(fn)
print()

final_data.sample(100).head(15)

Total rows : 207581
Total character count : 846918289
NaNs in doc column: 0.0
File size is 0.855439529GB

             char_count
source                 
stacksample    0.173837
tf2qa          0.181378
wiki103        0.644786

data/lm_data_2020-02-04.csv



Unnamed: 0,doc,source,char_count
115117,xxqbs when did man on the moon come out? xxqbe xxans Man on the Moon Theatrical release poster Directed by Miloš Forman Produced by Danny DeVito Michael Shamberg Stacey Sher Written by Scott Alexander Larry Karaszewski Starring Jim Carrey Danny DeVito Courtney Love Paul Giamatti Music by R.E.M. Cinematography Anastas N. Michos Edited by Adam Boome Lynzee Klingman Christopher Tellefsen Production companies BBC Films Cinehaus Jersey Films Marubeni Mutual Film Company Shapiro / West Productions ...,tf2qa,921
23261,"xxars xxh1s Wendell Willkie xxh1e xxnpg Wendell Lewis Willkie ( born Lewis Wendell Willkie ; February 18 , 1892 – October 8 , 1944 ) was an American lawyer , corporate executive , and the 1940 Republican candidate for president . Willkie appealed to many convention delegates as the Republican field 's only interventionist : although the U.S. remained neutral prior to Pearl Harbor , he favored greater U.S. involvement in World War II to support Britain and other Allies . His Democratic opponent , incumbent President Franklin D. Roosevelt , won the 1940 election with roughly 55 % of the po...",wiki103,68081
27841,"xxars xxh1s Coming Up to Breathe xxh1e xxnpg Coming Up to Breathe is the fourth studio album by Christian rock band MercyMe . Released on April 25 , 2006 , by INO Records , the album was intended by MercyMe to be edgier than their previous albums . Coming Up to Breathe sold 58 @,@ 000 copies its first week , MercyMe 's biggest sales week at the time . It debuted and peaked at number one on the Billboard Christian Albums chart , number five on the Rock Albums chart , and number thirteen on the Billboard 200 . It also appeared on the Alternative Albums chart in 2007 , peaking at number thi...",wiki103,10414
67891,"xxqts Add a column if it doesn't exist to all tables? xxqte xxqbs I'm using SQL Server 2005/2008. I need to add a column to a table if it does not yet exist. This will apply to all tables in a given database. I hoped I was close, but I'm having issues with this solution. xxnpg xxnpg How can this be done? xxnpg xxnpg Here's what I have: xxnpg xxnpg EXEC sp_MSforeachtable ' xxnpg declare @tblname varchar(255); xxnpg SET @tblname = PARSENAME(""?"",1); xxnpg xxnpg if not exists (select column_name from INFORMATION_SCHEMA.columns xxnpg where table_name = @tb...",stacksample,1633
128752,"xxqbs why did new zealand soldiers go to world war 1? xxqbe xxans The military history of New Zealand during World War I began in August 1914 when Great Britain declared war on Germany at the start of the First World War , the New Zealand government followed without hesitation , despite its geographic isolation and small population . It was believed at the time that any declaration of war by the United Kingdom automatically included New Zealand . xxane",tf2qa,459
156442,"xxqbs identify the demographic groups that made up the new reagan coalition? xxqbe xxans VOTER GROUPS AND THE PRESIDENTIAL VOTE , 1980 AND 1976 Size ' 80 Carter ' 80 Reagan ' 80 Anderson ' 76 Carter ' 76 Ford Party Democratic 43 66 26 6 77 22 Independent 23 30 54 12 43 54 Republican 28 11 84 9 90 Ideology Liberal 18 57 27 11 70 26 Moderate 51 42 48 8 51 48 Conservative 31 23 71 29 70 Race Black 10 82 14 82 16 ...",tf2qa,2606
29089,"xxqts Factorial Algorithms in different languages xxqte xxqbs I want to see all the different ways you can come up with, for a factorial subroutine, or program. The hope is that anyone can come here and see if they might want to learn a new language. xxnpg xxnpg Ideas: xxnpg xxnpg xxnpg Procedural xxnpg Functional xxnpg Object Oriented xxnpg One liners xxnpg Obfuscated xxnpg Oddball xxnpg Bad Code xxnpg Polyglot xxnpg xxnpg xxnpg Basically I want to see an example, of different ways of writing an algorithm, and what they would look like in different languages. xxnpg xxnpg Please limi...",stacksample,1644
170018,"xxqbs where does the last name polanco originate from? xxqbe xxans Polanco is a Spanish surname originating from the municipality of Polanco , Cantabria in Spain . Notable people with the surname include : xxane",tf2qa,214
130324,"xxqbs history of garden of the gods colorado springs? xxqbe xxans The Garden of the Gods ' red rock formations were created during a geological upheaval along a natural fault line millions of years ago . Archaeological evidence shows that prehistoric people visited Garden of the Gods about 1330 BC . At about 250 BC , Native American people camped in the park ; they are believed to have been attracted to wildlife and plant life in the area and used overhangs created by the rocks for shelter . Many native peoples have reported a connection to Garden of the Gods , including Apache , Cheyenne...",tf2qa,671
129605,"xxqbs who does miss rabbit's voice on peppa pig? xxqbe xxans Sarah Ann Kennedy is a British voice actress best known for providing the voices of Miss Rabbit and Mummy Rabbit in the children 's animated series Peppa Pig , Nanny Plum in the children 's animated series Ben & Holly 's Little Kingdom and Dolly Pond in Pond Life . She is also a writer and animation director and the creator of Crapston Villas , an animated soap opera for Channel 4 in 1996 -- 1998 . She has also written for Hit Entertainment and Peppa Pig , and is a lecturer at the University of Central Lancashire . xxane",tf2qa,590
